# IS 4487 Assignment 11: Predicting Airbnb Prices with Regression

In this assignment, you will:
- Load the Airbnb dataset you cleaned and transformed in Assignment 7
- Build a linear regression model to predict listing price
- Interpret which features most affect price
- Try to improve your model using only the most impactful predictors
- Practice explaining your findings to a business audience like a host, pricing strategist, or city partner

## Why This Matters

Pricing is one of the most important levers for hosts and Airbnb‚Äôs business teams. Understanding what drives price ‚Äî and being able to predict it accurately ‚Äî helps improve search results, revenue management, and guest satisfaction.

This assignment gives you hands-on practice turning a cleaned dataset into a predictive model. You‚Äôll focus not just on code, but on what the results mean and how you‚Äôd communicate them to stakeholders.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Assignments/assignment_11_regression.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>



## Original Source: Dataset Description

The dataset you'll be using is a **detailed Airbnb listing file**, available from [Inside Airbnb](https://insideairbnb.com/get-the-data/).

Each row represents one property listing. The columns include:

- **Host attributes** (e.g., host ID, host name, host response time)
- **Listing details** (e.g., price, room type, minimum nights, availability)
- **Location data** (e.g., neighborhood, latitude/longitude)
- **Property characteristics** (e.g., number of bedrooms, amenities, accommodates)
- **Calendar/booking variables** (e.g., last review date, number of reviews)

The schema is consistent across cities, so you can expect similar columns regardless of the location you choose.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score


## 1. Load Your Transformed Airbnb Dataset

**Business framing:**  
Before building any models, we must start with clean, prepared data. In Assignment 7, you exported a cleaned version of your Airbnb dataset. You‚Äôll now import that file for analysis.

### Do the following:
- Import your CSV file called `cleaned_airbnb_data_7.csv`.   (Note: If you had significant errors with assignment 7, you can use the file named "airbnb_listings.csv" in the DataSets folder on GitHub as a backup starting point.)
- Use `pandas` to load and preview the dataset

### In Your Response:
1. What does the dataset include?
2. How many rows and columns are present?


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,11508,https://www.airbnb.com/rooms/11508,20250129142212,2025-01-30,city scrape,Amazing Luxurious Apt-Palermo Soho,LUXURIOUS 1 BDRM APT- POOL/ GYM/ 24-HR SECURIT...,AREA: PALERMO SOHO<br /><br />Minutes walking ...,https://a0.muscache.com/pictures/19357696/b1de...,42762,...,4.93,4.98,4.93,4.86,f,1,1,0,0,0.29
1,14222,https://www.airbnb.com/rooms/14222,20250129142212,2025-01-30,city scrape,"RELAX IN HAPPY HOUSE - PALERMO, BUENOS AIRES",Beautiful cozy apartment in excellent location...,Palermo is such a perfect place to explore the...,https://a0.muscache.com/pictures/4695637/bbae8...,87710233,...,4.82,4.9,4.87,4.75,f,6,6,0,0,0.8
2,15074,https://www.airbnb.com/rooms/15074,20250129142212,2025-01-30,previous scrape,ROOM WITH RIVER SIGHT,,,https://a0.muscache.com/pictures/91166/c0fdcb4...,59338,...,,,,,f,1,0,1,0,
3,16695,https://www.airbnb.com/rooms/16695,20250129142212,2025-01-30,city scrape,DUPLEX LOFT 2 - SAN TELMO,,San Telmo is one of the best neighborhoods in ...,https://a0.muscache.com/pictures/619c33a9-0618...,64880,...,4.83,4.8,4.39,4.41,t,9,9,0,0,0.27
4,20062,https://www.airbnb.com/rooms/20062,20250129142212,2025-01-30,city scrape,PENTHOUSE /Terrace & pool /City views /2bedrooms,,,https://a0.muscache.com/pictures/165679/2eb448...,75891,...,4.94,4.93,4.93,4.79,f,4,4,0,0,1.84


In [4]:
# Get the number of rows and columns
num_rows, num_cols = df.shape

# Print the information
print(f"The dataset includes {num_rows} rows and {num_cols} columns.")

The dataset includes 35172 rows and 76 columns.


### ‚úçÔ∏è Your Response: üîß
1. The dataset includes various attributes for each Airbnb listing, such as listing URL, scrape ID, last scraped date, source, name, description, neighborhood overview, picture URL, host ID, and more. As we saw from the .head() output, it appears to contain a mix of identification, descriptive, and potentially numerical data about each listing.



2. The dataset includes 35172 rows and 76 columns.

## 2. Drop Columns Not Useful for Modeling

**Business framing:**  
Some columns ‚Äî like post IDs or text ‚Äî may not help us predict price and could add noise or bias.

### Do the following:
- Drop columns like `post_id`, `title`, `descr`, `details`, and `address` if they‚Äôre still in your dataset

### In Your Response:
1. What columns did you drop, and why?
2. What risks might occur if you included them in your model?


Index(['host_id', 'host_response_time', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_neighbourhood',
       'host_listings_count', 'host_total_listings_count',
       'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
       'neighbourhood_cleansed', 'latitude', 'longitude', 'property_type',
       'room_type', 'accommodates', 'bathrooms', 'bathrooms_text', 'bedrooms',
       'beds', 'amenities', 'price', 'minimum_nights', 'maximum_nights',
       'minimum_minimum_nights', 'maximum_minimum_nights',
       'minimum_maximum_nights', 'maximum_maximum_nights',
       'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'calendar_last_scraped', 'number_of_reviews',
       'number_of_reviews_ltm', 'number_of_reviews_l30d', 'availability_eoy',
       'number_of_reviews_ly', 'estimated_occupancy_l365d',
       'estimated_revenue_l

### ‚úçÔ∏è Your Response: üîß
1. I have dropped columns such as id, listing_url, scrape_id, last_scraped, source, name, description, neighborhood_overview, picture_url, host_url, host_name, host_since, host_location, host_about, host_thumbnail_url, host_picture_url, neighbourhood, neighborhood_group_cleansed, and jurisdiction_names. These columns were dropped because they are either unique identifiers, URLs, or free-text fields that are not directly useful for predicting price in a linear regression model.

2. Including these columns could introduce noise, increase model complexity without providing significant predictive power, or lead to issues with multicollinearity if not handled properly (e.g., through techniques like one-hot encoding for categorical text data, which is beyond the scope of a simple linear regression).

## 3. Explore Relationships Between Numeric Features

**Business framing:**  
Understanding how features relate to each other ‚Äî and to the target ‚Äî helps guide feature selection and modeling.

### Do the following:
- Generate a correlation matrix
- Identify which variables are strongly related to `price`

### In Your Response:
1. Which variables had the strongest positive or negative correlation with price?
2. Which variables might be useful predictors?


Correlation with price:
price                                           1.000000
estimated_revenue_l365d                         0.157548
calculated_host_listings_count_private_rooms    0.035703
availability_30                                 0.026971
availability_60                                 0.026390
availability_90                                 0.019492
bathrooms                                       0.017096
accommodates                                    0.013808
bedrooms                                        0.012011
beds                                            0.010984
longitude                                       0.004134
minimum_nights                                  0.002982
maximum_minimum_nights                          0.002746
minimum_nights_avg_ntm                          0.002086
minimum_minimum_nights                          0.001663
availability_eoy                                0.000234
availability_365                               -0.000494
maximum

### ‚úçÔ∏è Your Response: üîß
1. Strongest positive or negative correlation with price: The variable with the strongest positive correlation with price is estimated_revenue_l365d (0.157548). There are no variables with a strong negative correlation, as all values are close to zero. estimated_occupancy_l365d (-0.022639) has the most negative correlation, but it's still very weak.
2. Variables that might be useful predictors: While the correlations are generally weak, estimated_revenue_l365d, calculated_host_listings_count_private_rooms, and the availability related columns show the highest (though still low) positive correlations. Variables like estimated_occupancy_l365d, number_of_reviews_ltm, number_of_reviews_ly, and minimum_maximum_nights have the highest (though still low) negative correlations. These variables might be useful predictors, but the overall weak correlations suggest that a simple linear regression might not explain a large portion of the price variation.

## 4. Define Features and Target Variable

**Business framing:**  
To build a regression model, you need to define what you‚Äôre predicting (target) and what you‚Äôre using to make that prediction (features).

### Do the following:
- Set `price` as your target variable
- Remove `price` from your predictors

### In Your Response:
1. What features are you using?
2. Why is this a regression problem and not a classification problem?


### ‚úçÔ∏è Your Response: üîß
1. I am using all columns in the DataFrame except for price as features. You can see the list of features by printing X.columns.
2. This is a regression problem because the target variable (price) is a continuous numerical value. Regression models are used to predict a continuous outcome, whereas classification models are used to predict a categorical outcome (i.e., assigning data points to discrete classes).

## 5. Split Data into Training and Testing Sets

### Business framing:
Splitting your data lets you train a model and test how well it performs on new, unseen data.

### Do the following:
- Use `train_test_split()` to split into 80% training, 20% testing



Training set shape (X, y): (28137, 58) (28137,)
Testing set shape (X, y): (7035, 58) (7035,)


## 6. Fit a Linear Regression Model

### Business framing:
Linear regression helps you quantify the impact of each feature on price and make predictions for new listings.

### Do the following:
- Fit a linear regression model to your training data
- Use it to predict prices for the test set



## 7. Evaluate Model Performance

### Business framing:  
A good model should make accurate predictions. We‚Äôll use Mean Squared Error (MSE) and R¬≤ to evaluate how close our predictions were to the actual prices.

### Do the following:
- Print MSE and R¬≤ score for your model

### In Your Response:
1. What is your R¬≤ score? How well does your model explain price variation?
2. Is your MSE large or small? What could you do to improve it?


Mean Squared Error (MSE): 3161918444383.021
R-squared (R¬≤): 0.010645122984766608


### ‚úçÔ∏è Your Response: üîß
1.The R¬≤ score for the model is 0.010645122984766608. This score indicates that only about 1.06% of the variance in the price is predictable from the numerical features used in this linear regression model. This is a very low R¬≤ score, suggesting that the model does not explain a large portion of the variation in Airbnb prices.
2. The Mean Squared Error (MSE) is 3161918444383.021. To determine if this MSE is large or small, we need to consider the typical range of prices in the dataset. Given the large value of the MSE, it suggests that the model's predictions are, on average, quite far from the actual prices. To improve the model, you could:
Include more relevant features, especially categorical ones, after appropriate encoding (e.g., one-hot encoding for room_type, neighbourhood_cleansed, etc.).
Explore different feature engineering techniques.
Consider more complex models that can capture non-linear relationships.
Address potential outliers in the target variable or features.

## 8. Interpret Model Coefficients

### Business framing:
The regression coefficients tell you how each feature impacts price. This can help Airbnb guide hosts and partners.

### Do the following:
- Create a table showing feature names and regression coefficients
- Sort the table so that the most impactful features are at the top

### In Your Response:
1. Which features increased price the most?
2. Were any surprisingly negative?
3. What business insight could you draw from this?


Unnamed: 0,feature,coefficient
3,latitude,-1408557.0
4,longitude,-211275.0
31,review_scores_checkin,-77113.11
34,review_scores_value,76666.62
29,review_scores_accuracy,44600.15
28,review_scores_rating,-39319.76
30,review_scores_cleanliness,-38323.82
37,calculated_host_listings_count_private_rooms,35239.88
32,review_scores_communication,-19913.21
33,review_scores_location,17088.42


### ‚úçÔ∏è Your Response: üîß
1. Based on the absolute values of the coefficients, the features with the largest positive impact on price appear to be calculated_host_listings_count_private_rooms, review_scores_value, and review_scores_location. It's important to note that the magnitude of the coefficient depends on the scale of the feature. For example, latitude and longitude have very large coefficients, but this is likely due to the scale of these values rather than a direct strong impact on price per unit change. Focusing on features with more interpretable scales, calculated_host_listings_count_private_rooms and the review_scores seem to have a notable positive association with price.

2. latitude and longitude have very large negative coefficients, which can be surprising without considering their scale and the specific geographic area of the dataset. Other features with negative coefficients include various review_scores (rating, cleanliness, checkin, communication), bedrooms, beds, number_of_reviews related features, and some availability metrics. The negative coefficients for review scores might seem counterintuitive, as higher scores would typically be expected to increase price. This could be due to multicollinearity among review scores or other factors not captured in this simple linear model. The negative coefficients for bedrooms and beds are also surprising, as more bedrooms/beds would usually lead to a higher price. This might indicate complex relationships or interactions not captured by the linear model, or it could be influenced by other factors like room type (e.g., private rooms vs. entire homes).

3. Location matters: The large coefficients for latitude and longitude, despite being hard to interpret directly without context, suggest that location is a significant driver of price. Further analysis focusing on neighborhood or geographic clusters would be valuable.
Review scores have a complex relationship: The mixed positive and negative coefficients for different review scores suggest that the relationship between reviews and price is not straightforward in this simple linear model. More detailed analysis or different modeling approaches might be needed to understand how guest reviews truly impact pricing.
Property size and type are likely important: The surprising negative coefficients for bedrooms and beds highlight the need to include categorical features like room_type in the model. The type of listing (entire home, private room, shared room) likely has a much stronger and clearer relationship with price than just the number of bedrooms or beds.
Host listing count for private rooms: The positive coefficient for calculated_host_listings_count_private_rooms might indicate that hosts with more private room listings tend to charge more, perhaps due to experience or professional management.


## 9. Try to Improve the Linear Regression Model

### Business framing:
The first version of your model included all available features ‚Äî but not all features are equally useful. Removing weak or noisy predictors can often improve performance and interpretation.

### Do the following:
1. Choose your top 3‚Äì5 features with the strongest absolute coefficients
2. Rebuild the regression model using just those features
3. Compare MSE and R¬≤ between the baseline and refined model

### In Your Response:
1. What features did you keep in the refined model, and why?
2. Did model performance improve? Why or why not?
3. Which model would you recommend to stakeholders?
4. How does this relate to your customized learning outcome you created in canvas?


Top 5 features: ['latitude', 'longitude', 'review_scores_checkin', 'review_scores_value', 'review_scores_accuracy']

Refined Model Mean Squared Error (MSE): 3198161980648.2256
Refined Model R-squared (R¬≤): -0.0006953717164670525

Baseline Model Mean Squared Error (MSE): 3161918444383.021
Baseline Model R-squared (R¬≤): 0.010645122984766608


### ‚úçÔ∏è Your Response: üîß
1. I kept the top 5 features with the strongest absolute coefficients from the baseline model. These features are: latitude, longitude, review_scores_checkin, review_scores_value, and review_scores_accuracy. I chose these features because they had the largest impact on the predicted price in the initial model, based on the magnitude of their coefficients. The idea was to see if focusing on the most influential numerical features would improve the model's performance by potentially reducing noise from less impactful predictors.

2. No, the model performance did not improve. The R¬≤ score for the refined model (-0.0006953717164670525) is actually slightly worse than the baseline model (0.010645122984766608), and the MSE for the refined model (3198161980648.2256) is slightly higher than the baseline model (3161918444383.021). This indicates that using only these top 5 numerical features resulted in a model that explains even less of the price variation and has slightly higher prediction errors compared to using all numerical features. This could be because even though these features had the largest individual impact among numerical features, the combined effect of all numerical features, however weak, was slightly better at capturing some of the price variance. It also reinforces that the truly impactful predictors are likely the categorical ones that we excluded earlier.

3.  Based purely on the performance metrics (MSE and R¬≤), neither model is particularly strong at predicting Airbnb prices with just the numerical features. The baseline model with all numerical features performs slightly better than the refined model, but both have very low R¬≤ scores, indicating they explain very little of the price variation. I would recommend neither of these models for making business decisions related to pricing. Instead, I would recommend further analysis and model building that includes categorical features (like room_type, neighbourhood_cleansed, etc.) after appropriate encoding, as these are likely much stronger predictors of price.

4. This experience directly relates to my customized learning outcome about evaluating different feature selection strategies. It demonstrated that simply picking features based on the magnitude of their coefficients in a basic linear model is not always an effective strategy for improving model performance, especially when dealing with a dataset that likely has complex relationships and important categorical features not initially included. It highlighted the importance of considering the nature of the data and exploring different feature engineering and selection techniques beyond just simple numerical correlation. It also reinforced the need to clearly communicate the limitations of a model with low explanatory power (low R¬≤) to stakeholders, emphasizing that this simple model is not sufficient for making accurate pricing predictions and that a more comprehensive approach is required.


## 10. Reflect and Recommend

### Business framing:  
Ultimately, the value of your model comes from how well it can guide business decisions. Use your results to make real-world recommendations.

### In Your Response:
1. What business question did your model help answer?
2. What would you recommend to Airbnb or its hosts?
3. What could you do next to improve this model or make it more useful?
4. How does this relate to your customized learning outcome you created in canvas?


### ‚úçÔ∏è Your Response: üîß
1. While the model's predictive power was low with only numerical features, it provided some preliminary insights into which numerical factors have the strongest linear association with price in this dataset. It started to answer the question of "Which quantifiable listing characteristics (excluding text and location specifics) seem to influence price the most?" However, it also strongly suggested that numerical features alone are not sufficient to accurately predict price.

2. For Airbnb: Focus on incorporating and properly handling categorical data (like room_type, neighbourhood_cleansed, etc.) in any pricing models. These are likely much stronger predictors than the numerical features explored here. Continue investing in data quality for review scores and other numerical metrics, but understand their impact might be complex and not purely linear.
For Hosts: While review scores generally have a positive association, the specific impact is not clear from this simple model. Location is likely a key driver, but this model doesn't provide specific location-based pricing guidance. The number of private room listings seems to have a positive correlation, which might suggest that professional hosts with multiple private rooms operate differently or in different market segments. Overall, this model is not sufficient to provide concrete pricing recommendations to individual hosts.

3. Include Categorical Features: This is the most crucial next step. Encode categorical variables like room_type, neighbourhood_cleansed, property_type, host_response_time, host_is_superhost, etc., using techniques like one-hot encoding.
Feature Engineering: Create new features, such as interaction terms (e.g., between room_type and accommodates), or features derived from text data (e.g., sentiment analysis of descriptions or amenities counts).
Handle Missing Values: Implement more sophisticated missing value imputation strategies beyond just filling with the mean (e.g., using median, mode, or more complex methods).
Explore Different Models: Try other regression algorithms that can handle non-linear relationships and potentially categorical features more effectively (e.g., Decision Trees, Random Forests, Gradient Boosting models).
Address Outliers: Investigate and potentially handle outliers in the price variable and influential features.

4. This experience directly relates to my customized learning outcome about evaluating different feature selection strategies. It demonstrated that simply picking features based on the magnitude of their coefficients in a basic linear model is not always an effective strategy for improving model performance, especially when dealing with a dataset that likely has complex relationships and important categorical features not initially included. It highlighted the importance of considering the nature of the data and exploring different feature engineering and selection techniques beyond just simple numerical correlation. It also reinforced the need to clearly communicate the limitations of a model with low explanatory power (low R¬≤) to stakeholders, emphasizing that this simple model is not sufficient for making accurate pricing predictions and that a more comprehensive approach is required.

## Submission Instructions
‚úÖ Checklist:
- All code cells run without error
- All markdown responses are complete
- Submit on Canvas as instructed

In [16]:
!jupyter nbconvert --to html "assignment_11_ChristensenBryson.ipynb"

[NbConvertApp] Converting notebook assignment_11_ChristensenBryson.ipynb to html
[NbConvertApp] Writing 339426 bytes to assignment_11_ChristensenBryson.html
