<a href="https://colab.research.google.com/github/egs1sos/IS-4487/blob/main/assignment_11_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IS 4487 Assignment 11: Predicting Airbnb Prices with Regression

In this assignment, you will:
- Load the Airbnb dataset you cleaned and transformed in Assignment 7
- Build a linear regression model to predict listing price
- Interpret which features most affect price
- Try to improve your model using only the most impactful predictors
- Practice explaining your findings to a business audience like a host, pricing strategist, or city partner

## Why This Matters

Pricing is one of the most important levers for hosts and Airbnb‚Äôs business teams. Understanding what drives price ‚Äî and being able to predict it accurately ‚Äî helps improve search results, revenue management, and guest satisfaction.

This assignment gives you hands-on practice turning a cleaned dataset into a predictive model. You‚Äôll focus not just on code, but on what the results mean and how you‚Äôd communicate them to stakeholders.

<a href="https://colab.research.google.com/github/vandanara/UofUtah_IS4487/blob/main/Assignments/assignment_11_regression.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>



## Original Source: Dataset Description

The dataset you'll be using is a **detailed Airbnb listing file**, available from [Inside Airbnb](https://insideairbnb.com/get-the-data/).

Each row represents one property listing. The columns include:

- **Host attributes** (e.g., host ID, host name, host response time)
- **Listing details** (e.g., price, room type, minimum nights, availability)
- **Location data** (e.g., neighborhood, latitude/longitude)
- **Property characteristics** (e.g., number of bedrooms, amenities, accommodates)
- **Calendar/booking variables** (e.g., last review date, number of reviews)

The schema is consistent across cities, so you can expect similar columns regardless of the location you choose.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score


## 1. Load Your Transformed Airbnb Dataset

**Business framing:**  
Before building any models, we must start with clean, prepared data. In Assignment 7, you exported a cleaned version of your Airbnb dataset. You‚Äôll now import that file for analysis.

### Do the following:
- Import your CSV file called `cleaned_airbnb_data_7.csv`.   (Note: If you had significant errors with assignment 7, you can use the file named "airbnb_listings.csv" in the DataSets folder on GitHub as a backup starting point.)
- Use `pandas` to load and preview the dataset

### In Your Response:
1. What does the dataset include?
2. How many rows and columns are present?


In [None]:
# Add code here üîß
url = '/content/cleaned_airbnb_data.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,flag
0,958,https://www.airbnb.com/rooms/958,20250901181253,2025-09-01,city scrape,"Bright, Modern Garden Unit - 1BR/1BTH",Our bright garden unit overlooks a lovely back...,Quiet cul de sac in friendly neighborhood<br /...,https://a0.muscache.com/pictures/be1bf5ac-a955...,1169,...,4.98,4.78,STR-0006854,f,1,1,0,0,2.53,False
1,5858,https://www.airbnb.com/rooms/5858,20250901181253,2025-09-01,city scrape,Creative Sanctuary,We live in a large Victorian house on a quiet ...,I love how our neighborhood feels quiet but is...,https://a0.muscache.com/pictures/hosting/Hosti...,8904,...,4.77,4.68,,f,1,1,0,0,0.53,False
2,8014,https://www.airbnb.com/rooms/8014,20250901181253,2025-09-01,city scrape,female HOST quiet fast internet market parking,Room is on the second floor so it gets a good ...,"The neighborhood is very residential, close to...",https://a0.muscache.com/pictures/2cc1fc3d-0ae0...,22402,...,4.59,4.66,STR-0000974,f,3,0,3,0,0.57,False
3,8142,https://www.airbnb.com/rooms/8142,20250901181253,2025-09-01,city scrape,*FriendlyRoom Apt. Style -UCSF/USF - San Franc...,Nice and good public transportation. 7 minute...,"N Juda Muni, Bus and UCSF Shuttle.<br /><br />...",https://a0.muscache.com/pictures/hosting/Hosti...,21994,...,4.7,4.7,,f,20,0,20,0,0.07,False
4,8339,https://www.airbnb.com/rooms/8339,20250901181253,2025-09-01,city scrape,Historic Alamo Square Victorian,"For creative humans who love art, space, photo...",,https://a0.muscache.com/pictures/miso/Hosting-...,24215,...,4.94,4.75,STR-0000264,f,1,1,0,0,0.13,False


### ‚úçÔ∏è Your Response: üîß
1. This dataset includes lots of things, like descriptions, neighborhood overviews, etc.

2. There are 78 different columns.

## 2. Drop Columns Not Useful for Modeling

**Business framing:**  
Some columns ‚Äî like post IDs or text ‚Äî may not help us predict price and could add noise or bias.

### Do the following:
- Drop columns like `post_id`, `title`, `descr`, `details`, and `address` if they‚Äôre still in your dataset

### In Your Response:
1. What columns did you drop, and why?
2. What risks might occur if you included them in your model?


In [None]:
# Add code here üîß
columns_to_drop = ['id', 'description', 'scrape_id', 'source']
columns_to_drop_existing = [col for col in columns_to_drop if col in df.columns]
df.drop(columns_to_drop_existing, axis=1, inplace=True)
df.head()

Unnamed: 0,listing_url,last_scraped,name,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,...,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,flag
0,https://www.airbnb.com/rooms/958,2025-09-01,"Bright, Modern Garden Unit - 1BR/1BTH",Quiet cul de sac in friendly neighborhood<br /...,https://a0.muscache.com/pictures/be1bf5ac-a955...,1169,https://www.airbnb.com/users/show/1169,Holly,2008-07-31,"San Francisco, CA",...,4.98,4.78,STR-0006854,f,1,1,0,0,2.53,False
1,https://www.airbnb.com/rooms/5858,2025-09-01,Creative Sanctuary,I love how our neighborhood feels quiet but is...,https://a0.muscache.com/pictures/hosting/Hosti...,8904,https://www.airbnb.com/users/show/8904,Philip Jonathon,2009-03-02,"San Francisco, CA",...,4.77,4.68,,f,1,1,0,0,0.53,False
2,https://www.airbnb.com/rooms/8014,2025-09-01,female HOST quiet fast internet market parking,"The neighborhood is very residential, close to...",https://a0.muscache.com/pictures/2cc1fc3d-0ae0...,22402,https://www.airbnb.com/users/show/22402,Jia,2009-06-20,"San Francisco, CA",...,4.59,4.66,STR-0000974,f,3,0,3,0,0.57,False
3,https://www.airbnb.com/rooms/8142,2025-09-01,*FriendlyRoom Apt. Style -UCSF/USF - San Franc...,"N Juda Muni, Bus and UCSF Shuttle.<br /><br />...",https://a0.muscache.com/pictures/hosting/Hosti...,21994,https://www.airbnb.com/users/show/21994,Aaron,2009-06-17,"San Francisco, CA",...,4.7,4.7,,f,20,0,20,0,0.07,False
4,https://www.airbnb.com/rooms/8339,2025-09-01,Historic Alamo Square Victorian,,https://a0.muscache.com/pictures/miso/Hosting-...,24215,https://www.airbnb.com/users/show/24215,Rosmarie,2009-07-02,"San Francisco, CA",...,4.94,4.75,STR-0000264,f,1,1,0,0,0.13,False


### ‚úçÔ∏è Your Response: üîß
1. I dropped id, description, scrape_id, neighbourhood_group_cleansed, calendar_updated, and source, as I felt like they weren't relevant to running regression

2. The risks of dropping these columns is that the data might be skewed and biased towards the remaining columns.

## 3. Explore Relationships Between Numeric Features

**Business framing:**  
Understanding how features relate to each other ‚Äî and to the target ‚Äî helps guide feature selection and modeling.

### Do the following:
- Generate a correlation matrix
- Identify which variables are strongly related to `price`

### In Your Response:
1. Which variables had the strongest positive or negative correlation with price?
2. Which variables might be useful predictors?


In [None]:
# Add code here üîß
corr_matrix = df.corr(numeric_only=True)
print(corr_matrix['price'].sort_values(ascending=False))

price                                           1.000000
estimated_revenue_l365d                         0.418524
host_total_listings_count                       0.158010
host_id                                         0.095392
host_listings_count                             0.091678
flag                                            0.066926
accommodates                                    0.061797
availability_30                                 0.052633
availability_365                                0.045418
availability_60                                 0.045087
availability_eoy                                0.043388
availability_90                                 0.042902
bedrooms                                        0.034858
longitude                                       0.034738
calculated_host_listings_count                  0.030099
bathrooms                                       0.025194
review_scores_cleanliness                       0.021262
beds                           

### ‚úçÔ∏è Your Response: üîß
1. estimated_revenue has the strongest positive correlation with price, and estimated_occupancy_l365d has the strongest negative correlation with price.

2. I think estimated_revenue and host_total_listings_count have the best prediction of price.

## 4. Define Features and Target Variable

**Business framing:**  
To build a regression model, you need to define what you‚Äôre predicting (target) and what you‚Äôre using to make that prediction (features).

### Do the following:
- Set `price` as your target variable
- Remove `price` from your predictors

### In Your Response:
1. What features are you using?
2. Why is this a regression problem and not a classification problem?


In [None]:
# Add code here üîß
target = df['price']
features = df.drop('price', axis=1)
features.head()

Unnamed: 0,listing_url,last_scraped,name,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,...,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,flag
0,https://www.airbnb.com/rooms/958,2025-09-01,"Bright, Modern Garden Unit - 1BR/1BTH",Quiet cul de sac in friendly neighborhood<br /...,https://a0.muscache.com/pictures/be1bf5ac-a955...,1169,https://www.airbnb.com/users/show/1169,Holly,2008-07-31,"San Francisco, CA",...,4.98,4.78,STR-0006854,f,1,1,0,0,2.53,False
1,https://www.airbnb.com/rooms/5858,2025-09-01,Creative Sanctuary,I love how our neighborhood feels quiet but is...,https://a0.muscache.com/pictures/hosting/Hosti...,8904,https://www.airbnb.com/users/show/8904,Philip Jonathon,2009-03-02,"San Francisco, CA",...,4.77,4.68,,f,1,1,0,0,0.53,False
2,https://www.airbnb.com/rooms/8014,2025-09-01,female HOST quiet fast internet market parking,"The neighborhood is very residential, close to...",https://a0.muscache.com/pictures/2cc1fc3d-0ae0...,22402,https://www.airbnb.com/users/show/22402,Jia,2009-06-20,"San Francisco, CA",...,4.59,4.66,STR-0000974,f,3,0,3,0,0.57,False
3,https://www.airbnb.com/rooms/8142,2025-09-01,*FriendlyRoom Apt. Style -UCSF/USF - San Franc...,"N Juda Muni, Bus and UCSF Shuttle.<br /><br />...",https://a0.muscache.com/pictures/hosting/Hosti...,21994,https://www.airbnb.com/users/show/21994,Aaron,2009-06-17,"San Francisco, CA",...,4.7,4.7,,f,20,0,20,0,0.07,False
4,https://www.airbnb.com/rooms/8339,2025-09-01,Historic Alamo Square Victorian,,https://a0.muscache.com/pictures/miso/Hosting-...,24215,https://www.airbnb.com/users/show/24215,Rosmarie,2009-07-02,"San Francisco, CA",...,4.94,4.75,STR-0000264,f,1,1,0,0,0.13,False


### ‚úçÔ∏è Your Response: üîß
1. I'm using every variable except for price to make my predictions.

2. This is not a classification problem, as we are predicting numerical values, instead of different categories.

## 5. Split Data into Training and Testing Sets

### Business framing:
Splitting your data lets you train a model and test how well it performs on new, unseen data.

### Do the following:
- Use `train_test_split()` to split into 80% training, 20% testing



In [None]:
# Add code here üîß
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

## 6. Fit a Linear Regression Model

### Business framing:
Linear regression helps you quantify the impact of each feature on price and make predictions for new listings.

### Do the following:
- Fit a linear regression model to your training data
- Use it to predict prices for the test set



In [None]:
# Add code here üîß
from sklearn.impute import SimpleImputer
import numpy as np
X_train_cleaned = X_train.copy()
y_train_cleaned = y_train.copy()
nan_in_y_train = y_train_cleaned.isna()
X_train_cleaned = X_train_cleaned[~nan_in_y_train]
y_train_cleaned = y_train_cleaned[~nan_in_y_train]
X_train_numeric = X_train_cleaned.select_dtypes(include=np.number)
X_test_numeric = X_test.select_dtypes(include=np.number)
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train_numeric)
X_test_imputed = imputer.transform(X_test_numeric)
model = LinearRegression()
model.fit(X_train_imputed, y_train_cleaned)
y_pred = model.predict(X_test_imputed)

## 7. Evaluate Model Performance

### Business framing:  
A good model should make accurate predictions. We‚Äôll use Mean Squared Error (MSE) and R¬≤ to evaluate how close our predictions were to the actual prices.

### Do the following:
- Print MSE and R¬≤ score for your model

### In Your Response:
1. What is your R¬≤ score? How well does your model explain price variation?
2. Is your MSE large or small? What could you do to improve it?


In [None]:
# Add code here üîß
nan_in_y_test = y_test.isna()
y_test_cleaned = y_test[~nan_in_y_test]
y_pred_cleaned = y_pred[~nan_in_y_test]
mse = mean_squared_error(y_test_cleaned, y_pred_cleaned)
r2 = r2_score(y_test_cleaned, y_pred_cleaned)
print(f"MSE: {mse}")
print(f"R¬≤: {r2}")

MSE: 5644348.308497341
R¬≤: 0.5114792223581511


### ‚úçÔ∏è Your Response: üîß
1. This is not an accurate model, as my R2 value is only 0.51.

2. My MSE is large, and I could bring it down by cleaning more unnecessary columns.

## 8. Interpret Model Coefficients

### Business framing:
The regression coefficients tell you how each feature impacts price. This can help Airbnb guide hosts and partners.

### Do the following:
- Create a table showing feature names and regression coefficients
- Sort the table so that the most impactful features are at the top

### In Your Response:
1. Which features increased price the most?
2. Were any surprisingly negative?
3. What business insight could you draw from this?


In [None]:
# Add code here üîß
coefficients = pd.DataFrame({'Feature': X_train_numeric.columns, 'Coefficient': model.coef_})
coefficients = coefficients.sort_values(by='Coefficient', ascending=False)
coefficients

Unnamed: 0,Feature,Coefficient
33,calculated_host_listings_count,394.7314
30,review_scores_communication,109.5591
32,review_scores_value,82.84685
5,bathrooms,54.74172
28,review_scores_cleanliness,53.31782
26,review_scores_rating,50.34958
4,accommodates,44.83989
7,beds,29.71727
31,review_scores_location,13.80616
2,host_total_listings_count,6.18273


### ‚úçÔ∏è Your Response: üîß
1. Host listings count and review scores had the biggest impact on price.

2. The number of reviews was surprisingly negative, as I expected that to have a large impact on reviews.

3. For Airbnb, they could look at the number of listings a host has and review scores and create price predictions from just those.


## 9. Try to Improve the Linear Regression Model

### Business framing:
The first version of your model included all available features ‚Äî but not all features are equally useful. Removing weak or noisy predictors can often improve performance and interpretation.

### Do the following:
1. Choose your top 3‚Äì5 features with the strongest absolute coefficients
2. Rebuild the regression model using just those features
3. Compare MSE and R¬≤ between the baseline and refined model

### In Your Response:
1. What features did you keep in the refined model, and why?
2. Did model performance improve? Why or why not?
3. Which model would you recommend to stakeholders?
4. How does this relate to your customized learning outcome you created in canvas?


In [None]:
# Add code here üîß
from sklearn.impute import SimpleImputer
import numpy as np
reduced_features = features[['calculated_host_listings_count', 'review_scores_communication', 'review_scores_value', 'bathrooms', 'accommodates']]
imputer_reduced = SimpleImputer(strategy='mean')
reduced_features_imputed = imputer_reduced.fit_transform(reduced_features)

X_train_reduced, X_test_reduced, y_train_reduced, y_test_reduced = train_test_split(reduced_features_imputed, target, test_size=0.2, random_state=42)
not_nan_in_y_train_reduced = ~np.isnan(y_train_reduced)
X_train_reduced_cleaned = X_train_reduced[not_nan_in_y_train_reduced]
y_train_reduced_cleaned = y_train_reduced[not_nan_in_y_train_reduced]
LinearRegressionModel = LinearRegression()
LinearRegressionModel.fit(X_train_reduced_cleaned, y_train_reduced_cleaned)
y_pred_reduced = LinearRegressionModel.predict(X_test_reduced)
not_nan_in_y_test_reduced = ~np.isnan(y_test_reduced)
y_test_reduced_cleaned = y_test_reduced[not_nan_in_y_test_reduced]
y_pred_reduced_cleaned = y_pred_reduced[not_nan_in_y_test_reduced]
mse_reduced = mean_squared_error(y_test_reduced_cleaned, y_pred_reduced_cleaned)
r2_reduced = r2_score(y_test_reduced_cleaned, y_pred_reduced_cleaned)
print(f"Refined Model MSE: {mse_reduced}")
print(f"Refined Model R¬≤: {r2_reduced}")
print(f"\nBaseline Model MSE: {mse}")
print(f"Baseline Model R¬≤: {r2}")

Refined Model MSE: 11559711.869395258
Refined Model R¬≤: -0.0004980421302329674

Baseline Model MSE: 5644348.308497341
Baseline Model R¬≤: 0.5114792223581511


### ‚úçÔ∏è Your Response: üîß
1. I kept the 5 features with the highest coefficients.

2. The model performance got worse, as I feel like I didn't clean these columns as well as I wanted to.

3. I would recommend the original model, as that was more accurate.

4. This relates to my goal of using analytics to make strategic decisions, as regression can predict which strategic direction a company goes in.


## 10. Reflect and Recommend

### Business framing:  
Ultimately, the value of your model comes from how well it can guide business decisions. Use your results to make real-world recommendations.

### In Your Response:
1. What business question did your model help answer?
2. What would you recommend to Airbnb or its hosts?
3. What could you do next to improve this model or make it more useful?
4. How does this relate to your customized learning outcome you created in canvas?


### ‚úçÔ∏è Your Response: üîß
1. This model answered the question of whether you could predict price based on other categories.

2. I would recommend Airbnb further refine this regression model to make it more accurate.

3. I would like to clean and transform more columns, as I felt that affected the accuracy of the model.

4. This relates to my goal of using analytics to make strategic decisions, as regression can predict which strategic direction a company goes in.

## Submission Instructions
‚úÖ Checklist:
- All code cells run without error
- All markdown responses are complete
- Submit on Canvas as instructed

In [None]:
!jupyter nbconvert --to html "assignment_11_LastnameFirstname.ipynb"