# IS 4487 Assignment 11: Predicting Airbnb Prices with Regression

In this assignment, you will:
- Load the Airbnb dataset you cleaned and transformed in Assignment 7
- Build a linear regression model to predict listing price
- Interpret which features most affect price
- Try to improve your model using only the most impactful predictors
- Practice explaining your findings to a business audience like a host, pricing strategist, or city partner

## Why This Matters

Pricing is one of the most important levers for hosts and Airbnb‚Äôs business teams. Understanding what drives price ‚Äî and being able to predict it accurately ‚Äî helps improve search results, revenue management, and guest satisfaction.

This assignment gives you hands-on practice turning a cleaned dataset into a predictive model. You‚Äôll focus not just on code, but on what the results mean and how you‚Äôd communicate them to stakeholders.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Assignments/assignment_11_regression.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>



## Original Source: Dataset Description

The dataset you'll be using is a **detailed Airbnb listing file**, available from [Inside Airbnb](https://insideairbnb.com/get-the-data/).

Each row represents one property listing. The columns include:

- **Host attributes** (e.g., host ID, host name, host response time)
- **Listing details** (e.g., price, room type, minimum nights, availability)
- **Location data** (e.g., neighborhood, latitude/longitude)
- **Property characteristics** (e.g., number of bedrooms, amenities, accommodates)
- **Calendar/booking variables** (e.g., last review date, number of reviews)

The schema is consistent across cities, so you can expect similar columns regardless of the location you choose.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score


## 1. Load Your Transformed Airbnb Dataset

**Business framing:**  
Before building any models, we must start with clean, prepared data. In Assignment 7, you exported a cleaned version of your Airbnb dataset. You‚Äôll now import that file for analysis.

### Do the following:
- Import your CSV file called `cleaned_airbnb_data_7.csv`.   (Note: If you had significant errors with assignment 7, you can use the file named "airbnb_listings.csv" in the DataSets folder on GitHub as a backup starting point.)
- Use `pandas` to load and preview the dataset

### In Your Response:
1. What does the dataset include?
2. How many rows and columns are present?


In [None]:
# Add code here üîß

In [4]:
df = pd.read_csv('cleaned_airbnb_data.csv')
display(df.head())

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,5456,https://www.airbnb.com/rooms/5456,20250613040113,2025-06-13,city scrape,"Walk to 6th, Rainey St and Convention Ctr",Great central location for walking to Convent...,My neighborhood is ideally located if you want...,https://a0.muscache.com/pictures/14084884/b5a3...,8028,...,4.82,4.73,4.79,,f,1.0,1.0,0.0,0.0,3.59
1,5769,https://www.airbnb.com/rooms/5769,20250613040113,2025-06-13,city scrape,NW Austin Room,,Quiet neighborhood with lots of trees and good...,https://a0.muscache.com/pictures/23822033/ac94...,8186,...,4.94,4.77,4.92,,f,1.0,0.0,1.0,0.0,1.65
2,6413,https://www.airbnb.com/rooms/6413,20250613040113,2025-06-14,previous scrape,Gem of a Studio near Downtown,"Great studio apartment, perfect a single perso...",Travis Heights is one of the oldest neighborho...,https://a0.muscache.com/pictures/hosting/Hosti...,13879,...,4.98,4.87,4.93,,f,1.0,1.0,0.0,0.0,0.65
3,6448,https://www.airbnb.com/rooms/6448,20250613040113,2025-06-13,city scrape,"Secluded Studio @ Zilker - King Bed, Bright & ...","Clean, private space with everything you need ...",The neighborhood is fun and funky (but quiet)!...,https://a0.muscache.com/pictures/airflow/Hosti...,14156,...,4.98,4.97,4.88,,t,1.0,1.0,0.0,0.0,2.02
4,8502,https://www.airbnb.com/rooms/8502,20250613040113,2025-06-13,city scrape,Woodland Studio Lodging,Studio rental on lower level of home located i...,,https://a0.muscache.com/pictures/miso/Hosting-...,25298,...,4.88,4.69,4.63,,f,1.0,1.0,0.0,0.0,0.29


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2410 entries, 0 to 2409
Data columns (total 79 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            2410 non-null   int64  
 1   listing_url                                   2410 non-null   object 
 2   scrape_id                                     2410 non-null   int64  
 3   last_scraped                                  2410 non-null   object 
 4   source                                        2410 non-null   object 
 5   name                                          2410 non-null   object 
 6   description                                   2345 non-null   object 
 7   neighborhood_overview                         1664 non-null   object 
 8   picture_url                                   2410 non-null   object 
 9   host_id                                       2410 non-null   i

### ‚úçÔ∏è Your Response: üîß
1. The dataset includes all the data collected about specific airbnb rentals.
2. There is a max of 15,187 values and there are 79 columns present.

## 2. Drop Columns Not Useful for Modeling

**Business framing:**  
Some columns ‚Äî like post IDs or text ‚Äî may not help us predict price and could add noise or bias.

### Do the following:
- Drop columns like `post_id`, `title`, `descr`, `details`, and `address` if they‚Äôre still in your dataset

### In Your Response:
1. What columns did you drop, and why?
2. What risks might occur if you included them in your model?


In [6]:
# Step 1: clean all the data FIRST
# Drop columns with all missing values
df.dropna(axis=1, how='all', inplace=True)

# Fill missing values in numeric columns with the mean
numeric_cols = df.select_dtypes(include=['number']).columns
for col in numeric_cols:
    df[col] = df[col].fillna(df[col].mean())

display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2410 entries, 0 to 2409
Data columns (total 76 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            2410 non-null   int64  
 1   listing_url                                   2410 non-null   object 
 2   scrape_id                                     2410 non-null   int64  
 3   last_scraped                                  2410 non-null   object 
 4   source                                        2410 non-null   object 
 5   name                                          2410 non-null   object 
 6   description                                   2345 non-null   object 
 7   neighborhood_overview                         1664 non-null   object 
 8   picture_url                                   2410 non-null   object 
 9   host_id                                       2410 non-null   i

None

In [31]:
# Step 2: Drop columns
columns_to_drop = ['maximum_nights_avg_ntm', 'minimum_nights_avg_ntm', 'maximum_maximum_nights','minimum_minimum_nights','maximum_minimum_nights','minimum_maximum_nights','id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name', 'description', 'neighborhood_overview', 'picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood', 'bathrooms_text', 'amenities', 'calendar_last_scraped', 'first_review', 'last_review']

# Drop columns that exist in the DataFrame
existing_columns_to_drop = [col for col in columns_to_drop if col in df.columns]
df.drop(columns=existing_columns_to_drop, inplace=True)

display(df.head())

Unnamed: 0,host_listings_count,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,1.0,2.0,78702,30.26057,-97.73441,Entire guesthouse,Entire home/apt,3,1.0,1.0,...,4.9,4.82,4.73,4.79,f,1,1,0,0,3.59
1,1.0,4.0,78729,30.45697,-97.78422,Private room in home,Private room,2,1.0,1.0,...,4.91,4.94,4.77,4.92,f,1,0,1,0,1.65
2,1.0,1.0,78704,30.24885,-97.73587,Entire guesthouse,Entire home/apt,2,1.73838,2.095101,...,4.99,4.98,4.87,4.93,f,1,1,0,0,0.65
3,1.0,2.0,78704,30.26034,-97.76487,Entire guesthouse,Entire home/apt,2,1.0,1.0,...,4.99,4.98,4.97,4.88,t,1,1,0,0,2.02
4,1.0,1.0,78741,30.23466,-97.73682,Entire guest suite,Entire home/apt,2,1.0,1.0,...,4.85,4.88,4.69,4.63,f,1,1,0,0,0.29


In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15187 entries, 0 to 15186
Data columns (total 39 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   host_listings_count                           15187 non-null  float64
 1   host_total_listings_count                     15187 non-null  float64
 2   neighbourhood_cleansed                        15187 non-null  int64  
 3   latitude                                      15187 non-null  float64
 4   longitude                                     15187 non-null  float64
 5   property_type                                 15187 non-null  object 
 6   room_type                                     15187 non-null  object 
 7   accommodates                                  15187 non-null  int64  
 8   bathrooms                                     15187 non-null  float64
 9   bedrooms                                      15187 non-null 

Scaled numerical columns:
['host_listings_count', 'host_total_listings_count', 'neighbourhood_cleansed', 'latitude', 'longitude', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'minimum_nights', 'maximum_nights', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d', 'availability_eoy', 'number_of_reviews_ly', 'estimated_occupancy_l365d', 'estimated_revenue_l365d', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'calculated_host_listings_count', 'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms', 'reviews_per_month']


Unnamed: 0,host_listings_count,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,accommodates,bathrooms,bedrooms,beds,price,...,property_type_Tower,property_type_Treehouse,property_type_Yurt,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,room_type_Shared room,has_availability_t,instant_bookable_f,instant_bookable_t
0,-0.206809,-0.211216,-1.075487,-0.308152,0.247408,-0.586211,-0.839966,-0.784119,-0.388463,$101.00,...,False,False,False,True,False,False,False,True,True,False
1,-0.206809,-0.208888,0.222251,2.751947,-0.525182,-0.878067,-0.839966,-0.784119,-0.788097,$45.00,...,False,False,False,False,False,True,False,True,True,False
2,-0.206809,-0.212381,-0.979358,-0.49076,0.224762,-0.878067,0.0,0.0,0.0,,...,False,False,False,True,False,False,False,True,True,False
3,-0.206809,-0.211216,-0.979358,-0.311735,-0.225049,-0.878067,-0.839966,-0.784119,-0.388463,$155.00,...,False,False,False,True,False,False,False,True,False,True
4,-0.206809,-0.212381,0.799023,-0.711854,0.210027,-0.878067,-0.839966,-0.784119,-0.788097,$43.00,...,False,False,False,True,False,False,False,True,True,False


### ‚úçÔ∏è Your Response: üîß
1. I droped a lot of variables that either had photo links, long discriptions, id, names, etc.

2. Risks that could occur are the data would not be repersentitive of the actual data, there was a lot of noise that needed to be cleaned up.

## 3. Explore Relationships Between Numeric Features

**Business framing:**  
Understanding how features relate to each other ‚Äî and to the target ‚Äî helps guide feature selection and modeling.

### Do the following:
- Generate a correlation matrix
- Identify which variables are strongly related to `price`

### In Your Response:
1. Which variables had the strongest positive or negative correlation with price?
2. Which variables might be useful predictors?


In [None]:
# Add code here üîß

In [37]:
# Convert 'price' to numeric, handling potential errors and removing '$'
df['price'] = df['price'].astype(str).str.replace('[$,]', '', regex=True)
df['price'] = pd.to_numeric(df['price'], errors='coerce')

# Drop rows where price could not be converted to numeric
df.dropna(subset=['price'], inplace=True)

# Calculate the correlation matrix with 'price'
correlation_matrix = df.corr()['price'].sort_values(ascending=False)

# Drop NaN values from the correlation matrix
correlation_matrix.dropna(inplace=True)

display(correlation_matrix)

Unnamed: 0,price
price,1.000000
room_type_Hotel room,0.690135
host_total_listings_count,0.394259
estimated_revenue_l365d,0.366934
property_type_Room in hotel,0.331316
...,...
number_of_reviews_ltm,-0.054318
estimated_occupancy_l365d,-0.075099
instant_bookable_f,-0.079704
longitude,-0.081083


### ‚úçÔ∏è Your Response: üîß
1.

2.

## 4. Define Features and Target Variable

**Business framing:**  
To build a regression model, you need to define what you‚Äôre predicting (target) and what you‚Äôre using to make that prediction (features).

### Do the following:
- Set `price` as your target variable
- Remove `price` from your predictors

### In Your Response:
1. What features are you using?
2. Why is this a regression problem and not a classification problem?


In [None]:
# Add code here üîß

### ‚úçÔ∏è Your Response: üîß
1.

2.

## 5. Split Data into Training and Testing Sets

### Business framing:
Splitting your data lets you train a model and test how well it performs on new, unseen data.

### Do the following:
- Use `train_test_split()` to split into 80% training, 20% testing



In [None]:
# Add code here üîß

## 6. Fit a Linear Regression Model

### Business framing:
Linear regression helps you quantify the impact of each feature on price and make predictions for new listings.

### Do the following:
- Fit a linear regression model to your training data
- Use it to predict prices for the test set



In [None]:
# Add code here üîß

## 7. Evaluate Model Performance

### Business framing:  
A good model should make accurate predictions. We‚Äôll use Mean Squared Error (MSE) and R¬≤ to evaluate how close our predictions were to the actual prices.

### Do the following:
- Print MSE and R¬≤ score for your model

### In Your Response:
1. What is your R¬≤ score? How well does your model explain price variation?
2. Is your MSE large or small? What could you do to improve it?


In [None]:
# Add code here üîß

### ‚úçÔ∏è Your Response: üîß
1.

2.

## 8. Interpret Model Coefficients

### Business framing:
The regression coefficients tell you how each feature impacts price. This can help Airbnb guide hosts and partners.

### Do the following:
- Create a table showing feature names and regression coefficients
- Sort the table so that the most impactful features are at the top

### In Your Response:
1. Which features increased price the most?
2. Were any surprisingly negative?
3. What business insight could you draw from this?


In [None]:
# Add code here üîß

### ‚úçÔ∏è Your Response: üîß
1.

2.

3.


## 9. Try to Improve the Linear Regression Model

### Business framing:
The first version of your model included all available features ‚Äî but not all features are equally useful. Removing weak or noisy predictors can often improve performance and interpretation.

### Do the following:
1. Choose your top 3‚Äì5 features with the strongest absolute coefficients
2. Rebuild the regression model using just those features
3. Compare MSE and R¬≤ between the baseline and refined model

### In Your Response:
1. What features did you keep in the refined model, and why?
2. Did model performance improve? Why or why not?
3. Which model would you recommend to stakeholders?
4. How does this relate to your customized learning outcome you created in canvas?


In [None]:
# Add code here üîß

### ‚úçÔ∏è Your Response: üîß
1.

2.

3.

4.


## 10. Reflect and Recommend

### Business framing:  
Ultimately, the value of your model comes from how well it can guide business decisions. Use your results to make real-world recommendations.

### In Your Response:
1. What business question did your model help answer?
2. What would you recommend to Airbnb or its hosts?
3. What could you do next to improve this model or make it more useful?
4. How does this relate to your customized learning outcome you created in canvas?


### ‚úçÔ∏è Your Response: üîß
1.

2.

3.

4.

## Submission Instructions
‚úÖ Checklist:
- All code cells run without error
- All markdown responses are complete
- Submit on Canvas as instructed

In [None]:
!jupyter nbconvert --to html "assignment_11_LastnameFirstname.ipynb"