# IS 4487 Assignment 11: Predicting Airbnb Prices with Regression

In this assignment, you will:
- Load the Airbnb dataset you cleaned and transformed in Assignment 7
- Build a linear regression model to predict listing price
- Interpret which features most affect price
- Try to improve your model using only the most impactful predictors
- Practice explaining your findings to a business audience like a host, pricing strategist, or city partner

## Why This Matters

Pricing is one of the most important levers for hosts and Airbnb‚Äôs business teams. Understanding what drives price ‚Äî and being able to predict it accurately ‚Äî helps improve search results, revenue management, and guest satisfaction.

This assignment gives you hands-on practice turning a cleaned dataset into a predictive model. You‚Äôll focus not just on code, but on what the results mean and how you‚Äôd communicate them to stakeholders.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Assignments/assignment_11_regression.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>



## Original Source: Dataset Description

The dataset you'll be using is a **detailed Airbnb listing file**, available from [Inside Airbnb](https://insideairbnb.com/get-the-data/).

Each row represents one property listing. The columns include:

- **Host attributes** (e.g., host ID, host name, host response time)
- **Listing details** (e.g., price, room type, minimum nights, availability)
- **Location data** (e.g., neighborhood, latitude/longitude)
- **Property characteristics** (e.g., number of bedrooms, amenities, accommodates)
- **Calendar/booking variables** (e.g., last review date, number of reviews)

The schema is consistent across cities, so you can expect similar columns regardless of the location you choose.

In [49]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score


## 1. Load Your Transformed Airbnb Dataset

**Business framing:**  
Before building any models, we must start with clean, prepared data. In Assignment 7, you exported a cleaned version of your Airbnb dataset. You‚Äôll now import that file for analysis.

### Do the following:
- Import your CSV file called `cleaned_airbnb_data_7.csv`.   (Note: If you had significant errors with assignment 7, you can use the file named "airbnb_listings.csv" in the DataSets folder on GitHub as a backup starting point.)
- Use `pandas` to load and preview the dataset

### In Your Response:
1. What does the dataset include?
2. How many rows and columns are present?


In [50]:
df = pd.read_csv('/content/cleaned_airbnb_data.csv')
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15187 entries, 0 to 15186
Data columns (total 79 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            15187 non-null  int64  
 1   listing_url                                   15187 non-null  object 
 2   scrape_id                                     15187 non-null  int64  
 3   last_scraped                                  15187 non-null  object 
 4   source                                        15187 non-null  object 
 5   name                                          15187 non-null  object 
 6   description                                   14840 non-null  object 
 7   neighborhood_overview                         7914 non-null   object 
 8   picture_url                                   15186 non-null  object 
 9   host_id                                       15187 non-null 

None

In [None]:
# Add code here üîß

### ‚úçÔ∏è Your Response: üîß
1. This dataset includes information about the rental properties, everywhere from location, host, reviews, rental size, avg stay, etc.

2. There are around 15,187 rows and 78 columns in this dataset

## 2. Drop Columns Not Useful for Modeling

**Business framing:**  
Some columns ‚Äî like post IDs or text ‚Äî may not help us predict price and could add noise or bias.

### Do the following:
- Drop columns like `post_id`, `title`, `descr`, `details`, and `address` if they‚Äôre still in your dataset

### In Your Response:
1. What columns did you drop, and why?
2. What risks might occur if you included them in your model?


In [None]:
# Add code here üîß

In [51]:
columns_to_drop = [col for col in ['id','maximum_nights_avg_ntm ','host_total_listings_count','minimum_nights_avg_ntm','maximum_maximum_nights','neighborhood_overview','minimum_minimum_nights','maximum_minimum_nights','minimum_maximum_nights','first_review','last_review','scrape_id', 'listing_url', 'picture_url', 'host_id', 'host_url', 'last_scraped', 'calendar_last_scraped', 'source', 'host_thumbnail_url', 'host_picture_url', 'name', 'description', 'host_about', 'host_verifications', 'neighbourhood', 'bathrooms_text', 'host_name', 'host_location', 'host_neighbourhood', 'neighbourhood_group_cleansed', 'calendar_updated', 'license'] if col in df.columns]

df = df.drop(columns=columns_to_drop)

display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15187 entries, 0 to 15186
Data columns (total 47 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   host_since                                    15185 non-null  object 
 1   host_response_time                            10868 non-null  object 
 2   host_response_rate                            10868 non-null  object 
 3   host_acceptance_rate                          11688 non-null  object 
 4   host_is_superhost                             14709 non-null  object 
 5   host_listings_count                           15185 non-null  float64
 6   host_has_profile_pic                          15185 non-null  object 
 7   host_identity_verified                        15185 non-null  object 
 8   neighbourhood_cleansed                        15187 non-null  int64  
 9   latitude                                      15187 non-null 

None

In [33]:
display(df.head())

Unnamed: 0,host_since,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_has_profile_pic,host_identity_verified,neighbourhood_cleansed,latitude,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,2009-02-16,within a few hours,100%,92%,t,1.0,t,t,78702,30.26057,...,4.9,4.82,4.73,4.79,f,1,1,0,0,3.59
1,2009-02-19,within an hour,100%,100%,f,1.0,t,t,78729,30.45697,...,4.91,4.94,4.77,4.92,f,1,0,1,0,1.65
2,2009-04-17,within an hour,100%,100%,t,1.0,t,t,78704,30.24885,...,4.99,4.98,4.87,4.93,f,1,1,0,0,0.65
3,2009-04-20,within an hour,100%,96%,t,1.0,t,t,78704,30.26034,...,4.99,4.98,4.97,4.88,t,1,1,0,0,2.02
4,2009-07-11,within a day,80%,50%,f,1.0,t,f,78741,30.23466,...,4.85,4.88,4.69,4.63,f,1,1,0,0,0.29


### ‚úçÔ∏è Your Response: üîß
1. I dropped a buch of columns that were too unique and not relevant (id's, urls, timestamp data--> all in text) as well as text with long descriptions, redundant columns ( bathroom_text), columns that showed multicoliniarity risks (min and max nights), and columns that were empty.

2. Risks that can occur if we didn't take out these columns is it would be hard to train the model because there is so much busy and unessasary data.

## 3. Explore Relationships Between Numeric Features

**Business framing:**  
Understanding how features relate to each other ‚Äî and to the target ‚Äî helps guide feature selection and modeling.

### Do the following:
- Generate a correlation matrix
- Identify which variables are strongly related to `price`

### In Your Response:
1. Which variables had the strongest positive or negative correlation with price?
2. Which variables might be useful predictors?


In [52]:
correlation_with_price = correlation_matrix['price'].sort_values(key=abs, ascending=False)
display("Correlation of features with Price:")
display(correlation_with_price)

'Correlation of features with Price:'

Unnamed: 0,price
price,1.0
estimated_revenue_l365d,0.366934
host_listings_count,0.299557
calculated_host_listings_count,0.209643
maximum_nights,0.106529
longitude,-0.081083
estimated_occupancy_l365d,-0.075099
calculated_host_listings_count_private_rooms,0.058815
number_of_reviews_ltm,-0.054318
reviews_per_month,-0.053729


### ‚úçÔ∏è Your Response: üîß
1. The top 3 that were most correlated to price was estimated_ revenue, host listing count and maximum nights. The top variables that were negatively correlated was reviews score accuracy, min nights avg, and review scores.

2. The variables that migh be valuble predictors are the ones that are positively correlated to the price variable.

## 4. Define Features and Target Variable

**Business framing:**  
To build a regression model, you need to define what you‚Äôre predicting (target) and what you‚Äôre using to make that prediction (features).

### Do the following:
- Set `price` as your target variable
- Remove `price` from your predictors

### In Your Response:
1. What features are you using?
2. Why is this a regression problem and not a classification problem?


### ‚úçÔ∏è Your Response: üîß
1.

2.

## 5. Split Data into Training and Testing Sets

### Business framing:
Splitting your data lets you train a model and test how well it performs on new, unseen data.

### Do the following:
- Use `train_test_split()` to split into 80% training, 20% testing



In [None]:
# Add code here üîß

## 6. Fit a Linear Regression Model

### Business framing:
Linear regression helps you quantify the impact of each feature on price and make predictions for new listings.

### Do the following:
- Fit a linear regression model to your training data
- Use it to predict prices for the test set



In [None]:
# Add code here üîß

## 7. Evaluate Model Performance

### Business framing:  
A good model should make accurate predictions. We‚Äôll use Mean Squared Error (MSE) and R¬≤ to evaluate how close our predictions were to the actual prices.

### Do the following:
- Print MSE and R¬≤ score for your model

### In Your Response:
1. What is your R¬≤ score? How well does your model explain price variation?
2. Is your MSE large or small? What could you do to improve it?


In [None]:
# Add code here üîß

### ‚úçÔ∏è Your Response: üîß
1.

2.

## 8. Interpret Model Coefficients

### Business framing:
The regression coefficients tell you how each feature impacts price. This can help Airbnb guide hosts and partners.

### Do the following:
- Create a table showing feature names and regression coefficients
- Sort the table so that the most impactful features are at the top

### In Your Response:
1. Which features increased price the most?
2. Were any surprisingly negative?
3. What business insight could you draw from this?


In [None]:
# Add code here üîß

### ‚úçÔ∏è Your Response: üîß
1.

2.

3.


## 9. Try to Improve the Linear Regression Model

### Business framing:
The first version of your model included all available features ‚Äî but not all features are equally useful. Removing weak or noisy predictors can often improve performance and interpretation.

### Do the following:
1. Choose your top 3‚Äì5 features with the strongest absolute coefficients
2. Rebuild the regression model using just those features
3. Compare MSE and R¬≤ between the baseline and refined model

### In Your Response:
1. What features did you keep in the refined model, and why?
2. Did model performance improve? Why or why not?
3. Which model would you recommend to stakeholders?
4. How does this relate to your customized learning outcome you created in canvas?


In [None]:
# Add code here üîß

### ‚úçÔ∏è Your Response: üîß
1.

2.

3.

4.


## 10. Reflect and Recommend

### Business framing:  
Ultimately, the value of your model comes from how well it can guide business decisions. Use your results to make real-world recommendations.

### In Your Response:
1. What business question did your model help answer?
2. What would you recommend to Airbnb or its hosts?
3. What could you do next to improve this model or make it more useful?
4. How does this relate to your customized learning outcome you created in canvas?


### ‚úçÔ∏è Your Response: üîß
1.

2.

3.

4.

## Submission Instructions
‚úÖ Checklist:
- All code cells run without error
- All markdown responses are complete
- Submit on Canvas as instructed

In [None]:
!jupyter nbconvert --to html "assignment_11_LastnameFirstname.ipynb"

# Task
Analyze the "cleaned_airbnb_data.csv" dataset by performing the following steps: load the data, display information about the dataset, drop specified columns, print the 'availability_30' column, generate a correlation matrix excluding 'price', identify variables strongly related to 'price', explain why this is a regression problem, create a regression model with 'price' as the target variable and the top 10 correlated features as predictors, and evaluate the model.

## Identify top features

### Subtask:
Select the top 10 features most correlated with 'price' (excluding 'price').


**Reasoning**:
Calculate the absolute correlation of numeric features with 'price', sort them, exclude 'price', and select the top 10 features to identify the most relevant predictors for the regression model.



In [53]:
numeric_df = df.select_dtypes(include=['float64', 'int64'])
correlation_matrix = numeric_df.corr()
correlation_with_price = correlation_matrix['price'].abs().sort_values(ascending=False)
top_features = correlation_with_price.drop('price').head(10).index.tolist()

KeyError: 'price'

**Reasoning**:
The previous code failed because the 'price' column was not in the numeric dataframe. I need to check the data type of the 'price' column and convert it to numeric if necessary before calculating correlations.



In [54]:
df['price'] = df['price'].replace({'\$': '', ',': ''}, regex=True).astype(float)
numeric_df = df.select_dtypes(include=['float64', 'int64'])
correlation_matrix = numeric_df.corr()
correlation_with_price = correlation_matrix['price'].abs().sort_values(ascending=False)
top_features = correlation_with_price.drop('price').head(10).index.tolist()
display(top_features)

  df['price'] = df['price'].replace({'\$': '', ',': ''}, regex=True).astype(float)


['estimated_revenue_l365d',
 'host_listings_count',
 'calculated_host_listings_count',
 'maximum_nights',
 'longitude',
 'estimated_occupancy_l365d',
 'calculated_host_listings_count_private_rooms',
 'number_of_reviews_ltm',
 'reviews_per_month',
 'availability_30']

## Define features and target

### Subtask:
Create the feature matrix (X) using the selected top features and the target vector (y) using the 'price' column.


**Reasoning**:
Create the feature matrix X and the target vector y using the selected top features.



In [55]:
X = df[top_features]
y = df['price']

## Handle missing values

### Subtask:
Address any remaining missing values in the selected features.


**Reasoning**:
Check for missing values in the features DataFrame.



In [56]:
display(X.isnull().sum())

Unnamed: 0,0
estimated_revenue_l365d,4479
host_listings_count,2
calculated_host_listings_count,0
maximum_nights,0
longitude,0
estimated_occupancy_l365d,0
calculated_host_listings_count_private_rooms,0
number_of_reviews_ltm,0
reviews_per_month,2911
availability_30,0


**Reasoning**:
Impute missing values in 'estimated_revenue_l365d' and 'reviews_per_month' using the median, and drop rows with missing 'host_listings_count' as there are only two.



In [57]:
X['estimated_revenue_l365d'] = X['estimated_revenue_l365d'].fillna(X['estimated_revenue_l365d'].median())
X['reviews_per_month'] = X['reviews_per_month'].fillna(X['reviews_per_month'].median())
X = X.dropna(subset=['host_listings_count'])
y = y[X.index] # Ensure y aligns with the rows kept in X
display(X.isnull().sum())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['estimated_revenue_l365d'] = X['estimated_revenue_l365d'].fillna(X['estimated_revenue_l365d'].median())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['reviews_per_month'] = X['reviews_per_month'].fillna(X['reviews_per_month'].median())


Unnamed: 0,0
estimated_revenue_l365d,0
host_listings_count,0
calculated_host_listings_count,0
maximum_nights,0
longitude,0
estimated_occupancy_l365d,0
calculated_host_listings_count_private_rooms,0
number_of_reviews_ltm,0
reviews_per_month,0
availability_30,0


## Split data

### Subtask:
Split the data into training and testing sets.


**Reasoning**:
Split the data into training and testing sets using the selected features and target variable.



In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
display(X_train.shape)
display(X_test.shape)
display(y_train.shape)
display(y_test.shape)

(12148, 10)

(3037, 10)

(12148,)

(3037,)

## Fit model

### Subtask:
Fit a linear regression model to the training data and make predictions on the test set.


**Reasoning**:
Fit a linear regression model to the training data and make predictions on the test set.



In [59]:
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

ValueError: Input y contains NaN.

**Reasoning**:
The traceback indicates that the target variable `y_train` contains NaN values, which the Linear Regression model cannot handle. I need to remove the rows with NaN values from `y_train` and the corresponding rows from `X_train` before fitting the model. I will then perform the fitting and prediction steps.



In [60]:
nan_indices_train = y_train[y_train.isna()].index
X_train = X_train.drop(nan_indices_train)
y_train = y_train.drop(nan_indices_train)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)