# IS 4487 Assignment 11: Predicting Airbnb Prices with Regression

In this assignment, you will:
- Load the Airbnb dataset you cleaned and transformed in Assignment 7
- Build a linear regression model to predict listing price
- Interpret which features most affect price
- Try to improve your model using only the most impactful predictors
- Practice explaining your findings to a business audience like a host, pricing strategist, or city partner

## Why This Matters

Pricing is one of the most important levers for hosts and Airbnb‚Äôs business teams. Understanding what drives price ‚Äî and being able to predict it accurately ‚Äî helps improve search results, revenue management, and guest satisfaction.

This assignment gives you hands-on practice turning a cleaned dataset into a predictive model. You‚Äôll focus not just on code, but on what the results mean and how you‚Äôd communicate them to stakeholders.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Assignments/assignment_11_regression.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>



## Original Source: Dataset Description

The dataset you'll be using is a **detailed Airbnb listing file**, available from [Inside Airbnb](https://insideairbnb.com/get-the-data/).

Each row represents one property listing. The columns include:

- **Host attributes** (e.g., host ID, host name, host response time)
- **Listing details** (e.g., price, room type, minimum nights, availability)
- **Location data** (e.g., neighborhood, latitude/longitude)
- **Property characteristics** (e.g., number of bedrooms, amenities, accommodates)
- **Calendar/booking variables** (e.g., last review date, number of reviews)

The schema is consistent across cities, so you can expect similar columns regardless of the location you choose.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score


## 1. Load Your Transformed Airbnb Dataset

**Business framing:**  
Before building any models, we must start with clean, prepared data. In Assignment 7, you exported a cleaned version of your Airbnb dataset. You‚Äôll now import that file for analysis.

### Do the following:
- Import your CSV file called `cleaned_airbnb_data_7.csv`.   (Note: If you had significant errors with assignment 7, you can use the file named "airbnb_listings.csv" in the DataSets folder on GitHub as a backup starting point.)
- Use `pandas` to load and preview the dataset

### In Your Response:
1. What does the dataset include?
2. How many rows and columns are present?


In [2]:
# Import your CSV file
df = pd.read_csv('cleaned_airbnb_data.csv')

# Preview the dataset
display(df.head())
df.shape

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,21853,https://www.airbnb.com/rooms/21853,20250612050748,2025-06-26,city scrape,Bright and airy room,We have a quiet and sunny room with a good vie...,We live in a leafy neighbourhood with plenty o...,https://a0.muscache.com/pictures/68483181/87bc...,83531,...,4.82,4.21,4.67,,f,2,0,2,0,0.25
1,30320,https://www.airbnb.com/rooms/30320,20250612050748,2025-06-27,previous scrape,Apartamentos Dana Sol,,,https://a0.muscache.com/pictures/336868/f67409...,130907,...,4.78,4.9,4.69,,f,3,3,0,0,0.94
2,30959,https://www.airbnb.com/rooms/30959,20250612050748,2025-06-27,previous scrape,Beautiful loft in Madrid Center,Beautiful Loft 60m2 size just in the historica...,,https://a0.muscache.com/pictures/78173471/835e...,132883,...,4.63,4.88,4.25,,f,1,1,0,0,0.06
3,40916,https://www.airbnb.com/rooms/40916,20250612050748,2025-06-26,previous scrape,Apartasol Apartamentos Dana,,,https://a0.muscache.com/pictures/hosting/Hosti...,130907,...,4.79,4.88,4.55,,t,3,3,0,0,0.27
4,62423,https://www.airbnb.com/rooms/62423,20250612050748,2025-06-25,city scrape,MAGIC ARTISTIC HOUSE IN THE CENTER OF MADRID,INCREDIBLE HOME OF AN ARTIST SURROUNDED BY PAI...,DISTRICT WITH VERY GOOD VIBES IN THE MIDDLE OF...,https://a0.muscache.com/pictures/miso/Hosting-...,303845,...,4.86,4.97,4.59,,f,3,1,2,0,2.7


(26004, 76)

### ‚úçÔ∏è Your Response: üîß
1. The dataset contains detailed information about Airbnb property listings, such as host details, property characteristics (e.g., room type, bedrooms, minimum nights), location data (neighborhood, latitude, longitude), and guest review metrics. These features help analyze how listing attributes influence pricing and booking performance.

2. The dataset includes 26,004 rows (individual listings) and 76 columns (variables describing each listing).

## 2. Drop Columns Not Useful for Modeling

**Business framing:**  
Some columns ‚Äî like post IDs or text ‚Äî may not help us predict price and could add noise or bias.

### Do the following:
- Drop columns like `post_id`, `title`, `descr`, `details`, and `address` if they‚Äôre still in your dataset

### In Your Response:
1. What columns did you drop, and why?
2. What risks might occur if you included them in your model?


In [3]:
# Drop columns not useful for modeling
columns_to_drop = ['listing_url', 'scrape_id', 'last_scraped', 'source', 'name', 'description', 'neighborhood_overview', 'picture_url', 'host_url', 'host_name', 'host_location', 'host_about', 'host_thumbnail_url', 'host_picture_url', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood', 'neighbourhood_group_cleansed', 'property_type', 'calendar_updated', 'calendar_last_scraped', 'first_review', 'last_review', 'license']
df = df.drop(columns=columns_to_drop, errors='ignore')

display(df.head())
df.shape

Unnamed: 0,id,host_id,host_since,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,neighbourhood_cleansed,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,21853,83531,2010-02-21,,,,f,2.0,2.0,C√°rmenes,...,4.75,4.82,4.21,4.67,f,2,0,2,0,0.25
1,30320,130907,2010-05-24,,,,f,3.0,6.0,Sol,...,4.82,4.78,4.9,4.69,f,3,3,0,0,0.94
2,30959,132883,2010-05-26,,,,f,1.0,4.0,Embajadores,...,4.63,4.63,4.88,4.25,f,1,1,0,0,0.06
3,40916,130907,2010-05-24,,,,f,3.0,6.0,Universidad,...,4.85,4.79,4.88,4.55,t,3,3,0,0,0.27
4,62423,303845,2010-11-29,within an hour,100.0,100%,f,3.0,3.0,Justicia,...,4.8,4.86,4.97,4.59,f,3,1,2,0,2.7


(26004, 53)

### ‚úçÔ∏è Your Response: üîß
1. I dropped 23 columns, including listing URLs, host profile details, property descriptions, neighborhood text, and calendar information such as listing_url, description, host_name, neighbourhood, and calendar_updated.

2. These columns were removed because they contain text, links, or identifiers that do not provide meaningful numeric input for predicting price. If included, they could cause overfitting, add noise or bias, and make the model harder to interpret and less accurate.

## 3. Explore Relationships Between Numeric Features

**Business framing:**  
Understanding how features relate to each other ‚Äî and to the target ‚Äî helps guide feature selection and modeling.

### Do the following:
- Generate a correlation matrix
- Identify which variables are strongly related to `price`

### In Your Response:
1. Which variables had the strongest positive or negative correlation with price?
2. Which variables might be useful predictors?


In [4]:
# Generate a correlation matrix
correlation_matrix = df.corr(numeric_only=True)

# Identify which variables are strongly related to price
price_correlation = correlation_matrix['price'].sort_values(ascending=False)

display(price_correlation)

Unnamed: 0,price
price,1.0
estimated_revenue_l365d,0.280371
accommodates,0.140619
beds,0.115445
bedrooms,0.101488
bathrooms,0.083521
host_listings_count,0.049428
host_total_listings_count,0.048888
maximum_maximum_nights,0.030205
maximum_nights,0.030197


### ‚úçÔ∏è Your Response: üîß
1. The strongest positive correlations were with estimated_revenue_l365d, accommodates, beds, and bedrooms, meaning larger or higher-earning listings tend to be priced higher. The strongest negative correlations were with number_of_reviews_l30d, estimated_occupancy_l365d, and calculated_host_listings_count_private_rooms, which are typically linked to lower-priced listings.

2. Variables like accommodates, beds, bedrooms, bathrooms, and estimated_revenue_l365d would be useful predictors since they directly reflect the property‚Äôs size, amenities, and potential value.

## 4. Define Features and Target Variable

**Business framing:**  
To build a regression model, you need to define what you‚Äôre predicting (target) and what you‚Äôre using to make that prediction (features).

### Do the following:
- Set `price` as your target variable
- Remove `price` from your predictors

### In Your Response:
1. What features are you using?
2. Why is this a regression problem and not a classification problem?


In [5]:
# Set price as your target variable
y = df['price']

# Convert 'host_since' to datetime and then to a numerical feature (days since earliest host date)
df['host_since'] = pd.to_datetime(df['host_since'], errors='coerce')
earliest_host_date = df['host_since'].min()
df['host_since_days'] = (df['host_since'] - earliest_host_date).dt.days

# Remove original 'host_since' and 'price' from your predictors
X = df.drop(['price', 'host_since'], axis=1)

# Display the first few rows of the features and target
display(X.head())
display(y.head())

Unnamed: 0,id,host_id,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,neighbourhood_cleansed,latitude,...,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,host_since_days
0,21853,83531,,,,f,2.0,2.0,C√°rmenes,40.40381,...,4.82,4.21,4.67,f,2,0,2,0,0.25,191.0
1,30320,130907,,,,f,3.0,6.0,Sol,40.41476,...,4.78,4.9,4.69,f,3,3,0,0,0.94,283.0
2,30959,132883,,,,f,1.0,4.0,Embajadores,40.41259,...,4.63,4.88,4.25,f,1,1,0,0,0.06,285.0
3,40916,130907,,,,f,3.0,6.0,Universidad,40.42247,...,4.79,4.88,4.55,t,3,3,0,0,0.27,283.0
4,62423,303845,within an hour,100.0,100%,f,3.0,3.0,Justicia,40.41884,...,4.86,4.97,4.59,f,3,1,2,0,2.7,472.0


Unnamed: 0,price
0,29.0
1,101.0
2,101.0
3,101.0
4,64.0


### ‚úçÔ∏è Your Response: üîß
1. I‚Äôm using numerical and categorical features that describe each listing, such as host_listings_count, bedrooms, bathrooms, accommodates, review_scores, and neighbourhood_cleansed. These variables capture property size, quality, and location characteristics that influence price.

2. This is a regression problem because the goal is to predict a continuous numeric value ‚Äî the listing‚Äôs price ‚Äî rather than assigning listings to categories or labels.

## 5. Split Data into Training and Testing Sets

### Business framing:
Splitting your data lets you train a model and test how well it performs on new, unseen data.

### Do the following:
- Use `train_test_split()` to split into 80% training, 20% testing



In [6]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 6. Fit a Linear Regression Model

### Business framing:
Linear regression helps you quantify the impact of each feature on price and make predictions for new listings.

### Do the following:
- Fit a linear regression model to your training data
- Use it to predict prices for the test set



In [9]:
from sklearn.impute import SimpleImputer

# Identify categorical and numerical features
categorical_features = X_train.select_dtypes(include=['object']).columns
numerical_features = X_train.select_dtypes(include=['number']).columns

# Perform one-hot encoding on categorical features
X_train_encoded = pd.get_dummies(X_train, columns=categorical_features, dummy_na=False)
X_test_encoded = pd.get_dummies(X_test, columns=categorical_features, dummy_na=False)

# Align columns after one-hot encoding (important for consistent features)
X_train_encoded, X_test_encoded = X_train_encoded.align(X_test_encoded, join='inner', axis=1, fill_value=0)

# Impute missing values in numerical features (including the newly created dummy variables which are numerical)
# We need to re-identify numerical columns after one-hot encoding as dummy variables are now numerical
numerical_features_encoded = X_train_encoded.select_dtypes(include=['number']).columns

numerical_imputer = SimpleImputer(strategy='mean')
X_train_encoded[numerical_features_encoded] = numerical_imputer.fit_transform(X_train_encoded[numerical_features_encoded])
X_test_encoded[numerical_features_encoded] = numerical_imputer.transform(X_test_encoded[numerical_features_encoded])

# No need to impute categorical features separately after one-hot encoding as they are now numerical (0 or 1)
# and handled by the numerical imputer.

# Fit a linear regression model
model = LinearRegression()
model.fit(X_train_encoded, y_train)

# Predict prices for the test set
y_pred = model.predict(X_test_encoded)

## 7. Evaluate Model Performance

### Business framing:  
A good model should make accurate predictions. We‚Äôll use Mean Squared Error (MSE) and R¬≤ to evaluate how close our predictions were to the actual prices.

### Do the following:
- Print MSE and R¬≤ score for your model

### In Your Response:
1. What is your R¬≤ score? How well does your model explain price variation?
2. Is your MSE large or small? What could you do to improve it?


In [10]:
# Calculate MSE and R¬≤
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the results
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R¬≤): {r2}")

Mean Squared Error (MSE): 224242.5379267591
R-squared (R¬≤): 0.00047822764128369055


### ‚úçÔ∏è Your Response: üîß
1. The R¬≤ score is 0.0005, which means the model explains almost none of the variation in Airbnb prices. In other words, the predictors used so far are not effectively capturing the factors that influence price.

2. The MSE of 224,242.54 is quite large, indicating that the model‚Äôs predictions deviate significantly from actual prices. To improve it, you could log-transform the price variable, remove outliers, add more meaningful predictors (like room_type, accommodates, and neighbourhood), and try regularization or tree-based models for better performance.

## 8. Interpret Model Coefficients

### Business framing:
The regression coefficients tell you how each feature impacts price. This can help Airbnb guide hosts and partners.

### Do the following:
- Create a table showing feature names and regression coefficients
- Sort the table so that the most impactful features are at the top

### In Your Response:
1. Which features increased price the most?
2. Were any surprisingly negative?
3. What business insight could you draw from this?


In [11]:
# Get the feature names from the encoded training data
feature_names = X_train_encoded.columns

# Get the coefficients from the fitted model
coefficients = model.coef_

# Create a DataFrame to display feature names and coefficients
coefficients_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

# Sort the DataFrame by the absolute value of the coefficients in descending order
coefficients_df['Abs_Coefficient'] = abs(coefficients_df['Coefficient'])
coefficients_df = coefficients_df.sort_values(by='Abs_Coefficient', ascending=False)

# Display the sorted table
display(coefficients_df[['Feature', 'Coefficient']])

Unnamed: 0,Feature,Coefficient
1,host_id,-9.121379e-09
42,host_since_days,-5.698865e-14
29,estimated_revenue_l365d,5.126698e-14
3,host_listings_count,-7.025259e-16
4,host_total_listings_count,-6.981631e-16
...,...,...
610,"amenities_[""Washer"", ""Hair dryer"", ""Oven"", ""Ho...",7.768110e-23
361,"amenities_[""Hair dryer"", ""Coffee maker"", ""Carb...",6.622115e-23
198,neighbourhood_cleansed_Fuente del Berro,-6.401472e-23
639,"amenities_[""Washer"", ""Paid parking on premises...",4.836191e-23


### ‚úçÔ∏è Your Response: üîß
1. The features with the largest positive coefficients‚Äîsuch as certain amenities combinations (e.g., listings with washer, hair dryer, oven) and higher estimated annual revenue (estimated_revenue_l365d)‚Äîwere associated with higher prices. This suggests that listings offering more amenities and greater earning potential tend to charge higher nightly rates.

2. Yes, some features like host_id and specific neighborhoods (e.g., Fuente del Berro) had small or negative coefficients, which may be surprising since they likely reflect unique host or location characteristics not strongly linked to price. These negative values could also indicate noise or multicollinearity from too many encoded features.

3. Listings with comprehensive amenities and strong revenue history command higher prices, emphasizing the value of upgrading amenities and maintaining high-quality guest experiences. Airbnb hosts and strategists could use this insight to prioritize property features that most effectively increase booking value and profitability.


## 9. Try to Improve the Linear Regression Model

### Business framing:
The first version of your model included all available features ‚Äî but not all features are equally useful. Removing weak or noisy predictors can often improve performance and interpretation.

### Do the following:
1. Choose your top 3‚Äì5 features with the strongest absolute coefficients
2. Rebuild the regression model using just those features
3. Compare MSE and R¬≤ between the baseline and refined model

### In Your Response:
1. What features did you keep in the refined model, and why?
2. Did model performance improve? Why or why not?
3. Which model would you recommend to stakeholders?
4. How does this relate to your customized learning outcome you created in canvas?


In [12]:
# Choose the top 5 features with the strongest absolute coefficients
top_features = coefficients_df.head(5)['Feature'].tolist()

# Create new training and testing sets with only the top features
X_train_refined = X_train_encoded[top_features]
X_test_refined = X_test_encoded[top_features]

# Rebuild the regression model using just those features
model_refined = LinearRegression()
model_refined.fit(X_train_refined, y_train)

# Predict prices for the test set using the refined model
y_pred_refined = model_refined.predict(X_test_refined)

# Compare MSE and R¬≤ between the baseline and refined model
mse_refined = mean_squared_error(y_test, y_pred_refined)
r2_refined = r2_score(y_test, y_pred_refined)

print("Baseline Model Performance:")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R¬≤): {r2}")
print("\nRefined Model Performance (Top 5 Features):")
print(f"Mean Squared Error (MSE): {mse_refined}")
print(f"R-squared (R¬≤): {r2_refined}")

Baseline Model Performance:
Mean Squared Error (MSE): 224242.5379267591
R-squared (R¬≤): 0.00047822764128369055

Refined Model Performance (Top 5 Features):
Mean Squared Error (MSE): 193890.59174930275
R-squared (R¬≤): 0.13576670287133208


### ‚úçÔ∏è Your Response: üîß
1. I kept host_id, host_since_days, estimated_revenue_l365d, host_listings_count, and host_total_listings_count because they had the strongest absolute coefficients, suggesting they have the greatest linear impact on listing price. These variables reflect host experience and overall revenue potential, which are likely tied to how hosts set their prices.

2. Yes ‚Äî the R¬≤ increased from 0.0005 to 0.136, and the MSE decreased from about 224,243 to 193,891, showing that the refined model explains more of the variation in prices and makes more accurate predictions. The improvement occurred because the refined model focused on features with stronger relationships to price, reducing noise from irrelevant variables.

3. I would recommend the refined model, since it‚Äôs simpler, performs better, and is easier to interpret. However, further improvements could come from adding property size, amenities, and location features, which likely have even stronger effects on price.

4. This assignment aligns with my learning outcomes by demonstrating the application of strategic market analytics‚Äîusing data to identify the most influential pricing factors, similar to analyzing market drivers in global semiconductor distribution. It also reflects supply chain risk analytics and operational resilience, as the process of refining and validating models mirrors how predictive analytics can be used to assess variability, reduce uncertainty, and make more resilient, data-driven business decisions in complex markets like semiconductors.


## 10. Reflect and Recommend

### Business framing:  
Ultimately, the value of your model comes from how well it can guide business decisions. Use your results to make real-world recommendations.

### In Your Response:
1. What business question did your model help answer?
2. What would you recommend to Airbnb or its hosts?
3. What could you do next to improve this model or make it more useful?
4. How does this relate to your customized learning outcome you created in canvas?


### ‚úçÔ∏è Your Response: üîß
1. The model helped answer the question: ‚ÄúWhich factors most influence Airbnb listing prices?‚Äù It identified that host-related variables like experience, number of listings, and estimated annual revenue have measurable effects on how listings are priced.

2. I would recommend that hosts focus on building reputation and experience through consistent guest satisfaction and maintaining multiple high-quality listings. Airbnb could use these insights to refine pricing algorithms and offer better guidance to hosts about how experience and property characteristics impact pricing.

3. Next, I would include property features (like room type, amenities, and location), apply log transformations to stabilize price variance. I could also compare results across cities to understand regional pricing dynamics.

4. It connects to my learning outcomes by demonstrating how data analytics supports strategic decision-making‚Äîin this case, optimizing Airbnb pricing similar to forecasting demand or setting pricing strategies in global semiconductor markets. It also reflects risk analytics and operational resilience in deciding the pricing in the real estate industery as we should analyise the real word data we have in hand to predict realistic prices for the real estates in a specific location and time period.

## Submission Instructions
‚úÖ Checklist:
- All code cells run without error
- All markdown responses are complete
- Submit on Canvas as instructed

In [13]:
!jupyter nbconvert --to html "assignment_11_AlhinaiAlmuhanna.ipynb"

[NbConvertApp] Converting notebook assignment_11_AlhinaiAlmuhanna.ipynb to html
[NbConvertApp] Writing 365228 bytes to assignment_11_AlhinaiAlmuhanna.html
