#**Airbnb Price Prediction**


**Problem Domain:** Supervised Learning (Regression)   
Name : Amjaad Alfayez

---

## Dataset Selection & Problem Definition

### Dataset Selection
The dataset chosen is a clean sample of **Airbnb listing data** (`airnb.csv`), which is ideal for demonstrating the application of regression techniques to a real-world pricing problem.

### Problem Definition: Airbnb Price Prediction
The goal is to build a model that can predict the Price of an Airbnb listing based on its features. This is a classic Supervised Learning Regression problem.

Variable (Y): Price Continuous, in US Dollars.

Features (X):
1.  Number_of_Beds: The number of beds available (Numeric).
2.  Review_Count: The total number of reviews (Numeric).
3.  Rating: The average rating of the listing (Numeric).
4.  Location_Grouped: One-hot encoded categorical features representing the listing's location.

---

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

#Uploading the data

df = pd.read_csv(file_path)

print(f"Dataset Shape: {df.shape}")

NameError: name 'file_path' is not defined

## Data Cleaning & Preprocessing

### Data Cleaning and Feature Extraction
The raw data required significant cleaning to extract numerical features from text fields and handle missing values.

| Raw Column | Transformation | Resulting Feature |
| :--- | :--- | :--- |
| Price(in dollar) | Converted to numeric, renamed. | Price (Target) |
| Number of bed | Extracted first number. | Number_of_Beds |
| Review and rating | Extracted rating and review count. | Rating, Review_Count |

### Handling Categorical Data
The location information was grouped and converted into numerical format using One-Hot Encoding, which is necessary for Linear Regression.

In [None]:
# Rename and clean the target variable 'Price'
df.rename(columns={'Price(in dollar)': 'Price'}, inplace=True)
df['Price'] = df['Price'].astype(str).str.replace(r'[$,]', '', regex=True)
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')
df.dropna(subset=['Price'], inplace=True)

# Extract 'Number of beds' (Numerical Feature)
df.rename(columns={'Number of bed': 'Bed_Details'}, inplace=True)
df['Number_of_Beds'] = df['Bed_Details'].str.extract(r'(\d+)').astype(float)
df.dropna(subset=['Number_of_Beds'], inplace=True)

# Extract 'Review Count' and 'Rating' (Numerical Features)
df.rename(columns={'Review and rating': 'Review_Details'}, inplace=True)
df['Rating'] = df['Review_Details'].str.extract(r'(\d+\.\d+)').astype(float)
df['Review_Count'] = df['Review_Details'].str.extract(r'\((\d+)\)').astype(float)
df['Rating'].fillna(0, inplace=True)
df['Review_Count'].fillna(0, inplace=True)

# Extract 'Location' (Categorical Feature)
df['Location'] = df['Title'].str.split(',').str[1:3].str.join(',').str.strip()
top_locations = df['Location'].value_counts().nlargest(5).index
df['Location_Grouped'] = df['Location'].apply(lambda x: x if x in top_locations else 'Other')

# One-Hot Encode the Categorical Feature
df_processed = pd.get_dummies(df, columns=['Location_Grouped'], drop_first=True)

# Final Feature Selection
features = ['Number_of_Beds', 'Review_Count', 'Rating'] + [col for col in df_processed.columns if col.startswith('Location_Grouped_')]
target = 'Price'

df_final = df_processed[features + [target]].copy()

print(f"Final Processed Dataset Shape: {df_final.shape}")

## Exploratory Data Analysis
EDA was performed to understand the distribution of the target variable and the relationship between features.

### Key Observations:
1.  Price Distribution: The target variable (Price) is highly skewed to the right, indicating a few very expensive listings (outliers). This is typical for real-estate pricing data.
2.  Price vs. Beds: There is a positive correlation between the number of beds and the price, which is expected.
3.  Correlation: The correlation heatmap shows that the engineered features are not highly correlated with each other, which is good for avoiding *multicollinearity* in Linear Regression.

In [None]:
# Define X and Y
X = df_final[features]
Y = df_final[target]

# Split data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

## Model Selection & Implementation

### Model Selection: Multiple Linear Regression
Multiple Linear Regression is chosen as it is the foundational model for Regression problems, directly aligning with the course material. It models the linear relationship between multiple independent variables (features) and a continuous dependent variable (price).

### Model Training
The model is trained using the Ordinary Least Squares (OLS) method.

In [None]:
# Initialize and train the model
model = LinearRegression()
model.fit(X_train, Y_train)

print("Multiple Linear Regression Model Trained Successfully.")

## Theoretical Understanding of the Model

### Linear Regression Theory
Linear Regression models the relationship as a straight line or hyperplane: Price = beta_0 + beta_1*x_1 + beta_2*x_2 + ... + beta_n*x_n + error.

The model minimizes the Sum of Squared Errors (SSE) between the predicted and actual prices. The model relies on key assumptions, including Linearity, Independence of errors, and Homoscedasticity (constant variance of errors).

### Model Coefficients
The coefficients (beta_i) represent the change in the predicted price for a one-unit increase in the corresponding feature, holding all other features constant.



**Intercept** : **179.39** The baseline price when all features are zero.
**Number_of_Beds** : **18.38**  Each additional bed increases the predicted price by 18.38.
**Review_Count** : **-0.04** A very slight negative impact, suggesting high review count alone doesn't drive price up.
 **Rating**  **-1.12**  A slight negative impact, which is counter-intuitive and suggests a complex relationship not fully captured by this simple linear model.
**Location_Grouped_Mexico** :**-102.94**  Being in Mexico (compared to the baseline location) decreases the predicted price by $102.94. |

The coefficients confirm that **Number of Beds** is the strongest positive predictor of price.

## Evaluation Metrics & Interpretation

The model's performance is evaluated using the standard metrics for Regression:

| Metric | Value | Interpretation |
| :--- | :---: | :--- |
| **RMSE** (Root Mean Squared Error) | **152.94** | The average magnitude of the error, measured in the same units as the target (dollars). The model is typically off by $152.94. |
| **MAE** (Mean Absolute Error) | **91.21** | The average absolute error. The model's predictions are off by $91.21 on average. |
| **R-squared** (R2) | **0.1245** | Only 12.45% of the variance in the Airbnb price is explained by our model's features. This indicates the model is a poor fit for the data, suggesting that important features (like property type, amenities, or exact location) are missing. |

**Interpretation:** The low R2 value indicates that a simple linear model is insufficient for accurately predicting Airbnb prices, which are likely influenced by non-linear factors and features not present in this dataset.

In [None]:
# predict on the test
Y_pred = model.predict(X_test)

# Evaluation Metrics
rmse = np.sqrt(mean_squared_error(Y_test, Y_pred))
mae = mean_absolute_error(Y_test, Y_pred)
r2 = r2_score(Y_test, Y_pred)

print(f"RMSE: {rmse:.2f}")
print(f"MAE: {mae:.2f}")
print(f"R-squared (R2): {r2:.4f}")

#Quick Revision
this project successfully applied Multiple Linear Regression to the Airbnb dataset, demonstrating the entire data mining process from raw data cleaning to model evaluation.
The Model Coefficients confirmed our initial hypothesis: Number of Beds is the strongest positive predictor of price, increasing the predicted price by 18.38 for each additional bed.
However, the evaluation metrics provided a crucial insight: the R-squared ($R^2$) value was only 0.1245. This low value indicates that our simple linear model is insufficient for accurately predicting Airbnb prices. The model only explains about 12.45% of the variance in price, suggesting that critical factors—such as property type, amenities, and exact neighborhood—are missing from our feature set.
This outcome is a valuable lesson in data mining: while the model is technically correct, it highlights the need for more complex models or richer data to accurately capture the non-linear complexities of real-world phenomena like housing prices.