
### Notebook 2: Feature Engineering and Model Construction
#### Introduction to Feature Engineering and Model Construction
This notebook focuses on developing new features from the cleaned data and constructing a model to predict prices. The main steps include:

- Developing new features

- Normalizing data

- Splitting data into training and validation sets

- Training a model

- Evaluating model performance

### Part 1: Feature Engineering
Develop New Features
We develop a new feature price_per_night_per_review.

In [2]:
# Necessary imports

import pandas as pd
from sqlalchemy import create_engine
import psycopg2
from sklearn.preprocessing import LabelEncoder


In [None]:
# Datbase connection

host = '127.0.0.1'
db = 'project_airbnb'
user = 'postgres'
pw = 'your_pass'
port = '5432'

db_conn = create_engine(f"postgresql://{user}:{pw}@{host}:{port}/{db}")

In [4]:
# Load cleaned data
df = pd.read_sql_table('cleaned_data', db_conn, schema='cleaned')

# Develop new features
df['price_per_night_per_review'] = df.apply(lambda row: row['price'] / row['number_of_reviews'] if row['number_of_reviews'] != 0 else 0, axis=1)

# Handle missing values in price_per_night_per_review
df['price_per_night_per_review'] = df['price_per_night_per_review'].fillna(df['price_per_night_per_review'].mean())

# Normalize features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['price', 'minimum_nights', 'number_of_reviews', 'availability_365']] = scaler.fit_transform(df[['price', 'minimum_nights', 'number_of_reviews', 'availability_365']])

# Store feature engineered data
df.to_sql('feature_engineered_data', db_conn, schema='cleaned', if_exists='replace', index=False)


297

Based on the analytical question, the 4 new features that I feel will support my analysis are price_per_night_per_review, 'minimum_nights', 'number_of_reviews' (the price might be higher for listings that have a high number of reviews and lower for listings with lower count of reviews), and 'availability_365' (how many days the listings is avalibale might also affect its price since it is related to the law of offer and demand)

Part 2: Model Construction
Split Data and Train Model
now I will Utilize the 3-way split method to construct an optimal model. here I am using RndomforestRegressor as my oprimal model

In [5]:
# Split data into training and validation sets
from sklearn.model_selection import train_test_split
X = df[['room_type', 'number_of_reviews', 'neighbourhood', 'minimum_nights', 'availability_365', 'price_per_night_per_review']]
y = df['price']

# Convert categorical variables to numerical
le_room_type = LabelEncoder()
le_neighbourhood = LabelEncoder()
X['room_type'] = le_room_type.fit_transform(X['room_type'])
X['neighbourhood'] = le_neighbourhood.fit_transform(X['neighbourhood'])

# Ensure all numerical columns are floats
X = X.astype(float)

# Check for infinity values
import numpy as np
print("Infinity values in X:", np.isinf(X).sum().sum())
print("Infinity values in y:", np.isinf(y).sum())

# Replace infinity values with NaN and then fill with mean
X = X.replace([np.inf, -np.inf], np.nan)
X = X.fillna(X.mean())
y = y.replace([np.inf, -np.inf], np.nan)
y = y.fillna(y.mean())

X_train, X_val, y_train, y_val = train_test_split(X.values, y.values, test_size=0.2, random_state=42)

# Train a model
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()

try:
    model.fit(X_train, y_train)
except Exception as e:
    print(f"An error occurred: {e}")

# Evaluate model on validation set
y_pred = model.predict(X_val)
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_val, y_pred)
print(f"Validation MSE: {mse}")

# Save the model
import joblib
joblib.dump(model, 'optimal_model.joblib')

# Store validation data for later use
pd.DataFrame(X_val).to_csv('X_val.csv', index=False)
pd.DataFrame(y_val).to_csv('y_val.csv', index=False)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['room_type'] = le_room_type.fit_transform(X['room_type'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['neighbourhood'] = le_neighbourhood.fit_transform(X['neighbourhood'])


Infinity values in X: 0
Infinity values in y: 0
Validation MSE: 0.7811781855054779


to see the tables I created on PgAdmin, please run code 1, 2, and 3