Handling Missing Values
1 - Impute Missing Values:
For last_review and reviews_per_month, fill missing values with 0 or the median value.

Encoding Categorical Variables
2 - Neighborhood and Room Type:
One-Hot Encoding: Convert neighbourhood_group, neighbourhood, and room_type into binary columns.

Creating New Features
3 - Date Features:
Extract year, month, and day from last_review.
4- Price-Related Features:
Create a price_per_night by dividing price by minimum_nights.

Spatial Features
5 - Geospatial Clustering:
Apply clustering algorithms (e.g., K-means) on latitude and longitude to identify neighborhood clusters.

Aggregations
6 - Review Aggregations:
Calculate the average number of reviews per month for each neighborhood.

Feature Interactions
7 - Price Interaction:
Create interaction terms like price * number_of_reviews or price / availability_365.

Binning
8 - Price Binning:
Create bins for price to categorize listings into low, medium, and high price ranges.

Scaling and Normalization
9 - Standardize Numerical Features:
Apply standard scaling to price, minimum_nights, number_of_reviews, and availability_365.

Handling Outliers
10 - Cap Outliers:
Cap extreme values in price, minimum_nights, and number_of_reviews.

In [49]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.cluster import KMeans
import numpy as np

df = pd.read_csv('AB_NYC_2019.csv')

df['reviews_per_month'].fillna(0, inplace=True)
df['last_review'].fillna('2000-01-01', inplace=True)

## One-Hot Encoding

In [50]:
df = pd.get_dummies(df, columns=['neighbourhood_group', 'neighbourhood', 'room_type'])

In [51]:
df

Unnamed: 0,id,name,host_id,host_name,latitude,longitude,price,minimum_nights,number_of_reviews,last_review,...,neighbourhood_Williamsburg,neighbourhood_Willowbrook,neighbourhood_Windsor Terrace,neighbourhood_Woodhaven,neighbourhood_Woodlawn,neighbourhood_Woodrow,neighbourhood_Woodside,room_type_Entire home/apt,room_type_Private room,room_type_Shared room
0,2539,Clean & quiet apt home by the park,2787,John,40.64749,-73.97237,149,1,9,2018-10-19,...,False,False,False,False,False,False,False,False,True,False
1,2595,Skylit Midtown Castle,2845,Jennifer,40.75362,-73.98377,225,1,45,2019-05-21,...,False,False,False,False,False,False,False,True,False,False
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,40.80902,-73.94190,150,3,0,2000-01-01,...,False,False,False,False,False,False,False,False,True,False
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,40.68514,-73.95976,89,1,270,2019-07-05,...,False,False,False,False,False,False,False,True,False,False
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,40.79851,-73.94399,80,10,9,2018-11-19,...,False,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,40.67853,-73.94995,70,2,0,2000-01-01,...,False,False,False,False,False,False,False,False,True,False
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,40.70184,-73.93317,40,4,0,2000-01-01,...,False,False,False,False,False,False,False,False,True,False
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,40.81475,-73.94867,115,10,0,2000-01-01,...,False,False,False,False,False,False,False,True,False,False
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,40.75751,-73.99112,55,1,0,2000-01-01,...,False,False,False,False,False,False,False,False,False,True


## Date Features

In [52]:
df['last_review'] = pd.to_datetime(df['last_review'])
df['last_review_year'] = df['last_review'].dt.year
df['last_review_month'] = df['last_review'].dt.month
df['last_review_day'] = df['last_review'].dt.day

In [53]:
df

Unnamed: 0,id,name,host_id,host_name,latitude,longitude,price,minimum_nights,number_of_reviews,last_review,...,neighbourhood_Woodhaven,neighbourhood_Woodlawn,neighbourhood_Woodrow,neighbourhood_Woodside,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,last_review_year,last_review_month,last_review_day
0,2539,Clean & quiet apt home by the park,2787,John,40.64749,-73.97237,149,1,9,2018-10-19,...,False,False,False,False,False,True,False,2018,10,19
1,2595,Skylit Midtown Castle,2845,Jennifer,40.75362,-73.98377,225,1,45,2019-05-21,...,False,False,False,False,True,False,False,2019,5,21
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,40.80902,-73.94190,150,3,0,2000-01-01,...,False,False,False,False,False,True,False,2000,1,1
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,40.68514,-73.95976,89,1,270,2019-07-05,...,False,False,False,False,True,False,False,2019,7,5
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,40.79851,-73.94399,80,10,9,2018-11-19,...,False,False,False,False,True,False,False,2018,11,19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,40.67853,-73.94995,70,2,0,2000-01-01,...,False,False,False,False,False,True,False,2000,1,1
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,40.70184,-73.93317,40,4,0,2000-01-01,...,False,False,False,False,False,True,False,2000,1,1
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,40.81475,-73.94867,115,10,0,2000-01-01,...,False,False,False,False,True,False,False,2000,1,1
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,40.75751,-73.99112,55,1,0,2000-01-01,...,False,False,False,False,False,False,True,2000,1,1


## Price-Related Features

In [54]:
df['price_per_night'] = df['price'] / df['minimum_nights']

In [55]:
df

Unnamed: 0,id,name,host_id,host_name,latitude,longitude,price,minimum_nights,number_of_reviews,last_review,...,neighbourhood_Woodlawn,neighbourhood_Woodrow,neighbourhood_Woodside,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,last_review_year,last_review_month,last_review_day,price_per_night
0,2539,Clean & quiet apt home by the park,2787,John,40.64749,-73.97237,149,1,9,2018-10-19,...,False,False,False,False,True,False,2018,10,19,149.000000
1,2595,Skylit Midtown Castle,2845,Jennifer,40.75362,-73.98377,225,1,45,2019-05-21,...,False,False,False,True,False,False,2019,5,21,225.000000
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,40.80902,-73.94190,150,3,0,2000-01-01,...,False,False,False,False,True,False,2000,1,1,50.000000
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,40.68514,-73.95976,89,1,270,2019-07-05,...,False,False,False,True,False,False,2019,7,5,89.000000
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,40.79851,-73.94399,80,10,9,2018-11-19,...,False,False,False,True,False,False,2018,11,19,8.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,40.67853,-73.94995,70,2,0,2000-01-01,...,False,False,False,False,True,False,2000,1,1,35.000000
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,40.70184,-73.93317,40,4,0,2000-01-01,...,False,False,False,False,True,False,2000,1,1,10.000000
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,40.81475,-73.94867,115,10,0,2000-01-01,...,False,False,False,True,False,False,2000,1,1,11.500000
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,40.75751,-73.99112,55,1,0,2000-01-01,...,False,False,False,False,False,True,2000,1,1,55.000000


## Price Interaction

In [56]:
df['price_num_reviews'] = df['price'] * df['number_of_reviews']
df['price_availability'] = df['price'] / (df['availability_365'] + 1)

In [57]:
df

Unnamed: 0,id,name,host_id,host_name,latitude,longitude,price,minimum_nights,number_of_reviews,last_review,...,neighbourhood_Woodside,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,last_review_year,last_review_month,last_review_day,price_per_night,price_num_reviews,price_availability
0,2539,Clean & quiet apt home by the park,2787,John,40.64749,-73.97237,149,1,9,2018-10-19,...,False,False,True,False,2018,10,19,149.000000,1341,0.407104
1,2595,Skylit Midtown Castle,2845,Jennifer,40.75362,-73.98377,225,1,45,2019-05-21,...,False,True,False,False,2019,5,21,225.000000,10125,0.632022
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,40.80902,-73.94190,150,3,0,2000-01-01,...,False,False,True,False,2000,1,1,50.000000,0,0.409836
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,40.68514,-73.95976,89,1,270,2019-07-05,...,False,True,False,False,2019,7,5,89.000000,24030,0.456410
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,40.79851,-73.94399,80,10,9,2018-11-19,...,False,True,False,False,2018,11,19,8.000000,720,80.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,40.67853,-73.94995,70,2,0,2000-01-01,...,False,False,True,False,2000,1,1,35.000000,0,7.000000
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,40.70184,-73.93317,40,4,0,2000-01-01,...,False,False,True,False,2000,1,1,10.000000,0,1.081081
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,40.81475,-73.94867,115,10,0,2000-01-01,...,False,True,False,False,2000,1,1,11.500000,0,4.107143
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,40.75751,-73.99112,55,1,0,2000-01-01,...,False,False,False,True,2000,1,1,55.000000,0,18.333333


## Binning

In [58]:
df['price_bin'] = pd.cut(df['price'], bins=[0, 50, 100, 200, np.inf], labels=['low', 'medium', 'high', 'very_high'])

In [59]:
df

Unnamed: 0,id,name,host_id,host_name,latitude,longitude,price,minimum_nights,number_of_reviews,last_review,...,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,last_review_year,last_review_month,last_review_day,price_per_night,price_num_reviews,price_availability,price_bin
0,2539,Clean & quiet apt home by the park,2787,John,40.64749,-73.97237,149,1,9,2018-10-19,...,False,True,False,2018,10,19,149.000000,1341,0.407104,high
1,2595,Skylit Midtown Castle,2845,Jennifer,40.75362,-73.98377,225,1,45,2019-05-21,...,True,False,False,2019,5,21,225.000000,10125,0.632022,very_high
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,40.80902,-73.94190,150,3,0,2000-01-01,...,False,True,False,2000,1,1,50.000000,0,0.409836,high
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,40.68514,-73.95976,89,1,270,2019-07-05,...,True,False,False,2019,7,5,89.000000,24030,0.456410,medium
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,40.79851,-73.94399,80,10,9,2018-11-19,...,True,False,False,2018,11,19,8.000000,720,80.000000,medium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,40.67853,-73.94995,70,2,0,2000-01-01,...,False,True,False,2000,1,1,35.000000,0,7.000000,medium
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,40.70184,-73.93317,40,4,0,2000-01-01,...,False,True,False,2000,1,1,10.000000,0,1.081081,low
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,40.81475,-73.94867,115,10,0,2000-01-01,...,True,False,False,2000,1,1,11.500000,0,4.107143,high
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,40.75751,-73.99112,55,1,0,2000-01-01,...,False,False,True,2000,1,1,55.000000,0,18.333333,medium


## Handling Outliers by Capping

In [60]:
df['price'].max()

10000

In [61]:
df['price'] = np.clip(df['price'], df['price'].quantile(0.05), df['price'].quantile(0.95))
df['minimum_nights'] = np.clip(df['minimum_nights'], df['minimum_nights'].quantile(0.05), df['minimum_nights'].quantile(0.95))
df['number_of_reviews'] = np.clip(df['number_of_reviews'], df['number_of_reviews'].quantile(0.05), df['number_of_reviews'].quantile(0.95))

In [62]:
df

Unnamed: 0,id,name,host_id,host_name,latitude,longitude,price,minimum_nights,number_of_reviews,last_review,...,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,last_review_year,last_review_month,last_review_day,price_per_night,price_num_reviews,price_availability,price_bin
0,2539,Clean & quiet apt home by the park,2787,John,40.64749,-73.97237,149,1,9,2018-10-19,...,False,True,False,2018,10,19,149.000000,1341,0.407104,high
1,2595,Skylit Midtown Castle,2845,Jennifer,40.75362,-73.98377,225,1,45,2019-05-21,...,True,False,False,2019,5,21,225.000000,10125,0.632022,very_high
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,40.80902,-73.94190,150,3,0,2000-01-01,...,False,True,False,2000,1,1,50.000000,0,0.409836,high
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,40.68514,-73.95976,89,1,114,2019-07-05,...,True,False,False,2019,7,5,89.000000,24030,0.456410,medium
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,40.79851,-73.94399,80,10,9,2018-11-19,...,True,False,False,2018,11,19,8.000000,720,80.000000,medium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,40.67853,-73.94995,70,2,0,2000-01-01,...,False,True,False,2000,1,1,35.000000,0,7.000000,medium
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,40.70184,-73.93317,40,4,0,2000-01-01,...,False,True,False,2000,1,1,10.000000,0,1.081081,low
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,40.81475,-73.94867,115,10,0,2000-01-01,...,True,False,False,2000,1,1,11.500000,0,4.107143,high
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,40.75751,-73.99112,55,1,0,2000-01-01,...,False,False,True,2000,1,1,55.000000,0,18.333333,medium


In [63]:
df['price'].max()

355