## **Feature Engineering**

Features Engineering are going to be done in 3 dataset:


1.   Untreated data with no outliers removed (df_nyc_backup)
2.   Treated data with winsorisation outliers removal (df_nyc_winsor)
3.   Treated data with IQR outliers removal (df_nyc_IQR)



### Features Engineering - Dataset With Outliers

In [None]:
#We decide the customer segmentation logic

top_5_percent = df_nyc_backup['price'].quantile(0.95)
top_15_percent = df_nyc_backup['price'].quantile(0.85)
top_85_percent = df_nyc_backup['price'].quantile(0.15)

def customer_segment(row):
    if row['price'] >= top_5_percent and (row['room_type_Entire home/apt'] == 1 or row['room_type_Private room'] == 1):
        return 'Luxury'
    elif top_15_percent <= row['price'] < top_5_percent and (row['room_type_Entire home/apt'] == 1 or row['room_type_Private room'] == 1):
        return 'High'
    elif top_85_percent <= row['price'] < top_15_percent and (row['room_type_Entire home/apt'] == 1 or row['room_type_Private room'] == 1 or row['room_type_Shared room'] == 1):
        return 'Middle'
    elif row['price'] < top_85_percent and (row['room_type_Private room'] == 1 or row['room_type_Shared room'] == 1):
        return 'Low'
    else:
        return 'Other'

df_nyc_backup['customer_segment'] = df_nyc_backup.apply(customer_segment, axis=1)
print(df_nyc_backup['customer_segment'].value_counts())


customer_segment
Middle    31037
Low        6058
High       4261
Luxury     2472
Other       212
Name: count, dtype: int64


### Features Engineering - Dataset Without Outliers

In [None]:
# Dataset after outliers removed - IQR
top_5_percent = df_nyc_IQR['price'].quantile(0.95)
top_15_percent = df_nyc_IQR['price'].quantile(0.85)
top_85_percent = df_nyc_IQR['price'].quantile(0.15)

def customer_segment(row):
    if row['price'] >= top_5_percent and (row['room_type_Entire home/apt'] == 1 or row['room_type_Private room'] == 1):
        return 'Luxury'
    elif top_15_percent <= row['price'] < top_5_percent and (row['room_type_Entire home/apt'] == 1 or row['room_type_Private room'] == 1):
        return 'High'
    elif top_85_percent <= row['price'] < top_15_percent and (row['room_type_Entire home/apt'] == 1 or row['room_type_Private room'] == 1 or row['room_type_Shared room'] == 1):
        return 'Middle'
    elif row['price'] < top_85_percent and (row['room_type_Private room'] == 1 or row['room_type_Shared room'] == 1):
        return 'Low'
    else:
        return 'Other'

print("Customer Segment for df_nyc_IQR")
df_nyc_IQR['customer_segment'] = df_nyc_IQR.apply(customer_segment, axis=1)
print(df_nyc_IQR['customer_segment'].value_counts())

Customer Segment for df_nyc_IQR
customer_segment
Middle    28779
Low        5954
High       4093
Luxury     2174
Other       181
Name: count, dtype: int64


In [None]:
# Dataset after outliers removed - Winsorisation
top_5_percent = df_nyc_winsor['price'].quantile(0.95)
top_15_percent = df_nyc_winsor['price'].quantile(0.85)
top_85_percent = df_nyc_winsor['price'].quantile(0.15)

def customer_segment(row):
    if row['price'] >= top_5_percent and (row['room_type_Entire home/apt'] == 1 or row['room_type_Private room'] == 1):
        return 'Luxury'
    elif top_15_percent <= row['price'] < top_5_percent and (row['room_type_Entire home/apt'] == 1 or row['room_type_Private room'] == 1):
        return 'High'
    elif top_85_percent <= row['price'] < top_15_percent and (row['room_type_Entire home/apt'] == 1 or row['room_type_Private room'] == 1 or row['room_type_Shared room'] == 1):
        return 'Middle'
    elif row['price'] < top_85_percent and (row['room_type_Private room'] == 1 or row['room_type_Shared room'] == 1):
        return 'Low'
    else:
        return 'Other'

print("Customer Segment for df_nyc_winsor")
df_nyc_winsor['customer_segment'] = df_nyc_winsor.apply(customer_segment, axis=1)
print(df_nyc_winsor['customer_segment'].value_counts())

Customer Segment for df_nyc_winsor
customer_segment
Middle    31037
Low        6058
High       4261
Luxury     2472
Other       212
Name: count, dtype: int64


We noticed that there's an imbalanced of data within the target variable. "Middle class" customer segment has the highest value, which could potentially lead the model to be biased towards one class. To better evaluate the model's performance and understand its behavior across all classes, we will generate a confusion matrix - once we have applied the model into each datasets.

Final check of our datasets before proceeding to model training

In [None]:
df_nyc_backup.head(3)

Unnamed: 0,price,minimum_nights,number_of_reviews,calculated_host_listings_count,availability_365,neighbourhood_group_Bronx,neighbourhood_group_Brooklyn,neighbourhood_group_Manhattan,neighbourhood_group_Queens,neighbourhood_group_Staten Island,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,customer_segment
0,0.013914,0.0,0.014308,0.015337,1.0,0,1,0,0,0,0,1,0,Middle
1,0.021522,0.0,0.071542,0.003067,0.972603,0,0,1,0,0,1,0,0,High
2,0.014014,0.001601,0.0,0.0,1.0,0,0,1,0,0,0,1,0,Middle


In [None]:
df_nyc_winsor.head(3)

Unnamed: 0,price,minimum_nights,number_of_reviews,calculated_host_listings_count,availability_365,neighbourhood_group_Bronx,neighbourhood_group_Brooklyn,neighbourhood_group_Manhattan,neighbourhood_group_Queens,neighbourhood_group_Staten Island,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,customer_segment
0,0.315909,0.0,0.014308,0.015337,1.0,0,1,0,0,0,0,1,0,Middle
1,0.488636,0.0,0.071542,0.003067,0.972603,0,0,1,0,0,1,0,0,High
2,0.318182,0.001601,0.0,0.0,1.0,0,0,1,0,0,0,1,0,Middle


In [None]:
df_nyc_IQR.head(3)

Unnamed: 0,price,minimum_nights,number_of_reviews,calculated_host_listings_count,availability_365,neighbourhood_group_Bronx,neighbourhood_group_Brooklyn,neighbourhood_group_Manhattan,neighbourhood_group_Queens,neighbourhood_group_Staten Island,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,customer_segment
0,0.370667,0.0,0.014308,0.015337,1.0,0,1,0,0,0,0,1,0,Middle
1,0.573333,0.0,0.071542,0.003067,0.972603,0,0,1,0,0,1,0,0,High
2,0.373333,0.001601,0.0,0.0,1.0,0,0,1,0,0,0,1,0,Middle
