
## **Data Splitting to Train and Test Set**

Map numerical label for customer segmentation within all of the datasets. We have gone through the documentation for certain classification ML algorithm, and some do not recognise non-numerical target values. Thus we decide to encode the target accordingly.

In [None]:
target_label_mapping = {
    'Luxury': 4,
    'High': 3,
    'Middle': 2,
    'Low': 1,
    'Other': 0
}
df_nyc_backup['customer_segment'] = df_nyc_backup['customer_segment'].map(target_label_mapping)
df_nyc_IQR['customer_segment'] = df_nyc_IQR['customer_segment'].map(target_label_mapping)
df_nyc_winsor['customer_segment'] = df_nyc_winsor['customer_segment'].map(target_label_mapping)

In [None]:
df_nyc_backup.head(3)

Unnamed: 0,price,minimum_nights,number_of_reviews,calculated_host_listings_count,availability_365,neighbourhood_group_Bronx,neighbourhood_group_Brooklyn,neighbourhood_group_Manhattan,neighbourhood_group_Queens,neighbourhood_group_Staten Island,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,customer_segment
0,0.013914,0.0,0.014308,0.015337,1.0,0,1,0,0,0,0,1,0,2
1,0.021522,0.0,0.071542,0.003067,0.972603,0,0,1,0,0,1,0,0,3
2,0.014014,0.001601,0.0,0.0,1.0,0,0,1,0,0,0,1,0,2


In [None]:
df_nyc_IQR.head(3)

Unnamed: 0,price,minimum_nights,number_of_reviews,calculated_host_listings_count,availability_365,neighbourhood_group_Bronx,neighbourhood_group_Brooklyn,neighbourhood_group_Manhattan,neighbourhood_group_Queens,neighbourhood_group_Staten Island,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,customer_segment
0,0.370667,0.0,0.014308,0.015337,1.0,0,1,0,0,0,0,1,0,2
1,0.573333,0.0,0.071542,0.003067,0.972603,0,0,1,0,0,1,0,0,3
2,0.373333,0.001601,0.0,0.0,1.0,0,0,1,0,0,0,1,0,2


In [None]:
df_nyc_winsor.head(3)

Unnamed: 0,price,minimum_nights,number_of_reviews,calculated_host_listings_count,availability_365,neighbourhood_group_Bronx,neighbourhood_group_Brooklyn,neighbourhood_group_Manhattan,neighbourhood_group_Queens,neighbourhood_group_Staten Island,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,customer_segment
0,0.315909,0.0,0.014308,0.015337,1.0,0,1,0,0,0,0,1,0,2
1,0.488636,0.0,0.071542,0.003067,0.972603,0,0,1,0,0,1,0,0,3
2,0.318182,0.001601,0.0,0.0,1.0,0,0,1,0,0,0,1,0,2


### Split Train : Test - Dataset with outliers

Data splitting of dataset with outliers

In [None]:
X_outlier = df_nyc_backup.drop(columns=['customer_segment','room_type_Entire home/apt','room_type_Private room','room_type_Shared room','price'])
y_outlier = df_nyc_backup['customer_segment']

X_outlier_train, X_outlier_test, y_outlier_train, y_outlier_test = train_test_split(X_outlier, y_outlier, test_size=0.2, random_state=42)

Check if any data leakage occur between train and test set

In [None]:
numcols = X_outlier_train.select_dtypes(include=['float64', 'int64']).columns
catcols = X_outlier_train.select_dtypes(include=['object', 'category']).columns

print("Numerical columns:", numcols)
print("Categorical columns:", catcols)

Numerical columns: Index(['minimum_nights', 'number_of_reviews', 'calculated_host_listings_count',
       'availability_365', 'neighbourhood_group_Bronx',
       'neighbourhood_group_Brooklyn', 'neighbourhood_group_Manhattan',
       'neighbourhood_group_Queens', 'neighbourhood_group_Staten Island'],
      dtype='object')
Categorical columns: Index([], dtype='object')


### Split Train : Test - Dataset without Outliers

Data splitting of dataset with IQR outlier removal treatment

In [None]:
X_IQR = df_nyc_IQR.drop(columns=['customer_segment','room_type_Entire home/apt','room_type_Private room','room_type_Shared room','price'])
y_IQR = df_nyc_IQR['customer_segment']

X_IQR_train, X_IQR_test, y_IQR_train, y_IQR_test = train_test_split(X_IQR, y_IQR, test_size=0.2, random_state=42)

Check if any data leakage occur between train and test set

In [None]:
numcols = X_IQR_train.select_dtypes(include=['float64', 'int64']).columns
catcols = X_IQR_train.select_dtypes(include=['object', 'category']).columns

print("Numerical columns:", numcols)
print("Categorical columns:", catcols)

Numerical columns: Index(['minimum_nights', 'number_of_reviews', 'calculated_host_listings_count',
       'availability_365', 'neighbourhood_group_Bronx',
       'neighbourhood_group_Brooklyn', 'neighbourhood_group_Manhattan',
       'neighbourhood_group_Queens', 'neighbourhood_group_Staten Island'],
      dtype='object')
Categorical columns: Index([], dtype='object')


Data splitting of dataset with winsorisation outlier removal treatment

In [None]:
X_winsor = df_nyc_winsor.drop(columns=['customer_segment','room_type_Entire home/apt','room_type_Private room','room_type_Shared room','price'])
y_winsor = df_nyc_winsor['customer_segment']

X_winsor_train, X_winsor_test, y_winsor_train, y_winsor_test = train_test_split(X_winsor, y_winsor, test_size=0.2, random_state=42)

Check if any data leakage occur between train and test set

In [None]:
numcols = X_winsor_train.select_dtypes(include=['float64', 'int64']).columns
catcols = X_winsor_train.select_dtypes(include=['object', 'category']).columns

print("Numerical columns:", numcols)
print("Categorical columns:", catcols)

Numerical columns: Index(['minimum_nights', 'number_of_reviews', 'calculated_host_listings_count',
       'availability_365', 'neighbourhood_group_Bronx',
       'neighbourhood_group_Brooklyn', 'neighbourhood_group_Manhattan',
       'neighbourhood_group_Queens', 'neighbourhood_group_Staten Island'],
      dtype='object')
Categorical columns: Index([], dtype='object')
