# __Predicting Airbnb Listing Prices in Sydney__

---

## Task 2: Data Cleaning, Missing Observations and Feature Engineering
- This task includes a set of instructions/ steps listed below along with comprehensive explanations. 




In [2]:
# Import libraries 
import pandas as pd
import numpy as np
import seaborn as sns

In [3]:
# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)

# Configure seaborn aesthetics
sns.set(style="whitegrid", palette="muted", font_scale=1.2)

In [None]:
# Load datasets
df_train = pd.read_csv(r"C:\Users\haiho\GITHUB\Sydney-Airbnb-prices-prediction\data\raw\train.csv")
df_test = pd.read_csv(r"C:\Users\haiho\GITHUB\Sydney-Airbnb-prices-prediction\data\raw\test.csv")

### **Step 1**: Clean **all** numerical features and the target variable `price` so that they can be used in training algorithms.

Even though the descriptive statistics summary provided by the initial data analysis includes 34 features, we can identify other 3 numerical variables are in the format of string and should be corrected for further analyis. `host_response_rate` and `host_acceptance_rate` are the two features which contain both numerical values and text on each entry, specfically number with percentage symbol (%). Since the intial data types of these two columns are string, we need to convert them to float after extracting the unwanted character. 

In [5]:
# List of data frames and columns to check
dfs = [df_train, df_test]
columns = ['host_response_rate', 'host_acceptance_rate']

# Loop through data frames and columns
for df in dfs:
    for col in columns:
        print(type(df[col]))

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


In [6]:
# Remove unwanted characters 
for df in dfs:
    for col in columns:
        df[col] = df[col].str.replace('%', '')

In [7]:
# Transform data columns into float type
for df in dfs:
    for col in columns:
        df[col] = df[col].astype(float)

Besides, `price` as the target variable of the training dataset is also not denoted as a floating point attribute due to the same initial data type. By repeating the procedure, we can convert price to the accurate data type.

In [8]:
# Examine the current data type
print(type(df_train['price']))

<class 'pandas.core.series.Series'>


In the form of string, we can spot the shortest and longest values to determine any unexpected characters.

In [9]:
print("Min value:", min(df_train['price'], key=len))
print("Max value:", max(df_train['price'], key=len))

Min value: $82.00
Max value: $2,746.00


We only want to keep the numeric values and remove the unwanted characters before changing to float.

In [12]:
# Remove unwanted characters 
df_train['price'] = df_train['price'].str.replace('$', '').str.replace(',', '')

In [13]:
# Transform price into float type
df_train['price'] = df_train['price'].astype(float)

### **Step 2:** Create new features from existing features which contain multiple items of information.  

There exits features that contains multiple intems in each entry of both train and test sets. Specfically, `host_verifications` contains `email`,  `phone`, `reviews`, `jumio`, etc.

We firstly look through the top 2 entries of the `host_verifications` feature.

In [None]:
# Examine first 2 values of host_verifications column
df_train['host_verifications'].head(2)

For `host_verifications`, the total numbers of 4 new features can be created as `email`, `phone`, `reviews`, `jumio`:

In [None]:
# Define main verification types
verification_types = ['email', 'phone', 'reviews', 'jumio']

In [19]:
# Calculate value counts of main verification types
for verification in verification_types:
    count = 0
    for row in df_train.itertuples(index = True, name ='Pandas'):
        if verification in getattr(row, 'host_verifications'):
            count+=1
    print(f'Value counts of {verification} verification: {count}')      

Value counts of email verifications : 6495
Value counts of phone verifications : 6995
Value counts of reviews verifications : 4429
Value counts of jumio verifications : 4795


With the idea of consolidating the verification functions into a single function that can handle different verification types, new features are created to indicate whether an Airbnb has `email` or `phone` or `reviews` or `jumio`.

In [None]:
# Identify different verification types
def verification_check(row, verification_type):
    return 1 if verification_type in row['host_verifications'] else 0

In [None]:
# Apply to the train and test sets using lambda functions
for verification in verification_types:
    df_train[verification] = df_train.apply(lambda row: verification_check(row, verification), axis=1)
    df_test[verification] = df_test.apply(lambda row: verification_check(row, verification), axis=1)

In [23]:
# Drop the original column
df_train.drop(columns='host_verifications', inplace=True)
df_test.drop(columns='host_verifications', inplace=True)

We also identify that the `amenities` feature has multiple items of information which also needs to be separated.

In [24]:
# Examine first 2 values of amenities column
df_train['amenities'].head(2)

0    ["Hot water", "Coffee maker", "Heating", "Hair...
1    ["Hot water", "Coffee maker", "Long term stays...
Name: amenities, dtype: object

By finding the longest entry of `amenities` and the top 4 amenities of all Airbnb entries, we can determine the 4 new features to generate.

In [25]:
longest_amen = max(df_train['amenities'], key=len)
longest_amen

'["Clothing storage: wardrobe, walk-in closet, and dresser", "Hot water", "Coffee maker", "Free dryer \\u2013 In building", "Toaster", "Heating", "Long term stays allowed", "Extra pillows and blankets", "Dining table", "Private fenced garden or backyard", "Bikes", "Hair dryer", "Conditioner", "Drying rack for clothing", "Babysitter recommendations", "Laundromat nearby", "Fire pit", "Bathtub", "Oven", "Private entrance", "Lockbox", "Beach essentials", "Dedicated workspace: monitor, desk, table, and office chair", "Bread maker", "Ceiling fan", "Microwave", "Iron", "Free washer \\u2013 In building", "68\\" HDTV with Amazon Prime Video, Apple TV, Netflix, standard cable", "Refrigerator", "Outdoor shower", "Board games", "Fire extinguisher", "Hot water kettle", "Piano", "Samsung Bar Bluetooth sound system", "Children\\u2019s dinnerware", "Stove", "Portable fans", "Ethernet connection", "Bed linens", "Game console", "Cable TV", "Hangers", "Pack \\u2019n play/Travel crib", "BBQ grill", "Body 

In [26]:
# Find the index the longest amenities entry
df_train.index[df_train['amenities']==longest_amen].tolist()

[967]

As we found the existing longest amenity entry in the dataframe, we can design a loop to count occurences of each amentity and apply the same algorithm to create 4 new features of the most popular amenity among all Airbnb.

In [None]:
amenity_counts = {}

# Count occurrences of each amenity
for amenity in longest_amen:
    count = 0
    for row in df_train.itertuples(index=True, name='Pandas'):
        if amenity in getattr(row, 'amenities'):
            count += 1
    amenity_counts[amenity] = count

# Sort the dictionary by count in descending order
sorted_amenities = sorted(amenity_counts.items(), key=lambda x: x[1], reverse=True)

# Print the top 10 amenities
for amenity, count in sorted_amenities[:10]:
    print(f'Value counts of {amenity} amenity: {count}')


In [27]:
# # Count occurrences of each amenity
# for i in longest_amen:
#     count = 0
#     for row in df_train.itertuples(index = True, name ='Pandas'):
#         if i in getattr(row, 'amenities'):
#             count+=1
#     print('Value counts of',i,'amenity :',count)   

Value counts of Clothing storage: wardrobe, walk-in closet, and dresser amenity : 3
Value counts of Hot water amenity : 5506
Value counts of Coffee maker amenity : 2713
Value counts of Free dryer \u2013 In building amenity : 44
Value counts of Toaster amenity : 1229
Value counts of Heating amenity : 5067
Value counts of Long term stays allowed amenity : 6387
Value counts of Extra pillows and blankets amenity : 2957
Value counts of Dining table amenity : 1148
Value counts of Private fenced garden or backyard amenity : 357
Value counts of Bikes amenity : 41
Value counts of Hair dryer amenity : 5697
Value counts of Conditioner amenity : 997
Value counts of Drying rack for clothing amenity : 855
Value counts of Babysitter recommendations amenity : 271
Value counts of Laundromat nearby amenity : 608
Value counts of Fire pit amenity : 140
Value counts of Bathtub amenity : 1285
Value counts of Oven amenity : 3640
Value counts of Private entrance amenity : 2912
Value counts of Lockbox amenity 

Generally, the top 4 amenities to be recorded are: 'Long term stays allowed', 'Wifi', 'Essentials', 'Smoke alarm'. By apply the same function structure of creating verifications, 4 new amenity features can be added.

In [None]:
# Define main verification types
amenity_types = ['Long term stays allowed', 'Wifi', 'Essentials', 'Smoke alarm']

In [None]:
# Identify different verification types
def amenity_check(row, amenity_type):
    return 1 if amenity_type in row['amenities'] else 0

In [None]:
# Apply to the train and test sets using lambda functions
for amenity in amenity_types:
    df_train[amenity] = df_train.apply(lambda row: amenity_check(row, amenity), axis=1)
    df_test[amenity] = df_test.apply(lambda row: amenity_check(row, amenity), axis=1)

In [31]:
# Drop the original columns
df_train.drop(columns='amenities', inplace=True)
df_test.drop(columns='amenities', inplace=True)

In [28]:
# #for Long term stays allowed amenity 
# def longterm_amenity(row):
#     for i in range(len(row)):
#         if 'Long term stays allowed' in row['amenities']:
#             return 1 
#         else: 
#             return 0

# #for Wifi amenity 
# def wifi_amenity(row):
#     for i in range(len(row)):
#         if 'Wifi' in row['amenities']:
#             return 1 
#         else: 
#             return 0

# #for Essentials amenity 
# def essentials_amenity(row):
#     for i in range(len(row)):
#         if 'Essentials' in row['amenities']:
#             return 1 
#         else: 
#             return 0

# #for Smoke alarm amenity 
# def smokealarm_amenity(row):
#     for i in range(len(row)):
#         if 'Smoke alarm' in row['amenities']:
#             return 1 
#         else: 
#             return 0

In [29]:
# #apply to the train set
# df_train['Long term stays allowed'] = df_train.apply(longterm_amenity, axis=1)
# df_train['Wifi'] = df_train.apply(wifi_amenity, axis=1)
# df_train['Essentials'] = df_train.apply(essentials_amenity, axis=1)
# df_train['Smoke alarm'] = df_train.apply(smokealarm_amenity, axis=1)

In [30]:
# #apply to the test set
# df_test['Long term stays allowed'] = df_test.apply(longterm_amenity, axis=1)
# df_test['Wifi'] = df_test.apply(wifi_amenity, axis=1)
# df_test['Essentials'] = df_test.apply(essentials_amenity, axis=1)
# df_test['Smoke alarm'] = df_test.apply(smokealarm_amenity, axis=1)

### **Step 3**: Impute missing values for all features in both training and test datasets.   

**For the training dataset:**

In [None]:
# Examine the shape of the train set
df_train.shape

In [33]:
# List all columns of the train set
df_train.columns

Index(['ID', 'name', 'description', 'neighborhood_overview', 'host_name',
       'host_since', 'host_location', 'host_about', 'host_response_time',
       'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
       'host_neighbourhood', 'host_listings_count', 'host_has_profile_pic',
       'host_identity_verified', 'neighbourhood', 'neighbourhood_cleansed',
       'latitude', 'longitude', 'property_type', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'minimum_nights', 'maximum_nights',
       'minimum_minimum_nights', 'maximum_minimum_nights',
       'minimum_maximum_nights', 'maximum_maximum_nights',
       'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'number_of_reviews', 'number_of_reviews_ltm',
       'number_of_reviews_l30d', 'first_review', 'last_review',
       'review_scores_rating', 'review_scores_accuracy',
       'review_sc

Due to specific characteristics of each feature including range, type, etc; we must specifically design different imputing strategies to suit nature of each column. 

Generally, while we can fill all misisng text variables with **'unknown'** value, numerical variables' missing values can be imputed with their **mode** or **mean**. 

First, we start imputing text-based columns:

In [34]:
# List all text_based columns
text_cols = ['name', 'description', 'neighborhood_overview', 'host_about', 'neighbourhood_cleansed', 'license']

In [35]:
# Fill text_based columns with 'unknown'
df_train[text_cols] = df_train[text_cols].fillna(value='unknonwn')

Continuing with the remaining numeric features, there are 2 popular ways to fill missing values: mode and mean.

By carefully evaluating the nature of each numeric feature, we can split them into 2 separate lists for precise imputing purposes.

In [36]:
# Define all numeric cols to fill with mode values
mode_fill_cols = ['host_location','host_response_time', 'property_type', 'room_type','bathrooms']

In [37]:
# Fill mode values with listed columns
modes = df_train[mode_fill_cols].mode()
df_train[mode_fill_cols] = df_train[mode_fill_cols].fillna(value=modes.iloc[0])

In [38]:
# Define all numeric cols to fill with mean values
mean_fill_cols = ['host_response_rate','host_acceptance_rate','bedrooms','beds','minimum_minimum_nights','maximum_maximum_nights','availability_365','review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin','review_scores_communication','review_scores_location','review_scores_value','reviews_per_month']

In [39]:
# Fill mean values with listed columns
means = df_train[mean_fill_cols].mean()
df_train[mean_fill_cols] = df_train[mean_fill_cols].fillna(value=means.iloc[0])

Besides, various features are recognised to better be in the form of integers instead of floatin point numbers due to their nature. Therefore, they will be converted to the correct form for analysis purposes.

In [40]:
# Convert into int 
df_train[['bedrooms','beds','availability_365','minimum_minimum_nights','maximum_minimum_nights','minimum_maximum_nights','maximum_maximum_nights','minimum_nights_avg_ntm','maximum_nights_avg_ntm']] = df_train[['bedrooms','beds','availability_365','minimum_minimum_nights','maximum_minimum_nights','minimum_maximum_nights','maximum_maximum_nights','minimum_nights_avg_ntm','maximum_nights_avg_ntm']].astype(int)

**For the test set:**

In [41]:
# Examine the shape of the test set
df_test.shape

(3000, 66)

We repeat the same process as for the training dataset so that misisng values of the test set can be appropriately imputed.

In [42]:
df_test[text_cols] = df_test[text_cols].fillna(value='unknonwn')

In [43]:
test_modes = df_test[mode_fill_cols].mode()
df_test[mode_fill_cols] = df_test[mode_fill_cols].fillna(value=test_modes.iloc[0])

In [44]:
test_means = df_test[mean_fill_cols].mean()
df_test[mean_fill_cols] = df_test[mean_fill_cols].fillna(value=test_means.iloc[0])

In [45]:
df_test[['bedrooms','beds','availability_365','minimum_minimum_nights','maximum_minimum_nights','minimum_maximum_nights','maximum_maximum_nights','minimum_nights_avg_ntm','maximum_nights_avg_ntm']] = df_test[['bedrooms','beds','availability_365','minimum_minimum_nights','maximum_minimum_nights','minimum_maximum_nights','maximum_maximum_nights','minimum_nights_avg_ntm','maximum_nights_avg_ntm']].astype(int)

### **Step 4**: Encode all categorical variables appropriately. 

Where a categorical feature contains more than 5 unique values, we can map the features into 5 most frequent values + 'other' and then encode appropriately. For instance, we could group then map `property_type` into 5 basic types + 'other': [entire rental unit, private room, entire room, entire towehouse, shared room, other] and then encode.

In [None]:
# Examine the value counts for each property type in both the train and test sets
train_prop_counts = df_train['property_type'].value_counts()
test_prop_counts = df_test['property_type'].value_counts()

# Combine both results
combined_prop_counts = pd.concat([train_prop_counts, test_prop_counts], axis=1, keys=['Train', 'Test'])
combined_prop_counts['Total'] = combined_prop_counts.sum(axis=1)
print(combined_prop_counts)

In [None]:
# Count values in property_type column
# df_train['property_type'].value_counts()

In [None]:
# df_test['property_type'].value_counts()

After investigating the values of `property_type` for both datasets, there can be a way to set groups which are relevant to 'Private room', 'Shared room' and 'Entire room'. 

As it is difficult to identify which values can be grouped together for 'Entire townhouse' or 'Entire rental unit', we need to specify lists of possible values and apply ordinal encodings.

In [48]:
# Define 2 general categories of multiple proprty types
entire_townhouse_types = ['Entire villa', 'Entire residential home','Entire guest suite','Entire guesthouse','Entire bungalow','Tiny house','Entire place','Entire vacation home','Dome house','Earth house','Casa particular']
entire_rental_unit_types = ['Entire serviced apartment', 'Entire loft', 'Entire rental unit', 'Entire condominium (condo)', 'Entire cottage']

# Function to update property_type
def update_property_type(df):
    for i in range(len(df)):
        if 'Private room' in df['property_type'].iloc[i]:
            df['property_type'].iloc[i] = 'Private room'
        elif 'Shared room' in df['property_type'].iloc[i]:
            df['property_type'].iloc[i] = 'Shared room'
        elif 'Room in' in df['property_type'].iloc[i]:
            df['property_type'].iloc[i] = 'Entire room'
        elif df['property_type'].iloc[i] in entire_townhouse_types:
            df['property_type'].iloc[i] = 'Entire townhouse'
        elif df['property_type'].iloc[i] in entire_rental_unit_types:
            df['property_type'].iloc[i] = 'Entire rental unit'

In [None]:
# Apply function to both train and test sets
update_property_type(df_train)
update_property_type(df_test)

In [49]:
# #apply to the training set
# for i in range(len(df_train)):
#     if 'Private room' in df_train['property_type'].loc[i]:
#         df_train['property_type'].loc[i]= 'Private room'
#     if 'Shared room' in df_train['property_type'].loc[i]:
#         df_train['property_type'].loc[i]= 'Shared room'
#     if 'Room in' in df_train['property_type'].loc[i]:
#         df_train['property_type'].loc[i]= 'Entire room'
#     if df_train['property_type'].loc[i] in entire_house:
#         df_train['property_type'].loc[i]= 'Entire townhouse'
#     if df_train['property_type'].loc[i] in entire_unit:
#         df_train['property_type'].loc[i]= 'Entire rental unit'

In [50]:
# #apply to the test set
# for j in range(len(df_test)):
#     if 'Private room' in df_test['property_type'].iloc[j]:
#         df_test['property_type'].iloc[j]= 'Private room'
#     if 'Shared room' in df_test['property_type'].iloc[j]:
#         df_test['property_type'].iloc[j]= 'Shared room'
#     if 'Room in' in df_test['property_type'].iloc[j]:
#         df_test['property_type'].iloc[j]= 'Entire room'
#     if df_test['property_type'].iloc[j] in e_townhouse:
#         df_test['property_type'].iloc[j]= 'Entire townhouse'
#     if df_test['property_type'].iloc[j] in e_rentalunit:
#         df_test['property_type'].iloc[j]= 'Entire rental unit'

Now that 5 basic types of Airbnb property have been identified, the remaining values can be assigned as 'Other'.

In [None]:
# Define categories
valid_property_types = ['Private room', 'Shared room', 'Entire room', 'Entire townhouse', 'Entire rental unit']

# Function to update property_type
def update_property_type(df):
    df['property_type'] = df['property_type'].apply(lambda x: x if x in valid_property_types else 'Other')

# Apply function to both train and test sets
update_property_type(df_train)
update_property_type(df_test)

In [51]:
# for i in range(len(df_train)):
#     if df_train['property_type'].loc[i] not in ['Private room', 'Shared room','Entire room', 'Entire townhouse','Entire rental unit']:
#         df_train['property_type'].loc[i] = 'Other'

In [52]:
# for i in range(len(df_test)):
#     if df_test['property_type'].iloc[i] not in ['Private room', 'Shared room','Entire room', 'Entire townhouse','Entire rental unit']:
#         df_test['property_type'].iloc[i] = 'Other'

In [None]:
# Combine value counts for both sets after grouping types
train_grouped_prop_counts = df_train['property_type'].value_counts()
test_grouped_prop_counts = df_test['property_type'].value_counts()

combined_grouped_prop_counts = pd.concat([train_grouped_prop_counts, test_grouped_prop_counts], axis=1, keys=['Train', 'Test']).fillna(0)
combined_grouped_prop_counts['Total'] = combined_grouped_prop_counts.sum(axis=1)
print(combined_grouped_prop_counts)

In [None]:
# df_train['property_type'].value_counts()

In [None]:
# df_test['property_type'].value_counts()

In [55]:
# Examine the average price of each property type 
df_train['price'].groupby(df_train['property_type']).mean()

property_type
Entire rental unit    240.258974
Entire room           229.866667
Entire townhouse      473.339852
Other                 482.634146
Private room           80.474479
Shared room            47.100000
Name: price, dtype: float64

As the chosen methodology for this forecasting topic is ML regression models, all independent varibles must be in numeric form. Therefore, we will map all identified values of `property_type` into integers based on the each type's average price in the descending order.

In [56]:
# Define mapping rules 
property_type_mapping = {'Other':6, 'Entire townhouse':5, 'Entire rental unit':4, 'Entire room':3, 'Private room':2, 'Shared room':1}

# Apply to both DataFrames
df_train['mapped_property_type'] = df_train['property_type'].map(property_type_mapping)
df_test['mapped_property_type'] = df_test['property_type'].map(property_type_mapping)

In [57]:
# Drop the original columns after mapping 
df_train.drop(columns=['property_type'],inplace=True)
df_test.drop(columns=['property_type'],inplace=True)

With the same nature of having diverse types, we apply the same logic for `bathrooms` feature.

In [None]:
# Examine the value counts for each bathroom type in both the train and test sets
train_bath_counts = df_train['bathrooms'].value_counts()
test_bath_counts = df_test['bathrooms'].value_counts()

# Combine both results
combined_bath_counts = pd.concat([train_bath_counts, test_bath_counts], axis=1, keys=['Train', 'Test'])
combined_bath_counts['Total'] = combined_bath_counts.sum(axis=1)
print(combined_bath_counts)

In [None]:
#df_train['bathrooms'].value_counts()

In [None]:
#df_test['bathrooms'].value_counts()

It is evident that `bathrooms` feature includes more than 5 unique values which indicates an appropriate encoding strategy should also be applied for analysing purposes. 

Based on the counting results of all bathroom types, we can classify them as 5 main groups: 
- One full bath (1 bath) 
- One full private full bath (1 private bath) 
- One full shared bath (1 shared bath) 
- Many full baths (many baths) 
- Many full shared baths (many shared baths) 
- Remaining types (other).

In [None]:
# Function to categorize bathroom values
def simplify_bathrooms(value):
    if '1.5 baths' in value:
        return '1 bath'
    if '1.5 shared baths' in value:
        return '1 shared bath'
    if any(x in value for x in ['2', '3', '4', '5', '6', '7', '11', '19']) and 'shared' not in value:
        return 'Many baths'
    if any(x in value for x in ['2', '3', '4', '5', '6', '7']) and 'shared' in value:
        return 'Many shared baths'
    if any(x in value for x in ['0', 'half']):
        return 'Other'
    return value  # Keep unchanged if not matched

# Apply function to both DataFrames
df_train['bathrooms'] = df_train['bathrooms'].apply(simplify_bathrooms)
df_test['bathrooms'] = df_test['bathrooms'].apply(simplify_bathrooms)


In [60]:
# #replace values 
# #1 full bath
# one_bath = ['1.5 baths']
# df_train['bathrooms'] = df_train['bathrooms'].replace(one_bath,'1 bath')
# df_test['bathrooms'] = df_test['bathrooms'].replace(one_bath,'1 bath')
# #1 full shared bath
# sh_bath = ['1.5 shared baths']
# df_train['bathrooms'] = df_train['bathrooms'].replace(sh_bath,'1 shared bath')
# df_test['bathrooms'] = df_test['bathrooms'].replace(sh_bath,'1 shared bath')
# #many full baths
# many_baths = ['2 baths','3 baths','2.5 baths','3.5 baths','4 baths','6 baths','5.5 baths','6.5 baths','11 baths','19 
# ']
# df_train['bathrooms'] = df_train['bathrooms'].replace(many_baths,'Many baths')
# df_test['bathrooms'] = df_test['bathrooms'].replace(many_baths,'Many baths')
# #many full shared baths
# many_sh_baths = ['2 shared baths','3 shared baths','2.5 shared baths','5 baths','4.5 baths','3.5 shared baths','6 shared baths','4 shared baths','4.5 shared baths','7 shared baths','5.5 shared baths']
# df_train['bathrooms'] = df_train['bathrooms'].replace(many_sh_baths,'Many shared baths')
# df_test['bathrooms'] = df_test['bathrooms'].replace(many_sh_baths,'Many shared baths')
# #the remaining types
# other = ['0 shared baths','0 baths','Shared half-bath','Half-bath','Private half-bath']
# df_train['bathrooms'] = df_train['bathrooms'].replace(other,'Other')
# df_test['bathrooms'] = df_test['bathrooms'].replace(other,'Other')

In [None]:
# Count distinct values for bathrooms feature after grouping
# df_train['bathrooms'].value_counts()

In [None]:
# df_test['bathrooms'].value_counts()

In [None]:
# Examine the value counts for each bathroom type in both the train and test sets
train_grouped_bath_counts = df_train['bathrooms'].value_counts()
test_grouped_bath_counts = df_test['bathrooms'].value_counts()

# Combine both results
combined_bath_counts = pd.concat([train_grouped_bath_counts, test_grouped_bath_counts], axis=1, keys=['Train', 'Test'])
combined_bath_counts['Total'] = combined_bath_counts.sum(axis=1)
print(combined_bath_counts)

The mapping process is also applied for the `bathrooms` features for constructing ML models in next tasks.

In [63]:
# Map grouped values into integers
bathrooms_mapping = {'Other':1, 'Many shared baths':2, 'Many baths':3, '1 shared bath':4, '1 private bath':5, '1 bath':6} 

# Apply to both DataFrames
df_train['mapped_bathrooms'] = df_train['bathrooms'].map(bathrooms_mapping)
df_test['mappped_bathrooms'] = df_test['bathrooms'].map(bathrooms_mapping)

In [64]:
# Drop the original column after mapping 
df_train.drop(columns=['bathrooms'],inplace=True)
df_test.drop(columns=['bathrooms'],inplace=True)

### **Step 5**: Label and Transform data types for features with multiple values.  

In [65]:
# Examine the value counts for each room type in the train DataFrame
df_train['room_type'].value_counts()

Entire home/apt    5337
Private room       1511
Hotel room          103
Shared room          49
Name: room_type, dtype: int64

In [66]:
# Examine the value counts for each host_response_time value in the train DataFrame
df_train['host_response_time'].value_counts()

within an hour        4552
within a few hours    1271
within a day           860
a few days or more     317
Name: host_response_time, dtype: int64

Even though `room_type` and `host_response_time` only has 4 unique values, this feature should also be in an appropriate ordering of the labels by applying the same mapping integer values process.

In [67]:
# Define mapping rules
room_mapping = {'Shared room':1, 'Hotel room':2, 'Private room':3, 'Entire home/apt':4}

# Apply to both DataFrames
df_train['room_maptype'] = df_train['room_type'].map(room_mapping)
df_test['room_maptype'] = df_test['room_type'].map(room_mapping)

In [68]:
# Define mapping rules
response_time_mapping = {'a few days or more':1, 'within a day':2, 'within a few hours':3, 'within an hour':4}

# Apply to both DataFrames
df_train['response_time'] = df_train['host_response_time'].map(response_time_mapping)
df_test['response_time'] = df_test['host_response_time'].map(response_time_mapping)

In [69]:
# Drop the original columns after mapping 
df_train.drop(columns=['room_type'],inplace=True)
df_test.drop(columns=['room_type'],inplace=True)
df_train.drop(columns=['host_response_time'],inplace=True)
df_test.drop(columns=['host_response_time'],inplace=True)

In [70]:
# #drop the original column after mapping 
# df_train.drop(columns=['host_response_time'],inplace=True)
# df_test.drop(columns=['host_response_time'],inplace=True)

In [None]:
# # Combine value counts for both sets after grouping types
# train_grouped_prop_counts = df_train['property_type'].value_counts()
# test_grouped_prop_counts = df_test['property_type'].value_counts()

# combined_grouped_prop_counts = pd.concat([train_grouped_prop_counts, test_grouped_prop_counts], axis=1, keys=['Train', 'Test']).fillna(0)
# combined_grouped_prop_counts['Total'] = combined_grouped_prop_counts.sum(axis=1)
# print(combined_grouped_prop_counts)

Moreover, by assessing the values of the remaining categorical features which are `has_availability`, `host_is_superhost`, `host_has_profile_pic`, `host_identity_verified` and `instant_bookable`, we recognize that they only have 2 values of 't' and 'f'. Therefore, we can encode these 2 values to 1 and 0 respectively.

In [None]:
columns_to_replace = [
    'has_availability',
    'host_is_superhost',
    'host_has_profile_pic',
    'host_identity_verified',
    'instant_bookable'
]
# Examine the top 5 values of these given features 
df_train[columns_to_replace].head()

In [None]:
# Calculate value counts for each column
has_availability_counts = df_train['has_availability'].value_counts()
host_is_superhost_counts = df_train['host_is_superhost'].value_counts()
host_has_profile_pic_counts = df_train['host_has_profile_pic'].value_counts()
host_identity_verified_counts = df_train['host_identity_verified'].value_counts()
instant_bookable_counts = df_train['instant_bookable'].value_counts()

# Concatenate the outputs into a single DataFrame
tf_counts_result = pd.concat([has_availability_counts, host_is_superhost_counts, host_has_profile_pic_counts, host_identity_verified_counts, instant_bookable_counts], axis=1)
tf_counts_result.columns = columns_to_replace

# Display the result
print(tf_counts_result)

In [71]:
# df_train['has_availability'].value_counts()

t    6986
f      14
Name: has_availability, dtype: int64

In [72]:
# df_train['host_is_superhost'].value_counts()

f    4901
t    2099
Name: host_is_superhost, dtype: int64

In [73]:
# df_train['host_has_profile_pic'].value_counts()

t    6981
f      19
Name: host_has_profile_pic, dtype: int64

In [74]:
# df_train['host_identity_verified'].value_counts()

t    6320
f     680
Name: host_identity_verified, dtype: int64

In [75]:
# df_train['instant_bookable'].value_counts()

f    4575
t    2425
Name: instant_bookable, dtype: int64

Now that we can map the t and f values in those columns into 1 and 0, it will be more appropriate to prepare datasets for ML models than text-based data.

In [None]:
# Replacing values in df_train
df_train[columns_to_replace] = df_train[columns_to_replace].replace({'t': 1, 'f': 0})

# Replacing values in df_test
df_test[columns_to_replace] = df_test[columns_to_replace].replace({'t': 1, 'f': 0})

In [76]:
# #encode values
# df_train['has_availability'].replace({'t': 1, 'f': 0}, inplace=True)
# df_train['host_is_superhost'].replace({'t': 1, 'f': 0}, inplace=True)
# df_train['host_has_profile_pic'].replace({'t': 1, 'f': 0}, inplace=True)
# df_train['host_identity_verified'].replace({'t': 1, 'f': 0}, inplace=True)
# df_train['instant_bookable'].replace({'t': 1, 'f': 0}, inplace=True)

In [77]:
# #encode values
# df_test['has_availability'].replace({'t': 1, 'f': 0}, inplace=True)
# df_test['host_is_superhost'].replace({'t': 1, 'f': 0}, inplace=True)
# df_test['host_has_profile_pic'].replace({'t': 1, 'f': 0}, inplace=True)
# df_test['host_identity_verified'].replace({'t': 1, 'f': 0}, inplace=True)
# df_test['instant_bookable'].replace({'t': 1, 'f': 0}, inplace=True)

### **Step 6**: Perform exploratory data analysis to measure the relationship between the features and the target and write up your findings. 

To explore and measure the relationship between current features and `price` as the target variable, we can construct correlation matrix to establish them:

In [78]:
correlations = df_train.corr().unstack().sort_values(ascending=False) #build correlation matrix
correlations = pd.DataFrame(correlations).reset_index() #convert to dataframe
correlations.columns = ['Target', 'Features', 'Correlation'] #label it
correlations.query("Target == 'price' & Features != 'price'") #filter by variable

Unnamed: 0,Target,Features,Correlation
144,price,accommodates,0.597608
165,price,map_property_type,0.381652
187,price,room_maptype,0.263343
188,price,latitude,0.262959
191,price,longitude,0.262011
315,price,calculated_host_listings_count_entire_homes,0.134978
430,price,review_scores_rating,0.096719
435,price,host_listings_count,0.095468
475,price,reviews,0.084221
485,price,reviews_per_month,0.080419


- Majority of the correlation showed a weak relationship as nearly closed to 0. 
- The 3 in the highest order are accommodates, properties and room type. Meaning that the essential organizing of these features could lead to a higher listing price as a common type of feature that creates values of living experiences. 
- On the opposite, the bottom level of Response rate, bathroom type and acceptance rate could result in a low listing price as it is directly related to the comfortable experiences of customers while making the booking and using the facilities. 

**Feature Engineering:**

After cleaning all string-based, categorical and numerical data, we identify that various potential features will not be utilized for forecasting models and included in the subset of the training data. They can be listed as: `name`,`description`,`neighborhood_overview`,`host_name`,`host_since`,`host_location`,`host_about`, `host_neighbourhood`,`neighbourhood`,
`neighbourhood_cleansed`,`first_review`,`last_review` and `license`.