# Customer Segmentation on New York City AirBnB Data

**Members:** <br>
Tsoi Kwan Ma - 476914 <br>
Łukasz Brzoska - 472892<br>
Si-Tang Lin - 476912<br>
I Putu Agastya Harta Pratama - 472876<br>
Chi Phuong Dao - 474064<br>

***Disclaimer on work distribution:*** <br>
As most of the time we work simultaneously in a group, and each team members have different level understanding of coding and/or machine learning, making our responsibilities overlap between one-another.

## **Business Problem**
With hundreds of thousands of listings available on the Airbnb platform across various locations, it is essential to understand customer preferences and create the appropriate segment effectively. This is crucial for both hosts and Airbnb to maximise occupancy rates, set optimal pricing, and offer personalised experiences.

In this project, our team will an analyse Airbnb listings in New York City to optimise Airbnb's offerings and pricing strategies. By categorizing the listings based on price and room type, Airbnb can enhance its recommendation algorithms to better align with customer preferences.

Later on, Machine Learning classification algorithms will be applied based on these variables:

**Target Variable: customers segment (by price per room type) <br>
Feature Variable: minimum nights, neighborhood group, hosts' listings count, number of review and the availablility of the listing**

## **Attribute Information**

This dataset contains 48,895 rows and 16 columns.

1. ID: A unique identifier for each listing.
2. Name: The name of the Airbnb listing.
3. Host_id: A unique identifier for each host.
4. Host_name: The name of the host.
5. Neighbourhood_group: The larger area or borough where the property is located (e.g., Manhattan, Brooklyn).
6. Latitude: The latitude coordinate of the property’s location.
7. Longitude: The longtitude coordinate of the property’s location.
8. Room_type: The type of room offered, e.g., Entire home/apt, Private room, Shared room. <br>
  *   Entire apartment - Studio apartment; a single apartment unit for yourself.
  *   Private room - Shared room in an apartment.
  *   Shared room - Dormitory situation.
9. Price: The nightly price of the listing in USD.
10. Minimum_nights: The minimum number of nights a guest must book.
11. Calculated_host_listings_count: The total number of listings managed by a host.
12. Availability_365: The number of days in a year the property is available for booking.
13. Number_of_reviews: Total number of reviews received by the listing.
14. Last_review: The date of the most recent review for the property.
15. Reviews_per_month: The average number of reviews the property receives per month.

With the type of data as follows: <br>

*   Categorical: name, host_name neighbourhood_group, room_type.
*   Numerical: id, host_id, latitude, longitude, price, minimum_nights, calculated_host_listings_count, availability_365, number_of_reviews, last_review, reviews_per_month.

## **Imports**
Libraries and Data Imports is done in this section

### Libraries Imports
*Run firstly after each new kernel session*

All of the necessary libraries for our Machine Learning pipeline will be included within this section


In [None]:
# Libraries Import for Data Preparation and Visualisation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
!pip install mapclassify
from wordcloud import WordCloud
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import sqlite3
import csv
import pickle



In [None]:
# AI is used to assist us with the code
'''ChatGPT is used to help us check whether we have imported the necessary
libraries to run, and test the model correctly.

We also inquired AI for our syntax writing as well, since we are unsure about
the exact naming of each libraries'''

# Library for splitting the dataset
from sklearn.model_selection import train_test_split
# Libraries for Classification Algorithm
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
# Library for model evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
# Library for encoding and normalisation
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import MinMaxScaler

### Importing data
Run firstly after each kernel session

In [None]:
# AI is used to generate the code
'''ChatGPT is used to assist us on streamlining the data import process.
In hindsight, we want to be able to have the dataset ready each time we start
a new Kernel session, without having to manually import every single time

We are also insipred with this code from the previous Machine Learning Project
of one of our team member'''

# Dataset download
file_id_dataset = '18ZTZy0aOD63QZ7L0DGVwKoy1_kTIuWpL' # AI generated
download_url_dataset = f'https://drive.google.com/uc?id={file_id_dataset}&export=download' # AI generated
# Database download
file_id_db = '1uHykw8z1B0JQr0CywWpN4gwev4ETlAPZ'
download_url_db = f'https://drive.google.com/uc?id={file_id_db}&export=download'

# Download the file
!wget -O AB_NYC_2019.csv '{download_url_dataset}' # AI generated
!wget -O AB_NYC_2019.db '{download_url_db}'

# Load the dataset
df_nyc = pd.read_csv('AB_NYC_2019.csv')

--2025-01-15 16:23:55--  https://drive.google.com/uc?id=18ZTZy0aOD63QZ7L0DGVwKoy1_kTIuWpL&export=download
Resolving drive.google.com (drive.google.com)... 64.233.181.113, 64.233.181.102, 64.233.181.139, ...
Connecting to drive.google.com (drive.google.com)|64.233.181.113|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=18ZTZy0aOD63QZ7L0DGVwKoy1_kTIuWpL&export=download [following]
--2025-01-15 16:23:55--  https://drive.usercontent.google.com/download?id=18ZTZy0aOD63QZ7L0DGVwKoy1_kTIuWpL&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 209.85.200.132, 2607:f8b0:4001:c08::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|209.85.200.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7077973 (6.8M) [application/octet-stream]
Saving to: ‘AB_NYC_2019.csv’


2025-01-15 16:24:01 (33.6 MB/s) - ‘AB_NYC_2019.csv’ saved [707