# Exploring Listings: Toronto vs. Vancouver (2022) - A Data Analyis Project using CRISP-DM Framework
____

## Project Overview
This project aims to compare 2022 Airbnb Listings in Toronto using the CRISP-DM (Cross-Industry Standard Process for Data Mining) framework. This framework establishes a structured approach to data analysis and decompose the project into different phases, each withs it respective set of tasks and objectives. By following this framework, the analysis included in this project can be verified that it is comprehensive, rigorous and methodical. 

## CRISP-DM Framework
The Crisp-DM Framework consists of the six phases:

1. [Business Understanding](#section-1-business-understanding)
2. [Data Understanding](#section-2-data-understanding)
3. Data Preparation
4. Modelling
5. Evaluation
6. Deployment



### Section 1: Business Understanding
---

Airbnb is a popular online platform for both hosts and guests, enabling people to rent accommodations and providing individuals an extra income. The company has dirupted the traditional hospitality structure of "hotels" and "motels". As a result, Airbnb has become a popular choice for accomodation, with 1M+ listings and bookings made on this platform.

In addition, Toronto and Vancouver are two major Canadian cities and tech hubs that attract a large number of visitors anually due to business, tourism, education, and immigration. Airbnb has become one of the most highly-sought and popular accommodation choice for many international travelers due to the convenience of accessing the platform, affordability and unique experience. For hosts, Airbnb offers opportunity to gain surplus income by renting out their current properties. This situation emphasizes that understanding the Airbnb market in these two cities become critical in gaining benefits for both parties.


#### Business Questions:
**1. Are Airbnb listings in Toronto more expensive than Vancover?** 
- **H<sub>0</sub>:** There is no significant difference in listing prices between Toronto and Vancouver.
- **H<sub>A</sub>:** Listings in Toronto are more expensive than those in Vancouver. <br></br>

**2. Are there more Airbnb listings available in Vancouver than Toronto?**
- **H<sub>0</sub>:** There are no significant difference in the number of listings between Vancouver and Toronto
- **H<sub>A</sub>:** Vancouver has more listings avaialble than Toronto <br></br>
  
**3. Does the number of bedrooms significantly affect the listing price in both cities?**
- **H<sub>0</sub>:** The number of bedrooms does not signficantly affect the listing price in both cities.
- **H<sub>A</sub>:** The number of bedrooms significantly affects the listing price in both cities. <br></br>


**4: Are there any differences in the types of properties listed in Toronto vs. Vancouver?**
- **H<sub>0</sub>:** The distribution of property types is the same between Toronto and Vancouver
- **H<sub>1</sub>:** There are signfiicant differences in teh distribution of peroperty types between Toronto and Vancouver. <br></br>

**5. Does being a superhost signficantly affect the bumber of reviews a listing receives in both cities?**
- **H<sub>0</sub>:** Being a superhost does not significantly affect the number of reviews a listing receives in both cities 
- **H<sub>1</sub>:** Being a superhost signfiicantly affects the number of reviews a listing receives in both cities. <br></br>

**6. Does the average review score differ signficantly between Toronto and Vancouver?**
- **H<sub>0</sub>:** The distribution of property types is the same between Toronto and Vancouver
- **H<sub>1</sub>:** There a re signfiicant differences in teh distribution of peroperty types between Toronto and Vancouver. <br></br>
  
**7. Is there a relationship between the number of reviews a listing has and its occupancy rate in both cities?**
- **H<sub>0</sub>:** There is no significant relationship between the number of reviews a listing has and its occupancy rates in both cities.
- **H<sub>1</sub>:** The number of reviews a listing has is significantly related to its occupancy rate in both cities.<br></br>

**8. Do instant bookable listings receive more bookings than those are not instant bookable and non-instant bookable listings in both cities**
- **H<sub>0</sub>:** There is no significant difference in the booking rate between instant bookable and non-instant bookable listings in both cities
- **H<sub>1</sub>:** Instant bookable listings receive more bookings than non-instant bookable listings in both cities <br></br>

**9. Does the proximity of a listing to popular tourist attractions affect its occupancy rates in both cities?**
- **H<sub>0</sub>:** There is no significant relationship between the proximity of listing to popular tourist attractions and its occupancy rate in both cities.
- **H<sub>1</sub>:** The proximity of listing to popular tourist attractions is significantly related to its occupancy rate in both cities <br></br>

**10. Does the availability of certain amenities, such as Wi-Fi or parking, significantly affect the listing price in both cities?**
- **H<sub>0</sub>:** The availability of certain amenities does not significantly affect the listing prices in both cities
- **H<sub>1</sub>:** The availability of certain amentities significantly affects the listing price in both cities. 


---

#### Objective:
The primary objective of this project is to provide insights that can be used to leverage hosts and guests decision-making and pricing strategies on the Airbnb platform. By analyzing the data on the properties listed and bookings, one can examine the trends and patterns that reveals the needs for hosts and guests and the factors that contribute to a positive experience for both parties. Based on these insights, one can make recommendations to improve the Airbnb experience with the ultimate goal of increasing loyalty and customer satisfaction.





## Section 2: Data Understanding
The first step in the CRISP-DM process is data understanding. In this phase, we will try to understand the data we are working with for the Airbnb Data Analysis Project.

### Collect Initial Data
The dataset for this analysis was collected from two seperate CSV files containing Airbnb listings data for Toronto and Vancouver. The files were downloaded from [Inside Airbnb](http://insideairbnb.com/get-the-data/) on March 9, 2023.

### Describe Data 
To begin the analysis, one needs to collect and describe the data. It will be necessary to import the necessary Python libraries such as pandas and numpy, read the CSV files that include the Airbnb listing data for Toronto and Vancouver.

We have two datasets, one for Toronto and another for vancouver. Each dataset contains 75 columns, with different data types like integers, float, and objects. The datasets comrprise info about the listing characteristics (i.e., host identity, location, number of bed/bathrooms, reviews, price).

**The dataset contains the following columns:**
| Column Name                  | Description                                                                                        |
|------------------------------|----------------------------------------------------------------------------------------------------|
| id                           | Unique identifier for the listing                                                                  |
| listing_url                  | URL of the listing page                                                                            |
| scrape_id                    | Unique identifier for the data scraping process                                                    |
| last_scraped                 | Date on which the data was last scraped                                                            |
| source                       | Source of the data                                                                                 |
| name                         | Name of the listing                                                                                |
| description                  | Description of the listing                                                                         |
| neighborhood_overview        | Description of the neighborhood where the listing is located                                       |
| picture_url                  | URL of the picture for the listing                                                                 |
| host_id                      | Unique identifier for the host                                                                     |
| host_url                     | URL of the host page                                                                               |
| host_name                    | Name of the host                                                                                   |
| host_since                   | Date on which the host joined Airbnb                                                               |
| host_location                | Location of the host                                                                               |
| host_about                   | Description of the host                                                                            |
| host_response_time           | Time it takes for the host to respond to a message                                                 |
| host_response_rate           | Percentage of messages that the host responds to                                                   |
| host_acceptance_rate         | Percentage of booking requests that the host accepts                                               |
| host_is_superhost            | Whether or not the host has achieved superhost status                                              |
| host_thumbnail_url           | URL of the host's thumbnail picture                                                                |
| host_picture_url             | URL of the host's picture                                                                          |
| host_neighbourhood           | Neighborhood where the host is located                                                             |
| host_listings_count          | Number of listings that the host has                                                               |
| host_total_listings_count    | Total number of listings that the host has                                                         |
| host_verifications           | Methods that the host has used to verify their identity                                            |
| host_has_profile_pic         | Whether or not the host has a profile picture                                                      |
| host_identity_verified       | Whether or not the host has verified their identity                                                |
| neighbourhood                | Neighborhood where the listing is located                                                          |
| neighbourhood_cleansed       | Neighborhood where the listing is located (cleaned)                                                |
| neighbourhood_group_cleansed | Group of neighborhoods where the listing is located (cleaned)                                      |
| latitude                     | Latitude of the listing                                                                            |
| longitude                    | Longitude of the listing                                                                           |
| property_type                | Type of property (e.g. apartment, house, etc.)                                                     |
| room_type                    | Type of room (e.g. entire home, private room, etc.)                                                |
| accommodates                 | Maximum number of guests that the listing can accommodate                                          |
| bathrooms                    | Number of bathrooms                                                                                |
| bathrooms_text               | Description of the bathrooms                                                                       |
| bedrooms                     | Number of bedrooms                                                                                 |
| beds                         | Number of beds                                                                                     |
| amenities                    | List of amenities provided with the listing                                                        |
| price                        | Nightly price of the listing                                                                       |
| minimum_nights               | Minimum number of nights that can be booked                                                        |
| maximum_nights               | Maximum number of nights that can be booked                                                        |
| minimum_minimum_nights       | Minimum value for minimum_nights within the last year                                              |
| maximum_minimum_nights       | Maximum value for minimum_nights within the last year                                              |
| minimum_maximum_nights       | Minimum value for maximum_nights within the last year                                              |
| maximum_maximum_nights       | Maximum value for maximum_nights within the last year                                              |
| minimum_nights_avg_ntm       | Minimum average number of nights for all bookings within the last year                             |
| maximum_nights_avg_ntm       | Maximum average number of nights for all bookings within the last year                             |
| calendar_updated             | Date on which the calendar was last updated                                                        |
|has_availability              | Whether or not the calendar is updated for the next 12 months                                      |
|availability_30               | Number of available nights within the next 30 days                                                 |
|availability_60               | Number of available nights within the next 60 days                                                 |
|availability_90               | Number of available nights within the next 90 days                                                 |
|availability_365              | Number of available nights within the next 365 days                                                |
|calendar_last_scraped         | Date on which the calendar was last scraped                                                        |
|number_of_reviews             | Number of reviews for the listing                                                                  |
|number_of_reviews_ltm         | Number of reviews for the listing within the last 12 months                                        |
|first_review                  | Date of the first review for the listing
|last_review                   | Date of the most recent review for the listing
|review_scores_rating          | Overall rating for the listing based on reviews
|review_scores_accuracy        | Rating for accuracy based on reviews
|review_scores_cleanliness     | Rating for cleanliness based on reviews
|review_scores_checkin         | Rating for check-in based on reviews
|review_scores_communication   | Rating for communication based on reviews
|review_scores_location        | Rating for location based on reviews
|review_scores_value           | Rating for value based on reviews
|license                       | License number of the listing (if applicable)
|instant_bookable              | Whether or not the listing can be instantly booked
|calculated_host_listings_count| Number of listings that the host has
|calculated_host_listings_count_entire_homes | Number of entire homes that the host has
|calculated_host_listings_count_private_rooms | Number of private rooms that the host has
|calculated_host_listings_count_shared_rooms | Number of shared rooms that the host has
|reviews_per_month             | Number of reviews per month


In [None]:
#Import required python libraries
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
#import seaborn as sns
#from scipy.stats import ttest_ind

# Read the CSV files that includes Airbnb listings for Toronto and Vancouver
toronto_df = pd.read_csv('airbnb_toronto_listings.csv')
vancouver_df = pd.read_csv('airbnb_vancouver_listings.csv')

#Preview the first 5 rows of each dataframe
print("Toronto Airbnb (2022) Listings:")
print(toronto_df.head())

print("\nVancouver Airbnb (2022) Listings:")
print(vancouver_df.head())

The code above imports the pandas and numpy libraries and reads the CSV files for Toronto and Vancouver using the `read_csv()` function. Then, we will preview the first 5 rows of each dataframe using the `(head)` function. When previewing data, we gain few advantages:

- Validate that the data has been properly loaded into the dataframe
- Quick overview of the data (i.e., number of columns, data types, etc.)
- Identify errors (i.e., missing values & incorrect data types)
- Find out if data cleaning/preprocessing is required before analysis.
- Helps us understand the structure and format the data when it comes to "exploratory data analysis" and "modelling".

In [16]:
#Print the data types for each column in each dataframe
print("Toronto Airbnb (2022) Data Types:")
print(toronto_df.dtypes)

print("\nVancouver Airbnb (2022) Data Types:")
print(vancouver_df.dtypes)

#Print the number of unique values in each columns in dataframe
print("Toronto Airbnb (2022) Unique Values:")
print(toronto_df.nunique())

print("\nVancouver Airbnb (2022) Unique Values:")
print(vancouver_df.nunique())


Toronto Airbnb (2022) Data Types:
id                                              float64
listing_url                                      object
scrape_id                                       float64
last_scraped                                     object
source                                           object
                                                 ...   
calculated_host_listings_count                    int64
calculated_host_listings_count_entire_homes       int64
calculated_host_listings_count_private_rooms      int64
calculated_host_listings_count_shared_rooms       int64
reviews_per_month                               float64
Length: 75, dtype: object

Vancouver Airbnb (2022) Data Types:
id                                                int64
listing_url                                      object
scrape_id                                         int64
last_scraped                                     object
source                                           object
       

In [14]:
#Print the number of missing values in each column of each data frame
print("Toronto Airbnb (2022) Missing Values:")
print(toronto_df.isnull().sum())

print("\nVancouver Airbnb (2022) Missing Values:")
print(vancouver_df.isnull().sum())

Toronto Airbnb (2022) Missing Values:
id                                                 0
listing_url                                        0
scrape_id                                          0
last_scraped                                       0
source                                             0
                                                ... 
calculated_host_listings_count                     0
calculated_host_listings_count_entire_homes        0
calculated_host_listings_count_private_rooms       0
calculated_host_listings_count_shared_rooms        0
reviews_per_month                               4157
Length: 75, dtype: int64

Vancouver Airbnb (2022) Missing Values:
id                                                0
listing_url                                       0
scrape_id                                         0
last_scraped                                      0
source                                            0
                                               ... 
c

From the code above, the code displays the number of missing values in the Toronto & Vancouver Airbnb (2022) datasets. Both datasets have 75 columns, and there are no missing values for any of the columns except for the "reviews_per_month" column. 

The **Toronto dataset has 4157 missing values** for the "reviews_per_month" column, while the **Vancouver dataset has 978 missing values** for this column.

In [18]:
#Quick Summary:
print("Toronto Airbnb Listings (2022) Summary:")
print(toronto_df.info())

print("Vancouver Airbnb Listings (2022) Summary:")
print(vancouver_df.info())

print("Toronto Airbnb Listings (2022) Rows & Columns:")
print(toronto_df.shape)

print("Vancouver Airbnb Listings (2022) Rows & Columns:")
print(vancouver_df.shape)

Toronto Airbnb Listings (2022) Summary:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16761 entries, 0 to 16760
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            16761 non-null  float64
 1   listing_url                                   16761 non-null  object 
 2   scrape_id                                     16761 non-null  float64
 3   last_scraped                                  16761 non-null  object 
 4   source                                        16761 non-null  object 
 5   name                                          16759 non-null  object 
 6   description                                   16501 non-null  object 
 7   neighborhood_overview                         9723 non-null   object 
 8   picture_url                                   16761 non-null  object 
 9   host_id              

# Steps Taken to Understand Data

1. Using `.head()` method - allows us to preview first 5 rows of data
``` python
print(toronto_df.head())
print(vancouver_df.head())
```

2. `.info()` method - revealed both datasets have 75 columns and for `id` column in dataset, Toronto is assigned with a datatype `float64`, while Vancouver uses `int64`.
```python
print("Toronto Airbnb Listings (2022) Summary:")
print(toronto_df.info())

print("Vancouver Airbnb Listings (2022) Summary:")
print(vancouver_df.info())
```

3. `is.null.sum()` method - revealed Toronto dataset has 