# INFO 2950 Group Project


In [2]:
import pandas as pd
import numpy as np
import time

import seaborn
from matplotlib import pyplot
import duckdb, sqlalchemy

from sklearn.linear_model import LogisticRegression


In [3]:
%load_ext sql

%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

%sql duckdb:///:memory:

## Research Question

Does the property characteristics (location, property type, and price) or the host characteristics (host response time, host response rate, cancellation policy) matter more in the review score of an Airbnb listing in New York City?


## Data collection and cleaning

In [4]:
raw_data_df = pd.read_csv('New_York.csv')
print(raw_data_df.shape)
print(raw_data_df.size)
raw_data_df.head()

(44317, 31)
1373827


Unnamed: 0,id,host_response_time,host_response_rate,host_is_superhost,host_has_profile_pic,neighbourhood_cleansed,latitude,longitude,is_location_exact,property_type,...,maximum_nights,calendar_updated,availability_30,number_of_reviews,review_scores_rating,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,reviews_per_month
0,18461891,,,f,t,Ditmars Steinway,40.774142,-73.916246,t,Apartment,...,6,5 months ago,0,0,,f,f,strict,f,
1,20702398,within an hour,100%,f,t,City Island,40.849191,-73.786509,f,House,...,21,2 weeks ago,19,2,100.0,f,f,moderate,f,2.0
2,6627449,within an hour,100%,f,t,City Island,40.849775,-73.786609,t,Apartment,...,21,2 weeks ago,28,21,95.0,f,f,strict,f,0.77
3,19949243,within a few hours,100%,f,t,City Island,40.848838,-73.782276,f,Boat,...,1125,6 days ago,30,0,,t,f,strict,f,
4,1886820,,,f,t,City Island,40.841144,-73.783052,t,House,...,90,16 months ago,30,0,,f,f,strict,f,


The dataset is downloaded from Kaggle (URl: https://www.kaggle.com/datasets/ivanovskia1/nyc-airbnb-rental-data-october-2017/versions/1?resource=download). We will use NYC Airbnb Rental data in October 2017 as our sample to anaylze whether property characteristics or host characterisitics play a more important role in the review score of an Airbnb listing in NYC. The raw dataframe has 44317 rows and 31 columns. The size of the dataframe is 1373827, which is too big and contains some irrelevant information, so we need to clean it first. 

In [5]:
raw_data_df.isnull().sum()

id                                   0
host_response_time               13679
host_response_rate               13679
host_is_superhost                  232
host_has_profile_pic               232
neighbourhood_cleansed               0
latitude                             0
longitude                            0
is_location_exact                    0
property_type                        0
room_type                            0
accommodates                         0
bathrooms                          144
bedrooms                            73
beds                                91
bed_type                             0
amenities                            0
square_feet                      43768
price                                0
guests_included                      0
minimum_nights                       0
maximum_nights                       0
calendar_updated                     0
availability_30                      0
number_of_reviews                    0
review_scores_rating     

Our research question concern about property characteristics and host characterisitics, so we need to drop columns that are irrelevant, such as id and calendar updates. Some characteristics like longitude and latitude are less expressive than location for general audience, so we also disregard these values. Square feet of the listing would be a good one to include as one of the property characteristics, but after checking the null value, we find that 43768 of the listings do not contain this information, which is 98.7% of our total listings, so we also need to drop it. Finally, given our research question, we decide to use location, property type, and price as property characteristics, and use host response time, host response rate, and cancellation policy as host characteristics. We also need to use review scores rating as our output varaible. All other columns will be removed. 
(I can talk more about why we remove certain columns - Linda)

In [6]:
%sql data_df << SELECT neighbourhood_cleansed AS location, property_type, price, host_response_time, host_response_rate, cancellation_policy, review_scores_rating FROM raw_data_df
print(data_df.shape)
print(data_df.size)
data_df.head()

Returning data to local variable data_df
(44317, 7)
310219


Unnamed: 0,location,property_type,price,host_response_time,host_response_rate,cancellation_policy,review_scores_rating
0,Ditmars Steinway,Apartment,110,,,strict,
1,City Island,House,50,within an hour,100%,moderate,100.0
2,City Island,Apartment,125,within an hour,100%,strict,95.0
3,City Island,Boat,100,within a few hours,100%,strict,
4,City Island,House,300,,,strict,


We need to further clean the dataset as there are some NaN value for host response time, host response rate, and review score rating. We decide to remove rows that contain any missing data. We choose to remove incomplete data instead of filling in numbers for missing data because about 31% of the rows do not contain value for host response time and host reponse rate, and about 22% of the rows do not contain value for review score. We think if we simply fill the missing value with zeros, it will create a strong bias during our sample analysis. Our raw dataset is very large, so removing the raws can still lead to a reliable and consistent dataset.

In [7]:
%sql airbnb_df << SELECT* FROM data_df WHERE review_scores_rating is not null and host_response_time is not null and host_response_rate is not null
print(airbnb_df.shape)
print(airbnb_df.size)
airbnb_df.head()

Returning data to local variable airbnb_df
(26620, 7)
186340


Unnamed: 0,location,property_type,price,host_response_time,host_response_rate,cancellation_policy,review_scores_rating
0,City Island,House,50,within an hour,100%,moderate,100.0
1,City Island,Apartment,125,within an hour,100%,strict,95.0
2,City Island,House,69,within an hour,100%,moderate,97.0
3,City Island,Apartment,150,within an hour,100%,flexible,100.0
4,City Island,House,101,within an hour,100%,moderate,100.0


In [8]:
airbnb_df.isnull().sum()

location                0
property_type           0
price                   0
host_response_time      0
host_response_rate      0
cancellation_policy     0
review_scores_rating    0
dtype: int64

At this point, our data is in good quality. The clean dataframe has 26620 rows and 7 columns. The size of the dataframe is 186340. None of the values are missing. We are ready to explore the dataset further. The cleaned data is exported as cleaned airbnb 

In [9]:
airbnb_df.to_csv('cleaned_airbnb_data.csv')

## Data description

The data source we used is https://www.kaggle.com/datasets/ivanovskia1/nyc-airbnb-rental-data-october-2017?resource=download. Our cleaned data contains 26620 properties and the data was accumulated in New York, NY in October 2017.

This consists of information to find out about hosts, geographical and property characteristic information used to make predictions and draw conclusions. Specifically, we used narrowed down our data to the variables which relate to host attributes such as the host response time, response rate, and their cancellation policy. The variables we used that relate to property characteristics are location, property type and price. These variables will then all be used to answer our research question of whether host characteristics or property characteristics influence Airbnb reviews more. 

In [10]:
variable_types = airbnb_df.dtypes
print('Here are the types of each of the variables:')
print(variable_types)


Here are the types of each of the variables:
location                 object
property_type            object
price                     int64
host_response_time       object
host_response_rate       object
cancellation_policy      object
review_scores_rating    float64
dtype: object


The descriptions of each of the variables are: 

location: represents the neighborhood in New York the property is located

property_type: represents the type of Airbnb being rented for example a house, apartment or boat

price: represents the nightly rate of the property

host_response_time: the average time a host takes to respond

host_response_rate: the percentage of times a host responds to a booking request

cancellation_policy: whether the host's policy is strict, moderate, or flexible 

review_scores_rating: the airbnb rating on a scale of 0-100

## Data limitations

## Exploratory data analysis

## Questions for reviewers