In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Research Question: Airbnb Prices in NYC 
NYC is a very diverse area known for its 5 boroughs. These boroughs vary drastically in terms of people, culture, and income. Does Airbnb display these variations through its listings as well? Do prices change from borough to borough and how significant are these changes? Can we also accurately predict prices and make suggestions for houses just like Airbnb does? We will see below in our Data 301 Final Project which is broken up into 3 parts: EDA, Machine Learning, and Final Conclusion. 

## Imports for Data Collection

In [0]:
import pandas as pd
import numpy as np
import sys
sys.setrecursionlimit(10000000)
import json
from pandas.io.json import json_normalize
import sklearn 
import warnings # the below imports were to stop displaying warnings for aesthetic appeal
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning) 

# About Our Data: 

1.   Our Main CSV File for 50,000 Listings: Inside Airbnb, http://insideairbnb.com/ , a website which holds publically available information about Airbnb's all over the world --> csv file
2.   Our JSON File for NY Subways: NYC Open Data, https://opendata.cityofnewyork.us/, lots of geospatial data for making maps and locating subways. We used this to get information about subways --> json file
3. A PNG Image for Our Map
4. A CSV file to map neighborhood to borough

# Cleaning Our Data: 

Reading in Our Main CSV File

In [3]:
df_bnb = pd.read_csv("listings.csv", error_bad_lines=False, encoding = "utf8")

  interactivity=interactivity, compiler=compiler, result=result)


Filtering the variables we think are important

In [0]:
df = df_bnb.filter(["id", "name", "summary", "space", "description", "host_id", "host_response_rate",
                     "host_is_superhost", "street", "neighbourhood", "city", "latitude", "longitude", 
                     "property_type", "room_type", "bathrooms", "bedrooms", "beds", "bed_type", "amenities", 
                     "square_feet", "price", "cleaning_fee", "number_of_reviews", "review_scores_rating"]) 

Price, our main prediction variable, is a string. We need to filter out bad data which did not start with a dollar sign. For example, some prices were read in accidentally as Neighborhood. We also needed to remove the "$" and ".00"

In [0]:
df = df[df.price.str.startswith("$", na=False) == True]
df["price"] # price is a string, need to remove the dollar sign
df["price_int"] = df["price"].str.replace("$", "").str.replace(",", "").astype(float)

Our data had 200 neighborhoods but we wanted to categorize our data into 5 boroughs. We did this by joining another dataset as mentioned above.

In [0]:
joined = pd.read_csv("neighbourhoods.csv")  # another CSV file to join on 
df = df.merge(joined, on=["neighbourhood"])
df = df.rename(columns={"neighbourhood_group": "boroughs"}) # renaming the column

We found listings to be less than 50 dollars a night and some listings to be over 10,000 dollars a night. This threw off our data so we created a new dataframe, df2, which limited the "price" variable to a certain range

In [0]:
df2 = df[df.price_int < 650]
df2 = df2[df2.price_int > 50]

The longitude and latitude were also objects and not floats. We had to change that as well

In [0]:
df2["latitude_int"] = df2["latitude"].astype(float)
df2["longitude_int"] = df2["longitude"].astype(float)

We wanted to plot subway stations in NYC as well so we got this data through a JSON file as mentioned above. The DataFrame we created through our JSON file had latitude and longitudes in 1 column as Point(X, Y). So we needed to manually extract that to create 2 columns with just X, Y.

In [0]:
subways = json.load(open('ny_subways.json'))
df_json_subway = json_normalize(subways, "data")
df_subway = pd.DataFrame()
df_subway = df_json_subway[11].str.split().apply(pd.Series) 
df_subway["longitude"] = df_subway[1].str.replace("(", "") 
df_subway["latitude"] = df_subway[2].str.replace(")", "")
del df_subway[0], df_subway[1], df_subway[2] #deleting unecessary columns 
df_subway["latitude_int"] = df_subway["latitude"].astype(float) 
df_subway["longitude_int"] = df_subway["longitude"].astype(float)

Now, just filling in our missing values with 0 if it was an int, "other" if it was a string.

In [0]:
df2["host_response_rate"].fillna(0, inplace = True)
df2["property_type"].fillna("other", inplace = True)
df2["room_type"].fillna("other", inplace = True)
df2["bathrooms"].fillna(0, inplace = True)
df2["bedrooms"].fillna(0, inplace = True)
df2["beds"].fillna(0, inplace = True)
df2["bed_type"].fillna("other", inplace = True)
df2["square_feet"].fillna(0, inplace = True)
df2["cleaning_fee"].fillna(0, inplace = True)
df2["price"].fillna(0, inplace = True)
df2["number_of_reviews"].fillna(0, inplace = True)
df2["review_scores_rating"].fillna(0, inplace = True)
df2["boroughs"].fillna("other", inplace = True)

In [0]:
df = df[df.cleaning_fee.str.startswith("$", na=False) == True]
df2["cleaning_fee_int"] = df["cleaning_fee"].str.replace("$", "").str.replace(",", "").astype(float)
df2["cleaning_fee_int"].fillna(0, inplace = True)
df2["price_int"].fillna(0, inplace = True)
df2["host_response_rate_int"] = df["host_response_rate"].str.replace("%", "").str.replace(",", "").astype(float)
df2["host_response_rate_int"].fillna(0, inplace = True)