The data cleaning part includes selection of necessary columns, some data transformations /e.g. new columns/, as well as dropping unnecessary and NA columns.<br>
The rest part of the Feature engineering is done in the next notebook /fill-in missing values, dealing with outliers, transformation of variables, binning, discretization, one - hot encoding etc/. This SHOULD BE DONE SEPARATELY FOR TRAIN&TEST SETS to avoid data leakage.

In [1]:
import pandas as pd
import numpy as np

import regex as re
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick 
import matplotlib.dates as mdates
from matplotlib.ticker import PercentFormatter, FuncFormatter
%matplotlib inline
import matplotlib.pylab as pylab
params = {'legend.fontsize': 'x-large',
         'axes.labelsize': 'x-large',
         'axes.titlesize':'xx-large',
         'xtick.labelsize':'large',
         'ytick.labelsize':'large'}
pylab.rcParams.update(params)
from cycler import cycler

import seaborn as sns
sns.set()

import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from textacy import preprocessing
import textacy
from nltk.corpus import stopwords
from nltk.stem import *

import spacy
nlp = spacy.load('en_core_web_sm')

from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from  sklearn.metrics  import accuracy_score
from sklearn import metrics
from sklearn.metrics import confusion_matrix
# for one hot encoding with sklearn
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_selection import VarianceThreshold

# for the Q-Q plots
import scipy.stats as stats


# for one hot encoding with feature-engine
from feature_engine.encoding import OneHotEncoder as fe_OneHotEncoder

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import roc_auc_score, r2_score, mean_squared_error
# environment settings
pd.set_option('display.max_column',None)
pd.set_option('display.max_rows',None)

In [2]:
#read data
calendar = pd.read_csv('/Users/asyagadzhalova/Documents/GitHub/Boston-Airbnb-data/src/data/raw_data/calendar.csv')
listings =  pd.read_csv('/Users/asyagadzhalova/Documents/GitHub/Boston-Airbnb-data/src/data/raw_data/listings.csv')
reviews = pd.read_csv('/Users/asyagadzhalova/Documents/GitHub/Boston-Airbnb-data/src/data/raw_data/reviews.csv')

From the previous notebook, in which we initially explored the data, we have defined the folling questions to answer:
1. When are the busiest times in Boston? 
2. Which are the busiest neghbourhoods? 
I will perform some descriptive statistics and visualisations. 
In order to answer that, we will use the calendar data create occupancy metrics based on calendar data, to understand the occupancy levels throughout the year. Add to this data the neighbourhood information. 
3. What drives/defines the prices of the AirBNB in Boston? Which are the top variables that affect the price? 
We will build a regression model to predict a continuous variable, and to understand the features that drive the model. 

# 1. Data cleaning

### Check for duplicates, columns with repeated values, drop unnecessary columns

####  Calendar data

In [3]:
calendar.shape

(1308890, 4)

In [4]:
calendar.drop_duplicates(subset=['listing_id','date'],inplace=True)

In [5]:
calendar.shape

(1308525, 4)

In [6]:
calendar['date'] = pd.to_datetime(calendar['date'])
calendar['month'] = calendar['date'].map(lambda x: x.strftime('%Y-%m'))
#calendar['week'] = calendar['date'].map(lambda x: x.strftime("%V"))

#### Engineering new features - occupancy metrics
- Number of busy days/total number of days throughout the month, split by months
- Number of busy days/total number of days throughout the year

In [7]:
cal_month_avail = calendar.groupby(['listing_id','month','available']).agg({'date':'nunique'}).reset_index()

In [8]:
cal_month_total = cal_month_avail.groupby(['listing_id','month']).agg({'date':'sum'}).reset_index()
cal_month_total.rename({'date':'total_days'},inplace=True,axis=1)

In [9]:
cal_month_busy = cal_month_avail[cal_month_avail['available']=='f'].groupby(['listing_id','month']).agg({'date':'sum'}).reset_index()
cal_month_busy.rename({'date':'busy_days'},inplace=True,axis=1)

In [10]:
cal_month_total = cal_month_total.merge(cal_month_busy, how = 'left', on = ['listing_id','month'])

In [11]:
cal_month_total['busy_days'].fillna(0,inplace=True)
cal_month_total['occupancy_metrics'] = cal_month_total['busy_days']/cal_month_total['total_days']*100

In [12]:
cal_metrics_total = cal_month_total.groupby(['listing_id']).agg({'total_days':'sum','busy_days':'sum'}).reset_index()
#cal_metrics_total.sort_values(by='occupancy_metrics',ascending =False).head()

In [13]:
cal_metrics_total['occupancy_metrics'] = round(cal_metrics_total['busy_days']/cal_metrics_total['total_days']*100,0)

In [14]:
cal_metrics_total.head()

Unnamed: 0,listing_id,total_days,busy_days,occupancy_metrics
0,3353,365,116.0,32.0
1,5506,365,21.0,6.0
2,6695,365,41.0,11.0
3,6976,365,46.0,13.0
4,8792,365,117.0,32.0


In [15]:
cal_metrics_total.shape

(3585, 4)

In [16]:
#20% of the listings are fully booked throughout the year
cal_metrics_total[cal_metrics_total['occupancy_metrics']==100].shape

(705, 4)

In [17]:
cal_metrics_total[cal_metrics_total['occupancy_metrics']<=40].shape

(1660, 4)

In [18]:
calendar.to_pickle('/Users/asyagadzhalova/Documents/GitHub/Boston-Airbnb-data/src/data/processed/calendar.pkl')
cal_month_total.to_pickle('/Users/asyagadzhalova/Documents/GitHub/Boston-Airbnb-data/src/data/processed/calendar_metrics_monthly.pkl')
cal_metrics_total.to_pickle('/Users/asyagadzhalova/Documents/GitHub/Boston-Airbnb-data/src/data/processed/calendar_metrics.pkl')

#### Listings data

In [19]:
listings.shape

(3585, 95)

In [20]:
listings.describe()

Unnamed: 0,id,scrape_id,host_id,host_listings_count,host_total_listings_count,neighbourhood_group_cleansed,latitude,longitude,accommodates,bathrooms,bedrooms,beds,square_feet,guests_included,minimum_nights,maximum_nights,has_availability,availability_30,availability_60,availability_90,availability_365,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,jurisdiction_names,calculated_host_listings_count,reviews_per_month
count,3585.0,3585.0,3585.0,3585.0,3585.0,0.0,3585.0,3585.0,3585.0,3571.0,3575.0,3576.0,56.0,3585.0,3585.0,3585.0,0.0,3585.0,3585.0,3585.0,3585.0,3585.0,2772.0,2762.0,2767.0,2765.0,2767.0,2763.0,2764.0,0.0,0.0,3585.0,2829.0
mean,8440875.0,20160910000000.0,24923110.0,58.902371,58.902371,,42.340032,-71.084818,3.041283,1.221647,1.255944,1.60906,858.464286,1.429847,3.171269,28725.84,,8.64993,21.833194,38.558159,179.346444,19.04463,91.916667,9.431571,9.258041,9.646293,9.646549,9.414043,9.168234,,,12.733891,1.970908
std,4500787.0,0.8516813,22927810.0,171.119663,171.119663,,0.024403,0.031565,1.778929,0.501487,0.75306,1.011745,608.87431,1.056787,8.874133,1670136.0,,10.43533,21.860966,33.158272,142.13618,35.571658,9.531686,0.931863,1.168977,0.762753,0.735507,0.903436,1.011116,,,29.415076,2.120561
min,3353.0,20160910000000.0,4240.0,0.0,0.0,,42.235942,-71.171789,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,0.0,0.0,0.0,0.0,0.0,20.0,2.0,2.0,2.0,4.0,2.0,2.0,,,1.0,0.01
25%,4679319.0,20160910000000.0,6103425.0,1.0,1.0,,42.329995,-71.105083,2.0,1.0,1.0,1.0,415.0,1.0,1.0,365.0,,0.0,0.0,0.0,19.0,1.0,89.0,9.0,9.0,9.0,9.0,9.0,9.0,,,1.0,0.48
50%,8577620.0,20160910000000.0,19281000.0,2.0,2.0,,42.345201,-71.078429,2.0,1.0,1.0,1.0,825.0,1.0,2.0,1125.0,,4.0,16.0,37.0,179.0,5.0,94.0,10.0,10.0,10.0,10.0,10.0,9.0,,,2.0,1.17
75%,12789530.0,20160910000000.0,36221470.0,7.0,7.0,,42.354685,-71.062155,4.0,1.0,2.0,2.0,1200.0,1.0,3.0,1125.0,,15.0,40.0,68.0,325.0,21.0,98.25,10.0,10.0,10.0,10.0,10.0,10.0,,,6.0,2.72
max,14933460.0,20160910000000.0,93854110.0,749.0,749.0,,42.389982,-71.0001,16.0,6.0,5.0,16.0,2400.0,14.0,300.0,100000000.0,,30.0,60.0,90.0,365.0,404.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,,,136.0,19.15


In [21]:
listings['last_scraped'].unique()

array(['2016-09-07'], dtype=object)

In [22]:
listings.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,access,interaction,house_rules,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,city,state,zipcode,market,smart_location,country_code,country,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,square_feet,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",,"The bus stop is 2 blocks away, and frequent. B...","You will have access to 2 bedrooms, a living r...",,Clean up and treat the home the way you'd like...,https://a2.muscache.com/im/pictures/c0842db1-e...,https://a2.muscache.com/im/pictures/c0842db1-e...,https://a2.muscache.com/im/pictures/c0842db1-e...,https://a2.muscache.com/im/pictures/c0842db1-e...,31303940,https://www.airbnb.com/users/show/31303940,Virginia,2015-04-15,"Boston, Massachusetts, United States",We are country and city connecting in our deck...,,,,f,https://a2.muscache.com/im/pictures/5936fef0-b...,https://a2.muscache.com/im/pictures/5936fef0-b...,Roslindale,1,1,"['email', 'phone', 'facebook', 'reviews']",t,f,"Birch Street, Boston, MA 02131, United States",Roslindale,Roslindale,,Boston,MA,2131.0,Boston,"Boston, MA",US,United States,42.282619,-71.133068,t,House,Entire home/apt,4,1.5,2.0,3.0,Real Bed,"{TV,""Wireless Internet"",Kitchen,""Free Parking ...",,$250.00,,,,$35.00,1,$0.00,2,1125,2 weeks ago,,0,0,0,0,2016-09-06,0,,,,,,,,,,f,,,f,moderate,f,f,1,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...","If you don't have a US cell phone, you can tex...",Plenty of safe street parking. Bus stops a few...,Apt has one more bedroom (which I use) and lar...,"If I am at home, I am likely working in my hom...",Pet friendly but please confirm with me if the...,https://a1.muscache.com/im/pictures/39327812/d...,https://a1.muscache.com/im/pictures/39327812/d...,https://a1.muscache.com/im/pictures/39327812/d...,https://a1.muscache.com/im/pictures/39327812/d...,2572247,https://www.airbnb.com/users/show/2572247,Andrea,2012-06-07,"Boston, Massachusetts, United States",I live in Boston and I like to travel and have...,within an hour,100%,100%,f,https://a2.muscache.com/im/users/2572247/profi...,https://a2.muscache.com/im/users/2572247/profi...,Roslindale,1,1,"['email', 'phone', 'facebook', 'linkedin', 'am...",t,t,"Pinehurst Street, Boston, MA 02131, United States",Roslindale,Roslindale,,Boston,MA,2131.0,Boston,"Boston, MA",US,United States,42.286241,-71.134374,t,Apartment,Private room,2,1.0,1.0,1.0,Real Bed,"{TV,Internet,""Wireless Internet"",""Air Conditio...",,$65.00,$400.00,,$95.00,$10.00,0,$0.00,2,15,a week ago,,26,54,84,359,2016-09-06,36,2014-06-01,2016-08-13,94.0,10.0,9.0,10.0,10.0,9.0,9.0,f,,,t,moderate,f,f,1,1.3
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,I am in a scenic part of Boston with a couple ...,"PUBLIC TRANSPORTATION: From the house, quick p...","I am living in the apartment during your stay,...","ABOUT ME: I'm a laid-back, friendly, unmarried...","I encourage you to use my kitchen, cooking and...",https://a2.muscache.com/im/pictures/6ae8335d-9...,https://a2.muscache.com/im/pictures/6ae8335d-9...,https://a2.muscache.com/im/pictures/6ae8335d-9...,https://a2.muscache.com/im/pictures/6ae8335d-9...,16701,https://www.airbnb.com/users/show/16701,Phil,2009-05-11,"Boston, Massachusetts, United States","I am a middle-aged, single male with a wide ra...",within a few hours,100%,88%,t,https://a2.muscache.com/im/users/16701/profile...,https://a2.muscache.com/im/users/16701/profile...,Roslindale,1,1,"['email', 'phone', 'reviews', 'jumio']",t,t,"Ardale St., Boston, MA 02131, United States",Roslindale,Roslindale,,Boston,MA,2131.0,Boston,"Boston, MA",US,United States,42.292438,-71.135765,t,Apartment,Private room,2,1.0,1.0,1.0,Real Bed,"{TV,""Cable TV"",""Wireless Internet"",""Air Condit...",,$65.00,$395.00,"$1,350.00",,,1,$20.00,3,45,5 days ago,,19,46,61,319,2016-09-06,41,2009-07-19,2016-08-05,98.0,10.0,9.0,10.0,10.0,9.0,10.0,f,,,f,moderate,t,f,1,0.47
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,Please be mindful of the property as it is old...,There are buses that stop right in front of th...,The basement has a washer dryer and gym area. ...,We do live in the house therefore might be som...,- The bathroom and house are shared so please ...,https://a2.muscache.com/im/pictures/39764190-1...,https://a2.muscache.com/im/pictures/39764190-1...,https://a2.muscache.com/im/pictures/39764190-1...,https://a2.muscache.com/im/pictures/39764190-1...,6031442,https://www.airbnb.com/users/show/6031442,Meghna,2013-04-21,"Boston, Massachusetts, United States",My husband and I live on the property. He’s a...,within a few hours,100%,50%,f,https://a2.muscache.com/im/pictures/5d430cde-7...,https://a2.muscache.com/im/pictures/5d430cde-7...,,1,1,"['email', 'phone', 'reviews']",t,f,"Boston, MA, United States",,Roslindale,,Boston,MA,,Boston,"Boston, MA",US,United States,42.281106,-71.121021,f,House,Private room,4,1.0,1.0,2.0,Real Bed,"{TV,Internet,""Wireless Internet"",""Air Conditio...",,$75.00,,,$100.00,$50.00,2,$25.00,1,1125,a week ago,,6,16,26,98,2016-09-06,1,2016-08-28,2016-08-28,100.0,10.0,10.0,10.0,10.0,10.0,10.0,f,,,f,moderate,f,f,1,1.0
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",I have one roommate who lives on the lower lev...,From Logan Airport and South Station you have...,You will have access to the front and side por...,I love my city and really enjoy sharing it wit...,"Please no smoking in the house, porch or on th...",https://a1.muscache.com/im/pictures/97154760/8...,https://a1.muscache.com/im/pictures/97154760/8...,https://a1.muscache.com/im/pictures/97154760/8...,https://a1.muscache.com/im/pictures/97154760/8...,15396970,https://www.airbnb.com/users/show/15396970,Linda,2014-05-11,"Boston, Massachusetts, United States",I work full time for a public school district....,within an hour,100%,100%,t,https://a0.muscache.com/im/users/15396970/prof...,https://a0.muscache.com/im/users/15396970/prof...,Roslindale,1,1,"['email', 'phone', 'reviews', 'kba']",t,t,"Durnell Avenue, Boston, MA 02131, United States",Roslindale,Roslindale,,Boston,MA,2131.0,Boston,"Boston, MA",US,United States,42.284512,-71.136258,t,House,Private room,2,1.5,1.0,2.0,Real Bed,"{Internet,""Wireless Internet"",""Air Conditionin...",,$79.00,,,,$15.00,1,$0.00,2,31,2 weeks ago,,13,34,59,334,2016-09-06,29,2015-08-18,2016-09-01,99.0,10.0,10.0,10.0,10.0,9.0,10.0,f,,,f,flexible,f,f,1,2.25


In [23]:
missing_values = listings.isna().sum()/listings.shape[0]*100
missing_values.sort_values(ascending = False)

has_availability                    100.000000
license                             100.000000
neighbourhood_group_cleansed        100.000000
jurisdiction_names                  100.000000
square_feet                          98.437936
monthly_price                        75.230126
weekly_price                         75.118550
security_deposit                     62.566248
notes                                55.090656
interaction                          43.347280
access                               41.534170
neighborhood_overview                39.470014
host_about                           36.513250
transit                              35.983264
house_rules                          33.249651
cleaning_fee                         30.878661
space                                29.483961
review_scores_accuracy               22.956764
review_scores_location               22.928870
review_scores_value                  22.900976
review_scores_checkin                22.873082
review_scores

In [24]:
listings.shape

(3585, 95)

In [25]:
#drop duplicates
listings.drop_duplicates(inplace=True)

In [26]:
listings.shape

(3585, 95)

### Drop columns with missing values and other columns that are not relevant

In [27]:
#There are columns with almost all missing values -> we are dropping them
listings.drop(columns={'has_availability','neighbourhood_group_cleansed','license','jurisdiction_names','square_feet','monthly_price','weekly_price'},inplace=True,axis = 1)

In [28]:
# drop other irrelevant columns
listings.drop(columns={'listing_url','thumbnail_url','medium_url','picture_url','xl_picture_url','availability_30','availability_60','availability_90','availability_365','maximum_nights','calendar_updated','host_picture_url','host_listings_count','neighbourhood'},inplace=True,axis = 1)

In [29]:
listings.shape

(3585, 74)

In [30]:
#find columns with the same value
col_list = []
for col in listings.columns:
    if listings[col].nunique()==1:
        col_list.append(col)

In [31]:
col_list

['scrape_id',
 'last_scraped',
 'experiences_offered',
 'state',
 'country_code',
 'country',
 'calendar_last_scraped',
 'requires_license']

Last Scraped is the column on which the last was scraped the calendar -> we will use that date to calculate the days into operation since the host was established

In [32]:
# drop the columns with the same value
listings.drop(columns={'scrape_id',
 'experiences_offered',
 'state',
 'country_code',
 'country',
 'calendar_last_scraped',
 'requires_license'},inplace=True,axis = 1)

In [33]:
#find columns that have 2 variables - 2dummy variables
col_list2 = []
for col in listings.columns:
    if listings[col].nunique()==2:
        col_list2.append(col)

In [34]:
col_list2

['host_is_superhost',
 'host_has_profile_pic',
 'host_identity_verified',
 'is_location_exact',
 'instant_bookable',
 'require_guest_profile_picture',
 'require_guest_phone_verification']

In [35]:
listings['host_is_superhost'].value_counts()

f    3178
t     407
Name: host_is_superhost, dtype: int64

In [36]:
listings['host_has_profile_pic'].value_counts()

t    3577
f       8
Name: host_has_profile_pic, dtype: int64

In [37]:
listings['host_identity_verified'].value_counts()

t    2603
f     982
Name: host_identity_verified, dtype: int64

In [38]:
listings['is_location_exact'].value_counts()

t    3080
f     505
Name: is_location_exact, dtype: int64

In [39]:
listings['instant_bookable'].value_counts()

f    2991
t     594
Name: instant_bookable, dtype: int64

In [40]:
listings['require_guest_profile_picture'].value_counts()

f    3518
t      67
Name: require_guest_profile_picture, dtype: int64

In [41]:
listings['require_guest_phone_verification'].value_counts()

f    3348
t     237
Name: require_guest_phone_verification, dtype: int64

In [42]:
#Transform the Dummy variables into Binary with numerical values
for col in col_list2:
    listings[col] = np.where(listings[col]=='t', 1, 0)

In [43]:
#drop also column host_has_profile_pic, since the majority of the values are true
listings.drop('host_has_profile_pic',inplace=True, axis=1)

In [44]:
listings.shape

(3585, 66)

In [45]:
#for security deposit - my assumption is that when there is no value -> there is NO security deposit. Also for the cleaning fee - if omitted -> then there isn't one. THIS IS OK to be done before the SPLIT
listings['security_deposit'].fillna(0,inplace=True)
listings['cleaning_fee'].fillna(0,inplace=True)

In [46]:
#The text data contained in space, summary, description is almost the same, so I will leave only the description, since it has less missing values
listings.drop(['space','summary','name','host_url','host_name','host_thumbnail_url'],axis=1,inplace=True)

In [47]:
#column market contains very few values, and the majority of the values are only 1 value -> we will drop it
listings['market'].value_counts()

Boston                   3568
San Francisco               1
Other (Domestic)            1
Other (International)       1
Name: market, dtype: int64

In [48]:
listings.drop(['market'],axis=1,inplace=True)

In [49]:
#For the rest of the text variables, since there is a big part of them missing, I will encode them as 1: with data, 0 - no data
#col_list = ['notes', 'transit','access','interaction','house_rules','neighborhood_overview','host_about']
#for col in col_list:
#    listings[col].fillna(0,inplace=True)
#    listings[col] = np.where(listings[col].notna()==True, 1, 0)

In [50]:
#Drop the above columns 
listings.drop(columns={'notes', 'transit','access','interaction','house_rules','neighborhood_overview','host_about'},axis=1,inplace=True)

__Date variable we will transform into numerical - host_since: days between last_scraped and host_since - we will get days into operation of the listing__

In [51]:
listings['host_since']= pd.to_datetime(listings['host_since'])
listings['last_scraped']= pd.to_datetime(listings['last_scraped'])

In [52]:
listings['days_operation'] = (listings['last_scraped']- listings['host_since']).dt.days
#listings['days_operation'].dtype()

In [53]:
listings.drop(columns={'host_since','last_review','first_review','last_scraped'},inplace=True,axis=1)

In [54]:
listings.drop(['zipcode'],axis=1,inplace=True)

In [55]:
listings.describe(include='object').transpose()

Unnamed: 0,count,unique,top,freq
description,3585,3423,"The unit is stylishly designed for comfort, va...",7
host_location,3574,176,"Boston, Massachusetts, United States",2421
host_response_time,3114,4,within an hour,1384
host_response_rate,3114,52,100%,2072
host_acceptance_rate,3114,72,100%,1210
host_neighbourhood,3246,53,Allston-Brighton,375
host_verifications,3585,83,"['email', 'phone', 'reviews', 'jumio']",930
street,3585,1239,"Boylston Street, Boston, MA 02215, United States",64
neighbourhood_cleansed,3585,25,Jamaica Plain,343
city,3583,38,Boston,3381


In [56]:
#Transform the price columns to integer values by removing symbols
cols = ['security_deposit','price','cleaning_fee','extra_people']
for col in cols:
    listings[col] = listings[col].str.replace('$','')
    listings[col] = listings[col].str.replace(',','')
    #listings[col].fillna(0,inplace=True)
    listings[col] = pd.to_numeric(listings[col])

  listings[col] = listings[col].str.replace('$','')


In [57]:
listings['security_deposit'].fillna(0,inplace=True)
listings['cleaning_fee'].fillna(0,inplace=True)

In [58]:
listings['price'].head()

0    250.0
1     65.0
2     65.0
3     75.0
4     79.0
Name: price, dtype: float64

In [59]:
col_objects = listings.select_dtypes('object').columns

In [60]:
for col in col_objects:
    print (col, listings[col].isna().sum())

description 0
host_location 11
host_response_time 471
host_response_rate 471
host_acceptance_rate 471
host_neighbourhood 339
host_verifications 0
street 0
neighbourhood_cleansed 0
city 2
smart_location 0
property_type 3
room_type 0
bed_type 0
amenities 0
cancellation_policy 0


In [61]:
#Map host location to 1:within Boston, 0 - not in Boston
listings['host_location'] = np.where(listings['host_location']=='Boston, Massachusetts, United States', 1, 0)
listings['host_location'].fillna(1,inplace=True)

In [62]:
listings['host_response_time'].value_counts()

within an hour        1384
within a few hours    1218
within a day           469
a few days or more      43
Name: host_response_time, dtype: int64

In [63]:
#as a result from the analysis in 02.1 notebook - dropping the description field
listings.drop(['description'],axis=1,inplace=True)

#### Transform the price into price per person, as well as take into account the additional price for guests

In [64]:
listings['price_per_person'] = round(listings['price']/(listings['accommodates']+listings['guests_included']),2)+listings['extra_people']

In [65]:
listings.drop(['price','accommodates','guests_included','extra_people'],axis=1,inplace=True)

In [66]:
listings.shape

(3585, 44)

In [67]:
listings.head()

Unnamed: 0,id,host_id,host_location,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_neighbourhood,host_total_listings_count,host_verifications,host_identity_verified,street,neighbourhood_cleansed,city,smart_location,latitude,longitude,is_location_exact,property_type,room_type,bathrooms,bedrooms,beds,bed_type,amenities,security_deposit,cleaning_fee,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,days_operation,price_per_person
0,12147973,31303940,1,,,,0,Roslindale,1,"['email', 'phone', 'facebook', 'reviews']",0,"Birch Street, Boston, MA 02131, United States",Roslindale,Boston,"Boston, MA",42.282619,-71.133068,1,House,Entire home/apt,1.5,2.0,3.0,Real Bed,"{TV,""Wireless Internet"",Kitchen,""Free Parking ...",0.0,35.0,2,0,,,,,,,,0,moderate,0,0,1,,511,50.0
1,3075044,2572247,1,within an hour,100%,100%,0,Roslindale,1,"['email', 'phone', 'facebook', 'linkedin', 'am...",1,"Pinehurst Street, Boston, MA 02131, United States",Roslindale,Boston,"Boston, MA",42.286241,-71.134374,1,Apartment,Private room,1.0,1.0,1.0,Real Bed,"{TV,Internet,""Wireless Internet"",""Air Conditio...",95.0,10.0,2,36,94.0,10.0,9.0,10.0,10.0,9.0,9.0,1,moderate,0,0,1,1.3,1553,32.5
2,6976,16701,1,within a few hours,100%,88%,1,Roslindale,1,"['email', 'phone', 'reviews', 'jumio']",1,"Ardale St., Boston, MA 02131, United States",Roslindale,Boston,"Boston, MA",42.292438,-71.135765,1,Apartment,Private room,1.0,1.0,1.0,Real Bed,"{TV,""Cable TV"",""Wireless Internet"",""Air Condit...",0.0,0.0,3,41,98.0,10.0,9.0,10.0,10.0,9.0,10.0,0,moderate,1,0,1,0.47,2676,41.67
3,1436513,6031442,1,within a few hours,100%,50%,0,,1,"['email', 'phone', 'reviews']",0,"Boston, MA, United States",Roslindale,Boston,"Boston, MA",42.281106,-71.121021,0,House,Private room,1.0,1.0,2.0,Real Bed,"{TV,Internet,""Wireless Internet"",""Air Conditio...",100.0,50.0,1,1,100.0,10.0,10.0,10.0,10.0,10.0,10.0,0,moderate,0,0,1,1.0,1235,37.5
4,7651065,15396970,1,within an hour,100%,100%,1,Roslindale,1,"['email', 'phone', 'reviews', 'kba']",1,"Durnell Avenue, Boston, MA 02131, United States",Roslindale,Boston,"Boston, MA",42.284512,-71.136258,1,House,Private room,1.5,1.0,2.0,Real Bed,"{Internet,""Wireless Internet"",""Air Conditionin...",0.0,15.0,2,29,99.0,10.0,10.0,10.0,10.0,9.0,10.0,0,flexible,0,0,1,2.25,850,26.33


In [68]:
#other columns to drop
listings.drop(['host_id','host_neighbourhood','host_verifications','street','smart_location','is_location_exact','require_guest_profile_picture','require_guest_phone_verification','calculated_host_listings_count','instant_bookable'],axis=1,inplace=True)

In [69]:
listings.shape

(3585, 34)

In [70]:
listings.to_pickle('/Users/asyagadzhalova/Documents/GitHub/Boston-Airbnb-data/src/data/processed/listings.pkl')