# Seattle AirBnB data

## Using data to understand the homeowner's market in Seattle

I approached the data as if I were a homeowner Seattle. If I were a homeowner in Seattle, my main objective would be to offer a great experience for my guests while making a healthy profit. Hence, I structured my business understanding questions around these objectives. My questions for my analysis are thus as follows: 

### Business Understanding:
1. Can we predict what drives higher ratings?
2. When are the most popular times of the year for Seattle home-owners?
3. When are the most profitable times of the year for Seattle home-owners?

### Data Understanding

#### Data Exploration

All data was obtained from Kaggle: https://www.kaggle.com/airbnb/seattle/home

In [966]:
#import libraries and load data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython import display
import collections
import sklearn
%matplotlib inline

listings = pd.read_csv('./seattle/listings.csv')
calendar = pd.read_csv('./seattle/calendar.csv')
reviews = pd.read_csv('./seattle/reviews.csv')

In [967]:
#explore columns in datasets
print(listings.columns.values)
print(calendar.columns.values)
print(reviews.columns.values)

['id' 'listing_url' 'scrape_id' 'last_scraped' 'name' 'summary' 'space'
 'description' 'experiences_offered' 'neighborhood_overview' 'notes'
 'transit' 'thumbnail_url' 'medium_url' 'picture_url' 'xl_picture_url'
 'host_id' 'host_url' 'host_name' 'host_since' 'host_location'
 'host_about' 'host_response_time' 'host_response_rate'
 'host_acceptance_rate' 'host_is_superhost' 'host_thumbnail_url'
 'host_picture_url' 'host_neighbourhood' 'host_listings_count'
 'host_total_listings_count' 'host_verifications' 'host_has_profile_pic'
 'host_identity_verified' 'street' 'neighbourhood'
 'neighbourhood_cleansed' 'neighbourhood_group_cleansed' 'city' 'state'
 'zipcode' 'market' 'smart_location' 'country_code' 'country' 'latitude'
 'longitude' 'is_location_exact' 'property_type' 'room_type'
 'accommodates' 'bathrooms' 'bedrooms' 'beds' 'bed_type' 'amenities'
 'square_feet' 'price' 'weekly_price' 'monthly_price' 'security_deposit'
 'cleaning_fee' 'guests_included' 'extra_people' 'minimum_nights'
 'm

It appears that all the datasets can potentially be merged by their listing ID, if needed during analysis. First, check that all columns are variables and rows are individuals.

In [968]:
#check no. of rows and columns
print(listings.shape)
print(calendar.shape)
print(reviews.shape)

(3818, 92)
(1393570, 4)
(84849, 6)


In [969]:
#check for missing values in the columns for each dataset, get percentages
(listings.isnull().sum()/len(listings)).sort_values(ascending=False)

license                             1.000000
square_feet                         0.974594
monthly_price                       0.602672
security_deposit                    0.511262
weekly_price                        0.473808
notes                               0.420639
neighborhood_overview               0.270299
cleaning_fee                        0.269775
transit                             0.244631
host_about                          0.224987
host_acceptance_rate                0.202462
review_scores_accuracy              0.172342
review_scores_checkin               0.172342
review_scores_value                 0.171818
review_scores_location              0.171556
review_scores_cleanliness           0.171032
review_scores_communication         0.170508
review_scores_rating                0.169460
reviews_per_month                   0.164222
first_review                        0.164222
last_review                         0.164222
space                               0.149031
host_respo

In [970]:
#check distribution of null values
(listings.isnull().sum()/len(listings)).describe()

count    92.000000
mean      0.084893
std       0.181492
min       0.000000
25%       0.000000
50%       0.000000
75%       0.136983
max       1.000000
dtype: float64

For the listing dataset, it looks as though there are a number of columns containing missing values. The license column is completely null.

Looking at the spread of null values, we can see that the average % of null rows in each column is around 10-11%, with 75% of columns having 17% of rows being missing. As I would prefer not to drop columns unnecessarily, I would consider drop columns with more than 30% rows that are null.

In [971]:
(calendar.isnull().sum()/len(calendar)).sort_values(ascending=False)

price         0.32939
available     0.00000
date          0.00000
listing_id    0.00000
dtype: float64

For the calendar dataset, the price column has 32% of rows containing null values.

In [972]:
(reviews.isnull().sum()/len(reviews)).sort_values(ascending=False)

comments         0.000212
reviewer_name    0.000000
reviewer_id      0.000000
date             0.000000
id               0.000000
listing_id       0.000000
dtype: float64

The reviews dataset has almost no missing values.

In order to prepare the data for the 3 business questions, we need to look at the 3 datasets and determine which datasets and columns contained within them that are relevant to the question above.

We have 3 datasets: listings, calendar and reviews. Based on our brief exploration above, we can see that the dataset most relevant to our analysis for this question is the listings dataset. The calendar dataset looks to be more relevant to supplement the listings dataset for our 2nd question on popular times and availability. 

Meanwhile, the reviews dataset is more relevant for qualitative predictors and is mainly unstructured data, hence we will only analyse it if we lack sufficient information to answer our questions.

After determining the datasets that are relevant for answering our questions, we move to preparing the data for our analysis.

### Question 1: Can we predict what drives higher ratings?
### Data Preparation

#### Treatment of missing values

Seeing as there are many missing values in the license column, and it is not relevant to the questions above, we can drop it from our analysis dataset. As mentioned above, I would also drop columns that are more than 30% of missing rows, if there are not too many and we don't foresee losing too much information.

In [973]:
#drop license column
listings.drop(columns=['license'],inplace=True)

In [974]:
#check columns >=30% missing values
[cols for cols in listings.columns.values if (listings[cols].isnull().sum()/len(listings))>=0.30]

['notes', 'square_feet', 'weekly_price', 'monthly_price', 'security_deposit']

Considering the columns above, we know that there is the price column that would contain similar information, so we can drop the weekly_price and monthly_price columns without worrying about losing information. 

In [975]:
#drop weekly_price, monthly_price columns
listings.drop(columns=['weekly_price','monthly_price'],inplace=True)

Since security deposit is an additional cost, I would rather not drop the value, but since the number of missing values are quite high, I would instead engineer the variable to become a flag for if there is a security deposit or not, as this may be more relevant to incorporate information for the missing values.

In [976]:
#create new variable security_deposit_flag 
listings['security_deposit_flag']=np.where(listings['security_deposit'].notnull(), 1, 0)

As for notes, it appears to be unstructured data. Meanwhile, for square_feet, the number of missing values are so high that I will just drop it.

In [977]:
#check values in notes columns
print(listings['notes'].head(10))
#drop square_feet variables
listings.drop(columns=['notes','square_feet'],inplace=True)

0                                                  NaN
1    What's up with the free pillows?  Our home was...
2    Our house is located just 5 short blocks to To...
3                                                  NaN
4                                            Belltown 
5    Let me know if you need anything or have sugge...
6    The room now has a mini frig to keep your favo...
7    There are three rentals in our back yard . If ...
8                                                  NaN
9    What's up with the free pillows?  Our home was...
Name: notes, dtype: object


#### Choosing the target (y) variable
Next, we revisit the question, which is on driving higher ratings in homes. The relevant column that can be seen as the target variable (y column) would be in the set of review_scores columns. However, we can see that there are several columns in the review_scores.

In [978]:
#check column names that begin with 'review_scores' 
[col for col in listings if col.startswith('review_scores_')]

['review_scores_rating',
 'review_scores_accuracy',
 'review_scores_cleanliness',
 'review_scores_checkin',
 'review_scores_communication',
 'review_scores_location',
 'review_scores_value']

Based on [AirBnB's ratings methodology](https://www.airbnb.com/help/article/1257/how-do-star-ratings-work), the overall experience is the one that determines the overall experience for guests, and so the review_scores_rating column is the one I would set as my target variable.  

#### Feature engineering

However, we need to revisit the other columns in the listings dataset. There are quite a few redundant columns that are unnecessary.

For example, it is unnecessary to have columns that only contain one unique value as they don't provide any predictive power.

In [979]:
#find columns in dataset that only contain one unique value
one_unique=[col for col in listings.columns.values if listings[col].nunique()==1]
one_unique

['scrape_id',
 'last_scraped',
 'experiences_offered',
 'market',
 'country_code',
 'country',
 'has_availability',
 'calendar_last_scraped',
 'requires_license',
 'jurisdiction_names']

Any columns that contain 'url' in the name are also irrelevant as they contain no predictive power or characteristics that lead to higher ratings for homes.

In [980]:
#find columns containing 'url' in the name
url_col=[col for col in listings.columns.values if 'url' in col]
url_col

['listing_url',
 'thumbnail_url',
 'medium_url',
 'picture_url',
 'xl_picture_url',
 'host_url',
 'host_thumbnail_url',
 'host_picture_url']

In [981]:
#add url and single unique value to drop columns list
drop_cols=[]
drop_cols.extend(one_unique)
drop_cols.extend(url_col)
#drop columns from dataset
listings.drop(columns=drop_cols,inplace=True)

In [982]:
#get total no. of columns after dropping columns
len(listings.columns.values)

70

For the remaining columns (ignoring the review_scores/target variable columns), we will need to look into a sample of one row from each column to determine if it would be valuable for our predictions or not, and to determine if further feature engineeringis required for some columns. 

In a way, we need to build our own simple data dictionary for our use, to more easily identify what kind of feature engineering is required.

In [983]:
#ignoring the review_scores column, look at the remaining columns, take one sample from each and build simple data dictionary
listings_sample=[(x,listings[x][1]) for x in listings.columns.values if not x.startswith('review_scores_')]
listings_sample=pd.DataFrame(listings_sample, columns=['column_name','sample_value'])
listings_sample.set_index('column_name',inplace=True)
#check types of values in each column - add new column of value_type to our data dictionary
listings_sample['value_type']=[type(x) for x in listings_sample['sample_value']]
#check how many unique values are in each column for categorical variables encoding
listings_sample['nunique']=[listings[col].nunique() for col in listings_sample.index.values]
listings_sample

Unnamed: 0_level_0,sample_value,value_type,nunique
column_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
id,953595,<class 'numpy.int64'>,3818
name,Bright & Airy Queen Anne Apartment,<class 'str'>,3792
summary,Chemically sensitive? We've removed the irrita...,<class 'str'>,3478
space,"Beautiful, hypoallergenic apartment in an extr...",<class 'str'>,3119
description,Chemically sensitive? We've removed the irrita...,<class 'str'>,3742
neighborhood_overview,"Queen Anne is a wonderful, truly functional vi...",<class 'str'>,2506
transit,"Convenient bus stops are just down the block, ...",<class 'str'>,2574
host_id,5177328,<class 'numpy.int64'>,2751
host_name,Andrea,<class 'str'>,1466
host_since,2013-02-21,<class 'str'>,1380


From our data dictionary above, we can see that there are quite a number of columns with unstructured data in the form of text and sentences, e.g. name, summary, description, neighbourhood_overview, etc. We will decide what to do with them later as there will be quite a lot of engineering to be done on those features.

For the other columns, we can see some that may not be relevant in the predicitive model. For example, the host particulars like host_id, host_name, host_since, host_location, host_about, host_verifications don't look like they will be useful, as they contain mainly unstructured and irrelevant information to our question of interest.

In [984]:
#remove relevant columns, update data dictionary at the same time
drop_cols=['host_id', 'host_name', 'host_since', 'host_location', 'host_about', 'host_verifications']
listings.drop(columns=drop_cols,inplace=True)
listings_sample.drop(drop_cols,inplace=True)

Furthermore, there are some redundant columns that can be removed as well, for example the neighbourhood data (it looks like they have been aggregated into neighbourhood_group_cleansed which may be more useful).

In [985]:
#check neighbourhood columns to see which one would provide more information
print(listings['neighbourhood_cleansed'].unique())
print(listings['neighbourhood_group_cleansed'].unique())
print(listings['neighbourhood'].unique())

['West Queen Anne' 'Adams' 'West Woodland' 'East Queen Anne' 'Wallingford'
 'North Queen Anne' 'Green Lake' 'Westlake' 'Mann' 'Madrona'
 'University District' 'Harrison/Denny-Blaine' 'Minor' 'Leschi' 'Atlantic'
 'Pike-Market' 'Eastlake' 'South Lake Union' 'Lawton Park' 'Briarcliff'
 'Belltown' 'International District' 'Central Business District'
 'First Hill' 'Yesler Terrace' 'Pioneer Square' 'Gatewood' 'Arbor Heights'
 'Alki' 'North Admiral' 'Crown Hill' 'Fairmount Park' 'Genesee' 'Interbay'
 'Industrial District' 'Mid-Beacon Hill' 'South Beacon Hill' 'Greenwood'
 'Holly Park' 'Fauntleroy' 'North Beacon Hill' 'Mount Baker' 'Brighton'
 'South Delridge' 'View Ridge' 'Dunlap' 'Rainier Beach' 'Columbia City'
 'Seward Park' 'North Delridge' 'Maple Leaf' 'Ravenna' 'Riverview'
 'Portage Bay' 'Bryant' 'Montlake' 'Broadway' 'Loyal Heights'
 'Victory Heights' 'Matthews Beach' 'Whittier Heights' 'Meadowbrook'
 'Olympic Hills' 'Roosevelt' 'Lower Queen Anne' 'Wedgwood'
 'North Beach/Blue Ridge' 'C

After checking the neighbourhood columns above, I decided to use neighbourhood_group_cleansed instead of the other two variables as it provides less number of levels as a categorical variable, while also offering a good amount of information.

In [986]:
#remove redundant columns, update data dictionary at the same time
drop_cols=['neighbourhood','host_neighbourhood','neighbourhood_cleansed']
listings.drop(columns=drop_cols,inplace=True)
listings_sample.drop(drop_cols,inplace=True)

City, State, Street, smart_location, latitude and longitude are all redundant data as all listings are in Seattle, WA. We have information on location through other variables like neighbourhood and zip code. Thus they can be dropped. The variables first_review and last_review are dates that I am not going to focus on in my analysis for question 1. This also applied to the variable calendar_updated.

In [987]:
#remove redundant data, update data dictionary at the same time
drop_cols=['city','state','street','smart_location',
            'latitude','longitude','first_review','last_review','calendar_updated']
listings.drop(columns=drop_cols,inplace=True)
listings_sample.drop(drop_cols,inplace=True)

After removing the 'low-hanging fruit' of irrelevant columns, we need to revisit our data dictionary and check if the values in each column can be processed by the model. 
number of levels they have that are required to be encoded.

In [988]:
print('No. of columns= {} columns.'.format(len(listings_sample)))
listings_sample

No. of columns= 45 columns.


Unnamed: 0_level_0,sample_value,value_type,nunique
column_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
id,953595,<class 'numpy.int64'>,3818
name,Bright & Airy Queen Anne Apartment,<class 'str'>,3792
summary,Chemically sensitive? We've removed the irrita...,<class 'str'>,3478
space,"Beautiful, hypoallergenic apartment in an extr...",<class 'str'>,3119
description,Chemically sensitive? We've removed the irrita...,<class 'str'>,3742
neighborhood_overview,"Queen Anne is a wonderful, truly functional vi...",<class 'str'>,2506
transit,"Convenient bus stops are just down the block, ...",<class 'str'>,2574
host_response_time,within an hour,<class 'str'>,4
host_response_rate,98%,<class 'str'>,45
host_acceptance_rate,100%,<class 'str'>,2


From our simple data dictionary, we can see that a lot of feature engineering is required for many of the columns. For example, the host_response_rate and host_acceptance_rate are recorded as type strings eventhough they are percentages. The pricing data (e.g. price, weekly_price, etc.) are also recorded as strings, eventhough they would be more useful to the predictive model as numerical values.

Additionally, the columns 'amenities' are also recorded as lists of strings, so we will need to process those as well.

#### Encoding for categorical variables 

Leaving all columns containing unstructured data for later, we first encode all categorical variables. Referring to our data dictionary, the categorical variables are as follows:

    host_response_time
    host_is_superhost
    host_has_profile_pic
    host_identity_verified
    neighbourhood_group_cleansed
    zipcode
    is_location_exact
    property_type
    room_type
    bed_type
    amenities (which needs further processing)
    instant_bookable
    cancellation_policy
    require_guest_profile_picture
    require_guest_phone_verification
    security_deposit_flag

I created the security_deposit_flag variable so we can leave that aside. But for the others, we need to encode them according to their levels. With that, I used pandas_getdummies function.

In [989]:
#check zipcode unique values - appears to have a weird value there
print(listings['zipcode'].unique())
#convert weird value into 98122
listings.loc[listings['zipcode'] == '99\n98122', 'zipcode'] = '98122'
print(listings['zipcode'].unique())

['98119' '98109' '98107' '98117' nan '98103' '98105' '98115' '98101'
 '98122' '98112' '98144' '99\n98122' '98121' '98102' '98199' '98104'
 '98134' '98136' '98126' '98146' '98116' '98177' '98118' '98108' '98133'
 '98106' '98178' '98125']
['98119' '98109' '98107' '98117' nan '98103' '98105' '98115' '98101'
 '98122' '98112' '98144' '98121' '98102' '98199' '98104' '98134' '98136'
 '98126' '98146' '98116' '98177' '98118' '98108' '98133' '98106' '98178'
 '98125']


In [990]:
#encode categorical variables
binary_cols=['host_is_superhost','host_has_profile_pic','host_identity_verified','is_location_exact'
             ,'instant_bookable','require_guest_profile_picture','require_guest_phone_verification']
listings[binary_cols]=np.where(listings[binary_cols]=='t', 1, 0)
encode_cols=['host_response_time','neighbourhood_group_cleansed','zipcode'
             ,'property_type','room_type','bed_type','cancellation_policy']
listings=pd.get_dummies(data=listings, columns=encode_cols,drop_first=True)

Next we need to look at the 'amenities' column and engineer it to extract categorical variables.

#### Transforming strings into numerical values

Now we look at the numerical values



Now we need to revisit what to do with the columns containing unstructured data. 

After reading around, I decided that a simple natural language processing model can be used to transform the unstructured text into feature vectors that could be processed by a machine learning model. In a way, I had to do basic feature extraction on those columns to extract the textual features in numerical format.
I referred to this article [here](https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/) for extensive guidance on how to do this.

In [996]:
listings_test=listings.copy()

In [1018]:
# #split amenities by commas
# listings_test['amenities']=listings_test['amenities'].str.split(',')
# #get unique list of amenities
# for x in range(len(listings_test['amenities'])):
#     amenities_list=[listings_test['amenities'][x][y].strip('{}""') for y in range(len(listings_test['amenities'][x]))]

In [1029]:
#get dataframe of listings with amenities split up
amenities_df=listings_test['amenities'].str.split(',',expand=True)
amenities_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,{TV,"""Cable TV""",Internet,"""Wireless Internet""","""Air Conditioning""",Kitchen,Heating,"""Family/Kid Friendly""",Washer,Dryer},...,,,,,,,,,,
1,{TV,Internet,"""Wireless Internet""",Kitchen,"""Free Parking on Premises""","""Buzzer/Wireless Intercom""",Heating,"""Family/Kid Friendly""",Washer,Dryer,...,,,,,,,,,,
2,{TV,"""Cable TV""",Internet,"""Wireless Internet""","""Air Conditioning""",Kitchen,"""Free Parking on Premises""","""Pets Allowed""","""Pets live on this property""",Dog(s),...,Shampoo},,,,,,,,,
3,{Internet,"""Wireless Internet""",Kitchen,"""Indoor Fireplace""",Heating,"""Family/Kid Friendly""",Washer,Dryer,"""Smoke Detector""","""Carbon Monoxide Detector""",...,,,,,,,,,,
4,{TV,"""Cable TV""",Internet,"""Wireless Internet""",Kitchen,Heating,"""Family/Kid Friendly""","""Smoke Detector""","""Carbon Monoxide Detector""","""First Aid Kit""",...,,,,,,,,,,
5,"{""Wireless Internet""","""Free Parking on Premises""",Heating,"""Smoke Detector""",Essentials,Shampoo},,,,,...,,,,,,,,,,
6,"{""Wireless Internet""","""Free Parking on Premises""",Heating,"""Smoke Detector""","""First Aid Kit""",Essentials,Shampoo},,,,...,,,,,,,,,,
7,"{""Wireless Internet""","""Pets live on this property""",Dog(s),Heating,"""Family/Kid Friendly""",Essentials,Shampoo},,,,...,,,,,,,,,,
8,{TV,"""Cable TV""",Internet,"""Wireless Internet""",Kitchen,Breakfast,"""Indoor Fireplace""",Heating,Washer,Dryer,...,,,,,,,,,,
9,{TV,Internet,"""Wireless Internet""",Kitchen,"""Free Parking on Premises""","""Buzzer/Wireless Intercom""",Heating,"""Family/Kid Friendly""",Washer,Dryer,...,,,,,,,,,,


In [1030]:
#remove punctuation from amenities dataframe
amenities_df=pd.DataFrame(np.array([amenities_df[x].str.replace('[^\w\s]','') for x in amenities_df.columns.values])).transpose()

In [1141]:
#get listings with most amenities
amenities_df[amenities_df.transpose().isnull().sum()==0]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
1330,TV,Cable TV,Internet,Wireless Internet,Wheelchair Accessible,Kitchen,Free Parking on Premises,Pets Allowed,Doorman,Gym,...,First Aid Kit,Safety Card,Fire Extinguisher,Essentials,Shampoo,24Hour Checkin,Hangers,Hair Dryer,Iron,Laptop Friendly Workspace
1421,TV,Internet,Wireless Internet,Air Conditioning,Wheelchair Accessible,Pool,Kitchen,Free Parking on Premises,Pets Allowed,Gym,...,First Aid Kit,Safety Card,Fire Extinguisher,Essentials,Shampoo,24Hour Checkin,Hangers,Hair Dryer,Iron,Laptop Friendly Workspace
2746,TV,Cable TV,Internet,Wireless Internet,Air Conditioning,Wheelchair Accessible,Kitchen,Free Parking on Premises,Gym,Breakfast,...,First Aid Kit,Safety Card,Fire Extinguisher,Essentials,Shampoo,24Hour Checkin,Hangers,Hair Dryer,Iron,Laptop Friendly Workspace
2951,TV,Cable TV,Internet,Wireless Internet,Air Conditioning,Wheelchair Accessible,Kitchen,Free Parking on Premises,Pets Allowed,Breakfast,...,First Aid Kit,Safety Card,Fire Extinguisher,Essentials,Shampoo,24Hour Checkin,Hangers,Hair Dryer,Iron,Laptop Friendly Workspace


In [1142]:
#add all amenities of those listings to lists
lst_1=amenities_df.transpose()[1330].unique().tolist()
lst_2=amenities_df.transpose()[1421].unique().tolist()
lst_3=amenities_df.transpose()[2746].unique().tolist()
lst_4=amenities_df.transpose()[2951].unique().tolist()
lst_1.extend(lst_2)
lst_1.extend(lst_3)
lst_1.extend(lst_4)

#convert amenities_list to no duplicate
amenities_list=set(lst_1)

In [1143]:
amenities_df[x]= np.where(amenities_df[binary_cols]=='t', 1, 0) for x in amenities_list

{'24Hour Checkin',
 'Air Conditioning',
 'Breakfast',
 'BuzzerWireless Intercom',
 'Cable TV',
 'Carbon Monoxide Detector',
 'Dogs',
 'Doorman',
 'Dryer',
 'Elevator in Building',
 'Essentials',
 'FamilyKid Friendly',
 'Fire Extinguisher',
 'First Aid Kit',
 'Free Parking on Premises',
 'Gym',
 'Hair Dryer',
 'Hangers',
 'Heating',
 'Hot Tub',
 'Indoor Fireplace',
 'Internet',
 'Iron',
 'Kitchen',
 'Laptop Friendly Workspace',
 'Pets Allowed',
 'Pets live on this property',
 'Pool',
 'Safety Card',
 'Shampoo',
 'Smoke Detector',
 'Suitable for Events',
 'TV',
 'Washer',
 'Wheelchair Accessible',
 'Wireless Internet'}

In [1009]:
#get dummies for amenities_df
amenities_df=pd.get_dummies(data=amenities_df,drop_first=True)

Columns containing unstructured data like sentences will also not feature in the predictive model. Even though they may contain some important information, it would require to parse through those columns for commonly occurring words, so will require further feature engineering.

In [148]:
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
#add numbers to corpus
nums=[str(x) for x in range(1000)]
stop.extend(['seattle',"i've","i'm"])
stop.extend(nums)

In [150]:
lst1=collections.Counter(" ".join(listings_test["description"].dropna()).split()).most_common(1000)
[x for x in lst1 if x[0].lower() not in stop]

[('room', 3855),
 ('kitchen', 2715),
 ('home', 2660),
 ('bedroom', 2653),
 ('bed', 2516),
 ('downtown', 2500),
 ('apartment', 2350),
 ('house', 2170),
 ('space', 2069),
 ('access', 2014),
 ('neighborhood', 1944),
 ('private', 1927),
 ('bathroom', 1914),
 ('living', 1865),
 ('restaurants', 1843),
 ('one', 1824),
 ('bus', 1699),
 ('full', 1680),
 ('located', 1600),
 ('floor', 1549),
 ('walk', 1528),
 ('away', 1525),
 ('two', 1483),
 ('available', 1394),
 ('blocks', 1324),
 ('area', 1318),
 ('Hill', 1313),
 ('great', 1279),
 ('parking', 1245),
 ('queen', 1206),
 ('large', 1167),
 ('quiet', 1163),
 ('street', 1128),
 ('minutes', 1064),
 ('coffee', 1034),
 ('also', 1008),
 ('stay', 1005),
 ('Capitol', 1003),
 ('guests', 996),
 ('city', 969),
 ('comfortable', 965),
 ('distance', 955),
 ('walking', 947),
 ('new', 934),
 ('bath', 923),
 ('Lake', 914),
 ('unit', 865),
 ('close', 865),
 ('location', 859),
 ('TV', 851),
 ('use', 826),
 ('shops', 824),
 ('views', 803),
 ('light', 795),
 ('size', 7

In [76]:
listings_test['description']

0       Make your self at home in this charming one-be...
1       Chemically sensitive? We've removed the irrita...
2       New modern house built in 2013.  Spectacular s...
3       A charming apartment that sits atop Queen Anne...
4       Cozy family craftman house in beautiful neighb...
5       We're renting out a small private unit of one ...
6       Enjoy a quiet stay in our comfortable 1915 Cra...
7       Our tiny cabin is private , very quiet and com...
8       Nestled in the heart of the city, this space i...
9       Beautiful apartment in an extremely safe, quie...
10      Queen Anne Hill is a charming neighborhood wit...
11      Beautifully furnished, cozy 1 bedroom mid cent...
12      Spacious apt in popular Seattle neighborhood. ...
13      Enjoy our amazing, updated & modern design cot...
14      Stunning Designsponge featured 6 bed, 3.75 bat...
15      This home is full of light, art and comfort. 5...
16      Master bedroom suite with 1/4 bath & kitchenet...
17      Beauti

In [None]:
bag_of_words = vec.transform(corpus)

In [None]:
#make a copy to check for unstructured data


In [25]:
#drop unstructured data like 'summary', 'description', 'neighborhood_overview','transit',

TypeError: object of type 'float' has no len()

It also could be that columns like 'host_name' or 'host_about' are not particularly predictive of the rating as they are unstructured and not indicative of any characteristics of the home.

There are also redundant details like 'longitude' and 'latitude', and 

In [14]:
drop_cols=['listing_url','scrape_id','last_scraped','thumbnail_url',
       'medium_url', 'picture_url', 'xl_picture_url', 'host_id',
       'host_url', 'host_name','host_about','host_thumbnail_url',
       'host_picture_url','country_code','latitude',
       'longitude','latitude',
       'longitude','requires_license', 'jurisdiction_names']