# Table of Contents
1. [Data Wrangling](#Data_wrangling)<br>
    1a. [Dataframe cleaning](#cleaning)<br>
    1b. [Text preprocessing](#text)<br>

2. [EDA](#EDA)<br>

In [1]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.util import ngrams

<a id='Data_wrangling'></a>

## Data Wrangling

The dataset was obtained at http://insideairbnb.com/get-the-data.html. The detailed Los Angeles listing and calendar dataset were used. <br>

The data wrangling is straight forward in this dataset. First, features unrelated to target feature (number of reviews) and features with insufficient data were removed. Second, the data were clean to have proper format and length of time as host column was created. Then finally the NLP of the text data were conducted.

In [2]:
#load data into a dataframe
path1 = 'Dataset/listings.csv'
path2 = 'Dataset/calendar.csv'
dataset1 = pd.read_csv(path1, index_col = None, parse_dates = ['last_scraped', 'host_since'])
dataset2 = pd.read_csv(path2, index_col = None, parse_dates = ['date'])

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
dataset1.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,109,https://www.airbnb.com/rooms/109,20190914032935,2019-09-14,Amazing bright elegant condo park front *UPGRA...,"*** Unit upgraded with new bamboo flooring, br...","*** Unit upgraded with new bamboo flooring, br...","*** Unit upgraded with new bamboo flooring, br...",none,,...,f,f,strict_14_with_grace_period,t,f,1,1,0,0,0.02
1,344,https://www.airbnb.com/rooms/344,20190914032935,2019-09-14,Family perfect;Pool;Near Studios!,This home is perfect for families; aspiring ch...,"Cheerful & comfortable; near studios, amusemen...",This home is perfect for families; aspiring ch...,none,Quiet-yet-close to all the fun in LA! Hollywoo...,...,t,f,flexible,f,f,1,1,0,0,0.15
2,2708,https://www.airbnb.com/rooms/2708,20190914032935,2019-09-14,Fireplace Mirrored Mini Suit (Website hidden b...,Our best memory foam pillows you'll ever sleep...,Flickering fireplace. Blendtec® Designer 625 ...,Our best memory foam pillows you'll ever sleep...,none,We are minutes away from the Mentor Language I...,...,t,f,strict_14_with_grace_period,f,f,2,0,2,0,0.33
3,2732,https://www.airbnb.com/rooms/2732,20190914032935,2019-09-14,Zen Life at the Beach,,This is a three story townhouse with the follo...,This is a three story townhouse with the follo...,none,,...,f,f,strict_14_with_grace_period,f,f,2,1,1,0,0.19
4,2864,https://www.airbnb.com/rooms/2864,20190914032935,2019-09-14,*Upscale Professional Home with Beautiful Studio*,Centrally located.... Furnished with 42 inch S...,The space is furnished with Thomasville furnit...,Centrally located.... Furnished with 42 inch S...,none,What makes the neighborhood unique is that the...,...,f,f,strict_14_with_grace_period,f,f,1,1,0,0,


<a id='cleaning'></a>

## 1a. Dataframe Cleaning

In [4]:
#get info on dataframe1
dataset1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45053 entries, 0 to 45052
Columns: 106 entries, id to reviews_per_month
dtypes: datetime64[ns](2), float64(23), int64(21), object(60)
memory usage: 36.4+ MB


In [5]:
#list out all the columns in Dataframe
print(list(dataset1.columns))

['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary', 'space', 'description', 'experiences_offered', 'neighborhood_overview', 'notes', 'transit', 'access', 'interaction', 'house_rules', 'thumbnail_url', 'medium_url', 'picture_url', 'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'street', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market', 'smart_location', 'country_code', 'country', 'latitude', 'longitude', 'is_location_exact', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet', 'price', 'weekly_price', 'monthly_price', '

In [6]:
#features were removed based on two reasons: 
#1. insufficient data
#2. no apparent relations to target feature
dataset1 = dataset1.drop(columns = ['id', 'listing_url', 'scrape_id', 'summary', 'space', 'neighborhood_overview', 
                                    'notes', 'access', 'interaction', 'house_rules', 'thumbnail_url', 'medium_url', 
                                    'picture_url', 'xl_picture_url', 'host_url', 'host_name', 'host_location',
                                    'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate',
                                    'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count',
                                    'host_verifications', 'street', 'neighbourhood', 'neighbourhood_group_cleansed', 'city',
                                    'state', 'zipcode', 'market', 'smart_location', 'country_code', 'country', 'latitude',
                                    'longitude', 'is_location_exact', 'square_feet', 'monthly_price', 'minimum_nights',
                                    'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights',
                                    'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm',
                                    'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability', 'availability_30',
                                    'availability_60', 'availability_90', 'availability_365', 'calendar_last_scraped',
                                    'number_of_reviews_ltm', 'first_review', 'last_review', 'review_scores_accuracy',
                                    'review_scores_checkin', 'review_scores_checkin', 'review_scores_location',
                                    'review_scores_value', 'requires_license', 'license', 'jurisdiction_names',
                                    'is_business_travel_ready', 'cancellation_policy', 'require_guest_phone_verification',
                                    'require_guest_profile_picture', 'calculated_host_listings_count',
                                    'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms',
                                    'calculated_host_listings_count_shared_rooms','host_is_superhost', 'experiences_offered',
                                    'reviews_per_month', 'host_total_listings_count'])

In [7]:
#info after removing unused columns
dataset1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45053 entries, 0 to 45052
Data columns (total 28 columns):
last_scraped                   45053 non-null datetime64[ns]
name                           45047 non-null object
description                    43731 non-null object
transit                        27633 non-null object
host_id                        45053 non-null int64
host_since                     45037 non-null datetime64[ns]
host_has_profile_pic           45037 non-null object
host_identity_verified         45037 non-null object
neighbourhood_cleansed         45053 non-null object
property_type                  45053 non-null object
room_type                      45053 non-null object
accommodates                   45053 non-null int64
bathrooms                      45034 non-null float64
bedrooms                       45007 non-null float64
beds                           44982 non-null float64
bed_type                       45053 non-null object
amenities                 

In [8]:
def binary_modifier(data):
    """if data is null or blank string, return False
    otherwise return True"""
    if pd.isnull(data) or data == '' or data == 0:
        return False
    else:
        return True

In [9]:
#feature engineering columns
#we like to modify transit, weekly price, security deposit, and cleaning fee 
#to rather there is a discount for a full week rent, requires security deposit, or cleaning fee
#or rather there is information on transportation
dataset1.transit = dataset1.transit.apply(lambda x: binary_modifier(x))
dataset1.weekly_price = dataset1.weekly_price.apply(lambda x: binary_modifier(x))
dataset1.security_deposit = dataset1.security_deposit.apply(lambda x: binary_modifier(x))
dataset1.cleaning_fee = dataset1.cleaning_fee.apply(lambda x: binary_modifier(x))

#create column to measure length for which the home was put up on airbnb
dataset1.time_as_host = dataset1.last_scraped-dataset1.host_since
dataset1 = dataset1.drop(columns = ['last_scraped', 'host_since'])

#convert price feature to an int object
dataset1.price = dataset1.price.replace(to_replace = r'\$',value = '', regex = True)
dataset1.price = dataset1.price.replace(to_replace = r'\,',value = '', regex = True)
dataset1.price = dataset1.price.apply(lambda x: int(float(x)))

#convert extra people column to an int object
dataset1.extra_people = dataset1.extra_people.replace(to_replace = r'\$',value = '', regex = True)
dataset1.extra_people = dataset1.extra_people.replace(to_replace = r'\,',value = '', regex = True)
dataset1.extra_people = dataset1.extra_people.apply(lambda x: int(float(x)))

#if there isn't a review score for rating, cleanliness or communication, we can safely assume
#there isn't any number of review. So we fill null values to 0
dataset1.review_scores_rating = dataset1.review_scores_rating.fillna(0)
dataset1.review_scores_cleanliness = dataset1.review_scores_cleanliness.fillna(0)
dataset1.review_scores_communication = dataset1.review_scores_communication.fillna(0)

#drop all the other null values
dataset1 = dataset1.dropna()

  # This is added back by InteractiveShellApp.init_path()


<a id='text'></a>

## 1b. Text Preprocessing 

In [10]:
#NLP steps: 1. lowercase
#           2. tokenize
#           3. lemmatize
#           4. remove stopwords and make sure all character is alphabet


stopword = set(stopwords.words('english'))

def nlp(data):
    """nlp for sentences"""
    #lowercase all characters
    data = data.lower()
    
    #tokenized sentences
    data = word_tokenize(data)
    
    #lemmatize characters
    wnl = WordNetLemmatizer()
    data = [wnl.lemmatize(word) for word in data]
    
    #remove stopwords and contains only alphabets
    data = [word for word in data if word not in stopword and word.isalpha()]
    
    return data

#apply nlp on description and name
dataset1.name = dataset1.name.apply(lambda x: nlp(x))
dataset1.description = dataset1.description.apply(lambda x: nlp(x))

In [11]:
#create bigram features
dataset1['bigram_name'] = dataset1.name.apply(lambda x: list(ngrams(x, 2)))
dataset1['bigram_description'] = dataset1.description.apply(lambda x: list(ngrams(x, 2)))

In [12]:
dataset1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43621 entries, 0 to 45051
Data columns (total 28 columns):
name                           43621 non-null object
description                    43621 non-null object
transit                        43621 non-null bool
host_id                        43621 non-null int64
host_has_profile_pic           43621 non-null object
host_identity_verified         43621 non-null object
neighbourhood_cleansed         43621 non-null object
property_type                  43621 non-null object
room_type                      43621 non-null object
accommodates                   43621 non-null int64
bathrooms                      43621 non-null float64
bedrooms                       43621 non-null float64
beds                           43621 non-null float64
bed_type                       43621 non-null object
amenities                      43621 non-null object
price                          43621 non-null int64
weekly_price                   43621 non-null

In [13]:
dataset2.head()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,109,2019-09-14,f,$115.00,$115.00,30,730
1,109,2019-09-15,f,$115.00,$115.00,30,730
2,109,2019-09-16,f,$115.00,$115.00,30,730
3,109,2019-09-17,f,$115.00,$115.00,30,730
4,109,2019-09-18,f,$115.00,$115.00,30,730


In [14]:
dataset2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16444345 entries, 0 to 16444344
Data columns (total 7 columns):
listing_id        int64
date              datetime64[ns]
available         object
price             object
adjusted_price    object
minimum_nights    int64
maximum_nights    int64
dtypes: datetime64[ns](1), int64(3), object(3)
memory usage: 878.2+ MB


In [15]:
dataset2 = dataset2.dropna()
dataset2.price = dataset2.price.replace(to_replace = r'\$',value = '', regex = True)
dataset2.price = dataset2.price.replace(to_replace = r'\,',value = '', regex = True)
dataset2.price = dataset2.price.apply(lambda x: int(float(x)))