# Table of Contents
Data Wrangling<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a. [Dataframe cleaning](#cleaning)<br></p>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; b. [Text preprocessing](#text)<br>

In [1]:
from collections import Counter
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.util import ngrams
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
from langdetect import detect

## 1. Data Wrangling

The dataset was obtained at http://insideairbnb.com/get-the-data.html. The detailed Los Angeles listing and calendar dataset were used. <br>

The data wrangling is straight forward in this dataset. First, features unrelated to target feature (number of reviews) and features with insufficient data were removed. Second, the data were clean to have proper format and length of time as host column was created. Then finally the NLP of the text data were conducted.

In [2]:
#load data into a dataframe
path1 = '../Data/Raw/listings.csv'
df = pd.read_csv(path1, index_col = None, parse_dates = ['last_scraped', 'host_since'])

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,109,https://www.airbnb.com/rooms/109,20190914032935,2019-09-14,Amazing bright elegant condo park front *UPGRA...,"*** Unit upgraded with new bamboo flooring, br...","*** Unit upgraded with new bamboo flooring, br...","*** Unit upgraded with new bamboo flooring, br...",none,,...,f,f,strict_14_with_grace_period,t,f,1,1,0,0,0.02
1,344,https://www.airbnb.com/rooms/344,20190914032935,2019-09-14,Family perfect;Pool;Near Studios!,This home is perfect for families; aspiring ch...,"Cheerful & comfortable; near studios, amusemen...",This home is perfect for families; aspiring ch...,none,Quiet-yet-close to all the fun in LA! Hollywoo...,...,t,f,flexible,f,f,1,1,0,0,0.15
2,2708,https://www.airbnb.com/rooms/2708,20190914032935,2019-09-14,Fireplace Mirrored Mini Suit (Website hidden b...,Our best memory foam pillows you'll ever sleep...,Flickering fireplace. Blendtec® Designer 625 ...,Our best memory foam pillows you'll ever sleep...,none,We are minutes away from the Mentor Language I...,...,t,f,strict_14_with_grace_period,f,f,2,0,2,0,0.33
3,2732,https://www.airbnb.com/rooms/2732,20190914032935,2019-09-14,Zen Life at the Beach,,This is a three story townhouse with the follo...,This is a three story townhouse with the follo...,none,,...,f,f,strict_14_with_grace_period,f,f,2,1,1,0,0.19
4,2864,https://www.airbnb.com/rooms/2864,20190914032935,2019-09-14,*Upscale Professional Home with Beautiful Studio*,Centrally located.... Furnished with 42 inch S...,The space is furnished with Thomasville furnit...,Centrally located.... Furnished with 42 inch S...,none,What makes the neighborhood unique is that the...,...,f,f,strict_14_with_grace_period,f,f,1,1,0,0,


<a id='cleaning'></a>

## 1a. Dataframe Cleaning

In [4]:
#get info on dataframe1
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45053 entries, 0 to 45052
Columns: 106 entries, id to reviews_per_month
dtypes: datetime64[ns](2), float64(23), int64(21), object(60)
memory usage: 36.4+ MB


In [5]:
#list out all the columns in Dataframe
print(list(df.columns))

['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary', 'space', 'description', 'experiences_offered', 'neighborhood_overview', 'notes', 'transit', 'access', 'interaction', 'house_rules', 'thumbnail_url', 'medium_url', 'picture_url', 'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'street', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market', 'smart_location', 'country_code', 'country', 'latitude', 'longitude', 'is_location_exact', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet', 'price', 'weekly_price', 'monthly_price', '

In [6]:
#features were removed based on two reasons: 
#1. insufficient data
#2. no apparent relations to target feature
df = df.drop(columns = ['id', 'listing_url', 'scrape_id', 'summary', 'space', 'neighborhood_overview', 
                                    'notes', 'access', 'interaction', 'house_rules', 'thumbnail_url', 'medium_url', 
                                    'picture_url', 'xl_picture_url', 'host_url', 'host_name', 'host_location',
                                    'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate',
                                    'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count',
                                    'host_verifications', 'street', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 
                                    'city', 'state', 'zipcode', 'market', 'smart_location', 'country_code', 'country', 
                                    'latitude', 'longitude', 'is_location_exact', 'square_feet', 'monthly_price', 
                                    'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights',
                                    'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm',
                                    'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability', 'availability_30',
                                    'availability_60', 'availability_90', 'availability_365', 'calendar_last_scraped',
                                    'number_of_reviews_ltm', 'first_review', 'last_review', 'review_scores_accuracy',
                                    'review_scores_checkin', 'review_scores_checkin', 'review_scores_location',
                                    'review_scores_value', 'requires_license', 'license', 'jurisdiction_names',
                                    'is_business_travel_ready', 'cancellation_policy', 'require_guest_phone_verification',
                                    'require_guest_profile_picture', 'calculated_host_listings_count',
                                    'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms',
                                    'calculated_host_listings_count_shared_rooms', 'experiences_offered',
                                    'reviews_per_month', 'host_total_listings_count', 'host_id'])

In [7]:
#info after removing unused columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45053 entries, 0 to 45052
Data columns (total 28 columns):
last_scraped                   45053 non-null datetime64[ns]
name                           45047 non-null object
description                    43731 non-null object
transit                        27633 non-null object
host_since                     45037 non-null datetime64[ns]
host_is_superhost              45037 non-null object
host_has_profile_pic           45037 non-null object
host_identity_verified         45037 non-null object
neighbourhood                  42793 non-null object
property_type                  45053 non-null object
room_type                      45053 non-null object
accommodates                   45053 non-null int64
bathrooms                      45034 non-null float64
bedrooms                       45007 non-null float64
beds                           44982 non-null float64
bed_type                       45053 non-null object
amenities                

In [8]:
def binary_modifier(data):
    """if data is null or blank string, return False
    otherwise return True"""
    if pd.isnull(data) or data == '' or data == 0:
        return False
    else:
        return True

In [9]:
#feature engineering columns
#we like to modify transit, weekly price, security deposit, and cleaning fee 
#to rather there is a discount for a full week rent, requires security deposit, or cleaning fee
#or rather there is information on transportation
df.transit = df.transit.apply(lambda x: binary_modifier(x))
df.security_deposit = df.security_deposit.apply(lambda x: binary_modifier(x))
df.cleaning_fee = df.cleaning_fee.apply(lambda x: binary_modifier(x))
df.weekly_price = df.weekly_price.apply(lambda x: binary_modifier(x))


#create column to measure length for which the home was put up on airbnb
df['time_as_host'] = df.last_scraped-df.host_since
df.time_as_host = df.time_as_host.apply(lambda x: x.days)
df = df.drop(columns = ['last_scraped', 'host_since'])

#convert price feature to an int object
df.price = df.price.replace(to_replace = r'\$',value = '', regex = True)
df.price = df.price.replace(to_replace = r'\,',value = '', regex = True)
df.price = df.price.apply(lambda x: int(float(x)))

#convert extra people column to an int object
df.extra_people = df.extra_people.replace(to_replace = r'\$',value = '', regex = True)
df.extra_people = df.extra_people.replace(to_replace = r'\,',value = '', regex = True)
df.extra_people = df.extra_people.apply(lambda x: int(float(x)))

#if there isn't a review score for rating, cleanliness or communication, we can safely assume
#there isn't any number of review. So we fill null values to 0
df.review_scores_rating = df.review_scores_rating.fillna(0)
df.review_scores_cleanliness = df.review_scores_cleanliness.fillna(0)
df.review_scores_communication = df.review_scores_communication.fillna(0)

#drop all the other null values
df = df.dropna()

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41513 entries, 0 to 45051
Data columns (total 27 columns):
name                           41513 non-null object
description                    41513 non-null object
transit                        41513 non-null bool
host_is_superhost              41513 non-null object
host_has_profile_pic           41513 non-null object
host_identity_verified         41513 non-null object
neighbourhood                  41513 non-null object
property_type                  41513 non-null object
room_type                      41513 non-null object
accommodates                   41513 non-null int64
bathrooms                      41513 non-null float64
bedrooms                       41513 non-null float64
beds                           41513 non-null float64
bed_type                       41513 non-null object
amenities                      41513 non-null object
price                          41513 non-null int64
weekly_price                   41513 non-nul

<a id='text'></a>

## 1b. Text Preprocessing 

In [11]:
#detect english 
def detector(phrases):
    try:
        return detect(phrases)
    except:
        return ''

df['language'] = df.name.apply(lambda x: detector(x))

In [12]:
#remove all non-english
df = df[df.language == 'en']
df.drop(columns = ['language'])

Unnamed: 0,name,description,transit,host_is_superhost,host_has_profile_pic,host_identity_verified,neighbourhood,property_type,room_type,accommodates,...,security_deposit,cleaning_fee,guests_included,extra_people,number_of_reviews,review_scores_rating,review_scores_cleanliness,review_scores_communication,instant_bookable,time_as_host
0,Amazing bright elegant condo park front *UPGRA...,"*** Unit upgraded with new bamboo flooring, br...",False,f,t,t,Culver City,Condominium,Entire home/apt,6,...,True,True,3,25,2,80.0,10.0,8.0,f,4096.0
1,Family perfect;Pool;Near Studios!,This home is perfect for families; aspiring ch...,True,f,t,t,Burbank,House,Entire home/apt,6,...,True,True,6,0,6,93.0,10.0,10.0,t,4082.0
2,Fireplace Mirrored Mini Suit (Website hidden b...,Our best memory foam pillows you'll ever sleep...,True,t,t,t,Hollywood,Apartment,Private room,1,...,True,True,1,0,21,98.0,10.0,10.0,t,4015.0
3,Zen Life at the Beach,This is a three story townhouse with the follo...,False,t,t,f,Santa Monica,Apartment,Private room,1,...,False,True,1,0,19,96.0,9.0,10.0,f,4014.0
4,*Upscale Professional Home with Beautiful Studio*,Centrally located.... Furnished with 42 inch S...,True,f,t,t,Bellflower,Apartment,Entire home/apt,2,...,True,True,1,25,0,0.0,0.0,0.0,f,4006.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45045,Fully furnished one bedroom one month only,Hi nice and cozy one-bedroom apartment super b...,True,f,t,f,Sun Valley,Apartment,Entire home/apt,4,...,False,False,1,0,0,0.0,0.0,0.0,t,1.0
45046,Lakeridge 2 bedroom house in the hills,This miraculous pad will deliver endless hours...,False,f,t,f,Hollywood Hills,House,Entire home/apt,4,...,False,False,1,0,0,0.0,0.0,0.0,t,31.0
45047,UCLA Swimmers/shared house for study,Unbelievable opportunity for UCLA students and...,True,t,t,f,Brentwood,House,Shared room,2,...,True,True,1,100,0,0.0,0.0,0.0,f,2025.0
45049,Luxury Valley Room w/ Private Bath-Do Not Miss...,Luxury meets the Valley Come enjoy a beautiful...,False,t,t,t,Sun Valley,Apartment,Private room,2,...,False,True,1,0,0,0.0,0.0,0.0,t,1125.0


In [13]:
#NLP steps: 1. lowercase
#           2. tokenize
#           3. lemmatize
#           4. remove stopwords and make sure all character is alphabet


stopword = set(stopwords.words('english'))

def nlp(data):
    """nlp for sentences"""
    #lowercase all characters
    data = data.lower()
    
    #tokenized sentences
    data = word_tokenize(data)
    
    #lemmatize characters
    wnl = WordNetLemmatizer()
    data = [wnl.lemmatize(word) for word in data]
    
    #remove stopwords and contains only alphabets
    data = [word for word in data if word not in stopword and word.isalpha()]
    
    return data

#apply nlp on description and name
df.name = df.name.apply(lambda x: nlp(x))
df.description = df.description.apply(lambda x: nlp(x))

In [14]:
#create bigram features
df['bigram_name'] = df.name.apply(lambda x: list(ngrams(x, 2)))
df['bigram_description'] = df.description.apply(lambda x: list(ngrams(x, 2)))

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33875 entries, 0 to 45051
Data columns (total 30 columns):
name                           33875 non-null object
description                    33875 non-null object
transit                        33875 non-null bool
host_is_superhost              33875 non-null object
host_has_profile_pic           33875 non-null object
host_identity_verified         33875 non-null object
neighbourhood                  33875 non-null object
property_type                  33875 non-null object
room_type                      33875 non-null object
accommodates                   33875 non-null int64
bathrooms                      33875 non-null float64
bedrooms                       33875 non-null float64
beds                           33875 non-null float64
bed_type                       33875 non-null object
amenities                      33875 non-null object
price                          33875 non-null int64
weekly_price                   33875 non-nul

<a id='EDA'></a>

In [16]:
#saved to processed folder
df.to_csv('../Data/Processed/listings.csv')

In [18]:
df.neighbourhood

0            Culver City
1                Burbank
2              Hollywood
3           Santa Monica
4             Bellflower
              ...       
45045         Sun Valley
45046    Hollywood Hills
45047          Brentwood
45049         Sun Valley
45051             Venice
Name: neighbourhood, Length: 33875, dtype: object