# Text analytics on Amenities&Host verifications features

In this notebook we are doing the text analytics&processing on the fields: amenities, host_verifications.
I performed text preprocessing in terms of cleaning, tokenization, lemmatization, and transformation of the text data into vectors, which are fed into a classification model.<br>

In [2]:
#imports 
import pandas as pd
import numpy as np 

import regex as re
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick 
import matplotlib.dates as mdates
from matplotlib.ticker import PercentFormatter, FuncFormatter
%matplotlib inline
import matplotlib.pylab as pylab
params = {'legend.fontsize': 'x-large',
         'axes.labelsize': 'x-large',
         'axes.titlesize':'xx-large',
         'xtick.labelsize':'large',
         'ytick.labelsize':'large'}
pylab.rcParams.update(params)
from cycler import cycler

import seaborn as sns
sns.set()

import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from textacy import preprocessing
import textacy
from nltk.corpus import stopwords
from nltk.stem import *

import spacy
nlp = spacy.load('en_core_web_sm')

from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from  sklearn.metrics  import accuracy_score
from sklearn import metrics
from sklearn.metrics import confusion_matrix

# environment settings
pd.set_option('display.max_column',None)
pd.set_option('display.max_rows',None)

In [3]:
#read data
calendar = pd.read_csv('/Users/asyagadzhalova/Documents/GitHub/Boston-Airbnb-data/src/data/raw_data/calendar.csv')
listings =  pd.read_csv('/Users/asyagadzhalova/Documents/GitHub/Boston-Airbnb-data/src/data/raw_data/listings.csv')
reviews = pd.read_csv('/Users/asyagadzhalova/Documents/GitHub/Boston-Airbnb-data/src/data/raw_data/reviews.csv')

In [4]:
df = listings[['id','amenities']].copy()

In [5]:
df.dropna(axis=0,inplace=True)

In [6]:
df.shape

(3585, 2)

In [7]:
pd.set_option('display.width', None)
df.head()

Unnamed: 0,id,amenities
0,12147973,"{TV,""Wireless Internet"",Kitchen,""Free Parking ..."
1,3075044,"{TV,Internet,""Wireless Internet"",""Air Conditio..."
2,6976,"{TV,""Cable TV"",""Wireless Internet"",""Air Condit..."
3,1436513,"{TV,Internet,""Wireless Internet"",""Air Conditio..."
4,7651065,"{Internet,""Wireless Internet"",""Air Conditionin..."


Steps performed:<br>
We will perform only bag-of-words model -> we will count the number of amenities by listing id. The count matrix we will use to feed into the linear model, but we will not perform tf-idf, only term-frequency count. We will use the counted features directly as binary variables. So the term frequency should only be done on the TRAIN set, not on the TEST set - > if we have features that are not on the TRAIN set, we need to think how to handle this. 

### Text cleaning

#### Amenities

In [8]:
def text_cleaner_open(serie, words= []):
    '''
    input: pandas series, stop words
    ouput: removes special char, stop words, symbols.'''
    serie = serie.astype(str).str.lower()
    #st = PorterStemmer()
    stop= stopwords.words('english') + words
    stop = [x for x in stop]
    
    serie=serie.str.replace('"','')
    serie=serie.str.replace('{','')
    serie=serie.str.replace('}','')
    serie=serie.str.replace('[','')
    serie=serie.str.replace(']','')
    serie= serie.apply(lambda x: " ".join([word for word in x.split() 
                                           if word not in stop])) 
    return serie

In [9]:
#use textacy for text normalization and preprocessing - removal of accents, hyphens, quotes etc.
def normalize(text):
    text = preprocessing.normalize.hyphenated_words(text)
    text = preprocessing.normalize.unicode(text)
    text = preprocessing.normalize.quotation_marks(text)
    return text

In [10]:
df['amenities'] = df['amenities'].map(normalize)

In [11]:
df['amenities'] = text_cleaner_open(df['amenities'])

  serie=serie.str.replace('{','')
  serie=serie.str.replace('}','')
  serie=serie.str.replace('[','')
  serie=serie.str.replace(']','')


In [13]:
df.shape

(3585, 2)

In [14]:
df.head()

Unnamed: 0,id,amenities
0,12147973,"tv,wireless internet,kitchen,free parking prem..."
1,3075044,"tv,internet,wireless internet,air conditioning..."
2,6976,"tv,cable tv,wireless internet,air conditioning..."
3,1436513,"tv,internet,wireless internet,air conditioning..."
4,7651065,"internet,wireless internet,air conditioning,ki..."


In [18]:
vectorizer = CountVectorizer(tokenizer=lambda x: x.split(','))
vectorizer.fit(df['amenities'])

CountVectorizer(tokenizer=<function <lambda> at 0x7fa0e00c3dc0>)

In [19]:
vectorizer.get_feature_names()

['',
 '24-hour check-in',
 'air conditioning',
 'breakfast',
 'buzzer/wireless intercom',
 'cable tv',
 'carbon monoxide detector',
 'cat(s)',
 'dog(s)',
 'doorman',
 'dryer',
 'elevator building',
 'essentials',
 'family/kid friendly',
 'fire extinguisher',
 'first aid kit',
 'free parking premises',
 'free parking street',
 'gym',
 'hair dryer',
 'hangers',
 'heating',
 'hot tub',
 'indoor fireplace',
 'internet',
 'iron',
 'kitchen',
 'laptop friendly workspace',
 'lock bedroom door',
 'other pet(s)',
 'paid parking premises',
 'pets allowed',
 'pets live property',
 'pool',
 'safety card',
 'shampoo',
 'smoke detector',
 'smoking allowed',
 'suitable events',
 'translation missing: en.hosting_amenity_49',
 'translation missing: en.hosting_amenity_50',
 'tv',
 'washer',
 'washer / dryer',
 'wheelchair accessible',
 'wireless internet']

In [20]:
dt = vectorizer.transform(df['amenities'])

In [21]:
dt.shape

(3585, 46)

In [22]:
df = df.join(pd.DataFrame(dt.toarray(), columns=vectorizer.get_feature_names()))

In [23]:
df.head()

Unnamed: 0,id,amenities,Unnamed: 3,24-hour check-in,air conditioning,breakfast,buzzer/wireless intercom,cable tv,carbon monoxide detector,cat(s),dog(s),doorman,dryer,elevator building,essentials,family/kid friendly,fire extinguisher,first aid kit,free parking premises,free parking street,gym,hair dryer,hangers,heating,hot tub,indoor fireplace,internet,iron,kitchen,laptop friendly workspace,lock bedroom door,other pet(s),paid parking premises,pets allowed,pets live property,pool,safety card,shampoo,smoke detector,smoking allowed,suitable events,translation missing: en.hosting_amenity_49,translation missing: en.hosting_amenity_50,tv,washer,washer / dryer,wheelchair accessible,wireless internet
0,12147973,"tv,wireless internet,kitchen,free parking prem...",0,0,0,0,0,0,0,0,1,0,1,0,1,1,1,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,1,1,0,0,0,0,1,1,0,0,1
1,3075044,"tv,internet,wireless internet,air conditioning...",0,0,1,0,0,0,1,0,1,0,1,0,1,1,1,0,0,0,0,1,1,1,0,0,1,1,1,0,1,0,0,1,1,0,0,1,1,0,0,0,0,1,1,0,0,1
2,6976,"tv,cable tv,wireless internet,air conditioning...",0,0,1,0,0,1,1,0,0,0,1,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,1,1,1,0,0,1,1,1,1,0,0,1
3,1436513,"tv,internet,wireless internet,air conditioning...",0,0,1,1,0,0,1,0,0,0,1,0,1,0,1,1,1,0,1,1,1,1,0,1,1,1,1,1,0,0,0,0,0,0,1,1,1,0,0,0,0,1,1,0,0,1
4,7651065,"internet,wireless internet,air conditioning,ki...",0,0,1,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,1,1,1,0,0,1,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1


In [24]:
df.shape

(3585, 48)

In [79]:
#tfidf = TfidfTransformer()
#fidf_dt = tfidf.fit_transform(dt)
#pd.DataFrame(tfidf_dt.toarray(), columns=vectorizer.get_feature_names())
#pd.DataFrame(cosine_similarity(tfidf_dt, tfidf_dt))

#### Host verifications

In [28]:
def text_cleaner_open_verifications(serie, words= []):
    '''
    input: pandas series, stop words
    ouput: removes special char, stop words, symbols.'''
    serie = serie.astype(str).str.lower()
    #st = PorterStemmer()
    stop= stopwords.words('english') + words
    stop = [x for x in stop]
    
    serie=serie.str.replace('\'','')
    serie=serie.str.replace('{','')
    serie=serie.str.replace('}','')
    serie=serie.str.replace('[','')
    serie=serie.str.replace(']','')
    serie = serie.str.strip()
    serie= serie.apply(lambda x: " ".join([word for word in x.split() 
                                           if word not in stop])) 
    return serie

In [29]:
df1 = listings[['id','host_verifications']].copy()

In [30]:
df1['host_verifications'] = df1['host_verifications'].map(normalize)
df1['host_verifications'] = text_cleaner_open_verifications(df1['host_verifications'])

  serie=serie.str.replace('{','')
  serie=serie.str.replace('}','')
  serie=serie.str.replace('[','')
  serie=serie.str.replace(']','')


In [31]:
df1.head()

Unnamed: 0,id,host_verifications
0,12147973,"email, phone, facebook, reviews"
1,3075044,"email, phone, facebook, linkedin, amex, review..."
2,6976,"email, phone, reviews, jumio"
3,1436513,"email, phone, reviews"
4,7651065,"email, phone, reviews, kba"


In [33]:
#we use the same vectorizer, with split by comma
vectorizer.fit(df1['host_verifications']);

In [34]:
vectorizer.get_feature_names()

['',
 ' amex',
 ' facebook',
 ' google',
 ' jumio',
 ' kba',
 ' linkedin',
 ' manual_offline',
 ' manual_online',
 ' phone',
 ' reviews',
 ' sent_id',
 ' weibo',
 'email',
 'facebook',
 'phone']

In [36]:
dt1 = vectorizer.transform(df1['host_verifications'])

In [37]:
dt1.shape

(3585, 16)

In [38]:
df1 = df1.join(pd.DataFrame(dt1.toarray(), columns=vectorizer.get_feature_names()))

In [39]:
df1.head()

Unnamed: 0,id,host_verifications,Unnamed: 3,amex,facebook,google,jumio,kba,linkedin,manual_offline,manual_online,phone,reviews,sent_id,weibo,email,facebook.1,phone.1
0,12147973,"email, phone, facebook, reviews",0,0,1,0,0,0,0,0,0,1,1,0,0,1,0,0
1,3075044,"email, phone, facebook, linkedin, amex, review...",0,1,1,0,1,0,1,0,0,1,1,0,0,1,0,0
2,6976,"email, phone, reviews, jumio",0,0,0,0,1,0,0,0,0,1,1,0,0,1,0,0
3,1436513,"email, phone, reviews",0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0
4,7651065,"email, phone, reviews, kba",0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0


In [55]:
df.drop(df.iloc[:,2],inplace=True)

In [59]:
df1.columns

Index(['id', 'host_verifications', '', ' amex', ' facebook', ' google',
       ' jumio', ' kba', ' linkedin', ' manual_offline', ' manual_online',
       ' phone', ' reviews', ' sent_id', ' weibo', 'email', 'facebook',
       'phone'],
      dtype='object')

In [61]:
df1.drop([''],axis=1,inplace=True)

In [62]:
df1.columns

Index(['id', 'host_verifications', ' amex', ' facebook', ' google', ' jumio',
       ' kba', ' linkedin', ' manual_offline', ' manual_online', ' phone',
       ' reviews', ' sent_id', ' weibo', 'email', 'facebook', 'phone'],
      dtype='object')

In [None]:
df1['_phone']

In [63]:
df1['phone_all']=df1[' phone']
df1.loc[df1['phone']==1,'phone_all']=1

In [64]:
df1['phone_all'].sum()

3575

In [66]:
df1.drop([' phone','phone'],axis=1,inplace=True)

In [67]:
df1.head()

Unnamed: 0,id,host_verifications,amex,facebook,google,jumio,kba,linkedin,manual_offline,manual_online,reviews,sent_id,weibo,email,facebook.1,phone_all
0,12147973,"email, phone, facebook, reviews",0,1,0,0,0,0,0,0,1,0,0,1,0,1
1,3075044,"email, phone, facebook, linkedin, amex, review...",1,1,0,1,0,1,0,0,1,0,0,1,0,1
2,6976,"email, phone, reviews, jumio",0,0,0,1,0,0,0,0,1,0,0,1,0,1
3,1436513,"email, phone, reviews",0,0,0,0,0,0,0,0,1,0,0,1,0,1
4,7651065,"email, phone, reviews, kba",0,0,0,0,1,0,0,0,1,0,0,1,0,1


Once I've cleared the transformations, I will produce them into the common notebook, since the transformations and feature extractions should be done only on the TRAINING SET. 