# Text analytics on Description field

In this notebook we are splitting separately the text preprocessing we did on the Description data. 
I performed text preprocessing in terms of cleaning, tokenization, lemmatization, and transformation of the text data into vectors, which are fed into a classification model.<br>
I tried to understand whether there is a relationship between the description and the review score. 
After the text preprocessing, text data is transformed into numerical vectors, to be fed into a SVC. However, the metrics are quite bad, so I will not include the Description data into the features for predicting the prices. <br>
This notebook is left here for reference purposes and as a trace. 


In [1]:
#imports 
import pandas as pd
import numpy as np 

import regex as re
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick 
import matplotlib.dates as mdates
from matplotlib.ticker import PercentFormatter, FuncFormatter
%matplotlib inline
import matplotlib.pylab as pylab
params = {'legend.fontsize': 'x-large',
         'axes.labelsize': 'x-large',
         'axes.titlesize':'xx-large',
         'xtick.labelsize':'large',
         'ytick.labelsize':'large'}
pylab.rcParams.update(params)
from cycler import cycler

import seaborn as sns
sns.set()

import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from textacy import preprocessing
import textacy

import spacy
nlp = spacy.load('en_core_web_sm')

from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from  sklearn.metrics  import accuracy_score
from sklearn import metrics
from sklearn.metrics import confusion_matrix

# environment settings
pd.set_option('display.max_column',None)
pd.set_option('display.max_rows',None)

In [2]:
#read data
calendar = pd.read_csv('/Users/asyagadzhalova/Documents/GitHub/Boston-Airbnb-data/src/data/raw_data/calendar.csv')
listings =  pd.read_csv('/Users/asyagadzhalova/Documents/GitHub/Boston-Airbnb-data/src/data/raw_data/listings.csv')
reviews = pd.read_csv('/Users/asyagadzhalova/Documents/GitHub/Boston-Airbnb-data/src/data/raw_data/reviews.csv')

In [3]:
df = listings[['id','description','review_scores_value']].copy()

In [4]:
df.dropna(axis=0,inplace=True)

In [5]:
df.shape

(2764, 3)

In [6]:
pd.set_option('display.width', None)
df.head()

Unnamed: 0,id,description,review_scores_value
1,3075044,Charming and quiet room in a second floor 1910...,9.0
2,6976,"Come stay with a friendly, middle-aged guy in ...",10.0
3,1436513,Come experience the comforts of home away from...,10.0
4,7651065,"My comfy, clean and relaxing home is one block...",10.0
5,12386020,Super comfy bedroom plus your own bathroom in ...,10.0


Steps performed:<br>
1.Data cleaning, consisting mainly of text normalization /no noise, data is relatively clean/ - textacy module
2. Linguisting processing - tokenization, POS tagging, lemmatization. For all of this I've used the spacy library and the pipeline from there for linguistic processing
3. I've used the build-in TFIDFVectorizer to transform the text data into td-idf vectors. I have fed the transformed features into Support Vector Classifier for performing multi-class classification with target variable - the Score of the listing. However, very poor classification metrics -> I would assume there is no relationship between the Description and the review score. <br>

I cannot think of any other usage of the Description data. Of course, it could be pulled into the overall model for predicting the price.

### Text cleaning

In [7]:
#use textacy for text normalization and preprocessing - removal of accents, hyphens, quotes etc.
def normalize(text):
    text = preprocessing.normalize.hyphenated_words(text)
    text = preprocessing.normalize.hyphenated_words(text)
    text = preprocessing.normalize.unicode(text)
    text = preprocessing.normalize.quotation_marks(text)
    return text

In [8]:
df['clean_descr'] = df['description'].map(normalize)

In [9]:
df.head()

Unnamed: 0,id,description,review_scores_value,clean_descr
1,3075044,Charming and quiet room in a second floor 1910...,9.0,Charming and quiet room in a second floor 1910...
2,6976,"Come stay with a friendly, middle-aged guy in ...",10.0,"Come stay with a friendly, middle-aged guy in ..."
3,1436513,Come experience the comforts of home away from...,10.0,Come experience the comforts of home away from...
4,7651065,"My comfy, clean and relaxing home is one block...",10.0,"My comfy, clean and relaxing home is one block..."
5,12386020,Super comfy bedroom plus your own bathroom in ...,10.0,Super comfy bedroom plus your own bathroom in ...


In [10]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f917b00f160>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f917b00f040>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7f917aff0a50>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7f917b021800>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7f91c8b6ae80>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7f917aff0c10>)]

In [11]:
'''
Input: The doc input of spacy pipeline 
Output data: Lemmas 
'''
def extract_lemmas(doc, **kwargs):
    return [t.lemma_ for t in textacy.extract.words(doc, **kwargs)]


def extract_noun_phrases(doc, preceding_pos=['NOUN'], sep='_'):
    patterns = []
    for pos in preceding_pos:
        patterns.append(f"POS:{pos} POS:NOUN:+")
    spans = textacy.extract.matches.token_matches(doc, patterns=patterns)
    return [sep.join([t.lemma_ for t in s]) for s in spans]


def extract_entities(doc, include_types=None, sep='_'):

    ents = textacy.extract.entities(doc,
             include_types=include_types,
             exclude_types=None,
             drop_determiners=True,
             min_freq=1)

    return [sep.join([t.lemma_ for t in e])+'/'+e.label_ for e in ents]


def extract_nlp(doc):
    return {
    'lemmas'          : extract_lemmas(doc,
                                     exclude_pos = ['PART', 'PUNCT',
                                        'DET', 'PRON', 'SYM', 'SPACE'],
                                     filter_stops = False),
    'adjs_verbs'      : extract_lemmas(doc, include_pos = ['ADJ', 'VERB']),
    'nouns'           : extract_lemmas(doc, include_pos = ['NOUN', 'PROPN']),
    'noun_phrases'    : extract_noun_phrases(doc, ['NOUN']),
    'adj_noun_phrases': extract_noun_phrases(doc, ['ADJ']),
    'entities'        : extract_entities(doc, ['PERSON', 'ORG', 'GPE', 'LOC'])
    }

In [307]:
for col, values in extract_nlp(doc).items():
    print(f"{col}: {values}")

lemmas: ['place', 'be', 'close', 'to', 'home', 'be', 'warm', 'and', 'friendly', 'environment', 'design', 'connect', 'people', 'from', 'all', 'over', 'world', 'room', 'be', 'private', 'in', '4', 'bedroom', 'house', 'other', '2', 'bedroom', 'be', 'also', 'for', 'airbnb', 'traveler', 'just', 'like', 'check', 'in', 'time', '11', 'am', '4', 'pm', '7', 'pm', 'or', 'later', 'if', 'need', 'checkout', '10', 'am', 'Location:67', 'Broadway', 'somerville', 'Ma', '02145', 'Harvard', '3.1', 'mile', 'MIT', '3', 'mile', 'train', 'metro', '1.3', 'mile', 'Sullivan', 'square', 'Bus', 'Stop', 'Highland', 'street', 'at', 'corner', 'ensure', 'smooth', 'check', 'in', '1', 'must', 'have', 'cell', 'phone', 'call', 'or', 'text', 'phone', 'number', 'HIDDEN', 'or', 'on', 'app', '30', 'minute', 'prior', 'to', 'arrival', 'and', 'when', 'arrive', 'because', 'be', 'doorbell', 'in', 'big', 'apartment', 'complex', 'if', 'do', 'have', 'US', 'phone', 'will', 'gladly', 'meet', 'at', 'Dunkin', 'donut', 'next', 'door', 'can

In [308]:
nlp_columns = list(extract_nlp(nlp.make_doc('')).keys())
print(nlp_columns)

['lemmas', 'adjs_verbs', 'nouns', 'noun_phrases', 'adj_noun_phrases', 'entities']


In [309]:
for col in nlp_columns:
    df[col] = None

In [310]:
if spacy.prefer_gpu():
    print("Working on GPU.")
else:
    print("No GPU found, working on CPU.")

No GPU found, working on CPU.


### Linguistic processing

In [311]:
nlp = spacy.load('en_core_web_sm')

In [312]:
batch_size = 50

for i in range(0, len(df), batch_size):
    docs = nlp.pipe(df['clean_descr'][i:i+batch_size])

    for j, doc in enumerate(docs):
        for col, values in extract_nlp(doc).items():
            df[col].iloc[i+j] = values

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [313]:
df.head()

Unnamed: 0,id,description,review_scores_value,clean_descr,lemmas,adjs_verbs,nouns,noun_phrases,adj_noun_phrases,entities
1,3075044,Charming and quiet room in a second floor 1910...,9.0,Charming and quiet room in a second floor 1910...,"[charming, and, quiet, room, in, second, floor...","[charming, quiet, second, quiet, shared, use, ...","[room, floor, condo, building, room, size, bed...","[condo_building, size_bed, darkening_curtain, ...","[quiet_room, second_floor, full_size, full_siz...","[Boston/GPE, Roslindale/GPE, Boston/GPE, Arnol..."
2,6976,"Come stay with a friendly, middle-aged guy in ...",10.0,"Come stay with a friendly, middle-aged guy in ...","[come, stay, with, friendly, middle, aged, guy...","[come, stay, friendly, middle, aged, safe, qui...","[guy, Roslindale, neighborhood, Boston, room, ...","[cable_tv, folk_art, family_house, cable_tv, a...","[aged_guy, furnished_room, mexican_folk, mexic...","[Roslindale/GPE, Boston/GPE]"
3,1436513,Come experience the comforts of home away from...,10.0,Come experience the comforts of home away from...,"[come, experience, comfort, of, home, away, fr...","[experience, fabulous, available, sleep, large...","[comfort, home, home, bedroom, suite, Roslinda...","[bedroom_suite, washer_dryer, home_gym, street...","[fabulous_bedroom, fabulous_bedroom_suite, lar...","[Roslindale/GPE, Boston/GPE, Boston/GPE, Bosto..."
4,7651065,"My comfy, clean and relaxing home is one block...",10.0,"My comfy, clean and relaxing home is one block...","[comfy, clean, and, relax, home, be, one, bloc...","[comfy, clean, relax, quiet, residential, priv...","[home, block, bus, line, street, room, bed, ba...","[bus_line, half_bath, bus_line, half_bath, kit...","[residential_street, private_room, single_bed,...","[AC_Clean/ORG, Air_Conditioned/ORG, Boston/GPE..."
5,12386020,Super comfy bedroom plus your own bathroom in ...,10.0,Super comfy bedroom plus your own bathroom in ...,"[super, comfy, bedroom, plus, own, bathroom, i...","[super, comfy, big, sunny, outer, interested, ...","[bedroom, bathroom, condo, Roslindale, neighbo...","[minute_ride, driveway_parking, family_home, g...","[comfy_bedroom, own_bathroom, sunny_condo, out...","[Roslindale/GPE, Boston/GPE, Boston/GPE, Bosto..."


In [314]:
df[nlp_columns] = df[nlp_columns].applymap(lambda items: ' '.join(items))

In [315]:
df.head()

Unnamed: 0,id,description,review_scores_value,clean_descr,lemmas,adjs_verbs,nouns,noun_phrases,adj_noun_phrases,entities
1,3075044,Charming and quiet room in a second floor 1910...,9.0,Charming and quiet room in a second floor 1910...,charming and quiet room in second floor 1910 c...,charming quiet second quiet shared use pet fri...,room floor condo building room size bed darken...,condo_building size_bed darkening_curtain c_un...,quiet_room second_floor full_size full_size_be...,Boston/GPE Roslindale/GPE Boston/GPE Arnold_Ar...
2,6976,"Come stay with a friendly, middle-aged guy in ...",10.0,"Come stay with a friendly, middle-aged guy in ...",come stay with friendly middle aged guy in saf...,come stay friendly middle aged safe quiet clea...,guy Roslindale neighborhood Boston room cable ...,cable_tv folk_art family_house cable_tv air_co...,aged_guy furnished_room mexican_folk mexican_f...,Roslindale/GPE Boston/GPE
3,1436513,Come experience the comforts of home away from...,10.0,Come experience the comforts of home away from...,come experience comfort of home away from home...,experience fabulous available sleep large size...,comfort home home bedroom suite Roslindale nei...,bedroom_suite washer_dryer home_gym street_par...,fabulous_bedroom fabulous_bedroom_suite large_...,Roslindale/GPE Boston/GPE Boston/GPE Boston/GP...
4,7651065,"My comfy, clean and relaxing home is one block...",10.0,"My comfy, clean and relaxing home is one block...",comfy clean and relax home be one block away f...,comfy clean relax quiet residential private in...,home block bus line street room bed bath half ...,bus_line half_bath bus_line half_bath kitchen_...,residential_street private_room single_bed ful...,AC_Clean/ORG Air_Conditioned/ORG Boston/GPE Ap...
5,12386020,Super comfy bedroom plus your own bathroom in ...,10.0,Super comfy bedroom plus your own bathroom in ...,super comfy bedroom plus own bathroom in big s...,super comfy big sunny outer interested public ...,bedroom bathroom condo Roslindale neighborhood...,minute_ride driveway_parking family_home guest...,comfy_bedroom own_bathroom sunny_condo outer_n...,Roslindale/GPE Boston/GPE Boston/GPE Boston/GPE


### Transforming processed text into feature vectors 
To perform the transformation - always first do the split by train and test, to learn the vectors only from the training set and avoid data leakage on the test set

In [316]:
X_train, X_test, y_train, y_test = train_test_split(df['noun_phrases'], df['review_scores_value'], test_size=0.33,random_state = 42)

In [317]:
tfidf = TfidfVectorizer( min_df=2)
train_vectors = tfidf.fit_transform(X_train)
test_vectors = tfidf.transform(X_test)

In [318]:
test_vectors 

<913x1445 sparse matrix of type '<class 'numpy.float64'>'
	with 4595 stored elements in Compressed Sparse Row format>

In [319]:
train_vectors

<1851x1445 sparse matrix of type '<class 'numpy.float64'>'
	with 10712 stored elements in Compressed Sparse Row format>

In [320]:
model = LinearSVC()
model.fit(train_vectors, y_train)
y_pred = model.predict(test_vectors)

In [329]:
print(accuracy_score(y_test,y_pred))

0.4403066812705367


In [330]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         4.0       0.00      0.00      0.00         3
         5.0       0.00      0.00      0.00         1
         6.0       0.00      0.00      0.00        16
         7.0       0.00      0.00      0.00        14
         8.0       0.17      0.09      0.12       110
         9.0       0.45      0.47      0.46       362
        10.0       0.48      0.55      0.51       407

    accuracy                           0.44       913
   macro avg       0.16      0.16      0.16       913
weighted avg       0.41      0.44      0.42       913



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [331]:
tfidf.get_feature_names()

['10_min',
 '15min_walk',
 '1_bath',
 '24',
 '24_hour',
 '24_hour_concierge',
 '24_hour_concierge_service',
 '24_hour_concierge_service_',
 '32in_led',
 '32in_led_tv',
 '5min_walk',
 '7_concierge',
 '7min_walk',
 '_access',
 '_bedroom',
 '_bike',
 '_bike_share',
 '_bike_share_station',
 '_ceiling',
 '_discount',
 '_foot',
 '_fridge',
 '_mail',
 '_op',
 '_refrigerator',
 '_rise',
 '_smoker',
 '_stop',
 '_student',
 '_suite',
 '_suite_bathroom',
 '_top',
 '_tub',
 'a_burger',
 'a_dunkin',
 'access_code',
 'access_parking',
 'acre_property',
 'aero_bed',
 'aft_cabin',
 'aft_deck',
 'afternoon_sun',
 'air_bed',
 'air_bnb',
 'air_condition',
 'air_conditioner',
 'air_conditioning',
 'air_conditioning_property',
 'air_mattress',
 'airbnb_smartphone',
 'airbnb_smartphone_app',
 'airport_shuttle',
 'airport_shuttle_bus',
 'airport_station',
 'airport_terminal',
 'airport_traveler',
 'alarm_clock',
 'alarm_clock_radio',
 'alarm_rental',
 'amenity_package',
 'amenity_resident',
 'answer_question