# Case Study AirBnB Prediction - Text Analytics
This notebook walks through how to 
- extract keywords described in comments 
- use a pre-trained text analytics model to classify text

In [4]:
# Data Representation
import numpy as np
import pandas as pd

# Processing & Modeling
import sklearn
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler


from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

from sklearn import set_config
set_config(display='diagram')

import spacy
nlp = spacy.load('en_core_web_sm')

# https://github.com/huggingface/transformers
import transformers

# Visualization
import plotly.express as px

random_state = 42
pd.set_option('display.max_rows', 100)

In [5]:
doc = nlp(u"Apple is looking at buying a U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
a
U.K.
startup
for
$
1
billion


In [6]:
doc = nlp(u"Apple is looking at buying a U.K. startup for $1 billion")
doc.ents

(Apple, U.K., $1 billion)

In [7]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 29 33 GPE
$1 billion 46 56 MONEY


## Task 1: Extract Entities in Reviews
Let's take the first 50 reviews and grab any entities referred to

In [9]:
link = 'https://drive.google.com/file/d/1-JRyJEw1K9SysORKOCu36uxujjxFBKq5/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+link.split('/')[-2]

In [10]:
reviews_df = pd.read_csv(path)
reviews_df.head(1)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,7202016,38917982,2015-07-19,28943674,Bianca,Cute and cozy place. Perfect location to every...


In [11]:
reviews_df.shape

(84849, 6)

In [13]:
print(f"The reviews are from {reviews_df['date'].min()} to {reviews_df['date'].max()}")

The reviews are from 2009-06-07 to 2016-01-03


In [129]:
# Get the entity
def extract_entities(text):
    doc = nlp(text)
    entities = [entity.text for entity in doc.ents]
    return entities

# Get the entity label
def extract_entity_labels(text):
    doc = nlp(text)
    entities = [entity.label_ for entity in doc.ents]
    return entities

In [185]:
# Demo text
text = [
    'Google amazon texas ten',
    'Apple is looking at buying U.K. startup for $1 Billion',
    'Carnegie Mellon University is great'
]
text_df = pd.DataFrame({'X': text})
text_df

Unnamed: 0,X
0,Google amazon texas ten
1,Apple is looking at buying U.K. startup for $1...
2,Carnegie Mellon University is great


In [132]:
test_df['X'].apply(extract_entities)

0                           [ten]
1       [Apple, U.K., $1 Billion]
2    [Carnegie Mellon University]
Name: X, dtype: object

In [133]:
test_df['X'].apply(extract_entity_labels)

0           [CARDINAL]
1    [ORG, GPE, MONEY]
2                [ORG]
Name: X, dtype: object

In [134]:
# Now let's try it on the reviews 
reviews_df['comments'].head(50).apply(extract_entities)

0                                                    []
1                          [Kelly, Seattle Center, WOW]
2                                 [Kelly, 5 pm, Friday]
3                   [Space Needle, Metropolitan, Kelly]
4                           [Kelly, the Seattle Center]
5                      [Kelly, Seattle, 2015, all week]
6                                               [Kelly]
7                     [Seattle, the weekend, Kelly, 50]
8        [Kelly, Lower, Anne, Belltown, Seattle, Kelly]
9                                                    []
10      [Clean Linen, Towels, Neighbourhood, 10, Kelly]
11               [Kelly, a moment one evening, Seattle]
12                                            [Seattle]
13                                                   []
14                                         [Kelly, One]
15                                     [Kelly, Seattle]
16                                       [Rachel & Jon]
17                                       [Rachel

In [135]:
reviews_df['comments'].head(50).apply(extract_entities)

0                                                    []
1                          [Kelly, Seattle Center, WOW]
2                                 [Kelly, 5 pm, Friday]
3                   [Space Needle, Metropolitan, Kelly]
4                           [Kelly, the Seattle Center]
5                      [Kelly, Seattle, 2015, all week]
6                                               [Kelly]
7                     [Seattle, the weekend, Kelly, 50]
8        [Kelly, Lower, Anne, Belltown, Seattle, Kelly]
9                                                    []
10      [Clean Linen, Towels, Neighbourhood, 10, Kelly]
11               [Kelly, a moment one evening, Seattle]
12                                            [Seattle]
13                                                   []
14                                         [Kelly, One]
15                                     [Kelly, Seattle]
16                                       [Rachel & Jon]
17                                       [Rachel

## Task 2: Classify Text w/Pre-Trained Model
Please see [Zero-Shot Learning in Modern NLP](https://joeddav.github.io/blog/2020/05/29/ZSL.html)

In [136]:
classifier = transformers.pipeline("zero-shot-classification", device=0)

Some layers from the model checkpoint at roberta-large-mnli were not used when initializing TFRobertaModel: ['classifier']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-large-mnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.
All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initial

In [137]:
sequence = "Python is the best langauge ever!!!"
candidate_labels = ["negative", "positive"]

classifier(sequence, candidate_labels)

{'sequence': 'Python is the best langauge ever!!!',
 'labels': ['positive', 'negative'],
 'scores': [0.9931901693344116, 0.006809812970459461]}

In [187]:
classifier('NY Giants Sucks', candidate_labels)['labels'][0]

'negative'

In [193]:
sentiment_labels = ['positive', 'negative']

def label_sentiment(text):
    return classifier(text, sentiment_labels)['labels'][0]

def sentiment_score(text):
    return classifier(text, sentiment_labels)['scores'][0]

In [194]:
# Demo text
text = [
    'Google amazon texas ten',
    'Apple is looking at buying U.K. startup for $1 Billion',
    'Carnegie Mellon University is great',
    'NY Giants suck',
    'NY Giants are the worst team'
]
text_df = pd.DataFrame({'X': text})
text_df

Unnamed: 0,X
0,Google amazon texas ten
1,Apple is looking at buying U.K. startup for $1...
2,Carnegie Mellon University is great
3,NY Giants suck
4,NY Giants are the worst team


In [195]:
text_df['sentiment'] = text_df['X'].apply(label_sentiment)
text_df['score'] = text_df['X'].apply(sentiment_score)
text_df

Unnamed: 0,X,sentiment,score
0,Google amazon texas ten,positive,0.645827
1,Apple is looking at buying U.K. startup for $1...,positive,0.534415
2,Carnegie Mellon University is great,positive,0.996467
3,NY Giants suck,negative,0.990939
4,NY Giants are the worst team,negative,0.991825


In [196]:
reviews_df['comments'].head(250).apply(label_sentiment)

0     positive
1     positive
2     positive
3     positive
4     positive
5     positive
6     positive
7     positive
8     positive
9     positive
10    positive
11    positive
12    positive
13    positive
14    negative
15    positive
16    positive
17    positive
18    positive
19    positive
20    positive
21    positive
22    positive
23    positive
24    positive
Name: comments, dtype: object