# Case Study: Natural Language Processing

This notebook walks through how to 
- extract keywords described in comments 
- use a pre-trained text analytics model to classify text

In [1]:
!pip install transformers==3.1.0 &> /dev/null
!pip install pyyaml==5.4.1 &> /dev/null

In [1]:
# Data Representation
import numpy as np
import pandas as pd

# Data Modeling
import spacy
import tensorflow
nlp = spacy.load('en_core_web_sm')

# https://github.com/huggingface/transformers
import transformers


random_state = 42
pd.set_option('display.max_rows', 100)

KeyboardInterrupt: ignored

In [None]:
print(f"Transformers version: {transformers.__version__}")
print(f"TensorFlow version: {tensorflow.__version__}")
print(f"Pandas version: {pd.__version__}")

In [18]:
doc = nlp(u"Apple is looking at buying a U.K. startup for $1 billion")
doc

Apple is looking at buying a U.K. startup for $1 billion

In [19]:
type(doc)

spacy.tokens.doc.Doc

In [20]:
for token in doc:
    print(token.text) # tokens in the processed string

Apple
is
looking
at
buying
a
U.K.
startup
for
$
1
billion


In [21]:
doc = nlp(u"Apple is looking at buying a U.K. startup for $1 billion")
doc.ents

(Apple, U.K., $1 billion)

In [22]:
spacy.displacy.render(doc, style='dep', jupyter=True)

In [23]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 29 33 GPE
$1 billion 46 56 MONEY


## Task 1: Extract Entities
Let's take the first 50 reviews and grab any entities referred to

In [24]:
link = 'https://drive.google.com/file/d/1-JRyJEw1K9SysORKOCu36uxujjxFBKq5/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+link.split('/')[-2]

In [25]:
reviews_df = pd.read_csv(path)
reviews_df.head(15)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,7202016,38917982,2015-07-19,28943674,Bianca,Cute and cozy place. Perfect location to every...
1,7202016,39087409,2015-07-20,32440555,Frank,Kelly has a great room in a very central locat...
2,7202016,39820030,2015-07-26,37722850,Ian,"Very spacious apartment, and in a great neighb..."
3,7202016,40813543,2015-08-02,33671805,George,Close to Seattle Center and all it has to offe...
4,7202016,41986501,2015-08-10,34959538,Ming,Kelly was a great host and very accommodating ...
5,7202016,43979139,2015-08-23,1154501,Barent,"Kelly was great, place was great, just what I ..."
6,7202016,45265631,2015-09-01,37853266,Kevin,Kelly was great! Very nice and the neighborhoo...
7,7202016,46749120,2015-09-13,24445447,Rick,hola all bnb erz - Just left Seattle where I h...
8,7202016,47783346,2015-09-21,249583,Todd,Kelly's place is conveniently located on a qui...
9,7202016,48388999,2015-09-26,38110731,Tatiana,"The place was really nice, clean, and the most..."


In [26]:
reviews_df.shape

(84849, 6)

In [27]:
print(f"The reviews are from {reviews_df['date'].min()} to {reviews_df['date'].max()}")

The reviews are from 2009-06-07 to 2016-01-03


#### Subtask 1: Create an entity extractor

In [28]:
# Get the entity
def extract_entities(text):
    doc = nlp(text)
    entities = [entity.text for entity in doc.ents]
    return entities

# Get the entity label
def extract_entity_labels(text):
    doc = nlp(text)
    entities = [entity.label_ for entity in doc.ents]
    return entities

In [29]:
# Demo text
text = [
    'Google amazon texas ten',
    'Amazon AWS rangers Washington',
    'Apple is looking at buying U.K. startup for $1 Billion',
    'Carnegie Mellon University is great'
]
text_df = pd.DataFrame({'X': text})
text_df

Unnamed: 0,X
0,Google amazon texas ten
1,Amazon AWS rangers Washington
2,Apple is looking at buying U.K. startup for $1...
3,Carnegie Mellon University is great


In [30]:
text_df['X'].apply(extract_entities)

0            [Google, texas, ten]
1        [Amazon AWS, Washington]
2       [Apple, U.K., $1 Billion]
3    [Carnegie Mellon University]
Name: X, dtype: object

In [31]:
text_df['X'].apply(extract_entity_labels)

0    [ORG, GPE, CARDINAL]
1          [PRODUCT, GPE]
2       [ORG, GPE, MONEY]
3                   [ORG]
Name: X, dtype: object

#### Subtask 2: Apply entity extractor on data

In [32]:
# Now let's try it on the reviews 
reviews_df['comments'].head(50).apply(extract_entities)

0                                                    []
1     [Kelly, Seattle Center, the Space Needle, Chih...
2                                 [Kelly, 5 pm, Friday]
3     [Seattle Center, Space Needle, Metropolitan, K...
4                           [Kelly, the Seattle Center]
5           [Kelly, 5 min, Seattle, Aug 2015, all week]
6                                               [Kelly]
7                 [hola, Seattle, the weekend, Kelly's]
8     [Kelly, Queen Anne, Belltown, Downtown, Seattl...
9     [Muy, Kelly, y lo mas importante, que esta cer...
10                        [night, Neighbourhood, Kelly]
11    [Kelly, a moment, one evening, City Center, Se...
12                                            [Seattle]
13                                                   []
14                                       [Kelly's, One]
15                                     [Kelly, Seattle]
16                            [Rachel & Jon, Farmhouse]
17                                       [Rachel

## Task 2: Classify Text
Please see [Zero-Shot Learning in Modern NLP](https://joeddav.github.io/blog/2020/05/29/ZSL.html)

A few notes on this example:


*   The [zero-shot-classifier](https://huggingface.co/facebook/bart-large-mnli) is a generalized pre-trained model - for greater performance, this model should be specialized using an approach like [fine-tuning](https://github.com/huggingface/notebooks/blob/main/transformers_doc/custom_datasets.ipynb)
*   Additional pre-trained models that work with the transformers library can be found via [HuggingFace's model repository](https://huggingface.co/models)

In [3]:
classifier = transformers.pipeline("zero-shot-classification") # you can specify to use GPU with the option, device=0

Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartForSequenceClassification: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing BartForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BartForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
sequence = "Python is the best langauge ever!!!"
candidate_labels = ["negative", "positive"]

classifier(sequence, candidate_labels)

{'labels': ['positive', 'negative'],
 'scores': [0.9788928627967834, 0.02110718935728073],
 'sequence': 'Python is the best langauge ever!!!'}

In [5]:
classifier('NY Giants Sucks', candidate_labels)['labels'][0]

'negative'

In [10]:
classifier("C is a so-so okay language", candidate_labels)


{'labels': ['negative', 'positive'],
 'scores': [0.881877601146698, 0.118122398853302],
 'sequence': "I'm not sure how I feel about C"}

In [11]:
classifier("C is neither positive nor negative", candidate_labels)

{'labels': ['positive', 'negative'],
 'scores': [0.5069530010223389, 0.4930470287799835],
 'sequence': 'C is neither positive nor negative'}

#### Subtask 1: Create an sentiment classifier

In [14]:
sentiment_labels = ['positive', 'negative']

def label_sentiment(text):
    return classifier(text, sentiment_labels)['labels'][0]

def sentiment_score(text):
    return classifier(text, sentiment_labels)['scores'][0]

In [15]:
# Demo text
text = [
    'Google amazon texas ten',
    'Apple is looking at buying U.K. startup for $1 Billion',
    'Carnegie Mellon University is great',
    'NY Giants suck',
    'NY Giants are the worst team',
    "Dallas Cowboys are America's Favorite Team!"
]
text_df = pd.DataFrame({'X': text})
text_df

Unnamed: 0,X
0,Google amazon texas ten
1,Apple is looking at buying U.K. startup for $1...
2,Carnegie Mellon University is great
3,NY Giants suck
4,NY Giants are the worst team
5,Dallas Cowboys are America's Favorite Team!


In [16]:
text_df['sentiment'] = text_df['X'].apply(label_sentiment)
text_df['score'] = text_df['X'].apply(sentiment_score)
text_df

Unnamed: 0,X,sentiment,score
0,Google amazon texas ten,positive,0.540497
1,Apple is looking at buying U.K. startup for $1...,positive,0.777548
2,Carnegie Mellon University is great,positive,0.997745
3,NY Giants suck,negative,0.998749
4,NY Giants are the worst team,negative,0.992797
5,Dallas Cowboys are America's Favorite Team!,positive,0.959153


In [33]:
reviews_sentiment = reviews_df['comments'].head(250).apply(label_sentiment)
reviews_sentiment

0      positive
1      positive
2      positive
3      positive
4      positive
         ...   
245    positive
246    positive
247    positive
248    positive
249    positive
Name: comments, Length: 250, dtype: object

In [None]:
reviews_sentiment.value_counts()

positive    245
negative      5
Name: comments, dtype: int64

In [None]:
negative_listing_indicies = reviews_sentiment[reviews_sentiment=='negative'].index.tolist()
negative_listing_indicies

[14, 80, 83, 132, 230]

In [None]:
reviews_df[reviews_df.index.isin(negative_listing_indicies)]['comments']

14     Staying at Kelly's was easy. The location was ...
80     我们是一家三口，可爱的女儿，夫妻二人都是中国来的访问学者，来到美丽的西雅图，住在了Roger...
83     The host canceled this reservation 21 days bef...
132    The apartment was great. The location was fabu...
230    Great stay, only thing is the main house had a...
Name: comments, dtype: object

In [None]:
reviews_df[reviews_df.index.isin(negative_listing_indicies)]['comments'][14]

"Staying at Kelly's was easy. The location was a block away from public transportation, her place was easy to find, keys were easy to access and timing was extremely flexible. Great for the price - nothing too fancy. One negative: the shower didn't drain well. "

In [None]:
reviews_df[reviews_df.index.isin(negative_listing_indicies)]['comments'][80]

'我们是一家三口，可爱的女儿，夫妻二人都是中国来的访问学者，来到美丽的西雅图，住在了Roger的房子里，房子位于美丽的艺术小镇fremont，有各种各样的雕塑，还有宇宙中心的路标，各种小店的橱窗也是充满特色，富有艺术气息。安静美丽的小镇，充满了秋天的味道。离西雅图市区比较近。房子位置好，能从窗户里看到Rainer山。很不幸我们没有看到。'

In [None]:
reviews_df[reviews_df.index.isin(negative_listing_indicies)]['comments'][83]

'The host canceled this reservation 21 days before arrival. This is an automated posting.'

In [None]:
reviews_df[reviews_df.index.isin(negative_listing_indicies)]['comments'][132]

'The apartment was great. The location was fabulous: easy walking distance to downtown, Capitol Hill, South Lake Union etc. Apartment was roomy, extremely clean and has an awesome view of Mt Rainer. Kitchen was well appointed, and there is a big projector TV. No complaints at all. 5 stars.\r\n\r\nThe one slight negative was parking: its a bit difficult/pricey to park. This is really nothing to do with the apartment, just First Hill.'

In [None]:
reviews_df[reviews_df.index.isin(negative_listing_indicies)]['comments'][230]

'Great stay, only thing is the main house had a very strong cat odor so people with allergies be aware. '

## Task 3: Generate Text
Please see [How to generate text: using different decoding methods for language generation with Transformers](https://huggingface.co/blog/how-to-generate)

In [12]:
tokenizer = transformers.GPT2Tokenizer.from_pretrained("gpt2")
tokenizer

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

<transformers.tokenization_gpt2.GPT2Tokenizer at 0x7f0cc5b6fed0>

In [13]:
# add the EOS token as PAD token to avoid warnings
model = transformers.TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)
model

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/498M [00:00<?, ?B/s]

All model checkpoint weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


<transformers.modeling_tf_gpt2.TFGPT2LMHeadModel at 0x7f0bc5765910>

In [14]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='tf')
input_ids

<tf.Tensor: shape=(1, 7), dtype=int32, numpy=array([[   40,  2883,  6155,   351,   616, 13779,  3290]], dtype=int32)>

In [16]:
type(input_ids)

tensorflow.python.framework.ops.EagerTensor

In [20]:
# activate beam search and early_stopping
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=3, 
    early_stopping=True
)
beam_output

<tf.Tensor: shape=(1, 50), dtype=int32, numpy=
array([[   40,  2883,  6155,   351,   616, 13779,  3290,    11,   475,
          314,  1101,   407,  1654,   611,   314,  1183,  1683,   307,
         1498,   284,  2513,   351,   683,   757,    13,   198,   198,
           40,  1053,   587,  1804,   428,   329,   257,  1178,   812,
          783,    11,   290,   314,  1053,  1239,   550,   257,  1917,
          351,   340,    13,   314,  1053]], dtype=int32)>

In [21]:
print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been doing this for a few years now, and I've never had a problem with it. I've
