# Hugging Face Transformers

## 0. Read in Data

In [2]:
import torch
import transformers
import openpyxl

print(torch.__version__)
print("All installed correctly ðŸš€")

2.9.1
All installed correctly ðŸš€


In [3]:
import pandas as pd

# modify the column width
pd.set_option('display.max_colwidth', None)

# look at a subset of the reviews
df = pd.read_excel('../Data/Popchip_Reviews_Sentiment.xlsx').head(30)
df.head(2)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269


In [4]:
# confirm the number of reviews
df.shape

(30, 7)

## 1. Sentiment Analysis

### a. Simple Example

In [5]:
# sentiment analysis with hugging face
from transformers import pipeline

sentiment_analyzer = pipeline("sentiment-analysis", # set the task to sentiment analysis
                              model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", # specify the default distilbert model
                              device=-1) # use the computer's cpu

text1 = 'When life gives you lemons, make lemonade! ðŸ™‚'
text2 = 'A dozen lemons will make a gallon of lemonade.'
text3 = 'I didn\'t like the taste of that lemonade at all.'

Device set to use cpu


In [6]:
sentiment_analyzer(text1)

[{'label': 'POSITIVE', 'score': 0.996239423751831}]

In [7]:
sentiment_analyzer(text2)

[{'label': 'POSITIVE', 'score': 0.7781559228897095}]

In [8]:
sentiment_analyzer(text3)

[{'label': 'NEGATIVE', 'score': 0.9955589771270752}]

### b. Practical Example

In [9]:
# calculate the sentiment scores
sentiment_analyzer = pipeline("sentiment-analysis",
                              model="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
                              device=-1,
                              truncation=True) # adding truncation here to truncate text before analyzing sentiment

sentiment_scores = df['Text'].apply(sentiment_analyzer)
sentiment_scores[:5]

Device set to use cpu


0    [{'label': 'POSITIVE', 'score': 0.9935212731361389}]
1     [{'label': 'POSITIVE', 'score': 0.999605119228363}]
2     [{'label': 'NEGATIVE', 'score': 0.698489785194397}]
3    [{'label': 'NEGATIVE', 'score': 0.9996308088302612}]
4    [{'label': 'POSITIVE', 'score': 0.9991814494132996}]
Name: Text, dtype: object

In [None]:
# What is the reason to use truncation=True?
# Some reviews may be longer than the model's maximum input length.
# By setting truncation=True, we ensure that any text exceeding this length is truncated to fit within the model's limits.
# This prevents errors during processing and ensures that all reviews can be analyzed without issues.

In [None]:
%%time

# add a timer and hide all non-critical warnings
from transformers import pipeline, logging

logging.set_verbosity_error()

sentiment_analyzer = pipeline("sentiment-analysis",
                              model="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
                              device=-1, # use the cpu
                              truncation=True)

sentiment_scores = df['Text'].apply(sentiment_analyzer)
sentiment_scores[:5]

CPU times: user 799 ms, sys: 57.2 ms, total: 856 ms
Wall time: 978 ms


0    [{'label': 'POSITIVE', 'score': 0.9935212731361389}]
1     [{'label': 'POSITIVE', 'score': 0.999605119228363}]
2     [{'label': 'NEGATIVE', 'score': 0.698489785194397}]
3    [{'label': 'NEGATIVE', 'score': 0.9996308088302612}]
4    [{'label': 'POSITIVE', 'score': 0.9991814494132996}]
Name: Text, dtype: object

In [None]:
%%time

# utilize mac's silicon chip (gpu)
sentiment_analyzer = pipeline("sentiment-analysis",
                              model="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
                              device='mps', # use the mac's silicon chip
                              truncation=True)

sentiment_scores = df['Text'].apply(sentiment_analyzer)
sentiment_scores[:5]

CPU times: user 418 ms, sys: 213 ms, total: 631 ms
Wall time: 4.92 s


0    [{'label': 'POSITIVE', 'score': 0.9935213923454285}]
1     [{'label': 'POSITIVE', 'score': 0.999605119228363}]
2    [{'label': 'NEGATIVE', 'score': 0.6984843015670776}]
3    [{'label': 'NEGATIVE', 'score': 0.9996308088302612}]
4    [{'label': 'POSITIVE', 'score': 0.9991814494132996}]
Name: Text, dtype: object

In [12]:
# A list 
sentiment_scores[0]

[{'label': 'POSITIVE', 'score': 0.9935213923454285}]

In [13]:
# The dictionary inside the list
sentiment_scores[0][0]

{'label': 'POSITIVE', 'score': 0.9935213923454285}

In [14]:
# extract the label for a single review
sentiment_scores[0][0]['label']

'POSITIVE'

In [15]:
# extract the score for a single review
sentiment_scores[0][0]['score']

0.9935213923454285

In [None]:
# extract the label and score and create a sentiment score for all reviews
df['Label_HF'] = sentiment_scores.apply(lambda x: x[0]['label'])
df['Score_HF'] = sentiment_scores.apply(lambda x: x[0]['score'])
df['Sentiment_HF'] = df.apply(lambda row: row['Score_HF'] if row['Label_HF'] == 'POSITIVE' else -row['Score_HF'], axis=1)
# What is HF?
# HF stands for Hugging Face, which is the library used to perform the sentiment analysis in this code snippet.

In [17]:
df.head(5)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER,Label_HF,Score_HF,Sentiment_HF
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,POSITIVE,0.993521,0.993521
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,POSITIVE,0.999605,0.999605
2,23691,A30NYUHEDLWI0Y,5,High,Great Alternative to Potato Chips,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",0.979,NEGATIVE,0.698484,-0.698484
3,23692,A2NU55U9LKTB5J,3,Low,Not somthing I would crave,"These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. They were the bomb then, not so much now. Won't buy again unless I get them for cheap or free.",0.8689,NEGATIVE,0.999631,-0.999631
4,23693,A225F7QFP5LIW2,5,High,healthy and delicious,"These chips are great! They look almost like a flattened rice cake, but taste so much better, more like a potato chip. The bbq flavor is delicious. They are very low in fat and full of flavor. It is easy to eat an entire bag of these!",0.9613,POSITIVE,0.999181,0.999181


In [18]:
# view the calculations
df[['Rating', 'Text', 'Sentiment_VADER', 'Label_HF', 'Score_HF', 'Sentiment_HF']].head()

Unnamed: 0,Rating,Text,Sentiment_VADER,Label_HF,Score_HF,Sentiment_HF
0,5,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,POSITIVE,0.993521,0.993521
1,5,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,POSITIVE,0.999605,0.999605
2,5,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",0.979,NEGATIVE,0.698484,-0.698484
3,3,"These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. They were the bomb then, not so much now. Won't buy again unless I get them for cheap or free.",0.8689,NEGATIVE,0.999631,-0.999631
4,5,"These chips are great! They look almost like a flattened rice cake, but taste so much better, more like a potato chip. The bbq flavor is delicious. They are very low in fat and full of flavor. It is easy to eat an entire bag of these!",0.9613,POSITIVE,0.999181,0.999181


In [19]:
# view the most positive review
df.sort_values('Sentiment_HF', ascending=False).head(1).Text

28    These Pop Chips are incredible. They taste so much better than baked chips and the quantity you get for 2 points is so much more. I buy the variety case and love them all!
Name: Text, dtype: object

In [20]:
# view the most negative review
df.sort_values('Sentiment_HF', ascending=True).head(1).Text

3    These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day.  They were the bomb then, not so much now.  Won't buy again unless I get them for cheap or free.
Name: Text, dtype: object

### c. Speed Up Code

In [21]:
%%time

# no optimizations
from transformers import pipeline

sentiment_analyzer = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device=-1, # running on CPU
    truncation=True
)

sentiment_scores = df['Text'].apply(sentiment_analyzer)
sentiment_scores[:5]

CPU times: user 790 ms, sys: 112 ms, total: 903 ms
Wall time: 1.45 s


0    [{'label': 'POSITIVE', 'score': 0.9935212731361389}]
1     [{'label': 'POSITIVE', 'score': 0.999605119228363}]
2     [{'label': 'NEGATIVE', 'score': 0.698489785194397}]
3    [{'label': 'NEGATIVE', 'score': 0.9996308088302612}]
4    [{'label': 'POSITIVE', 'score': 0.9991814494132996}]
Name: Text, dtype: object

In [22]:
%%time

# four things to try if you can't use GPU
from transformers import pipeline

sentiment_analyzer = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english", # 1. smaller model
    device=-1, # running on CPU
    truncation=True,
    use_fast=True # 2. faster tokenization
)

import torch
torch.set_num_threads(1)  # 3. specify multi-threading

with torch.no_grad(): # 4. disable gradients
    sentiment_scores = df['Text'].apply(sentiment_analyzer)
sentiment_scores[:5]

CPU times: user 722 ms, sys: 20.1 ms, total: 742 ms
Wall time: 1.27 s


0    [{'label': 'POSITIVE', 'score': 0.9935212731361389}]
1     [{'label': 'POSITIVE', 'score': 0.999605119228363}]
2    [{'label': 'NEGATIVE', 'score': 0.6984878778457642}]
3    [{'label': 'NEGATIVE', 'score': 0.9996308088302612}]
4    [{'label': 'POSITIVE', 'score': 0.9991814494132996}]
Name: Text, dtype: object

## 2. Named Entity Recognition

In [None]:
# What is Entity Recognition (NER)?
# Entity Recognition, also known as Named Entity Recognition (NER), is a subtask of natural language processing (NLP) that involves identifying and classifying named entities in text into predefined categories such as persons, organizations, locations, dates, and more.   

### a. Simple Example

In [23]:
# view warning options
logging.set_verbosity_warning() # view more warnings
#logging.set_verbosity_error() # view fewer warnings

In [None]:
# ner with hugging face
ner_analyzer = pipeline("ner",
                        model="dbmdz/bert-large-cased-finetuned-conll03-english",
                        device=-1,
                        aggregation_strategy='SIMPLE') # aggregate entities

text4 = "I ordered an Arnold Palmer at Applebee's in Springfield."

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [25]:
ner_analyzer(text4)

[{'entity_group': 'MISC',
  'score': np.float32(0.9914088),
  'word': 'Arnold Palmer',
  'start': 13,
  'end': 26},
 {'entity_group': 'ORG',
  'score': np.float32(0.943614),
  'word': "Applebee ' s",
  'start': 30,
  'end': 40},
 {'entity_group': 'LOC',
  'score': np.float32(0.9780035),
  'word': 'Springfield',
  'start': 44,
  'end': 55}]

In [26]:
# try a different model
ner_analyzer2 = pipeline("ner",
                        model="dslim/bert-base-NER",
                        device=-1,
                        aggregation_strategy='SIMPLE')

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [27]:
ner_analyzer2(text4)

[{'entity_group': 'PER',
  'score': np.float32(0.8766219),
  'word': 'Arnold Palmer',
  'start': 13,
  'end': 26},
 {'entity_group': 'ORG',
  'score': np.float32(0.70051295),
  'word': 'Applebee',
  'start': 30,
  'end': 38},
 {'entity_group': 'LOC',
  'score': np.float32(0.6289268),
  'word': "' s",
  'start': 38,
  'end': 40},
 {'entity_group': 'LOC',
  'score': np.float32(0.99173564),
  'word': 'Springfield',
  'start': 44,
  'end': 55}]

### b. Practical Example

In [28]:
df.head(2)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER,Label_HF,Score_HF,Sentiment_HF
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,POSITIVE,0.993521,0.993521
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,POSITIVE,0.999605,0.999605


In [29]:
# find the named entities in each review
ner_analyzer = pipeline("ner",
                        model="dbmdz/bert-large-cased-finetuned-conll03-english",
                        device='mps',
                        aggregation_strategy='SIMPLE')

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps


In [30]:
# apply to one review
ner_analyzer(df.Text[1])

[{'entity_group': 'MISC',
  'score': np.float32(0.9149265),
  'word': 'Salt and Vinegar',
  'start': 99,
  'end': 115},
 {'entity_group': 'MISC',
  'score': np.float32(0.7742398),
  'word': 'Salt and Vinegar',
  'start': 392,
  'end': 408},
 {'entity_group': 'ORG',
  'score': np.float32(0.9589694),
  'word': 'S & V',
  'start': 450,
  'end': 453}]

In [31]:
# extract the words
[entity['word'] for entity in ner_analyzer(df.Text[1])]

['Salt and Vinegar', 'Salt and Vinegar', 'S & V']

In [32]:
# apply to all reviews
df['Named_Entities'] = df['Text'].apply(lambda x: [entity['word'] for entity in ner_analyzer(x)])

In [33]:
# view the named entities
df[['Text', 'Named_Entities']].head()

Unnamed: 0,Text,Named_Entities
0,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,[]
1,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.","[Salt and Vinegar, Salt and Vinegar, S & V]"
2,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",[Amazon]
3,"These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. They were the bomb then, not so much now. Won't buy again unless I get them for cheap or free.",[]
4,"These chips are great! They look almost like a flattened rice cake, but taste so much better, more like a potato chip. The bbq flavor is delicious. They are very low in fat and full of flavor. It is easy to eat an entire bag of these!",[]


In [34]:
# create a unique list of named entities
named_entities = list(set(df.Named_Entities.explode().dropna().tolist()))
named_entities[:10]

['PopChips',
 'and',
 'General Mills',
 '##D',
 'B',
 'Salt',
 '##le',
 'Stop and Shop',
 'Popchips',
 'com']

In [35]:
# view the number of named entities found
len(named_entities)

33

In [36]:
# exclude subwords from the list
named_entities_clean = [entity for entity in named_entities if '#' not in entity]
sorted(named_entities_clean)

['& V',
 'Amazon',
 'B',
 'COSTCO',
 'Ch',
 'Cheetos',
 'Chip',
 'Costco',
 'General Mills',
 'Lays',
 'Miami',
 'PopChips',
 'Popchi',
 'Popchips',
 'Popchips B',
 'Pringles',
 'S',
 'S & V',
 'Salt',
 'Salt and Vinegar',
 'Stop and Shop',
 'VA',
 "Vinegar Pirate ' s Bo",
 'Watch',
 'and',
 'com']

In [37]:
# view the number of named entities found
len(named_entities_clean)

26

## 3. Zero-Shot Classification

In [None]:
# What is Zero-Shot Classification?
# Zero-Shot Classification is a natural language processing (NLP) technique that allows a model to classify text into categories that it has not been explicitly trained on.

### a. Simple Example

In [38]:
# zero-shot classification with hugging face
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli",
                      device=-1)

Device set to use cpu


In [39]:
text1, text4

('When life gives you lemons, make lemonade! ðŸ™‚',
 "I ordered an Arnold Palmer at Applebee's in Springfield.")

In [40]:
classifier(text1, candidate_labels = ['quote', 'food & drinks', 'technology'])

{'sequence': 'When life gives you lemons, make lemonade! ðŸ™‚',
 'labels': ['quote', 'food & drinks', 'technology'],
 'scores': [0.9833194017410278, 0.01117656659334898, 0.005504006519913673]}

In [41]:
classifier(text4, candidate_labels = ['quote', 'food & drinks', 'technology'])

{'sequence': "I ordered an Arnold Palmer at Applebee's in Springfield.",
 'labels': ['food & drinks', 'quote', 'technology'],
 'scores': [0.5157102942466736, 0.44382408261299133, 0.04046565294265747]}

### b. Practical Example

In [42]:
df.head(2)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER,Label_HF,Score_HF,Sentiment_HF,Named_Entities
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,POSITIVE,0.993521,0.993521,[]
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,POSITIVE,0.999605,0.999605,"[Salt and Vinegar, Salt and Vinegar, S & V]"


In [None]:
# remember our topics from the machine learning section: 'order', 'taste & texture', 'good', 'flavor', 'health'
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli",
                      device='mps') # use mac's silicon chip

Device set to use mps


In [44]:
# try on one review
classifier(df.Text[0], ['order', 'taste & texture', 'good', 'flavor', 'health'])

{'sequence': 'Popchips are the bomb!!  I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip.  My healthy eating program is saved.',
 'labels': ['health', 'good', 'flavor', 'taste & texture', 'order'],
 'scores': [0.26937171816825867,
  0.2510984539985657,
  0.2403363138437271,
  0.20681004226207733,
  0.03238346800208092]}

In [45]:
# try on another review
classifier(df.Text[1], ['order', 'taste & texture', 'good', 'flavor', 'health'])

{'sequence': 'I like the puffed nature of this chip that makes it more unique in the chip market.  I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever.  I have tried the cheddar and regular flavors as well.  The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular.  The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.',
 'labels': ['flavor', 'good', 'taste & texture', 'order', 'health'],
 'scores': [0.3400878310203552,
  0.27486953139305115,
  0.2560734450817108,
  0.11652442067861557,
  0.012444754131138325]}

In [46]:
# extract just the top label
classifier(df.Text[1], ['order', 'taste & texture', 'good', 'flavor', 'health'])['labels'][0]

'flavor'

In [None]:
# apply to all reviews
df['Category'] = df.Text.apply(lambda x: classifier(x, ['order', 'taste & texture', 'good', 'flavor', 'health'])['labels'][0]) # extract top label

In [48]:
# view the category labels
df[['Text', 'Category']].head()

Unnamed: 0,Text,Category
0,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,health
1,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",flavor
2,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",good
3,"These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. They were the bomb then, not so much now. Won't buy again unless I get them for cheap or free.",taste & texture
4,"These chips are great! They look almost like a flattened rice cake, but taste so much better, more like a potato chip. The bbq flavor is delicious. They are very low in fat and full of flavor. It is easy to eat an entire bag of these!",taste & texture


## 4. Text Summarization

In [None]:
# What is Text Summarization?
# Text Summarization is a natural language processing (NLP) technique that involves condensing a longer piece of text into a shorter version while retaining its main ideas and key information.

### a. Simple Example

In [None]:
# text summarization with hugging face
summarizer = pipeline("summarization", # set the task to summarization
                      model="facebook/bart-large-cnn",
                      device=-1)

text5 = """
            The lemon tree produces a pointed oval yellow fruit. Botanically this is a hesperidium, 
            a modified berry with a tough, leathery rind. The rind is divided into an outer colored layer or zest, 
            which is aromatic with essential oils, and an inner layer of white spongy pith. 
            Inside are multiple carpels arranged as radial segments. The seeds develop inside the carpels. 
            The space inside each segment is a locule filled with juice vesicles. 
            Lemons contain many phytochemicals, including polyphenols, terpenes, and tannins.[3] 
            Their juice contains slightly more citric acid than lime juice (about 47 g/L), 
            nearly twice as much as grapefruit juice, and about five times as much as orange juice.[4]
        """

Device set to use cpu


In [50]:
# try it with the default parameters
summarizer(text5)

[{'summary_text': 'The lemon tree produces a pointed oval yellow fruit. The rind is divided into an outer colored layer or zest, and an inner layer of white spongy pith. Lemons contain many phytochemicals, including polyphenols, terpenes, and tannins.'}]

In [None]:
# specify the parameters
summarizer(text5, min_length=20, max_length=50) # set the minimum and maximum length of the summary

[{'summary_text': 'The lemon tree produces a pointed oval yellow fruit. The rind is divided into an outer colored layer or zest, and an inner layer of white spongy pith. Lemons contain many phytochemicals, including poly'}]

In [52]:
# will get the same results
summarizer(text5, min_length=20, max_length=50)

[{'summary_text': 'The lemon tree produces a pointed oval yellow fruit. The rind is divided into an outer colored layer or zest, and an inner layer of white spongy pith. Lemons contain many phytochemicals, including poly'}]

In [None]:
# will get more random results
summarizer(text5, min_length=20, max_length=50, do_sample=True) # enable sampling for more varied summaries

[{'summary_text': 'The lemon tree produces a pointed oval yellow fruit. Lemons contain many phytochemicals, including polyphenols, terpenes, and tannins. Their juice contains slightly more citric acid than lime juice.'}]

In [54]:
# extract just the text portion
summarizer(text5, min_length=20, max_length=50)[0]['summary_text']

'The lemon tree produces a pointed oval yellow fruit. The rind is divided into an outer colored layer or zest, and an inner layer of white spongy pith. Lemons contain many phytochemicals, including poly'

### b. Practical Example

In [55]:
# load pipelines
sentiment_analyzer = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device='mps')

Device set to use mps


In [56]:
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device='mps')

Device set to use mps


In [57]:
summarizer(df.Text[0])[0]['summary_text']

Your max_length is set to 142, but your input_length is only 39. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


'Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip.  My healthy eating program is saved. I love Popchips! I love them!! I love popchips. I hate chips.'

In [58]:
# step 1: summarize reviews
df['Summary'] = df['Text'].apply(lambda x: summarizer(x, min_length=20, max_length=50)[0]['summary_text'])
df[['Text', 'Summary']].head(4)

# In case the text is cut off, try increasing max_length and add early_stopping, and length_penalty
#summarizer(df_childrens.Description[0], min_length=10, max_length=50, early_stopping=True, length_penalty=.8)[0]['summary_text']

Your max_length is set to 50, but your input_length is only 39. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
Your max_length is set to 50, but your input_length is only 44. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)
Your max_length is set to 50, but your input_length is only 24. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)
Your max_length is set to 50, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Your max

Unnamed: 0,Text,Summary
0,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.
1,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.","I like the puffed nature of this chip that makes it more unique in the chip market. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come"
2,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!","I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. If you are on a low salt diet these chips are probably not for you."
3,"These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. They were the bomb then, not so much now. Won't buy again unless I get them for cheap or free.","These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. Won't buy again unless I get them for cheap or free."


In [59]:
# step 2: find sentiment scores
sentiment_scores2 = df.Summary.apply(sentiment_analyzer)
sentiment_scores2[:5]

0    [{'label': 'POSITIVE', 'score': 0.9976533055305481}]
1    [{'label': 'POSITIVE', 'score': 0.9991886019706726}]
2    [{'label': 'NEGATIVE', 'score': 0.9929706454277039}]
3    [{'label': 'NEGATIVE', 'score': 0.9985463619232178}]
4    [{'label': 'POSITIVE', 'score': 0.9993218183517456}]
Name: Summary, dtype: object

In [60]:
# extract label and score and create a sentiment score
df['Label_HF2'] = sentiment_scores2.apply(lambda x: x[0]['label'])
df['Score_HF2'] = sentiment_scores2.apply(lambda x: x[0]['score'])
df['Sentiment_HF2'] = df.apply(lambda row: row['Score_HF2'] if row['Label_HF2'] == 'POSITIVE' else -row['Score_HF2'], axis=1)

In [61]:
# view the calculations
df[['Text', 'Label_HF2', 'Score_HF2', 'Sentiment_HF2']].head()

Unnamed: 0,Text,Label_HF2,Score_HF2,Sentiment_HF2
0,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,POSITIVE,0.997653,0.997653
1,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",POSITIVE,0.999189,0.999189
2,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",NEGATIVE,0.992971,-0.992971
3,"These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. They were the bomb then, not so much now. Won't buy again unless I get them for cheap or free.",NEGATIVE,0.998546,-0.998546
4,"These chips are great! They look almost like a flattened rice cake, but taste so much better, more like a potato chip. The bbq flavor is delicious. They are very low in fat and full of flavor. It is easy to eat an entire bag of these!",POSITIVE,0.999322,0.999322


In [63]:
# compare the sentiment scores
df[['Text', 'Sentiment_VADER', 'Sentiment_HF', 'Sentiment_HF2']].head()

Unnamed: 0,Text,Sentiment_VADER,Sentiment_HF,Sentiment_HF2
0,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,0.993521,0.997653
1,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,0.999605,0.999189
2,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",0.979,-0.698484,-0.992971
3,"These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. They were the bomb then, not so much now. Won't buy again unless I get them for cheap or free.",0.8689,-0.999631,-0.998546
4,"These chips are great! They look almost like a flattened rice cake, but taste so much better, more like a potato chip. The bbq flavor is delicious. They are very low in fat and full of flavor. It is easy to eat an entire bag of these!",0.9613,0.999181,0.999322


## 5. Text Generation

In [None]:
# What is Text Generation?
# Text Generation is a natural language processing (NLP) technique that involves generating coherent and contextually relevant text based on a given input or prompt.

In [62]:
# text generation with hugging face
generator = pipeline("text-generation", model="gpt2", max_length=20, device=-1)

prompt = "On a hot summer day, I love to drink cold lemonade because"

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


In [None]:
# set general parameters
generator(prompt, 
          max_length=50, 
          num_return_sequences=1, # number of generated sequences
          do_sample=False) # disable sampling for deterministic output

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': "On a hot summer day, I love to drink cold lemonade because it's so refreshing. I also love to drink cold lemonade because it's so refreshing.\n\nI love to drink cold lemonade because it's so refreshing. I also love to drink cold lemonade because it's so refreshing.\n\nI love to drink cold lemonade because it's so refreshing. I also love to drink cold lemonade because it's so refreshing.\n\nI love to drink cold lemonade because it's so refreshing. I also love to drink cold lemonade because it's so refreshing.\n\nI love to drink cold lemonade because it's so refreshing. I also love to drink cold lemonade because it's so refreshing.\n\nI love to drink cold lemonade because it's so refreshing. I also love to drink cold lemonade because it's so refreshing.\n\nI love to drink cold lemonade because it's so refreshing. I also love to drink cold lemonade because it's so refreshing.\n\nI love to drink cold lemonade because it's so refreshing. I also love to drink cold lemona

In [64]:
# get a more random output
generator(prompt, max_length=50, num_return_sequences=1, do_sample=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'On a hot summer day, I love to drink cold lemonade because I love to drink it, and I also like to drink cold lemonade because I love to drink cold lemonade. I want to drink cold lemonade, because I want to drink cold lemonade. I want to drink cold lemonade because I want to drink cold lemonade.\n\n(I want to drink cold lemonade because I want to drink cold lemonade.)\n\nI like cold lemonade because I want to drink cold lemonade. I want to drink cold lemonade because I want to drink cold lemonade. I want to drink cold lemonade because I want to drink cold lemonade.\n\nI like cold lemonade because I want to drink cold lemonade. I want to drink cold lemonade because I want to drink cold lemonade. I want to drink cold lemonade because I want to drink cold lemonade.\n\nI like cold lemonade because I want to drink cold lemonade. I want to drink cold lemonade because I want to drink cold lemonade. I want to drink cold lemonade because I want to drink cold lemonade.\n\nI l

In [65]:
# get a more random output
generator(prompt, max_length=50, num_return_sequences=1, do_sample=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': "On a hot summer day, I love to drink cold lemonade because it tastes great. I've always loved it but it's not as good as it is in my mouth. The lemonade is not a great substitute for lemonade, but I think it's a great alternative to lemonade. I've tried it a couple of times on my own and it works great. I think it's a great replacement for lemonade.\n\nRated 5 out of 5 by B_ from Best lemonade EVER I've been using this lemonade for years. It has been a success. I've found it to be a very pleasant drink, even for those who are not used to lemonade. What a great way to give off a strong citrus flavor without having to drink a lot. I'm very happy with my purchase!\n\nRated 5 out of 5 by jpugil from Great lemonade I bought this lemonade this week and I'm happy!!! This is a great lemonade for a cold day. It is very refreshing and refreshing. It is a great lemonade if you want a little lemonade with lemonade.\n\nRated 5 out of 5 by S_D from Great Lemonade I love this lem

## 6. Document Similarity with Embeddings

### a. Simple Example

In [66]:
# feature extraction with hugging face
feature_extractor = pipeline("feature-extraction",
                             model="sentence-transformers/all-MiniLM-L6-v2",
                             device=-1)

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


In [67]:
# view the text
text1

'When life gives you lemons, make lemonade! ðŸ™‚'

In [68]:
# view the embedding
feature_extractor(text1)[0][0][:10]

[-0.29363277554512024,
 0.20775160193443298,
 0.11103475838899612,
 0.14668850600719452,
 0.3988548219203949,
 0.3143490254878998,
 0.41525599360466003,
 -0.19369441270828247,
 0.11604060977697372,
 -0.885179340839386]

In [69]:
# view the shape
len(feature_extractor(text1)[0][0])

384

### b. Practical Example

#### Step 1: Extract Embeddings

In [75]:
# modify the column width
pd.set_option('display.max_colwidth', None)

# read in the movies data
movies = pd.read_csv('../Data/movie_reviews.csv')
movies.head(2)

Unnamed: 0,movie_title,rating,genre,in_theaters_date,movie_info,directors,director_gender,tomatometer_rating,audience_rating,critics_consensus
0,A Dog's Journey,PG,"Drama, Kids & Family",5/17/19,"Bailey (voiced again by Josh Gad) is living the good life on the Michigan farm of his ""boy,"" Ethan (Dennis Quaid) and Ethan's wife Hannah (Marg Helgenberger). He even has a new playmate: Ethan and Hannah's baby granddaughter, CJ. The problem is that CJ's mom, Gloria (Betty Gilpin), decides to take CJ away. As Bailey's soul prepares to leave this life for a new one, he makes a promise to Ethan to find CJ and protect her at any cost. Thus begins Bailey's adventure through multiple lives filled with love, friendship and devotion as he, CJ (Kathryn Prescott), and CJ's best friend Trent (Henry Lau) experience joy and heartbreak, music and laughter, and few really good belly rubs.",Gail Mancuso,female,50,92,"A Dog's Journey is as sentimental as one might expect, but even cynical viewers may find their ability to resist shedding a tear stretched to the puppermost limit."
1,A Dog's Way Home,PG,Drama,1/11/19,"Separated from her owner, a dog sets off on an 400-mile journey to get back to the safety and security of the place she calls home. Along the way, she meets a series of new friends and manages to bring a little bit of comfort and joy to their lives.",Charles Martin Smith,male,60,71,"A Dog's Way Home may not quite be a family-friendly animal drama fan's best friend, but this canine adventure is no less heartwarming for its familiarity."


In [76]:
# extract the embedding representation for each review
feature_extractor = pipeline("feature-extraction",
                             model="sentence-transformers/all-MiniLM-L6-v2",
                             device='mps')

embeddings = movies['movie_info'].apply(lambda x: feature_extractor(x)[0][0])
embeddings.head(2)

Device set to use mps


0    [-0.32529786229133606, -0.07725143432617188, 0.12645775079727173, -0.07083047181367874, -0.1954820156097412, 0.39315325021743774, -0.028674812987446785, -0.07342422008514404, -0.04998306185007095, -0.22656352818012238, 0.0639311820268631, -0.19600071012973785, -0.058478984981775284, 0.06706777960062027, -0.11102186143398285, -0.02910090610384941, -0.09101027995347977, -0.04516932740807533, -0.24494890868663788, -0.37002164125442505, -0.11229361593723297, 0.1254516988992691, 0.0339873805642128, 0.03995151445269585, 0.009731832891702652, 0.11734715104103088, 0.24881170690059662, -0.11834382265806198, 0.21207652986049652, -0.9435896873474121, -0.05794324725866318, -0.0903797596693039, 0.043521054089069366, -0.003161325119435787, 0.11663058400154114, -0.06622736901044846, 0.11712948977947235, 0.03098667412996292, 0.3084622025489807, -0.12582361698150635, 0.05811595916748047, 0.10287109017372131, 0.02482311800122261, 0.12457882612943649, -0.1665530651807785, -0.3847602605819702, 0.1391

#### Step 2: Specify the Captain Marvel Embedding

In [77]:
# view one movie - Captain Marvel
movies[movies.movie_title == 'Captain Marvel']

Unnamed: 0,movie_title,rating,genre,in_theaters_date,movie_info,directors,director_gender,tomatometer_rating,audience_rating,critics_consensus
25,Captain Marvel,PG-13,"Action & Adventure, Science Fiction & Fantasy",3/8/19,"The story follows Carol Danvers as she becomes one of the universe's most powerful heroes when Earth is caught in the middle of a galactic war between two alien races. Set in the 1990s, Captain Marvel is an all-new adventure from a previously unseen period in the history of the Marvel Cinematic Universe.","Anna Boden, Ryan Fleck",female,78,53,"Packed with action, humor, and visual thrills, Captain Marvel introduces the MCU's latest hero with an origin story that makes effective use of the franchise's signature formula."


In [78]:
embeddings[25]

[0.025250256061553955,
 -0.08712661266326904,
 0.3106367290019989,
 0.3306695222854614,
 0.03224433586001396,
 0.19540604948997498,
 -0.014434965327382088,
 0.25350841879844666,
 -0.21229849755764008,
 0.4076874554157257,
 -0.257793128490448,
 -0.17885392904281616,
 0.07528190314769745,
 0.2038242518901825,
 -0.16678935289382935,
 -0.18752668797969818,
 -0.10599218308925629,
 -0.24773384630680084,
 -0.04168311506509781,
 0.12357877194881439,
 0.2890177369117737,
 0.2567751407623291,
 -0.04900190606713295,
 0.06799813359975815,
 -0.29883331060409546,
 -0.08523674309253693,
 -0.10169246047735214,
 0.1938239485025406,
 -0.14997249841690063,
 -0.7965190410614014,
 -0.047787487506866455,
 -0.12024735659360886,
 0.1082068532705307,
 0.10693415999412537,
 -0.3177112936973572,
 -0.26869288086891174,
 0.4158342480659485,
 0.06679555028676987,
 0.165336474776268,
 -0.05166644975543022,
 -0.37615859508514404,
 0.07553063333034515,
 0.37573546171188354,
 0.14313198626041412,
 0.06679452955722809,


In [None]:
# save the embedding for that movie
import numpy as np

# 2D Array (Row Vector): [[0.1, 0.5, ...]]
# This is explicitly 1 row containing many columns
embedding_cm = np.array(embeddings[25]).reshape(1, -1) # reshape to 2D array
embedding_cm.shape

(1, 384)

In [80]:
embedding_cm

array([[ 2.52502561e-02, -8.71266127e-02,  3.10636729e-01,
         3.30669522e-01,  3.22443359e-02,  1.95406049e-01,
        -1.44349653e-02,  2.53508419e-01, -2.12298498e-01,
         4.07687455e-01, -2.57793128e-01, -1.78853929e-01,
         7.52819031e-02,  2.03824252e-01, -1.66789353e-01,
        -1.87526688e-01, -1.05992183e-01, -2.47733846e-01,
        -4.16831151e-02,  1.23578772e-01,  2.89017737e-01,
         2.56775141e-01, -4.90019061e-02,  6.79981336e-02,
        -2.98833311e-01, -8.52367431e-02, -1.01692460e-01,
         1.93823949e-01, -1.49972498e-01, -7.96519041e-01,
        -4.77874875e-02, -1.20247357e-01,  1.08206853e-01,
         1.06934160e-01, -3.17711294e-01, -2.68692881e-01,
         4.15834248e-01,  6.67955503e-02,  1.65336475e-01,
        -5.16664498e-02, -3.76158595e-01,  7.55306333e-02,
         3.75735462e-01,  1.43131986e-01,  6.67945296e-02,
        -1.00563303e-01, -9.74753872e-02,  8.79165605e-02,
        -6.28625602e-02, -1.57981172e-01, -2.97595084e-0

#### Step 3: Specify the Embeddings for All Movies

In [None]:
# save the embeddings for all movies
embeddings_movies = np.vstack(embeddings) # stack all embeddings vertically
embeddings_movies.shape

(166, 384)

#### Step 4: Calculate Cosine Similarity Scores

In [82]:
# calculate the cosine similarity scores
from sklearn.metrics.pairwise import cosine_similarity

similarity_scores_cm = cosine_similarity(embedding_cm, embeddings_movies)
similarity_scores_cm_series = pd.Series(similarity_scores_cm.flatten(), name='similarity_score')
similarity_scores_cm_series.head()

0    0.749577
1    0.684320
2    0.599276
3    0.673823
4    0.724890
Name: similarity_score, dtype: float64

In [83]:
# combine movie titles, descriptions and scores
similarity_scores_cm_df = pd.concat([movies[['movie_title', 'movie_info']], similarity_scores_cm_series], axis=1)
similarity_scores_cm_df.head()

Unnamed: 0,movie_title,movie_info,similarity_score
0,A Dog's Journey,"Bailey (voiced again by Josh Gad) is living the good life on the Michigan farm of his ""boy,"" Ethan (Dennis Quaid) and Ethan's wife Hannah (Marg Helgenberger). He even has a new playmate: Ethan and Hannah's baby granddaughter, CJ. The problem is that CJ's mom, Gloria (Betty Gilpin), decides to take CJ away. As Bailey's soul prepares to leave this life for a new one, he makes a promise to Ethan to find CJ and protect her at any cost. Thus begins Bailey's adventure through multiple lives filled with love, friendship and devotion as he, CJ (Kathryn Prescott), and CJ's best friend Trent (Henry Lau) experience joy and heartbreak, music and laughter, and few really good belly rubs.",0.749577
1,A Dog's Way Home,"Separated from her owner, a dog sets off on an 400-mile journey to get back to the safety and security of the place she calls home. Along the way, she meets a series of new friends and manages to bring a little bit of comfort and joy to their lives.",0.68432
2,A Tuba to Cuba,"The leader of New Orleans' famed Preservation Hall Jazz Band seeks to fulfill his late father's dream of retracing their musical roots to the shores of Cuba in search of the indigenous music that gave birth to New Orleans jazz. A TUBA TO CUBA celebrates the triumph of the human spirit expressed through the universal language of music and challenges us to resolve to build bridges, not walls.",0.599276
3,A Vigilante,"A once abused woman, Sadie (Olivia Wilde), devotes herself to ridding victims of their domestic abusers while hunting down the husband she must kill to truly be free. A Vigilante is a thriller inspired by the strength and bravery of real domestic abuse survivors and the incredible obstacles to safety they face.",0.673823
4,After,"Based on Anna Todd's best-selling novel which became a publishing sensation on social storytelling platform Wattpad, AFTER follows Tessa (Langford), a dedicated student, dutiful daughter and loyal girlfriend to her high school sweetheart, as she enters her first semester in college. Armed with grand ambitions for her future, her guarded world opens up when she meets the dark and mysterious Hardin Scott (Tiffin), a magnetic, brooding rebel who makes her question all she thought she knew about herself and what she wants out of life.",0.72489


In [84]:
# view the top 5 most similar movies
similarity_scores_cm_df.sort_values('similarity_score', ascending=False).head()

Unnamed: 0,movie_title,movie_info,similarity_score
25,Captain Marvel,"The story follows Carol Danvers as she becomes one of the universe's most powerful heroes when Earth is caught in the middle of a galactic war between two alien races. Set in the 1990s, Captain Marvel is an all-new adventure from a previously unseen period in the history of the Marvel Cinematic Universe.",1.0
45,Fast & Furious Presents: Hobbs & Shaw,"Ever since hulking lawman Hobbs (Dwayne Johnson), a loyal agent of America's Diplomatic Security Service, and lawless outcast Shaw (Jason Statham), a former British military elite operative, first faced off in 2015's Furious 7, the duo have swapped smack talk and body blows as they've tried to take each other down. But when cyber-genetically enhanced anarchist Brixton (Idris Elba) gains control of an insidious bio-threat that could alter humanity forever--and bests a brilliant and fearless rogue MI6 agent (The Crown's Vanessa Kirby), who just happens to be Shaw's sister--these two sworn enemies will have to partner up to bring down the only guy who might be badder than themselves.",0.819661
18,Avengers: Endgame,"The grave course of events set in motion by Thanos that wiped out half the universe and fractured the Avengers ranks compels the remaining Avengers to take one final stand in Marvel Studios' grand conclusion to twenty-two films, ""Avengers: Endgame.""",0.794008
131,The LEGO Movie 2: The Second Part,"The much-anticipated sequel to the critically acclaimed, global box office phenomenon that started it all, ""The LEGO (R) Movie 2: The Second Part,"" reunites the heroes of Bricksburg in an all new action-packed adventure to save their beloved city. It's been five years since everything was awesome and the citizens are now facing a huge new threat: LEGO DUPLO (R) invaders from outer space, wrecking everything faster than it can be rebuilt. The battle to defeat the invaders and restore harmony to the LEGO universe will take Emmet (Chris Pratt), Lucy (Elizabeth Banks), Batman (Will Arnett) and their friends to faraway, unexplored worlds, including a strange galaxy where everything is a musical. It will test their courage, creativity and Master Building skills, and reveal just how special they really are.",0.792253
6,Alita: Battle Angel,"From visionary filmmakers James Cameron (AVATAR) and Robert Rodriguez (SIN CITY), comes ALITA: BATTLE ANGEL, an epic adventure of hope and empowerment. When Alita (Rosa Salazar) awakens with no memory of who she is in a future world she does not recognize, she is taken in by Ido (Christoph Waltz), a compassionate doctor who realizes that somewhere in this abandoned cyborg shell is the heart and soul of a young woman with an extraordinary past. As Alita learns to navigate her new life and the treacherous streets of Iron City, Ido tries to shield her from her mysterious history while her street-smart new friend Hugo (Keean Johnson) offers instead to help trigger her memories. But it is only when the deadly and corrupt forces that run the city come after Alita that she discovers a clue to her past - she has unique fighting abilities that those in power will stop at nothing to control. If she can stay out of their grasp, she could be the key to saving her friends, her family and the world she's grown to love.",0.789453


#### DEMO: Create a function to find the most similar movie

In [85]:
# step 1: specify our feature extraction model
feature_extractor = pipeline('feature-extraction',
                     model='sentence-transformers/all-MiniLM-L6-v2',
                     device='mps')

Device set to use mps


In [86]:
# step 2: create a movies x embeddings array (166 x 384)
embeddings = movies.movie_info.apply(lambda row: feature_extractor(row)[0][0])
embeddings_movies = np.vstack(embeddings)
embeddings_movies.shape

(166, 384)

In [87]:
# step 3: create a get_similar_movies function with the inputs: embeddings, movie_index, movie_details, top_n
def get_similar_movies(embeddings, movie_index, movie_details, top_n=3):

    # create movie embedding for movie_index
    m_embedding = np.array(embeddings[movie_index]).reshape(1, -1)
    
    # calculate similarity scores
    similarity_scores = cosine_similarity(m_embedding, embeddings)
    similarity_scores_series = pd.Series(similarity_scores.flatten(), name='similarity_score')
    
    # bring in movie info
    movies_similarity_scores_df = pd.concat([movie_details, similarity_scores_series], axis=1)

    # display movie recs
    return movies_similarity_scores_df.sort_values('similarity_score', ascending=False).iloc[0:top_n+1]

In [88]:
# modify the column width
pd.set_option('display.max_colwidth', None)

In [89]:
# find movies similar to Captain Marvel
get_similar_movies(embeddings_movies, 25, movies[['movie_title', 'movie_info']])

Unnamed: 0,movie_title,movie_info,similarity_score
25,Captain Marvel,"The story follows Carol Danvers as she becomes one of the universe's most powerful heroes when Earth is caught in the middle of a galactic war between two alien races. Set in the 1990s, Captain Marvel is an all-new adventure from a previously unseen period in the history of the Marvel Cinematic Universe.",1.0
45,Fast & Furious Presents: Hobbs & Shaw,"Ever since hulking lawman Hobbs (Dwayne Johnson), a loyal agent of America's Diplomatic Security Service, and lawless outcast Shaw (Jason Statham), a former British military elite operative, first faced off in 2015's Furious 7, the duo have swapped smack talk and body blows as they've tried to take each other down. But when cyber-genetically enhanced anarchist Brixton (Idris Elba) gains control of an insidious bio-threat that could alter humanity forever--and bests a brilliant and fearless rogue MI6 agent (The Crown's Vanessa Kirby), who just happens to be Shaw's sister--these two sworn enemies will have to partner up to bring down the only guy who might be badder than themselves.",0.819661
18,Avengers: Endgame,"The grave course of events set in motion by Thanos that wiped out half the universe and fractured the Avengers ranks compels the remaining Avengers to take one final stand in Marvel Studios' grand conclusion to twenty-two films, ""Avengers: Endgame.""",0.794008
131,The LEGO Movie 2: The Second Part,"The much-anticipated sequel to the critically acclaimed, global box office phenomenon that started it all, ""The LEGO (R) Movie 2: The Second Part,"" reunites the heroes of Bricksburg in an all new action-packed adventure to save their beloved city. It's been five years since everything was awesome and the citizens are now facing a huge new threat: LEGO DUPLO (R) invaders from outer space, wrecking everything faster than it can be rebuilt. The battle to defeat the invaders and restore harmony to the LEGO universe will take Emmet (Chris Pratt), Lucy (Elizabeth Banks), Batman (Will Arnett) and their friends to faraway, unexplored worlds, including a strange galaxy where everything is a musical. It will test their courage, creativity and Master Building skills, and reveal just how special they really are.",0.792253


In [90]:
# find movies similar to The LEGO Movie 2
get_similar_movies(embeddings_movies, 131, movies[['movie_title', 'movie_info', 'rating']], top_n=5)

Unnamed: 0,movie_title,movie_info,rating,similarity_score
131,The LEGO Movie 2: The Second Part,"The much-anticipated sequel to the critically acclaimed, global box office phenomenon that started it all, ""The LEGO (R) Movie 2: The Second Part,"" reunites the heroes of Bricksburg in an all new action-packed adventure to save their beloved city. It's been five years since everything was awesome and the citizens are now facing a huge new threat: LEGO DUPLO (R) invaders from outer space, wrecking everything faster than it can be rebuilt. The battle to defeat the invaders and restore harmony to the LEGO universe will take Emmet (Chris Pratt), Lucy (Elizabeth Banks), Batman (Will Arnett) and their friends to faraway, unexplored worlds, including a strange galaxy where everything is a musical. It will test their courage, creativity and Master Building skills, and reveal just how special they really are.",PG,1.0
151,Toy Story 4,"Woody (voice of Tom Hanks) has always been confident about his place in the world, and that his priority is taking care of his kid, whether that's Andy or Bonnie. So when Bonnie's beloved new craft-project-turned-toy, Forky (voice of Tony Hale), declares himself as ""trash"" and not a toy, Woody takes it upon himself to show Forky why he should embrace being a toy. But when Bonnie takes the whole gang on her family's road trip excursion, Woody ends up on an unexpected detour that includes a reunion with his long-lost friend Bo Peep (voice of Annie Potts). After years of being on her own, Bo's adventurous spirit and life on the road belie her delicate porcelain exterior. As Woody and Bo realize they're worlds apart when it comes to life as a toy, they soon come to find that's the least of their worries.",G,0.83106
32,Dolemite Is My Name,"Stung by a string of showbiz failures, floundering comedian Rudy Ray Moore (Academy Award nominee Eddie Murphy) has an epiphany that turns him into a word-of-mouth sensation: step onstage as someone else. Borrowing from the street mythology of 1970s Los Angeles, Moore assumes the persona of Dolemite, a pimp with a cane and an arsenal of obscene fables. However, his ambitions exceed selling bootleg records deemed too racy for mainstream radio stations to play. Moore convinces a social justice-minded dramatist (Keegan-Michael Key) to write his alter ego a film, incorporating kung fu, car chases, and Lady Reed (Da'Vine Joy Randolph), an ex-backup singer who becomes his unexpected comedic foil. Despite clashing with his pretentious director, D'Urville Martin (Wesley Snipes), and countless production hurdles at their studio in the dilapidated Dunbar Hotel, Moore's Dolemite becomes a runaway box office smash and a defining movie of the Blaxploitation era.",R,0.826733
154,Triple Threat,"TRIPLE THREAT, the newest feature from Johnson, is an adrenaline fueled and gritty action thriller starring some of the biggest names in action today. Michael Jai White (BLACK DYNAMITE; UNDISPUTED 2: LAST MAN STANDING), Scott Adkins (Marvel's DOCTOR STRANGE; THE EXPENDABLES 2), Michael Bisping (xXx: RETURN OF XANDER CAGE) star as a group of professional assassins hired to take out a billionaire's daughter who is intent on bringing down a major crime syndicate. A down and out team of mercenaries, played by Tony Jaa (ONG BAK TRILOGY; xXx: RETURN OF XANDER CAGE), Iko Uwais (THE RAID 1 & 2; STAR WARS: THE FORCE AWAKENS) and Tiger Chen (MAN OF TAI CHI), must take on the assassins and stop them before they kill their target. The film co-stars JeeJa Yanin (CHOCOLATE) Michael Wong (Cold War) and Celina Jade (WOLF WARRIOR 2).",R,0.818383
36,Dumbo,"From Disney and visionary director Tim Burton, the all-new grand live-action adventure ""Dumbo"" expands on the beloved classic story where differences are celebrated, family is cherished and dreams take flight. Circus owner Max Medici (Danny DeVito) enlists former star Holt Farrier (Colin Farrell) and his children Milly (Nico Parker) and Joe (Finley Hobbins) to care for a newborn elephant whose oversized ears make him a laughingstock in an already struggling circus. But when they discover that Dumbo can fly, the circus makes an incredible comeback, attracting persuasive entrepreneur V.A. Vandevere (Michael Keaton), who recruits the peculiar pachyderm for his newest, larger-than-life entertainment venture, Dreamland. Dumbo soars to new heights alongside a charming and spectacular aerial artist, Colette Marchant (Eva Green), until Holt learns that beneath its shiny veneer, Dreamland is full of dark secrets.",PG,0.818338
45,Fast & Furious Presents: Hobbs & Shaw,"Ever since hulking lawman Hobbs (Dwayne Johnson), a loyal agent of America's Diplomatic Security Service, and lawless outcast Shaw (Jason Statham), a former British military elite operative, first faced off in 2015's Furious 7, the duo have swapped smack talk and body blows as they've tried to take each other down. But when cyber-genetically enhanced anarchist Brixton (Idris Elba) gains control of an insidious bio-threat that could alter humanity forever--and bests a brilliant and fearless rogue MI6 agent (The Crown's Vanessa Kirby), who just happens to be Shaw's sister--these two sworn enemies will have to partner up to bring down the only guy who might be badder than themselves.",PG-13,0.816315
