# Light-Weight Embeddings Notebook


In [3]:
%%javascript
document.title = 'Github'

<IPython.core.display.Javascript object>

In this notebook, I'm experimenting with the relatively new python library: sentence transformers. More information can be found at https://www.sbert.net/

In [3]:
# !pip install sentence_transformers

In [4]:
import pickle
from gensim.summarization import summarize
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

Here are some test sentences

In [5]:
test1 = 'I love to go on dates. I am just looking for a woman. Dating is really tough though!'
test2 = 'I am looking for a man. I like dating but it can be hard sometimes'

Instantiating a Sentence Transformer model with distilbert-base-nli-mean-tokens

Testing different models

In [69]:
roberta_model = SentenceTransformer('stsb-roberta-large')
distilbert_model = SentenceTransformer('stsb-distilbert-base')

In [70]:
test1_emb = roberta_model.encode(test1).reshape(1,-1)
test2_emb = roberta_model.encode(test2).reshape(1,-1)

In [71]:
test1_emb.shape

(1, 1024)

In [72]:
cosine_similarity(test1_emb, test2_emb)[0][0]

0.8181629

In [73]:
# Helper function
def compare(a, b, model = roberta_model):
    emb1 = model.encode(a).reshape(1,-1)
    emb2 = model.encode(b).reshape(1,-1)
    return cosine_similarity(emb1, emb2)[0][0]

In [74]:
compare('I like chicken', 'I like chicken wings')

0.93388546

In [81]:
compare('i like chicken', 'i like steak', model=roberta_model)

0.34584302

## Opening Seinfeld Dialogues

In [13]:
with open('../data/episode_dialogues.pkl', 'rb') as f:
    episodes_dialogue = pickle.load(f)

In [14]:
# samp is a string of the complete dialogue from the first episode
samp = episodes_dialogue['S01E01']

In [15]:
summary = summarize(samp, ratio = .01)

In [16]:
summary

'Lets go!The dating world is not a fun world...its a pressure world, its a world of tension, its a world of pain...and you know, if a woman comes over to my house, I gotta get that bathroom ready, cause she needs things.\nWell, Bill, the boss thinks youre the man for the position, why dont you strip down and meet some of the people youll be workin with?Wouldnt it be great if you could ask a woman what shes thinking?What a world that would be, if you just could ask a woman what shes thinkin.You know, instead, Im like a detective.\nYoure engaged.Yeah, yeah, hes a great guy...Yeah.Youd really like him, you know, I cant wait to get on that boat.Me too!I swear, I have absolutely no idea what women are thinking.\nYes, we met.Hi, happy birthday.Thanks, ah, everybody, this is Elaine and Jerry.HiI didnt bring anything.Uh, I put you two right here.Oh, Okay  Im sorry, I didnt know what to bring, nobody told me.How big a tip do you think itd take to get him to stop?Im in for five...Ill supply the 

In [82]:
def vectorize_episode(episode_dict, model='stsb-distilbert-base', ratio=.01):
    reduced_episode_vectors = {}
    model = SentenceTransformer(model)
    for episode, dialogue in episode_dict.items():
        summary = summarize(episode_dict[episode], ratio=ratio)
        vector = model.encode(summary).reshape(1,-1)
        reduced_episode_vectors[episode] = vector
    return reduced_episode_vectors

In [83]:
vector_dict = vectorize_episode(episodes_dialogue)

In [84]:
with open('distilbert_episode_vectors.pkl', 'wb') as f:
    pickle.dump(vector_dict, f)

In [85]:
def get_similarities(dialogue, model ='stsb-distilbert-base'):
    similarity_list = []
    model = SentenceTransformer(model)
    dialogue_vector = model.encode(dialogue).reshape(1,-1)
    for episode, vector in vector_dict.items():
        similarity_list.append((episode, cosine_similarity(dialogue_vector, vector)[0][0]))
    similarity_list.sort(key=lambda x: x[1], reverse=True)
    return similarity_list
        

In [86]:
dialogue = 'Hi everyone. I really like soup. Soup is my favorite thing. Soup nazi episode is great. Can you get some squash soup?'

In [87]:
sims = get_similarities(dialogue)

In [88]:
sims[:5]

[('S09E07', 0.5182687),
 ('S07E06', 0.47713837),
 ('S05E12', 0.4392568),
 ('S09E18', 0.43371052),
 ('S03E03', 0.4318787)]

In [67]:
summarize(episodes_dialogue['S09E07'], ratio=.01)

"Welcome to flavor country.Yeah, that's pretty good.Hey, I got a date with that doctor you met.Sara Sitarides?Mmhu.Oh...What's with you?You remember that next door neighbor of mine, the apartment that always smells like potatoes?Your whole building smells like potatoes.This jackass goes to Paris, leaves the alarm on."

In [31]:
import imdb

In [33]:
ia = imdb.IMDb()
series = ia.get_movie('0098904')
ia.update(series, 'episodes')
sorted(series['episodes'].keys())

[1, 2, 3, 4, 5, 6, 7, 8, 9]

In [52]:
ordered_episodes = []
for i in range(len(sims)):
    ordered_episodes.append([int(sims[i][0][1:3]), int(sims[i][0][-2:])])

In [53]:
ordered_episodes[0]

[7, 6]

In [89]:
for i in range(len(sims[:1])):
    season_num = ordered_episodes[i][0]
    episode_num = ordered_episodes[i][1]
    episode = series['episodes'][season_num][episode_num]
    title = episode['title']
    plot = episode['plot']
    print(plot)


    A soup stand owner obsesses about his customers' ordering procedure, but his soup is so good that people line up down the block for it anyway.    
