# First data exploration

### Extracting and cleaning plot summaries 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
from transformers import pipeline
from scipy.special import softmax


import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

#Setting pandas display options
pd.set_option("max_colwidth", None)


#Src folder path
src_folder = 'src/data/'

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Mathieu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Mathieu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [2]:
df_plot_summaries = pd.read_csv(src_folder + 'plot_summaries.txt', sep='\t', header=None,  names=['id', 'summary'])
df_plot_summaries.sample(2)

Unnamed: 0,id,summary
37889,29081683,Jerry McGuire is a dress designer who is tired of being looked upon as a wimp. He begins secretly training as a boxer to take on Spike Mullins and win the affections of store clerk Hilda Jensen .http://www.allmovie.com/work/so-this-is-love-110791
39595,25862838,"Set in modern Birmingham, Land Gold Women revolves around a small British Asian family caught between their traditional past and the tumultuous, faction-driven present. Nazir Ali Khan, a soft-spoken, 45-year-old professor of History at a University in Birmingham, emigrated from India in the 1980s. He made Birmingham his home with his conservative wife Rizwana and their two children, Saira, 17 and Asif, 14. He indulges their interests in all things English and Western but now finds himself increasingly nostalgic about his roots. Saira, with a year to complete her graduation, is excited at the prospect of going to university to pursue her interest in Literature. She also hopes that this will give her more time to spend with David, her aspiring writer boyfriend. At this critical juncture in her life, Nazir finds himself feeling increasingly conflicted at the thought of his daughter going out into the big bad world. His fears are further strengthened by the arrival of his older brother Riyaaz from India. A staunch traditional man, Riyaaz arrives with a proposal of marriage for Saira. A man of his word, who takes great pride in his roots, Riyaaz doesn’t intend on taking a ‘no’ for an answer. With the threat of an illicit relationship looming over his head and the prospect of getting cut off from the rest of his family, Nazir finds himself at the brink of a terrible decision to make: Should he save face? Or save his daughter?"


Some cleaning is necessary for the plot summaries. There appears to be in the data:

1) {{annotation}}
2) links (references to wikipedia)
3) <> html delimiters
4) ... further inspection needed

That need to be removed


In [3]:
def clean_plot(txt):

    #Remove URLs
    txt = re.sub(r"http\S+|www\.\S+", '', txt)

    #Remove HTML tags
    txt = re.sub(r'<.*?>', '', txt)

    #Remove {{annotations}}
    txt = re.sub(r'\{\{.*?\}\}', '', txt)

    #Remove the ([[ annotation that is never closed
    txt = re.sub(r'\(\[\[', '', txt)

    #Remove the synopsis from context
    txt = re.sub(r'Synopsis from', '', txt)

    #Remove <ref...}} tags
    txt = re.sub(r'<ref[^}]*}}', '', txt)

    return txt

In [4]:
df_test_clean = df_plot_summaries.copy()
df_test_clean["summary"] = df_plot_summaries['summary'].apply(clean_plot)
df_test_clean.sample(2)

Unnamed: 0,id,summary
32940,5604917,"Repossession men Laurel and Hardy serve a summons to Mr. Kennedy, who has failed to pay the installments for his radio. They wind up destroying both their car and the radio, as Mrs. Kennedy returns home to announce she's just paid for the radio."
35691,32169995,"A nameless horse butcher, whose wife left him soon after their autistic daughter was born, operates his own business while trying to raise the daughter. Despite that she has become a teenager, the butcher continues to wash her like a baby, and struggles to resist the temptation of committing incest. On the day of the daughter's menstruation, the butcher misinterprets the situation and assumes that she has been raped by a worker, whom he immediately seeks out and stabs as revenge. The butcher is imprisoned for the assault and is forced to sell his butcher shop and apartment."


    histogram of the length of the summary

In [5]:
df_length_summary = df_test_clean['summary'].copy(deep=True)
df_length_summary = df_length_summary.apply(len) 


In [6]:
df_length_summary

0         178
1        4559
2        3090
3        4917
4        2425
         ... 
42298     220
42299    2971
42300     871
42301    1289
42302    3489
Name: summary, Length: 42303, dtype: int64

In [None]:
plt.hist(df_length_summary, bins=50, edgecolor='black')
plt.xlabel('length of summary')
plt.ylabel('amount of summary')
plt.title('distribution of the length of the summary')
plt.show()

#### Sentiment analysis test

I see two possibilities: making classical sentiment analysis using twitter-roberta-base-sentiment trained on tweets or using distilbert-base-uncased-emotions for emotions

In [9]:
#First possibility: sentence by sentence sentiment classification

MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)



dict_labels = {0: 'negative', 1: 'neutral', 2: 'positive'}
for t in df_test_clean['summary'].sample(1):
    for sentence in nltk.sent_tokenize(t):
        print(sentence)
        t_encoded = tokenizer(sentence, return_tensors='pt')
        t_output = model(**t_encoded)
        t_scores = softmax(t_output.logits.detach().numpy(), axis=1)
        t_predicted = np.argmax(t_scores)
        print(dict_labels[t_predicted], t_scores[0][t_predicted])
    

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


The story is about three sisters Srividya, Subha and Jayachitra, each getting attracted to their tenant, a young man played by Sivakumar.
neutral 0.8701348
Sivakumar falls in love with Jayachitra.
neutral 0.5637661
However, things take a turn to the worse in the typical Balachander style when Jayachitra sacrifices herself to the whims of a playboy, played by Kamal Haasan in order to save her friend  from his exploits.
negative 0.77866256
What happens next, who marries whom, is the crux of a dramatic climax.
neutral 0.84232396
The movie turns the entire concept of romantic movies by its head - the concept being that the hero does not always have to get the girl.
neutral 0.6040655


In [10]:
# Second possibility: sentence to sentence emotion classification

#the emotions are anger, fear, joy, love, sadness, surprise and neutral

classifier = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base", return_all_scores=True)

for t in df_test_clean['summary'].sample(1):
    for sentence in nltk.sent_tokenize(t):
        print(sentence)
        out = classifier(sentence)[0]
        best_emotion_dict = max(out, key=lambda x: x['score'])
        best_label = best_emotion_dict['label']
        best_score = best_emotion_dict['score']
        print(best_label, best_score)    

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


An extremist right-wing candidate is elected to the French presidency, sparking riots in Paris.
anger 0.5820716023445129
Hoping to escape Paris but needing cash, Alex, Tom, Farid, the pregnant Yasmine, and her brother Sami take advantage of the chaos to pull off a robbery.
fear 0.4334038197994232
Sami is shot and the group splits up: Alex and Yasmine take Sami to a hospital, and Tom and Farid take the money to a family-run inn near the border.
sadness 0.7484657764434814
Innkeepers Gilberte and Klaudia claim their rooms are free and seduce the two men.
neutral 0.6540887951850891
At the hospital, the emergency room staff report Sami's injury to the police.
sadness 0.7390162944793701
Sami insists Yasmine run before the police catch her.
neutral 0.45305579900741577
His dying wish is that Yasmine not have an abortion.
sadness 0.9689406156539917
Alex and Yasmine flee, phoning their friends for directions to the inn.
fear 0.48608505725860596
Tom and Farid give them directions but soon after a