# Text Classification

While in the notebook `embedding`, we were trying to generate the graph based on the similarity of the clusters, this time
we'll let the model predict the probability of belonging to a class. 

## Imports

In [1]:
import pandas as pd
import numpy as np

from transformers import pipeline
from scripts.helpers import get_list_of_genres, get_str_of_genres

## Data Processing

In [2]:
# specify the column names 
column_names = ['wikipedia_id', 'freebase_id', 'name', 'release_date', 'box_office_revenue', 'runtime', 'languages', 'countries', 'genres']
movie_metadata_df = pd.read_table('../data/raw/movie.metadata.tsv', names=column_names)
movie_metadata_df.head(5)

Unnamed: 0,wikipedia_id,freebase_id,name,release_date,box_office_revenue,runtime,languages,countries,genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"


In [3]:
movie_metadata_df['genres_list'] = movie_metadata_df.genres.apply(get_list_of_genres)

In [4]:
movie_metadata_df.head(5)

Unnamed: 0,wikipedia_id,freebase_id,name,release_date,box_office_revenue,runtime,languages,countries,genres,genres_list
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...","[Thriller, Science Fiction, Horror, Adventure,..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp...","[Mystery, Biographical film, Drama, Crime Drama]"
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D...","[Crime Fiction, Drama]"
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...","[Thriller, Erotic thriller, Psychological thri..."
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}",[Drama]


In [5]:
# let's find all the unique genres there are
unique_genres = set(movie_metadata_df.genres_list.explode().tolist())
print(f'We have {len(unique_genres)} different genres.')

We have 364 different genres.


In [6]:
# as there are too many genres, let's make a test with only 3 of them
selected_genres = ['action', 'nature', 'comedy']

### Plots

As we want to embed plots, let's read them in and merge with `movie_metadata_df`.

In [7]:
plot_column_names = ['wikipedia_id', 'plot']
plot_df = pd.read_csv('../data/raw/plot_summaries.txt', sep="\t", names=plot_column_names) 
plot_df.head(5)

Unnamed: 0,wikipedia_id,plot
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
1,31186339,The nation of Panem consists of a wealthy Capi...
2,20663735,Poovalli Induchoodan is sentenced for six yea...
3,2231378,"The Lemon Drop Kid , a New York City swindler,..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...


In [8]:
# let's now merge the DataFrames
merged_df = movie_metadata_df.merge(plot_df, on='wikipedia_id', how='inner')
merged_df.shape

(42204, 11)

In [9]:
# in order to filter by genres, let's generate temporary genre-string column
merged_df['genres_string'] = merged_df.genres.apply(get_str_of_genres)

# and take some films from all three selected genres
regex_pattern = "|".join(selected_genres)
genre_mask = merged_df.genres_string.str.contains(regex_pattern)
filtered_df = merged_df[genre_mask]
filtered_df.shape

(18297, 12)

In [10]:
# we are left with ~18,000 plots, let's filter out the ones that don't have revenue information
filtered_df = filtered_df[~np.isnan(filtered_df.box_office_revenue)]
filtered_df.shape

(4527, 12)

In [11]:
# and now let's try classification with just 30 movies
final_df = filtered_df.sample(30)

## Modelling

We're using the model described in this [paper](https://arxiv.org/pdf/1909.00161.pdf), which can be downloaded [here](https://huggingface.co/facebook/bart-large-mnli). To download the model, run in the `models` directory:

```
git lfs install
git clone https://huggingface.co/facebook/bart-large-mnli
```

In [12]:
classifier = pipeline("zero-shot-classification", model="../models/bart-large-mnli")

In [17]:
probabilities = {}
candidate_labels = ['travel', 'cooking', 'dancing']

for i, row in final_df.iterrows():
    probabilities[row.name] = classifier(row['plot'], selected_genres)

In [20]:
probabilities

KeyboardInterrupt: 

In [19]:
probabilities