## U.Group RDSO AI/ML Notebook

This notebook demonstrates the AI and ML features implemented in the U.Group RDSO submission website. 

### Data Processing Pipeline
The general outline of our data processing pipeline is as follows:

* Scrape movie data from various freely available sources (e.g. OMDB, MovieTweetings)
* Clean and parse the data
* Join the various data sources - most include the IMDB ID, which can be used as a unique key
* Parse the data fields into standardized formats across the application
* Join together multiple descriptive text fields into a single `description` field
* Clean, preprocess, and tokenize the `description` field in preparation for vectorization

The output of this processing is the raw scraped data in the `data/` subfolder, and a pickle file containing a `pandas Dataframe` of the processed data, ready for machine learning.

In [14]:
from rdso import movies, vectorize, plot
from gensim.models.doc2vec import Doc2Vec
import numpy as np
import pathlib
import pandas as pd

Dockerfile        [0m[01;34m__pycache__[0m/           movies.py   requirements.txt
Dockerfile.train  api.py                 [01;34momdb_json[0m/  [01;34mtests[0m/
__init__.py       feature_extraction.py  plot.py     vectorize.py


In [18]:
cwd = pathlib.Path(".").resolve()
data_dir = cwd / "data"
movies_df_file = data_dir / "movies_df.pkl"
movies_df = pd.read_pickle(str(movies_df_file))
print(f"Loaded {len(movies_df)} movies") 

Loaded 17351 movies


In [20]:
movies_df.head().T

Unnamed: 0,0,1,2,3,4
Title,[REC] 3: Genesis,Just Let Go,Kaagaz Ke Fools,Brush with Danger,Stereo
Year,2012,2015,2015,2015,2014
Rated,R,PG-13,,Not Rated,
Released,30 Mar 2012,09 Oct 2015,24 Apr 2015,19 Sep 2015,15 May 2014
Runtime,80 min,106 min,109 min,90 min,98 min
Genre,"Horror, Thriller",Drama,Comedy,"Action, Drama, Thriller","Crime, Thriller"
Director,Paco Plaza,"Christopher S. Clark, Patrick Henry Parker",Anil Chaudhary,Livi Zheng,Maximilian Erlenwein
Writer,"Luiso Berdejo (screenplay), Paco Plaza (screen...","Christopher S. Clark, Vance Mellen (screenplay...",Anil Chaudhary,"Ken Zheng, Ken Zheng",Maximilian Erlenwein (screenplay)
Actors,"Leticia Dolera, Diego Martín, Ismael Martínez,...","Henry Ian Cusick, Brenda Vaccaro, Sam Sorbo, P...","Vinay Pathak, Mugdha Godse, Raima Sen, Amit Behl","Ken Zheng, Livi Zheng, Nikita Breznikov, Norma...","Jürgen Vogel, Moritz Bleibtreu, Petra Schmidt-..."
description,The action now takes place miles away from the...,After surviving a drunk driving accident that ...,"Revolving around a middle-class family, Kaagaz...","A painter, a fighter, both artists in their ow...","The appearance of a mysterious, hooded man evo..."


### Data Science AI/ML Pipeline
* Train a Doc2Vec model on our cleaned & tokenized movie descriptions
* Use movie genres from OMDB as categories to create clusters
* Find the centroid of each cluster
* Measure the mean and standard deviation of movie distances from center within each cluster, and record this as our metric for model performance. A higher-performing model will result in tighter clusters, with a smaller mean and standard deviation.

Doc2Vec has a great feature that allows you to tag each document prior to training, and that tag will be retained with the created document vectors so that you can look up a document by tag. We will create a tagged corpus using the IMDB IDs that are common among our datasets.

In [3]:
movies_labels = list(movies_df["film_id"])
movie_tokens = movies_df["movie_tokens"].tolist()
tagged_corpus = vectorize.TaggedLineDocument(movie_tokens, movies_labels)

Now we can train the Doc2Vec model:

In [4]:
d2v_model = vectorize.train_doc2vec(tagged_corpus)

We now have a collection of tagged document vectors. We can use these to go through each of the genres in our `movies_df` data and find the centroid.

In [8]:
genre_metrics, genre_centroids = vectorize.get_genre_distance_metrics(d2v_model, movies_df)
for genre in genre_metrics.items():
    print(f'{genre[0]}: mean {genre[1]["mean"]:0.4f}, standard deviation {genre[1]["stdev"]:0.4f}')

Documentary: mean 0.8740, standard deviation 0.3419
Comedy: mean 0.8888, standard deviation 0.2797
Action: mean 0.8590, standard deviation 0.2594
Animation: mean 0.8356, standard deviation 0.2708
Crime: mean 0.8589, standard deviation 0.2826
Drama: mean 0.8755, standard deviation 0.2602
Short: mean 0.8682, standard deviation 0.3933
Horror: mean 0.8574, standard deviation 0.3041
Mystery: mean 0.9434, standard deviation 0.4701
Biography: mean 0.8938, standard deviation 0.2857
Thriller: mean 0.8943, standard deviation 0.2824
Fantasy: mean 0.9223, standard deviation 0.5597
Adventure: mean 0.8591, standard deviation 0.2941
Sci-Fi: mean 0.7786, standard deviation 0.2226
Family: mean 0.8521, standard deviation 0.3740
Adult: mean 0.0000, standard deviation 0.0000
Romance: mean 0.8767, standard deviation 0.2822
News: mean 0.8314, standard deviation 0.2249
Musical: mean 0.9085, standard deviation 0.1229
History: mean 0.7476, standard deviation 0.1873
Music: mean 0.8547, standard deviation 0.2092

If we have a text description of a movie that is not in our dataset, we can preprocess, clean, and tokenize the text, then pass the tokens to the Doc2Vec model to find the genre it would fit best in

In [12]:
die_hard_imdbID = "tt0095016"
die_hard_description = "NYPD cop John McClane goes on a Christmas vacation to visit his wife Holly in Los Angeles where she works for the Nakatomi Corporation. While they are at the Nakatomi headquarters for a Christmas party, a group of robbers led by Hans Gruber take control of the building and hold everyone hostage, with the exception of John, while they plan to perform a lucrative heist. Unable to escape and with no immediate police response, John is forced to take matters into his own hands."
die_hard_tokens = movies.nlp_clean(die_hard_description)
print(f"Die Hard movie tokens: {die_hard_tokens[:6]}, ...")

Die Hard movie tokens: ['police', 'hands', 'escape', 'nakatomi', 'gruber', 'goes'], ...


In [13]:
die_hard_vectors = d2v_model.infer_vector(die_hard_tokens)
genre_distances = {}
min_distance = 10
min_genre = None
for genre in genre_centroids.items():
    genre_distance = np.linalg.norm(die_hard_vectors - genre[1])
    genre_distances[genre[0]] = genre_distance
    if genre_distance < min_distance:
        min_distance = genre_distance
        min_genre = genre[0]
print(min_genre, min_distance)

Reality-TV 4.6758966
