
This demo contains the following:

* Setting up Python environment - importing libraries and first look at the raw dataset 

* Import dataset to ArangoDB

* Preprocessing raw data

    * Using ArangoQL
    
    * Connecting with Python using PyArango

* Data exploration with the features of ArangoDB.
    
    * Graph visualization
    
    * ArangoSearch example
    
    * K-shortest path example
    
    * Pruned search

* Machine Learning tasks
    
    * Movie similarity based on plots using Tensorflow. 
    
    * Genre classification based on plots using -
    
        * scikit-learn
        
        * Tensorflow


# Import libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

We are working with Movie data scraped from Wikipedia: [link](https://www.kaggle.com/jrobischon/wikipedia-movie-plots)

The dataset contains descriptions of 34,886 movies from around the world. Column descriptions are listed below:

1. Release Year - Year in which the movie was released
2. Title - Movie title
3. Origin/Ethnicity - Origin of movie (i.e. American, Bollywood, Tamil, etc.)
4. Director - Director(s) (comma separated, null values)
5. Cast - Main actor and actresses (comma separated, null values)
6. Genre - Movie Genre(s) (unknown values)
7. Wiki Page - URL of the Wikipedia page from which the plot description was scraped
8. Plot - Long form description of movie plot

Read csv file:

In [2]:
df = pd.read_csv("wiki_movie_plots_deduped.csv")

In [3]:
df.sample(10)

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
18279,1934,Get Your Man,British,George King,"Dorothy Boyd, Sebastian Shaw",comedy,https://en.wikipedia.org/wiki/Get_Your_Man_(19...,A determined young woman sets up an elaborate ...
2131,1936,Satan Met a Lady,American,William Dieterle,"Bette Davis, Warren William, Alison Skipworth","comedy, drama",https://en.wikipedia.org/wiki/Satan_Met_a_Lady,Private detective Ted Shane returns to work wi...
21818,1984,Next of Kin,Canadian,Atom Egoyan,"Patrick Tierney, Arsinée Khanjian",drama,https://en.wikipedia.org/wiki/Next_of_Kin_(198...,Twenty-three-year-old Peter Foster is an only ...
26942,2013,Goliyon Ki Rasleela Ram-Leela,Bollywood,Sanjay Leela Bhansali,"Ranveer Singh, Deepika Padukone, Richa Chadda,...",romance/drama,https://en.wikipedia.org/wiki/Goliyon_Ki_Rasle...,"In the fictional Gujarati village Ranjaar, inf..."
29264,1962,Bale Pandiya,Tamil,B. R. Panthulu,"Sivaji Ganesan, M. R. Radha, Devika",unknown,https://en.wikipedia.org/wiki/Bale_Pandiya_(19...,Pandiya is a young man who leads a troubled li...
31056,2011,Seedan,Tamil,Subramaniya Siva,,drama,https://en.wikipedia.org/wiki/Seedan,Mahalakshmi (Ananya) is a servant at the resid...
32925,1965,Sword of the Beast,Japanese,"Gosha, HideoHideo Gosha",Mikijiro Hira,"action, drama",https://en.wikipedia.org/wiki/Sword_of_the_Beast,"Gennosuke is a rebel samurai on the run, havin..."
417,1921,Tol'able David,American,Henry King,Richard Barthelmess,drama,https://en.wikipedia.org/wiki/Tol%27able_David,"David Kinemon, youngest son of West Virginia t..."
33425,2004,Ultraman: The Next,Japanese,Kazuya Konaka,,unknown,https://en.wikipedia.org/wiki/Ultraman_(2004_f...,First Lieutenant Shunichi Maki of the Japan Ai...
2668,1939,She Married a Cop,American,Sidney Salkow,"Jean Parker, Phil Regan",comedy,https://en.wikipedia.org/wiki/She_Married_a_Cop,"A couple of cops, Jimmy Duffy and partner Joe,..."


We can see that columns like `Cast` (also `Director` and `Genre`) contain multiple values that might be separated by a comma, space or slash etc. It will require some preprocessing. 

First we will learn how to import the data into ArangoDB, preprocess it and build a knowledge graph from it for better interpretation.

# Import data to ArangoDB

Create new databse:

    db._createDatabase("arangoml", {}, [{ username: "root", passwd: "", active: true}])

ArangoDB Import data:
1. Go to the directory that contains the dataset.
2. Open terminal and write the following command:

        arangoimport --file "wiki_movie_plots_deduped.csv" --type csv --server.database arangoml --create-collection --collection "movies"


# Preprocess dataset


### Using ArangoQL

1. We want store different columns like cast, director etc. as documents in collections as raw data is highly unstructured. But it requires some processing first. For example, if we want to store all Casts in a 'cast' collection, we first need to process the original data (which ideally should contain comma separated cast members) as it contains unwanted characters and stopwords. We handle them and extract unique Actors/Actresses from the raw dataset in following way:

        let casts_data = (
        for i in movies
            filter i['Cast'] != null
            let casts = substitute(
                i['Cast'], 
                ["'",']','[','"','\r\n',')','(','; ',' and ',' & ','/','Cast: ','.'],
                ['', '', '',', ', ', ', '', '', ', ', ', ', ', ', ', ', '','']
            )
            for j in split(casts, ",")
                let nj = substitute(trim(j),[' '],['_'])
                filter trim(j)!=''
                return distinct nj)

        for i in casts_data
            insert {'_key':i} in cast options {ignoreErrors: true}
        
    We can execute same query for Director, Origin, Genre columns.
    Director:
    
         let directors_data = (
         for i in movies
             filter i['Director'] != null
             let director = substitute(
                 i['Director'], 
                 ["'",']','[','"','\r\n',')','(','; ',' and ',' & ','/','Director: ','Directors: ','.'],
                 ['', '', '',', ', ', ', '', '', ', ', ', ', ', ', ', ', '', '','']
             )
             for j in split(director, ",")
                 let nj = substitute(trim(j),[' '],['_'])
                 filter trim(j)!=''
                 return distinct nj)

         for i in directors_data
             insert {'_key':i} in director options {ignoreErrors: true}
             
     Origin:
            
            let origin_data = (
             for i in movies
                 filter i['Origin'] != null
                 let origin = substitute(
                     i['Origin'], 
                        ["'",']','[','"','\r\n',')','(','; ',' and ',' & ','/','-','_',' ','.'],
                        ['', '', '',', ', ', ', '', '', ', ', ', ', ', ', ', ', ',', ',', ',','']
                        )
                 )
                 for j in split(origin, ",")
                     let nj = substitute(trim(j),[' '],['_'])
                     filter trim(j)!=''
                     return distinct nj)

             for i in origin_data
                 insert {'_key':i} in origin options {ignoreErrors: true}
             
     Genre:
              
             let genre_data = (
             for i in movies
                 filter i['Genre'] != null
                 let genre = substitute(
                     i['Genre'], 
                    ["'",']','[','"','\r\n',')','(','; ',' and ',' & ','/','-','_',' ','.'],
                    ['', '', '',', ', ', ', '', '', ', ', ', ', ', ', ', ', ',', ',', ',','']
                    )
                 for j in split(genre, ",")
                     let nj = substitute(trim(j),[' '],['_'])
                     filter trim(j)!=''
                     return distinct nj)

             for i in genre_data
                 insert {'_key':i} in genre options {ignoreErrors: true}
         

2. We create a 'movie' collection that will store specific info about movies like Release data, Title, Plot. Along with this, we also add an edge between the moveis and its corresponding cast members, director(s), origin and genre (that we created from previous queries). To insert data into 'movie' collection, execute the following query:

            for i in movies
                let id_split = split(i['Wiki Page'],"/")
                let id = substitute(id_split[length(id_split)-1],["#"],["_"])
                insert {_key:id, year:i['Release Year'], title:i['Title'], plot:i['Plot']} 
                    into movie 
                    options { overwrite: true, ignoreErrors: true }


    
    For adding edge with Casts/Director append the following query to the above query:
    
            let casts = substitute(
                i['Cast'], 
                ["'",']','[','"','\r\n',')','(','; ',' and ',' & ','/','Director: ','Directors: ','Cast: ','.'],
                ['', '', '',', ', ', ', '', '', ', ', ', ', ', ', ', ', '', '', '','']
                )
            for j in split(casts, ",")
                let nj = substitute(trim(j),[' '],['_'])
                filter trim(j)!=''
                insert {_from: concat("movie/",id), _to:concat("cast/",nj), label:'had as a cast'} 
                    into conn 
                    options {ignoreErrors: true}

    Similarly for adding edge with Genre/Origin:
    
        for i in movies
            let id_split = split(i['Wiki Page'],"/")
            let id = substitute(id_split[length(id_split)-1],["#"],["_"])
            let genre = substitute(
                i['Genre'], 
                ["'",']','[','"','\r\n',')','(','; ',' and ',' & ','/','-','_',' ','.'],
                ['', '', '',', ', ', ', '', '', ', ', ', ', ', ', ', ', ',', ',', ',','']
                )
            for j in split(genre, ",")
                let nj = substitute(trim(j),[' '],['_'])
                filter trim(j)!=''
                insert {_from: concat("movie/",id), _to:concat("genre/",nj), label:'genre'} 
                    into conn 
                    options {ignoreErrors: true}

### Using Python 

Another way to do insert node and edges is by using Python. For this, we connect with ArangoDB using PyArango. 

In [4]:
from pyArango.connection import Connection
conn = Connection(username="root", password="")
db = conn["arangoml"]
def exec(db, aql):
	output = db.AQLQuery(aql, rawResults=True, batchSize=1000)
	return np.array(output)

Here we just need to use `exec()` and provide database variable `db` with corresponding `aql` query for execution. It’s that easy.

# Data Exploration


### Graph

Create graph named `movies` in ArangoDB with the all the node collections and the edge collection created in the previous section. 

![ex2](screenshots/pic3.png)

![ex2](screenshots/pic2.png)

Now that we can clearly see the connections with the descriptions, we perform graph exploration techniques that are available in ArangoDB for answering different types of research questions.


### Search movies containing given phrase in its plot
We do this by using new feature in ArangoDB 3.5 called ArangoSearch. To know how it works, refer to [this](https://www.arangodb.com/arangodb-training-center/search/arangosearch/) blog.

We link the view named `search_` with the `movie` collection to index `Plot` column and execute the following query. 

    for i in search_
        SEARCH PHRASE(i.Plot,'batman and robin', 'text_en')
        SORT TFIDF(i) desc
        limit 5
        return [i.Title, i['Release Year']]

### Search with specific Genre combinations
We use another new feature K_SHORTEST_PATHS ([details](https://www.arangodb.com/docs/stable/aql/graphs-kshortest-paths.html)) for this:

    FOR p IN ANY K_SHORTEST_PATHS 'genre/comedy' TO 'genre/horror'
      GRAPH 'movies'
          LIMIT 3
          RETURN [p.vertices[*]._key]

So we are able to find some movies with has the flavours of both comedy and horror in it. Let’s do a similar search with war and horror.


    FOR p IN ANY K_SHORTEST_PATHS 'genre/war' TO 'genre/horror'
      GRAPH 'movies'
          LIMIT 3
          RETURN [p.vertices[*]._key]

Here, we can observe that there is just one movie titled `Below` in the database (as the shortest path is 3) which is about war + horror. But the other two outputs just connects movies through their origin. Other outputs are simply connected through their `American` origin.

### Using pruned traversal on graph

Now we look for `American action` movies using pruning (detail) on `Genre` edges during graph traversal. It improves query performance and reduces the amount of overhead generated by the query.

    FOR v, e, p IN 1..3 ANY 'origin/American' GRAPH 'movies'
          PRUNE e.label == 'genre'
          FILTER v._key=='action'
          LIMIT 5
          RETURN p.vertices[1]._key

Without the PRUNE command, if we execute the above query, we get the same results in ~5 minutes:

# Machine Learning

We are going to perform mainly two ML tasks:

1. Movie similarity based on plots - using Tensorflow. 
    - Content-based recommendation of movies.
2. Genre classification based on plots - using scikit-learn and Tensorflow. 
    - Predicting appropriate genres for data with null/unknown values.

## 1. Movie recommendation based on plots

In [5]:
from tqdm import tqdm_notebook
import tensorflow as tf
import tensorflow_hub as hub
from nltk import sent_tokenize
from scipy import spatial
from operator import itemgetter
import re

Apply basic regex tools to clean movie plots.

In [6]:
def clean_plot(text_list):
    clean_list = []
    for sent in text_list:
        sent = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-.:;<=>?@[\]^`{|}~"""), '',sent)
        sent = sent.replace('[]','')
        sent = re.sub('\d+',' ',sent)
        sent = sent.lower()
        clean_list.append(sent)
    return clean_list

Find plot embeddings: (takes some time ~ 5 minutes)

In [7]:
plot_emb_list = []
with tf.Graph().as_default():
    embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")
    messages = tf.placeholder(dtype=tf.string, shape=[None])
    output = embed(messages)
    with tf.Session() as session:
        session.run([tf.global_variables_initializer(), tf.tables_initializer()])
        for plot in tqdm_notebook(df['Plot']):
            sent_list = sent_tokenize(plot)
            clean_sent_list = clean_plot(sent_list)
            sent_embed = session.run(output, feed_dict={messages: clean_sent_list})
            plot_emb_list.append(sent_embed.mean(axis=0).reshape(1,512))            

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


HBox(children=(IntProgress(value=0, max=34886), HTML(value='')))




In [8]:
df['embeddings'] = plot_emb_list
df.to_pickle('./df_embed.pkl')

In [9]:
def similar_movie(movie_name,topn=5):
    plot = df[df['Title']==movie_name]['Plot'].values[0]
    with tf.Graph().as_default():
        embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")
        messages = tf.placeholder(dtype=tf.string, shape=[None])
        output = embed(messages)
        with tf.Session() as session:
            session.run([tf.global_variables_initializer(), tf.tables_initializer()])
            sent_list = sent_tokenize(plot)
            clean_sent_list = clean_plot(sent_list)
            sent_embed2 = (session.run(output, feed_dict={messages: clean_sent_list})).mean(axis=0).reshape(1,512)
            similarities, titles = [],[movie_name]
            for tensor,title in zip(df['embeddings'],df['Title']):
                if title not in titles:
                    cos_sim = 1 - spatial.distance.cosine(sent_embed2,tensor)
                    similarities.append((title,cos_sim))
                    titles.append(title)
            return sorted(similarities,key=itemgetter(1),reverse=True)[1:topn+1]

In [11]:
similar_movie('Superman',10)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


[('Batman v Superman: Dawn of Justice', 0.933032214641571),
 ('Superman Returns', 0.9193736910820007),
 ('Superman: Unbound', 0.9133195281028748),
 ('Superman IV: The Quest for Peace', 0.9110839366912842),
 ('Justice League: The Flashpoint Paradox', 0.9093491435050964),
 ('Man of Steel', 0.9017841219902039),
 ('Justice League', 0.8930277824401855),
 ('Megamind', 0.8920153379440308),
 ('Superman III', 0.8913161754608154),
 ('Hulk', 0.8826043605804443)]

We can see that based on the plots, these are the top movies recommended by the model that are similar to “batman” movie. 


## 2. Genre Prediction based on plot

### 2.1 Using simpler tools (scikit-learn)

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import f1_score, accuracy_score

Use Tokenizer and remove unneccessary symbols/expressions

In [13]:
new_plots = []
for plot in tqdm_notebook(df['Plot']):
    sent_list = sent_tokenize(plot)
    clean_sent_list = clean_plot(sent_list)
    new_plots.append(clean_sent_list[0])
df_new = df.copy()
df_new['clean plot'] = new_plots

HBox(children=(IntProgress(value=0, max=34886), HTML(value='')))




Split data into train and test.

In [14]:
train_df = df_new[df_new['Genre']!='unknown'][['Title','clean plot','Genre']]
test_df = df_new[df_new['Genre']=='unknown'][['Title','clean plot','Genre']]
train_df['genre_new'] = [x.replace(' ',',').replace('_',',').replace('-',',').split(',') for x in train_df['Genre'].values]
test_df['genre_new'] = [x.replace(' ',',').replace('_',',').replace('-',',').split(',') for x in test_df['Genre'].values]

Remove stopwords from plots.

In [15]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

# function to remove stopwords
def remove_stopwords(text):
    no_stopword_text = [w for w in text.split() if not w in stop_words]
    return ' '.join(no_stopword_text)
train_df['clean_plot_new'] = train_df['clean plot'].apply(lambda x: remove_stopwords(x))
test_df['clean_plot_new'] = test_df['clean plot'].apply(lambda x: remove_stopwords(x))

Apply binarizer for multi-label classification for Genre.

In [16]:
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(train_df['genre_new'])

# transform target variable
y = multilabel_binarizer.transform(train_df['genre_new'])

Find embeddings of plots.

In [17]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=10000)
xtrain, xval, ytrain, yval = train_test_split(train_df['clean_plot_new'], y, test_size=0.2, random_state=9)
xtrain_tfidf = tfidf_vectorizer.fit_transform(xtrain)
xval_tfidf = tfidf_vectorizer.transform(xval)

Define Logistic Regression model and train it.

In [18]:
lr = LogisticRegression()
clf = OneVsRestClassifier(lr)
clf.fit(xtrain_tfidf, ytrain)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='warn',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='warn', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

As Logistic regression is rather simpler model and data is complicated, we modify threshold for predition probabilities from 0.5 to 0.2.

In [19]:
y_pred_prob = clf.predict_proba(xval_tfidf)
y_pred_new = (y_pred_prob >= 0.2).astype(int)
print("Accuracy:" ,accuracy_score(yval, y_pred_new))
print("F1-score:" ,f1_score(yval, y_pred_new, average="micro"))

Accuracy: 0.12619336920673493
F1-score: 0.3614946739559263


Make predictions ...

In [20]:
def infer_tags(q):
    q_vec = tfidf_vectorizer.transform([q])
    q_pred = clf.predict(q_vec)
    return multilabel_binarizer.inverse_transform(q_pred)

In [21]:
i=0
while i<5: 
    k = xval.sample(1).index[0]
    if infer_tags(xval[k])!=[()]:
        print("Movie:\t\t",train_df['Title'].ix[k])
        print("Predicted genre: ", infer_tags(xval[k]))
        print("Actual genre: ",train_df['genre_new'].ix[k])
        i+=1

Movie:		 The Nanny Diaries
Predicted genre:  [('drama',)]
Actual genre:  ['comedy', 'drama']
Movie:		 Sasural
Predicted genre:  [('drama',)]
Actual genre:  ['family', 'drama']
Movie:		 You Can't Cheat an Honest Man
Predicted genre:  [('comedy',)]
Actual genre:  ['comedy']
Movie:		 A Mother's Story
Predicted genre:  [('drama',)]
Actual genre:  ['drama']
Movie:		 Bad Company
Predicted genre:  [('drama',)]
Actual genre:  ['drama']


In [22]:
xtest_tfidf = tfidf_vectorizer.transform(test_df['clean_plot_new'])
y_test_pred_prob = clf.predict_proba(xtest_tfidf)
y_test_pred_new = (y_test_pred_prob >= 0.2).astype(int)

In [23]:
i=0
while i<5: 
    k = test_df['clean plot'].sample(1).index[0]
    pred = infer_tags(test_df['clean plot'].ix[k])
    if pred!=[()]:
        print("Movie:\t\t\t",test_df['Title'].ix[k])
        print("Predicted genre:\t", pred)
        i+=1

Movie:			 The Bellboy
Predicted genre:	 [('comedy',)]
Movie:			 Vaigasi Poranthachu
Predicted genre:	 [('drama',)]
Movie:			 Avan (dubbed from Hindi)
Predicted genre:	 [('drama',)]
Movie:			 Ek Ruka Hua Faisla
Predicted genre:	 [('drama',)]
Movie:			 Anubandham
Predicted genre:	 [('drama',)]


### 2.2 Using Deep Learning (Tensorflow)

In [24]:
from nltk.corpus import stopwords
import nltk
from keras.models import Model
from keras.layers import Dense, Embedding, Input
from keras.layers import Conv1D, GlobalMaxPool1D, Dropout, concatenate
from keras.preprocessing import text, sequence
from keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer

Using TensorFlow backend.


In [25]:
# new_plots = []
# for plot in tqdm_notebook(df['Plot']):
#     sent_list = sent_tokenize(plot)
#     clean_sent_list = clean_plot(sent_list)
#     new_plots.append(clean_sent_list[0])
# df_new = df.copy()
# df_new['clean plot'] = new_plots

Apply similar preprocessing as previous case: Remove unnecessary symbols/expressions and stopwords from `clean plot`.

In [26]:
# train_df = df_new[df_new['Genre']!='unknown'][['Title','clean plot','Genre']]
# test_df = df_new[df_new['Genre']=='unknown'][['Title','clean plot','Genre']]

In [27]:
# from nltk.corpus import stopwords
# stop_words = set(stopwords.words('english'))

# # function to remove stopwords
# def remove_stopwords(text):
#     no_stopword_text = [w for w in text.split() if not w in stop_words]
#     return ' '.join(no_stopword_text)
# train_df['clean_plot_new'] = train_df['clean plot'].apply(lambda x: remove_stopwords(x))
# test_df['clean_plot_new'] = test_df['clean plot'].apply(lambda x: remove_stopwords(x))
# train_df['genre_new'] = [x.replace(' ',',').replace('_',',').replace('-',',').split(',') for x in train_df['Genre'].values]
# test_df['genre_new'] = [x.replace(' ',',').replace('_',',').replace('-',',').split(',') for x in test_df['Genre'].values]


In [28]:
maxlen = 200
max_features = 20000
encoder = MultiLabelBinarizer()
encoder.fit_transform(train_df['genre_new'])
y_train = encoder.transform(train_df['genre_new'])
y_test = encoder.transform(test_df['genre_new'])
num_classes = len(encoder.classes_)

Train tokenizer on movie plots of training data.

In [29]:
tokenizer = text.Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_df['clean_plot_new']))
# train data
list_tokenized_train = tokenizer.texts_to_sequences(train_df['clean_plot_new'])
X_t = sequence.pad_sequences(list_tokenized_train, maxlen=200)
# test data
list_tokenized_test = tokenizer.texts_to_sequences(test_df['clean_plot_new'])
X_te = sequence.pad_sequences(list_tokenized_test, maxlen=200)

Define 1D-CNN Model here

In [30]:
def build_model(conv_layers = 2, max_dilation_rate = 3):
    embed_size = 128
    inp = Input(shape=(maxlen, ))
    x = Embedding(max_features, embed_size)(inp)
    x = Dropout(0.25)(x)
    x = Conv1D(2*embed_size, 
                   kernel_size = 3)(x)
    prefilt_x = Conv1D(2*embed_size, 
                   kernel_size = 3)(x)
    out_conv = []
    for dilation_rate in range(max_dilation_rate):
        x = prefilt_x
        for i in range(3):
            x = Conv1D(32*2**(i), 
                       kernel_size = 3, 
                       dilation_rate = dilation_rate+1)(x)    
        out_conv += [Dropout(0.5)(GlobalMaxPool1D()(x))]
    x = concatenate(out_conv, axis = -1)    
    x = Dense(50, activation="relu")(x)
    x = Dropout(0.1)(x)
    x = Dense(num_classes, activation="sigmoid")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='categorical_crossentropy',
                      optimizer='adam',
                      metrics=['accuracy'])

    return model

In [31]:
model = build_model()
model.summary()

























Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.














Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 200)          0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 200, 128)     2560000     input_1[0][0]                    
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 200, 128)     0           embedding_1[0][0]                
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 198, 256)     98560       dropout_1[0][0]                  
____________________________________________________________________________________________

In [32]:
batch_size = 128
epochs = 5

file_path="weights.hdf5"

checkpoint = ModelCheckpoint(file_path, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
early = EarlyStopping(monitor="val_loss", mode="min", patience=4)

callbacks_list = [checkpoint, early] 
model.fit(X_t, y_train, 
          batch_size=batch_size, 
          epochs=epochs, 
          validation_split=0.2)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Train on 23042 samples, validate on 5761 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1723d1dd8>

In [33]:
y_pred = model.predict(X_t)

In [34]:
y_pred_new = (y_pred >= 0.5).astype(int)
y_pred_genre = multilabel_binarizer.inverse_transform(y_pred_new)
for ind,i in enumerate(train_df.index[:20]):
    print("\nMovie:\t\t",train_df['Title'].ix[i])
    print("Predicted genre: ", y_pred_genre[ind])
    print("Actual genre: ",train_df['genre_new'].ix[i])


Movie:		 The Great Train Robbery
Predicted genre:  ('crime', 'drama', 'western')
Actual genre:  ['western']

Movie:		 The Suburbanite
Predicted genre:  ('comedy', 'drama')
Actual genre:  ['comedy']

Movie:		 Dream of a Rarebit Fiend
Predicted genre:  ('', '/', 'adventure', 'animated', 'animation', 'comedy', 'drama', 'family', 'fantasy', 'horror', 'short')
Actual genre:  ['short']

Movie:		 From Leadville to Aspen: A Hold-Up in the Rockies
Predicted genre:  ('comedy', 'drama', 'film')
Actual genre:  ['short', 'action/crime', 'western']

Movie:		 Kathleen Mavourneen
Predicted genre:  ('', 'adventure', 'animated', 'animation', 'comedy', 'drama', 'family', 'fantasy', 'horror', 'short')
Actual genre:  ['short', 'film']

Movie:		 Daniel Boone
Predicted genre:  ('', 'animated', 'animation', 'comedy', 'drama', 'family', 'fantasy', 'horror', 'short')
Actual genre:  ['biographical']

Movie:		 How Brown Saw the Baseball Game
Predicted genre:  ('comedy', 'drama')
Actual genre:  ['comedy']

Movie:

As we can see, this model performs much better than the previous one. The `Genre`s of a movie is identifiable using the plot summaries provided in the data. We can use these predictions in place of `unknown` or missing Genre informatiosn of a movie.