##### <h1 style="text-align:center;"> Movie Genres Classification from their Poster Image using CNNs</h1>

<h3 style="text-align:center;">- Final Project -</h3>
<h5 style="text-align:center;">author: Davide Iacobelli, 254435 </h5>

# Introduction 
### Problem Description
This project has the aim to achieve ***movie genre classication based only on movie poster images.*** <br>
For movie viewers, movie posters are one of the first impressions used to get an idea about the movie content and its genre. Humans can get an idea based on things like color, objects, expressions on the faces of actors etc to quickly determine the genre (horror, comedy, animation etc). <br>
If humans are more or less able to predict genre of a movie only giving a look at its poster, then we can assume that the poster possesses some characteristics which could be utilized in machine learning algorithms to predict its genre. 

### Proposed Approach
In order to do that a _Deep Neural Network_ (**Convolutional Neural Network**) is constructed to classify a given movie poster image into genres. Since a movie may belong to multiple genres, this is a _multi-label image classication problem_. 

First of all, we use an online available **IMDB dataset** (source: https://www.kaggle.com/neha1703/movie-genre-from-its-poster) collected from the most famous movie website (https://www.imdb.com/). <br>
Using the IMDB link of each movie (available in this dataset) we use a **Web Scraping approach** in order to retrieve its poster image from the IMDB movie page and save it locally. 
Once this is done, we can finally construct our Convolutional Neural Network in order to classify movie genre basing on poster characteristics.

**Note:** since even a human can easily make mistakes in this task, our initial goal is to recognize correctly ***at least half of the movies***.

# Step 1: _Webscraping_
 
First of all we get the original IMDB dataset and using informations contained in it, namely IMDB link and IMDB id of each movie, we use a Webscraping Approach in order to retrieve movie poster images. 
In this task we use ***BeautifulSoup***, a Python Framework for Webscraping. <br>
Since movie pages on IMDB website has all the same structure (see figure below), through Webscraping we can easily get the poster link of each film simply going on its IMDB page and taking the content of the _src_ HTML tag corresponding to the poster. 
Once we have all poster links, we add them to our dataset.
<br><br>

![alt text](./imdb_screen.png)



In [1]:
import numpy as np
import pandas as pd
import glob
import scipy.misc
import imageio
import skimage
from tqdm import tqdm
import requests  
import re
from bs4 import BeautifulSoup  
from urllib.request import urlretrieve
import ast 
import matplotlib.pyplot as plt 


savelocation = 'imdb_posters/'

In [None]:
movie = pd.read_csv("movies_metadata.csv")

In [None]:
len(movie)

In [None]:
movie['imdb_link'] = ["https://www.imdb.com/title/"+str(x) for x in movie['imdb_id']]

In [None]:
imdbURLS = movie['imdb_link'].tolist()
imdbIDS = movie['imdb_id'].tolist()
records = [] 
counter = 0

for x in tqdm(imdbURLS): 
    # web scraping
    imdbID = imdbIDS[counter]
    r = requests.get(x)
    soup = BeautifulSoup(r.text, 'html.parser')  
    results = soup.find_all('div', attrs={'class':'poster'})  
    if results:
        first_result = results[0]  
        postername = first_result.find('img')['alt'] 
        imgurl = first_result.find('img')['src'] 
        records.append((x, postername, imgurl))
    else:
        movie = movie[movie.imdb_id != imdbID]    
counter += 1


In [None]:
poster_df = pd.DataFrame(records, columns=['imdb_link', 'postername', 'poster_link'])

In [None]:
df_movietotal = pd.merge(movie, poster_df, on='imdb_link')

In [None]:
df_movietotal.to_csv('movie_metadataWithPoster.csv', sep='\t')

# Step 2: _Posters Download_

Once the _Webscraping step_ is completed, we have also poster image links in our dataset. <br>
So now we are able to ***download poster images*** using those links. <br>
Before doing that, we apply a simple step of data cleaning to the dataset, consisting in dropping all entries without a defined genre. 

So we start downloading all posters from the corrispondent link and we save each of them _using as name in the filesystem the IMDB id of the related movie_. In this way we maintain the relationship between movies and their poster images. 

**Note:** some images could be corrupted during the download or they may not be found at all. For those reasons we  check for corrupted images after download and also we drop from dataset rows corresponding to movie whose poster was not found. 


In [None]:
df_movietotal = pd.read_csv("movie_metadataWithPoster.csv", sep='\t')

In [None]:
genres = []
for entry in df_movietotal["genres"]:
    list_genres = ""
    for genre in ast.literal_eval(entry):
        list_genres = list_genres + genre["name"] + "|"
    genres.append(list_genres)
df_movietotal["genres"] = genres

In [None]:
df_movietotal['genres'].replace('', np.nan, inplace=True)
df_movietotal.dropna(inplace=True)

In [None]:
df_poster = df_movietotal[['imdb_id','poster_link']]

In [None]:
not_found = []
for index, row in tqdm(df_poster.iterrows()):
    url = row['poster_link']
    if "https://m.media-amazon.com/" in str(url):
        id = row['imdb_id']
        jpgname = savelocation+id+'.jpg'
        urlretrieve(url, jpgname)
    else:
        not_found.append(index)

In [None]:
from os import listdir
from PIL import Image
   
for filename in listdir(savelocation):
    if filename.endswith('.jpg'):
        try:
            img = Image.open(savelocation+filename) # open the image file
            img.verify() # verify that it is, in fact an image
        except (IOError, SyntaxError) as e:
            print('Bad file:', filename) # print out the names of corrupt files

In [None]:
df_movietotal.drop(df_movietotal.index[not_found], inplace=True)

In [None]:
columns_to_drop = []
for i in df_movietotal.columns:
    if "Unnamed" in i:
        columns_to_drop.append(i)

In [None]:
df_movietotal.drop(columns_to_drop, axis=1, inplace=True)

In [None]:
df_movietotal.to_csv('movie_metadataWithPoster.csv', sep='\t')

# Step 3: _Dataset Manipulation_

Before constructing our Machine Learning model, we manipulate the dataset obtained from previous steps in order to _increase our model performances_. <br>
First of all we remove movies that has as genre _"TV Movie"_ or _"Foreign"_ since it is reasonable to think that there are no characteristics that identify them in their posters. 

Once this is done, we also get a copy of the whole dataset, that is composed by **19103 entries**, which is too big for the hardware we have. So we get a _random sample fraction of 0.4 of it_.

Since the original dataset (and so also in the one obtained after sampling) is **heavily imbalanced with respect genres** (there are many occurrences of genres like "Comedy" and "Drama" and a very low number for other ones), we can add entries not picked in the random sample in order to reduce this imbalance. In this way we are facing the imbalance problem through an ***oversampling approach***, which consists in adding instances of less frequent classes in order to balance the dataset. <br>
This step is necessary since _imbalance in the training set can negatively affect the model performances_. <br>
Thus, for each genre which has a low number of occurences in the random sample, we get a number of instances from the remaining dataset (original - random sample) which allows to ***make the genre balanced with respect the most frequent one***. <br>
Once this procedure is done for each imbalanced genre, we have that our random sample contains now balanced number of movies for each genre (remember that each movie can belong to many genres) and a total size of 8453 movies. <br>
So finally we obtain a **balanced sample of movies** that we can use in our model (and which has size suitable to available hardware).

In [25]:
df_movietotal = pd.read_csv("movie_metadataWithPoster.csv", sep='\t')

In [26]:
df_movietotal = df_movietotal[~df_movietotal["genres"].str.contains("TV Movie")]
df_movietotal = df_movietotal[~df_movietotal["genres"].str.contains("Foreign")]

In [27]:
df_movietotal_copy = df_movietotal.copy()

In [28]:
len(df_movietotal)

19103

In [29]:
df_movietotal = df_movietotal.sample(frac=.1)
df_movietotal_copy = pd.concat([df_movietotal, df_movietotal_copy]).drop_duplicates(keep=False)

In [30]:
label_dict = {"word2idx": {}, "idx2word": []}
idx = 0
genre_per_movie = df_movietotal["genres"].apply(lambda x: str(x).split("|")[:-1])
for l in [g for d in genre_per_movie for g in d]:
    if l in label_dict["idx2word"]:
        pass
    else:
        label_dict["idx2word"].append(l)
        label_dict["word2idx"][l] = idx
        idx += 1
n_classes = len(label_dict["idx2word"])

In [31]:
def genre_count(df, label_dict):
    max_genre = 0
    for label in label_dict["idx2word"]:
        occurrences = len((df[df['genres'].str.contains(label)]))
        print(label, occurrences)
        if occurrences > max_genre:
            max_genre = occurrences
    return max_genre

In [32]:
max_genre = genre_count(df_movietotal, label_dict)

Crime 242
Drama 896
Thriller 462
Western 56
Animation 70
Comedy 584
Family 129
Adventure 220
Science Fiction 181
Fantasy 122
Horror 286
Mystery 155
Action 381
War 75
Documentary 106
History 70
Romance 339
Music 88


In [33]:
# IMBALANCE: OVERSAMPLING SOLUTION
df_movietotal_copy = df_movietotal_copy[~df_movietotal_copy["genres"].str.contains("Comedy")]
df_movietotal_copy = df_movietotal_copy[~df_movietotal_copy["genres"].str.contains("Drama")]
    
for label in label_dict["idx2word"]:
    if label not in ["Drama", "Comedy"]:
        len_genre = len(df_movietotal[df_movietotal['genres'].str.contains(label)])
        df_genre = df_movietotal_copy[df_movietotal_copy['genres'].str.contains(label)]
        #df_genre['genres'] = [label+"|" for i in range (0, len(df_genre))]    
        if (max_genre - len_genre) > 0:
            if len_genre > 3000:
                param = 0
            elif len_genre > 2000:
                param = 0.3
            elif len_genre > 1000:
                param = 0.5
            else:
                param = 0.9
            df_class_over = df_genre.sample(int((max_genre-len_genre)*param)+1, replace=True)
            df_movietotal = pd.concat([df_movietotal, df_class_over], axis=0)

print('Random over-sampling:')
print(genre_count(df_movietotal, label_dict))

Random over-sampling:
Crime 1107
Drama 896
Thriller 1876
Western 1022
Animation 1119
Comedy 584
Family 1137
Adventure 1468
Science Fiction 1153
Fantasy 989
Horror 1222
Mystery 903
Action 2152
War 995
Documentary 1361
History 842
Romance 957
Music 839
2152


In [35]:
len(df_movietotal)

8453

In [38]:
print(genre_count(df_movietotal, label_dict))

Crime 1107
Drama 896
Thriller 1876
Western 1022
Animation 1119
Comedy 584
Family 1137
Adventure 1468
Science Fiction 1153
Fantasy 989
Horror 1222
Mystery 903
Action 2152
War 995
Documentary 1361
History 842
Romance 957
Music 839
2152


# Step 4: _Poster Images Preprocessing and Final Dataset Construction_

Before starting with the Convolutional Neural Network, we need to preprocess the images in order to construct a final dataset that can be used to train our CNN. <br> 
In the following cells we define functions used for preprocessing. This functions allow to reshape poster images so that all of them has the same size, that will match the input size of our CNN. Once this is done, we read all poster images (using the Python library _imageio_), getting as output a numpy array, which comes with a dict of meta data at its ‘meta’ attribute.

Furthermore in this step we perform the ***one-hot-encoding*** our target variable (_"genres"_). 

Finally we can obtain our Final Dataset which has as **_X variable_ poster images numpy arrays** (obtained processing each image) and as **_Y variable_ the target variable "genres" one-hot-encoded**.

In [39]:
image_glob = glob.glob(savelocation + "*.jpg")
img_dict = {}


def get_id(filename):
    index_s = filename.rfind("/") + 1
    index_f = filename.rfind(".jpg")
    return filename[index_s:index_f]

In [40]:
for fn in image_glob:
    try:
        img_dict[get_id(fn)] = imageio.imread(fn)
    except:
        pass

In [41]:
def show_img(id):
    title = df_movietotal[df_movietotal["imdb_id"] == id]["original_title"].values[0]
    genre = df_movietotal[df_movietotal["imdb_id"] == id]["genres"].values[0]
    plt.imshow(img_dict[id])
    plt.title("{} \n {}".format(title, genre))

In [42]:
def preprocess(img, size=(150, 101, 3)):
    img = skimage.transform.resize(img, size)
    img = img.astype(np.float32)
    img = (img / 127.5) - 1.
    return img

In [43]:
def prepare_data(data, img_dict, label_dict, size=(150, 101, 3)):
    print("Generation dataset...")
    dataset = []
    y = []
    ids = []
    n_samples = len(img_dict)
    print("got {} posters".format(n_samples))
    for k in img_dict:
        if k in data["imdb_id"].values:
            G = data[data["imdb_id"] == k]["genres"].values
            for g in G: 
                g = g.split("|")[:-1]
                img = preprocess(img_dict[k], size)
                if img.shape != (150, 101, 3):
                    continue
                l = np.sum([np.eye(n_classes, dtype="uint8")[label_dict["word2idx"][s]] 
                                                            for s in g], axis=0)
                y.append(l)
                dataset.append(img)
                ids.append(k)
    print("DONE")
    print(len(dataset))
    return dataset, y, ids

In [45]:
df_movietotal = df_movietotal[['genres', 'id', 'original_title', 'imdb_id', 'title']]

In [46]:
SIZE = (150, 101, 3)
dataset, y, ids =  prepare_data(df_movietotal, img_dict, label_dict, size=SIZE)

Generation dataset...
got 19587 posters


  warn("The default mode, 'constant', will be changed to 'reflect' in "
  warn("Anti-aliasing will be enabled by default in skimage 0.15 to "


DONE
8453


# Step 5: _Convolutional Neural Network (using Keras Framework)_

### Model Construction 
Finally we are ready to build our model, namely a Convolutional Neural Network. For this purpose, we use **Keras**, a Python framework which allows to build Machine Learning models. <br>

The Keras model type that we will be using is Sequential, which is the easiest way to build a model, since it allows to to build a model layer by layer. <br>
Our first 4 layers are __Conv2D layers__. These are **convolution layers** that will deal with our **input images**, which are seen as 2-dimensional matrices. <br>
The first of these has **32** nodes, the second and the last have **64** nodes, while the third has **128** nodes. <br>
The first layer also takes an input shape, which is is the shape of each input image.

Since we are modelling a Convolutional Neural Network, we need also **filter matrices**. These matrices are fundamental in the CNNs, since _this model multiplies a matrix of pixels with a filter matrix (or ‘kernel’) and sums up the multiplication values_. Then the convolution slides over to the next pixel and repeats the same process until all the image pixels have been covered. 

In our case the size of the filter matrix is 3, which means we will have a 3x3 filter matrix for each Conv2D layer. 

In between the Conv2D layers and the Dense layer, there is a **Flatten layer**, used as a connection between the convolution and dense layers.

The last two layers are **Dense** layers, one with **128** nodes and one (the last one, namely the output layer) with **18** nodes, since our output classes are 18. 

The activation for the last layer is **sigmoid** since we are dealing with a _multi-label classication problem_.

### Training the Model and Predict Genres
Once the model is constructed, we are ready to train it. We perform a **training** using the first **4000** (due to hardware limitations). <br>
We can see from the Keras log that the **training work quite fine**, since a **loss of 0.39** is achieved. <br> 
So finally we make the prediction on **100 test** instances and evaluate the results. <br>
In order to **evaluate the prediction**, we consider the first 3 genres predicted and check if the each of the real movie genres is reported in this 3. For each contained in the first 3 we get a 1, while we get a 0 for each that is not contained in the first 3 predicted genres. 

In [47]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D, BatchNormalization

Using TensorFlow backend.


In [48]:
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu',
                 input_shape=(SIZE[0], SIZE[1], 3)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(18, activation='sigmoid'))

In [49]:
model.compile(loss='binary_crossentropy',
              optimizer=keras.optimizers.Adagrad(),
              metrics=['accuracy'])

In [50]:
n = 4000
model.fit(np.array(dataset[: n]), np.array(y[: n]), batch_size=16, epochs=5,
          verbose=1, validation_split=0.1)


Train on 3600 samples, validate on 400 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fca51898ba8>

In [51]:
n_test = 100
X_test = dataset[n:n + n_test]
y_test = y[n:n + n_test]

In [52]:
pred = model.predict(np.array(X_test))

In [63]:
def accuracy_score(y_test, pred):
    value = 0
    for i in range(0, len(pred)):
        first3_index = np.argsort(pred[i])[-3:]
        correct = np.where(y_test[i] == 1)[0]
        for j in first3_index:
            if j in correct:
                value += 1
    print(value/len(pred))
        

In [64]:
accuracy_score(y_test, pred)

0.64


# Consideration about Results

***In the end, an accuracy of 0.65 is reached, which is satisfying considering our first assumption, which was to obtain at least 0.5. ***