## Part 1 - Movie Poster Emotion Analysis


### Introduction

There are many ways to decide if you would like to Watch a movie. Here is a list of common ways to make this decision. 

 1. Read move critics
 2. Watch trailers
 3. Friends recommendation
 ...

But sometimes, strongly enough, you might have found yourself making this decision based on a quick glance of the movie poster. There is something about the movie poster that grabs your initial attention. Something that makes you at least a little curious to read the movie description. 

What is that? Is it really important? Do we really pick a movie based on their cover? 

These are the questions that I had when I though about the modern way of selecting movies. May be the single best question to summarize this analysis is to know: 

Do we really select a movie to watch based on their poster/cover? 

or even

Does movie poster **"Facial Expression"** matters toward its **"popularity"**?


I decided to answer this question by collecting relevant data and posters. My intention is to continue this project and refine it as I get more experienced with ***deep-learning*** and image ***Emotion Detection***. There are endless ways to answer this question. However, as for my first attempt, I decided to study one of the most common feature of movie posters; **Human Face** and **Human Facial Expression**. Here, the initial idea is to understand the importance of human face on movie posters. Are we more likely to watch a movie when there is a human face on the cover? What about the expression? Is that important? Are "**Happy**" movies more popular than **"Sad"** movies? 

In this analysis, the `Vote_Count` variable is assumed to be a valid representative of viewer's interest and therefore the response variable. 

### Data and methods

#### Movie Data:
There are a few ways to collect movie related stats. There are a few publicly available datasets on Kaggle that have movie stats. I was inspired  particularly by [TMDB 5000 Movie Dataset](https://www.kaggle.com/tmdb/tmdb-movie-metadata). This is a nice dataset and has the structure and framework that I was seeking. It doesn't, however, come with posters. It also has a dataframe that needs cleaning. That is why I decided to "scrape" my own data from the [TMDB](https://www.themoviedb.org/) website. Many thanks to the organizers of this website and their openness to developers. 

TMDB provide many python APIs from which I used [tmdbsimple](https://github.com/celiao/tmdbsimple/) (can be installed from GitHub). 

In [2]:
import tmdbsimple as tmdb #TMDB API
import urllib.request #URL handling modules
import pandas as pd
import re # Regular Expressions

In [None]:
tmdb.API_KEY = 'USER_SPECIFIC_Key'

Using the above API, I retrieved a a list of ~4400 movies from TMDB. There are various information available for each title among which I store the following:

 1. Title
 2. Budget
 3. Genre
 4. Original Language
 5. Popularity
 6. Revenue
 7. Runtime
 8. User Rating
 9. Vote Count
 10. Poster URL, and
 11. Movie ID
 12. Release Date

Finally, by passing the URL of the poster to my code, I was able to compile all of the movie poster for the received titles. 

In [None]:
search = tmdb.Search()

# Now we have the id of all of our movies, we will procced to download their poster from the TMDB
mdic = {}
mdb = {}

title = []
budget = []
genre1 = []
org_lan = []
pop = []
rev = []
run = []
rate = []
v_count = []
poster = []
poster_url = []
r_date = []
m_ids = []

for ids in m_id:
    mv = tmdb.Movies(ids)
    mv.info()
    #The location of the poster in the following dictionary is:
    tt = re.sub(r"[^a-zA-Z0-9]+"," " , mv.title)
    m_ids.append(mv.id)
    title.append(mv.title)
    budget.append(mv.budget)
    genre1.append(mv.genres[0]['name'])
    org_lan.append(mv.original_language)
    pop.append(mv.popularity)
    rev.append(mv.revenue)
    run.append(mv.runtime)
    rate.append(mv.vote_average)
    v_count.append(mv.vote_count)
    poster.append(mv.poster_path)
    r_date.append(mv.r_date)
    # The absolute url
    url = 'https://image.tmdb.org/t/p/original'
    im_url = url + mv.poster_path
    poster_url.append(im_url)
    #Lets Download the poster and save it in the movie poster folder
    urllib.request.urlretrieve(im_url, outdir + str(ids) + '.jpg')

Finally, I was able to contruct a dataframe containing all of the above information.

In [None]:
# Construct the daraframe
m_df = pd.DataFrame({'Movie_ID':m_ids,
                     'Title':title,
                     'Budget':budget,
                     'Original_Language':org_lan,
                     'popularity': pop,
                     'release_date': r_date,
                     'Revenue' :rev,
                     'Runtime' :run,
                     'Rating':rate,
                     'Vote_Count':v_count,
                     'Poster_url':poster_url})

The constructed dataframe look like this:

In [8]:
# Movie DataFr
# Since this is not a "live" snipit, I'll be imorting the downloaded dataframe from HDD.
m_df = pd.read_csv('/home/fyousef/face_rec/MOVIE_5000/movies_TMDB.csv')
m_df.head(5)

Unnamed: 0,Movie_ID,Title,Budget,Original_Language,popularity,release_date,Revenue,Runtime,Rating,Vote_Count,Poster_url
0,868,Tsotsi,3000000,af,2.504169,8/18/2005,9879971,94.0,7.0,137,https://image.tmdb.org/t/p/original/6ylcfUctX2...
1,17654,District 9,30000000,en,63.13678,8/5/2009,210819611,112.0,7.3,5066,https://image.tmdb.org/t/p/original/axFmCRNQsW...
2,1725,West Side Story,6000000,en,23.431117,10/18/1961,43656822,152.0,7.3,727,https://image.tmdb.org/t/p/original/zRQhCSREdR...
3,7347,Elite Squad,4000000,pt,23.604936,10/12/2007,0,115.0,8.0,896,https://image.tmdb.org/t/p/original/soOOLcNFRH...
4,12405,Slumdog Millionaire,15000000,en,59.258969,5/12/2008,377910544,120.0,7.7,5195,https://image.tmdb.org/t/p/original/gWE4R4DjcU...


The dataframe was finally saved to HDD.

In [None]:
# Save the dataframe
m_df.to_csv('movies_TMDB.csv', sep=",")

#### Poster Analysis

The second part of this data gathering was spent on performing **Face Recognition** and **Emotion Detection** on the images. The area of face detection and emotional analysis is very new and contains complex algorithms. Many of the above techniques requires knowledge of **deep-learning** and **machine-learning**. I used the `face_recognition` module from GitHub(https://github.com/ageitgey/face_recognition) for facial recognition. This is a simple, yet efficient tool to quickly find faces in the movie posters. The caveat is that the ability of the detecting faces is dependent upon the image resolution. The above module find less number of faces in similar small size posters compared to larger (1200x2000 pixel) posters. Therefore, one should be willing to sacrifice computation time for accurate detection (albeit with availbility of higher resolution posters). 

I also used the newly distributed [`EmoPy`](https://github.com/thoughtworksarts/EmoPy) module for facial emotion recognition. This package uses **TensorFlow** and deep-learning to train and detect existing emotions of a face in the poster. This module has many detection levels and uses deep-learning to detect the facial emotions. Processing time is significant for this module. The following is the code snippet for face recognition and emotion detection. The above work yield two final parameters:

 1. Face Score (a float between 0 to 3 for each movie poster representing the general facial expression). 0 means angry, and 3 means happy. 2 is a middle state representing calmness. This value is averaged for all the detected faces in the movie poster.
 2. Face Count (summarizes the detected number of faces at the available poster resolution)

In [None]:
from PIL import Image
import face_recognition
import numpy as np
import matplotlib.pyplot as plt
import os
import glob
import pickle
import pandas as np

# The EmoPy Package for emotion detection
from EmoPy.src.fermodel import FERModel
from pkg_resources import resource_filename

I also wrote two functions to automate the face recognition and emotion detection for the ~4400 movie posters. 

In [None]:
# Face_emotion_function
def f_location(f_l):
    emos = []
    for face_location in f_l:
        # Print the location of each face in this image
        top, right, bottom, left = face_location
        print("A face is located at pixel location Top: 
              {}, Left: {}, Bottom: {}, Right: {}".format(top, left, bottom, right))

        # Actual face 
        face_image = image[top:bottom, left:right]
        pil_image = Image.fromarray(face_image)
        # Save the Image to HDD for Emotion Recognition
        pil_image.save('IM_obj.jpg')
        target_emotions = ['calm','anger','happiness']
        # Emotion Recognition Model setup
        model = FERModel(target_emotions, verbose=True)
        # Making predictions
        model.predict('IM_obj.jpg')
        file = open('emot.txt','r')
        emo = file.read()
        emos.append(emo)
        file.close()
    return emos

In [None]:
# Emotion_summary_function or f_score function
def emo_score(emos):
    score  = []
    for n in emos:
        if n == 'anger':
            score.append(1)
        elif n == 'calm':
            score.append(2)
        elif n == 'happiness':
            score.append(3)
    try:
        f_score = sum(score)/len(score)
        return f_score
    except ZeroDivisionError: 
        f_score = 'nan'

The following code section uses the above functions to summarize and store our metrics in a few lists. 

In [None]:
# The code below does 2 things!
# 1) Gets the file name (which is the image ID)
# 2) Reads each image in the directory into face_recognition package!
m_id = [] # Movie ID
m_fs = [] # Face Score
m_nf = [] # Number of faces

for mlist in glob.glob(os.path.join(im_list, '*.txt')):
    print(mlist)
    with open(mlist, 'rb') as f:
        lst = pickle.load(f)
        for item in lst:
            m_id.append(os.path.splitext(os.path.basename(item))[0])
            print(item)
            image = face_recognition.load_image_file(item)
            face_locations = face_recognition.face_locations(image, number_of_times_to_upsample=0, model="cnn")
            print("I found {} face(s) in this photograph.".format(len(face_locations)))
            emo = f_location(face_locations)
            fs = emo_score(emo)
            m_fs.append(fs)
            m_nf.append(len(face_locations))

print(m_id)
print(m_fs)
print(m_nf)

#### Note
Again, since this is not a "live" code, and poster processing (~16 hrs for ~4400 posters) has been preformed separately, we will be using the stored data.

In [7]:
# Reading the emotion data frame
e_df = pd.read_csv('/home/fyousef/face_rec/MOVIE_5000/movies_emo_score.csv')
e_df.head(5)

Unnamed: 0,Movie_ID,Face_score,Face_count
0,38970,2.333333,3
1,38985,1.0,1
2,39013,3.0,1
3,3902,,0
4,39037,3.0,3
