# Imports

In [62]:
import numpy as np
import tensorflow as tf
import cv2
import os
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Data Exploration
## So first before we go onto the model, let's look at the data
Since our input X is going to be extremely large with almost 300 movie trailers, lets instead load in the average of each movie trailer and get a feel for our data

In [122]:
avgtrailers = np.zeros((300,144,256,3))
counter = 0
for filename in os.listdir('MovieTrailers'):
    cap = cv2.VideoCapture('MovieTrailers' + '/' + filename)
    trailer = []
    while cap.isOpened():
        ret, frame = cap.read()
        if ret == True:
            trailer.append(frame)
        else:
            break
    ## We add the np.uint8 to be able to visualize it with opencv
    avgtrailer = np.around(np.mean(trailer,axis = 0)).astype(np.uint8)
    # we add this line to force all averages to be the same size of 144x256
    avgtrailer = cv2.resize(avgtrailer,(256,144)).astype(np.uint8)
    cap.release()
    avgtrailers[counter] = avgtrailer
    counter = counter + 1

Let's Learn a little bit about X, from the code segment below, we can see we have 300 values and each picture is rgb and 144x256 pixels in size

In [123]:
print("Number of Examples = " + str(avgtrailers.shape[0]))
print("Size of Picture  = " + str(avgtrailers[0].shape))

Number of Examples = 300
Size of Picture  = (144, 256, 3)


Next we're going to load in the movie id, movie name, trailerurl, and the Genre (our label)
Our Genres can be one of three categories : Action, Horror, Comedy

In [127]:
label_dataframe = pd.read_csv("MovieTrailerData.csv")
label_dataframe.head()

Unnamed: 0,MovieId,MovieName,TrailerUrl,Genre
0,1,Extraction,https://www.youtube.com/watch?v=L6P3nI6VnlY,Action
1,2,The Gentleman,https://www.youtube.com/watch?v=Ify9S7hj480,Action
2,3,Code 8,https://www.youtube.com/watch?v=PrX1JJ5dduA,Action
3,4,Avengers: Endgame,https://www.youtube.com/watch?v=TcMBFSGVi1c,Action
4,5,Star Wars: The Rise of Skywalker,https://www.youtube.com/watch?v=8Qn_spdM5Zg,Action


Just to make it a little easier to work with let's get the genre and turn it into a matrix of one-hot vectors where:
- Action = [1,0,0]
- Comedy = [0,1,0]
- Horror = [0,0,1]

In [128]:
labels = label_dataframe["Genre"].to_numpy()
labels.reshape(labels.shape[0],1)
label_encoder = LabelEncoder()
vec = label_encoder.fit_transform(labels)
labels = tf.keras.utils.to_categorical(vec)
labels.shape

(300, 3)

## Let's Separate the Test/Dev/Test
One thing we have to keep in mind is that we have a relatively small dataset, so what we ar egoing to do is do a 60/20/20 test/dev/test split. In addition we are going to make sure each of the sets get the respective amount of each of the categories. Then after we split them, we are going to shuffle them once more

In [129]:
#initialize split vars and sets
train_percent =  .6
dev_percent = .2
test_percent = .2
training_x = np.zeros((int(train_percent * avgtrailers.shape[0]),144,256,3))
dev_x = np.zeros((int(dev_percent * avgtrailers.shape[0]),144,256,3))
test_x = np.zeros((int(test_percent * avgtrailers.shape[0]),144,256,3))

#Split the data by category first
Action_Examples = avgtrailers[0:100]
Horror_Examples = avgtrailers[100:200]
Comedy_Examples = avgtrailers[200:300]

#shuffle
np.random.seed(42)
np.random.shuffle(Action_Examples)
np.random.shuffle(Horror_Examples)
np.random.shuffle(Comedy_Examples)

#initialize splitting variables
train_set_length = int(train_percent * Action_Examples.shape[0])
dev_set_length = int(dev_percent * Action_Examples.shape[0])
test_set_length = int(test_percent * Action_Examples.shape[0])
train_set_end = train_set_length
dev_set_end = training_set_end + dev_set_length
test_set_end = dev_set_end + test_set_length

#split the categories into the sets equally
training_x[0:train_set_length] = Action_Examples[0:train_set_end]
training_x[train_set_length :train_set_length * 2] = Horror_Examples[0:train_set_end]
training_x[train_set_length*2:train_set_length * 3] = Comedy_Examples[0:train_set_end]

dev_x[0:dev_set_length] = Action_Examples[train_set_end: dev_set_end]
dev_x[dev_set_length :dev_set_length * 2] = Horror_Examples[train_set_end: dev_set_end]
dev_x[dev_set_length*2:dev_set_length * 3] = Comedy_Examples[train_set_end: dev_set_end]

test_x[0:test_set_length] = Action_Examples[dev_set_end: test_set_end]
test_x[test_set_length:test_set_length * 2] = Horror_Examples[dev_set_end: test_set_end]
test_x[test_set_length*2:test_set_length * 3] = Comedy_Examples[dev_set_end: test_set_end]

#split the labels
y_Action = labels[0:100]
y_Horror = labels[100:200]
y_Comedy = labels[200:300]

training_y = np.zeros((int(train_percent * avgtrailers.shape[0]),3))
dev_y = np.zeros((int(dev_percent * avgtrailers.shape[0]),3))
test_y = np.zeros((int(test_percent * avgtrailers.shape[0]),3))

training_y[0:train_set_length] = y_Action[0:train_set_end]
training_y[train_set_length :train_set_length * 2] = y_Horror[0:train_set_end]
training_y[train_set_length*2:train_set_length * 3] = y_Comedy[0:train_set_end]

dev_y[0:dev_set_length] = y_Action[train_set_end: dev_set_end]
dev_y[dev_set_length :dev_set_length * 2] = y_Horror[train_set_end: dev_set_end]
dev_y[dev_set_length*2:dev_set_length * 3] = y_Comedy[train_set_end: dev_set_end]

test_y[0:test_set_length] = y_Action[dev_set_end: test_set_end]
test_y[test_set_length:test_set_length * 2] = y_Horror[dev_set_end: test_set_end]
test_y[test_set_length*2:test_set_length * 3] = y_Comedy[dev_set_end: test_set_end]

In [130]:
avgtrailers[3]

array([[[ 88.,  94.,  99.],
        [ 88.,  95.,  99.],
        [ 88.,  95.,  99.],
        ...,
        [ 84.,  88.,  92.],
        [ 84.,  88.,  92.],
        [ 83.,  87.,  91.]],

       [[ 88.,  95.,  99.],
        [ 88.,  95.,  99.],
        [ 88.,  95.,  99.],
        ...,
        [ 85.,  89.,  93.],
        [ 85.,  90.,  93.],
        [ 84.,  88.,  91.]],

       [[ 89.,  95., 100.],
        [ 89.,  96., 100.],
        [ 89.,  96., 100.],
        ...,
        [ 86.,  90.,  93.],
        [ 86.,  90.,  93.],
        [ 84.,  88.,  91.]],

       ...,

       [[ 76.,  84.,  92.],
        [ 76.,  84.,  91.],
        [ 76.,  84.,  91.],
        ...,
        [ 83.,  91.,  96.],
        [ 83.,  91.,  97.],
        [ 83.,  91.,  96.]],

       [[ 75.,  83.,  91.],
        [ 75.,  83.,  91.],
        [ 75.,  83.,  91.],
        ...,
        [ 82.,  90.,  96.],
        [ 83.,  91.,  96.],
        [ 82.,  90.,  95.]],

       [[ 75.,  83.,  91.],
        [ 75.,  83.,  91.],
        [ 75.,  

## Let's see a couple of datapoints
Let's look at the first couple of datapoints, lets see the average of "Avengers: Endgame" (Action) , "I See You" (Horror), and "Jumanji: The Next Level" (Comedy)

In [131]:
cv2.imshow('frame', avgtrailers[3])
cv2.waitKey(0)
cv2.imwrite('NoteBookData/Avengers.jpg', avgtrailers[3])
cv2.imshow('frame',avgtrailers[158])
cv2.waitKey(0)
cv2.imwrite('NoteBookData/ISeeYou.jpg', avgtrailers[158])
cv2.imshow('frame',avgtrailers[202])
cv2.waitKey(0)
cv2.imwrite('NoteBookData/Jumanji.jpg', avgtrailers[202])
cv2.destroyAllWindows()

## Avengers: Endgame
![Avengers: Endgame](NoteBookData/Avengers.jpg)
## I See You
![I See You](NoteBookData/ISeeYou.jpg)
## Jumanji: The Next Level
![Jumanji: The Next Level](NoteBookData/Jumanji.jpg)

So what do we notice, Well if you squint you can definitely see the rating for the movie I See You since the rating was kept on for a larger chunk of the trailer, we also see for the Jumanji average, we can see the ending title card. What can we do to remedy this? Do we even need to remedy this? Let's first try and build a simple logistic regression model to predict our genre

# Simple Logistic Model (Without Pre-processing out Beginning and Ending Title/Rating/Producer Cards)
## What will our model look like?
Well since our pictures are 144x256x3, why don't we just make a logististic regression unit with 144x256x3 = 110,592 units?