# Imports

In [72]:
import numpy as np
import tensorflow as tf
import cv2
import os
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Data Exploration
## So first before we go onto the model, let's look at the data
Since our input X is going to be extremely large with almost 300 movie trailers, lets instead load in the average of each movie trailer and get a feel for our data

In [9]:
avgtrailers = []
for filename in os.listdir('MovieTrailers'):
    cap = cv2.VideoCapture('MovieTrailers' + '/' + filename)
    trailer = []
    while cap.isOpened():
        ret, frame = cap.read()
        if ret == True:
            trailer.append(frame)
        else:
            break
    avgtrailer = np.around(np.mean(trailer,axis = 0)).astype(np.uint8)
    cap.release()
    avgtrailers.append(np.asarray(avgtrailer))

Let's Learn a little bit about X, from the code segment below, we can see we have 300 values and each picture is rgb and 144x256 pixels in size

In [82]:
print("Number of Examples = " + str(len(avgtrailers)))
print("Size of Picture  = " + str(avgtrailers[0].shape))

Number of Examples = 300
Size of Picture  = (144, 256, 3)


Next we're going to load in the movie id, movie name, trailerurl, and the Genre (our label)
Our Genres can be one of three categories : Action, Horror, Comedy

In [19]:
label_dataframe = pd.read_csv("MovieTrailerData.csv")
label_dataframe.head()

Unnamed: 0,MovieId,MovieName,TrailerUrl,Genre
0,1,Extraction,https://www.youtube.com/watch?v=L6P3nI6VnlY,Action
1,2,The Gentleman,https://www.youtube.com/watch?v=Ify9S7hj480,Action
2,3,Code 8,https://www.youtube.com/watch?v=PrX1JJ5dduA,Action
3,4,Avengers: Endgame,https://www.youtube.com/watch?v=TcMBFSGVi1c,Action
4,5,Star Wars: The Rise of Skywalker,https://www.youtube.com/watch?v=8Qn_spdM5Zg,Action


Just to make it a little easier to work with let's get the genre and turn it into a matrix of one-hot vectors where:
- Action = [1,0,0]
- Comedy = [0,1,0]
- Horror = [0,0,1]

In [78]:
labels = label_dataframe["Genre"].to_numpy()
labels.reshape(labels.shape[0],1)
labels[0]
label_encoder = LabelEncoder()
vec = label_encoder.fit_transform(labels)
labels = tf.keras.utils.to_categorical(vec)

## Let's see a couple of datapoints
Let's look at the first couple of datapoints, lets see the average of "Avengers: Endgame" (Action) , "I See You" (Horror), and "Jumanji: The Next Level" (Comedy)

In [79]:
cv2.imshow('frame', avgtrailers[3])
cv2.waitKey(0)
cv2.imwrite('NoteBookData/Avengers.jpg', avgtrailers[3])
cv2.imshow('frame',avgtrailers[158])
cv2.waitKey(0)
cv2.imwrite('NoteBookData/ISeeYou.jpg', avgtrailers[158])
cv2.imshow('frame',avgtrailers[202])
cv2.waitKey(0)
cv2.imwrite('NoteBookData/Jumanji.jpg', avgtrailers[202])
cv2.destroyAllWindows()

## Avengers: Endgame
![Avengers: Endgame](NoteBookData/Avengers.jpg)
## I See You
![I See You](NoteBookData/ISeeYou.jpg)
## Jumanji: The Next Level
![Jumanji: The Next Level](NoteBookData/Jumanji.jpg)

So what do we notice, Well if you squint you can definitely see the rating for the movie I See You since the rating was kept on for a larger chunk of the trailer, we also see for the Jumanji average, we can see the ending title card. What can we do to remedy this? Do we even need to remedy this? Let's first try and build a simple logistic regression model to predict our genre

# Simple Logistic Model (Without Pre-processing out Beginning and Ending Title/Rating/Producer Cards)
## What will our model look like?
Well since our pictures are 144x256x3, why don't we just make a logististic regression unit with 144x256x3 = 110,592 units?