## Preprocessing and Dataset Statistics

In [None]:
import pandas as pd
import numpy as np
import re
import sklearn
from sklearn import cluster
import matplotlib.pyplot as plt
import seaborn as sb
import matplotlib as g
from sklearn.preprocessing import OneHotEncoder
import random
import scipy
import scipy.signal

**Importing the Dataset**

In [None]:
movie = pd.read_pickle('cleaned_movie_set.pkl')
movie.head()

**Acquiring Genre Correlations**

We want to find genre correlations to see which genre categories are the most highly correlated.

In [None]:
df = movie
c = df.corr().abs()
s = c.unstack()
so = s.sort_values(ascending=False, kind="quicksort")
genre_corr = so[28::2]
genre_corr[0:28]

**Acquiring Cumulative Sums and Label Counts**

We want to find how much of each the dataset each genre is a part of.

In [None]:
counts = [] #getting the counts per genre
df = movie
genres = df.columns[2::]
for i in genres:
    counts.append(df[i].value_counts().to_dict())
#creating counts table
counts_df = pd.DataFrame.from_dict(counts)
counts_df[2] = genres
counts_table = counts_df.drop([0], axis=1)
counts_table = counts_table.rename(columns = {1:'counts', 2:'genre'})

# Use sort instead of sort_values if error
# counts_sort = counts_table.sort('counts', ascending=False)
counts_sort = counts_table.sort_values('counts', ascending=False)

#creating column of cumulative sums
sumcol = counts_sort['counts'].sum()
cumsum = counts_sort['counts']/sumcol
counts_sort['cumsum'] = cumsum
counts_sort

We see here that the cumulative sum for Drama is much higher than other genres, and that there are few instance in other genres.

**Displaying Correlation Matrix**

We want to graph a heatmap to visualize the correlation matrices.

In [None]:
df = movie
genres = df.columns[1:].tolist()

In [None]:
df_genres = df.drop(df.columns[0],axis = 1)

In [None]:
# sb.palplot(sb.color_palette("hls", 7))
R = np.corrcoef(df_genres,rowvar=False)
genre_heatmap = sb.heatmap(R,cmap="Blues",xticklabels=True, yticklabels=True)
genre_heatmap.set_xticklabels(genres, rotation=90)
genre_heatmap.set_yticklabels(genres)
plt.show(genre_heatmap)

As one can see, there are not many categories that are too highly correlated, so we cannot group any categories together.

**Removing Irrelevant Label Genres**

We removed the genres that make up less than 4% of the dataset as there is not enough data to use them.

In [None]:
garb = counts_sort['genre'].where(counts_sort['cumsum']<.04).tolist()
garb = garb[9:28]
garb

In [None]:
clean_df = df #clean dataset without garbage labels
clean_df = clean_df.drop(garb, axis=1) #removing the unwanted genres from our dataset
#clean_df = clean_df.drop(clean_df.columns[0], axis=1)
clean_df.head()

**New Correlations after Removing Excess Labels**

We want to check the correlation between the genres again.

In [None]:
#New Correlations after removing excess labels
c = clean_df.corr().abs()
s = c.unstack()
so = s.sort_values(ascending=False, kind="quicksort")
genre_corr = so[9::2]
genre_corr[1:12]

**Dropping Non-Contextual Genres**

Since Drama is often correlated with others, we decided to consider Drama a 'secondary' genre, not primary. Thus, it is not meaningful to include it in our dataset. 
Documentary and Adventure also offer little meaning since Documentary is related to Drama (according to correlation) and Adventure is essentially the same as Action. After removing these, we wanted to check the cumulative sums again.

In [None]:
sixgenres = clean_df.drop(['Adventure', 'Documentary', 'Mystery','Drama'], axis=1)
sixgenres['sample']=0
sixgenres.head()

**Checking Cumulative Sums after Removing Non-contextual Genres**

In [None]:
genres = sixgenres.columns[1:].tolist()
counts = []
for i in genres:
    counts.append(sixgenres[i].value_counts().to_dict())
counts_df = pd.DataFrame.from_dict(counts)
counts_df[2] = genres
counts_table = counts_df.drop([0], axis=1)
counts_table = counts_table.rename(columns = {1:'counts', 2:'genre'})
# Use sort instead of sort_values if error
# counts_sort = counts_table.sort('counts', ascending=False)
counts_sort = counts_table.sort_values('counts', ascending=False)

sumcol = counts_sort['counts'].sum()
cumsum = counts_sort['counts']/sumcol
counts_sort['cumsum'] = cumsum
counts_sort

As we can see, Comedy is 32% of all of our dataset. But since Comedy holds different features and properties than Drama, we decided to keep it and downsample Comedy instead.

**Downsampling Comedy to Balance Label Classes**

In [None]:
col_list = ['Action','Comedy','Crime','Horror','Romance','Thriller']
sixgenres['Empty'] = sixgenres[col_list].sum(axis = 1)
df = sixgenres[sixgenres.Empty != 0]
df.head()

In [None]:
comedy = df[df['Comedy'] == 1]
np.random.seed = 1
b = np.random.choice(comedy.Title,size = 4000, replace = False)
df2 = comedy[comedy['Title'].isin(b)]

In [None]:
df3 = df[(df['Title'].isin(df2.Title) & df['Comedy'] == 1) | (df['Comedy'] == 0)]
df4 = df3.drop(df3.columns[-2:],axis= 1)
df4.head(10)

**Double-Checking Class Balance**

In [None]:
genres = df4.columns[1:].tolist()
counts = []
for i in genres:
    counts.append(df4[i].value_counts().to_dict())
counts_df = pd.DataFrame.from_dict(counts)

counts_df[2] = genres
counts_table = counts_df.drop([0], axis=1)
counts_table = counts_table.rename(columns = {1:'counts', 2:'genre'})

# Use sort instead of sort_values if error:
# counts_sort = counts_table.sort('counts', ascending=False)
counts_sort = counts_table.sort_values('counts', ascending=False)

#counts_sort
sumcol = counts_sort['counts'].sum()
cumsum = counts_sort['counts']/sumcol
counts_sort['cumsum'] = cumsum
counts_sort

Downsampling gave us this new dataset, which is much more balanced and efficient for our neurel nets and models (thinning our data this way greatly increased the speed of which the training set processed. We finished processing the data to have 6 classes, which will help model accuracy and speed. The resulting dataset has a little over 17 thousand values.

**Pickling the new dataframe**

In [None]:
df4.to_pickle("Top6.pkl")