ECE324 
Classifying Movie Genres


# Data gathering

The code that is commented below is simply showing how we obtained data from Kaggle, but as we have added the data csvs to our repository, this does not need to be run. 

Set up to extract data from Kaggle is from the code in this website: https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/ 

In [1]:
# set up kaggle to get data 
# only need to run once and use your personal kaggle api key (need to load kaggle.json to files)
! pip install -q kaggle
! mkdir ~/.kaggle
! cp /kaggle.json ~/.kaggle/ # contains personal kaggle api key
! chmod 600 ~/.kaggle/kaggle.json

In [2]:
# Movies dataset
! kaggle datasets download -d rounakbanik/the-movies-dataset -f movies_metadata.csv
# Netflix dataset
! kaggle datasets download -d satpreetmakhija/netflix-movies-and-tv-shows-2021
# IMDB Movies dataset
! kaggle datasets download -d harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

Downloading movies_metadata.csv.zip to /content
 82% 10.0M/12.2M [00:00<00:00, 26.3MB/s]
100% 12.2M/12.2M [00:00<00:00, 33.5MB/s]
Downloading netflix-movies-and-tv-shows-2021.zip to /content
  0% 0.00/1.07M [00:00<?, ?B/s]
100% 1.07M/1.07M [00:00<00:00, 34.2MB/s]
Downloading imdb-dataset-of-top-1000-movies-and-tv-shows.zip to /content
  0% 0.00/175k [00:00<?, ?B/s]
100% 175k/175k [00:00<00:00, 41.8MB/s]


Below code is used to get data for training and testing our models. 

In [3]:
# unzip files 
! unzip movies_metadata.csv.zip
! unzip netflix-movies-and-tv-shows-2021.zip
! unzip imdb-dataset-of-top-1000-movies-and-tv-shows.zip

Archive:  movies_metadata.csv.zip
  inflating: movies_metadata.csv     
Archive:  netflix-movies-and-tv-shows-2021.zip
  inflating: netflixData.csv         
Archive:  imdb-dataset-of-top-1000-movies-and-tv-shows.zip
  inflating: imdb_top_1000.csv       


Extract relevant data from csvs


In [11]:
import pandas as pd 
import json

def extract_genre(x):
  """
  Helper function for extracting genre from the Kaggle Movies dataset
  """
  for i in range(len(x)):
    x[i] = x[i]['name']
  return x

# Movies dataset (movies_metadata.csv)
# we want features: 'genres' (label), 'overview' (input)
df_1 = pd.read_csv('movies_metadata.csv',low_memory = False) # low_memory is set to False to avoid dtype warning
# print(df_1.columns)
input_1 = df_1['overview']
output_1 = df_1['genres']

# data in kaggle dataset is in the form: '[{'id': 12, 'name': 'Adventure'},..'
# we want to extract the name of genres to a list of the form: ['Adventure',..]
output_1 = output_1.apply(lambda x: json.loads(x.replace("'", '"'))) # make the string into a list of dictionaries
output_1 = output_1.apply(extract_genre) # get the genre from the dictionaries

# print(len(input_1.index)) # size of dataset
# print(output_1[1]) # example genre output

# Netflix dataset (netflixData.csv)
# features: 'genres' (label), 'description' (input)
df_2 = pd.read_csv('netflixData.csv')
# print(df_2.columns)
input_2 = df_2['Description']
output_2 = df_2['Genres']

output_2 = output_2.str.split(', ') # make genres a list
# print(len(input_2.index)) # size of dataset
# print(output_2[2]) # example genre output

# IMDB dataset (imdb_top_1000.csv)
# features: 'Genre' (label), 'Overview' (input)
df_3 = pd.read_csv('imdb_top_1000.csv')
# print(df_3.columns)
input_3 = df_3['Overview']
output_3 = df_3['Genre']

output_3 = output_3.str.split(', ') # make genres a list
# print(len(input_3.index)) # size of dataset
# print(output_3[1]) # example genre output

# Human classification dataset (human_classification_training.csv)
df_4 = pd.read_csv('human_classification_training.csv')
# print(df_4.columns)
input_4 = df_4['Synopsis']
output_4 = df_4['Genre']

output_4 = output_4.str.split(', ')
# print(len(input_4.index)) # size of dataset
# print(output_4[1]) # example genre output

45466
['Adventure', 'Fantasy', 'Family']
5967
['Documentaries', 'International Movies']
1000
['Crime', 'Drama']
15
['Romance', 'Comedy']


Split into training, validation, and test data

In [15]:
# combine all inputs together
inputs = pd.concat([input_1,input_2,input_3,input_4])
outputs = pd.concat([output_1, output_2,output_3,output_4])
# print(outputs[5]) # example of what the output genre data looks like

In [14]:
# get train validate test split of data
from sklearn.model_selection import train_test_split

# 80% train : 10% validation : 10% test
X_train, X_test, y_train, y_test = train_test_split(inputs, outputs,
    test_size=0.2, shuffle = True, random_state = 8)

# get validation split from test dataset
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, 
    test_size=0.5, random_state= 8) 
