# Project Name
By Alec Plante, Deanna Hedges, Raul Cortez, Sunny Sanchez, Zachary Mitchell

Movie id meanings: https://www.themoviedb.org/talk/5daf6eb0ae36680011d7e6ee

### Import Libraries


In [2]:
import pandas as pd
import numpy as np
import sqlite3


### Unzip Data
This section is used to unzip data from the zippedData folder and place it into the new data folder

In [18]:
#extract im.db zip file
import zipfile
with zipfile.ZipFile('zippedData/im.db.zip', 'r') as zip_ref:
    zip_ref.extractall('data/')

# unzip the gz files 
import gzip
import shutil

# unzip bom.movie_gross
with gzip.open('zippedData/bom.movie_gross.csv.gz', 'rb') as f_in:
    with open('data/bom.movie_gross.csv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
        
# unzip rt.movie_info.tsv
with gzip.open('zippedData/rt.movie_info.tsv.gz', 'rb') as f_in:
    with open('data/rt.movie_info.tsv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
        
# unzip rt.reviews.tsv
with gzip.open('zippedData/rt.reviews.tsv.gz', 'rb') as f_in:
    with open('data/rt.reviews.tsv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
        
# unzip tmdb.movies.csv
with gzip.open('zippedData/tmdb.movies.csv.gz', 'rb') as f_in:
    with open('data/tmdb.movies.csv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
        
# unzip tn.movie_budgets.csv
with gzip.open('zippedData/tn.movie_budgets.csv.gz', 'rb') as f_in:
    with open('data/tn.movie_budgets.csv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

### Import Data and connect to Database

In [3]:
# import data as 
movieGross = pd.read_csv('data/bom.movie_gross.csv')
tmdbMovies = pd.read_csv('data/tmdb.movies.csv')
movieBudgets = pd.read_csv('data/tn.movie_budgets.csv')
movieInfo = pd.read_csv('data/rt.movie_info.tsv', sep = '\t')
reviews = pd.read_csv('data/rt.reviews.tsv', sep = '\t', encoding= 'latin1')


In [4]:
# Connect to sql database
conn = sqlite3.connect('data/im.db')

### Data Exploration

#### tmdbMovies

In [34]:
# start by looking at the first 5 rows of data
tmdbMovies.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


At first glace, we can see that there is an extra column that matches with the index.This should be removed.

In [35]:
# View the Column names
tmdbMovies.columns

Index(['Unnamed: 0', 'genre_ids', 'id', 'original_language', 'original_title',
       'popularity', 'release_date', 'title', 'vote_average', 'vote_count'],
      dtype='object')

In [36]:
# Drop 'Unnamed: 0' as it contains the same information as the index
tmdbMovies.drop('Unnamed: 0', axis = 1, inplace = True)

In [37]:
# View the Column names again to confirm that changes were made
tmdbMovies.columns

Index(['genre_ids', 'id', 'original_language', 'original_title', 'popularity',
       'release_date', 'title', 'vote_average', 'vote_count'],
      dtype='object')

After removing the unneeded column, the data types should be reviewed to ensure that we are able to work with the table.

In [38]:
# View the information about each column
tmdbMovies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   genre_ids          26517 non-null  object 
 1   id                 26517 non-null  int64  
 2   original_language  26517 non-null  object 
 3   original_title     26517 non-null  object 
 4   popularity         26517 non-null  float64
 5   release_date       26517 non-null  object 
 6   title              26517 non-null  object 
 7   vote_average       26517 non-null  float64
 8   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 1.8+ MB


A few columns should be investigated:
- genre_ids should be a list
- release_date should be datetime

In [43]:
# Check the type of each Column
print(f"gener_id type: {type(tmdbMovies['genre_ids'].iloc[1])}\nrelease_date type: {type(tmdbMovies['release_date'].iloc[1])}")

gener_id type: <class 'str'>
release_date type: <class 'str'>


Both are strings, which are not usable for data analysis. We must convert genre_ids and release_date to lists and datetimes, repectively.

Let's start with the genre_ids:

In [93]:
# View the values in genre_ids and check for null values:
print(tmdbMovies.genre_ids.value_counts())
print(f"There are {tmdbMovies['genre_ids'].isna().sum()} null values")
# There are no NA values, and they all seem to be close to lists. We can procede by changing the type to a list

[99]                               3700
[]                                 2479
[18]                               2268
[35]                               1660
[27]                               1145
                                   ... 
[35, 18, 10751, 14, 10749, 878]       1
[53, 80, 18, 10770]                   1
[14, 35, 878, 10751]                  1
[878, 28, 16, 12]                     1
[80, 9648, 53, 18]                    1
Name: genre_ids, Length: 2477, dtype: int64
There are 0 null values


In [85]:
# Convert genre_ids into list
#library with function for us to complete this operation
import ast
#converts all strings into a list
tmdbMovies.genre_ids = tmdbMovies.genre_ids.map(lambda x: ast.literal_eval(x))

<class 'list'>


In [91]:
# make sure that rows are of type list
for i in tmdbMovies['genre_ids']:
    assert isinstance(i, list), "ERROR: element is not a list"
print("all rows in genre_ids column are of type list :^)")

all rows in genre_ids column are of type list :^)


The genre_ids in tmdbMovies are numbers, which doesn't give us a lot of information. A new column reflecting the meaning of these numbers should be created. The dictionary of the meanings is listed below.

In [102]:
genre_ids_dict={28:'Action',
                12:'Adventure',
                16:'Animation',
                35:'Comedy',
                80:'Crime',
                99:'Documentary',
                18:'Drama',
                10751:'Family',
                14:'Fantasy',
                36:'History',
                27:'Horror',
                10402:'Music',
                9648:'Mystery',
                10749:'Romance',
                878:'Science Fiction',
                10770:'TV Movie',
                53:'Thriller',
                10752:'War',
                37:'Western'}

In [120]:
# Create a new column 'genres' that is a list of the genres as strings
tmdbMovies['genres'] = tmdbMovies['genre_ids'].map(lambda x: list(pd.Series(x,dtype='float64').replace(genre_ids_dict)))
tmdbMovies.head()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count,genres
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788,"[Adventure, Fantasy, Family]"
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610,"[Fantasy, Adventure, Animation, Family]"
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368,"[Adventure, Action, Science Fiction]"
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174,"[Animation, Comedy, Family]"
4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186,"[Action, Science Fiction, Adventure]"


In [122]:
type(tmdbMovies['genres'].iloc[1])
tmdbMovies['genres'].iloc[1]

['Fantasy', 'Adventure', 'Animation', 'Family']

Now, the release_date column needs to be converted to a datetime.

In [98]:
# Investigate the types of values in the release date column
print(tmdbMovies['release_date'].value_counts())
# make sure there are no NA values
print(f"There are {tmdbMovies['release_date'].isna().sum()} null values")

2010-01-01    269
2011-01-01    200
2012-01-01    155
2014-01-01    155
2013-01-01    145
             ... 
2019-03-07      1
2011-07-30      1
1995-04-07      1
2016-07-20      1
2016-07-13      1
Name: release_date, Length: 3433, dtype: int64
There are 0 null values


In [99]:
#convert the column to datetimes
tmdbMovies['release_date'] = pd.to_datetime(tmdbMovies['release_date'])

pandas._libs.tslibs.timestamps.Timestamp

In [100]:
# make sure that release_date is of type datetime
tmdbMovies.dtypes

genre_ids                    object
id                            int64
original_language            object
original_title               object
popularity                  float64
release_date         datetime64[ns]
title                        object
vote_average                float64
vote_count                    int64
dtype: object