# A. Project Name:  IMDb Successful Movie.
- **Student Name:** Eduardo Galindez.
- **Coding Dojo Bootcamp:** Data Science.
  - **Stack:** Data Enrichment.
- **Date:** September 13th, 2022.

# B. Project Objective
For this project, we have been hired to produce a MySQL database on Movies from a subset of IMDB's publicly available dataset. Ultimately, we will use this database to analyze what makes a movie successful and will provide recommendations to the stakeholder on how to make a successful movie.

# C. Project Statement


### Specifications:

Our stakeholder only wants we to include information for movies based on the following specifications:

- Exclude any movie with missing values for genre or runtime.
- Include only full-length movies (titleType = "movie").
- Include only fictional movies (not from documentary genre).
- Include only movies that were released 2000 - 2021 (include 2000 and 2021).
- Include only movies that were released in the United States.

### Deliverable:

After filtering out movies that do not meet the stakeholder's specifications:

- Before saving, run a final .info() for each of the dataframes to show a summary of how many movies remain and the datatypes of each feature.
- Save each file to a compressed csv file "Data/" folder inside your repository.
- Commit your changes to your repository in GitHub desktop and Publish repository / Push Changes.
- Submit the link to your repository.

# D. Project Development

## 1.- Libraries

In [1]:
# Imports.
import numpy as np
import pandas as pd
import glob
import os
os.makedirs('Data',exist_ok=True)

## 2.-  Loading Data

### 2.1.- Mount and loading: Tittle Akas Dataset

In [2]:
# Load data.
title_akas_df = pd.read_csv('Data/title.akas.tsv.gz',
                             sep='\t', low_memory=False,
                            chunksize=100_000)

In [3]:
# The first row # of the next chunk is stored under ._currow.
title_akas_df._currow

0

In [4]:
# Get the first df chunk from the reader.
title_akas_temp_df = title_akas_df.get_chunk()
title_akas_temp_df

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0
...,...,...,...,...,...,...,...,...
99995,tt0022441,2,ハンガリアン・ダンス5番,JP,ja,imdbDisplay,\N,0
99996,tt0022441,3,Studie Nr. 7,DE,\N,imdbDisplay,\N,0
99997,tt0022441,4,Studie Nr. 7,ES,\N,imdbDisplay,\N,0
99998,tt0022441,5,Étude n°7,FR,\N,imdbDisplay,\N,0


In [5]:
# Programatically saving an fname using the chunk #.
chunk_num2=1
fname2= f'Data/title_akas_chunk_{chunk_num2:03d}.csv.gz'
fname2

'Data/title_akas_chunk_001.csv.gz'

In [6]:
# Save tittle_akas_temp_df to disk using the fname2.
title_akas_temp_df.to_csv(fname2, compression='gzip')

# Incrementing chunk_num by 1 for the next file.
chunk_num2+=1

In [7]:
# Read the csv without the index.
pd.read_csv(fname2, index_col=0, low_memory=False)

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0
...,...,...,...,...,...,...,...,...
99995,tt0022441,2,ハンガリアン・ダンス5番,JP,ja,imdbDisplay,\N,0
99996,tt0022441,3,Studie Nr. 7,DE,\N,imdbDisplay,\N,0
99997,tt0022441,4,Studie Nr. 7,ES,\N,imdbDisplay,\N,0
99998,tt0022441,5,Étude n°7,FR,\N,imdbDisplay,\N,0


In [8]:
# Constructing the Loop.
chunk_num2 = 1
title_akas_df = pd.read_csv('Data/title.akas.tsv.gz',
                             sep='\t', low_memory=False,
                            chunksize=100_000)

In [9]:
for title_akas_temp_df in title_akas_df:
    
    #### COMBINED WORKFLOW FROM ABOVE
    ## Keep only US movies.
    title_akas_temp_df = title_akas_temp_df[(title_akas_temp_df['region']=='US')]
    
    ## Replace "\N" with np.nan
    title_akas_temp_df.replace({'\\N':np.nan},inplace=True)
       
    ### Saving chunk to disk
    fname2= f'Data/title_akas_chunk_{chunk_num2:03d}.csv.gz'
    title_akas_temp_df.to_csv(fname2, compression='gzip')
    print(f"- Saved {fname2}")
    
    chunk_num2+=1

title_akas_df.close()

- Saved Data/title_akas_chunk_001.csv.gz
- Saved Data/title_akas_chunk_002.csv.gz
- Saved Data/title_akas_chunk_003.csv.gz
- Saved Data/title_akas_chunk_004.csv.gz
- Saved Data/title_akas_chunk_005.csv.gz
- Saved Data/title_akas_chunk_006.csv.gz
- Saved Data/title_akas_chunk_007.csv.gz
- Saved Data/title_akas_chunk_008.csv.gz
- Saved Data/title_akas_chunk_009.csv.gz
- Saved Data/title_akas_chunk_010.csv.gz
- Saved Data/title_akas_chunk_011.csv.gz
- Saved Data/title_akas_chunk_012.csv.gz
- Saved Data/title_akas_chunk_013.csv.gz
- Saved Data/title_akas_chunk_014.csv.gz
- Saved Data/title_akas_chunk_015.csv.gz
- Saved Data/title_akas_chunk_016.csv.gz
- Saved Data/title_akas_chunk_017.csv.gz
- Saved Data/title_akas_chunk_018.csv.gz
- Saved Data/title_akas_chunk_019.csv.gz
- Saved Data/title_akas_chunk_020.csv.gz
- Saved Data/title_akas_chunk_021.csv.gz
- Saved Data/title_akas_chunk_022.csv.gz
- Saved Data/title_akas_chunk_023.csv.gz
- Saved Data/title_akas_chunk_024.csv.gz
- Saved Data/tit

- Saved Data/title_akas_chunk_201.csv.gz
- Saved Data/title_akas_chunk_202.csv.gz
- Saved Data/title_akas_chunk_203.csv.gz
- Saved Data/title_akas_chunk_204.csv.gz
- Saved Data/title_akas_chunk_205.csv.gz
- Saved Data/title_akas_chunk_206.csv.gz
- Saved Data/title_akas_chunk_207.csv.gz
- Saved Data/title_akas_chunk_208.csv.gz
- Saved Data/title_akas_chunk_209.csv.gz
- Saved Data/title_akas_chunk_210.csv.gz
- Saved Data/title_akas_chunk_211.csv.gz
- Saved Data/title_akas_chunk_212.csv.gz
- Saved Data/title_akas_chunk_213.csv.gz
- Saved Data/title_akas_chunk_214.csv.gz
- Saved Data/title_akas_chunk_215.csv.gz
- Saved Data/title_akas_chunk_216.csv.gz
- Saved Data/title_akas_chunk_217.csv.gz
- Saved Data/title_akas_chunk_218.csv.gz
- Saved Data/title_akas_chunk_219.csv.gz
- Saved Data/title_akas_chunk_220.csv.gz
- Saved Data/title_akas_chunk_221.csv.gz
- Saved Data/title_akas_chunk_222.csv.gz
- Saved Data/title_akas_chunk_223.csv.gz
- Saved Data/title_akas_chunk_224.csv.gz
- Saved Data/tit

In [10]:
# Get list of files that match a pattern.
q2 = "Data/title_akas_chunk*.csv.gz"
chunked_files2 = sorted(glob.glob(q2))

# Showing the first 5 
chunked_files2[:5]

['Data\\title_akas_chunk_001.csv.gz',
 'Data\\title_akas_chunk_002.csv.gz',
 'Data\\title_akas_chunk_003.csv.gz',
 'Data\\title_akas_chunk_004.csv.gz',
 'Data\\title_akas_chunk_005.csv.gz']

In [11]:
# Loading all files as df and appending to a list.
df_list2 = []
for file2 in chunked_files2:
    title_akas_temp_df = pd.read_csv(file2, index_col=0, low_memory=False)
    df_list2.append(title_akas_temp_df)
    
# Concatenating the list of dfs into 1 combined.
title_akas_df_combined = pd.concat(df_list2)
title_akas_df_combined

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
14,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
33,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
36,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
41,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0
...,...,...,...,...,...,...,...,...
33172903,tt9916702,1,Loving London: The Playground,US,,,,0.0
33172940,tt9916720,10,The Demonic Nun,US,,tv,,0.0
33172942,tt9916720,12,The Nun 2,US,,imdbDisplay,,0.0
33172959,tt9916756,1,Pretty Pretty Black Girl,US,,imdbDisplay,,0.0


In [12]:
# Loading and Concatenating the list of dfs with 1 line.
title_akas_df_combined = pd.concat([pd.read_csv(file2, index_col=0, low_memory=False) for file2 in chunked_files2])
title_akas_df_combined

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
14,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
33,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
36,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
41,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0
...,...,...,...,...,...,...,...,...
33172903,tt9916702,1,Loving London: The Playground,US,,,,0.0
33172940,tt9916720,10,The Demonic Nun,US,,tv,,0.0
33172942,tt9916720,12,The Nun 2,US,,imdbDisplay,,0.0
33172959,tt9916756,1,Pretty Pretty Black Girl,US,,imdbDisplay,,0.0


In [13]:
# Saving the final combined dataframe.
final_fname2 ='Data/title_akas_combined.csv.gz'
title_akas_df_combined.to_csv(final_fname2, compression='gzip', index=False)

In [14]:
# Rename our final dataset.
title_akas_ready_df = pd.read_csv(final_fname2, low_memory=False)
title_akas_ready_df

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
1,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
2,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
3,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
4,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0
...,...,...,...,...,...,...,...,...
1348742,tt9916702,1,Loving London: The Playground,US,,,,0.0
1348743,tt9916720,10,The Demonic Nun,US,,tv,,0.0
1348744,tt9916720,12,The Nun 2,US,,imdbDisplay,,0.0
1348745,tt9916756,1,Pretty Pretty Black Girl,US,,imdbDisplay,,0.0


### 2.2.- Mount and loading: Tittle Basics Dataset
- In order to do this step without having memory issues, let's follow the procedure suggested [here](https://github.com/coding-dojo-data-science/data-enrichment-loading-large-files-with-low-ram).

In [15]:
# Load data.
title_basics_df_reader = pd.read_csv('Data/title.basics.tsv.gz',
                                      sep='\t', low_memory=False,
                                      chunksize=100_000)

In [16]:
# The first row # of the next chunk is stored under ._currow.
title_basics_df_reader._currow

0

In [17]:
# Get the first df chunk from the reader.
title_basics_temp_df = title_basics_df_reader.get_chunk()
title_basics_temp_df

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
...,...,...,...,...,...,...,...,...,...
99995,tt0102313,movie,The Linguini Incident,The Linguini Incident,0,1991,\N,108,"Comedy,Crime"
99996,tt0102314,movie,Liniya smerti,Liniya smerti,0,1993,\N,99,"Crime,Drama"
99997,tt0102315,movie,Listen Up: The Lives of Quincy Jones,Listen Up: The Lives of Quincy Jones,0,1990,\N,115,"Documentary,Music"
99998,tt0102316,movie,Little Man Tate,Little Man Tate,0,1991,\N,99,Drama


In [18]:
# Checking the updated ._currow.
title_basics_df_reader._currow

100000

In [19]:
# Programatically saving an fname using the chunk #.
chunk_num=1
fname= f'Data/title_basics_chunk_{chunk_num:03d}.csv.gz'
fname

'Data/title_basics_chunk_001.csv.gz'

In [20]:
# Save temporary dataframe to disk using the fname.
title_basics_temp_df.to_csv(fname, compression='gzip')

# Incrementing chunk_num by 1 for the next file.
chunk_num+=1

In [21]:
# Reading fname without the index.
pd.read_csv(fname, index_col=0)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
...,...,...,...,...,...,...,...,...,...
99995,tt0102313,movie,The Linguini Incident,The Linguini Incident,0,1991,\N,108,"Comedy,Crime"
99996,tt0102314,movie,Liniya smerti,Liniya smerti,0,1993,\N,99,"Crime,Drama"
99997,tt0102315,movie,Listen Up: The Lives of Quincy Jones,Listen Up: The Lives of Quincy Jones,0,1990,\N,115,"Documentary,Music"
99998,tt0102316,movie,Little Man Tate,Little Man Tate,0,1991,\N,99,Drama


Let's code the way to load & mount the whole dataset.

In [22]:
# Constructing the Loop.
chunk_num = 1

for title_basics_temp_df in title_basics_df_reader:
    
    #### COMBINED WORKFLOW FROM ABOVE
    ## Replace "\N" with np.nan.
    title_basics_temp_df.replace({'\\N':np.nan},inplace=True)

    ## Eliminate movies that are null for runtimeMinute & genres.
    title_basics_temp_df = title_basics_temp_df.dropna(subset=['runtimeMinutes',
                                                                 'genres'])
        
    ## Keep only titleType==Movie.
    title_basics_temp_df = title_basics_temp_df[(title_basics_temp_df['titleType']=='movie')]
    
    ## Keep startYear 2000-2022
    ### Convert startyear to numeric for slicing.
    title_basics_temp_df['startYear'] = title_basics_temp_df['startYear'].astype(float)
    ### Let's code the filter.
    title_basics_temp_df = title_basics_temp_df[(title_basics_temp_df['startYear']>=2000)\
                                                  &(title_basics_temp_df['startYear']<2022)]
    
    ## Eliminate movies that include "Documentary" in genre.
    is_documentary = title_basics_temp_df['genres'].str.contains('Documentary',case=False)
    title_basics_temp_df = title_basics_temp_df[~is_documentary]
    
    ## Keep only US movies.
    ### Create the filter.
    keep_US_movies = title_basics_temp_df['tconst'].isin(title_akas_ready_df['titleId'])
    ### Apply the filter to the dataset.
    title_basics_temp_df = title_basics_temp_df[keep_US_movies]
    
    ## Saving chunk to disk.
    fname= f'Data/title_basics_chunk_{chunk_num:03d}.csv.gz'
    title_basics_temp_df.to_csv(fname, compression='gzip')
    print(f"- Saved {fname}")
    
    chunk_num+=1

title_basics_df_reader.close()

- Saved Data/title_basics_chunk_001.csv.gz
- Saved Data/title_basics_chunk_002.csv.gz
- Saved Data/title_basics_chunk_003.csv.gz
- Saved Data/title_basics_chunk_004.csv.gz
- Saved Data/title_basics_chunk_005.csv.gz
- Saved Data/title_basics_chunk_006.csv.gz
- Saved Data/title_basics_chunk_007.csv.gz
- Saved Data/title_basics_chunk_008.csv.gz
- Saved Data/title_basics_chunk_009.csv.gz
- Saved Data/title_basics_chunk_010.csv.gz
- Saved Data/title_basics_chunk_011.csv.gz
- Saved Data/title_basics_chunk_012.csv.gz
- Saved Data/title_basics_chunk_013.csv.gz
- Saved Data/title_basics_chunk_014.csv.gz
- Saved Data/title_basics_chunk_015.csv.gz
- Saved Data/title_basics_chunk_016.csv.gz
- Saved Data/title_basics_chunk_017.csv.gz
- Saved Data/title_basics_chunk_018.csv.gz
- Saved Data/title_basics_chunk_019.csv.gz
- Saved Data/title_basics_chunk_020.csv.gz
- Saved Data/title_basics_chunk_021.csv.gz
- Saved Data/title_basics_chunk_022.csv.gz
- Saved Data/title_basics_chunk_023.csv.gz
- Saved Dat

In [23]:
# Using glob to get list of files that match a pattern.
q = 'Data/title_basics_chunk*.csv.gz'
chunked_files = sorted(glob.glob(q))

# Showing the first 5 
chunked_files[:5]

['Data\\title_basics_chunk_001.csv.gz',
 'Data\\title_basics_chunk_002.csv.gz',
 'Data\\title_basics_chunk_003.csv.gz',
 'Data\\title_basics_chunk_004.csv.gz',
 'Data\\title_basics_chunk_005.csv.gz']

In [24]:
# Combining Many Files.

## Loading all files as df and appending to a list.
df_list = []
for file in chunked_files:
    title_basics_temp_df = pd.read_csv(file, index_col=0)
    df_list.append(title_basics_temp_df)
    
## Concatenating the list of dfs into 1 combined.
title_basics_df_combined = pd.concat(df_list)
title_basics_df_combined

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
106074,tt0108549,movie,West from North Goes South,West from North Goes South,0,2004.0,,96,"Comedy,Mystery"
110445,tt0113026,movie,The Fantasticks,The Fantasticks,0,2000.0,,86,"Musical,Romance"
110508,tt0113092,movie,For the Cause,For the Cause,0,2000.0,,100,"Action,Adventure,Drama"
111819,tt0114447,movie,The Silent Force,The Silent Force,0,2001.0,,90,Action
113254,tt0115937,movie,Consequence,Consequence,0,2000.0,,91,Drama
...,...,...,...,...,...,...,...,...,...
9221012,tt9916170,movie,The Rehearsal,O Ensaio,0,2019.0,,51,Drama
9221021,tt9916190,movie,Safeguard,Safeguard,0,2020.0,,95,"Action,Adventure,Thriller"
9221060,tt9916270,movie,Il talento del calabrone,Il talento del calabrone,0,2020.0,,84,Thriller
9221105,tt9916362,movie,Coven,Akelarre,0,2020.0,,92,"Drama,History"


In [25]:
# Loading and Concatenating the list of dfs with 1 line
title_basics_df_combined = pd.concat([pd.read_csv(file, index_col=0) for file in chunked_files])
title_basics_df_combined

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
106074,tt0108549,movie,West from North Goes South,West from North Goes South,0,2004.0,,96,"Comedy,Mystery"
110445,tt0113026,movie,The Fantasticks,The Fantasticks,0,2000.0,,86,"Musical,Romance"
110508,tt0113092,movie,For the Cause,For the Cause,0,2000.0,,100,"Action,Adventure,Drama"
111819,tt0114447,movie,The Silent Force,The Silent Force,0,2001.0,,90,Action
113254,tt0115937,movie,Consequence,Consequence,0,2000.0,,91,Drama
...,...,...,...,...,...,...,...,...,...
9221012,tt9916170,movie,The Rehearsal,O Ensaio,0,2019.0,,51,Drama
9221021,tt9916190,movie,Safeguard,Safeguard,0,2020.0,,95,"Action,Adventure,Thriller"
9221060,tt9916270,movie,Il talento del calabrone,Il talento del calabrone,0,2020.0,,84,Thriller
9221105,tt9916362,movie,Coven,Akelarre,0,2020.0,,92,"Drama,History"


In [26]:
# Saving the final combined dataframe
final_fname ='Data/title_basics_combined.csv.gz'
title_basics_df_combined.to_csv(final_fname, compression='gzip', index=False)

In [27]:
# Rename our final dataset.
title_basics_ready_df = pd.read_csv(final_fname)
title_basics_ready_df

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0108549,movie,West from North Goes South,West from North Goes South,0,2004.0,,96,"Comedy,Mystery"
1,tt0113026,movie,The Fantasticks,The Fantasticks,0,2000.0,,86,"Musical,Romance"
2,tt0113092,movie,For the Cause,For the Cause,0,2000.0,,100,"Action,Adventure,Drama"
3,tt0114447,movie,The Silent Force,The Silent Force,0,2001.0,,90,Action
4,tt0115937,movie,Consequence,Consequence,0,2000.0,,91,Drama
...,...,...,...,...,...,...,...,...,...
79853,tt9916170,movie,The Rehearsal,O Ensaio,0,2019.0,,51,Drama
79854,tt9916190,movie,Safeguard,Safeguard,0,2020.0,,95,"Action,Adventure,Thriller"
79855,tt9916270,movie,Il talento del calabrone,Il talento del calabrone,0,2020.0,,84,Thriller
79856,tt9916362,movie,Coven,Akelarre,0,2020.0,,92,"Drama,History"


### 2.3.- Mount and loading: Tittle Ratings Dataset

In [28]:
# Load data.
title_ratings_df = pd.read_csv('Data/title.ratings.tsv.gz',
                                sep='\t', low_memory=False,
                               chunksize=100_000)

In [29]:
# The first row # of the next chunk is stored under ._currow.
title_ratings_df._currow

0

In [30]:
# Get the first df chunk from the reader.
title_ratings_temp_df = title_ratings_df.get_chunk()
title_ratings_temp_df

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1910
1,tt0000002,5.8,256
2,tt0000003,6.5,1713
3,tt0000004,5.6,169
4,tt0000005,6.2,2527
...,...,...,...
99995,tt0139417,5.4,52
99996,tt0139418,5.0,123
99997,tt0139419,6.9,13
99998,tt0139420,8.7,6


In [31]:
# Programatically saving an fname using the chunk #.
chunk_num3=1
fname3= f'Data/title_ratings_chunk_{chunk_num3:03d}.csv.gz'
fname3

'Data/title_ratings_chunk_001.csv.gz'

In [32]:
# Save tittle_akas_temp_df to disk using the fname2.
title_ratings_temp_df.to_csv(fname3, compression='gzip')

## incrementing chunk_num by 1 for the next file.
chunk_num3+=1

In [33]:
# Read the csv without the index.
pd.read_csv(fname3, index_col=0, low_memory=False)

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1910
1,tt0000002,5.8,256
2,tt0000003,6.5,1713
3,tt0000004,5.6,169
4,tt0000005,6.2,2527
...,...,...,...
99995,tt0139417,5.4,52
99996,tt0139418,5.0,123
99997,tt0139419,6.9,13
99998,tt0139420,8.7,6


In [34]:
# Constructing the Loop.
chunk_num3 = 1

for title_ratings_temp_df in title_ratings_df:
      
    ## Replace "\N" with np.nan
    title_ratings_temp_df.replace({'\\N':np.nan}, inplace=True)
    
    ## Keep only US movies.
    ### Create the filter.
    keep_US_movies = title_ratings_temp_df['tconst'].isin(title_akas_ready_df['titleId'])
    ### Apply the filter to the dataset.
    title_ratings_temp_df = title_ratings_temp_df[keep_US_movies]
       
    ### Saving chunk to disk
    fname3= f'Data/title_ratings_chunk_{chunk_num3:03d}.csv.gz'
    title_ratings_temp_df.to_csv(fname3, compression='gzip')
    print(f"- Saved {fname3}")
    
    chunk_num3+=1

title_ratings_df.close()

- Saved Data/title_ratings_chunk_001.csv.gz
- Saved Data/title_ratings_chunk_002.csv.gz
- Saved Data/title_ratings_chunk_003.csv.gz
- Saved Data/title_ratings_chunk_004.csv.gz
- Saved Data/title_ratings_chunk_005.csv.gz
- Saved Data/title_ratings_chunk_006.csv.gz
- Saved Data/title_ratings_chunk_007.csv.gz
- Saved Data/title_ratings_chunk_008.csv.gz
- Saved Data/title_ratings_chunk_009.csv.gz
- Saved Data/title_ratings_chunk_010.csv.gz
- Saved Data/title_ratings_chunk_011.csv.gz
- Saved Data/title_ratings_chunk_012.csv.gz


In [35]:
# Get list of files that match a pattern.
q3 = "Data/title_ratings_chunk*.csv.gz"
chunked_files3 = sorted(glob.glob(q2))

# Showing the first 5 
chunked_files3[:5]

['Data\\title_akas_chunk_001.csv.gz',
 'Data\\title_akas_chunk_002.csv.gz',
 'Data\\title_akas_chunk_003.csv.gz',
 'Data\\title_akas_chunk_004.csv.gz',
 'Data\\title_akas_chunk_005.csv.gz']

In [36]:
# Loading all files as df and appending to a list.
df_list3 = []
for file3 in chunked_files3:
    title_ratings_temp_df = pd.read_csv(file3, index_col=0, low_memory=False)
    df_list3.append(title_ratings_temp_df)
    
# Concatenating the list of dfs into 1 combined.
title_ratings_df_combined = pd.concat(df_list3)
title_ratings_df_combined

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
14,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
33,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
36,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
41,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0
...,...,...,...,...,...,...,...,...
33172903,tt9916702,1,Loving London: The Playground,US,,,,0.0
33172940,tt9916720,10,The Demonic Nun,US,,tv,,0.0
33172942,tt9916720,12,The Nun 2,US,,imdbDisplay,,0.0
33172959,tt9916756,1,Pretty Pretty Black Girl,US,,imdbDisplay,,0.0


In [37]:
# Loading and Concatenating the list of dfs with 1 line.
title_ratings_df_combined = pd.concat([pd.read_csv(file3, index_col=0, low_memory=False) for file3 in chunked_files3])
title_ratings_df_combined

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
14,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
33,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
36,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
41,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0
...,...,...,...,...,...,...,...,...
33172903,tt9916702,1,Loving London: The Playground,US,,,,0.0
33172940,tt9916720,10,The Demonic Nun,US,,tv,,0.0
33172942,tt9916720,12,The Nun 2,US,,imdbDisplay,,0.0
33172959,tt9916756,1,Pretty Pretty Black Girl,US,,imdbDisplay,,0.0


In [38]:
# Saving the final combined dataframe.
final_fname3 ='Data/title_ratings_combined.csv.gz'
title_ratings_df_combined.to_csv(final_fname3, compression='gzip', index=False)

In [39]:
# Rename our final dataset.
title_ratings_ready_df = pd.read_csv(final_fname3, low_memory=False)
title_ratings_ready_df

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
1,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
2,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
3,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
4,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0
...,...,...,...,...,...,...,...,...
1348742,tt9916702,1,Loving London: The Playground,US,,,,0.0
1348743,tt9916720,10,The Demonic Nun,US,,tv,,0.0
1348744,tt9916720,12,The Nun 2,US,,imdbDisplay,,0.0
1348745,tt9916756,1,Pretty Pretty Black Girl,US,,imdbDisplay,,0.0


### 2.4.- Data dictionary

- The data dictionary can be found [here](https://www.imdb.com/interfaces/).
- The dictionary per dataset downloaded is:

1. **title.basics.tsv.gz:** Contains the following information for titles:
    - tconst (string) - alphanumeric unique identifier of the title
    - titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
    - primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
    - originalTitle (string) - original title, in the original language
    - isAdult (boolean) - 0: non-adult title; 1: adult title
    - startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
    - endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
    - runtimeMinutes – primary runtime of the title, in minutes
genres (string array) – includes up to three genres associated with the title.

2. **title.akas.tsv.gz:** Contains the following information for titles:
    - titleId (string) - a tconst, an alphanumeric unique identifier of the title
    - ordering (integer) – a number to uniquely identify rows for a given titleId
    - title (string) – the localized title
    - region (string) - the region for this version of the title
    - language (string) - the language of the title
    - types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
    - attributes (array) - Additional terms to describe this alternative title, not enumerated
    - isOriginalTitle (boolean) – 0: not original title; 1: original title

3. **title.ratings.tsv.gz:** Contains the IMDb rating and votes information for titles
    - tconst (string) - alphanumeric unique identifier of the title
    - averageRating – weighted average of all the individual user ratings
    - numVotes - number of votes the title has received

## 3.- Data Understanding

## 4.- Data Cleaning

In [40]:
## Title Basics: Keep only US movies (Use AKAs table, see "Filtering one dataframe based on another" section below).

## Ratings: Keep only US movies (Use AKAs table, see "Filtering one dataframe based on another" section below).

# E. Conclusions

- Xxxxx