### `tn.movies_budget.csv`
In this section we are going to prepare our data for analysis. In the previous section, we did a data exploration to understand out data better. We found that the `tn.movie_budgets.csv` file was mostly clean, with no null values and duplicates. However, the data contains numerical values stored as objects. 

To ensure the data is appropriate for analysis, we are going to convert the numerical values to integers enable proper calculations, aggregations, and statistical analysis.
Additionally, the `release_date` column contains dates stored as objects, therefore we are going to convert the column to a datetime datatype.

This process will involve;
- Data reformatting

- Data convertion 

- Renaming

In [50]:
# Importing the necessary libraries for analysis.

import pandas as pd
import numpy as np
import sqlite3
import string as str
import seaborn as sns
import matplotlib.pyplot as plt


In [51]:
# Reading the file into the variable 'movie_budgets'.
movie_budgets = pd.read_csv("Data/tn.movie_budgets.csv.gz", compression= 'gzip', delimiter= ',', encoding= 'latin-1', index_col= False)

movie_budgets

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"
...,...,...,...,...,...,...
5777,78,"Dec 31, 2018",Red 11,"$7,000",$0,$0
5778,79,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495"
5779,80,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338"
5780,81,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0


#### Data conversion
In this section we are going to convert numerical data and dates stored as objects to integers and dates respectively.

In [52]:
# We are confirming the data types of each columns.

movie_budgets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


The above code confirms our findings in the previous section.

In [53]:
# Before converting the numerical columns to integers, we are first going to strip the dollar sign ($) and replace the commas with nothing.
# The dollar sign and commas are characters, therefore the code will throw an error if we try to convert the columns without stripping and replacing the sign and the punctuation mark.

columns_to_strip = ['production_budget', 'domestic_gross', 'worldwide_gross']

movie_budgets[columns_to_strip] = movie_budgets[columns_to_strip].apply(lambda x: x.str.strip('$'))

movie_budgets[columns_to_strip] = movie_budgets[columns_to_strip].apply(lambda x: x.str.replace(',', ''))



In [54]:
# Validation that the code has worked

movie_budgets.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,425000000,760507625,2776345279
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875
2,3,"Jun 7, 2019",Dark Phoenix,350000000,42762350,149762350
3,4,"May 1, 2015",Avengers: Age of Ultron,330600000,459005868,1403013963
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747


In [55]:
# Converting production_budget, domestic_gross and worldwide_gross to integers
# We are using the .astype() method to convert them.
# The reason we cannot convert all of them at the same time, 'pd.to_numeric' only accepts series, lists, tuples and arrays.

movie_budgets['production_budget'] = pd.to_numeric(movie_budgets['production_budget'], errors='coerce').astype('Int64')

movie_budgets['domestic_gross'] = pd.to_numeric(movie_budgets['domestic_gross'], errors='coerce').astype('Int64')

movie_budgets['worldwide_gross'] = pd.to_numeric(movie_budgets['worldwide_gross'], errors='coerce').astype('Int64')

In [56]:
# By stripping the dollar sign from the rows, it is difficult to know the currency of the revenues.
# In this code we are going to add the currency to the columns.

movie_budgets.rename(columns={'production_budget': 'production_budget($)', 'domestic_gross': 'domestic_gross($)', 'worldwide_gross': 'worldwide_gross($)'}, inplace=True)

In [57]:
# The next step is converting release_date to date. This is essential for year on year (YoY) analysis on both the revebues and further analysis. 

movie_budgets['release_date'] = pd.to_datetime(movie_budgets['release_date'])

In [58]:
# Validating that the changes we've made have reflected.
movie_budgets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   id                    5782 non-null   int64         
 1   release_date          5782 non-null   datetime64[ns]
 2   movie                 5782 non-null   object        
 3   production_budget($)  5782 non-null   Int64         
 4   domestic_gross($)     5782 non-null   Int64         
 5   worldwide_gross($)    5782 non-null   Int64         
dtypes: Int64(3), datetime64[ns](1), int64(1), object(1)
memory usage: 288.1+ KB


In [59]:
movie_budgets.head()

Unnamed: 0,id,release_date,movie,production_budget($),domestic_gross($),worldwide_gross($)
0,1,2009-12-18,Avatar,425000000,760507625,2776345279
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875
2,3,2019-06-07,Dark Phoenix,350000000,42762350,149762350
3,4,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963
4,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747


Now we have a cleaned data with the appropriate data types and easy to interpret columns.

### `im.db` data preparation
In this section, we are going to clean the `im.db` database. In the previous section, we explored the database and realized that several tables contained unusable data ranging from null values to duplicates. In this section we are going to go table by table, exploring what else needs cleaning or formatting.

First, we are going to take a look at the database's ERD

![movie data erd](https://raw.githubusercontent.com/learn-co-curriculum/dsc-phase-2-project-v3/main/movie_data_erd.jpeg)

From this diagram, we are going to 

In [60]:
# Connecting the database and reading it into the `conn` variable 
conn = sqlite3.connect('Data/im.db')

# In this code we are reading the tables into get the lay of the database
tables = pd.read_sql("""SELECT *
                      FROM sqlite_master""", conn)
tables

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,movie_basics,movie_basics,2,"CREATE TABLE ""movie_basics"" (\n""movie_id"" TEXT..."
1,table,directors,directors,3,"CREATE TABLE ""directors"" (\n""movie_id"" TEXT,\n..."
2,table,known_for,known_for,4,"CREATE TABLE ""known_for"" (\n""person_id"" TEXT,\..."
3,table,movie_akas,movie_akas,5,"CREATE TABLE ""movie_akas"" (\n""movie_id"" TEXT,\..."
4,table,movie_ratings,movie_ratings,6,"CREATE TABLE ""movie_ratings"" (\n""movie_id"" TEX..."
5,table,persons,persons,7,"CREATE TABLE ""persons"" (\n""person_id"" TEXT,\n ..."
6,table,principals,principals,8,"CREATE TABLE ""principals"" (\n""movie_id"" TEXT,\..."
7,table,writers,writers,9,"CREATE TABLE ""writers"" (\n""movie_id"" TEXT,\n ..."


#### `movie_basics`
After exploring this table in the previous section, we found that it contains 146144 rows and 6 columns. Out of the 6 columns, 3 contain null values. In this section we are going to clean this data by dropping or replacing null values depending on the relevance of the column. We will also explore further to uncover any data quality issues that were missed in the previous section. 


In [61]:
# Using the sqlite3 library, we are going to query the movie_basics table, selecting everything.
movie_basics = pd.read_sql("""SELECT *
                               FROM movie_basics""", conn)
movie_basics

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"
...,...,...,...,...,...,...
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.0,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
146142,tt9916730,6 Gunn,6 Gunn,2017,116.0,


In [62]:
# This code gives us the overview of the table, showing us the columns, datatypes and how many null values the table has.
movie_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


The code above proves what the initial exploration pointed out. It also reveals that the `start_year` column is stored as an object but it is a datetime data type. 

In the next couple of codes we are going to;
- deal with the missing values 

- perform data conversion.

In [63]:
# We are aware of the columns that have missing values, but to make good decisions, we are looking at the count of missing vakues in each column.
# This code checks for null values in each column and sums them up. It then filters out columns with no missing values and returns those with null values.
missing = movie_basics.isnull().sum()

missing[missing > 0].sort_values(ascending=False)

runtime_minutes    31739
genres              5408
original_title        21
dtype: int64

As I mentioned before, we are going to decide how to deal with missing values depending on the relevance of the columns and the number of missing values. From the above code, we notice that we do not have a lot of missing values, when it comes to the columns' relevance, all of them are crucial for the analysis ahead, therefore we are going to replace some and drop others.

`runtime_minutes`

This column contains 31739 missing values, for this column we are going to replace the missing values with the median value of the column. This is because this will less likely influence the distribution of the data while also preventing the loss of valueable data.

In [64]:
# This code is replacing all the missing values in the `runtime_minutes` column with the median of the column. 
# The reason we use this method is because runtime enatils important information crucial to our analysis. 
# This decision is also better than dropping the rows as it is less likely to influence the distribution. 

movie_basics['runtime_minutes'].fillna(movie_basics['runtime_minutes'].median(), inplace= True)

`original_title`

This column only contains 21 missing values, while it is okay to drop those rows, replacing the null values with corresponding data from the `primary_title` seems appropriate.

In [65]:
# Most of the rows in the table contain the same entries in both the `original_title` and the `primary_title`.
# This code simply fills the null values in the `original_title` column with corresponding data from the `primary_title` column.
 
movie_basics['original_title'] = movie_basics['original_title'].fillna(movie_basics['primary_title'])

`genres`

Containing 5408 missing values, there is no other way to deal with these missing values other than to drop them. We are using the .dropna() method and specifying the rows we need dropped are from what column.

In [66]:
# For the `genres` column, we decided to drop the null values as there is no way to replace them and dropping them is less likely to cause data loss.

movie_basics = movie_basics.dropna(subset= ['genres'])

In [67]:
# Validation that we have dealt with all the null values.

missing = movie_basics.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

Series([], dtype: int64)

Next, we are dealing with data conversion from object to datetime. When dates are stored as objects, pandas cannot perform date-specific operations such as calculating time differences, extracting date components, or resampling time series data. Converting to datetime enables these functionalities.

In [79]:
movie_basics['start_year'] = pd.to_datetime(movie_basics['start_year'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_basics['start_year'] = pd.to_datetime(movie_basics['start_year'])


In [69]:
movie_basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 140736 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   movie_id         140736 non-null  object        
 1   primary_title    140736 non-null  object        
 2   original_title   140736 non-null  object        
 3   start_year       140736 non-null  datetime64[ns]
 4   runtime_minutes  140736 non-null  float64       
 5   genres           140736 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(4)
memory usage: 7.5+ MB


#### `directors`
Prior investigations found that this table contains 127639 duplicate records. In this section we are going to explore the table further and determine whether the data are indeed duplicates or just contain the same ids as the table only contains ids.

This process will include;
- Dropping duplicates

First, we are reading the table into the `directors` dataframe using pandas read_sql() function. 

In [70]:
# Using the sqlite3 library, we are going to query the directors table, selecting everything.
directors = pd.read_sql("""SELECT *
                           FROM directors""", conn)

directors

Unnamed: 0,movie_id,person_id
0,tt0285252,nm0899854
1,tt0462036,nm1940585
2,tt0835418,nm0151540
3,tt0835418,nm0151540
4,tt0878654,nm0089502
...,...,...
291169,tt8999974,nm10122357
291170,tt9001390,nm6711477
291171,tt9001494,nm10123242
291172,tt9001494,nm10123248


In [None]:
# This code helps us understand the structure of our table

directors.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 163535 entries, 0 to 291173
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   movie_id   163535 non-null  object
 1   person_id  163535 non-null  object
dtypes: object(2)
memory usage: 3.7+ MB


Confirming that the data contains duplicates, we are using the .duplicated() function to return booleans of all the rows that contain duplicates and the .sum() function to count them.

In [71]:
# This code counts the amount of duplicated records exist in the table.
# We are using the .dupiicated() function to identify the duplicated records and the .sum() function to count them.

duplicates = directors.duplicated().sum()
duplicates

127639

To decide the next cause of action, we are examining the duplicate records further to make sure we will not be dropping data that are not in fact, duplicated. We are filtering the duplicated records and storing them in the `all_duplicate_rows` variable.

In [72]:
# We are filtering out duplicated data and storing them in a variable for further inspection.

all_duplicate_rows = directors[directors.duplicated(keep=False)]
print(all_duplicate_rows)

         movie_id   person_id
2       tt0835418   nm0151540
3       tt0835418   nm0151540
8       tt0996958   nm2286991
9       tt0996958   nm2286991
10      tt0999913   nm0527109
...           ...         ...
291160  tt8992390   nm0504267
291161  tt8992390   nm0504267
291162  tt8992390   nm0504267
291167  tt8999892  nm10122247
291168  tt8999892  nm10122247

[182316 rows x 2 columns]


The code above clarifies that the table truly contains duplicate records. The next cause of action is deleting the duplicates. Dropping them is essential as duplicate data negatively impacts data quality, analysis, and overall business operations. It can lead to inaccuracies in reporting, skewed insights, and ultimately hinders informed decision-making. 

In [73]:
# To prevent the duplicates from negatively impacting the data, we are dropping them using the .drop_duplicates() function.

directors = directors.drop_duplicates()

In [74]:
# Finally, to validate the codes, we are using the .duplicated() and .sum() functions to count the number of duplicated records, confirming that the number is 0

duplicates = directors.duplicated().sum()
duplicates

0

#### `known_for`
According to the previous section, this table is mostly clean and has no data quality issues.

In [None]:
# Using the sqlite3 library, we are going to query the known_for table, selecting everything.

known_for = pd.read_sql("""SELECT *
                           FROM known_for""", conn)

known_for

Unnamed: 0,person_id,movie_id
0,nm0061671,tt0837562
1,nm0061671,tt2398241
2,nm0061671,tt0844471
3,nm0061671,tt0118553
4,nm0061865,tt0896534
...,...,...
1638255,nm9990690,tt9090932
1638256,nm9990690,tt8737130
1638257,nm9991320,tt8734436
1638258,nm9991320,tt9615610


In [None]:
# Checking for other data quality issues.

known_for.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1638260 entries, 0 to 1638259
Data columns (total 2 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   person_id  1638260 non-null  object
 1   movie_id   1638260 non-null  object
dtypes: object(2)
memory usage: 25.0+ MB


The data is clean and ready for analysis.

#### `movie_akas`
From the previous section, we found that this table, which contains 331703 rows and 8 columns, has null values in 5 columns. In this section we are going to deal with the null values and also explore the data further to make sure there are no other quality issues.

This process will include;
- Dropping rows/ columns

In the code below, we are querying the mavie_akas table to access everything for inspection.

In [None]:
# Using the sqlite3 library, we are going to query the movie_akas table, selecting everything.

movie_akas = pd.read_sql("""SELECT *
                            FROM movie_akas""", conn)

movie_akas

Unnamed: 0,movie_id,ordering,title,region,language,types,attributes,is_original_title
0,tt0369610,10,Джурасик свят,BG,bg,,,0.0
1,tt0369610,11,Jurashikku warudo,JP,,imdbDisplay,,0.0
2,tt0369610,12,Jurassic World: O Mundo dos Dinossauros,BR,,imdbDisplay,,0.0
3,tt0369610,13,O Mundo dos Dinossauros,BR,,,short title,0.0
4,tt0369610,14,Jurassic World,FR,,imdbDisplay,,0.0
...,...,...,...,...,...,...,...,...
331698,tt9827784,2,Sayonara kuchibiru,,,original,,1.0
331699,tt9827784,3,Farewell Song,XWW,en,imdbDisplay,,0.0
331700,tt9880178,1,La atención,,,original,,1.0
331701,tt9880178,2,La atención,ES,,,,0.0


In [None]:
# We are trying to understand the structure of the table.

movie_akas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 331703 entries, 0 to 331702
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   movie_id           331703 non-null  object 
 1   ordering           331703 non-null  int64  
 2   title              331703 non-null  object 
 3   region             278410 non-null  object 
 4   language           41715 non-null   object 
 5   types              168447 non-null  object 
 6   attributes         14925 non-null   object 
 7   is_original_title  331678 non-null  float64
dtypes: float64(1), int64(1), object(6)
memory usage: 20.2+ MB


In [None]:
# From the code above, as well as the previous section, we see that the table contains several null values.
# This code checks for null values in each column and finds their percentage. It then filters out columns with no missing values and returns those with null values.

missing = movie_akas.isnull().mean()*100

missing[missing > 0].sort_values(ascending=False)

attributes           95.500493
language             87.423991
types                49.217523
region               16.066481
is_original_title     0.007537
dtype: float64

The `attributes` column is missing 95% of its records, the `language` column is missing 87% of its records and the `types` column is missing almost half of its records. According to the rule of thumb, in instances of very large missing datasets, it is better to drop the whole column that to lose all that valuable data.

So, in the next code, we are going to use the .dropna() functions to drop these columns.

In [None]:
# In this code we are dropping columns with large missing values. 
# That includes the attributes, language and types columns.
# We are using the .drop() function and the axis=1 attribute to specify that the things being dropped are columns

movie_akas = movie_akas.drop(['attributes', 'language', 'types'], axis=1)

In [None]:
# This code asserts that we have dropped the three columns

movie_akas.head()

Unnamed: 0,movie_id,ordering,title,region,is_original_title
0,tt0369610,10,Джурасик свят,BG,0.0
1,tt0369610,11,Jurashikku warudo,JP,0.0
2,tt0369610,12,Jurassic World: O Mundo dos Dinossauros,BR,0.0
3,tt0369610,13,O Mundo dos Dinossauros,BR,0.0
4,tt0369610,14,Jurassic World,FR,0.0


In [99]:
# For the rest of the columns conatining null values, we are going to drop the rows because there is no way to fill them.
# We are using the .dropna() function.

movie_akas = movie_akas.dropna(subset= ['is_original_title', 'region'])

In [None]:
# This code 

missing = movie_akas.isnull().sum()

missing[missing > 0].sort_values(ascending=False)

Series([], dtype: int64)

#### `movie_rating`
 This table has information on movie ratings and number of votes. It contains 73856 rows and 3 columns, and has no null values or duplicated records. 

In [None]:
movie_ratings = pd.read_sql("""SELECT *
                               FROM movie_ratings""", conn)

movie_ratings

Unnamed: 0,movie_id,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21
...,...,...,...
73851,tt9805820,8.1,25
73852,tt9844256,7.5,24
73853,tt9851050,4.7,14
73854,tt9886934,7.0,5


In [85]:
movie_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


### `persons`
This table contains the biographies of all persons in the film industry including actors and directors. It has 606648 rows and 5 columns. Ou of the 5 columns, 3 contain null values.

In [86]:
persons = pd.read_sql("""SELECT *
                         FROM persons""", conn)
persons

Unnamed: 0,person_id,primary_name,birth_year,death_year,primary_profession
0,nm0061671,Mary Ellen Bauder,,,"miscellaneous,production_manager,producer"
1,nm0061865,Joseph Bauer,,,"composer,music_department,sound_department"
2,nm0062070,Bruce Baum,,,"miscellaneous,actor,writer"
3,nm0062195,Axel Baumann,,,"camera_department,cinematographer,art_department"
4,nm0062798,Pete Baxter,,,"production_designer,art_department,set_decorator"
...,...,...,...,...,...
606643,nm9990381,Susan Grobes,,,actress
606644,nm9990690,Joo Yeon So,,,actress
606645,nm9991320,Madeline Smith,,,actress
606646,nm9991786,Michelle Modigliani,,,producer


In [87]:
persons.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 606648 entries, 0 to 606647
Data columns (total 5 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   person_id           606648 non-null  object 
 1   primary_name        606648 non-null  object 
 2   birth_year          82736 non-null   float64
 3   death_year          6783 non-null    float64
 4   primary_profession  555308 non-null  object 
dtypes: float64(2), object(3)
memory usage: 23.1+ MB


In [89]:
missing = persons.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

death_year            599865
birth_year            523912
primary_profession     51340
dtype: int64

### `principals`
This table contains information about the job categories of the individuals in the film categories and for some, the characters they have played in past movies. It stored data in 1028186 rows and 6 columns. It has null values in 2 columns and no duplicate records.

In [88]:
principals = pd.read_sql("""SELECT *
                            FROM principals""", conn)
principals

Unnamed: 0,movie_id,ordering,person_id,category,job,characters
0,tt0111414,1,nm0246005,actor,,"[""The Man""]"
1,tt0111414,2,nm0398271,director,,
2,tt0111414,3,nm3739909,producer,producer,
3,tt0323808,10,nm0059247,editor,,
4,tt0323808,1,nm3579312,actress,,"[""Beth Boothby""]"
...,...,...,...,...,...,...
1028181,tt9692684,1,nm0186469,actor,,"[""Ebenezer Scrooge""]"
1028182,tt9692684,2,nm4929530,self,,"[""Herself"",""Regan""]"
1028183,tt9692684,3,nm10441594,director,,
1028184,tt9692684,4,nm6009913,writer,writer,


In [90]:
principals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1028186 entries, 0 to 1028185
Data columns (total 6 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   movie_id    1028186 non-null  object
 1   ordering    1028186 non-null  int64 
 2   person_id   1028186 non-null  object
 3   category    1028186 non-null  object
 4   job         177684 non-null   object
 5   characters  393360 non-null   object
dtypes: int64(1), object(5)
memory usage: 47.1+ MB


In [91]:
missing = principals.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

job           850502
characters    634826
dtype: int64

### `writers`
This table links nmovie writers and the movies they wrote. It contains 255873 rows and 2 columns. It also has 77521 duplicated records.

In [92]:
writers = pd.read_sql("""SELECT *
                         FROM writers""", conn)
writers

Unnamed: 0,movie_id,person_id
0,tt0285252,nm0899854
1,tt0438973,nm0175726
2,tt0438973,nm1802864
3,tt0462036,nm1940585
4,tt0835418,nm0310087
...,...,...
255868,tt8999892,nm10122246
255869,tt8999974,nm10122357
255870,tt9001390,nm6711477
255871,tt9004986,nm4993825


In [93]:
duplicates = writers.duplicated().sum()
duplicates

77521