# DATA UNDERSTANDING
In this section, we are going introduce ourselves to the data we are going to use, exploring them for insights and understanding before we begin our analysis. This section will include data description and summary to know what we are getting into.

In this repository, under the file path `Data`, we have 6 datasets namely `im.db`, `bom.movie_gross.csv`, `tn.movie_budgets.csv`, `rt.reviews.tsv`, `rt.movie_info.tsv`, and `tmdb.movies.csv`.
We are going to explore them one by one focusing on their sources, structures, formats, why they are suitable and if they are in the best shape for analysis.

The analysis tools we will be using in this section include;
- Pandas
- SQLite3


In [69]:
# Importing necessary libraries.

import pandas as pd
import sqlite3

## rt.reviews.tsv
### Data source
The `rt.reviews.tsv.gz` dataset is sourced from the Rotten Tomatoes website, a film and television review aggregator and user community. The data contains movie reviews and ratings from various movie critics dating from the year 2000 to 2018.

### Why is the data suitable for the project?
* The data contains movie reviews, ratings, user comments, and critical reception, which help predict box office performance and overall success. This information also reveals audience preferences for genres, themes, and storylines, which is important to help studios tailor their content to meet market demand and avoid oversaturation in certain areas.

* User reviews and comments provide valuable insights into audience perceptions of specific films and the overall cinematic landscape. Most importantly, this information is crucial for attracting investors and making informed decisions about green-lighting projects.

In [70]:
# We are reading the data into the 'rt_reviews, variable using the .read_csv() pandas function to be able to access the data in form of a dataframe

rt_reviews = pd.read_csv("Data/rt.reviews.tsv.gz", compression= 'gzip', delimiter= '\t', encoding= 'latin-1', index_col= False)
rt_reviews.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [71]:
# This code will print out the shape and the data's columns.

print(f'The data has a {rt_reviews.shape} structure.')
print('`rt_reviews` has 8 columns namely:', rt_reviews.columns)

The data has a (54432, 8) structure.
`rt_reviews` has 8 columns namely: Index(['id', 'review', 'rating', 'fresh', 'critic', 'top_critic', 'publisher',
       'date'],
      dtype='object')


In [72]:
# We will be using the .info() function to explore the data further.
# this function will show us the columns' datatypes, count of non null values, allowing us to identify columns with null values.

rt_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB


In [73]:
# Other than the .info() function, .describe() function also gives a summary of the dataset.
# This function gives us the statistical summary of the data.

rt_reviews.describe(include= 'all')

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
count,54432.0,48869,40915,54432,51710,54432.0,54123,54432
unique,,48682,186,2,3496,,1281,5963
top,,Parental Content Review,3/5,fresh,Emanuel Levy,,eFilmCritic.com,"January 1, 2000"
freq,,24,4327,33035,595,,673,4303
mean,1045.706882,,,,,0.240594,,
std,586.657046,,,,,0.427448,,
min,3.0,,,,,0.0,,
25%,542.0,,,,,0.0,,
50%,1083.0,,,,,0.0,,
75%,1541.0,,,,,0.0,,


In [74]:
#The info() function gives us a summary of the data, but to check missing values closely, we are using .isnull() function to look at the count of missing vakues in each column.
# This code checks for null values in each column and sums them up. It then filters out columns with no missing values and returns those with null values.

missing = rt_reviews.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

rating       13517
review        5563
critic        2722
publisher      309
dtype: int64

In [75]:
# This code counts the amount of duplicated records exist in the table.
# We are using the .dupiicated() function to identify the duplicated records and the .sum() function to count them.

duplicates = rt_reviews.duplicated().sum()
duplicates

9

### Data summary
* The dataset contains structured data stored in a TSV (tab-separated values) file containing both non-categorical and categorical data. 

* The dataset contains data in 54432 rows and 8 columns namely, 'id', 'review', 'rating', 'fresh', 'critic', 'top_critic', 'publisher', and 'date'. All the columns are well labeled and easy to interpret.

* Out of the 8 columns, only 4 contain null values ranging from as low as 300 to 13,517 null values. The dataset also contains 9 duplicated rows.

* The dataset contains mostly object datatypes, hence there is no quantitative analysis needed.

### Data Quality Issues
- Incompleteness - 4 columns contain missing values. Missing values in data can lead to inaccuracies and misleading results, which subsequently leads to inaccurate insights.

- Duplicate records - The dataset contains 9 duplicated records, which could skew the results as certain records provide the same information twice.

### Next steps
The data is mostly clean and has a lot of information to offer, however, a few columns need some cleaning. The cleaning involves;

* Handle missing and duplicated records.

## rt.movies_info.tsv
### Data source
The `rt.movie_info.tsv` data is extracted from Rotten Tomatoes website and contains information about movies and television series including their genres, ratings, synopsis as well as the movie's director, writer, release date and run-time.

### Why is the data suitable for the project?
* This data allows us to analyze the impact of writers and directors on previous films' success providing valuable information for casting decisions. Additionally, real-time review data can track a film's performance after its release, allowing studios to adapt marketing strategies and address potential issues. 

* Comparing a film's performance against similar titles in terms of review scores and box office success allows studios to identify areas for improvement and best practices.


In [76]:
# We are reading the data into the 'rt_movies, variable using the .read_csv() pandas function to be able to access the data in form of a dataframe

rt_movies = pd.read_csv("Data/rt.movie_info.tsv.gz", compression= 'gzip', delimiter= '\t', encoding= 'latin-1', index_col= False)
rt_movies.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [77]:
# This code will print out the shape and the data's columns.

print(f'The data has a {rt_movies.shape} structure.')
print('`rt_reviews` has 8 columns namely:', rt_movies.columns)

The data has a (1560, 12) structure.
`rt_reviews` has 8 columns namely: Index(['id', 'synopsis', 'rating', 'genre', 'director', 'writer',
       'theater_date', 'dvd_date', 'currency', 'box_office', 'runtime',
       'studio'],
      dtype='object')


In [78]:
# We will be using the .info() function to explore the data further.
# this function will show us the columns' datatypes, count of non null values, allowing us to identify columns with null values.

rt_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


In [79]:
# Other than the .info() function, .describe() function also gives a summary of the dataset.
# This function gives us the statistical summary of the data.

rt_movies.describe(include= 'all')

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
count,1560.0,1498,1557,1552,1361,1111,1201,1201,340,340.0,1530,494
unique,,1497,6,299,1125,1069,1025,717,1,336.0,142,200
top,,A group of air crash survivors are stranded in...,R,Drama,Steven Spielberg,Woody Allen,"Jan 1, 1987","Jun 1, 2004",$,600000.0,90 minutes,Universal Pictures
freq,,2,521,151,10,4,8,11,340,2.0,72,35
mean,1007.303846,,,,,,,,,,,
std,579.164527,,,,,,,,,,,
min,1.0,,,,,,,,,,,
25%,504.75,,,,,,,,,,,
50%,1007.5,,,,,,,,,,,
75%,1503.25,,,,,,,,,,,


In [80]:
#The info() function gives us a summary of the data, but to check missing values closely, we are using .isnull() function to look at the count of missing vakues in each column.
# This code checks for null values in each column and sums them up. It then filters out columns with no missing values and returns those with null values.

missing = rt_movies.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

box_office      1220
currency        1220
studio          1066
writer           449
dvd_date         359
theater_date     359
director         199
synopsis          62
runtime           30
genre              8
rating             3
dtype: int64

In [81]:
# This code counts the amount of duplicated records exist in the table.
# We are using the .dupiicated() function to identify the duplicated records and the .sum() function to count them.

duplicates = rt_movies.duplicated().sum()
duplicates

0

### Data summary
*  The dataset contains structured data stored in a TSV file containing both numerical and categorical data.

* The dataset contains data in 1560 rows and 12 columns namely, 'id', 'synopsis', 'rating', 'genre', 'director', 'writer', 'theater_date', 'dvd_date', 'currency', 'box_office', 'runtime' and 'studio'. All the columns are well labeled and easy to interpret.

* Out of the 12 columns, 11 contain null values ranging from as low as 3 to 1220 null values. The dataset also contains 0 duplicated rows.

* The dataset contains object datatypes, hence there is no quantitative analysis needed.

### Data Quality Issues
- Incompleteness - 11 columns contain missing values. Missing values in data can lead to inaccuracies and misleading results, which subsequently leads to inaccurate insights.

- Inaccuracy - Several columns are in the wrong data types, hence resulting in inaccurate data. inaccurate information can cause significant problems with severe consequences. 

### Next steps
The data is mostly clean and has a lot of information to offer, however, a few columns need some cleaning. The cleaning involves;

* Handle missing.

* Data type conversion.

## tmdb.movies.csv
### Data source
The `tmdb.movies.csv` data is extracted from TheMovieDB, a community built movie and TV database. It contains information on the popularity, genres and movie ratings.

### Why the data is suitable for the project
* This dataset is suitable for the project because it allows analysis of past movie performance data, including audience ratings, and box office numbers, to create models that predict the potential success of upcoming films. Also, community reviews and ratings reveal audience tastes and preferences, which can help studios tailor their content to appeal to specific demographics and genres. 

* By understanding audience preferences and geographical trends, studios can tailor their distribution strategies, including release dates, theater selection, and marketing efforts. It facilitates collaboration between stakeholders by providing a shared platform for accessing information and discussing potential projects. 

* The database provides a wealth of data for market research, allowing studios to understand trends, competitor activities, and potential opportunities.

In [82]:
# We are reading the data into the 'tmdb_movies' variable using the .read_csv() pandas function to be able to access the data in form of a dataframe

tmdb_movies = pd.read_csv("Data/tmdb.movies.csv.gz", compression= 'gzip', delimiter= '\t', encoding= 'latin-1', index_col= False)
tmdb_movies.head()

Unnamed: 0,",genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count"
0,"0,""[12, 14, 10751]"",12444,en,Harry Potter and ..."
1,"1,""[14, 12, 16, 10751]"",10191,en,How to Train ..."
2,"2,""[12, 28, 878]"",10138,en,Iron Man 2,28.515,2..."
3,"3,""[16, 35, 10751]"",862,en,Toy Story,28.005,19..."
4,"4,""[28, 878, 12]"",27205,en,Inception,27.92,201..."


In [83]:
# This code will print out the shape and the data's columns.

print(f'The data has a {tmdb_movies.shape} structure.')
print('`rt_reviews` has 8 columns namely:', tmdb_movies.columns)

The data has a (26517, 1) structure.
`rt_reviews` has 8 columns namely: Index([',genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count'], dtype='object')


In [84]:
# We will be using the .info() function to explore the data further.
# this function will show us the columns' datatypes, count of non null values, allowing us to identify columns with null values.

tmdb_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 1 columns):
 #   Column                                                                                                Non-Null Count  Dtype 
---  ------                                                                                                --------------  ----- 
 0   ,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count  26517 non-null  object
dtypes: object(1)
memory usage: 207.3+ KB


In [85]:
# Other than the .info() function, .describe() function also gives a summary of the dataset.
# This function gives us the statistical summary of the data.

tmdb_movies.describe(include= 'all')

Unnamed: 0,",genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count"
count,26517
unique,26517
top,"18171,[27],404961,en,Ghosthunters,3.69,2016-07..."
freq,1


In [86]:
#The info() function gives us a summary of the data, but to check missing values closely, we are using .isnull() function to look at the count of missing vakues in each column.
# This code checks for null values in each column and sums them up. It then filters out columns with no missing values and returns those with null values.

missing = tmdb_movies.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

Series([], dtype: int64)

In [87]:
# This code counts the amount of duplicated records exist in the table.
# We are using the .dupiicated() function to identify the duplicated records and the .sum() function to count them.

duplicates = tmdb_movies.duplicated().sum()
duplicates

0

### Data summary
* The data is stored in a CSV file containing both non-categorical and categorical data.

* The dataset has 26517 rows and 9 columns namely, 'genre_ids', 'id', 'original_language', 'original_title', 'popularity', 'release_date', 'title', 'vote_average', and 'vote_count'.

* It has no duplicate records and also no missing records.

### Data Quality
The dataset is mostly clean, with no apparent issues.

## tn.movie_budget.csv
### Data source
The `tn.movie_budgets.csv` dataset is extracted from 'The Numbers' box office revenue tracking website. The data contains the production budget, domestic gross as well as international gross of movies, providing a performance analysis of the movies since their release dates.

### Why the data is suitable for the project
* Data from a film revenue tracker website is crucial for movie studio stakeholders as it provides insights into a film's financial performance, helping with decision-making and planning throughout the film production process. This data enables better investment decisions, more effective marketing strategies, and improved content planning. 

* Analyzing past performance data, including box office sales, streaming numbers, and merchandise sales, allows studios to predict the potential profitability of upcoming films. This helps in securing investment and making informed decisions about production budgets and marketing campaigns. 

* Revenue trackers provide a clear picture of the financial risks and potential rewards associated with different film projects, allowing studios to allocate resources more effectively.

In [88]:
# We are reading the data into the 'movie_budgets' variable using the .read_csv() pandas function to be able to access the data in form of a dataframe

movie_budgets = pd.read_csv("Data/tn.movie_budgets.csv.gz", compression= 'gzip', delimiter= '\t', encoding= 'latin-1', index_col= False)
movie_budgets.head()

Unnamed: 0,"id,release_date,movie,production_budget,domestic_gross,worldwide_gross"
0,"1,""Dec 18, 2009"",Avatar,""$425,000,000"",""$760,5..."
1,"2,""May 20, 2011"",Pirates of the Caribbean: On ..."
2,"3,""Jun 7, 2019"",Dark Phoenix,""$350,000,000"",""$..."
3,"4,""May 1, 2015"",Avengers: Age of Ultron,""$330,..."
4,"5,""Dec 15, 2017"",Star Wars Ep. VIII: The Last ..."


In [89]:
# This code will print out the shape and the data's columns.

print(f'The data has a {movie_budgets.shape} structure.')
print('`rt_reviews` has 8 columns namely:', movie_budgets.columns)

The data has a (5782, 1) structure.
`rt_reviews` has 8 columns namely: Index(['id,release_date,movie,production_budget,domestic_gross,worldwide_gross'], dtype='object')


In [90]:
# We will be using the .info() function to explore the data further.
# this function will show us the columns' datatypes, count of non null values, allowing us to identify columns with null values.

movie_budgets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 1 columns):
 #   Column                                                                  Non-Null Count  Dtype 
---  ------                                                                  --------------  ----- 
 0   id,release_date,movie,production_budget,domestic_gross,worldwide_gross  5782 non-null   object
dtypes: object(1)
memory usage: 45.3+ KB


In [91]:
# Other than the .info() function, .describe() function also gives a summary of the dataset.
# This function gives us the statistical summary of the data.

movie_budgets.describe(include= 'all')

Unnamed: 0,"id,release_date,movie,production_budget,domestic_gross,worldwide_gross"
count,5782
unique,5782
top,"45,""Nov 15, 2013"",The Christmas Candle,""$7,000..."
freq,1


In [92]:
#The info() function gives us a summary of the data, but to check missing values closely, we are using .isnull() function to look at the count of missing vakues in each column.
# This code checks for null values in each column and sums them up. It then filters out columns with no missing values and returns those with null values.

missing = movie_budgets.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

Series([], dtype: int64)

In [93]:
# This code counts the amount of duplicated records exist in the table.
# We are using the .dupiicated() function to identify the duplicated records and the .sum() function to count them.

duplicates = movie_budgets.duplicated().sum()
duplicates

0

### Data summary
* The dataset contains structured data stored in a CSV file containing numerical data.

* The file contains 5782 rows and 6 columns namely, 'id', 'release_date', 'movie', 'production_budget', 'domestic_gross', and 'worldwide_gross' all well labeled and interpretable.

* According to the data, the most recent movie is Moonfall which was produced on 2020-12-31, while the oldest movie to be produced is The Birth of a Nation produced on 1915-02-08.

* The  movie with the highest production budget is Avatar With a budget allocation of $425,000,000, whereas My Date With Drew had the lowest budget standing at $1100

* The movie with the highest domestic income was Star Wars Ep. VII: The Force Awakens with an average amount of $760,507,625. The film with the highest worldwide gross was Avatar with an estimated return of $2,776,345,279.

### Data quality issues
- Inaccuracies

### Next steps
As noted from the dataset, columns' datatypes need to be formated as follows: 

 1. release_date needs to be formated into a date time object 

 2. production_budget needs to be formated into an int object and the dollar signs and commas removed

 3. domestic_gross needs to be formated into an int object and the dollar signs and commas removed
 
 4. worldwide_gross needs to be formated into and int object and the dollar signs and commas removed

## bom.movie_gross
### Data source
The `bom.movie_gross.csv` dataset is sourced from the 'Box Office Mojo' (BOM) reporting and analysis service website. The dataset contains systematic and algorithmic domestic and international box office revenues.

### Why the data is suitable for the project
* Analyzing gross data can help studios identify areas where they can streamline processes and optimize resource allocation. For example, data on production costs, marketing expenses, and distribution channels can be used to identify opportunities for cost savings and efficiency improvements. 

* By monitoring the returns on different film projects, studios can track their overall profitability and make adjustments to their production and marketing strategies as needed. 

* Data from revenue trackers allows studios to compare their performance against that of other films and companies, providing insights into their competitive standing and areas for improvement.

* The data empowers stakeholders to make better decisions, reduce risks, and optimize their chances of success in the competitive film industry. 

* By tracking the performance of different marketing channels and release strategies, studios can optimize their efforts to maximize reach and revenue. 

In [94]:
# We are reading the data into the 'movie_gross' variable using the .read_csv() pandas function to be able to access the data in form of a dataframe

movie_gross = pd.read_csv("bom.movie_gross.csv.gz", compression= 'gzip', delimiter= ',', encoding= 'latin-1', index_col= False)
movie_gross.head()

FileNotFoundError: [Errno 2] No such file or directory: 'bom.movie_gross.csv.gz'

In [None]:
# This code will print out the shape and the data's columns.

print(f'The data has a {movie_gross.shape} structure.')
print('`rt_reviews` has 8 columns namely:', movie_gross.columns)

In [None]:
# We will be using the .info() function to explore the data further.
# this function will show us the columns' datatypes, count of non null values, allowing us to identify columns with null values.

movie_gross.info()

In [None]:
# Other than the .info() function, .describe() function also gives a summary of the dataset.
# This function gives us the statistical summary of the data.

movie_gross.describe(include= 'all')

In [None]:
#The info() function gives us a summary of the data, but to check missing values closely, we are using .isnull() function to look at the count of missing vakues in each column.
# This code checks for null values in each column and sums them up. It then filters out columns with no missing values and returns those with null values.

missing = movie_gross.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

In [None]:
# This code counts the amount of duplicated records exist in the table.
# We are using the .dupiicated() function to identify the duplicated records and the .sum() function to count them.

duplicates = movie_gross.duplicated().sum()
duplicates

### Data summary
* The dataset contains structured data stored in a CSV file containing numerical and non_categorical data.

* The file contains 5782 rows and 6 columns namely, 'id', 'release_date', 'movie', 'production_budget', 'domestic_gross', and 'worldwide_gross' all well labeled and interpretable.

### Data quality issues

### Next steps

## im.db database
### Data source
The `im.db` data is extracted from the 'Internet Movie Database' (IMDB). The database contains information relating to films, TV series, production crew and their biographies, as well as movie ratings.

### Why the data is suitable for the project
Data from a film database websites provide valuable insights for movie studio stakeholders, enabling them to make more informed decisions across various aspects of film production and distribution. This data helps with everything from identifying emerging trends and predicting box office success to guiding script development and optimizing marketing campaigns. 

* Historical data on previous films, including ratings, reviews, and box office performance, can be used to predict the potential success of new releases. Understanding which factors, such as actor popularity, director reputation, and genre trends, positively influence box office results. 

* Data can provide insights into audience preferences and expectations, which can help inform the development of compelling and engaging scripts. Studios can use data to identify successful narrative structures, character archetypes, and thematic elements that resonate with audiences. 

* Data can be used to target specific demographics and tailor marketing messages to resonate with different audience segments. Studios can also optimize showtime schedules and distribution strategies based on past performance data and predictions. 

* Studios can use data to personalize recommendations and create more engaging user experiences on their streaming platforms and websites. This can lead to increased user engagement and satisfaction, ultimately driving revenue

*  The database can help studios identify promising actors, directors, and other talent by tracking their past performances and reviews. 

In [None]:
# Connecting the database and reading it into the `conn` variable 
conn = sqlite3.connect('Data/im.db')

# In this code we are reading the tables into the table variable to get the lay of the database
table = pd.read_sql("""SELECT *
                      FROM sqlite_master""", conn)
table

Then we are going to take a peek into the database's ERD, to understand how the tables relate to each other.

![movie data erd](https://raw.githubusercontent.com/learn-co-curriculum/dsc-phase-2-project-v3/main/movie_data_erd.jpeg)


In [None]:
# This code queries the database to access the table

movie_basics = pd.read_sql("""SELECT *
                               FROM movie_basics""", conn)
movie_basics

In [None]:
#The info() function gives us a summary of the data, but to check missing values closely, we are using .isnull() function to look at the count of missing vakues in each column.
# This code checks for null values in each column and sums them up. It then filters out columns with no missing values and returns those with null values.

missing = movie_basics.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

In [None]:
# This code counts the amount of duplicated records exist in the table.
# We are using the .dupiicated() function to identify the duplicated records and the .sum() function to count them.

duplicates = movie_basics.duplicated().sum()
duplicates

In [None]:
# This code queries the database to access the table

directors = pd.read_sql("""SELECT *
                           FROM directors""", conn)
directors

In [None]:
#The info() function gives us a summary of the data, but to check missing values closely, we are using .isnull() function to look at the count of missing vakues in each column.
# This code checks for null values in each column and sums them up. It then filters out columns with no missing values and returns those with null values.

missing = directors.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

In [None]:
# This code counts the amount of duplicated records exist in the table.
# We are using the .dupiicated() function to identify the duplicated records and the .sum() function to count them.

duplicates = directors.duplicated().sum()
duplicates

In [None]:
# This code queries the database to access the table

known_for = pd.read_sql("""SELECT *
                           FROM known_for""", conn)
known_for

In [None]:
#The info() function gives us a summary of the data, but to check missing values closely, we are using .isnull() function to look at the count of missing vakues in each column.
# This code checks for null values in each column and sums them up. It then filters out columns with no missing values and returns those with null values.

missing = known_for.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

In [None]:
# This code counts the amount of duplicated records exist in the table.
# We are using the .dupiicated() function to identify the duplicated records and the .sum() function to count them.

duplicates = known_for.duplicated().sum()
duplicates

In [None]:
# This code queries the database to access the table

movie_akas = pd.read_sql("""SELECT *
                            FROM movie_akas""", conn)
movie_akas

In [None]:
#The info() function gives us a summary of the data, but to check missing values closely, we are using .isnull() function to look at the count of missing vakues in each column.
# This code checks for null values in each column and sums them up. It then filters out columns with no missing values and returns those with null values.

missing = movie_akas.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

In [None]:
# This code counts the amount of duplicated records exist in the table.
# We are using the .dupiicated() function to identify the duplicated records and the .sum() function to count them.

duplicates = movie_akas.duplicated().sum()
duplicates

In [None]:
# This code queries the database to access the table

movie_ratings = pd.read_sql("""SELECT *
                               FROM movie_ratings""", conn)
movie_ratings

In [None]:
#The info() function gives us a summary of the data, but to check missing values closely, we are using .isnull() function to look at the count of missing vakues in each column.
# This code checks for null values in each column and sums them up. It then filters out columns with no missing values and returns those with null values.

missing = movie_ratings.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

In [None]:
# This code counts the amount of duplicated records exist in the table.
# We are using the .dupiicated() function to identify the duplicated records and the .sum() function to count them.

duplicates = movie_ratings.duplicated().sum()
duplicates

In [None]:
# This code queries the database to access the table

persons = pd.read_sql("""SELECT *
                         FROM persons""", conn)
persons

In [None]:
#The info() function gives us a summary of the data, but to check missing values closely, we are using .isnull() function to look at the count of missing vakues in each column.
# This code checks for null values in each column and sums them up. It then filters out columns with no missing values and returns those with null values.

missing = persons.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

In [None]:
# This code counts the amount of duplicated records exist in the table.
# We are using the .dupiicated() function to identify the duplicated records and the .sum() function to count them.

duplicates = persons.duplicated().sum()
duplicates

In [None]:
# This code queries the database to access the table

principals = pd.read_sql("""SELECT *
                            FROM principals""", conn)
principals

In [None]:
#The info() function gives us a summary of the data, but to check missing values closely, we are using .isnull() function to look at the count of missing vakues in each column.
# This code checks for null values in each column and sums them up. It then filters out columns with no missing values and returns those with null values.

missing = principals.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

In [None]:
# This code counts the amount of duplicated records exist in the table.
# We are using the .dupiicated() function to identify the duplicated records and the .sum() function to count them.

duplicates = principals.duplicated().sum()
duplicates

In [None]:
# This code queries the database to access the table

writers = pd.read_sql("""SELECT *
                         FROM writers""", conn)
writers

In [None]:
#The info() function gives us a summary of the data, but to check missing values closely, we are using .isnull() function to look at the count of missing vakues in each column.
# This code checks for null values in each column and sums them up. It then filters out columns with no missing values and returns those with null values.

missing = writers.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

In [None]:
# This code counts the amount of duplicated records exist in the table.
# We are using the .dupiicated() function to identify the duplicated records and the .sum() function to count them.

duplicates = writers.duplicated().sum()
duplicates

### Data summary
* The dataset contains structured data stored in a database containing 8 tables namely, 'movie_basics', 'directors', 'known_for', 'movie_akas', 'movie_ratings', 'persons', 'principals', and 'writers'.

### Table exploration
1. `movie_basics` - This table contains general movie information. It contains 146144 rows and 6 columns. Out of the 6 columns, 3 contain null values.

2. `directors` - This table links movie directors and the movies they directed. The data is stored in 291174 rows and 2 columns. It also contains 127639 duplicate records.

3. `known_for` - This table links people in the film industry and the movies they are known for. It has records in 1638260 rows and 2 columns and has no null values and duplicate records.

4. `movie_akas` - This table has information on the different names of the movies worldwide. It has 331703 rows and 8 columns. 5 out of the 8 columns contain null values.

5. `movie_ratings` - This table has information on movie ratings and number of votes. It contains 73856 rows and 3 columns, and has no null values or duplicated records.

6. `persons` - This table contains the biographies of all persons in the film industry including actors and directors. It has 606648 rows and 5 columns. Ou of the 5 columns, 3 contain null values.

7. `principals` - This table contains information about the job categories of the individuals in the film categories and for some, the characters they have played in past movies. It stored data in 1028186 rows and 6 columns. It has null values in 2 columns and no duplicate records.

8. `writers` - This table links nmovie writers and the movies they wrote. It contains 255873 rows and 2 columns. It also has 77521 duplicated records.

### Data Quality Observations
Most tables have a clean schema with clear foreign key references. However, it contains incomplete data and duplicated records.
- Joins between tables like movie_basics, movie_ratings, and persons are feasible and useful for analysis.

### Next Steps / Ideas for Deeper Analysis
* Join movie_basics with movie_ratings to find top-rated movies.

* Identify the most prolific directors or actors by linking directors, principals, and persons.

* Investigate trends by release year, genre, and average rating.

* Clean or supplement the known_for table if necessary.