# DATA UNDERSTANDING
In this section, we are going introduce ourselves to the data we are going to use, exploring them for insights and understanding before we begin our analysis. This section will include data description and summary to know what we are getting into.

In this repository, under the file path `Data`, we have 6 datasets namely `im.db`, `bom.movie_gross.csv`, `tn.movie_budgets.csv`, `rt.reviews.tsv`, `rt.movie_info.tsv`, and `tmdb.movies.csv`.
We are going to explore them one by one focusing on their sources, structures, formats, why they are suitable and if they are in the best shape for analysis.

The analysis tools we will be using in this section include;
- Pandas
- SQLite3


In [2]:
# Importing necessary libraries.

import pandas as pd
import sqlite3

## rt.reviews.tsv
### Data source
The `rt.reviews.tsv.gz` dataset is sourced from the Rotten Tomatoes website, a film and television review aggregator and user community. The data contains movie reviews and ratings from various movie critics dating from the year 2000 to 2018.

### Why is the data suitable for the project?
* The data contains movie reviews, ratings, user comments, and critical reception, which help predict box office performance and overall success. This information also reveals audience preferences for genres, themes, and storylines, which is important to help studios tailor their content to meet market demand and avoid oversaturation in certain areas.

* User reviews and comments provide valuable insights into audience perceptions of specific films and the overall cinematic landscape. Most importantly, this information is crucial for attracting investors and making informed decisions about green-lighting projects.

In [4]:
# We are reading the data into the 'rt_reviews, variable using the .read_csv() pandas function to be able to access the data in form of a dataframe

rt_reviews = pd.read_csv("Data/rt.reviews.tsv.gz", compression= 'gzip', delimiter= '\t', encoding= 'latin-1', index_col= False)
rt_reviews.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [5]:
# This cone will print out the shape and the data's columns.

print(f'The data has a {rt_reviews.shape} structure.')
print('`rt_reviews` has 8 columns namely:', rt_reviews.columns)

The data has a (54432, 8) structure.
`rt_reviews` has 8 columns namely: Index(['id', 'review', 'rating', 'fresh', 'critic', 'top_critic', 'publisher',
       'date'],
      dtype='object')


In [6]:
# We will be using the .info() function to explore the data further.
# this function will show us the columns' datatypes, count of non null values, allowing us to identify columns with null values.

rt_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB


In [7]:
# Other than the .info() function, .describe() function also gives a summary of the dataset.
# This function gives us the statistical summary of the data.

rt_reviews.describe(include= 'all')

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
count,54432.0,48869,40915,54432,51710,54432.0,54123,54432
unique,,48682,186,2,3496,,1281,5963
top,,Parental Content Review,3/5,fresh,Emanuel Levy,,eFilmCritic.com,"January 1, 2000"
freq,,24,4327,33035,595,,673,4303
mean,1045.706882,,,,,0.240594,,
std,586.657046,,,,,0.427448,,
min,3.0,,,,,0.0,,
25%,542.0,,,,,0.0,,
50%,1083.0,,,,,0.0,,
75%,1541.0,,,,,0.0,,


In [8]:
missing = rt_reviews.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

rating       13517
review        5563
critic        2722
publisher      309
dtype: int64

In [9]:
duplicates = rt_reviews.duplicated().sum()
duplicates

9

### Data summary
* The dataset contains structured data stored in a TSV (tab-separated values) file containing both non-categorical and categorical data. 

* The dataset contains data in 54432 rows and 8 columns namely, 'id', 'review', 'rating', 'fresh', 'critic', 'top_critic', 'publisher', and 'date'. All the columns are well labeled and easy to interpret.

* Out of the 8 columns, only 4 contain null values ranging from as low as 300 to 13,517 null values. The dataset also contains 9 duplicated rows.

* The dataset contains mostly object datatypes, hence there is no quantitative analysis needed.

### Data Quality Issues
- Incompleteness - 4 columns contain missing values. Missing values in data can lead to inaccuracies and misleading results, which subsequently leads to inaccurate insights.

- Duplicate records - The dataset contains 9 duplicated records, which could skew the results as certain records provide the same information twice.

### Next steps
The data is mostly clean and has a lot of information to offer, however, a few columns need some cleaning. The cleaning involves;

* Handle missing and duplicated records.

## rt.movies_info
### Data source
The `rt.movie_info.tsv` data is extracted from Rotten Tomatoes website and contains information about movies and television series including their genres, ratings, synopsis as well as the movie's director, writer, release date and run-time.

### Why is the data suitable for the project?
* This data allows us to analyze the impact of writers and directors on previous films' success providing valuable information for casting decisions. 

* Real-time review data can track a film's performance after its release, allowing studios to adapt marketing strategies and address potential issues. 

* Comparing a film's performance against similar titles in terms of review scores and box office success allows studios to identify areas for improvement and best practices.
