# DATA UNDERSTANDING
In this section, we are going introduce ourselves to the data we are going to use, exploring them for insights and understanding before we begin our analysis. This section will include data description and summary to know what we are getting into.

In this repository, under the file path `Data`, we have 6 datasets namely `im.db`, `bom.movie_gross.csv`, `tn.movie_budgets.csv`, `rt.reviews.tsv`, `rt.movie_info.tsv`, and `tmdb.movies.csv`.
We are going to explore them one by one focusing on their sources, structures, formats, why they are suitable and if they are in the best shape for analysis.

The analysis tools we will be using in this section include;
- Pandas
- SQLite3


In [1]:
# Importing necessary libraries.

import pandas as pd
import sqlite3

## rt.reviews.tsv
### Data source
The `rt.reviews.tsv.gz` dataset is sourced from the Rotten Tomatoes website, a film and television review aggregator and user community. The data contains movie reviews and ratings from various movie critics dating from the year 2000 to 2018.

### Why this data is suitable
* The data contains movie reviews, ratings, user comments, and critical reception, which help predict box office performance and overall success. This information also reveals audience preferences for genres, themes, and storylines, which is important to help studios tailor their content to meet market demand and avoid oversaturation in certain areas.

* User reviews and comments provide valuable insights into audience perceptions of specific films and the overall cinematic landscape. Most importantly, this information is crucial for attracting investors and making informed decisions about green-lighting projects.

In [None]:
# We are reading the data into the 'rt_reviews, variable using the .read_csv() pandas function to be able to access the data in form of a dataframe

rt_reviews = pd.read_csv("rt.reviews.tsv.gz", compression= 'gzip', delimiter= '\t', encoding= 'latin-1', index_col= False)
rt_reviews.head()

In [None]:
# This cone will print out the shape and the data's columns.

print(f'The data has a {rt_reviews.shape} structure.')
print('`rt_reviews` has 8 columns namely:', rt_reviews.columns)

In [None]:
# We will be using the .info() function to explore the data further.
# this function will show us the columns' datatypes, count of non null values, allowing us to identify columns with null values.

rt_reviews.info()

In [None]:
# Other than the .info() function, .describe() function also gives a summary of the dataset.
# This function gives us the statistical summary of the data.

rt_reviews.describe(include= 'all')

In [None]:
missing = rt_reviews.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

In [None]:
duplicates = rt_reviews.duplicated().sum()
duplicates

### Data summary
* The dataset contains structured data stored in a TSV (tab-separated values) file containing both non-categorical and categorical data. 

* The dataset contains data in 54432 rows and 8 columns namely, 'id', 'review', 'rating', 'fresh', 'critic', 'top_critic', 'publisher', and 'date'. All the columns are well labeled and easy to interpret.

* Out of the 8 columns, only 4 contain null values ranging from as low as 300 to 13,517 null values. The dataset also contains 9 duplicated rows.

* The dataset contains mostly object datatypes, hence there is no quantitative analysis needed.

### Data Quality Issues
- Incompleteness - 4 columns contain missing values. Missing values in data can lead to inaccuracies and misleading results, which subsequently leads to inaccurate insights.

- Duplicate records - The dataset contains 9 duplicated records, which could skew the results as certain records provide the same information twice.

### Next steps
The data is mostly clean and has a lot of information to offer, however, a few columns need some cleaning. The cleaning involves;

* Handle missing and duplicated records.