In [None]:
# Introduction 

## Description 
This notebook contains an analysis of hidden trends between some basic information regarding a particular movie or show and the rating/popularity said movie or show receives on IMDb or TMDB ([link to dataset](https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies)). Moreover, this notebook uses various models to predict the popularity a movie or show receives on TMDB. 

Note that this notebook was created primarily to analyze the trends in a dataset, whatever information it may regard; essentially, this was intended to be a practice exercise. However, I found some appealing and unexpected patterns in this movie/show dataset that prompted me to release my work on Kaggle.

## Executive Summary 
This summary briefly captures the gist of the information discussed. It is recommended that you look through the whole notebook, or even conduct some research of your own, to discover all the patterns/trends that lay hidden. 

The analysis of this dataset verified some expected trends that one might expect. The thriller genre was the most popular, and shows from Korea, Japan, and China were highly successful in terms of popularity.
While confirming many patent trends, the analysis also sheds light on some unexpected nuances. For example, there was no connection between the score a movie received on TMDB and the popularity it received on the TMDB. The rating on IMDb and the popularity on TMDB exhibited a similar relationship as well. Countries such as Colombia and Poland were chosen quite frequently as locations for production and had more popularity as well. Additionally, shows for younger audiences, rated either `TV-G`, `TV-Y`, or `TV-Y7`, received high popularity ratings on TMDB. These shows scored higher (on average) on IMDb *and* TMDB.

While subtle, these trends may have some implications for the average movie/show director, especially one interested in optimizing popularity. To achieve such an ambitious goal, a director may want to create a thriller show for younger audiences. As the show gains popularity, the director should continue to add more seasons to it.

# Exploration Objectives 



This section is primarily concerned with identifying trends between two or three features from the data set. Notice that many of the relationships are between a feature and a scoring metric. Primarily, the scoring metrics utilized will the rating on IMDB, the rating on TMDB, and the popularity on TMDB. One potential metric, namely the number of votes on IMDb, is not used.

In [3]:
# Load modules, data + initialize configuration
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

PALETTE = 'magma'
sns.set_theme(style="white")

In [4]:
plt.rcParams["font.family"] = "monospace"

In [5]:
df = pd.read_csv("../input/netflix-tv-shows-and-movies/titles.csv")

In [6]:
df.head()

## Release Year vs Score 
This section describes the relationship between the release year and each of the scoring metrics. Moreover, I have grouped the release years into decades as this might yield interesting results.

In [7]:
# Get data where `imdb_votes` and `tmdb_score` are not null
df_cleaned = df[(~df['imdb_score'].isnull()) & (~df['tmdb_score'].isnull())]

df_1 = df_cleaned.copy()

# Group the release year by decade
def get_decade(val):
  return int(str(val)[0:-1] + "0")

df_1['decade'] = df_1['release_year'].map(get_decade)
df_1['decade'].value_counts()

In [9]:
# Make bar plot of `decade` vs `score`
decade_score_relations = df_1.groupby('decade').aggregate({"imdb_score": "mean", "tmdb_score": "mean"})

In [10]:
sns.barplot(x=decade_score_relations.index, y=decade_score_relations["imdb_score"], palette=PALETTE)

In [11]:
sns.regplot(x=decade_score_relations.index, y=decade_score_relations['imdb_score']).set(xlabel='Decade', ylabel='Score (IMDB)', title='Decade vs Score (IMDB)')

There seems to be an incredibly minute downward trend in the IMDB scores of each decade.

Note: the values plotted for each decade are the **means** of the `imdb_score` values for that decade.

Consider the following plots, which do not group by decade; rather, it uses the original year value.

In [12]:
sns.scatterplot(x=df_1['release_year'], y=df_1['imdb_score'], hue=df_1['type'])
plt.show()
sns.regplot(x=df_1['release_year'], y=df_1['imdb_score'])

Although this plot is incredibly messy, the regression line still has a visibly negative slope. The reason for the downward trend in IMDB scores is hard to formulate, but there is some plausible reasoning.

One possible explanation is based on the technology available at that time. Early films did not have the advanced video editors, CGI, or the postprocessing effects that films of today so eagerly indulge. Thus, early directors would need to create an extraordinarily intriguing plot. Yes, directors of today do place an emphasis on plot; however, they have technology that can adequately make up for a bad plot for the average viewer. 

Nevertheless, this is only the trend for IMDb; the complete picture may suggest something different.

In [13]:
# Plot relationship between `decade` and `tmdb_score`, once again using the average values per decade
sns.regplot(x=decade_score_relations.index, y=decade_score_relations['tmdb_score'])

In [14]:
# Plot relationship between `release year` and `tmdb_score`
sns.scatterplot(x=df_1['release_year'], y=df_1['tmdb_score'], hue=df_1['type'])
plt.show()
sns.regplot(x=df_1['release_year'], y=df_1['tmdb_score'])

Interestingly enough, the `tmdb_score` values increase as time progresses.

This stark difference can then be attributed to either the ages of the users at IMDb or TMDB. Younger people tend to prefer today's movies while older folks may prefer movies from the '90s, '80s', or even the '70s.

There is one final metric that we have not looked at, namely the popularity on TMDB.

In [15]:
popularity_info = df_1.groupby('decade').aggregate({"tmdb_popularity": "mean"})

In [16]:
sns.barplot(x=popularity_info.index, y=popularity_info['tmdb_popularity'], palette=PALETTE).set(xlabel="Decade", ylabel="Popularity (TMDB)", title="Decade vs Popularity (TMDB)")

There is a slight upward trend in the popularity of movies. This suggests that, as time progresses, the movies (on average) will receive higher popularity scores.

The '80s had the movies with the highest average TMDB popularity. The next cell prints some of the movies, see if you recognize these titles.

In [17]:
df_1[df_1['decade'] == 1980].sort_values(by='tmdb_popularity', ascending=False)['title'].head(10)

Moreover, here are the titles with the highest TMDB popularity

In [18]:
df.sort_values(by='tmdb_popularity', ascending=False)['title'].head(10)

### Conclusion 
As time progresses, the ratings of movies and shows on IMDb and TMDB, when averaged, will roughly remain the same. The popularity, on the other hand, has shown to be especially high for certain periods, yet it nonetheless continues a slight upward trend.

## Age Certification vs Score 

This next section is about the trend that various different age certifications (e.g., `TV-G`, `TV-14`, `R`, `PG-13`, etc.) exhibit with rating and popularity.

In [19]:
# Count the number of null values in the dataset
df['age_certification'].isnull().sum()

In [21]:
# Fill null values with mode
df['age_certification'].fillna(df['age_certification'].mode()[0], inplace=True)

In [22]:
df_2 = df[(~df['imdb_score'].isnull()) & (~df['tmdb_score'].isnull())]

In [23]:
age_score = df_2.groupby("age_certification").aggregate({"imdb_score": "mean", "tmdb_score": "mean", "tmdb_popularity": "mean"})

In [24]:
# Visualize IMDb/TMDB scores
plt.figure(figsize=(12, 6))
sns.barplot(x=age_score.index, y=age_score['imdb_score'], palette=PALETTE).set(xlabel="Age Certification", ylabel="Score (IMDb)", title="Age Cert. vs. Score (IMDb)")
plt.show()
plt.figure(figsize=(12, 6))
sns.barplot(x=age_score.index, y=age_score['tmdb_score'], palette=PALETTE).set(xlabel="Age Certification", ylabel="Score (TMDB)", title="Age Cert. vs Score (TMDB)")

The distribution of average scores for each certification is fairly uniform, something that I was not expecting.

In [25]:
# Visualize popularities of each age certification
plt.figure(figsize=(12, 6))
sns.barplot(x=age_score.index, y=age_score['tmdb_popularity'], palette=PALETTE).set(xlabel='Age Certification', ylabel='TMDB Popularity', title='Age Cert. vs. Popularity')
plt.figure(figsize=(12, 6))
cert_sorted_pop = age_score.sort_values(by='tmdb_popularity', ascending=False)
sns.barplot(x=cert_sorted_pop.index, y=cert_sorted_pop['tmdb_popularity'], palette=PALETTE).set(xlabel='Age Certification', ylabel='TMDB Popularity', title='Age Cert. vs. Popularity (sorted)')

Interestingly enough, `TV-G` is the most popular age certification among the shows and `R` is the most popular age certification among the movies.

To see how much variation and nuance there is within a particular category, we can plot a histogram.

In [26]:
certifications = ['G', 'NC-17', 'PG', 'PG-13', 'R', 'TV-14', 'TV-G', 'TV-MA', 'TV-PG', 'TV-Y', 'TV-Y7']
for i in certifications:
  cert_values = df_2[df_2['age_certification'] == i]

  sns.histplot(cert_values['imdb_score'], kde=True).set(xlabel=i, title=f"Distribution of {i} w.r.t score (IMDb)")
  plt.show()

Although some are skewed slightly, the distributions within each certification are mostly normal.

### Conclusion 
The age certification did not seem to affect rating as one might expect it would have; the distribution was quite uniform. Moreover, within each certification, the distribution of scores was normal. The intriguing part of this mini-exploration was the surprising popularity of `TV-G` shows on TMDB. These are rated highly likely due to their ability to keep young children distracted while parents tend to important matters. Additionally, some shows catered towards kids include subtle adult jokes. 

## Genre vs. Score 

Next, we will explore how the genre of a movie can impact the score it receives on IMDb or TMDB.

Unfortunately, the organization of the data makes the `genre` column somewhat difficult to work with. Instead of simply one value of genre, we are given a list of genres, which is then transformed into a string.

Before we attempt to clean/parse the column, however, let us take a peek at some of the values that constitute this column.

In [27]:
df['genres'].value_counts()

Uh-oh. The `value_counts` method usually returns a nice `pd.Series` that has each value and its respective count. Due to the structure of the data, the method cannot parse the values in the most fitting manner. So, we must do the parsing ourselves.

In [28]:
# Parse genre column
genres = {}

def get_genres(row):
  parsed = (str(row)[1:-1]).split(",")

  for i in range(len(parsed)):
    parsed[i] = parsed[i].strip()
    parsed[i] = parsed[i][1:-1]

  for i in parsed:
    if i not in genres.keys():
      genres[i] = 0
      continue
    genres[i] += 1

  return row

df['genres'] = df['genres'].map(get_genres)
genres

Now we get a glimpse of the data.

Note: the `''` corresponds to no genre.

We will now modify the `get_genres` function to transform the original columns. We will only use the first element of each value in the `genres` column.

In [29]:
def transform_genres(row):
  parsed = (str(row)[1:-1]).split(",")

  for i in range(len(parsed)):
    parsed[i] = parsed[i].strip()[1:-1]

  for i in parsed:
    if i not in genres.keys():
      genres[i] = 0
      continue
    genres[i] += 1

  return parsed[0] if parsed[0] != '' else 'none'

In [30]:
# Perform the transformation
df['genres_transformed'] = df['genres'].map(transform_genres)
df['genres_transformed'].value_counts()

Now we get a clean list of all the genres.

Before we observe the relationship between the genre and the score, let us visualize the distribution of genres.

In [31]:
df['genres_transformed'].replace(to_replace='documentation', value='doc', inplace=True)
plt.figure(figsize=(20,10))
sns.histplot(df['genres_transformed'])

Next, we will visualize the relationship between a genre and the average score movies of that genre received.

In [32]:
df_3 = df[(~df['imdb_score'].isnull()) & (~df['tmdb_score'].isnull())]
df_3.head()

In [33]:
genre_vs_score = df_3.groupby("genres_transformed").aggregate({"imdb_score": "mean", "tmdb_score": "mean", "tmdb_popularity": "mean"})
genre_vs_score.head()

Note: I included the `tmdb_popularity` column in this grouping as it prevents us from having to include it at a later point.

In [35]:
plt.figure(figsize=(18,9))
sns.barplot(x=genre_vs_score.index, y=genre_vs_score['imdb_score'], palette=PALETTE).set(xlabel="Genre", ylabel="Score (IMDb)", title="Genre vs. Score (IMDb)")

In [37]:
plt.figure(figsize=(18,9))
sns.barplot(x=genre_vs_score.index, y=genre_vs_score['tmdb_popularity'], palette=PALETTE).set(xlabel="Genre", ylabel="Popularity (TMDB)", title="Genre vs. Popularity (TMDB)")

To see which genre scored the highest on the performance metrics, we can sort the data and then redraw the bar graphs.

In [38]:
sorted_imdb = genre_vs_score.sort_values(by="imdb_score", ascending=False)
plt.figure(figsize=(18, 9))
sns.barplot(x=sorted_imdb.index, y=sorted_imdb['imdb_score'], palette=PALETTE).set(xlabel='Genre', ylabel='IMDb Rating', title='Genre vs. Rating (IMDb)')

In [39]:
sorted_tmdb = genre_vs_score.sort_values(by="tmdb_score", ascending=False)
plt.figure(figsize=(18, 9))
sns.barplot(x=sorted_tmdb.index, y=sorted_tmdb['tmdb_score'], palette=PALETTE).set(xlabel='Genre', ylabel='TMDB Rating', title='Genre vs. Rating (TMDB)')

In [40]:
sorted_popularity = genre_vs_score.sort_values(by='tmdb_popularity', ascending=False)
plt.figure(figsize=(18,9))
sns.barplot(x=sorted_popularity.index, y=sorted_popularity['tmdb_popularity'], palette=PALETTE).set(xlabel='Genre', ylabel='Popularity (TMDB)', title='Genre vs. Popularity (TMDB)')



Within each scoring metric, the genres have relatively little variation. Yes, there are 1 or 2 point differences between the average genre and the highest-scoring one, but the data's variation is nontheless uniform.

However, there is an evident difference in the popularity of the genres and their scores. This is because a genre is rated based on other movies of that genre; rating a movie that is in scifi based on the quality of action movies is completely nonsensical. In other words, for people rating movies, popularity is not a significant metric; rather, the plot, characters, and acting are more decisive.


### Conclusion 
  
Genre **does not** necesarrily influence the rating a movie gets. The **popularity** of a particular movie or show, however, is affected by the genre.

### Additional Notes 

 - The visualizations presented in this section take into accout the trend between genre and rating for both movies and shows. In section 4.2, the shows will be analyzed in isolation. 

## Analysis of Shows 

The previous explorations conducted have been focused on the combined set of movies and shows. However, in this exploration, the aim is to thoroughly analyze the various nuances of the shows that are present in this database.

This heatmap of the correlation tells us that there is very little correlation between `seasons` and either of the scoring metrics (`imdb_score`, `tmdb_score`). The relatively high correlation between `seasons` and `tmdb_popularity` aligns with what one might expect. After all, most people prefer shows with more seasons to shows with a lower number of seasons.

Let us plot out the relationships between `seasons` and any of the scoring metrics.

These regression plots confirm the conclusions drawn from the heatmap.

#### Conclusion 

The data suggest that an increased number of seasons corresponds to increased popularity. This seems like an accurate conclusion; however, it is quite the fallacy. A show producer/director would only add more seasons *if the show is popular*.  This caveat only exists with the popularity metric; we can make reasonable conclusions from the data based on the other metrics.

For the `imdb_score` values, an increased number of seasons corresponds to a slight increase in score. In contrast, the `tmdb_score` values show very little correlation between an increased number of seasons and the overall score. Extrapolating this, we can infer that the TMDB users take plot and other story-like features into consideration to a greater extent than do IMDb users.

#### TMDB Ratings for Shows vs. IMDb ratings for Shows 

Since the TMDB ratings and IMDb ratings are both recorded on a ten-point scale, we can compare them to see which site is more critical of movies.

This small difference indicates little to nothing about the metholody utilized by users of IMDb and TMDB to rate movies.

### Genres vs. Score 

In section 3, we observed some of the trends between genre and rating. Recall that, in section 3, we incorporated both movies and shows into our visualization of the data. However, the trends between genre and rating may differ for movies and shows.

In [47]:
plt.figure(figsize=(18,9))
sns.barplot(x=genres_ratings_shows.index, y=genres_ratings_shows["tmdb_score"], palette=PALETTE).set(xlabel="Genre", ylabel="Rating (TMDB)", title="Genre vs. Rating (shows only)")

It might help if we arrange these in descending order to see the most highly rated genres for both IMDb and TMDB

Notice that the western genre and the history genre are both present in the top five highest scores for IMDb and TMDB.

#### Conclusion
Generally, the genre does not affect the rating a particular show receives. Moreover, it is hard to deduce a proper metric for scoring shows solely based on genre as other factors, such as plot, length, and acting, prove to be crucial to an average viewer's method for rating a particular show. Nonetheless, the distribution of average scores for genres is uniform and indicates that genre is not necessarily a primary metric for deducing the rating a show receives. This finding is expected as the people who rate shows of a particular genre compare that show to other shows of that same genre. This regularizes the methodology used by IMDb or TMDB users to judge genres, thus resulting in an overall negligible difference in rating.

However, there was a considerable difference in the popularity. Most people, it seems, enjoyed watching shows that belonged to the thiller, family, or science fiction genres. It is also important to note that these genres were not necessarily the highest scoring among the various genres. This indicates that the mean rating of shows of a particular genre does not dictate the genre's overall popularity.

## Production Country vs. Score

It appears that the `production_countries` column takes on a structure similar to that of genres. This means that we will have to transform the column such that the production country can easily be identified.
As with the transformation of the `genres` column, the transformation of the `production_countries` will entail the selection of the first country present in each value of the column.

Before we begin analyzing the relationship between the production countries and the performance of those countries on each of the scoring metrics, let us observe the distribution of production countries. 

To do this, I will split up the data into two sections for the sake of visualization. Viewing all ninety-one distinct values in one bar plot/histogram ended up being quite messy.

The top four countries are those that we might expect to be the leading producers of movies/shows; Namely, the countries are the U.S., India, Great Britain, and Japan. The distribution of the first section is extremely skewed, as the top five or six countries have an overwhelming lead in terms of the number of movies/shows produced. The second section, howeover, shows a somewhat uniform distribution; granted, the `count` values of the countries in the second section do differ by only one production.

Well those are certainly some strange results! Cuba, The Democratic Republic of the Congo, and Afghanistan were the top three scorers. The reason for this high score (and the reason for the relatively low scores of the U.S., Great Britain, and India) is due to the amount of movies/shows present for each country. More data inevitably results in more variation; high scorers such as Cuba had a small number of shows present in the database. To see a perhaps more accurate picture, countries with a low amount of shows produced in them must be filtered out.

### IMDb Ratings

These new-and-improved plots showcase an increased score for movies/shows of prominent Asian countries.

Next, we will examine the TMDB ratings as well as the TMDB popularity

### TMDB Ratings

### TMDB Popularity 

Another surprising batch of results: Colombia and Poland were the two most popular countries of production, and, as expected, Japan and Korea both showed up in the top five. The popularity of Colobmia and Poland is due to the cost of producing in those countries. Presumably, Colombia and Poland have relatively low costs for producing movies or shows.

The act of traveling to other countries to shoot a movie or a show is not new. For example, in India, some advertising agencies traveled to Malaysia or to South Africa to shoot their ads for sole reason of production cost.

### Conclusion
From this mini-exploration, we found out some surprising findings regarding the favorite production countries for each of the movies in this database. Specifically, we discovered that Colombia and Poland have a surprisingly large popularity on TMDB. Additionally, our exploration into the patterns between production country and rating aligned with what we might expect: movies/shows from Japan, Korea, and China (primarily akin to anime) were rated highly and were quite popular. 

# Modeling!

Now that we have done all this data exploration, it is time to effectively put it to good use. In this section, we will utilize our knowledge of the trends in this dataset to predict the popularity that a particular movie or show might receive on TMDB.

## Data Cleaning 