<a href="https://colab.research.google.com/github/erickbordam/data-projects/blob/main/netflix_analysis_ds.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
title: "Car Types"
author: "Erick Borda"
date: "2023-11-14"
output: pdf_document
---

# **Netflix Dataset Analysis**

## **1. Identified problem.**

The central challenge is to comprehend the dynamic content creation landscape within the streaming industry. Through a detailed analysis of trends encompassing content duration, ratings, and international distribution, this study seeks to offer crucial insights into the ever-changing preferences of audiences, the intricate dynamics of regional content, and the strategic choices made by both content creators and streaming platforms. This knowledge becomes pivotal in adapting content strategies, customizing offerings to diverse audience segments, and refining content creation processes to align with the evolving demands of the global streaming market.

## **2. Research Questions.**

In this project, it'll be focus on:
1. How does the duration of content (movies and TV shows) vary across different content ratings, and are there significant differences in duration between content rated for different audiences?
2.	How does the movie duration trend change before and after the year 2000?
3.	How does the content production change over time? Is there any variation in popular genre over the years?
4.	What kind of TV shows/movies are streamed most on Netflix?


## **3. Null Hypothesis.**

* How does the duration of content (movies and TV shows) vary across different content ratings, and are there significant differences in duration between content rated for different audiences?
  -  **H0:**  There is no significant difference in the mean duration of content (movies and TV shows) among different content ratings.
  - **H1:** There is significant difference in the mean duration of content (movies and TV shows) among different content ratings.
* How does the movie duration trend change before and after the year 2000?
  - **H0:** There is no significant difference in the mean duration of movies before and after the year 2000.
  - **H1:** There is significant difference in the mean duration of movies before and after the year 2000.
* How does the content production change over time? Is there any variation in popular genres over the years?
  - **H0:** There is no significant difference in the proportion of content production across different years, nor is there any variation in the popularity of genres over the years.
  - **H1:** There is significant difference in the proportion of content production across different years, nor is there any variation in the popularity of genres over the years.
* What kind of TV shows/movies are streamed most on Netflix?
  - **H0:** There is no significant difference in the popularity of different genres or types of TV shows/movies streamed on Netflix.
  - **H1:** There is significant difference in the popularity of different genres or types of TV shows/movies streamed on Netflix.

## **4. Importing Libraries.**

Use Kaleido to export Plotly images

In [77]:
pip install -U kaleido



In [69]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from scipy.stats import shapiro
from scipy.stats import kruskal
import plotly.express as px
import numpy as np
from scipy import stats
import plotly.graph_objects as go
from scipy.stats import skew
from scipy.stats import ttest_ind

Importing the dataset

In [70]:
df = pd.read_csv("Netflix dataset.csv")

Exploring the dataset.

In [71]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8790 entries, 0 to 8789
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8790 non-null   object
 1   type          8790 non-null   object
 2   title         8790 non-null   object
 3   director      8790 non-null   object
 4   country       8790 non-null   object
 5   date_added    8790 non-null   object
 6   release_year  8790 non-null   int64 
 7   rating        8790 non-null   object
 8   duration      8790 non-null   object
 9   listed_in     8790 non-null   object
dtypes: int64(1), object(9)
memory usage: 686.8+ KB


**Dataset Dimensions:**
- Comprises 8790 entries (titles)
- 10 columns
each serving as a feature that provides specific information about the titles.


**Key Features:**
* **show_id:** Unique identifier for each title.
* **type:** Indicates whether the entry is a Movie or a TV Show.
* **title:** The title of the TV show or movie.
* **director:** The director(s) responsible for the content.
* **country:** The country or countries associated with the title.
* **date_added:** Presumably, The date when the title was added to Netflix.
* **release_year:** The year the content was originally released.
* **rating:** The content rating assigned to the title.
* **duration:** The duration of the title, either in minutes or specified for TV shows as seasons.
* **listed_in:** The genre or genres to which the title is assigned.


Descriptive statistics, including measures of central tendency and dispersion, have been utilized to provide insights into the characteristics of the dataset.

In [72]:
descriptive_stats = df.describe(include='all')
descriptive_stats

Unnamed: 0,show_id,type,title,director,country,date_added,release_year,rating,duration,listed_in
count,8790,8790,8790,8790,8790,8790,8790.0,8790,8790,8790
unique,8790,2,8787,4528,86,1713,,14,220,513
top,s1,Movie,9-Feb,Not Given,United States,1/1/2020,,TV-MA,1 Season,"Dramas, International Movies"
freq,1,6126,2,2588,3240,110,,3205,1791,362
mean,,,,,,,2014.183163,,,
std,,,,,,,8.825466,,,
min,,,,,,,1925.0,,,
25%,,,,,,,2013.0,,,
50%,,,,,,,2017.0,,,
75%,,,,,,,2019.0,,,


## **5. Data cleaning.**

### **5.1. Identifying missing values.**

In [73]:
# Re check missing and duplicate values
df.isnull().sum()
#Total missing values
total_missing = df.isnull().sum().sum()
print('Total missing values: ', total_missing)

Total missing values:  0


### **5.2. Identifying duplicated values.**

In [74]:
# Check for duplicate data
duplicates = df.duplicated()

# Summarize duplicate data
total_duplicates = sum(duplicates)
print('Total duplicates: ', total_duplicates)

Total duplicates:  0


### **5.3. Identifying Outliers.**

In [79]:
# Assuming 'df' is your DataFrame
fig = px.box(df, y='release_year', labels={'release_year': 'Release Year'},
             title='Boxplot for Release Year')
fig.show(rendered = "svg")

The outliers primarily occur due to the noticeable upward trend in the values.

## **6. Enhancing the dataset.**

Standardized Column Names:
Column names across both movie and TV show datasets have been standardized


In [None]:
enhance_df = df.copy()

In [None]:
# Separate type into 'Movie' and 'TV Show' columns
enhance_df['movie'] = (enhance_df['type'] == 'Movie')
enhance_df['tv_show'] = (enhance_df['type'] == 'TV Show')

# Create duration_seasons and duration_min columns
enhance_df['duration_seasons'] = enhance_df['duration'].apply(lambda x: int(x.split()[0]) if 'Season' in x else None)
enhance_df['duration_min'] = enhance_df['duration'].apply(lambda x: int(x.split()[0]) if 'min' in x else None)

# Drop the original 'type' and 'duration' columns
enhance_df = enhance_df.drop(['duration'], axis=1)
#filling missing values in column duration_seasons and duration_min with 0
enhance_df['duration_min'] = enhance_df['duration_min'].fillna(0)
enhance_df['duration_seasons'] = enhance_df['duration_seasons'].fillna(0)
enhance_df.rename(columns={'rating': 'content_rating'}, inplace=True)

**Released Year Transformation:**
"Released_year" has been transformed into a numeric format from an object format.


In [None]:
enhance_df['release_year'] = pd.to_numeric(enhance_df['release_year'] , errors='coerce')

Date Added Transformation:

"Date_added" column converted to datetime format.


In [None]:
#Fixing data type to date for date_added column
enhance_df['date_added'] = pd.to_datetime(enhance_df['date_added'])

**Special Character Removal:**
Titles and director names have undergone preprocessing to remove special characters.


In [None]:
# Define a function to clean and trim the title
def clean_and_trim_value(value):
    # Remove non-ASCII characters
    value = value.encode('ascii', 'ignore').decode('utf-8')
    # Remove special characters using a regular expression
    value = re.sub(r'[^\w\s-]', '', value)
    # Trim leading and trailing spaces
    value = value.strip()
    return value

In [None]:
# Apply the clean_and_trim_title function to the 'title' column
enhance_df['title'] = enhance_df['title'].apply(clean_and_trim_value)
enhance_df['director'] = enhance_df['director'].apply(clean_and_trim_value)

In [None]:
enhance_df.info()

## **7. Exploratory Data Analysis**

### **7.1. Categorical data.**

#### 7.1.1. Title.

Most Frecuent Values from Title

In [None]:
# Calculate value counts
value_counts = enhance_df['title'].value_counts().head(10)
# Creating a Plotly bar chart
fig = px.bar(x=value_counts.index, y=value_counts.values, labels={'x': "Unique Values", 'y': "Frequency Count"},
             title="Top 10 Frequency Counts in Column 'Title'")
fig.update_xaxes(tickangle=45)  # Rotate x-axis labels
fig.show()

We can see that we need to remove blank values.

#### 7.1.2. Director.

Most frecuent values for director

In [None]:
# Calculate value counts
value_counts = enhance_df['director'].value_counts().head(10)
# Creating a Plotly bar chart
fig = px.bar(x=value_counts.index, y=value_counts.values, labels={'x': "Unique Values", 'y': "Frequency Count"},
             title="Top 10 Frequency Counts in Column 'Director'")
fig.update_xaxes(tickangle=45)  # Rotate x-axis labels
fig.show()

#### 7.1.3. Type.

Most frecuent values for Type

In [None]:
# Calculate value counts
value_counts = enhance_df['type'].value_counts().head(10)
# Creating a Plotly bar chart
fig = px.bar(x=value_counts.index, y=value_counts.values, labels={'x': "Unique Values", 'y': "Frequency Count"},
             title="Top 10 Frequency Counts in Column 'Type'")
fig.update_xaxes(tickangle=45)  # Rotate x-axis labels
fig.show()

#### 7.1.4. Country.

Most frecuent values per Country

In [None]:
# Calculate value counts
value_counts = enhance_df['country'].value_counts().head(10)
# Creating a Plotly bar chart
fig = px.bar(x=value_counts.index, y=value_counts.values, labels={'x': "Unique Values", 'y': "Frequency Count"},
             title="Top 10 Frequency Counts in Column 'Country'")
fig.update_xaxes(tickangle=45)  # Rotate x-axis labels
fig.show()

#### 7.1.5. Content rating.

Most frecuent Content Rating

In [None]:
# Calculate value counts
value_counts = enhance_df['content_rating'].value_counts().head(10)
# Creating a Plotly bar chart
fig = px.bar(x=value_counts.index, y=value_counts.values, labels={'x': "Unique Values", 'y': "Frequency Count"},
             title="Top 10 Frequency Counts in Column 'Content rating'")
fig.update_xaxes(tickangle=45)  # Rotate x-axis labels
fig.show()

Content rating for Movies

In [None]:
# Filter the DataFrame by 'type' equals 'movie' and retrieve the first 10 'content_rating' values
filtered_values = enhance_df.loc[enhance_df['type'] == 'Movie', 'content_rating'].head(10)
filtered_values

Content rating for TV Shows

In [None]:
# Filter the DataFrame by 'type' equals 'movie' and calculate value counts for 'content_rating'
value_counts = enhance_df[enhance_df['type'] == 'Movie']['content_rating'].value_counts().head(10)
print(value_counts)
# Create a Plotly bar chart
fig = px.bar(x=value_counts.index, y=value_counts.values,
             labels={'x': "Unique Values", 'y': "Frequency Count"},
             title="Top 10 Frequency Counts in Column 'Content rating' for 'Movie' Type")
fig.update_xaxes(tickangle=45)  # Rotate x-axis labels
fig.show()

In [None]:
# Filter the DataFrame by 'type' equals 'movie' and calculate value counts for 'content_rating'
value_counts = enhance_df[enhance_df['type'] == 'TV Show']['content_rating'].value_counts().head(10)
print(value_counts)
# Create a Plotly bar chart
fig = px.bar(x=value_counts.index, y=value_counts.values,
             labels={'x': "Unique Values", 'y': "Frequency Count"},
             title="Top 10 Frequency Counts in Column 'Content rating' for 'Movie' Type")
fig.update_xaxes(tickangle=45)  # Rotate x-axis labels
fig.show()

### **7.2. Numeric Data.**

#### 7.2.1. Date added column (Year).

In [None]:
# Assuming 'enhance_df' is your DataFrame
# Convert 'date_added' column to datetime and extract year
enhance_df['date_added'] = pd.to_datetime(enhance_df['date_added'])
enhance_df['year'] = enhance_df['date_added'].dt.year

# Create a histogram using Plotly Express (xplot)
fig = px.histogram(enhance_df, x='date_added', color='year', nbins=20,
                   labels={'date_added': 'Date Added', 'count': 'Frequency'},
                   title='Histogram of Dates Added per Year')
fig.update_xaxes(categoryorder='total ascending')  # Sort x-axis categories by year
fig.show()

#### 7.2.2. Release Year.

In [None]:
# Assuming 'df' is your DataFrame
# Create a histogram using Plotly Express (xplot)
fig = px.histogram(df, x='release_year', color='release_year', nbins=20,
                   labels={'release_year': 'Release Year', 'count': 'Frequency'},
                   title='Histogram of Release Years')
fig.update_xaxes(categoryorder='total ascending')  # Sort x-axis categories by year
fig.show()

#### 7.2.3. Duration Minutes.

In [None]:
# Calculate value counts

# Filter values above 0 in the 'duration_min' column and calculate value counts
filtered_values = enhance_df[enhance_df['duration_min'] > 0]['duration_min'].value_counts()

# Creating a Plotly bar chart
fig = px.bar(x=filtered_values.index, y=filtered_values.values, labels={'x': "Unique Values", 'y': "Frequency Count"},
             title="Top 10 Frequency Counts in Column 'Duration Minutes'")
fig.update_xaxes(tickangle=45)  # Rotate x-axis labels
fig.show()

#### 7.2.4. Duration Seasons.

In [None]:
# Calculate value counts

# Filter values above 0 in the 'duration_min' column and calculate value counts
filtered_values = enhance_df[enhance_df['duration_seasons'] > 0]['duration_seasons'].value_counts()

# Creating a Plotly bar chart
fig = px.bar(x=filtered_values.index, y=filtered_values.values, labels={'x': "Unique Values", 'y': "Frequency Count"},
             title="Top 10 Frequency Counts in Column 'Duration Seasons'")
fig.update_xaxes(tickangle=45)  # Rotate x-axis labels
fig.show()

## **8. Normality Test.**

To rigorously assess dataset normality, the following steps are employed:

-   **Data Plots:** Initial visual inspection through data plots, including histograms and Q-Q plots, serves as a preliminary assessment. Departures from a bell-shaped curve or a straight line in Q-Q plots may indicate non-normality.

-   **Skewness Test:** A skewness test quantifies data asymmetry. Positive or negative values indicate skewness. Values close to zero suggest normality.

-   **Shapiro-Wilk Test:** This formal statistical test provides a definitive verdict on normality. A p-value less than the significance level indicates non-normality.

These steps allow a conclusive determination of whether the dataset adheres to a normal distribution or deviates from it.

### **8.1. Release year.**

#### **8.1.1. Data Plots.**

In [None]:
# Assuming enhance_df contains your data
# Histogram using Plotly Express
histogram = px.histogram(enhance_df, x='release_year', nbins=30, title='Histogram of Release year')
histogram.update_traces(marker=dict(color='green', line=dict(color='black', width=1)))
histogram.update_xaxes(title='Release year')
histogram.update_yaxes(title='Frequency')
histogram.show()

# QQ Plot using Plotly
qq_data = stats.probplot(enhance_df['release_year'], dist="norm", fit=False)

qq_plot = go.Figure()
qq_plot.add_trace(go.Scatter(x=qq_data[0], y=qq_data[1], mode='markers', name='Data', marker=dict(color='blue')))
# Fit a line to the QQ plot
slope, intercept = np.polyfit(qq_data[0], qq_data[1], 1)
qq_plot.add_trace(go.Scatter(x=qq_data[0], y=(np.array(qq_data[0]) * slope + intercept), mode='lines', name='Fit', line=dict(color='red')))

qq_plot.update_layout(title='QQ Plot of Release year', xaxis_title='Theoretical quantiles', yaxis_title='Sample quantiles')
qq_plot.show()

The graphic indicates that the the values for release year are non normal disstributed.

#### **8.1.2. Skewness Test.**

In [None]:
# Calculate skewness
skewness_value = skew(enhance_df['release_year'])

# Determine the distribution description
if abs(skewness_value) < 0.5:
    description = "approximately normally distributed"
elif skewness_value < -0.5:
    description = "left-skewed"
elif skewness_value > 0.5:
    description = "right-skewed"
else:
    description = "exhibiting non-normal characteristics"

print(f"Release Year data is {description}, value: {skewness_value}")

#### **8.1.3. Shapiro-Wilk Test.**

In [None]:
# Perform the Shapiro-Wilk test
shapiro_test_result = shapiro(enhance_df['release_year'])

# Descriptive assessment based on the Shapiro-Wilk test result
if shapiro_test_result.pvalue > 0.05:
    description = "normally distributed"
else:
    description = "non-normally distributed"

print(f"Release year data is {description}, p-value: {shapiro_test_result.pvalue}")

### **8.2. Date added (Year).**

#### **8.2.1. Data Plots.**

In [None]:
# Assuming enhance_df contains your data
# Histogram using Plotly Express
histogram = px.histogram(enhance_df, x='year', nbins=30, title='Histogram of Date added (year)')
histogram.update_traces(marker=dict(color='green', line=dict(color='black', width=1)))
histogram.update_xaxes(title='Date added (year)')
histogram.update_yaxes(title='Frequency')
histogram.show()

# QQ Plot using Plotly
qq_data = stats.probplot(enhance_df['year'], dist="norm", fit=False)

qq_plot = go.Figure()
qq_plot.add_trace(go.Scatter(x=qq_data[0], y=qq_data[1], mode='markers', name='Data', marker=dict(color='blue')))
# Fit a line to the QQ plot
slope, intercept = np.polyfit(qq_data[0], qq_data[1], 1)
qq_plot.add_trace(go.Scatter(x=qq_data[0], y=(np.array(qq_data[0]) * slope + intercept), mode='lines', name='Fit', line=dict(color='red')))

qq_plot.update_layout(title='QQ Plot of Date added (year)', xaxis_title='Theoretical quantiles', yaxis_title='Sample quantiles')
qq_plot.show()

#### **8.2.2. Skewness Test.**

In [None]:
# Calculate skewness
skewness_value = skew(enhance_df['year'])

# Determine the distribution description
if abs(skewness_value) < 0.5:
    description = "approximately normally distributed"
elif skewness_value < -0.5:
    description = "left-skewed"
elif skewness_value > 0.5:
    description = "right-skewed"
else:
    description = "exhibiting non-normal characteristics"

print(f"Date added (Year) data is {description}, value: {skewness_value}")

#### **8.2.3. Shapiro-Wilk Test.**

In [None]:
# Perform the Shapiro-Wilk test
shapiro_test_result = shapiro(enhance_df['year'])

# Descriptive assessment based on the Shapiro-Wilk test result
if shapiro_test_result.pvalue > 0.05:
    description = "normally distributed"
else:
    description = "non-normally distributed"

print(f"Date added (year) data is {description}, p-value: {shapiro_test_result.pvalue}")

### **8.3. Duration Min (Movies)**

#### **8.3.1. Data Plots.**

In [None]:
# Assuming enhance_df contains your data
# Histogram using Plotly Express
movies = enhance_df[enhance_df['movie'] == True]
histogram = px.histogram(movies, x='duration_min', nbins=30, title='Histogram of Duration Min (Movies)')
histogram.update_traces(marker=dict(color='green', line=dict(color='black', width=1)))
histogram.update_xaxes(title='Duration Min (Movies)')
histogram.update_yaxes(title='Frequency')
histogram.show()

# QQ Plot using Plotly
qq_data = stats.probplot(movies['duration_min'], dist="norm", fit=False)

qq_plot = go.Figure()
qq_plot.add_trace(go.Scatter(x=qq_data[0], y=qq_data[1], mode='markers', name='Data', marker=dict(color='blue')))
# Fit a line to the QQ plot
slope, intercept = np.polyfit(qq_data[0], qq_data[1], 1)
qq_plot.add_trace(go.Scatter(x=qq_data[0], y=(np.array(qq_data[0]) * slope + intercept), mode='lines', name='Fit', line=dict(color='red')))

qq_plot.update_layout(title='QQ Plot of Duration Min (Movies)', xaxis_title='Theoretical quantiles', yaxis_title='Sample quantiles')
qq_plot.show()

#### **8.3.2. Skewness Test.**

In [None]:
# Calculate skewness
skewness_value = skew(enhance_df['duration_min'])

# Determine the distribution description
if abs(skewness_value) < 0.5:
    description = "approximately normally distributed"
elif skewness_value < -0.5:
    description = "left-skewed"
elif skewness_value > 0.5:
    description = "right-skewed"
else:
    description = "exhibiting non-normal characteristics"

print(f"Duration Min (Movies) data is {description}, value: {skewness_value}")

#### **8.3.3. Shapiro-Wilk Test.**

In [None]:
# Perform the Shapiro-Wilk test
shapiro_test_result = shapiro(enhance_df['duration_min'])

# Descriptive assessment based on the Shapiro-Wilk test result
if shapiro_test_result.pvalue > 0.05:
    description = "normally distributed"
else:
    description = "non-normally distributed"

print(f"Duration min (Movies) data is {description}, p-value: {shapiro_test_result.pvalue}")

### **8.4. Duration Seasons (TV Shows)**

#### **8.4.1. Data Plots.**

In [None]:
# Assuming enhance_df contains your data
# Histogram using Plotly Express
tv_show = enhance_df[enhance_df['duration_seasons'] > 0]
histogram = px.histogram(tv_show, x='duration_seasons', nbins=30, title='Histogram of Duration Seasons (Tv Shows)')
histogram.update_traces(marker=dict(color='green', line=dict(color='black', width=1)))
histogram.update_xaxes(title='Duration Seasons (Tv Shows)')
histogram.update_yaxes(title='Frequency')
histogram.show()

# QQ Plot using Plotly
qq_data = stats.probplot(tv_show['duration_seasons'], dist="norm", fit=False)

qq_plot = go.Figure()
qq_plot.add_trace(go.Scatter(x=qq_data[0], y=qq_data[1], mode='markers', name='Data', marker=dict(color='blue')))
# Fit a line to the QQ plot
slope, intercept = np.polyfit(qq_data[0], qq_data[1], 1)
qq_plot.add_trace(go.Scatter(x=qq_data[0], y=(np.array(qq_data[0]) * slope + intercept), mode='lines', name='Fit', line=dict(color='red')))

qq_plot.update_layout(title='QQ Plot of Duration Seasons (Tv Shows)', xaxis_title='Theoretical quantiles', yaxis_title='Sample quantiles')
qq_plot.show()

#### **8.4.2. Skewness Test.**

In [None]:
# Calculate skewness
skewness_value = skew(enhance_df['duration_seasons'])

# Determine the distribution description
if abs(skewness_value) < 0.5:
    description = "approximately normally distributed"
elif skewness_value < -0.5:
    description = "left-skewed"
elif skewness_value > 0.5:
    description = "right-skewed"
else:
    description = "exhibiting non-normal characteristics"

print(f"Duration Seasons (TV Shows) data is {description}, value: {skewness_value}")

#### **8.4.3. Shapiro-Wilk Test.**

In [None]:
# Perform the Shapiro-Wilk test
shapiro_test_result = shapiro(enhance_df['duration_seasons'])

# Descriptive assessment based on the Shapiro-Wilk test result
if shapiro_test_result.pvalue > 0.05:
    description = "normally distributed"
else:
    description = "non-normally distributed"

print(f"Duration Seasons (TV Shows) data is {description}, p-value: {shapiro_test_result.pvalue}")

After conducting the statistical test, it is evident that the distributions of variables such as Release Year, Date Added Year, Duration in Minutes for Movies, and Duration in Seasons for TV Shows exhibit deviation from a normal distribution. The results of the test indicate non-normality in these specific features based on the chosen statistical analysis or method used. The departure from normality implies that these variables do not conform to the characteristics expected under a standard normal distribution. This information is crucial for appropriately selecting statistical methods or models that account for the non-normal nature of these variables, ensuring robust and accurate analyses within the context of the dataset.

## **9. Visualization.**

In [None]:
# Create a bar plot for movies
movies_bar = px.bar(
    enhance_df[enhance_df['tv_show'] == False],
    x='content_rating',
    y='duration_min',
    error_y='duration_min',
    title='Bar Plot: Movie Duration Across Content Ratings',
    labels={'duration_min': 'Duration (minutes)', 'content_rating': 'Content Rating'}
)
movies_bar.update_traces(error_y=dict(visible=True, symmetric=True), opacity=0.7)
movies_bar.update_layout(xaxis_title='Content Rating', yaxis_title='Duration (minutes)', showlegend=False)
movies_bar.show()

# Create a bar plot for TV shows (using number of seasons)
movies_bar = px.bar(
    enhance_df[enhance_df['tv_show'] == True],
    x='content_rating',
    y='duration_seasons',
    error_y='duration_seasons',
    title='Bar Plot: TV Show Duration Across Content Ratings',
    labels={'duration_seasons': 'Duration (Seasons)', 'content_rating': 'Content Rating'}
)
movies_bar.update_traces(error_y=dict(visible=True, symmetric=True), opacity=0.7)
movies_bar.update_layout(xaxis_title='Content Rating', yaxis_title='Duration (Seasons)', showlegend=False)
movies_bar.show()

The graphs indicate a substantial focus by content creators on mature audiences, notably observed through the prevalence of content rated as TV-MA and TV-14. This trend suggests a deliberate emphasis on catering to the preferences and interests of older viewers, reflecting a strategic alignment with the demand for content tailored to a more mature demographic.

In [None]:

# Creating a new column to categorize movies as before or after the year 2000
enhance_df['year_category'] = enhance_df['release_year'].apply(lambda year: 'Before 2000' if year < 2000 else 'After 2000')

# Creating a box plot to compare movie durations before and after the year 2000
box_plot = px.box(enhance_df, x='year_category', y='duration_min',
                  title='Comparison of Movie Durations Before and After the Year 2000',
                  labels={'duration_min': 'Duration (minutes)', 'year_category': 'Year Group'})
box_plot.show()

The box plot comparison suggests a clear and significant difference in movie durations before and after the year 2000. The visible divergence in median durations and spread indicates a distinct shift in movie length trends around the millennium. Rejecting the null hypothesis signifies a meaningful distinction between these time frames in terms of movie duration pattern

In [None]:
# Assuming df contains the required data
content_distribution = enhance_df.groupby(['year', 'type'])['show_id'].count().unstack().reset_index()

fig = px.bar(content_distribution, x='year', y=['Movie', 'TV Show'],
             title='Content Distribution Over Time', labels={'value': 'Number of Titles'}
             )

fig.update_layout(xaxis_title='Year Added', yaxis_title='Number of Titles', legend_title='Type')
fig.update_xaxes(tickangle=45, tickmode='linear')
fig.show()

In [None]:
movies_df_copy = enhance_df.copy()

genre_dummies = movies_df_copy['listed_in'].str.get_dummies(sep=', ')

category_priorities = {
    'Action & Adventure': ['Action & Adventure', 'Anime Features'],
    'Romance': ['Dramas', 'Classic Movies'],
    'Comedy': ['Comedies'],
    'Horror': ['Horror Movies'],
    'Kid Friendly': ['Children & Family Movies']
}

for category, genres in category_priorities.items():
    movies_df_copy[category] = genre_dummies[genres].max(axis=1)

other_genres = genre_dummies.columns.difference(category_priorities.keys())
movies_df_copy['Other'] = 1 - genre_dummies[other_genres].max(axis=1)

category_counts = movies_df_copy.groupby('year')[list(category_priorities.keys()) + ['Other']].sum().reset_index()

fig = px.line(category_counts, x='year', y=list(category_counts.columns)[1:],
              title='Number of Productions Over the Years by Category', labels={'value': 'Number of Productions'},
              color_discrete_sequence=px.colors.qualitative.Plotly)

fig.update_layout(xaxis_title='Year', yaxis_title='Number of Productions', legend_title='Category')
fig.show()

In [None]:
tv_shows_df_copy = enhance_df.copy()

genre_dummies_tv = tv_shows_df_copy['listed_in'].str.get_dummies(sep=', ')

category_priorities_tv = {
    'Action & Adventure': ['TV Action & Adventure'],
    'Thrillers': ['TV Thrillers', 'TV Mysteries', 'Crime TV Shows'],
    'Comedy': ['TV Comedies', 'Stand-Up Comedy & Talk Shows'],
    'International': ['International TV Shows', 'British TV Shows', 'Korean TV Shows', 'Spanish-Language TV Shows', 'Anime Series'],
    'Romance': ['Classic & Cult TV', 'Romantic TV Shows']
}

for category, genres in category_priorities_tv.items():
    tv_shows_df_copy[category] = genre_dummies_tv[genres].max(axis=1)

other_genres_tv = genre_dummies_tv.columns.difference(category_priorities_tv.keys())
tv_shows_df_copy['Other'] = genre_dummies_tv[other_genres_tv].max(axis=1)

category_counts_tv = tv_shows_df_copy.groupby('year')[list(category_priorities_tv.keys()) + ['Other']].sum().reset_index()

fig = px.line(category_counts_tv, x='year', y=list(category_counts_tv.columns)[1:],
              title='Number of Productions Over the Years by Category (TV Shows)', labels={'value': 'Number of Productions'},
              color_discrete_sequence=px.colors.qualitative.Plotly)

fig.update_layout(xaxis_title='Year', yaxis_title='Number of Productions', legend_title='Category')
fig.show()

The line graph depicting the annual release of movies and TV shows by genre reveals a stable pattern over the years.
Notably, the popularity of a specific genre in both TV shows and movies has remained consistent, demonstrating a sustained audience interest on action and adventure category.


In [None]:
# Assuming df contains the 'content_type' column
content_type_counts = enhance_df['type'].value_counts().reset_index()

fig = px.pie(content_type_counts, values='type', names='index',
             title='Distribution of Movies and TV Shows',
             color_discrete_sequence=['#D10000', '#F08080'])
fig.update_traces(textinfo='percent+label', pull=[0.1, 0], hole=0.3)
fig.show()

The pie chart showcasing the distribution between movies and TV shows illustrates a predominant focus by content creators on producing movies. Approximately 69.7% of the content available on the platform constitutes movies, indicating a substantial emphasis on this category. This inclination toward movie production suggests a deliberate strategy by content creators to cater to a larger audience or align with viewers' preferences. The higher proportion of movies also potentially reflects a response to the prevalent demand for cinematic content over TV shows, shaping the platform's content offerings in favor of movies as the primary content type

## **10. Validating Hypotheses and Addressing Research Questions.**
To accept or reject the null Hypotesis stated on [2. Research Question], we need to use Kruskal-Wallis test, knowing that the does not follow normal distribution, to do that we need execute the following code:

### **10.1. Research question #1.**

In [None]:
# Filtered DataFrame for movies and TV shows
movies_df = enhance_df[enhance_df['type'] == 'Movie']
tv_shows_df = enhance_df[enhance_df['type'] == 'TV Show']
# Kruskal-Wallis test for movie duration across content ratings
movie_result = kruskal(*[group['duration_min'] for name, group in movies_df.groupby('content_rating')])
print("Kruskal-Wallis test p-value for movie duration across content ratings:", movie_result.pvalue)

# Determine acceptance or rejection of null hypothesis
alpha = 0.05  # Significance level
if movie_result.pvalue < alpha:
    print("Reject the null hypothesis: There are no significant differences in movie durations across content ratings.")
else:
    print("Accept the null hypothesis: There are no significant differences in movie durations across content ratings.")

# Kruskal-Wallis test for TV show duration across content ratings
tv_show_result = kruskal(*[group['duration_seasons'] for name, group in tv_shows_df.groupby('content_rating')])
print("Kruskal-Wallis test p-value for TV show duration across content ratings:", tv_show_result.pvalue)

# Determine acceptance or rejection of null hypothesis
alpha = 0.05  # Significance level
if movie_result.pvalue < alpha:
    print("Reject the null hypothesis: There are no significant differences in TV show durations across content ratings.")
else:
    print("Accept the null hypothesis: There are no significant differences in TV show durations across content ratings.")

### **10.2. Research Quesetion #2.**

In [None]:
# Filter movies before and after the year 2000
movies_before_2000 = enhance_df[(enhance_df['type'] == 'Movie') & (enhance_df['release_year'] < 2000)]
movies_after_2000 = enhance_df[(enhance_df['type'] == 'Movie') & (enhance_df['release_year'] >= 2000)]

# Perform t-test for comparing movie duration before and after 2000
t_stat, p_value = ttest_ind(movies_before_2000['duration_min'], movies_after_2000['duration_min'])
print("T-test p-value for movie duration before and after 2000:", p_value)

# Determine acceptance or rejection of null hypothesis
if p_value < alpha:
    print("Reject the null hypothesis: There is no significant difference in movie durations before and after 2000.")
else:
    print("Accept the null hypothesis: There is no significant difference in movie durations before and after 2000.")

### **10.3. Research Question #3.**

For analyzing variations in popular genres over time, no hypothesis test is required as it involves observing trends in data and identifying changing patterns without specific hypothesis testing.

### 10.4. Research Question #4.

For determining the most streamed content, statistical hypothesis testing may not be applicable. Instead, it involves finding the most occurring categories based on the data.