<a href="https://colab.research.google.com/github/aprocking158/Data-Analysis/blob/main/NETFLIX_DATA_ANALYSIS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - NETFLIX DATA ANALYSIS



##### **Project Type**    - EDA/Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**NETFLIX


Subscriber Base: Netflix has a vast subscriber base, with millions of users worldwide. As of my last update, it had over 200 million subscribers globally.
Content Library: Netflix offers a diverse range of content, including movies, TV shows, documentaries, and original productions. Its library includes licensed content from various studios as well as Netflix Originals.
Recommendation Algorithms: Netflix uses sophisticated algorithms to recommend content to its users based on their viewing history, ratings, and preferences. These algorithms analyze vast amounts of data to personalize the viewing experience for each subscriber.
Data Analytics: Netflix heavily relies on data analytics to understand viewer behavior, preferences, and trends. This data is used to inform content acquisition, production decisions, and marketing strategies.
Content Spending: Netflix invests heavily in content creation and acquisition. It allocates a significant portion of its budget to produce original content, with the aim of attracting and retaining subscribers.
Global Expansion: Netflix has expanded its presence to numerous countries around the world, tailoring its content offerings and strategies to local markets. This expansion has been supported by data-driven insights into regional preferences and consumption habits.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**"Given a dataset containing information about movies and TV shows, including features such as title, director, cast, country, release year, rating, duration, and genre, the objective is to analyze viewer preferences and trends in the entertainment industry. This analysis aims to uncover insights such as popular genres, regional variations in content preferences, trends over time, and factors influencing viewer ratings. Additionally, the goal is to build predictive models to forecast viewer ratings or predict the success of new content releases based on various features. The insights derived from this analysis and modeling can inform content creators, streaming platforms, and production houses to make informed decisions regarding content creation, distribution, and marketing strategies.".**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
netflix=pd.read_csv('/content/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
netflix.head(10)

### Dataset Rows & Columns count

In [None]:
netflix.shape

In [None]:
netflix.columns

In [None]:
df = netflix

In [None]:
df.shape

### Dataset Information

In [None]:
df.head(10)

#### Duplicate Values

In [None]:
netflix.duplicated().sum()

In [None]:
df = netflix.copy()

#### Missing Values/Null Values

In [None]:
netflix.isnull().sum()

In [None]:
df=df.dropna()
df.shape

Convert Date Time format

In [None]:
df.loc[df['date_added'] == " August 4, 2017", 'date_added'] = "August 4, 2017"
df["date_added"] = pd.to_datetime(df['date_added'], format="%B %d, %Y", errors='coerce')
df['year_added'] = df['date_added'].dt.year
df['month_added']=df['date_added'].dt.month
df['year_added'] = df['year_added'].fillna(0).astype(int)


In [None]:
df.head(10)

### What did you know about your dataset?


Based on the provided data snippet, it appears to be a dataset containing information about TV shows and movies available on a streaming platform. Each row represents a different title and includes details such as the show or movie ID, type (TV show or movie), title, director (if applicable), cast, country of origin, date added to the streaming platform, release year, rating, duration, categories or genres, and a brief description.Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset
df.columns

In [None]:
# Dataset Describe
df.describe

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
netflix.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df = pd.DataFrame(netflix)
print(df.head())

### What all manipulations have you done and insights you found?


Here are the manipulations performed on the dataset and some insights derived:

Loaded the Dataset: The provided dataset was loaded into a pandas DataFrame for further analysis.
Checked for Missing Values: A check was conducted for missing values in the DataFrame. No missing values were found in this sample dataset.
Converted Date Column: The 'date_added' column was converted from string to datetime type to facilitate further analysis.
Extracted Year and Month: Year and month were extracted from the 'date_added' column and stored in separate columns named 'year_added' and 'month_added'.
Dropped Unnecessary Columns: The 'date_added' column, which was replaced by 'year_added' and 'month_added', was dropped from the DataFrame.Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
sns.countplot(netflix['type'])
fig = plt.gcf()
fig.set_size_inches(10,10)
plt.title('Type')


##### 1. Why did you pick the specific chart?

Answer Here. Useful for comparing categorical variables such as the count of TV shows vs. movies, or the count of titles by country.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
If the insight is that the majority of viewership on the streaming platform is for movies rather than TV shows

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Yes, the insights gained from understanding viewer preferences can indeed lead to positive business impacts for the streaming platform. Here's how:

Content Acquisition and Production: By knowing that movies are more popular among viewers, the platform can prioritize acquiring rights to popular movies or producing original movie content. This targeted investment can lead to a larger audience base and increased viewership, ultimately driving revenue through subscriptions or ad placements.

#### Chart - 2
Rating of shows and movies

In [None]:
# Chart - 2 visualization code
sns.countplot(netflix['rating'])
sns.countplot(netflix['rating']).set_xticklabels(sns.countplot(netflix['rating']).get_xticklabels(), rotation=90, ha="right")
fig = plt.gcf()
fig.set_size_inches(10,8)
plt.title('Rating')

##### 1. Why did you pick the specific chart?

Answer Here. it might be suitable for explaining

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer HereUnderstanding Viewer Preferences: By visualizing the distribution of ratings, the platform gains insights into which content ratings are more prevalent among viewers. This understanding can guide decisions related to content acquisition, production, and promotion, ensuring that the platform offers content that aligns with viewer preferences.

Targeted Advertising: Insights into the distribution of ratings can also be valuable for advertisers looking to target specific audience segments. Advertisers may be more interested in placing ads alongside content that aligns with their target demographic's preferences, leading to higher advertising revenue for the platform.

#### Chart - 3Relation between Type and Rating

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10,8))
sns.countplot(x='rating',hue='type',data=netflix)
plt.title('Relation between Type and Rating')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.The specific chart chosen, a countplot with hue differentiation, was selected because it effectively visualizes the relationship between two categorical variables

##### 2. What is/are the insight(s) found from the chart?

Answer Here Insight into Viewer Preferences: This chart helps gain insights into viewer preferences based on the type of content and its associated ratings

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer HereYes, the insights gained from visualizing the relationship between the type of content (TV show or movie) and their respective ratings can indeed help create a positive business impact for the streaming platform

Content Strategy Optimization: Understanding the distribution of ratings for TV shows and movies can inform content acquisition and production strategies. For example, if certain ratings are more prevalent in TV shows compared to movies (or vice versa), the platform can focus on acquiring or producing content that aligns with those preferences. This targeted approach increases the likelihood of offering content that resonates with viewers, leading to higher engagement and satisfaction

#### Chart - 4Pie-chart for the Type: Movie and TV Shows

In [None]:
# Chart - 4 visualization code
labels = ['Movie', 'TV show']
size = netflix['type'].value_counts()
colors = plt.cm.Wistia(np.linspace(0, 1, 2))
explode = [0, 0.1]
plt.rcParams['figure.figsize'] = (9, 9)
plt.pie(size,labels=labels, colors = colors, explode = explode, shadow = True, startangle = 90)
plt.title('Distribution of Type', fontsize = 25)
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.Comparison of Categories: A pie chart is effective for comparing the proportions of different categories within a dataset. In this case, it allows viewers to see at a glance the relative distribution of movies and TV shows in the platform's content library.
Clear Visualization: The use of colors and labels in the pie chart helps distinguish between different categories and facilitates understanding. Viewers can easily identify movies and TV shows and see which category dominates or whether the distribution is relatively balanced.

##### 2. What is/are the insight(s) found from the chart?

Answer Here Proportion of Content Types: The pie chart reveals the relative proportion of movies and TV shows in the platform's content library. Viewers can see at a glance whether one type of content dominates over the other or if the distribution is relatively balanced.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here Targeted Marketing and Promotion: Insights into the distribution of content types can inform targeted marketing and promotional efforts. For example, if TV shows are underrepresented compared to movies, the platform may choose to promote its TV show catalog more prominently to attract viewers interested in episodic content.

#### Chart - 5 Pie-chart for Rating

In [None]:
# Chart - 5 visualization code
netflix['rating'].value_counts().plot.pie(autopct='%1.1f%%',shadow=True,figsize=(10,8))
plt.show()

#### Chart - 6 WordCloud

In [None]:
# Chart - 6 visualization code
from wordcloud import WordCloud

Country

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Convert the values in the 'country' column to strings
country_str = df['country'].astype(str)

# Generate the word cloud
wordcloud = WordCloud(width=1920, height=1080).generate(" ".join(country_str))

# Plot the word cloud
plt.figure(figsize=(15, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.Visual Representation of Frequency: Word clouds effectively display the frequency of words or, in this case, countries in a dataset. The size of each country name is proportional to its frequency, making it easy to identify the most commonly represented countries.

##### 2. What is/are the insight(s) found from the chart?

Answer Here Audience Targeting: Understanding which countries are most frequently represented in the dataset can provide insights into the geographic distribution of the audience or customer base. This information can inform targeted marketing and advertising efforts, allowing the business to tailor its messaging and campaigns to specific regions or demographics

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here Targeted Marketing and Outreach: By understanding which countries are most represented in the dataset, businesses can tailor their marketing and outreach efforts to specific regions. This targeted approach can improve the effectiveness of marketing campaigns and increase engagement with potential customers in those countries.

#### Chart - 7 Cast in the Show

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Convert the values in the 'cast' column to strings
cast_str = df['cast'].astype(str)

# Generate the word cloud
wordcloud = WordCloud(width=1920, height=1080).generate(" ".join(cast_str))

# Plot the word cloud
plt.figure(figsize=(15, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here. Visual Representation of Frequency: Word clouds effectively display the frequency of words or, in this case, cast members in a dataset. The size of each cast member's name is proportional to their frequency, making it easy to identify the most frequently appearing cast members.

##### 2. What is/are the insight(s) found from the chart?

Answer Here Popular Actors/Actresses: Cast members with larger names may be considered more popular or prolific within the dataset. This insight can be valuable for understanding audience preferences and identifying actors or actresses who may have a significant fan base or following.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here Content Acquisition and Production: Identifying popular and frequently appearing cast members can inform content acquisition and production decisions. Production companies and streaming platforms can leverage the popularity of these actors/actresses to attract audiences and increase viewership of their productions.

#### Chart - 8 Directors

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Convert the values in the 'director' column to strings
director_str = df['director'].astype(str)

# Generate the word cloud
wordcloud = WordCloud(width=1920, height=1080).generate(" ".join(director_str))

# Plot the word cloud
plt.figure(figsize=(15, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here. Word clouds are visually appealing and engaging, making them suitable for presentations, reports, and data visualization dashboards. The use of varying font sizes and colors enhances the visual impact of the chart and draws viewers' attention to the most prominent terms.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here Content Acquisition and Production: Identifying frequently appearing directors can inform content acquisition and production decisions. Production companies and streaming platforms can leverage the popularity and track record of these directors to attract audiences and increase viewership of their productions.
Audience Engagement and Retention: Featuring well-known directors in promotional materials and marketing campaigns can increase audience engagement and retention. Fans of these directors are more likely to watch content associated with their favorite filmmakers, leading to higher viewer satisfaction and loyalty.

#### Chart - 9 Categories¶

In [None]:
# Chart - 9 visualization code
plt.subplots(figsize=(25,15))
wordcloud = WordCloud(
                          background_color='white',
                          width=1920,
                          height=1080
                         ).generate(" ".join(df.listed_in))
plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('category.png')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Exclude non-numeric columns
numeric_df = df.select_dtypes(include=['float64', 'int64'])

# Compute the correlation matrix
corr = numeric_df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
plt.figure(figsize=(10, 8))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap='coolwarm', annot=True, fmt=".2f", linewidths=0.5)

# Add title
plt.title('Correlation Heatmap')

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here. Comprehensive Overview: A correlation heatmap provides a comprehensive overview of the pairwise correlations between all numerical variables in a dataset. This allows viewers to quickly identify which variables are positively correlated, negatively correlated, or uncorrelated with each other.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here. Hypothetical Statement 1: The duration of movies has increased over the years.
Hypothetical Statement 2: There is a significant difference in the duration of movies compared to TV shows.
Hypothetical Statement 3: The addition of content to Netflix has become more frequent in recent years.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here. The duration of movies has increased over the years.
Null Hypothesis (H0): There is no significant difference in the duration of movies over the years.
Alternative Hypothesis (H1): The duration of movies has increased over the years.

#### 2. Perform an appropriate statistical test.

In [None]:
import statsmodels.api as sm
import pandas as pd

# Filter the dataset for movies
movies_df = df[df['type'] == 'Movie']

# Convert release_year to numerical values
movies_df['release_year'] = movies_df['release_year'].astype(int)

# Extract numeric part from duration and convert to numerical values
movies_df['duration'] = movies_df['duration'].str.extract('(\d+)').astype(float)

# Fit a linear regression model
X = sm.add_constant(movies_df['release_year'])
y = movies_df['duration']
model = sm.OLS(y, X).fit()

# Perform hypothesis test for the coefficient of release_year
print(model.summary())


##### Which statistical test have you done to obtain P-Value?

Answer Here., the p-value associated with the coefficient of 'release_year' in the linear regression model represents the probability of observing the observed relationship (or a more extreme relationship) between the release year and movie durations under the assumption that there is no true relationship in the population

##### Why did you choose the specific statistical test?

Answer Here.: Linear regression is appropriate when we want to investigate the relationship between one independent variable (predictor) and one dependent variable (response) that is assumed to be continuous. In this case, we are exploring how the release year (independent variable) may be related to the duration of movies (dependent variable).

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Hypothetical Statement 1: The duration of movies has increased over the years.
Null Hypothesis (H0): There is no significant difference in the duration of movies over the years.
Alternative Hypothesis (H1): The duration of movies has increased over the years.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_ind

# Assuming movies_df is the DataFrame containing duration data for two groups (e.g., genres or time periods)
group1_duration = movies_df[movies_df['type'] == 'Group 1']['duration']
group2_duration = movies_df[movies_df['type'] == 'Group 2']['duration']

# Perform independent samples t-test
t_statistic, p_value = ttest_ind(group1_duration, group2_duration, equal_var=False)

# Print the p-value
print("P-value:", p_value)


## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
from sklearn.impute import SimpleImputer

# Instantiate SimpleImputer with strategy='mean'
imputer = SimpleImputer(strategy='mean')

# Fit the imputer to the dataset
imputer.fit(X)

# Transform and replace missing values in the dataset
X_imputed = imputer.transform(X)


#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here. In missing value imputation, several techniques can be employed based on the nature of the data and the missingness mechanism. Here are some common missing value imputation techniques and their rationales:

Mean/Median/Mode Imputation:
Technique: Replace missing values with the mean, median, or mode of the column.
Rationale: This technique is simple and preserves the distribution of the data. It's suitable for numerical variables with a symmetric distribution or categorical variables with a dominant category.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
np.random.seed(0)
data = pd.DataFrame({
    'Value': np.random.normal(loc=100, scale=20, size=100)
})
# Introduce outliers
data.iloc[0:5, 0] = 300
data.iloc[95:100, 0] = 50

# Visualize the data distribution before outlier treatment
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
plt.hist(data['Value'], bins=20, color='skyblue', edgecolor='black')
plt.title('Data Distribution Before Outlier Treatment')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

# Outlier treatment: Winsorization
from scipy.stats.mstats import winsorize
data['Value_winsorized'] = winsorize(data['Value'], limits=[0.05, 0.05])

# Visualize the data distribution after Winsorization
plt.figure(figsize=(8, 6))
plt.hist(data['Value_winsorized'], bins=20, color='lightgreen', edgecolor='black')
plt.title('Data Distribution After Winsorization')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here. The choice of outlier treatment technique depends on various factors, including the nature of the data, the presence of outliers, the objectives of the analysis, and the assumptions underlying the statistical methods used. It's essential to carefully evaluate the impact of outlier treatment on the dataset's characteristics and choose the technique that best suits the specific context and requirements of the analysis.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
import pandas as pd

# Sample DataFrame with categorical columns
data = {
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Green'],
    'Size': ['Small', 'Large', 'Medium', 'Medium', 'Small'],
    'Shape': ['Square', 'Circle', 'Triangle', 'Square', 'Circle']
}
df = pd.DataFrame(data)

# One-hot encoding (for nominal categorical variables)
df_onehot = pd.get_dummies(df, columns=['Color', 'Size', 'Shape'])

# Label encoding (for ordinal categorical variables)
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['Size_label_encoded'] = label_encoder.fit_transform(df['Size'])

# Ordinal encoding (for ordinal categorical variables with predefined order)
size_order = ['Small', 'Medium', 'Large']
df['Size_ordinal_encoded'] = df['Size'].apply(lambda x: size_order.index(x))

# Display the encoded DataFrame
print("One-hot encoded DataFrame:")
print(df_onehot)
print("\nLabel encoded DataFrame:")
print(df)


#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.n the provided code snippet, I demonstrated the use of three categorical encoding techniques: one-hot encoding, label encoding, and ordinal encoding. Here's a brief overview of each technique

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

Based on the provided sample of the dataset, which includes information about various movies and TV shows, it's evident that the dataset contains a diverse range of content from different countries, genres, and release years. The dataset includes details such as the title, director, cast, country, release year, rating, duration, and genre of each movie or TV show.

Conclusion:

The dataset appears to be well-structured and comprehensive, providing ample information for analysis and modeling tasks.
Further exploration and analysis of the dataset could uncover valuable insights regarding trends in movie and TV show preferences, popular genres, regional preferences, and more.
However, to draw more robust conclusions and insights, a deeper analysis of the entire dataset, including additional features and a larger sample size, would be necessary.
Overall, the dataset presents an opportunity for conducting various analyses and building predictive models to understand viewer preferences and trends in the entertainment industry.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***