<a href="https://colab.research.google.com/github/fadakenitin/Netflix-Movies-and-TV-Shows-Clustering-Unsupervised-/blob/main/Netflix_Movies_and_TV_Shows_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Netflix Movies and TV Shows Clustering






##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Name -** Nitin Fadake


# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

https://github.com/fadakenitin/Netflix-Movies-and-TV-Shows-Clustering-Unsupervised-

# **Problem Statement**


The objective of this project is to develop a machine learning model that can effectively cluster Netflix movies and TV shows based on their content attributes. The clustering algorithm should be able to group similar titles together, enabling improved recommendation systems and enhancing the user experience by providing personalized suggestions.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import io

### Dataset Loading

In [None]:
data=pd.read_csv('/content/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')


### Dataset First View

In [None]:
# Dataset First Look
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

data.shape

In [None]:
data.columns

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isna().sum()

In [None]:
# Visualizing the missing values

import missingno as msno

# Visualize missing values using a matrix
msno.matrix(data)



### What did you know about your dataset?

The dataset contains 7,787 rows and 12 columns. Here are the details of the columns with null values:

- director: 2,389 null values
- cast: 718 null values
- country: 507 null values
- date_added: 10 null values

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

data.columns

In [None]:
# Dataset Describe

data.describe(include='all').T

### Variables Description



1. show_id: An identifier for each Netflix movie or TV show in the dataset.
2. type: Indicates whether the entry is a "TV Show" or a "Movie".
3. title: The title of the Netflix movie or TV show.
4. director: The director(s) of the movie or TV show.
5. cast: The actors/actresses involved in the movie or TV show.
6. country: The country of production or origin of the movie or TV show.
7. date_added: The date when the movie or TV show was added to Netflix.
8. release_year: The year when the movie or TV show was originally released.
9. rating: The content rating assigned to the movie or TV show.
10. duration: The duration of the movie or TV show (in minutes for movies, or the number of seasons for TV shows).
11. listed_in: The categories or genres to which the movie or TV show belongs. This field may contain multiple categories separated by commas.
12. description: A brief description or synopsis of the movie or TV show.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

data.nunique()

In [None]:
data.dtypes

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Changing the dtype of the date_added column
data['date_added']=pd.to_datetime(data['date_added'])

In [None]:
# Extract day , month and year from the 'date_added'
# there are some null values in 'date_added' that's why use fillna(0) ti fill null values.

data['day_added'] = data['date_added'].dt.day.fillna(0).astype(int)
data['month_added'] = data['date_added'].dt.month.fillna(0).astype(int)
data['year_added'] = data['date_added'].dt.year.fillna(0).astype(int)

In [None]:
# We extracted all the information 'date_added' column therefore there is no need to retain this column
data.drop('date_added',axis=1,inplace=True)

In [None]:
data.dtypes

In [None]:
 # removing unnecessary column

data.drop(['show_id'],axis=1,inplace=True)

In [None]:
data.columns

In [None]:
'''There are some movies / TV shows that were filmed in multiple countries, have multiple genres associated with it.
To simplify the analysis, we have to take only the primary country where that respective movie / TV show was filmed.
And also the primary genre of the respective movie / TV show.'''

# Define the function to extract first i.e. primary element from the string
def extract_first_element(string):
    if isinstance(string, str):
        return string.split(',')[0]
    else:
        return string


In [None]:
data['country']=data['country'].apply(extract_first_element)
data['listed_in']=data['listed_in'].apply(extract_first_element)

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data['duration_min'] = data['duration'].str.extract('(\d+)').astype(float).where(data['duration'].str.contains('min'))
data['duration_seasons'] = data['duration'].str.extract('(\d+)').astype(float).where(data['duration'].str.contains('Seasons'))

In [None]:
data.drop('duration',axis=1,inplace=True)

### What all manipulations have you done and insights you found?

1. Converted the 'date_added' column to datetime format using pd.to_datetime()  function.
2. Extracted the day, month, and year from the 'date_added' column and created separate columns 'day_added', 'month_added', and 'year_added'.
3. Dropped the 'date_added' column as it is no longer needed for analysis using df.drop() function.
4. Removed the 'show_id' column using df.drop()` function as it is unnecessary.
5. Extracted the primary country and primary genre from the 'country' and 'listed_in' columns, respectively, using the extract_first_element() function.
6. Transformed the 'duration' column by extracting the numeric value and converting it to float. Created two new columns, 'duration_min' for movie durations in minutes and 'duration_seasons' for TV show durations in seasons.
7. Dropped the original 'duration' column using df.drop()  function.

* Insights from these manipulations :

- The dataset now includes separate columns for day, month, and year of content addition on Netflix, which can be used for temporal analysis.
- The 'country' column now contains only the primary country where the content was filmed, simplifying the analysis of country-wise distribution.
- The 'listed_in' column now represents the primary genre of the content, allowing genre-based analysis.
- The 'duration_min' and 'duration_seasons' columns provide separate duration information for movies and TV shows, respectively.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

* Hypothesis 1: The distribution of movie durations on Netflix follows a normal distribution.
*  Hypothesis 2: The average ratings of movies and TV shows on Netflix are significantly different.
* Hypothesis 3: The average release years of movies and TV shows on Netflix are significantly different.





### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): The movie durations on Netflix follow a normal distribution.

Alternative Hypothesis (HA): The movie durations on Netflix do not follow a normal distribution.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# from scipy import stats

# # Assuming you have a DataFrame 'df' with 'duration_min' column containing the movie durations

# # Extract the movie durations as a separate series
# movie_durations = data[data['type'] == 'Movie']['duration_min']

# # Perform the Shapiro-Wilk test
# statistic, p_value = stats.shapiro(movie_durations)

# # Set the significance level
# alpha = 0.05

# # Compare the p-value with the significance level
# if p_value < alpha:
#     print("Reject the null hypothesis")
#     print("The movie durations on Netflix do not follow a normal distribution.")
# else:
#     print("Fail to reject the null hypothesis")
#     print("The movie durations on Netflix follow a normal distribution.")

##### Which statistical test have you done to obtain P-Value?

The Shapiro-Wilk test

##### Why did you choose the specific statistical test?

The specific statistical test chosen, which is the Shapiro-Wilk test, was selected based on the nature of the hypothesis being tested and the assumptions of the test.

In this case, the hypothesis being tested is whether the movie durations on Netflix follow a normal distribution. The data consists of a sample of movie durations.

The Shapiro-Wilk test is a widely used normality test that specifically examines the assumption of normality in a dataset. It is sensitive to departures from normality and is suitable for small to moderate sample sizes. The test assesses whether the observed data significantly deviate from a normal distribution.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* Null Hypothesis (H0): The average ratings of movies and TV shows on Netflix are equal.

* Alternative Hypothesis (HA): The average ratings of movies and TV shows on Netflix are significantly different.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# from scipy import stats

# # Assuming you have a DataFrame 'df' with 'rating' column containing the ratings data

# # Convert the 'rating' column to numeric data type
# data['rating'] = pd.to_numeric(data['rating'], errors='coerce')

# # Split the ratings into two groups: movies and TV shows
# movies_ratings = data[data['type'] == 'Movie']['rating']
# tvshow_ratings = data[data['type'] == 'TV Show']['rating']

# # Perform a two-sample t-test
# t_stat, p_value = stats.ttest_ind(movies_ratings, tvshow_ratings, equal_var=False)

# # Set the significance level
# alpha = 0.05

# # Compare the p-value with the significance level
# if p_value < alpha:
#     print("Reject the null hypothesis")
#     print("The average ratings of movies and TV shows on Netflix are significantly different.")
# else:
#     print("Fail to reject the null hypothesis")
#     print("There is no significant difference in the average ratings of movies and TV shows on Netflix.")


##### Which statistical test have you done to obtain P-Value?

The two-sample t-test

##### Why did you choose the specific statistical test?

The specific statistical test chosen, which is the two-sample t-test, was selected based on the nature of the hypothesis being tested and the characteristics of the data.

In this case, the hypothesis being tested is whether the average ratings of movies and TV shows on Netflix are equal or significantly different. The data consists of independent samples of ratings from movies and TV shows.

The two-sample t-test is commonly used when comparing the means of two independent groups. It assesses whether the observed difference in means between the groups is statistically significant or likely to have occurred by chance. The t-test is suitable when the data approximately follows a normal distribution and the variances of the two groups are equal or assumed to be equal.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* Null Hypothesis (H0): The average release years of movies and TV shows on Netflix are equal.

* Alternative Hypothesis (HA): The average release years of movies and TV shows on Netflix are significantly different.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# import pandas as pd
# from scipy import stats

# # Separate the release years of movies and TV shows
# movie_release_years = data[data['type'] == 'Movie']['release_year']
# tvshow_release_years = data[data['type'] == 'TV Show']['release_year']

# # Perform a two-sample t-test assuming equal variances
# t_stat, p_value = stats.ttest_ind(movie_release_years, tvshow_release_years, equal_var=True)

# # Set the significance level
# alpha = 0.05

# # Check if the p-value is less than the significance level
# if p_value < alpha:
#     print("Reject the null hypothesis")
#     print("The average release years of movies and TV shows on Netflix are significantly different.")
# else:
#     print("Fail to reject the null hypothesis")
#     print("The average release years of movies and TV shows on Netflix are equal.")


##### Which statistical test have you done to obtain P-Value?

Two-sample t-test

##### Why did you choose the specific statistical test?

A two-sample t-test assuming equal variances, was chosen because we are comparing the average release years of two different groups: movies and TV shows on Netflix. The t-test is suitable when comparing means between two groups, and assuming equal variances is appropriate when the populations have similar variances

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
data.isna().sum()

In [None]:
data[['director','cast','country']]=data[['director','country','cast']].fillna('Unknown')

In [None]:
#Filling mode values in rating column inplace of the null value
data['rating'].fillna(data['rating'].mode()[0], inplace=True)

In [None]:
# contain 10 null values thatswhy we are dropping
data.dropna(subset=['day_added', 'month_added', 'year_added'], inplace=True)


In [None]:
data['duration_min'].fillna(0, inplace=True)
data['duration_seasons'].fillna(0,inplace=True)

In [None]:
data.isna().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Two missing value imputation techniques have been used:

* Using 'Unknown' for categorical variables:

 The 'director', 'cast', and 'country' olumns are filled with the string 'Unknown'. This approach is commonly used when dealing with categorical variables, where missing values are replaced with a value that indicates the information is unknown or unavailable.

* Using mode for numerical variable:

 The missing values in the 'rating' column are filled with the mode value. The mode represents the most frequent value in the column. Filling missing values with the mode is a common technique for numerical variables, as it provides a reasonable estimate based on the existing data.


 * Reason:

 These techniques were chosen based on their simplicity and common usage. Filling missing categorical variables with a distinct category like 'Unknown' allows the preservation of the existing data structure while indicating that the information is missing. For numerical variables like 'rating', using the mode is a practical approach when a single value dominates the distribution, making it a reasonable estimation for missing values.

### 2. Handling Outliers



##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
data.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

From the this data sample, which includes columns like 'title', 'director', 'cast', 'country', 'rating', 'listed_in', and 'description', it appears that the data does not contain any contractions. Therefore, there is no need for expanding contractions in this particular dataset.

In [None]:
# Expand Contraction
data.head()

#### 2. Lower Casing

In [None]:
# Lower Casing
# Lowercase the text in all columns
data = data.apply(lambda x: x.str.lower() if x.dtype == 'object' else x)

In [None]:
data.head()

#### 3. Removing Punctuations

In [None]:
data[['cast','country']]=data[['country','cast']]

In [None]:
# Remove Punctuations
import pandas as pd
import string

def remove_punctuation(text):
    """
    Helper function to remove punctuation from a given text.
    """
    return text.translate(str.maketrans("", "", string.punctuation))

# Assuming the DataFrame is stored in a variable called 'df'
data['title'] = data['title'].apply(remove_punctuation)
data['director'] = data['director'].apply(remove_punctuation)
data['cast'] = data['cast'].apply(remove_punctuation)
data['description'] = data['description'].apply(remove_punctuation)


In [None]:
data.head()

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
import re
def remove_urls(text):
    """
    Helper function to remove URLs from a given text.
    """
    return re.sub(r"http\S+|www\S+|https\S+", "", text)

def remove_digits(text):
    """
    Helper function to remove words and digits that contain digits from a given text.
    """
    return ' '.join(word for word in text.split() if not any(c.isdigit() for c in word))
data['description'] = data['description'].apply(remove_urls)
data['description'] = data['description'].apply(remove_digits)

#### 5. Removing Stopwords & Removing White spaces

In [None]:
data.head()

In [None]:
# Remove Stopwords
import nltk
from nltk.corpus import stopwords

# Download the NLTK stopwords corpus
nltk.download('stopwords')

# Get the English stopwords
stopwords = set(stopwords.words('english'))

# Function to remove stopwords from a text
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stopwords]
    return ' '.join(filtered_words)

# Remove stopwords from the text columns
text_columns = ['title', 'director', 'cast', 'country', 'listed_in', 'description']
for column in text_columns:
    data[column] = data[column].apply(remove_stopwords)

In [None]:
# Remove White spaces

# Assuming the DataFrame is stored in a variable called 'data'
data['title'] = data['title'].str.strip()
data['director'] = data['director'].str.strip()
data['cast'] = data['cast'].str.strip()
data['country'] = data['country'].str.strip()
data['rating'] = data['rating'].str.strip()
data['listed_in'] = data['listed_in'].str.strip()
data['description'] = data['description'].str.strip()


In [None]:
data.head()

#### 6. Rephrase Text

In [None]:
# Rephrase Text


#### 7. Tokenization

In [None]:
# Tokenization
import nltk
from nltk.tokenize import word_tokenize


# Download the necessary NLTK resources
nltk.download('punkt')

# Define the columns to tokenize
text_columns = ['title', 'director', 'cast', 'country', 'listed_in', 'description']

# Tokenize the text in the specified columns
for column in text_columns:
    data[column] = data[column].apply(lambda x: word_tokenize(str(x)))

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
data.head()

In [None]:
Dataset['cast']
Dataset['country']
Dataset['listed_in']
Dataset['description']

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
import pandas as pd
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Download the necessary NLTK resources
nltk.download('punkt')

# Initialize the PorterStemmer
stemmer = PorterStemmer()

# Define the columns to normalize
text_columns = ['title', 'director', 'cast', 'country', 'listed_in', 'description']

# Normalize the text in the specified columns using stemming
for column in text_columns:
    data[column] = data[column].apply(lambda x: ' '.join([stemmer.stem(word) for word in word_tokenize(str(x))]))

# Print the updated DataFrame
data.head()

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Define the text column to vectorize
text_column = 'description'

# Initialize the vectorizer (choose either CountVectorizer or TfidfVectorizer)
vectorizer = CountVectorizer()  # For Bag-of-Words (BoW) representation
# vectorizer = TfidfVectorizer()  # For TF-IDF representation

# Fit and transform the text data using the vectorizer
vectorized_text = vectorizer.fit_transform(data[text_column])

# Convert the vectorized text to a DataFrame
df_vectorized = pd.DataFrame(vectorized_text.toarray(), columns=vectorizer.get_feature_names_out())

# Print the vectorized text DataFrame
df_vectorized.head()

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting


##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data



### 6. Data Scaling

In [None]:
# Scaling your data
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Define the columns to scale
numeric_columns = ['release_year', 'duration_min']

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Scale the numeric columns
data[numeric_columns] = scaler.fit_transform(data[numeric_columns])

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)



##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

from sklearn.cluster import KMeans
wcss=[]
for i in range(1,31):
  kmeans = KMeans(n_clusters=i,init='k-means++',random_state=33)
  kmeans.fit(df_vectorized)
  wcss_iter = kmeans.inertia_
  wcss.append(wcss_iter)

number_clusters = range(1,31)
plt.figure(figsize=(10,5))
plt.plot(number_clusters,wcss)
plt.title('The Elbow Method - KMeans clustering')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
# Fit the Algorithm

# Predict on the model

In [None]:
from sklearn.metrics import silhouette_score

range_n_clusters = range(2,31)
silhouette_avg = []
for num_clusters in range_n_clusters:
  # initialize kmeans
  kmeans = KMeans(n_clusters=num_clusters,init='k-means++',random_state=33)
  kmeans.fit(df_vectorized)
  cluster_labels = kmeans.labels_

  # silhouette score
  silhouette_avg.append(silhouette_score(df_vectorized, cluster_labels))

plt.figure(figsize=(10,5))
plt.plot(range_n_clusters,silhouette_avg)
plt.xlabel('Values of K')
plt.ylabel('Silhouette score')
plt.title('Silhouette analysis For Optimal k - KMeans clustering')
plt.show()

In [None]:
kmeans = KMeans(n_clusters=4,init='k-means++',random_state=40)
kmeans.fit(df_vectorized)

In [None]:

df_vectorized['kmeans_cluster'] = kmeans.labels_

In [None]:
kmeans_distortion = kmeans.inertia_
kmeans_silhouette_score = silhouette_score(df_vectorized, kmeans.labels_)


In [None]:
print((kmeans_distortion,kmeans_silhouette_score))

In [None]:
plt.figure(figsize=(10,5))
q = sns.countplot(x='kmeans_cluster',data=df_vectorized, hue='type')
plt.title('Number of movies and TV shows in each cluster - Kmeans Clustering')
for i in q.patches:
  q.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***