<a href="https://colab.research.google.com/github/ayushambhore/Ted-Talk-Views-Predictions-ML-Regression/blob/master/Ted_Talk_Views_Predictions_ML_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Name : Ayush Ambhore**

# **Project Summary -**

The objective of this project is to build a predictive model that can accurately predict the number of views for videos uploaded on the TED website. The dataset used for this analysis consists of 4005 rows and 19 columns, with no duplicate values but with some missing values.

To prepare the data for modeling, several data cleaning and preprocessing steps were performed. The missing values in the "all_speakers" column were dropped since it indicated that only one speaker was present. The missing values in the "occupations" and "about_speakers" columns were replaced with "NA" to indicate the absence of information. The single missing value in the "recorded_date" column was dropped, and the missing values in the "comments" column were replaced with zeros.

Exploratory data analysis was conducted to gain insights into the dataset. Various charts and visualizations were created to explore the most popular videos, speakers, events, languages, and topics. The analysis revealed interesting findings such as the most popular video being "Do schools kill creativity?" and the top speaker being Alex Gendler. Alex Gendler also had the highest number of talks delivered, while Amy Cuddy had the highest average views per video. Richard Dawkins received the most comments on his videos. The most popular event was found to be TED-ed. The density plots showed that most videos had between 100 to 250 available languages, and the majority of videos had 2 to 10 topics.

Feature engineering was performed to derive new features from the existing data. The format of the "recorded_date" and "published_date" columns was changed to the datetime format. Two new columns, "video_age_day" and "average_daily_views," were created to capture the age of the video and the average daily views. The average views of the speaker were also calculated and mapped to a new column. Unnecessary columns were dropped, and outlier treatment was performed.

After feature engineering and data preprocessing, the dataset was divided into target variables and feature variables. VIF (Variance Inflation Factor) and multicollinearity checks were conducted to ensure the absence of high correlation among the features.

For the implementation of the predictive model, various machine learning algorithms such as linear regression, lasso, ridge, and elastic net were used. Grid search was employed for hyperparameter tuning to optimize the model performance. However, no significant improvement was observed even after hyperparameter tuning.

In conclusion, this project successfully built a predictive model to estimate the number of views for videos uploaded on the TED website. The data was cleaned, explored, and preprocessed to derive meaningful insights and create new features. Various machine learning algorithms were implemented, and though hyperparameter tuning did not yield significant improvements, the model can still serve as a valuable tool for predicting video views. The project highlights the importance of data cleaning, feature engineering, and exploratory data analysis in building effective predictive models.

# **GitHub Link -**

https://github.com/ayushambhore/Ted-Talk-Views-Predictions-ML-Regression

# **Problem Statement**


The primary objective of this project is to develop a predictive model capable of estimating the number of views for videos uploaded on the TEDx website. By analyzing various factors related to the videos, we aim to create a model that can provide valuable insights and predictions regarding the popularity of TED talks.

By leveraging the available data, we will explore patterns and relationships that can help us understand the key determinants of video views. We will perform data preprocessing, feature engineering, and exploratory data analysis to gain meaningful insights into the dataset.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Importing the libraries
import numpy as np
import pandas as pd
from numpy import math

from datetime import datetime, timedelta
import calendar

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score


from sklearn.model_selection import GridSearchCV

import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud,ImageColorGenerator

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
url = 'https://raw.githubusercontent.com/ayushambhore/Ted-Talk-Views-Predictions-ML-Regression/master/data_ted_talks.csv'
df = pd.read_csv(url)

### Dataset First View

In [None]:
# Dataset First Look
df.head(10)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f' Row count = {df.shape[0]}\n Column count = {df.shape[1]}')

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values
print(df.isnull().sum())

In [None]:
print(df.isnull().sum().sum())

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False);

### What did you know about your dataset?

The data set is about Ted talk videos. The data set have 4005 rows and 19 columns. The data have no duplicate rows. The data have 1685 missing values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')

### Variables Description

1. talk_id : Unique ID
2. title : Title of the video
3. speaker_1: Main speaker in the video
4. all_speakers : name of all speakers in the video
5. occupations: occupations of the main speaker
6. about_speakers: description about the speakers
7. views : total views on the video
8. recorder_date: the date of the recording of the video
9. published_date : date of publish of the video
10. event: event name
11. native_lang : the language in which the video was recorded
12. available_lang : available languages for the video
13. comments : total comments on the video
14. duration : duration of the video
15. topics : topics related to the video
16. related_talks : ID and the name of the related TED video
17. url : url link of the video
18. description : description of the video
19. transcript: transcript of the video

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No. of unique values in ",i,"is",df[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

### **Null values**

As we saw earlier all_speakers column have 4 missing values, occupations column have 522 missing values, about_speakers have 503 missing values, recorded_date have 1 missing value and comments have 655 missing values.

1. all_speakers have 4 missing values , missing value here suggests that only one speaker was there.
So we will drop the NaN values in all_speakers column.

In [None]:
#dropping the NaN rows of all_speakers column
df = df.dropna(subset= ['all_speakers'])

2. The columns- occupations and about speakers have 522 and 503 missing values respectively.
so we will replace these NaN values with 'NA'.

In [None]:
#replacing NaN values with 'NA'
df['occupations'].fillna(str({0:'NA'}),inplace=True,axis=0)

In [None]:
df['about_speakers'].fillna(str({0:'NA'}),inplace=True,axis=0)

3. recorded_date have only 1 missing value so we will drop that row.

In [None]:
#dropping the NaN row of recorded_date column
df = df.dropna(subset= ['recorded_date'])

4. The comment section has the most number of missing values , which can mean 2 things either no have commented on that video or the comments are disables , keeping in mind both the cases we will replace these Nan values with 0.

In [None]:
#replacing NaN values with 0
df['comments'].fillna(0, inplace = True)

#### Checking our final data

In [None]:
# Missing Values/Null Values
print(df.isnull().sum())

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False);

Now our data have no missing values , that means our data is clean and we are ready for the next steps.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1: **Barplot** on top 10 most popular TED talk videos

In [None]:
#sorting the dataset with respect to views
top10_most_views = df.sort_values(['views'],ascending=False).head(10)

In [None]:
#plotting  barplot
plt.figure(figsize=(10,6))
sns.set(font_scale=1.5)
plt.title('top 10 most popular TED talk videos',fontsize = 20)

sns.barplot(data= top10_most_views, x='views',y= 'title',
                    palette= "tab10")

plt.xlabel('Views in Billions');

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable.

##### 2. What is/are the insight(s) found from the chart?

As we can see the most popular video is the 'Do schools kill creativity?'

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The new content creators should note that people are more interested in these 10 topics so they should make more content on this.

#### Chart - 2:  **Barplot** on top 10 most popular speakers on TED talk videos

In [None]:
# Grouping the datafram with respect to speaker_1
top_speakers_wrt_views = df.groupby('speaker_1')
# Taking sum of all the rows across all the columns with respect to speaker_1
top_speakers_wrt_views= top_speakers_wrt_views.sum()
# sorting the values according to views in descending order and taking the top 10 values
top_speakers_wrt_views = top_speakers_wrt_views.sort_values(['views'],ascending= False).reset_index().head(10)

In [None]:
# plotting barplot
plt.figure(figsize=(10,6))
sns.set(font_scale=1.3)
plt.title('top 10 most popular speakers on TED talk videos',fontsize = 20)

sns.barplot(data= top_speakers_wrt_views, x='views',y= 'speaker_1',
                    palette= "tab10")

plt.xlabel('Views in Billions')
plt.ylabel('Speaker');

##### 1. Why did you pick the specific chart?

A bar plot provides a clear visual representation of the top 10 speakers.

##### 2. What is/are the insight(s) found from the chart?

As we can see the top most popular speaker is Alex Gendler.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The TED organization should approach these speakers more as they are bringing the most views to their videos.

#### Chart - 3 : **Countplot** on top 10 speakers with highest number of talks they delivered.

In [None]:
# count of top 10 speakers talks they deliverd
speaker_vs_frequency = df['speaker_1'].value_counts().head(10)
speaker_vs_frequency

In [None]:
# plotting countplot
plt.figure(figsize=(10,6))
sns.set(font_scale=1.3)
plt.title('top 10 speakers with highest number of talks they delivered',fontsize = 20)

sns.countplot(y='speaker_1',
              data=df,
              order=speaker_vs_frequency.index,
              palette="rocket")

sns.color_palette("rocket")
plt.xlabel('Count')
plt.ylabel('Speakers');

##### 1. Why did you pick the specific chart?

Countplot show the counts of observations in each categorical bin using bars. A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable.

##### 2. What is/are the insight(s) found from the chart?

Alex Gendler has delivered the highest number of talks.

#### Chart - 4: **Barplot** on top 10 speakers with respect to average views

In [None]:
#grouping the dataframe with respect to speaker_1
speakers_vs_avgviews= df.groupby('speaker_1',as_index=False)['views']
# taking mean of views
speakers_vs_avgviews= speakers_vs_avgviews.mean()
#sorting values with respect to mean of views in descending order
speakers_vs_avgviews = speakers_vs_avgviews.sort_values(['views'],ascending = False).head(10)

In [None]:
y_param1 =speakers_vs_avgviews['speaker_1']
x_param1 =speakers_vs_avgviews['views']

In [None]:
# plotting barplot
plt.figure(figsize=(10,6))
sns.set(font_scale=1.3)

graph = sns.barplot(x = x_param1,
                    y = y_param1 ,
                    linewidth=2,
                    color='#ff5252')
graph.set_title("Top 10 speakers with respect to average views");

##### 1. Why did you pick the specific chart?

The bar plots in this study visually describe the mean of a views with respect to the speaker.

##### 2. What is/are the insight(s) found from the chart?

Amy cuddy has the highest average views among all the speakers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The TED organization should approach these speakers more as they have the highest average views.

#### Chart - 5: **Barplot** on Top 10 speakers with highest number of comments

In [None]:
# grouoing dataframe wrt speaker_1
top_speakers_wrt_comments = df.groupby('speaker_1')
# taking sum across all the columns
top_speakers_wrt_comments= top_speakers_wrt_comments.sum()
# sortinng values wrt comments in descending order
top_speakers_wrt_comments = top_speakers_wrt_comments.sort_values(['comments'],ascending= False).reset_index().head(10)

In [None]:
# plotting barplot
plt.figure(figsize=(10,6))
sns.set(font_scale=1.3)
plt.title('Top 10 speakers with highest number of comments',fontsize = 20)

sns.barplot(data= top_speakers_wrt_comments, x='comments',y= 'speaker_1',
                    palette= "tab10")

plt.xlabel('comments')
plt.ylabel('speaker');

##### 1. Why did you pick the specific chart?

A bar plot provides a clear visual representation of the top 10 speakers with respect to comments.

##### 2. What is/are the insight(s) found from the chart?

Richard Dawkins have the most number of comments on his videos.

#### Chart - 6: **Barplot** on popular events

In [None]:
#grouping the dataframe wrt event
top_talk_event=df.groupby(['event'],as_index=False)
#taking sum across views column and count on talk_id
top_talk_event= top_talk_event.agg({'views':'sum','talk_id':'count'})
#sorting the values wrt views
top_talk_event= top_talk_event.sort_values('views',ascending=False).reset_index()[:8]
top_talk_event['talk_id']=top_talk_event['views']/top_talk_event['talk_id']

In [None]:
# plotting barplot
plt.figure(figsize=(10,6))
sns.set(font_scale=1.3)
plt.title('popularity of events',fontsize = 20)

sns.barplot(data= top_talk_event, x='views',y= 'event',
                    palette= "tab10")

plt.xlabel('Views in 10^9');

##### 1. Why did you pick the specific chart?

Bar charts make sense for categorical or nominal data, since they are measured on a scale with specific possible values. With categorical data, the sample is often divided into groups, and the responses have a defined order.

##### 2. What is/are the insight(s) found from the chart?

As we can clearly see the most popular event is TED-ed

#### Chart - 7: **Density plot** on available languages

In [None]:
# making a new column and storing how many languages are there in available lang
df['number_of_lang'] = df['available_lang'].apply(lambda x: len(x))

In [None]:
# plotting densityplot
plt.figure(figsize=(10,6))
sns.set(font_scale=1.3)
plt.title('Density plot on available languages',fontsize = 20)

sns.distplot(df['number_of_lang'])
plt.xlabel('Number of Languages');

##### 1. Why did you pick the specific chart?


Density Plot is the continuous and smoothed version of the Histogram estimated from the data. It is estimated through Kernel Density Estimation. In this method Kernel (continuous curve) is drawn at every individual data point and then all these curves are added together to make a single smoothened density estimation.

##### 2. What is/are the insight(s) found from the chart?

We can conclude from the chart that more number of videos have number languages between 100 to 250.

#### Chart - 8: **Density plot** on number of topics

In [None]:
# making a new column and storing how many topics are there in available lang
df['topics'] = df.apply(lambda x: eval(x['topics']), axis=1)
df['num_of_topics'] = df.apply(lambda x: len(x['topics']), axis=1)

In [None]:
# plotting densityplot
plt.figure(figsize=(10,6))
sns.set(font_scale=1.3)
plt.title('Density plot on Number of Topics',fontsize = 20)

sns.distplot(df['num_of_topics'])
plt.xlabel('Number of topics');

##### 1. Why did you pick the specific chart?


Density Plot is the continuous and smoothed version of the Histogram estimated from the data. It is estimated through Kernel Density Estimation. In this method Kernel (continuous curve) is drawn at every individual data point and then all these curves are added together to make a single smoothened density estimation.

##### 2. What is/are the insight(s) found from the chart?

we can see that more number of videos have around 2 to 10 topics.

#### Chart - 9: Most frequent words used in the title with **word cloud**

In [None]:
#creating a string to add the words found in titles
text = " ".join(topic for topic in df.title.astype(str))

#funtion for generating the wordcloud
wordcloud = WordCloud( width=1024, height=720).generate(text)

plt.title('Most frequent words used in the title',fontsize = 12)

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0);

##### 1. Why did you pick the specific chart?

A visualisation method that displays how frequently words appear in a given body of text, by making the size of each word proportional to its frequency. All the words are then arranged in a cluster or cloud of words.

##### 2. What is/are the insight(s) found from the chart?

We can see the that world and life are the most occuring words in the titles.

#### Chart - 10: Popular Topic tags with **word cloud**

In [None]:
#creating a string to add the words found in topics
text = " ".join(topic for topic in df.topics.astype(str))

#funtion for generating the wordcloud
wordcloud = WordCloud(width=1024, height=720).generate(text)

plt.imshow(wordcloud, interpolation='bilinear')

plt.title('Popular Topic tags',fontsize = 12)

plt.axis("off")
plt.margins(x=0, y=0);

##### 1. Why did you pick the specific chart?

A visualisation method that displays how frequently words appear in a given body of text, by making the size of each word proportional to its frequency. All the words are then arranged in a cluster or cloud of words.

##### 2. What is/are the insight(s) found from the chart?

We can see that technology,science and global issues and TED ed are the most occuring words in the topics.

#### Chart - 11: **Barplot** on top 10 most frequent speakers occupation.

In [None]:
#value counts of occupation of the speakers
top_10_speaker_occ =df['occupations'].value_counts()[1:].head(10).reset_index()

In [None]:
# plotting barplot
plt.figure(figsize=(10,6))
sns.set(font_scale=1.3)
plt.title(' top 10 most frequent speakers occupation',fontsize = 20)

sns.barplot(x = top_10_speaker_occ['occupations'],
            y = top_10_speaker_occ['index'],
                    palette= "tab10")


plt.ylabel('Count',fontsize = 15);

##### 1. Why did you pick the specific chart?

Bar charts make sense for categorical or nominal data, since they are measured on a scale with specific possible values. With categorical data, the sample is often divided into groups, and the responses have a defined order.

##### 2. What is/are the insight(s) found from the chart?

Writer is the most popular occupation of the speakers.

#### Chart - 12 - Correlation Heatmap

In [None]:
plt.figure(figsize=(12,8))
sns.set(font_scale=1)
heatmap = sns.heatmap(df.corr(), cmap="YlGnBu", annot=True)

heatmap.set_title('Correlation Heatmap',
                  fontdict={'fontsize':25},
                  pad=12);

##### 1. Why did you pick the specific chart?

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. The range of correlation is [-1,1].

Thus to know the correlation between all the variables along with the correlation coeficients, i used correlation heatmap.

## ***5. Feature Engineering & Data Pre-processing***

### 1. Data manipulation

Lets look at the columns of our data set

In [None]:
df.columns

Lets drop the unecessary columns which will not have any effect on our model building.

In [None]:
# Drop unnecessary columns from the dataframe
df.drop( columns = ['talk_id', 'title',
                    'about_speakers','event', 'url',
                    'description','transcript','native_lang',
                    'related_talks', 'available_lang', 'topics',
                    'all_speakers', 'occupations'],
          inplace = True)

In [None]:
df.columns

Before going further , we should change the format of recorded_date and published_date to datetime format.

In [None]:
# Change the format of recorded_date and published_date to datetime format
df['recorded_date'] = pd.to_datetime(df['recorded_date'], format = '%Y-%m-%d')
df['published_date'] = pd.to_datetime(df['published_date'], format = '%Y-%m-%d')

Creating a new column video_age_day as it will be required for further data processing.

In [None]:
# Create a new column for video_age_day
df['video_age_day'] = df['published_date'].max() + timedelta(days=1)-(pd.DatetimeIndex(df['published_date']))
df['video_age_day'] = df['video_age_day'].dt.days

Creating one more column for the average daily views. As this will be key feature in predicting the views of the video.

In [None]:
# Create a column for average daily views
df['avg_daily_views'] = df['views'] / df['video_age_day']

Obtaining the average views of the speaker of their videos and mapping it to new column.

In [None]:
# Get the average views of each speaker and map it to a new column
speaker_temp=df.groupby('speaker_1').agg({'views' : 'mean'}).sort_values(['views'],ascending=False).to_dict().values()
speaker_temp=  list(speaker_temp)[0]
df['speaker_1_avg_views']=df['speaker_1'].map(speaker_temp)

Now lets further drop these columns as they are not much relevant for our model.

In [None]:
# Drop remaining columns
df= df.drop(columns= ['recorded_date', 'published_date', 'speaker_1','number_of_lang' ,'num_of_topics','video_age_day'])

In [None]:
df.info()

### 2. Handling Outliers

In [None]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

outliers_count = ((df < Q1 - 1.5 * IQR) | (df > Q3 + 1.5 * IQR)).sum(axis=0)

outliers_count


We can see that very less collinearity is present so we should treat these .

In [None]:
# Handle outliers using IQR method
lower_threshold = Q1 - 1.5 * IQR
upper_threshold = Q3 + 1.5 * IQR

df = df.mask(df < lower_threshold, Q1 - 1.5 * IQR, axis=1)
df = df.mask(df > upper_threshold, Q3 + 1.5 * IQR, axis=1)

In [None]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

outliers_count = ((df < Q1 - 1.5 * IQR) | (df > Q3 + 1.5 * IQR)).sum(axis=0)

outliers_count

### 3. Assigning Target variable and feature varibales.

In [None]:
y = df['views'] #dependent variable

In [None]:
X = df.drop(columns='views') #independent variables

In [None]:
X.info()

### 4. Checking VIF


Variance Inflation Factor (VIF)
Variance inflation factor measures how much the behavior (variance) of an independent variable is influenced, or inflated, by its interaction/correlation with the other independent variables. Variance inflation factors allow a quick measure of how much a variable is contributing to the standard error in the regression.

In [None]:
# Check VIF (Variance Inflation Factor) to detect multicollinearity
selected_features = ['comments', 'duration', 'avg_daily_views', 'speaker_1_avg_views']

df_selected = df[selected_features]

vif = pd.DataFrame()
vif["Feature"] = df_selected.columns
vif["VIF"] = [variance_inflation_factor(df_selected.values, i) for i in range(df_selected.shape[1])]

print(vif)

We are good to go as all the VIF is under 10.

### 5. Checking multicollinearity

In [None]:
# Check multicollinearity using correlation heatmap
plt.figure(figsize=(12,8))
sns.set(font_scale=1)
heatmap = sns.heatmap(X.corr(), cmap="YlGnBu", annot=True)

heatmap.set_title('Correlation Heatmap',
                  fontdict={'fontsize':25},
                  pad=12);

## ***6. ML Model Implementation***

Train-test split

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 0)

### ML Model - 1

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# ML Model - Linear Regression
reg = LinearRegression().fit(X_train, y_train)

In [None]:
# Calculate model score on the training data
reg.score(X_train, y_train)

In [None]:
# Make predictions on the test data
y_pred = reg.predict(X_test)

In [None]:
# Calculate evaluation metrics: MSE, RMSE, MAE
MSE  = mean_squared_error(y_test, y_pred)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

MAE = mean_absolute_error(y_test, y_pred)
print("MAE :" ,MAE)

In [None]:
# Calculate R-Square and Adjusted R-Square
r2 = r2_score(y_test, y_pred)
print("R2 :" ,r2)
ar2= 1-(1-r2_score(y_test, y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 : ",ar2)

In [None]:
# Create a dataframe to store the evaluation metrics
error_metric_regression=pd.DataFrame({'Values':[r2,ar2,MSE,RMSE,MAE]},index=['R-Square','Adj. R-Square','MSE','RMSE','MAE'])

In [None]:
# Create a scatter plot to visualize predicted vs actual values
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs Actual Values')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Cross-Validation & Hyperparameter Tuning
reg_cross = LinearRegression()
parameters = {'positive': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100]}
reg_cross_x = GridSearchCV(reg_cross, parameters, scoring='neg_mean_squared_error', cv=3)
reg_cross_x.fit(X_train, y_train)

In [None]:
# Print the best hyperparameter value and the negative mean squared error
print("The best fit positive value is found out to be :" ,reg_cross_x.best_params_)
print("\nUsing ",reg_cross_x.best_params_, " the negative mean squared error is: ", reg_cross_x.best_score_)

In [None]:
# Make predictions on the test data using the model with the best hyperparameter value
y_pred_reg = reg_cross_x.predict(X_test)

In [None]:
# Calculate evaluation metrics for the tuned model and Print them
MSE  = mean_squared_error(y_test, y_pred_reg)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

MAE = mean_absolute_error(y_test, y_pred_reg)
print("MAE :" ,MAE)

r2 = r2_score(y_test, y_pred_reg)
print("R2 :" ,r2)
ar2 = 1-(1-r2_score(y_test, y_pred_reg))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 : ",ar2)

In [None]:
# Create a dataframe to store the evaluation metrics for the tuned model
error_metric_regression_x=pd.DataFrame({'Values':[r2,ar2,MSE,RMSE,MAE]},index=['R-Square','Adj. R-Square','MSE','RMSE','MAE'])

In [None]:
# Create a scatter plot to visualize predicted vs actual values
plt.scatter(y_test, y_pred_reg)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs Actual Values')
plt.show()

##### Which hyperparameter optimization technique have you used and why?

I have used Gridsearch as Grid search works by trying every possible combination of parameters you want to try in your model.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

No improvement.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# ML Model - Lasso Regression
lasso  = Lasso(alpha=0.1 , max_iter= 3000)
lasso.fit(X_train, y_train)

In [None]:
# Calculate model score on the training data
lasso.score(X_train, y_train)

In [None]:
# Make predictions on the test data
y_pred_lasso = lasso.predict(X_test)

In [None]:
# Calculate evaluation metrics: MSE, RMSE, MAE
MSE  = mean_squared_error(y_test, y_pred_lasso)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

MAE = mean_absolute_error(y_test, y_pred_lasso)
print("MAE :" ,MAE)

# Calculate R-Square and Adjusted R-Square
r2 = r2_score(y_test, y_pred_lasso)
print("R2 :" ,r2)
ar2 = 1-(1-r2_score(y_test, y_pred_lasso))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 : ",ar2)

In [None]:
# Create a dataframe to store the evaluation metrics
error_metric_lasso=pd.DataFrame({'Values':[r2,ar2,MSE,RMSE,MAE]},index=['R-Square','Adj. R-Square','MSE','RMSE','MAE'])

In [None]:
# Create a scatter plot to visualize predicted vs actual values
plt.scatter(y_test, y_pred_lasso)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs Actual Values')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Cross-Validation & Hyperparameter Tuning
lasso = Lasso()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100,1000]}
lasso_regressor = GridSearchCV(lasso, parameters, scoring='neg_mean_squared_error', cv=5)
lasso_regressor.fit(X_train, y_train)

In [None]:
# Print the best hyperparameter value and the negative mean squared error
print("The best fit alpha value is found out to be :" ,lasso_regressor.best_params_)
print("\nUsing ",lasso_regressor.best_params_, " the negative mean squared error is: ", lasso_regressor.best_score_)

In [None]:
# Make predictions on the test data using the model with the best hyperparameter value
y_pred_lasso_x = lasso_regressor.predict(X_test)

In [None]:
# Calculate evaluation metrics for the tuned model and Print them
MSE  = mean_squared_error(y_test, y_pred_lasso_x)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

MAE = mean_absolute_error(y_test, y_pred_lasso_x)
print("MAE :" ,MAE)

r2 = r2_score(y_test, y_pred_lasso_x)
print("R2 :" ,r2)
ar2 = 1-(1-r2_score(y_test, y_pred_lasso_x))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 : ",ar2)

In [None]:
# Create a dataframe to store the evaluation metrics for the tuned model
error_metric_lasso_x=pd.DataFrame({'Values':[r2,ar2,MSE,RMSE,MAE]},index=['R-Square','Adj. R-Square','MSE','RMSE','MAE'])

In [None]:
# Create a scatter plot to visualize predicted vs actual values
plt.scatter(y_test, y_pred_lasso_x)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs Actual Values')
plt.show()

##### Which hyperparameter optimization technique have you used and why?

I have used Gridsearch as Grid search works by trying every possible combination of parameters you want to try in your model.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

No improvement

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# ML Model - Ridge Regression
ridge = Ridge(alpha=0.1)

In [None]:
ridge.fit(X_train,y_train)

In [None]:
# Calculate model score on the training data
ridge.score(X_train, y_train)

In [None]:
# Make predictions on the test data
y_pred_r = ridge.predict(X_test)

In [None]:
# Calculate evaluation metrics: MSE, RMSE, MAE
MSE  = mean_squared_error(y_test, y_pred_r)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

MAE = mean_absolute_error(y_test, y_pred_r)
print("MAE :" ,MAE)

# Calculate R-Square and Adjusted R-Square
r2 = r2_score(y_test, y_pred_r)
print("R2 :" ,r2)
ar2 = 1-(1-r2_score(y_test, y_pred_r))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 : ",ar2)

In [None]:
# Create a dataframe to store the evaluation metrics
error_metric_ridge=pd.DataFrame({'Values':[r2,ar2,MSE,RMSE,MAE]},index=['R-Square','Adj. R-Square','MSE','RMSE','MAE'])

In [None]:
# Create a scatter plot to visualize predicted vs actual values
plt.scatter(y_test, y_pred_r)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs Actual Values')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Cross-Validation & Hyperparameter Tuning
ridge = Ridge()
parameters = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,40,45,50,55,60,100,6000,7000,8000]}
ridge_regressor = GridSearchCV(ridge, parameters, scoring='neg_mean_squared_error', cv=3)
ridge_regressor.fit(X_train,y_train)

In [None]:
# Print the best hyperparameter value and the negative mean squared error
print("The best fit alpha value is found out to be :" ,ridge_regressor.best_params_)
print("\nUsing ",ridge_regressor.best_params_, " the negative mean squared error is: ", ridge_regressor.best_score_)

In [None]:
# Make predictions on the test data using the model with the best hyperparameter value
y_pred_ridge = ridge_regressor.predict(X_test)

In [None]:
# Calculate evaluation metrics for the tuned model
MSE  = mean_squared_error(y_test, y_pred_ridge)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

MAE = mean_absolute_error(y_test, y_pred_ridge)
print("MAE :" ,MAE)

r2 = r2_score(y_test, y_pred_ridge)
print("R2 :" ,r2)
ar2 = 1-(1-r2_score(y_test, y_pred_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 : ",ar2)

In [None]:
# Create a dataframe to store the evaluation metrics for the tuned model
error_metric_ridge_x=pd.DataFrame({'Values':[r2,ar2,MSE,RMSE,MAE]},index=['R-Square','Adj. R-Square','MSE','RMSE','MAE'])

In [None]:
# Create a scatter plot to visualize predicted vs actual values for the tuned model
plt.scatter(y_test, y_pred_ridge)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs Actual Values')
plt.show()

##### Which hyperparameter optimization technique have you used and why?

I have used Gridsearch as Grid search works by trying every possible combination of parameters you want to try in your model.

### ML Model - 4

In [None]:
# ML Model - ElasticNet Regression
elasticnet = ElasticNet(alpha=0.1, l1_ratio=0.5)

In [None]:
elasticnet.fit(X_train,y_train)

In [None]:
# Calculate model score on the training data
elasticnet.score(X_train, y_train)

In [None]:
# Make predictions on the test data
y_pred_en = elasticnet.predict(X_test)

In [None]:
# Calculate evaluation metrics: MSE, RMSE, MAE
MSE  = mean_squared_error(y_test, y_pred_en)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

MAE = mean_absolute_error(y_test, y_pred_en)
print("MAE :" ,MAE)

# Calculate R-Square and Adjusted R-Square
r2 = r2_score(y_test, y_pred_en)
print("R2 :" ,r2)
ar2= 1-(1-r2_score(y_test, y_pred_en))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 : ",ar2)

In [None]:
# Create a dataframe to store the evaluation metrics
error_metric_elastic=pd.DataFrame({'Values':[r2,ar2,MSE,RMSE,MAE]},index=['R-Square','Adj. R-Square','MSE','RMSE','MAE'])

In [None]:
# Create a scatter plot to visualize predicted vs actual values
plt.scatter(y_test, y_pred_en)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs Actual Values')
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Cross-Validation & Hyperparameter Tuning
elastic = ElasticNet()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20],'l1_ratio':[0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]}
elastic_regressor = GridSearchCV(elastic, parameters, scoring='neg_mean_squared_error',cv=5)
elastic_regressor.fit(X_train, y_train)

In [None]:
# Print the best hyperparameter values and the negative mean squared error
print("The best fit alpha value is found out to be :" ,elastic_regressor.best_params_)
print("\nUsing ",elastic_regressor.best_params_, " the negative mean squared error is: ", elastic_regressor.best_score_)

In [None]:
# Make predictions on the test data using the model with the best hyperparameter values
y_pred_elastic = elastic_regressor.predict(X_test)

In [None]:
# Calculate evaluation metrics for the tuned model
MSE  = mean_squared_error(y_test, y_pred_elastic)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

MAE = mean_absolute_error(y_test, y_pred_elastic)
print("MAE :" ,MAE)

r2 = r2_score(y_test, y_pred_elastic)
print("R2 :" ,r2)
ar2= 1-(1-r2_score(y_test, y_pred_elastic))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 : ",ar2)

In [None]:
# Create a dataframe to store the evaluation metrics
error_metric_elastic_x=pd.DataFrame({'Values':[r2,ar2,MSE,RMSE,MAE]},index=['R-Square','Adj. R-Square','MSE','RMSE','MAE'])

In [None]:
# Create a scatter plot to visualize predicted vs actual values
plt.scatter(y_test, y_pred_elastic)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs Actual Values')
plt.show()

##### Which hyperparameter optimization technique have you used and why?

I have used Gridsearch as Grid search works by trying every possible combination of parameters you want to try in your model.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

In [None]:
# all evaluation metrics of all the models clubbed together
Model_Summary=pd.DataFrame({'Linear Regression':error_metric_regression['Values'],
                            'Lasso':error_metric_lasso['Values'],
                            'Ridge':error_metric_ridge['Values'],
                            'Elastic':error_metric_elastic['Values']},
                           index=['R-Square','Adj. R-Square','MSE','RMSE','MAE'])

pd.set_option('display.float_format', lambda x: '%.6f' % x)
Model_Summary.index.name = 'Metrics'


In [None]:
Model_Summary

In [None]:
# all evaluation metrics of all the models clubbed together after hyper parameter tuning
Model_Summary_HT =pd.DataFrame({'Linear Regression':error_metric_regression_x['Values'],
                            'Lasso':error_metric_lasso_x['Values'],
                            'Ridge':error_metric_ridge_x['Values'],
                            'Elastic':error_metric_elastic_x['Values']},
                           index=['R-Square','Adj. R-Square','MSE','RMSE','MAE'])

pd.set_option('display.float_format', lambda x: '%.6f' % x)

In [None]:
Model_Summary_HT

For a positive business impact we should choose R2 as it produced the best results in each of the regression models. And there is no overfitting as shown by adjusted R2.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

We should choose the Lasso model as it have the least RMSE value among all the other models. and after hyperparameter tuning the model works better.

# **Conclusion**

* The project aimed to build a predictive model to estimate the views of videos uploaded on the TED website.
* The dataset was cleaned and preprocessed, including handling missing values and converting date columns to datetime format.
* Exploratory data analysis provided insights into popular videos, speakers, events, languages, and topics.
* Feature engineering was performed to create new features, such as video age and average daily views.
Machine learning algorithms including linear regression, lasso, ridge, and elastic net were implemented.
* The R2 metric was chosen for evaluating the models, as it consistently produced the best results in each of the regression models.
* There was no indication of overfitting as shown by the adjusted R2 values.
* The Lasso model was selected as the best model, as it had the lowest RMSE (Root Mean Square Error) value among all the models.
* Hyperparameter tuning further improved the performance of the Lasso model.
* The selected model can have a positive business impact by accurately predicting the number of views for TED videos.
* The project emphasizes the importance of data cleaning, feature engineering, and exploratory data analysis in building effective predictive models.





### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***