# Enhance your brand using YouTube


## 📖 Background
You're a data scientist at a global marketing agency that helps some of the world's largest companies enhance their online presence.

Your new project is exciting: identify the most effective YouTube videos to promote your clients’ brands.

Forget simple metrics like views or likes; your job is to dive deep and discover who really connects with audiences through innovative content analysis.

## 💾 The Data

The data for this competition is stored in two tables, `videos_stats` and `comments`.

### `videos_stats.csv`
This table contains aggregated data for each YouTube video:
- **Video ID**: A unique identifier for each video.
- **Title**: The title of the video.
- **Published At**: The publication date of the video.
- **Keyword**: The main keyword or topic of the video.
- **Likes**: The number of likes the video has received.
- **Comments**: The number of comments on the video.
- **Views**: The total number of times the video has been viewed.

### `comments.csv`
This table captures details about comments made on YouTube videos:
- **Video ID**: The identifier for the video the comment was made on (matches the `Videos Stats` table).
- **Comment**: The text of the comment.
- **Likes**: How many likes this comment has received.
- **Sentiment**: The sentiment score ranges from 0 (negative) to 2 (positive), indicating the tone of a comment.

# Preparing Data

## Connect Drive

## Install Libraries

In [2]:
!pip install nrclex

Collecting nrclex
  Downloading NRCLex-4.0-py3-none-any.whl (4.4 kB)
INFO: pip is looking at multiple versions of nrclex to determine which version is compatible with other requirements. This could take a while.
  Downloading NRCLex-3.0.0.tar.gz (396 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m396.4/396.4 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: nrclex
  Building wheel for nrclex (setup.py) ... [?25l[?25hdone
  Created wheel for nrclex: filename=NRCLex-3.0.0-py3-none-any.whl size=43309 sha256=05f735e69112c66c656e41112c4de4f50fe0ff324a721a413e1522d5baf01a42
  Stored in directory: /root/.cache/pip/wheels/d2/10/44/6abfb1234298806a145fd6bcaec8cbc712e88dd1cd6cb242fa
Successfully built nrclex
Installing collected packages: nrclex
Successfully installed nrclex-3.0.0


##Importing Libraries

In [3]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from nrclex import NRCLex
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
# Import necessary libraries
import os
import re
import string
import random
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from tqdm import tqdm
from sklearn.model_selection import train_test_split

import nltk
from nltk.corpus import stopwords
from nltk import download as nltk_download
from nrclex import NRCLex
import spacy
from spacy.util import compounding, minibatch

import plotly.express as px
import plotly.figure_factory as ff
from plotly import graph_objs as go

# Set up Jupyter notebook display settings
%matplotlib inline
warnings.filterwarnings("ignore")

# Download necessary NLTK data
nltk_download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Importing Data

In [11]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [12]:
print("Youtube Data before cleaning:")
videos_stats = pd.read_csv('/content/drive/MyDrive/workspace/videos_stats.csv')
videos_stats.head()

Youtube Data before cleaning:


Unnamed: 0,Title,Video ID,Published At,Keyword,Likes,Comments,Views
0,Apple Pay Is Killing the Physical Wallet After...,wAZZ-UWGVHI,23/08/2022,tech,3407.0,672.0,135612.0
1,The most EXPENSIVE thing I own.,b3x28s61q3c,24/08/2022,tech,76779.0,4306.0,1758063.0
2,My New House Gaming Setup is SICK!,4mgePWWCAmA,23/08/2022,tech,63825.0,3338.0,1564007.0
3,Petrol Vs Liquid Nitrogen | Freezing Experimen...,kXiYSI7H2b0,23/08/2022,tech,71566.0,1426.0,922918.0
4,Best Back to School Tech 2022!,ErMwWXQxHp0,08/08/2022,tech,96513.0,5155.0,1855644.0


In [8]:
print("Comment Data before cleaning:")
comments = pd.read_csv('/content/drive/MyDrive/workspace/comments.csv')
comments.head()

Comment Data before cleaning:


FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/workspace/comments.csv'

## Cleaning Data

### Cleaning Video Data

In [None]:
# Create a copy of the DataFrame to work on
vs_clean = videos_stats.copy()

# 1. Handling missing values
# Assuming we want to fill numeric columns with the median and categorical with the mode
for column in videos_stats.columns:
    if vs_clean[column].dtype == 'object':
        vs_clean[column].fillna(vs_clean[column].mode()[0], inplace=True)
    else:
        vs_clean[column].fillna(vs_clean[column].median(), inplace=True)

# 2. Convert data types 'Published At' to datetime
vs_clean['Published At'] = pd.to_datetime(vs_clean['Published At'], errors='coerce')

# 3. Remove duplicates
vs_clean.drop_duplicates(inplace=True)

# 4. Normalize data (Example: Handling outliers in 'Likes')
q_low = vs_clean['Likes'].quantile(0.01)
q_hi  = vs_clean['Likes'].quantile(0.99)
vs_clean= vs_clean[(vs_clean['Likes'] > q_low) & (vs_clean['Likes'] < q_hi)]

# Display the cleaned data & Save the cleaned data back to a new CSV file
print("Youtube Data after cleaning:")
vs_clean.to_csv('videos_stats_cleaned.csv', index=False)
vs_df_clean = pd.read_csv('videos_stats_cleaned.csv')
vs_df_clean

### Cleaning Comments Data

In [None]:
# Create a copy of the DataFrame to work on
cm_clean = comments.copy()

# 1. Handling missing values
cm_clean.dropna(subset=['Comment'], inplace=True)  # Drop rows where 'Comment' is missing
cm_clean['Likes'] = cm_clean['Likes'].fillna(0)  # Fill missing 'Likes' with 0
cm_clean['Sentiment'] = cm_clean['Sentiment'].fillna(cm_clean['Sentiment'].median())

# 2. Convert data types
cm_clean['Likes'] = cm_clean['Likes'].astype(int)
cm_clean['Sentiment'] = cm_clean['Sentiment'].astype(int)

# 3. Data normalization
cm_clean['Comment'] = cm_clean['Comment'].str.strip()  # Remove extra spaces and newlines

# 4. Validate 'Sentiment' values
cm_clean = cm_clean[(cm_clean['Sentiment'] >= 0) & (cm_clean['Sentiment'] <= 2)]

# Save the cleaned data back to a new CSV file
cm_clean.to_csv('comments_cleaned.csv', index=False)

cm_df_clean = pd.read_csv('/content/drive/MyDrive/workspace/comments_cleaned.csv',
                 lineterminator='\n')

cm_df_clean


# Exploratory Data Analysis of YouTube Trends
## 1. Validating data types

In [None]:
videos_stats.info()

## 2. Validating numerical data

In [None]:
videos_stats.select_dtypes("number")

###Separate Year, Month, Day

In [None]:
vs_df = pd.DataFrame(videos_stats)
vs_df['Year'] = vs_df['Published At'].apply(lambda x: x.split('/')[-1]).astype(int)
vs_df['Month'] = vs_df['Published At'].apply(lambda x: x.split('/')[1]).astype(int)
vs_df['Day'] = vs_df['Published At'].apply(lambda x: x.split('/')[0]).astype(int)
vs_df

In [None]:
vs_df.select_dtypes("number")

## Merging Data

In [None]:
# Perform a right join between vs_df_clean and cm_df_clean on 'Video ID'
merged_df = pd.merge(cm_clean, vs_clean, on='Video ID', how='right', indicator=True)
#Make merged_df into a dataframe
merged_df = pd.DataFrame(merged_df)
# Display the merged DataFrame
print(merged_df)

In [None]:
merged_df= merged_df.drop_duplicates(subset=['Comment'], keep='first')
merged_df = merged_df.reset_index(drop=True)
merged_df

In [None]:
# Create a boolean series where True represents null values in 'Video ID'
null_video_id = merged_df['Video ID'].isnull()
rows_with_null_video_id = merged_df[null_video_id]
print(rows_with_null_video_id)

## Engagement Metrics
To calculate a YouTube channel's engagement rate, divide the total number of likes, comments, shares, and other engagements by the total number of views, then multiply by 100. This formula applies to both individual videos and entire channels.

"Total Number of Likes + Total Number of Comments" / Views * 100

basic engagement metrics such as views, likes, and comments and identify which types of content are most popular in each industry.

In [None]:
# Counting the number of comments per video
comments_count = merged_df.groupby('Title').size().reset_index(name='Comment Count')
comments_count

In [None]:
# Counting comments per video
comments_count = merged_df.groupby('Video ID').size().reset_index(name='comments_count')

# Merging comment counts into the main DataFrame
merged_df = pd.merge(merged_df, comments_count, on='Video ID', how='left')

In [None]:
merged_df['Engagement Rate'] = ((merged_df['Likes_y'] + merged_df['comments_count']) / merged_df['Views']) * 100

# Grouping by 'Keyword' and calculating average engagement rate
keyword_engagement = merged_df.groupby('Keyword')['Engagement Rate'].mean()

# Sort values from smallest to largest for plotting
keyword_engagement_sorted = keyword_engagement.sort_values()

In [None]:
import matplotlib.pyplot as plt

# Determine the total number of bars
n_bars = len(keyword_engagement_sorted)

# Create a color list: blue for top 5, red for bottom 5, green for others
colors = ['blue' if i < 6 else 'red' if i >= n_bars - 6 else 'green' for i in range(n_bars)]

# Increase the figure height in the figsize attribute for better visualization
plt.figure(figsize=(10, 12))
plt.barh(keyword_engagement_sorted.index, keyword_engagement_sorted.values, color=colors, height=0.4)
plt.ylabel('Keyword')
plt.xlabel('Average Engagement Rate (%)')
plt.title('Average Engagement Rate by Keyword, Sorted')
plt.show()


In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# Creating a dictionary from the sorted keywords and their values
keyword_frequencies = {keyword: value for keyword, value in zip(keyword_engagement_sorted.index, keyword_engagement_sorted.values)}

# Creating the word cloud with maximum font size adjusted for visibility
wordcloud = WordCloud(width=800, height=400, max_font_size=100, background_color='white').generate_from_frequencies(keyword_frequencies)

# Displaying the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')  # No axis for this plot
plt.show()


In [None]:
unique_keywords = merged_df['Keyword'].unique()
unique_keywords

## Keyword-based Classification

- Technology
- Education
- Media & Entertainment
- Food & Beverage
- Gaming
- Sports
- Arts & Entertainment
- Pets & Animals

In [None]:
industry_map = {
    "Technology": [
        "tech", "apple", "google", "computer science", "data science", "machine learning"
    ],
    "Business & Finance": [
        "business", "finance", "crypto", "interview", "news"
    ],
    "Gaming": [
        "gaming", "tutorial", "nintendo", "xbox", "minecraft", "game development"
    ],
    "Media & Entertainment": [
        "movies", "marvel", "mrbeast", "cnn", "mukbang", "reaction"
    ],
    "Sports": [
        "sports", "chess", "cubes"
    ],
    "Education": [
        "how-to", "history", "literature", "education", "math", "chemistry", "biology", "physics", "sat"
    ],
    "Lifestyle & Leisure": [
        "food", "bed", "animals", "trolling", "asmr", "music", "lofi"
    ]
}

In [None]:
industry_map_data = [(keyword, industry) for industry, keywords in industry_map.items() for keyword in keywords]

# Create the DataFrame
industry_map_df = pd.DataFrame(industry_map_data, columns=['Keyword', 'Industry'])
industry_map_df.head()

In [None]:
merged_df = pd.merge(merged_df, industry_map_df, on='Keyword', how='left')
merged_df

In [None]:
# Assuming you have a DataFrame where 'industry' is a column
# Replace 'industry_column_name' with the actual column name in your DataFrame
industry_engagement_sorted = merged_df.groupby('Industry')['Engagement Rate'].mean().sort_values(ascending=True)

# Determine the total number of bars
nn_bars = len(industry_engagement_sorted)

# Increase the figure height in the figsize attribute for better visualization
plt.figure(figsize=(10, 12))
plt.barh(industry_engagement_sorted.index, industry_engagement_sorted.values, color=colors, height=0.4)
plt.ylabel('Industry')
plt.xlabel('Average Engagement Rate (%)')
plt.title('Average Engagement Rate by Industry, Sorted')
plt.show()


In [None]:
# Assuming the DataFrame is named merged_df

# Group by 'Industry' and 'Keyword' and aggregate the metrics
keyword_popularity = merged_df.groupby(['Industry', 'Keyword']).agg({
    'Likes_x': 'sum',           # Total likes
    'Comments': 'sum',          # Total comments
    'Views': 'sum',             # Total views
    'Engagement Rate': 'mean'   # Mean engagement rate
})

# Sort the DataFrame by total likes in descending order
keyword_popularity = keyword_popularity.sort_values(by='Likes_x', ascending=False)

keyword_popularity


In [None]:
# Create the DataFrame
keyword_popularity_df = pd.DataFrame(keyword_popularity, columns=[])

keyword_popularity_df

In [None]:
# Calculate the sum of engagement rates for each industry and sort the industries by this sum
industry_engagement= keyword_popularity.groupby('Industry')['Engagement Rate'].mean().sort_values(ascending=False)

# Plot settings
plt.figure(figsize=(12, 10))
plt.title('Keyword Popularity by Industry')
plt.xlabel('Engagement Rate')
plt.ylabel('Keyword')

# Iterate over each industry sorted by overall popularity and plot a horizontal bar chart for keyword popularity
for industry in industry_engagement.index:
    data = keyword_popularity.loc[industry].sort_values(by='Engagement Rate', ascending=False)
    plt.barh(data.index.get_level_values('Keyword'), data['Engagement Rate'], label=industry)

# Add legend
plt.legend()

# Show plot
plt.show()


In [None]:
# Calculate the sum of engagement rates for each industry and sort the industries by this sum
industry_view = keyword_popularity.groupby('Industry')['Views'].mean().sort_values(ascending=False)

# Plot settings
plt.figure(figsize=(12, 10))
plt.title('Keyword Views by Industry')
plt.xlabel('Views')
plt.ylabel('Keyword')

# Iterate over each industry sorted by overall popularity and plot a horizontal bar chart for keyword popularity
for industry in industry_view.index:
    data = keyword_popularity.loc[industry].sort_values(by='Views', ascending=False)
    plt.barh(data.index.get_level_values('Keyword'), data['Views'], label=industry)

# Add legend
plt.legend()

# Show plot
plt.show()


# 2. Sentiment Analysis of Video Comments

Dataset provides the sentiment scale from 0 to 2. Since Sentiment is only identifying the negativity and positive, it is not as detail as to identify the emotion & opinions.So we will further investigate the sentiment of the comments after identifying the propotion of positive and negative comments.


##2.1 Emotional landscape

## 2.2 First Element - Opinion


**Sentiment EDA**

In [None]:
merged_df.Sentiment.value_counts()

In [None]:
temp = merged_df.groupby('Sentiment').count()['Comments'].reset_index().sort_values(by='Comments',ascending=False)
temp.style.background_gradient(cmap='Greens')

In [None]:
# Group by 'Sentiment' and count the 'Comments', then reset the index and sort
temp = merged_df.groupby('Sentiment').count()['Comments'].reset_index().sort_values(by='Comments', ascending=False)

# Add a new column with sentiment words
sentiment_mapping = {0: 'Negative', 1: 'Neutral', 2: 'Positive'}
temp['Sentiment_Words'] = temp['Sentiment'].map(sentiment_mapping)
temp

This is one of the first element that goes into a sentiment analysis system which is **Opinion**

Opinion has three divisions: positive, neutral, and negative.

In this case
- Positive = 2
- neutral = 1
- nagative = 0


In [None]:
sentiment_percentage = merged_df.Sentiment.value_counts()/len(merged_df)
sentiment_percentage

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x='Sentiment',data=merged_df)

In [None]:
fig = go.Figure(go.Funnelarea(
    text =temp.Sentiment_Words,
    values = temp.Comments,
    title = {"position": "top center", "text": "Funnel-Chart of Sentiment Distribution"}
    ))
fig.show()

Overall Positive comments are most common among videos.


Positive comments are 62% <br />
Neutral comments are 25.1%
Negative comments are 12.9$  <br />

Let's find out which types



In [None]:
# Group by 'Industry' and 'Keyword' and aggregate the metrics
keyword_sentiment = merged_df.groupby(['Industry', 'Keyword']).agg({
    'Sentiment': 'mean'  # Calculate mean sentiment
})

# Sort the DataFrame by 'Sentiment' in descending order
keyword_sentiment = keyword_sentiment.sort_values(by=['Industry', 'Sentiment'], ascending=[False, False])

# Display the sorted DataFrame
keyword_sentiment.head()

In [None]:
# Group by 'Industry' and 'Keyword' and calculate the mean sentiment
keyword_sentiment = merged_df.groupby(['Industry', 'Keyword']).agg({
    'Sentiment': 'mean'  # Calculate mean sentiment
}).reset_index()

# Categorize sentiments
def categorize_sentiment(sentiment):
    if 0 <= sentiment < 0.666:
        return 'Negative'
    elif 0.666 <= sentiment < 1.332:
        return 'Neutral'
    elif 1.332 <= sentiment <= 2:
        return 'Positive'

keyword_sentiment['Sentiment Category'] = keyword_sentiment['Sentiment'].apply(categorize_sentiment)


#Sort value
keyword_sentiment = keyword_sentiment.sort_values(by='Sentiment', ascending=False)


# Set plot style
sns.set(style="whitegrid")

# Create a bar plot
plt.figure(figsize=(14, 8))
bar_plot = sns.barplot(data=keyword_sentiment, x='Sentiment', y='Keyword', hue='Sentiment Category', palette='coolwarm')

# Improve the layout
plt.title('Keyword Sentiment by Industry, Categorized', fontsize=18)
plt.xlabel('Average Sentiment', fontsize=14)
plt.ylabel('Keyword', fontsize=14)
plt.legend(title='Sentiment Category', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()

# Show plot
plt.show()

In [None]:
industry_sentiment_sum = keyword_sentiment.groupby('Industry')['Sentiment'].sum().sort_values(ascending=False)

# Plot settings
plt.figure(figsize=(12, 10))
plt.title('Keyword Sentiment by Industry')
plt.xlabel('Sentiment')
plt.ylabel('Keyword')

# Iterate over each industry sorted by overall popularity and plot a horizontal bar chart for keyword popularity
for industry in industry_sentiment_sum.index:
    # Filter data for the current industry
    data_sentiment = keyword_sentiment[keyword_sentiment['Industry'] == industry].sort_values(by='Sentiment', ascending=False)

    # Plot horizontal bar chart
    plt.barh(data_sentiment['Keyword'], data_sentiment['Sentiment'], label=industry)

# Add legend
plt.legend()
# Show plot
plt.show()

## First Element: Emotion
The categorical model of emotion analysis places a person's emotions into six basic categories, like anger, fear, disgust, joy, sadness, and surprise. Specific words are linked to relevant emotion tags and used to detect both related and unrelated emotions([Reference](https://www.delve.ai/blog/emotion-analysis#:~:text=The%20categorical%20model%20of%20emotion,both%20related%20and%20unrelated%20emotions.)).


https://pypi.org/project/NRCLex/



In [None]:
# Function to calculate emotion scores
def get_emotion_scores(comment):
    emotions = NRCLex(comment).affect_frequencies
    return emotions

# Create a new DataFrame to work with
comment_df = merged_df.copy()

# Apply function to the comments column
comment_df.loc[:, 'Emotion Scores'] = comment_df['Comment'].apply(get_emotion_scores)

# Extract specific emotions and add them to the DataFrame
emotion_columns = ['fear', 'anger', 'anticipation', 'trust', 'surprise', 'sadness', 'disgust', 'joy']

for emotion in emotion_columns:
    comment_df.loc[:, emotion] = comment_df['Emotion Scores'].apply(lambda x: x.get(emotion.lower(), 0))

# Drop intermediate column
comment_df = comment_df.drop(columns=['Emotion Scores'])

# Display the DataFrame
comment_df.head()


In [None]:
# Assuming there's a 'Keyword' column in your DataFrame
# Aggregate the scores by keyword (assuming you have a 'Keyword' column)
keyword_avg_df = comment_df.groupby('Keyword')[emotion_columns].mean()

# Calculate the sum of emotion scores for each keyword
keyword_avg_df['Total'] = keyword_avg_df.sum(axis=1)

# Sort the DataFrame based on the total emotion scores
keyword_avg_df = keyword_avg_df.sort_values(by='Total', ascending=False).drop(columns='Total').reset_index()

# Create a heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(keyword_avg_df.set_index('Keyword'), annot=True, cmap="coolwarm", cbar=True, linewidths=0.5, linecolor='gray')
plt.title("Average Emotion Scores by Keyword", fontsize=16)
plt.xlabel("Emotions", fontsize=14)
plt.ylabel("Keywords", fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Assuming there's a 'Keyword' column in your DataFrame
# Aggregate the scores by keyword (assuming you have a 'Keyword' column)
industry_avg_df = comment_df.groupby('Industry')[emotion_columns].mean()

# Calculate the sum of emotion scores for each keyword
industry_avg_df['Total'] = industry_avg_df.sum(axis=1)

# Sort the DataFrame based on the total emotion scores
industry_avg_df = industry_avg_df.sort_values(by='Total', ascending=False).drop(columns='Total').reset_index()

# Create a heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(industry_avg_df.set_index('Industry'), annot=True, cmap="coolwarm", cbar=True, linewidths=0.5, linecolor='gray')
plt.title("Average Emotion Scores by Industry", fontsize=16)
plt.xlabel("Emotions", fontsize=14)
plt.ylabel("Industry", fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:

# Calculate the overall average of each emotion
average_emotions = comment_df[emotion_columns].mean().sort_values()

# Plot the bar graph
plt.figure(figsize=(10, 6))
average_emotions.plot(kind='bar')
plt.title('Overall Average of Each Emotion (Sorted)')
plt.xlabel('Emotions')
plt.ylabel('Average Score')
plt.xticks(rotation=45)
plt.show()

##2.1 Identifying trends

### Time Series

In [None]:
# Choose an emotion to visualize (e.g., 'anger')
chosen_emotion = 'anger'

# Resample data by year and calculate average emotion score
resampled_data = comment_df.resample('Y', on='Published At')[chosen_emotion].mean()

# Create time series plot
plt.figure(figsize=(10, 6))
plt.plot(resampled_data.index.year, resampled_data.values, marker='o', linestyle='-')
plt.xlabel('Year')
plt.ylabel(f'Average {chosen_emotion.capitalize()} Score')
plt.title(f'Average Yearly {chosen_emotion.capitalize()} Score Over Time')
plt.grid(True)
plt.show()

In [None]:
# Choose an emotion to visualize (e.g., 'anger')
chosen_emotion = 'anger'

# Resample data by year and calculate average emotion score
resampled_data = comment_df.resample('Q', on='Published At')[chosen_emotion].mean()

# Create time series plot
plt.figure(figsize=(10, 6))
plt.plot(resampled_data.index.year, resampled_data.values, marker='o', linestyle='-')
plt.xlabel('Monthly')
plt.ylabel(f'Average {chosen_emotion.capitalize()} Score')
plt.title(f'Average Yearly {chosen_emotion.capitalize()} Score Over Time')
plt.grid(True)
plt.show()

# 3. Development of a Video Ranking Model


In [None]:
train_df, test_df = train_test_split(merged_df, test_size=0.3, random_state=42)

# Print the shapes of the resulting datasets
print("Training Data Shape:", train_df.shape)
print("Testing Data Shape:", test_df.shape)

## 💪 Competition challenge

Create a report that covers the following:

1. **Exploratory Data Analysis of YouTube Trends:**
   - Conduct an initial analysis of YouTube video trends across different industries. This analysis should explore basic engagement metrics such as views, likes, and comments and identify which types of content are most popular in each industry.

2. **Sentiment Analysis of Video Comments:**
   - Perform a sentiment analysis on video comments to measure viewer perceptions. This task involves basic processing of text data and visualizing sentiment trends across various video categories.

3. **Development of a Video Ranking Model:**
   - Create a simple model that uses sentiment analysis results and traditional engagement metrics to rank videos. This model should help identify potentially valuable videos for specific industry sectors.

4. **Strategic Recommendation for E-Learning Collaboration:**
   - Use your model’s findings to identify YouTube videos that would be particularly effective for an **E-Learning platform focused on Data and AI skills**. Include recommendations for **three specific videos**, briefly explaining why each is ideal for promoting your E-Learning platform.

## 🧑‍⚖️ Judging criteria

| CATEGORY | WEIGHTING | DETAILS                                                              |
|:---------|:----------|:---------------------------------------------------------------------|
| **Recommendations** | 35%       | <ul><li>Clarity of recommendations - how clear and well presented the recommendation is.</li><li>Quality of recommendations - are appropriate analytical techniques used & are the conclusions valid?</li><li>Number of relevant insights found for the target audience.</li></ul>       |
| **Storytelling**  | 35%       | <ul><li>How well the data and insights are connected to the recommendation.</li><li>How the narrative and whole report connects together.</li><li>Balancing making the report in-depth enough but also concise.</li></ul> |
| **Visualizations** | 20% | <ul><li>Appropriateness of visualization used.</li><li>Clarity of insight from visualization.</li></ul> |
| **Votes** | 10% | <ul><li>Up voting - most upvoted entries get the most points.</li></ul> |

## ✅ Checklist before publishing into the competition
- Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
- **Remove redundant cells** like the judging criteria, so the workbook is focused on your story.
- Make sure the workbook reads well and explains how you found your insights.
- Try to include an **executive summary** of your recommendations at the beginning.
- Check that all the cells run without error

## ⌛️ Time is ticking. Good luck!