# Project 3 - Sentiment Analysis of Elon Musk using YouTube Comments

## Part 1 - Data Extraction, Cleaning and Exploratory Data Analysis (EDA)

## 1. Introduction and Problem Statement

Public image refers to the 'the character or attitudes that most people think' a person or organisation has ([source](https://www.ldoceonline.com/dictionary/public-image)). In most cases, a good public image is essential in:
1. Building trust, 
2. Attracting customers, and 
3. Fostering positive relationships with stakeholders.

As such, it is important for individuals and organisations to keep their finger on the pulse when it comes to understanding public sentiment around them at all times. In this project, I am going to focus on the public sentiment of a well-known and polarising figure the past few years, Elon Musk. In this scenario, I am part of a data science-backed PR agency to help Elon navigate the murky waters of media scrutiny and public perception, ensuring his actions and communications align with his goals.


The problem statement is:
**<center>Can we create an effective classification model to identify the level of negative sentiment in text comments?</center>**

The ideal outcome / goal is to help individuals and companies with understanding public sentiment around them in order to build their PR strategy. This will be done by creating a sentiment analysis model, with the dependent variable being a classifier whether a comment made about them is negative or not (where '1' refers to a negative comment and '0' refers to a non-negative comment).

Success of the model will be determined by the recall and accuracy scores of the model. The higher the recall and accuracy scores, the better it is at predicting the public sentiment around an individual or company. Between the two, we will prioritise the recall score in order to minimise missing negative sentiments as much as possible. This is so that we have more data to look through if we want to dive deeper into understanding the reasons behind negative sentiments.

The primary audience will be individuals and companies of interest, while the secondary audience will be  existing PR companies that we can explore partnerships with.

## 2. Data Extraction

In [1]:
import googleapiclient.discovery
import pandas as pd
import html
from bs4 import BeautifulSoup

from langdetect import detect

import spacy
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import re
import pickle

To train our model, we will be using comments from the YouTube video ['Joe Rogan Experience #1169 - Elon Musk'](https://www.youtube.com/watch?v=ycPr5-27vSI). This is because:
1. It is the most viewed video about Elon Musk (with approximately 67 million views, and 140,000 comments)
2. The video was created 4 years ago, so our model would be able to learn from a wide range of comments from 4 years ago until today.

Due to YouTube API limitations, we will pull 50,000 comments from this video, which should be more than enough to train our model.

In [2]:
# YouTube API Credentials
api_service_name = "youtube"
api_version = "v3"
DEVELOPER_KEY = "AIzaSyBGcph7UNs6FYit6VSrk986oKrqb8fEzpA"

In [3]:
# Using YouTube API to pull 50000 comments from a video
def get_video_comments(video_id, max_results=100, max_comments=50000):
    youtube = googleapiclient.discovery.build(api_service_name, api_version, developerKey=DEVELOPER_KEY)
    
    comments = []
    next_page_token = None
    total_comments_fetched = 0
    
    while total_comments_fetched < max_comments:
        remaining_comments_to_fetch = max_comments - total_comments_fetched
        comments_to_fetch = min(max_results, remaining_comments_to_fetch)
        
        request = youtube.commentThreads().list(
            part="snippet",
            videoId=video_id,
            maxResults=comments_to_fetch,
            pageToken=next_page_token if next_page_token else None
        )
        
        response = request.execute()
        
        # Get the video information (title and upload date) for the first batch of comments
        if total_comments_fetched == 0:
            video_response = youtube.videos().list(
                part="snippet",
                id=video_id
            ).execute()
            
            video_title = video_response['items'][0]['snippet']['title']
            video_published_at = video_response['items'][0]['snippet']['publishedAt']
        
        for item in response['items']:
            comment = item['snippet']['topLevelComment']['snippet']
            comments.append([
                comment['authorDisplayName'],
                comment['publishedAt'],
                comment['likeCount'],
                comment['textDisplay'],
                video_title,  # Adding video title for each comment
                video_published_at  # Adding video upload date for each comment
            ])
        
        total_comments_fetched += comments_to_fetch
        next_page_token = response.get('nextPageToken')
        if not next_page_token:
            break
    
    return comments

# List of video IDs for which you want to fetch comments
video_ids = ["ycPr5-27vSI"]

# Initialize an empty list to store all comments from multiple videos
all_comments = []

for video_id in video_ids:
    comments_for_video = get_video_comments(video_id)
    all_comments.extend(comments_for_video)

# Create a DataFrame from all the comments
df = pd.DataFrame(all_comments, columns=['author', 'published_at', 'like_count', 'text', 'video_title', 'video_published_at'])

# Display the first 10 rows of the DataFrame
df.head(10)

Unnamed: 0,author,published_at,like_count,text,video_title,video_published_at
0,Ronnie Howard,2023-08-08T12:46:01Z,0,I see I ribbit playing out in the next 50 years,Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z
1,Daham Kumarapathirana,2023-08-08T02:09:05Z,0,67 Million Joe Rogan&#39;s Watched this video 😅,Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z
2,Beau Johnson,2023-08-07T09:54:50Z,0,"Joe, giant carbon filters are called trees the...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z
3,ProphetKilo,2023-08-06T23:22:43Z,0,The Sun is not actually on fire in the same wa...,Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z
4,Khadulau,2023-08-06T22:35:45Z,0,"If anyone cares to look, Elon Musk wiki still ...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z
5,Preeti Singh,2023-08-06T10:55:29Z,0,Correction with ACID THEORY: It was not Buddhi...,Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z
6,Overcomecross98,2023-08-05T21:55:21Z,2,I feel like he would be really annoying to tal...,Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z
7,Jesse,2023-08-05T16:27:07Z,0,&quot;running an engine with no resistance&quo...,Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z
8,supertiaj,2023-08-05T00:48:09Z,0,I would not consent to this AI crap,Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z
9,Kilraven Wild Song Artist,2023-08-04T06:48:30Z,0,"Amen guys. All we need is love, love, love. Gr...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z


In [5]:
# Saving the file as an Original DataFrame so we do not need to run the above function many times
with open('../data/00_Original_DataFrame.pickle', 'wb') as file:
    pickle.dump(df, file)

In [6]:
# Load the file
with open('../data/00_Original_DataFrame.pickle', 'rb') as file:
    df = pickle.load(file)

In [7]:
# How many comments did we get?
df.shape

(49999, 6)

## 3. Data Cleaning

We will start off with data cleaning - focusing on null values, duplicated values, data types, HTML-encoded entities, and foreign language.

### 3.1 Handling Null Values

In [8]:
# Looking for columns with null values
df.isnull().sum()

author                0
published_at          0
like_count            0
text                  0
video_title           0
video_published_at    0
dtype: int64

There are no null values.

### 3.2 Handling Duplicated Values

In [9]:
# Looking for duplicated columns
df['text'].duplicated().sum()

947

In [10]:
df[df['text'].duplicated()]

Unnamed: 0,author,published_at,like_count,text,video_title,video_published_at
46,SolaraProject,2023-07-29T09:41:53Z,0,Elon: &quot;Be nicer to each other&quot;<br>Jo...,Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z
711,Maria,2023-01-20T05:38:18Z,0,So what’s the statue of limitations on capital...,Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z
788,Loan Wroblewski,2022-12-30T00:04:10Z,3,"<a href=""https://www.youtube.com/watch?v=ycPr5...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z
1575,Güneş türkyılmaz,2022-11-15T07:21:12Z,0,💯💯,Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z
1696,They Wright,2022-11-09T20:28:21Z,0,You two have smoked way too much weed..,Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z
...,...,...,...,...,...,...
49581,Izzy G.,2018-10-12T19:08:31Z,0,"<a href=""https://www.youtube.com/watch?v=ycPr5...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z
49704,Macbeth and her demon 🥀👹,2018-10-10T19:24:31Z,0,Big pit,Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z
49932,Mayne Maybe,2018-10-07T11:18:07Z,0,Wtf,Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z
49944,Levi H.,2018-10-07T06:57:14Z,0,Business magnet,Joe Rogan Experience #1169 - Elon Musk,2018-09-07T08:12:43Z


There are 947 rows where the comment is duplicated. Let us drop them (while keeping the first occurence).

In [11]:
df.drop_duplicates(subset='text', keep='first', inplace=True)

### 3.3 Handling Data Types

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49052 entries, 0 to 49998
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   author              49052 non-null  object
 1   published_at        49052 non-null  object
 2   like_count          49052 non-null  int64 
 3   text                49052 non-null  object
 4   video_title         49052 non-null  object
 5   video_published_at  49052 non-null  object
dtypes: int64(1), object(5)
memory usage: 2.6+ MB


We have two columns that are meant to be datetimes:
1. 'published_at'
2. 'video_published_at'

Let us change the type.

In [13]:
df['published_at'] = pd.to_datetime(df['published_at'])
df['video_published_at'] = pd.to_datetime(df['video_published_at'])

Additionally, the 'like_count' should be an integer. Let us investigate.

In [14]:
df['like_count'].unique()

array([    0,     2,     1,     3,     4,    14,     9,    12,    10,
         142,     6,     7,    26,    20,     8,    31,  1019,    11,
           5,    99,    33,    97,    17,    13,   700,    27,   164,
          86,    19,   388,    68,    47,   277,    18,    37,   107,
          25,    42,    65,    58,    41,   160,    70,   281,    28,
         505,    55,    30,    21,    59,   439,    15,    46,   199,
          62,   376,   829,    94,   206,    52,    40,    64,    88,
          60,    44,   136,  1436,    35,   244,    23,    73,    32,
         131,    76,   621,   146,   370,   139,    16,   280,   116,
          34,   898,    24,    50,   144,   121,    22,  1717,   519,
          91,   375,   323,    84,   343,   532,  1439,   127,   129,
         190,   340,  1486,   620,    79,    45,   231,   655,    29,
          49,    36,    63,    38,    57,    71,   237,    48,   265,
         622,   883,   103,    39,   102,   378,    89,   156,   273,
          74,   204,

There does not seem to be any issue with the column, so we can change its type to integer.

In [15]:
df['like_count'].astype(int)

0          0
1          0
2          0
3          0
4          0
        ... 
49993      0
49994    156
49995      0
49997      0
49998      0
Name: like_count, Length: 49052, dtype: int64

In [16]:
# Final data types
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49052 entries, 0 to 49998
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   author              49052 non-null  object             
 1   published_at        49052 non-null  datetime64[ns, UTC]
 2   like_count          49052 non-null  int64              
 3   text                49052 non-null  object             
 4   video_title         49052 non-null  object             
 5   video_published_at  49052 non-null  datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](2), int64(1), object(3)
memory usage: 2.6+ MB


### 3.4 Handling HTML-encoded entities

As we saw earlier, the comments can be quite confusing due to HTML-encoded entities. Thus, we will use BeautifulSoup to handle that.

In [17]:
df['text'] = df['text'].apply(lambda x: BeautifulSoup(x, "html.parser").get_text())
df

  df['text'] = df['text'].apply(lambda x: BeautifulSoup(x, "html.parser").get_text())


Unnamed: 0,author,published_at,like_count,text,video_title,video_published_at
0,Ronnie Howard,2023-08-08 12:46:01+00:00,0,I see I ribbit playing out in the next 50 years,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00
1,Daham Kumarapathirana,2023-08-08 02:09:05+00:00,0,67 Million Joe Rogan's Watched this video 😅,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00
2,Beau Johnson,2023-08-07 09:54:50+00:00,0,"Joe, giant carbon filters are called trees the...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00
3,ProphetKilo,2023-08-06 23:22:43+00:00,0,The Sun is not actually on fire in the same wa...,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00
4,Khadulau,2023-08-06 22:35:45+00:00,0,"If anyone cares to look, Elon Musk wiki still ...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00
...,...,...,...,...,...,...
49993,Arm Wrestler,2018-10-06 19:01:01+00:00,0,I rather be pessimistic and WRONG,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00
49994,EL34XYZ,2018-10-06 18:54:27+00:00,156,"Wow, jaw dropping interview. It's fascinating ...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00
49995,River,2018-10-06 18:52:02+00:00,0,"A day later joe “Elon weird, I don’t understan...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00
49997,Gaz Potts,2018-10-06 18:42:25+00:00,0,I watched all this earlier I couldn't stop it ...,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00


### 3.5 Handling Foreign Languages

Finally, we will look to handle foreign languages.

In [18]:
# Creating an 'is_english' column
def is_english(text):
    try:
        lang = detect(text)
        return 1 if lang == 'en' else 0
    except:
        return 1

df['is_english'] = df['text'].apply(is_english)

In [19]:
df

Unnamed: 0,author,published_at,like_count,text,video_title,video_published_at,is_english
0,Ronnie Howard,2023-08-08 12:46:01+00:00,0,I see I ribbit playing out in the next 50 years,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1
1,Daham Kumarapathirana,2023-08-08 02:09:05+00:00,0,67 Million Joe Rogan's Watched this video 😅,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1
2,Beau Johnson,2023-08-07 09:54:50+00:00,0,"Joe, giant carbon filters are called trees the...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1
3,ProphetKilo,2023-08-06 23:22:43+00:00,0,The Sun is not actually on fire in the same wa...,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1
4,Khadulau,2023-08-06 22:35:45+00:00,0,"If anyone cares to look, Elon Musk wiki still ...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1
...,...,...,...,...,...,...,...
49993,Arm Wrestler,2018-10-06 19:01:01+00:00,0,I rather be pessimistic and WRONG,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1
49994,EL34XYZ,2018-10-06 18:54:27+00:00,156,"Wow, jaw dropping interview. It's fascinating ...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1
49995,River,2018-10-06 18:52:02+00:00,0,"A day later joe “Elon weird, I don’t understan...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1
49997,Gaz Potts,2018-10-06 18:42:25+00:00,0,I watched all this earlier I couldn't stop it ...,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1


In [20]:
df['is_english'].value_counts()

1    44315
0     4737
Name: is_english, dtype: int64

In [21]:
df[df['is_english']==0].head(50)

Unnamed: 0,author,published_at,like_count,text,video_title,video_published_at,is_english
18,Cooper6006,2023-08-02 04:53:26+00:00,0,Elon seems 100% human here! 😅,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,0
21,Cooper6006,2023-08-02 04:38:09+00:00,0,Sniper vision.,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,0
27,David McFarland,2023-08-02 02:27:04+00:00,0,Wow,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,0
30,YELLOW FF,2023-08-01 18:04:21+00:00,0,https://youtube.com/@qadargaming,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,0
42,Junhao Liang,2023-07-29 23:41:37+00:00,0,Hahaha,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,0
49,Kid,2023-07-28 02:10:04+00:00,0,Sudy needed,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,0
57,Gabe Moralez,2023-07-24 18:10:13+00:00,0,Elon bro… ur scaring me like frfr,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,0
58,ZeeHamada,2023-07-24 03:02:19+00:00,0,elon musk finna beat mark zuckerberg,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,0
104,On Point,2023-07-07 04:46:18+00:00,0,Dope,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,0
125,gbeatsmacedonia,2023-06-26 11:13:29+00:00,0,1:10:00 bch,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,0


We actually see that, based on the first 50 rows, a good majority of the comments are in English, with only a couple of rows being Spanish and Portugese. There are, however, a number of rows simply calling out timestamps in videos, YouTube links, as well as slang.

If we were to drop these, we still have over 44,000 comments to train our model. So, let us drop them.

In [22]:
df.drop(df[df['is_english'] == 0].index, inplace = True)
df.reset_index(drop = True, inplace = True)
df

Unnamed: 0,author,published_at,like_count,text,video_title,video_published_at,is_english
0,Ronnie Howard,2023-08-08 12:46:01+00:00,0,I see I ribbit playing out in the next 50 years,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1
1,Daham Kumarapathirana,2023-08-08 02:09:05+00:00,0,67 Million Joe Rogan's Watched this video 😅,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1
2,Beau Johnson,2023-08-07 09:54:50+00:00,0,"Joe, giant carbon filters are called trees the...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1
3,ProphetKilo,2023-08-06 23:22:43+00:00,0,The Sun is not actually on fire in the same wa...,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1
4,Khadulau,2023-08-06 22:35:45+00:00,0,"If anyone cares to look, Elon Musk wiki still ...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1
...,...,...,...,...,...,...,...
44310,Arm Wrestler,2018-10-06 19:01:01+00:00,0,I rather be pessimistic and WRONG,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1
44311,EL34XYZ,2018-10-06 18:54:27+00:00,156,"Wow, jaw dropping interview. It's fascinating ...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1
44312,River,2018-10-06 18:52:02+00:00,0,"A day later joe “Elon weird, I don’t understan...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1
44313,Gaz Potts,2018-10-06 18:42:25+00:00,0,I watched all this earlier I couldn't stop it ...,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1


## 4. Feature Engineering

To get more details of the comments, let us create a couple of columns: comment length and comment word count.

In [23]:
df['comment_length'] = df['text'].apply(lambda x: len(x))

In [24]:
df['comment_word_count'] = df['text'].apply(lambda x: len(x.split()))

In [25]:
df

Unnamed: 0,author,published_at,like_count,text,video_title,video_published_at,is_english,comment_length,comment_word_count
0,Ronnie Howard,2023-08-08 12:46:01+00:00,0,I see I ribbit playing out in the next 50 years,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1,47,11
1,Daham Kumarapathirana,2023-08-08 02:09:05+00:00,0,67 Million Joe Rogan's Watched this video 😅,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1,43,8
2,Beau Johnson,2023-08-07 09:54:50+00:00,0,"Joe, giant carbon filters are called trees the...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1,69,13
3,ProphetKilo,2023-08-06 23:22:43+00:00,0,The Sun is not actually on fire in the same wa...,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1,450,81
4,Khadulau,2023-08-06 22:35:45+00:00,0,"If anyone cares to look, Elon Musk wiki still ...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1,95,17
...,...,...,...,...,...,...,...,...,...
44310,Arm Wrestler,2018-10-06 19:01:01+00:00,0,I rather be pessimistic and WRONG,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1,33,6
44311,EL34XYZ,2018-10-06 18:54:27+00:00,156,"Wow, jaw dropping interview. It's fascinating ...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1,328,55
44312,River,2018-10-06 18:52:02+00:00,0,"A day later joe “Elon weird, I don’t understan...",Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1,67,13
44313,Gaz Potts,2018-10-06 18:42:25+00:00,0,I watched all this earlier I couldn't stop it ...,Joe Rogan Experience #1169 - Elon Musk,2018-09-07 08:12:43+00:00,1,186,37


In [26]:
# Saving the file as pickle
with open('../data/01_Cleaned_Data.pickle', 'wb') as file:
    pickle.dump(df, file)