### LSE Data Analytics Online Career Accelerator 

# DA301:  Advanced Analytics for Organisational Impact

## Practical activity: Sentiment analysis using Python

You are part of a data analytics team at a global company, FutureProof. The company’s product line includes a range of innovative solutions to enhance cybersecurity. The marketing manager is considering using ChatGPT to generate content, with the aim of enhancing the brand’s social media presence on YouTube and other social media platforms. The campaign will require the creation of engaging social media content, scheduled updates to social media channels, and moderating and responding to comments.  However, the CEO has reservations both about using ChatGPT, and YouTube. You have been asked to research the sentiment towards ChatGPT on YouTube. You will start by getting comments about ChatGPT straight from YouTube. To achieve this, you will need to use the YouTube API key that you created through your Google Cloud account..

In this activity, you’ll pre-process and perform sentiment analysis on the extracted comments. You’ll work with the NLTK Vader class to classify words into positive, neutral, or negative. The comments will then be assigned a sentiment estimate. Therefore, you will:

- access the API in Python and query YouTube for key phrases
- customise the query and join results from the query in a Pandas DataFrame
- apply some pre-processing and perform sentiment analysis
- use the polarity score function and identify related words
- visualise the output to present to the business to help them decide on whether to use ChatGPT.


##  1. Prepare your workstation

In [None]:
# If needed, install the libraries.
# !pip install google-api-python-client

In [None]:
# Import the necessary libraries
import googleapiclient.discovery
import os
import json
import pandas as pd
from dotenv import load_dotenv

load_dotenv()

# Locate and read the key from your .env file.
API_key = os.getenv('YouTube_API_key')


## 2. Retrieve comments from the defined video

In [None]:
# Create an api call
youtube = googleapiclient.discovery.build('youtube', 'v3', developerKey=API_key)

# Make the search request
comment_response = youtube.commentThreads().list(
    part='snippet,replies',
    maxResults=100,
    videoId='40Kp_fa8vIw'
).execute()

# Get the comments


# Print the comments
for comment in comments:
    print(comment['snippet']['topLevelComment']['snippet']['textDisplay'])

## 3. Create a DataFrame

In [None]:
import pandas as pd

# Create a list of comments


# Iterate over the comments and add them to the list


# Create a DataFrame


# Print the DataFrame


# View shape of output


In [None]:
# Determine values of output.


# View results.


## 4. Pre-processing comments

In [None]:
# Import nltk and the required resources.

#Create a variable to store the stopwords.


In [None]:
# The results will change every time the code is executed. Let's review the first 15.

# Print the first 15 comments without stop words


In [None]:
# Look at one comment
# Based on the results of the previous cell, select a comment in English that contains keywords suitable for text analysis

# Set the index of the comment to be returned


In [None]:
# Split up each comment into individual words


# View results.


In [None]:
# Get a list of all english words so we can exclude anything that doesn't appear on the list.


# View results.


In [None]:
# Some pre-processing:
#-- lets get every word
#-- lets convert it to lowercase
#-- only include if the word is alphanumeric and if it is in the list of English words, but is not a stopword.

df3 = [[y.lower() for y in x if y.lower() not in stop_words and y.isalpha() and y.lower() in all_english_words] for x in df2]

In [None]:
# Let's have a look at the same comment as above



## 5. Perform sentiment analysis

In [None]:
# import the prebuilt rules and values of the vader lexicon.


In [None]:
# Import the vader class SentimentIntensityAnalyser.
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Create a variable sia to store the SentimentIntensityAnalyser() method.


In [None]:
# Run through a dictionary comprehension to take every cleaned comment
# Next run the polarity score function on the string.
# This will return four values in a dictionary

df_polarity = {" ".join(_) : sia.polarity_scores(" ".join(_)) for _ in df3}

In [None]:
# Convert the list of dictionary results to a Pandas DataFrame.
# The index is the cleaned tweet.
# We can see some of the highly positive words. 


# View the DataFrame.


In [None]:
# With the non-aplhanumeric words (the emojis, handles, hashtags and stopwords) removed 
# some of the most positive words are single words

# Get the top 5 most positive cleaned 


In [None]:
# Get the top 5 most negative words related to ChatGPT


In [None]:
# The describe function on the compound will show the distribution and moments. 

polarity['compound'].describe()

## 6. Visualise the output

In [None]:
# Sometimes the best way to see is to plot. 
# In the data sampled here many of the values are 0
# There are less negative values than positive but the negative values are highly negative.

%matplotlib inline
import matplotlib.pyplot as plt

# Create a boxplot. This is a good way to see how many values sit on the edges as outliers.


In [None]:
# Create a barplot.


In [None]:
# You can also create a histogram:


## 7. Summarise findings

Type a summary of the analysis.

Conclusion: