![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fdata-viz-of-the-week&branch=main&subPath=NLP-sentiment-analysis/sentiment-analysis-youtube-comments.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>


# Callysto’s Weekly Data Visualization

## Sentiment Analysis on YouTube Comments

### Recommended Grade levels: 6-12
<br>

### Instructions
#### “Run” the cells to see the graphs
Click “Cell” and select “Run All”.<br> This will import the data and run all the code, so you can see this week's data visualization. Scroll to the top after you’ve run the cells.<br> 

![instructions](https://github.com/callysto/data-viz-of-the-week/blob/main/images/instructions.png?raw=true)

**You don’t need to do any coding to view the visualizations**.
The plots generated in this notebook are interactive. You can hover over and click on elements to see more information. 

Email contact@callysto.ca if you experience issues.

### About this Notebook

Callysto's Weekly Data Visualization is a learning resource that aims to develop data literacy skills. We provide Grades 5-12 teachers and students with a data visualization, like a graph, to interpret. This companion resource walks learners through how the data visualization is created and interpreted by a data scientist. 

The steps of the data analysis process are listed below and applied to each weekly topic.

1. Question - What are we trying to answer? 
2. Gather - Find the data source(s) you will need. 
3. Organize - Arrange the data, so that you can easily explore it. 
4. Explore - Examine the data to look for evidence to answer the question. This includes creating visualizations. 
5. Interpret - Describe what's happening in the data visualization. 
6. Communicate - Explain how the evidence answers the question. 

# Question

Comments on **YouTube** provide opportunities for audiences to share their opinions regarding the videos they watched actively. To which extent, then, are the comments representative of the contents of the videos? Think of the "like" buttons on YouTube videos; are the content of the comments correlated with the number of "likes"? Perhaps the higher the "like" rating is, the more positive the comments are.

<img src="https://images.unsplash.com/photo-1585231059875-65414c63f26d?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=3086&q=80" alt="A beautiful sunset" width="30%">

### Goal

How do comments reflect on the original videos? We will investigate the emotions contained in the comments using [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis). Sentiment analysis is a natural language processing (NLP) technique to observe the emotional tone of text.

First, we start by extracting comments from selected YouTube videos. We will use a radar chart to depict underlying emotions in collected YouTube comments. Then we will perform sentiment analysis with two popular NLP libraries: [TextBlob](https://textblob.readthedocs.io/en/dev/) and [nltk](https://www.nltk.org/).

This notebook will examine the top 10 most viewed tutorial videos on popular programming languages: **C, Java, and Python**. The top 300 comments from each video will be collected and further examined through sentiment analysis to see which language is perceived to be the most positive.

# Gather

### Code:

The code below will import the Python libraries we need to gather and organize the data to answer our question.

In [None]:
%pip install -r requirements.txt
import pyodide_http
pyodide_http.patch_all()
import time
import pandas as pd
import numpy as np
import re
from IPython.display import YouTubeVideo

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('vader_lexicon')
from textblob.blob import TextBlob
try:
    import gcld3
except:
    !pip install gcld3
    import gcld3

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

import plotly.express as px
import plotly.graph_objects as go
print('Libraries imported')

### Data

Data was extracted by using the [YouTube API](https://developers.google.com/youtube/v3). If you want to collect your own data, replace `API_KEY` in the code block below to do a keyword search for videos by title. For example, searching for the keyword **Python** will return the top 50 most viewed videos with the word **Python** in their title. 

In [None]:
developer_key = "API_KEY"
keyword = "Python"

try:
    import googleapiclient
    youtube = googleapiclient.discovery.build("youtube", "v3", developerKey=developer_key)
    vid_ids = youtube.search().list(
        part="id",
        type='video',
        regionCode="US",
        order="viewCount",
        q=f"{keyword}",
        maxResults=50,
        fields="items(id(videoId))").execute()

    df = pd.DataFrame(columns=["id", "viewCount", "commentCount", "likeCount"])

    for item in vid_ids['items']:
        vidId = item['id']['videoId']
        run = youtube.videos().list(
            part="statistics,contentDetails",
            id=vidId,
            fields="items(statistics," + "contentDetails(duration))").execute()
        
        statistics = ["viewCount", "commentCount", "likeCount"]
        statistics_ls = [vidId]
        
        for statistic in statistics:
            try:
                filtered = run["items"][0]["statistics"][statistic]
            except:
                filtered = '0'
            statistics_ls.append(filtered)
        df.loc[len(df)] = statistics_ls
        
    display(df)
except:
    pass

For your convenience, we extracted data with the keyword **Python**, saved it as a CSV file and loaded in the cell below. 

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/NLP-sentiment-analysis/data/Youtube_API_data.csv")
df

#### Comment on the data
The extracted data contains 50 rows and 4 columns. Videos that disabled their comments are excluded from the search result. 

### Collect Comments

With the data we extracted from the YouTube API, we can now collect comments as raw texts in preparation for sentiment analysis. Two methods can be imployed to collect video comments: by using **web-scraping** and **YouTube API built-in features**. Each has its pros and cons:

| Methodology | Pros | Cons |
|:-------------|:--------------|:--------------|
| Web Scraping| Can sort comments from most liked by the number of likes. |Takes longer to collect comments. |
| YouTube API| Takes less time to collect comments. |Cannot sort comments by the number of likes; comments will be collected from most recent to the oldest. |


#### Web Scraping YouTube Data

Another method to extract YouTube comments from a website is web scraping.  The code below shows an example of how you can use Python to web scrape YouTube comments from the website. In this example, YouTube videos identified from the YouTube API are web scraped for their comments.

If you would like to run the cell, change it to a code cell.

The code block above is set to collect **300 most liked comments** from a video. A dataframe will be displayed as an outcome, consisting of columns including `User`, `Date`, and `Comment`.

We extracted the data, saved it as a CSV file, and loaded in the cell below. 

In [None]:
web_scraped = pd.read_csv("https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/NLP-sentiment-analysis/data/python_comments_df.csv")
print(web_scraped.shape)
web_scraped.head()

#### Importing Comments with the YouTube API

If you have an API_KEY you can use the code below use the YouTube API to search for videos and get their comments. The example code block below is set to collect **approximately 500 most recent comments** from a video. A dataframe will be displayed as an outcome, consisting of columns including User, LikeCount, and Comment.

For your convenience, we extracted the data, saved it as a CSV file and loaded in the cell below. 

In [None]:
api_implementation = pd.read_csv("https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/NLP-sentiment-analysis/data/API_implementation.csv")
print(api_implementation.shape)
api_implementation

#### Comment on the Data

Although both methodologies have pros and cons, we have used the **web scraping** method to collect data for the rest of the notebook. To save time, we have prepared pre-collected data on three keywords: **C programming, Java programming, and Python programming**.

In [None]:
print("Displaying C programming keyword outcomes.")
C = pd.read_csv("https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/NLP-sentiment-analysis/data/c_comments_df.csv")
display(C.head().iloc[:,0:4])

print("Displaying Java programming keyword outcomes.")
java = pd.read_csv("https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/NLP-sentiment-analysis/data/java_comments_df.csv")
display(java.head())

print("Displaying Python programming keyword outcomes.")
python = pd.read_csv("https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/NLP-sentiment-analysis/data/python_comments_df.csv")
display(python.head())

# Organize

Before to proceeding into sentiment analysis, we want to filter for English comments and remove stopwords from the gathered comments. **Stopwords** are generalized, commonly used words that could interfere with the outcomes of sentiment analysis. For example, words such as "a", "an", "the", "in" are considered to be stopwords. For more details on stopwords, refer to the [official documentation](https://www.nltk.org/search.html?q=stopwords&check_keywords=yes&area=default) provided by `nltk`. 

<br>
<img src="https://images.unsplash.com/photo-1558443957-d056622df610?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=3132&q=80" alt="A beautiful sunset" width="70%">

### Filter for English Comments

In the cell below, we filter for English comments only and remove all non-English comments. We will be using the `NNetLanguageIdentifier` of the `gcld3` library to calculate the probability of each comment being recognized as English. We filter for comments with a probability value of 0.9 or higher and remove all other comments. We will use the Python dataframe as an example.

In [None]:
def english_detection(text):
    max_num_bytes = len(text)
    detector = gcld3.NNetLanguageIdentifier(min_num_bytes=0, max_num_bytes=max_num_bytes)
    language = detector.FindLanguage(text=text).language
    probability = detector.FindLanguage(text=text).probability
    return language, probability

def filter_for_comments(df):
    before_filtering = df.shape[0]
    df["Language"] = ""
    df["Probability"] = ""
    for ind, row in df.iterrows():
        language, probability = english_detection(row["Comment"])
        df.at[ind, "Language"] = language
        df.at[ind, "Probability"] = probability
    df = df[(df["Language"] == "en") & (df["Probability"] > 0.9)]
    after_filtering = df.shape[0]
    total_removal = before_filtering - after_filtering
    print(f"Total number of rows removed are {total_removal}.")
    return df

python = filter_for_comments(python)
print(python.shape)
python.head()

From the filtering process, 327 comments have been removed from the dataframe and we are left with 2673 comments. 

### Remove `Stopwords`
We will start by extending the existing list of Stopwords provided by the `nltk` library. The list of extended stopwords is derived from this [GitHub repository](https://gist.github.com/sebleier/554280). 

In [None]:
extend_stopwords = ["0o", "0s", "3a", "3b", "3d", "6b", "6o", "a", "a1", "a2", "a3", "a4", "ab", "able", "about", "above", "abst", "ac", "accordance", "according", "accordingly", "across", "act", "actually", "ad", "added", "adj", "ae", "af", "affected", "affecting", "affects", "after", "afterwards", "ag", "again", "against", "ah", "ain", "ain't", "aj", "al", "all", "allow", "allows", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "announce", "another", "any", "anybody", "anyhow", "anymore", "anyone", "anything", "anyway", "anyways", "anywhere", "ao", "ap", "apart", "apparently", "appear", "appreciate", "appropriate", "approximately", "ar", "are", "aren", "arent", "aren't", "arise", "around", "as", "a's", "aside", "ask", "asking", "associated", "at", "au", "auth", "av", "available", "aw", "away", "awfully", "ax", "ay", "az", "b", "b1", "b2", "b3", "ba", "back", "bc", "bd", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "begin", "beginning", "beginnings", "begins", "behind", "being", "believe", "below", "beside", "besides", "best", "better", "between", "beyond", "bi", "bill", "biol", "bj", "bk", "bl", "bn", "both", "bottom", "bp", "br", "brief", "briefly", "bs", "bt", "bu", "but", "bx", "by", "c", "c1", "c2", "c3", "ca", "call", "came", "can", "cannot", "cant", "can't", "cause", "causes", "cc", "cd", "ce", "certain", "certainly", "cf", "cg", "ch", "changes", "ci", "cit", "cj", "cl", "clearly", "cm", "c'mon", "cn", "co", "com", "come", "comes", "con", "concerning", "consequently", "consider", "considering", "contain", "containing", "contains", "corresponding", "could", "couldn", "couldnt", "couldn't", "course", "cp", "cq", "cr", "cry", "cs", "c's", "ct", "cu", "currently", "cv", "cx", "cy", "cz", "d", "d2", "da", "date", "dc", "dd", "de", "definitely", "describe", "described", "despite", "detail", "df", "di", "did", "didn", "didn't", "different", "dj", "dk", "dl", "do", "does", "doesn", "doesn't", "doing", "don", "done", "don't", "down", "downwards", "dp", "dr", "ds", "dt", "du", "due", "during", "dx", "dy", "e", "e2", "e3", "ea", "each", "ec", "ed", "edu", "ee", "ef", "effect", "eg", "ei", "eight", "eighty", "either", "ej", "el", "eleven", "else", "elsewhere", "em", "empty", "en", "end", "ending", "enough", "entirely", "eo", "ep", "eq", "er", "es", "especially", "est", "et", "et-al", "etc", "eu", "ev", "even", "ever", "every", "everybody", "everyone", "everything", "everywhere", "ex", "exactly", "example", "except", "ey", "f", "f2", "fa", "far", "fc", "few", "ff", "fi", "fifteen", "fifth", "fify", "fill", "find", "fire", "first", "five", "fix", "fj", "fl", "fn", "fo", "followed", "following", "follows", "for", "former", "formerly", "forth", "forty", "found", "four", "fr", "from", "front", "fs", "ft", "fu", "full", "further", "furthermore", "fy", "g", "ga", "gave", "ge", "get", "gets", "getting", "gi", "give", "given", "gives", "giving", "gj", "gl", "go", "goes", "going", "gone", "got", "gotten", "gr", "greetings", "gs", "gy", "h", "h2", "h3", "had", "hadn", "hadn't", "happens", "hardly", "has", "hasn", "hasnt", "hasn't", "have", "haven", "haven't", "having", "he", "hed", "he'd", "he'll", "hello", "help", "hence", "her", "here", "hereafter", "hereby", "herein", "heres", "here's", "hereupon", "hers", "herself", "hes", "he's", "hh", "hi", "hid", "him", "himself", "his", "hither", "hj", "ho", "home", "hopefully", "how", "howbeit", "however", "how's", "hr", "hs", "http", "hu", "hundred", "hy", "i", "i2", "i3", "i4", "i6", "i7", "i8", "ia", "ib", "ibid", "ic", "id", "i'd", "ie", "if", "ig", "ignored", "ih", "ii", "ij", "il", "i'll", "im", "i'm", "immediate", "immediately", "importance", "important", "in", "inasmuch", "inc", "indeed", "index", "indicate", "indicated", "indicates", "information", "inner", "insofar", "instead", "interest", "into", "invention", "inward", "io", "ip", "iq", "ir", "is", "isn", "isn't", "it", "itd", "it'd", "it'll", "its", "it's", "itself", "iv", "i've", "ix", "iy", "iz", "j", "jj", "jr", "js", "jt", "ju", "just", "k", "ke", "keep", "keeps", "kept", "kg", "kj", "km", "know", "known", "knows", "ko", "l", "l2", "la", "largely", "last", "lately", "later", "latter", "latterly", "lb", "lc", "le", "least", "les", "less", "lest", "let", "lets", "let's", "lf", "like", "liked", "likely", "line", "little", "lj", "ll", "ll", "ln", "lo", "look", "looking", "looks", "los", "lr", "ls", "lt", "ltd", "m", "m2", "ma", "made", "mainly", "make", "makes", "many", "may", "maybe", "me", "mean", "means", "meantime", "meanwhile", "merely", "mg", "might", "mightn", "mightn't", "mill", "million", "mine", "miss", "ml", "mn", "mo", "more", "moreover", "most", "mostly", "move", "mr", "mrs", "ms", "mt", "mu", "much", "mug", "must", "mustn", "mustn't", "my", "myself", "n", "n2", "na", "name", "namely", "nay", "nc", "nd", "ne", "near", "nearly", "necessarily", "necessary", "need", "needn", "needn't", "needs", "neither", "never", "nevertheless", "new", "next", "ng", "ni", "nine", "ninety", "nj", "nl", "nn", "no", "nobody", "non", "none", "nonetheless", "noone", "nor", "normally", "nos", "not", "noted", "nothing", "novel", "now", "nowhere", "nr", "ns", "nt", "ny", "o", "oa", "ob", "obtain", "obtained", "obviously", "oc", "od", "of", "off", "often", "og", "oh", "oi", "oj", "ok", "okay", "ol", "old", "om", "omitted", "on", "once", "one", "ones", "only", "onto", "oo", "op", "oq", "or", "ord", "os", "ot", "other", "others", "otherwise", "ou", "ought", "our", "ours", "ourselves", "out", "outside", "over", "overall", "ow", "owing", "own", "ox", "oz", "p", "p1", "p2", "p3", "page", "pagecount", "pages", "par", "part", "particular", "particularly", "pas", "past", "pc", "pd", "pe", "per", "perhaps", "pf", "ph", "pi", "pj", "pk", "pl", "placed", "please", "plus", "pm", "pn", "po", "poorly", "possible", "possibly", "potentially", "pp", "pq", "pr", "predominantly", "present", "presumably", "previously", "primarily", "probably", "promptly", "proud", "provides", "ps", "pt", "pu", "put", "py", "q", "qj", "qu", "que", "quickly", "quite", "qv", "r", "r2", "ra", "ran", "rather", "rc", "rd", "re", "readily", "really", "reasonably", "recent", "recently", "ref", "refs", "regarding", "regardless", "regards", "related", "relatively", "research", "research-articl", "respectively", "resulted", "resulting", "results", "rf", "rh", "ri", "right", "rj", "rl", "rm", "rn", "ro", "rq", "rr", "rs", "rt", "ru", "run", "rv", "ry", "s", "s2", "sa", "said", "same", "saw", "say", "saying", "says", "sc", "sd", "se", "sec", "second", "secondly", "section", "see", "seeing", "seem", "seemed", "seeming", "seems", "seen", "self", "selves", "sensible", "sent", "serious", "seriously", "seven", "several", "sf", "shall", "shan", "shan't", "she", "shed", "she'd", "she'll", "shes", "she's", "should", "shouldn", "shouldn't", "should've", "show", "showed", "shown", "showns", "shows", "si", "side", "significant", "significantly", "similar", "similarly", "since", "sincere", "six", "sixty", "sj", "sl", "slightly", "sm", "sn", "so", "some", "somebody", "somehow", "someone", "somethan", "something", "sometime", "sometimes", "somewhat", "somewhere", "soon", "sorry", "sp", "specifically", "specified", "specify", "specifying", "sq", "sr", "ss", "st", "still", "stop", "strongly", "sub", "substantially", "successfully", "such", "sufficiently", "suggest", "sup", "sure", "sy", "system", "sz", "t", "t1", "t2", "t3", "take", "taken", "taking", "tb", "tc", "td", "te", "tell", "ten", "tends", "tf", "th", "than", "thank", "thanks", "thanx", "that", "that'll", "thats", "that's", "that've", "the", "their", "theirs", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "thered", "therefore", "therein", "there'll", "thereof", "therere", "theres", "there's", "thereto", "thereupon", "there've", "these", "they", "theyd", "they'd", "they'll", "theyre", "they're", "they've", "thickv", "thin", "think", "third", "this", "thorough", "thoroughly", "those", "thou", "though", "thoughh", "thousand", "three", "throug", "through", "throughout", "thru", "thus", "ti", "til", "tip", "tj", "tl", "tm", "tn", "to", "together", "too", "took", "top", "toward", "towards", "tp", "tq", "tr", "tried", "tries", "truly", "try", "trying", "ts", "t's", "tt", "tv", "twelve", "twenty", "twice", "two", "tx", "u", "u201d", "ue", "ui", "uj", "uk", "um", "un", "under", "unfortunately", "unless", "unlike", "unlikely", "until", "unto", "uo", "up", "upon", "ups", "ur", "us", "use", "used", "useful", "usefully", "usefulness", "uses", "using", "usually", "ut", "v", "va", "value", "various", "vd", "ve", "ve", "very", "via", "viz", "vj", "vo", "vol", "vols", "volumtype", "vq", "vs", "vt", "vu", "w", "wa", "want", "wants", "was", "wasn", "wasnt", "wasn't", "way", "we", "wed", "we'd", "welcome", "well", "we'll", "well-b", "went", "were", "we're", "weren", "werent", "weren't", "we've", "what", "whatever", "what'll", "whats", "what's", "when", "whence", "whenever", "when's", "where", "whereafter", "whereas", "whereby", "wherein", "wheres", "where's", "whereupon", "wherever", "whether", "which", "while", "whim", "whither", "who", "whod", "whoever", "whole", "who'll", "whom", "whomever", "whos", "who's", "whose", "why", "why's", "wi", "widely", "will", "willing", "wish", "with", "within", "without", "wo", "won", "wonder", "wont", "won't", "words", "world", "would", "wouldn", "wouldnt", "wouldn't", "www", "x", "x1", "x2", "x3", "xf", "xi", "xj", "xk", "xl", "xn", "xo", "xs", "xt", "xv", "xx", "y", "y2", "yes", "yet", "yj", "yl", "you", "youd", "you'd", "you'll", "your", "youre", "you're", "yours", "yourself", "yourselves", "you've", "yr", "ys", "yt", "z", "zero", "zi", "zz",]
stopwords_ls = set(stopwords.words("english"))
stopwords_ls.update(extend_stopwords)

def remove_stopwords(comment):
    comment_tk = word_tokenize(comment)
    filtered_comment = [c for c in comment_tk if not c.lower() in stopwords_ls]
    full_comment = (" ".join(filtered_comment))
    return full_comment

python["Comment"] = python["Comment"].apply(lambda x: remove_stopwords(x))
python.head()

Notice how the `Comments` column changed after removing the stopwords. Generalized words that do not convey any information are removed from the gathered comments.  

# Explore

We will use the `TextBlob` and the `nltk` libraries to perform sentiment analysis. They are both popular for processing textual data and ideal for handling natural language processing (NLP) tasks. Although the ideas are the same for both libraries - to analyze the emotional tone behind texts - how they measure their metrics is different. `TextBlob` assigns a **polarity** score to the texts, ranging from 0 being the most negative to 1 being the most positive. `nltk` calculates scores for **positive**, **negative**, **neutral**, and **compound** emotions. In this notebook, we will use the **subjectivity** scores provided by `TextBlob` and the aforementioned four metrics calculated using `nltk`. 

### Sentiment Analysis with `TextBlob`

In [None]:
def find_subjectivity(comment):
    return TextBlob(comment).sentiment.subjectivity

python["Subjectivity"] = python["Comment"].apply(find_subjectivity)
python.head()

### Sentiment Analysis with `nltk`

In [None]:
def sentiment_analysis_nltk(df):
    sid = SentimentIntensityAnalyzer()

    sentiment_df = pd.DataFrame()

    sentiment_scores = df["Comment"].apply(sid.polarity_scores)
    sentiment_df["VidID"] = df["VidID"]
    sentiment_df["Negative"] = sentiment_scores.apply(lambda x: x['neg'])
    sentiment_df["Positive"] = sentiment_scores.apply(lambda x: x['pos'])
    sentiment_df["Neutral"] = sentiment_scores.apply(lambda x: x['neu'])
    sentiment_df["Compound"] = sentiment_scores.apply(lambda x: x['compound'])
    display(sentiment_df.head())

    sum_df = sentiment_df.groupby(["VidID"]).agg(["mean"])
    sum_df = sum_df.reset_index()
    sum_df = sum_df.sort_values(by=[("Positive", "mean")], ascending=False)
    sum_df.columns = sum_df.columns.droplevel(1)
    display(sum_df.head())
    
    print("Negative score average: ", sentiment_df["Negative"].mean())
    print("Positive score average: ", sentiment_df["Positive"].mean())
    print("Neutral score average: ", sentiment_df["Neutral"].mean())
    print("Compound score average: ", sentiment_df["Compound"].mean())

    return sentiment_df, sum_df

sentiment_df, sum_df = sentiment_analysis_nltk(python)

#### Identifying the most *positive* videos from the video list

By rearranging the dataframe derived from the cell above, we can calculate the average positive score for each video. This way, we can quickly identify which video from the list of 10 videos has **most positively rated** comments on average.

In [None]:
def most_positive_video(sum_df):
    df_melted = pd.melt(sum_df, id_vars=["VidID"], value_vars=["Negative", "Positive", "Neutral", "Compound"],
                 var_name="Type", value_name="Mean_Score")
    df_melted = df_melted.sort_values(by=["VidID"])
    display(df_melted.head())

    top_positive_video = sum_df.sort_values(by=["Positive"], ascending=False).iloc[0:1]
    top_positive_video = top_positive_video["VidID"].values[0]
    print(f"Video with the ID {top_positive_video} is most positively rated on average based on the comments.")

    return df_melted, top_positive_video
            
positive_video, top_positive_video = most_positive_video(sum_df)
YouTubeVideo(f"{top_positive_video}", width=600)

# Interpret
Let's go back to the roots of this problem and think about what we want to analyze with our cleaned data. Here, we will investigate two topics: 
1. Which of the three programming languages is **most positively perceived** by the audience via comments? 
2. What does being *positively perceived* really mean? Does it mean more positively perceived languages are considered relatively **easy** compared to the other two?

We start by setting a reproducible workflow with functions we have worked on so far. With the workflow, we will be able to perform what we have done for Python on Java and C as well. This way, we don't have to write the functions repetitively - we can simply call the `main` function only!

### Compare `Positive` scores across three programming languages

In [None]:
global fig
fig = go.Figure()

def load_df(title):
    df = pd.read_csv(f"https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/NLP-sentiment-analysis/data/{title.lower()}_comments_df.csv")
    return df

def main(title):
    df = load_df(title)
    df = df[df.Comment.map(lambda x: isinstance(x,str))]
    df = filter_for_comments(df)
    df["Comment"] = df["Comment"].apply(lambda x: remove_stopwords(x))
    sentiment_df, sum_df = sentiment_analysis_nltk(df)
    sentiment_df["Subjectivity"] = df["Comment"].apply(find_subjectivity)
    display(sentiment_df)

    sum_df = sentiment_df.groupby(["VidID"]).agg(["mean"])
    sum_df = sum_df.reset_index()
    sum_df = sum_df.sort_values(by=[("Positive", "mean")], ascending=False)
    sum_df.columns = sum_df.columns.droplevel(1)
    display(sum_df)
    
    sum_df_melt, top_positive_video = most_positive_video(sum_df)
    sum_df_melt["Mean_Score"] = pd.to_numeric(sum_df_melt["Mean_Score"], errors='coerce')
    vid_sum = sum_df_melt.groupby(["Type"])["Mean_Score"].mean().reset_index()
    display(vid_sum.head())
    vid_sum_list = vid_sum["Mean_Score"].tolist()
    
    categories = ["Compound", "Negative", "Neutral", "Positive", "Subjectivity"]
    fig.add_trace(go.Scatterpolar(
          r=vid_sum_list,
          theta=categories,
          fill='toself',
          name=title))

keywords = [ "C", "Java", "Python"]

for keyword in keywords:
    print(f"Now showing results for keyword {keyword}.")
    main(keyword)
    if keyword != keywords[-1]:
        print("--" * 30)
fig.show()

From the radar chart, we notice that Python is most positively perceived across all collected comments. Java follows closely behind Python, and C is notably less positively perceived than the other two. 

### Explore connotation behind `Positivity` of sentiment analysis

What does being positively perceived really mean? Could it mean more *positive* languages are *easier* to follow or learn? As our next step, we will investigate the number of times the word "easy" are mentioned across collected comments. Subsequently, we will also investigate the number of times the word "difficult" or "challenging" are mentioned. 

In [None]:
def count_easy_difficult(df):
    easy_count = 0
    difficult_count = 0
    df["Comment"] = df["Comment"].astype(str)
    for comment in df["Comment"].tolist():
        if "easy" in comment:
            easy_count += 1
        if ("difficult" or "challenging" or "hard") in comment:
            difficult_count += 1
    return easy_count, difficult_count

count_df = pd.DataFrame(columns = ["Keyword", "Easy", "Difficult"])

for keyword in keywords:
    df = load_df(keyword)
    easy_count, difficult_count = count_easy_difficult(df)
    count_df.loc[len(count_df)] = [keyword, easy_count, difficult_count]
    
count_df_melt = pd.melt(count_df, id_vars=["Keyword"], value_vars=["Easy", "Difficult"])
count_df_melt.columns = ["Keyword", "Count", "Value"]
display(count_df_melt)

count_fig = px.bar(count_df_melt, x="Keyword", y="Value", color="Count", barmode="group", text_auto=True)
count_fig.update_layout(title_text='Word Count per Keyword', title_x=0.5)
count_fig.show()

From our analysis above, it looks like Python has the highest number of counts for the word "easy," whereas C has the highest number of counts for the words "difficult" or "challenging." Since this finding closely aligns with our hypothesis, we can presume that *positively perceived* is correlated with the difficulty of the programming language.

# Communicate
Below we will reflect on the new information that is presented from the data. When we look at the evidence, think about what you perceive about the information. Is this perception based on what the evidence shows? If others were to view it, what perceptions might they have? These writing prompts can help you reflect.

**Cause and effect**
- In your opinion, what does having positive comments mean? How can this be reflect on the contents of a video?
- Other than the three programming languages, it is strongly advised that you come up with your own keywords and compare the results! 

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)