## Data Collection

Main data collection is processed by running ['update.py'](https://github.com/cjunwon/Youtube-Data-Analysis/blob/main/update.py) on a Flask ['server.py'](https://github.com/cjunwon/Youtube-Data-Analysis/blob/main/server.py) scheduled through ngrok and Invictify.

This collection pipeline automatically updates the AWS MySQL RDS database with up-to-date channel statistics depending on the configured Invictify schedule.

In [77]:
from MySQL_DB_connect_functions import *
from MySQL_DB_update_functions import *
from youtube_api_functions import *

In [78]:
channel_id_list = ['UCIRYBXDze5krPDzAEOxFGVA'] #TheGuardian
channel_id_list = ['UCqnbDFdCpuN8CMEg0VuEBqA'] #NYTimes

In [79]:
# using youtube_api_functions.py:
youtube_obj = build_yt_API_object() # builds Youtube API object

In [80]:
video_df = create_video_df(youtube_obj, channel_id_list, 500) # store API data into pandas df
processed_video_df = clean_video_df(video_df) # run df through cleaning function

In [81]:
processed_video_df.head()

Unnamed: 0,video_id,channelTitle,title,description,publishedAt,viewCount,likeCount,favoriteCount,commentCount,caption,publishDayName,durationSecs,tagCount
0,FJvnn0qRWb4,The New York Times,Watch Warriors in Training in ‘The Woman King’...,The intensity of the action epic “The Woman Ki...,2022-09-16,9607.0,303.0,0.0,0.0,1,Friday,119.0,11
1,O5d1Cclk5QA,The New York Times,"I Sold the French Laundry. Then It Became ""The...","When my father died, he held disappointment in...",2022-09-13,72091.0,3102.0,0.0,218.0,1,Tuesday,1233.0,21
2,ccVH_1g-kCY,The New York Times,OB-GYNs Confront Legal Impact of Abortion Bans...,As anti-abortion laws take effect across the U...,2022-09-12,11967.0,617.0,0.0,318.0,1,Monday,379.0,33
3,0bw7rJ2eZaA,The New York Times,The Legacy of Elizabeth II: The Media Queen,"Queen Elizabeth II, the world’s longest-servin...",2022-09-08,56998.0,2061.0,0.0,377.0,1,Thursday,486.0,19
4,m5aWtcx02ZI,The New York Times,Jonathan Pie: Welcome to Britain. Everything i...,So Liz Truss will be Britain’s next prime mini...,2022-09-06,903494.0,45406.0,0.0,4980.0,1,Tuesday,401.0,15


In [82]:
selected_df = processed_video_df.query('viewCount > 5000 & commentCount > 500')

In [83]:
len(selected_df)

245

In [84]:
selected_df.head()

Unnamed: 0,video_id,channelTitle,title,description,publishedAt,viewCount,likeCount,favoriteCount,commentCount,caption,publishDayName,durationSecs,tagCount
4,m5aWtcx02ZI,The New York Times,Jonathan Pie: Welcome to Britain. Everything i...,So Liz Truss will be Britain’s next prime mini...,2022-09-06,903494.0,45406.0,0.0,4980.0,1,Tuesday,401.0,15
9,eMGqvUZjkH8,The New York Times,Kill Your Lawn | NYT Opinion,"Seen from above, it’s not the undulating rows ...",2022-08-09,87002.0,3644.0,0.0,581.0,1,Tuesday,321.0,20
14,RilwnjDwTOc,The New York Times,We Debunk the Latest Corporate Climate Lie | N...,"Finally, corporations are jumping into action ...",2022-07-14,283005.0,13279.0,0.0,1587.0,1,Thursday,288.0,17
17,tjIgYs81mB8,The New York Times,How I Had an Abortion at Home in Texas | NYT O...,This is the true story of a 27-year-old Texas ...,2022-06-29,286001.0,8838.0,0.0,7759.0,1,Wednesday,475.0,16
20,Oo_FM3mjBCY,The New York Times,How China’s Surveillance Is Growing More Invas...,"A New York Times analysis of over 100,000 gove...",2022-06-22,258469.0,9804.0,0.0,1542.0,1,Wednesday,867.0,44


In [85]:
video_ids = list(selected_df['video_id'])

In [86]:
all_comments_df = pd.DataFrame()
for video in video_ids:
    comment_data = get_video_comments(youtube_obj, video)
    all_comments_df = all_comments_df.append(comment_data, ignore_index=True)

In [87]:
all_comments_df.head()

Unnamed: 0,video_id,comment_id,comment,date
0,m5aWtcx02ZI,UgxufqRCtXh89VL-mfp4AaABAg,"Though we walk through the valley of death, so...",2022-09-15T23:18:04Z
1,m5aWtcx02ZI,UgwUecmtXM7o_sFXgo14AaABAg,"As a citizen of burgerland, I can confirm Brit...",2022-09-15T22:42:25Z
2,m5aWtcx02ZI,Ugydc96B5N4nXidOG0h4AaABAg,Really boring. Yet another 'I hate the British...,2022-09-15T22:17:28Z
3,m5aWtcx02ZI,UgzxQz_m_QuKjo82Jy94AaABAg,very surprised yall had him on.\nvery progress...,2022-09-15T21:45:34Z
4,m5aWtcx02ZI,UgykCYZTg5fHmL0FcGZ4AaABAg,The New York Times taking the opportunity to p...,2022-09-15T21:26:09Z


In [88]:
all_comments_df['comment'][3]

'very surprised yall had him on.\nvery progressive.'

## Comment Data Cleaning

In [89]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

In [90]:
def preprocess(comment):
    comment = comment.str.replace("\n", " ") # remove new lines 
    return comment

all_comments_df['comment'] = preprocess(all_comments_df['comment'])

In [91]:
all_comments_df

Unnamed: 0,video_id,comment_id,comment,date
0,m5aWtcx02ZI,UgxufqRCtXh89VL-mfp4AaABAg,"Though we walk through the valley of death, so...",2022-09-15T23:18:04Z
1,m5aWtcx02ZI,UgwUecmtXM7o_sFXgo14AaABAg,"As a citizen of burgerland, I can confirm Brit...",2022-09-15T22:42:25Z
2,m5aWtcx02ZI,Ugydc96B5N4nXidOG0h4AaABAg,Really boring. Yet another 'I hate the British...,2022-09-15T22:17:28Z
3,m5aWtcx02ZI,UgzxQz_m_QuKjo82Jy94AaABAg,very surprised yall had him on. very progressive.,2022-09-15T21:45:34Z
4,m5aWtcx02ZI,UgykCYZTg5fHmL0FcGZ4AaABAg,The New York Times taking the opportunity to p...,2022-09-15T21:26:09Z
...,...,...,...,...
24495,aLk5znZljTI,UgzaLfcXUI-MKhaZnvd4AaABAg,Nike supports a COMMUNIST Chinese party.,2021-06-29T03:59:00Z
24496,aLk5znZljTI,UgxcLr95KEBOpfCikwZ4AaABAg,Lmfao,2021-06-29T03:58:05Z
24497,aLk5znZljTI,UgxJtnKZGDFF3Mr4neJ4AaABAg,Are these females really complaining about cho...,2021-06-28T17:35:49Z
24498,aLk5znZljTI,UgxBrrJZLE2juVsxW5Z4AaABAg,Go somewhere else than... im sure there are di...,2021-06-28T13:22:54Z


In [92]:
all_comments_df['comment'][3]

'very surprised yall had him on. very progressive.'

In [93]:
all_comments_df['vader_sentiment'] = all_comments_df.comment.apply(lambda x: sia.polarity_scores(x))
all_comments_df['vader_comp_sentiment'] = all_comments_df.vader_sentiment.apply(lambda x: x['compound'])

In [94]:
all_comments_df['vader_comp_sentiment'].mean()

0.00024345306122447468

In [95]:
video_comp_sentiments = pd.DataFrame(columns=['video_id', 'vid_title', 'avg_comp_sentiment', 'vid_viewcount'])
for video in video_ids:
    comment_data = get_video_comments(youtube_obj, video)
    comment_data['vader_sentiment'] = comment_data.comment.apply(lambda x: sia.polarity_scores(x))
    comment_data['vader_comp_sentiment'] = comment_data.vader_sentiment.apply(lambda x: x['compound'])
    comp_mean = comment_data['vader_comp_sentiment'].mean()

    vid_title = processed_video_df.loc[processed_video_df['video_id'] == video, 'title'].item()
    vid_viewcount = processed_video_df.loc[processed_video_df['video_id'] == video, 'viewCount'].item()
    vid_likecount = processed_video_df.loc[processed_video_df['video_id'] == video, 'likeCount'].item()

    dict = {'video_id': video,
            'vid_title': vid_title,
            'avg_comp_sentiment': comp_mean,
            'vid_viewcount': vid_viewcount,
            'vid_likecount': vid_likecount
            }

    video_comp_sentiments = video_comp_sentiments.append(dict, ignore_index=True)
    

In [96]:
video_comp_sentiments

Unnamed: 0,video_id,vid_title,avg_comp_sentiment,vid_viewcount,vid_likecount
0,m5aWtcx02ZI,Jonathan Pie: Welcome to Britain. Everything i...,0.030583,903494.0,45406.0
1,eMGqvUZjkH8,Kill Your Lawn | NYT Opinion,0.118600,87002.0,3644.0
2,RilwnjDwTOc,We Debunk the Latest Corporate Climate Lie | N...,-0.014841,283005.0,13279.0
3,tjIgYs81mB8,How I Had an Abortion at Home in Texas | NYT O...,-0.189609,286001.0,8838.0
4,Oo_FM3mjBCY,How China’s Surveillance Is Growing More Invas...,0.028863,258469.0,9804.0
...,...,...,...,...,...
240,zkwbIlrFhgM,Did Iran Attack Ships in the Gulf? What the Ev...,-0.126641,175639.0,2361.0
241,VLwDvK0j4DQ,Why American Diabetics Go to Mexico and Craigl...,0.311231,470137.0,16256.0
242,LYs5HMky1qY,Why Biden’s First Run for President Failed | N...,-0.313644,1594422.0,22272.0
243,S7jnzOMxb14,The Stonewall You Know Is a Myth. And That’s O...,0.138696,974239.0,50722.0


In [97]:
import plotly.io as pio
pio.renderers

Renderers configuration
-----------------------
    Default renderer: 'vscode'
    Available renderers:
        ['plotly_mimetype', 'jupyterlab', 'nteract', 'vscode',
         'notebook', 'notebook_connected', 'kaggle', 'azure', 'colab',
         'cocalc', 'databricks', 'json', 'png', 'jpeg', 'jpg', 'svg',
         'pdf', 'browser', 'firefox', 'chrome', 'chromium', 'iframe',
         'iframe_connected', 'sphinx_gallery', 'sphinx_gallery_png']

In [98]:
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [116]:
fig = make_subplots(specs=[[{'secondary_y': True}]])

fig.add_trace(
        go.Scatter(
        x=video_comp_sentiments['avg_comp_sentiment'],
        y=video_comp_sentiments['vid_viewcount'],
        name='View Count',
        mode='markers',
        marker_color='Blue',
        text=video_comp_sentiments['vid_title'],
        hovertemplate='<b>%{text}</b> <br>Sentiment Value: %{x} <br>View Count: %{y}'),
        secondary_y=False
)

fig.add_trace(go.Scatter(
    x=video_comp_sentiments['avg_comp_sentiment'],
    y=video_comp_sentiments['vid_likecount'],
    name='Like Count',
    mode='markers',
    marker_color='Red',
    text=video_comp_sentiments['vid_title'],
    hovertemplate='<b>%{text}</b> <br>Sentiment Value: %{x} <br>Like Count: %{y}'),
    secondary_y=True
)

fig.update_layout(
    title="<b>Youtube View & Like Counts VS Sentiment Scores<b>",
    xaxis_title="Sentiment Score"
    # legend_title="Legend Title"
)

fig.update_yaxes(title_text="View Counts", secondary_y=False)
fig.update_yaxes(title_text="Like Counts", secondary_y=True)

fig.show(renderer='vscode')