# Exploratory Data Analysis

For the first few parts, we take the results scraped on Mar 10 to perform exploration on the data set.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from collections import Counter
import wordcloud
import glob

In [None]:
data = pd.read_csv('../data/video_clean.csv', thousands = ',', encoding = 'utf-8')
data[0:5]

## Frequency of Words / Tokens 

We combine all the tokens in titles and count the frequency of the key words at one time.

In [None]:
string = data.clean_text.str.split().copy()

#string = string.str.lower()
# np.append(string[1],string[2])
text = []
for i in range(len(data.title.values)):
    text = text + string[i]
text[0:5]

The first 30 most frequent tokens is shown below.

In [None]:
count1 = Counter(text)

most_c = count1.most_common().copy()
most_c

"Coronavirus" has the highest frequency(since we searched by it), followed by the "News", "Outbreak", and "China". As the event develops, the ranking will probably change. We will look at this in the later sections.

We removed single letter from the tokens since it had no meanings, but we did not easily remove two-letter tokens from the list because it may be confused with abbrevations.\
Such as: `LA`(Los Angeles) and `la` (Spanish, the)

Obviously there was some Spanish and other language mixing in the token set.

In [None]:
for i in reversed(range(0,len(most_c)-1)):
    if len(most_c[i][0]) == 1:
        print(most_c[i],i)
        most_c.remove(most_c[i])

In [None]:
for i in reversed(range(0,len(most_c)-1)):
    if len(most_c[i][0]) == 2:
        print(most_c[i],i)
most_c.remove(most_c[13])

In [None]:
most_c_50 = most_c[1:30]

Since `Coronavirus` had the highest frequency. We focus on the frequency of other tokens except the word `coronavirus` in future analysis.

In [None]:
plt.figure(figsize=(13,15))
names, values = zip(*most_c_50)  
plt.barh(names,values)
plt.xticks(rotation=90)
plt.show()

## Channels vs. Subscribers

In [None]:
data1 = data[['channel','subscriber']].copy()
data1 = data1.drop_duplicates().copy()
data1 = data1.sort_values(by='subscriber',ascending = False).copy()
data1[:10]

First 10 results are shown above.

In [None]:
plt.figure(figsize=(20,10))
plt.barh(data1.channel[0:10],data1.subscriber[0:10])
plt.xlabel('Subscribers')
plt.show()

We noticed the variety of uploaders, this includes:
* Traditional International News Stations
* Personal vlogger ('Luisito Comunica')
* New Media
* Talk shows

## Channels vs. Views
Who gain the most number of views on videos?

In [None]:
data1 = data[['channel','view']].copy()
data1 = data1.drop_duplicates().copy()
data1 = data1.sort_values(by='view',ascending = False).copy()
data1[:10]

We noticed channels that gain the most views had a variety, concentrating on:
* Self-Media (Doctor Mike)
* Local and International News Stations (ABC, CNN, Guardian)
* Talk Shows (Last Week Tonight, SNL, Daily Show)

Talk shows may be easier to attract more views.

## Who post the most number of videos about coronavirus?

For the first 30 uploaders, based on the first scrape.

In [None]:
plt.figure(figsize=(10,10))
data['channel1'] = 'Others'

data['channel'].values

data['channel'].value_counts()[0:30].index

for i in range(len(data['channel'].values)-1): # process data to make the plot readable
    if data.loc[i,'channel'] in data['channel'].value_counts()[0:30].index:
        data.loc[i,'channel1'] = data.loc[i,'channel']

In [None]:
plt.figure(figsize=(11,11))
data['channel1'].value_counts().plot(kind='pie')

We noticed that US News stations and international news stations posted a large proportion of videos.

# Who provided the contents with longest duration?

In [None]:
tot_len = data.groupby('channel', as_index=False).agg({"length": "sum"}).copy()
tot_len = tot_len.sort_values('length',ascending = False)
tot_len

In [None]:
plt.figure(figsize=(20,10))
plt.barh(tot_len['channel'][0:20],tot_len['length'][0:20])
plt.xlabel('Length of Videos')

* Extremely long videos are mostly playbacks of live stream.
* This ranking is different from the list subscribers and views. We suspected that the attractiveness to watch videos might have something to do with the channels and number subscribers as the quality or the contents of videos varied. Also, the recommendation system of the search engine might also affected the number of views.

## Trend of Views at Different Time

For a specific video, we are interested in the how the view increase with time.


* since we have a list of search results in the time range, we can see the temporal change of the video views, likes and so on. Since the increase rate of different videos might be totally different.

In [None]:
filenames = glob.glob("../data/clean/*.csv")
files = [i.split('/')[3] for i in filenames]
filenamelist = [i.split('.')[0] for i in files]
filenamelist = np.sort(filenamelist)
filenamelist

In [None]:
name = '../data/clean/'+ filenamelist[-1] + '.csv'
videodata = pd.read_csv(name, thousands = ',', encoding = 'utf-8')
videodata.head()

In [None]:
# get the most viewed video title
title = videodata.loc[videodata['view'].idxmax(),'clean_text']
title

In [None]:
columns = list(videodata.columns)
columns.append('filetime')
columns

In [None]:
videos = pd.DataFrame(columns = columns)
for i in filenamelist:
    name = '../data/clean/'+ i + '.csv'
    videodata = pd.read_csv(name, thousands = ',', encoding = 'utf-8')
    sample = videodata[videodata['clean_text']==title]
    if len(sample) == 0:
        print(i)
        continue
        
    date = int(i.split('_')[1][2:4]) #date
    hour = int(i.split('_')[2][0:2])
    minute = int(i.split('_')[2][2:4])

    time = (date-12)*24 + hour + minute/60
    sample = sample.assign(filetime = [time]) 
    videos = videos.append(sample)
videos.shape

In [None]:
videos.head()

In [None]:
np.log(videos['view'].values.astype(int))

In [None]:
logview = np.log(videos['view'].values.astype(int))
like = (videos['like'].values.astype(int))
likeoview = (videos['like'].values.astype(int))/(videos['view'].values.astype(int))
plt.scatter(videos['filetime'].values,logview, alpha=0.5)
#plt.scatter(videos['filetime'].values,likeoview, alpha=0.5)
            
plt.title('Change of views over time', size=13)
plt.xlabel('Time in hour', size=10)
plt.ylabel('Counts', size=10)
#plt.legend(labels=['views','likes'])                           
                                       

* This figure shows the number of view over time (in hour unit) of a certain video, named 'serious coronavirus infectious disease expert michael osterholm explains joe rogan'.
It seems that the increase rate is approximately stable.

In [None]:
plt.scatter(videos['filetime'].values,like, alpha=0.5)

            
plt.title('Change of like over time', size=13)
plt.xlabel('Time in hour', size=10)
plt.ylabel('Counts', size=10)
#plt.legend(labels=['views','likes'])  

We see similar pattern for the increase of likes over time.

In [None]:
plt.scatter(videos['filetime'].values,likeoview, alpha=0.5)

            
plt.title('Change of ratio of like over view over time', size=13)
plt.xlabel('Time in hour', size=10)
plt.ylabel('Counts', size=10)

What is not suprising is that the ratio of like over view (like ratio) is relatively stable, over the time. Meaning people have similar attitute towards the quality and content of the video.

## Trend of Frequency of Tokens

In [None]:
def wordmap(datafilename):
    """
    Produces a wordmap for each scraping.
    """
    print(datafilename)
    data2 = pd.read_csv(datafilename, thousands = ',', encoding = 'utf-8')
    string2 = data2.clean_text.str.split().copy()
    text2 = []
    for i in range(len(data2.title.values)):
        text2 = text2 + string2[i]

    count2 = Counter(text2)
    most_c_2 = count2.most_common().copy()
    for i in reversed(range(0,len(most_c_2)-1)):
        if len(most_c_2[i][0]) == 1:
            # print(most_c_2[i],i)
            most_c_2.remove(most_c_2[i])
    wc2 = wordcloud.WordCloud(width = 1000,max_words=500, height = 500,background_color="white",
                              collocations = True,relative_scaling=0.2).generate_from_frequencies(dict(most_c_2[1:60]))
    plt.figure(figsize=(20,10))
    plt.imshow(wc2, interpolation="bilinear")
    plt.axis("off")
    plt.show()

In [None]:
print('First Scrape on March 10')
wordmap('../data/video_clean.csv')

In [None]:
wordmap('../data/clean/' + filenamelist[0]+ '.csv')

In [None]:
wordmap('../data/clean/' + filenamelist[4]+ '.csv')

In [None]:
wordmap('../data/clean/' + filenamelist[8]+ '.csv')

In [None]:
wordmap('../data/clean/' + filenamelist[12]+ '.csv')

In [None]:
wordmap('../data/clean/' + filenamelist[16]+ '.csv')

We tried to see how the event evolved in these days, from March 10 to March 15.
* Most of the key words remained high frequency such as "News", "Outbreak" and "COVID(-19)". These word are essential parts of the titles of videos about coronavirus. 
* We noticed as Trump had more speeches and orders, the frequency of "Trump" generally increased. 
* As China had the pneumonia controlled, the frequency of "China" and "Wuhan" acturally decreased. 
* As the community spread become more severe in multiple positions in European countries and north America, we saw some new names of regions in the newer word maps.
* Many regions declared state of "Emergency" so this token appeared.