# Timeseries analysis

This notebook generates the data sets for the total number of views and videos for a specific topic across different fields that may be covered in ``News & Politics`` tagged channels. Then it creates the different graphs used in the data story

If you installed the ```environment.yml``` then you will not need to download any additional packages. However if you did not you will need to download the following:
- ```pip install nltk```
- ```pip install plotly```
- ```pip install pandas```


In [1]:
import pandas as pd
from nltk.corpus import stopwords
import os
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import warnings

warnings.filterwarnings('ignore')

## Generate the data sets

Note that this part uses the original data set which is very large and therefore not present on the repository. The file used to generate the data has been saved such as: ```'../original_data/yt_metadata_en.jsonl.gz'```. The code generates processed datasets saved in the following directory for each topcis assessed:
- ```datasets/timeseries/sports```
- ```datasets/timeseries/trends```
- ```datasets/timeseries/tech```
- ```datasets/timeseries/health```
- ```datasets/timeseries/politics```

On the other hand the graphs are saved in the follwing directories:
- ```datasets/figures/sports```
- ```datasets/figures/trends```
- ```datasets/figures/tech```
- ```datasets/figures/health```
- ```datasets/figures/politics```


### Description of method

To generate the data, we looked for the presence of key words in the title and description of the videos. For example, for the word cup we looked for *'world cup', 'football', 'soccer'* and regroup the total number of views and videos with the corresponding key words per month. 

In [2]:
def get_data(data_file, saving_path, list_of_words, theme='', chunksize=10**6):
    stop_words = stopwords.words('english')
    for i,chunk in enumerate(pd.read_json(data_file, lines=True, chunksize=chunksize)):
        #chunk = chunk.loc[chunk['categories'] == 'News & Politics']
        chunk = chunk[chunk.categories.isin(['News & Politics'])]
        chunk["video_info"] = chunk['title'].astype(str) +": "+ chunk["description"]
        # drop these columns to conserve space    
        chunk = chunk.drop(['title'],  axis=1)
        chunk = chunk.drop(['description'], axis=1)

        chunk['video_info'] = chunk['video_info'].str.lower()

        # Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
        chunk['video_info'] = chunk['video_info'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

        # converting upload_time to datetime object so we can begin to slice dataframe
        chunk['upload_date'] = pd.to_datetime(chunk['upload_date'], format='%Y-%m-%d').dt.date
        chunk['upload_date'] = pd.to_datetime(chunk['upload_date'], format='%Y-%m-%d')
        chunk['upload_date'] = chunk['upload_date'].dt.to_period('M')

        #look for specified words in video descriptions and titles
        msk = chunk.video_info.str.contains('|'.join(list_of_words), case=False)

        if len(msk)!=0:
            chunk = chunk[msk]

            num_videos = chunk.groupby(by='upload_date')['upload_date'].agg('count')
            total_view = chunk.groupby(by='upload_date')['view_count'].sum()
            df = pd.concat(
                {'num_videos': num_videos,
                'total_views': total_view},
                axis=1)

            compression_opts = dict(method='zip',archive_name=f'{theme}_{i}.csv') 
            df.to_csv(saving_path+'{}_{}.zip'.format(theme, i), compression=compression_opts)

    if i%10 == 0:
        print(i)

In [3]:
DATA_FILE = '../original_data/yt_metadata_en.jsonl.gz'

In [4]:
def check_create_dir(path):
# If folder doesn't exist, then create it.
    if not os.path.isdir(path):
        os.makedirs(path)
        print("created folder : ", path)
    else:
        print(path, "folder already exists.")

In [5]:
# count the number of videos uploaded in a certain date
def merge_sum_df(df1,df2):
    return df1.groupby('upload_date').sum().add(df2.groupby('upload_date').sum(), fill_value=0).reset_index()

In [6]:
def plot(df_, path_to_save, title=''):
    fig = make_subplots(specs=[[{"secondary_y": True}]])

    # Add traces
    fig.add_trace(
        go.Scatter(x=df_['upload_date'], y=df_['num_videos'], name="Number of videos uploaded for {}".format(title)),
        secondary_y=False,
    )

    fig.add_trace(
        go.Scatter(x=df_['upload_date'], y=df_['total_views'], name="Number of views for {}".format(title)),
        secondary_y=True,
    )

    # Add figure title
    fig.update_layout(
        title_text="Number of videos uploaded and number of views over time"
    )

    # Set x-axis title
    fig.update_xaxes(title_text="Date")

    # Set y-axes titles
    fig.update_yaxes(title_text="<b>primary</b> Number of videos", secondary_y=False)
    fig.update_yaxes(title_text="<b>secondary</b> Number of views", secondary_y=True)

    fig.show()
    fig.write_html(path_to_save + '/num_views_videos_{}.html'.format(title))

In [7]:
def get_file_names(path_files):
    file_names = []
    for path in os.listdir(path_files):
        # check if current path is a file
        if os.path.isfile(os.path.join(path_files, path)):
            if path[0]=='.':
                continue
            file_names.append(path)
    return file_names


def create_df(path,file_names):
    df_ = pd.read_csv(path+'/'+file_names[0], compression='zip')
    for i,file in enumerate(file_names):
        if i==0:
            continue
        df = pd.read_csv(path+'/'+file, compression='zip')
        df_ = merge_sum_df(df_,df)
    return df_

# Sport events

Plot the graphs for sport events

### World cup - Football

In [8]:
PATH_TO_SAVE_IMAGES = '../datasets/figures/sports/'
list_words = ['world cup', 'soccer', 'football']
path_football = '../datasets/sport/world_cup/'
file_names = get_file_names(path_football)
df_football = create_df(path_football,file_names)
plot(df_football, PATH_TO_SAVE_IMAGES, 'World Cup')

### NBA - Basketball

In [9]:
path = '../datasets/sport/NBA/'
list_words = ['basketball', 'NBA']
file_names = get_file_names(path)
df_nba = create_df(path,file_names)
plot(df_nba, PATH_TO_SAVE_IMAGES, 'NBA')

### Olympic games

In [10]:
path = '../datasets/sport/Olympics/'
list_words = ['olympic', 'olympic games', 'IOC']
file_names = get_file_names(path)
df_olympics = create_df(path,file_names)
plot(df_olympics, PATH_TO_SAVE_IMAGES, 'Olympics')

### Tennis

In [11]:
path = '../datasets/sport/tennis/'
list_words = ['tennis', 'grand slam', 'Roland Garos', 'Australian Open', 'US open', 'Wimbledon']
file_names = get_file_names(path)
df_tennis = create_df(path,file_names)
plot(df_tennis, PATH_TO_SAVE_IMAGES, 'Tennis')

### MLB - Baseball

In [12]:
path = '../datasets/sport/baseball/'
list_words = ['baseball', 'MLB']
file_names = get_file_names(path)
df_baseball = create_df(path,file_names)
plot(df_baseball, PATH_TO_SAVE_IMAGES, 'Baseball')

In [13]:
list_df = [df_football, df_nba, df_olympics, df_tennis, df_baseball]
list_titles = ['World cup', 'NBA', 'Olympics', 'Tennis', 'Baseball']

In [14]:
def create_complete_df(list_df, list_titles):
    df_num_views = pd.DataFrame()
    df_num_videos = pd.DataFrame()
    for df,title in zip(list_df, list_titles):
        df_num_videos[title] = df.set_index('upload_date')['num_videos']
        df_num_views[title] = df.set_index('upload_date')['total_views']
    return df_num_views, df_num_videos

In [15]:
df_num_views, df_num_videos = create_complete_df(list_df, list_titles)

In [16]:
#join plots on s asingle graph for comparison
def multi_plot(df, path_to_save, type='', title='', addAll = True):
    fig = go.Figure()

    for column in df.columns.to_list():
        fig.add_trace(
            go.Scatter(
                x = df.index,
                y = df[column],
                name = column
            )
        )

    button_all = dict(label = 'All',
                      method = 'update',
                      args = [{'visible': df.columns.isin(df.columns),
                               'title': 'All',
                               'showlegend':True}])

    def create_layout_button(column):
        return dict(label = column,
                    method = 'update',
                    args = [{'visible': df.columns.isin([column]),
                             'title': column,
                             'showlegend': True}])

    fig.update_layout(
        updatemenus=[go.layout.Updatemenu(
            active = 0,
            buttons = ([button_all] * addAll) + list(df.columns.map(lambda column: create_layout_button(column)))
            )
        ])
    fig.update_xaxes(title_text="Date")
    fig.update_yaxes(title_text=type)
    
    fig.show()
    fig.write_html(path_to_save+'/{}_{}.html'.format(type, title))

In [17]:
multi_plot(df_num_videos, PATH_TO_SAVE_IMAGES, type='Number of videos', title = 'sport_events')

In [18]:
multi_plot(df_num_views, PATH_TO_SAVE_IMAGES, type='Total number of views', title = 'sport_events')

# Trends

Generate data and plots for different trendy topcis on YouTube

In [90]:
theme = 'fortnite'
PATH_TO_SAVE = '../datasets/trends/'+theme+'/'
DATA_FILE = '../original_data/yt_metadata_en.jsonl.gz'
list_words = ['fortnite']

check_create_dir(PATH_TO_SAVE)
check_create_dir(PATH_TO_SAVE_IMAGES)

../datasets/trends/fortnite/ folder already exists.
../datasets/figures/tech folder already exists.


In [None]:
get_data(DATA_FILE,list_words, theme)

In [91]:
theme = 'clash_of_clans'
PATH_TO_SAVE = '../datasets/trends/'+theme+'/'
DATA_FILE = '../original_data/yt_metadata_en.jsonl.gz'
list_words = ['clash of clans', 'coc']

check_create_dir(PATH_TO_SAVE)
check_create_dir(PATH_TO_SAVE_IMAGES)

../datasets/trends/clash_of_clans/ folder already exists.
../datasets/figures/tech folder already exists.


In [16]:
get_data(DATA_FILE,PATH_TO_SAVE,list_words, theme)

0
10
20
30
40
50
60
70


In [92]:
theme = 'gangnam_style'
PATH_TO_SAVE = '../datasets/trends/'+theme+'/'
DATA_FILE = '../original_data/yt_metadata_en.jsonl.gz'
list_words = ['psy', 'gangnam style']

check_create_dir(PATH_TO_SAVE)

../datasets/trends/gangnam_style/ folder already exists.


In [33]:
get_data(DATA_FILE,PATH_TO_SAVE,list_words, theme)

In [93]:
theme = 'how_to'
PATH_TO_SAVE = '../datasets/trends/'+theme+'/'
DATA_FILE = '../original_data/yt_metadata_en.jsonl.gz'
list_words = ['how to']

check_create_dir(PATH_TO_SAVE)
check_create_dir(PATH_TO_SAVE_IMAGES)

../datasets/trends/how_to/ folder already exists.
../datasets/figures/tech folder already exists.


In [36]:
get_data(DATA_FILE,PATH_TO_SAVE,list_words, theme)

In [94]:
theme = 'pauls_brothers'
PATH_TO_SAVE = '../datasets/trends/'+theme+'/'
DATA_FILE = '../original_data/yt_metadata_en.jsonl.gz'
list_words = ['jake paul', 'logan paul']

check_create_dir(PATH_TO_SAVE)
check_create_dir(PATH_TO_SAVE_IMAGES)

../datasets/trends/pauls_brothers/ folder already exists.
../datasets/figures/tech folder already exists.


In [38]:
get_data(DATA_FILE,PATH_TO_SAVE,list_words, theme)

In [95]:
PATH_TO_SAVE_IMAGES = '../datasets/figures/trends/'
check_create_dir(PATH_TO_SAVE_IMAGES)

../datasets/figures/trends/ folder already exists.


### Clash of clans

In [96]:
path = '../datasets/trends/clash_of_clans/'
file_names = get_file_names(path)
df_coc = create_df(path,file_names)
plot(df_coc, PATH_TO_SAVE_IMAGES, 'Clash of clans')

### Fortnite

In [97]:
path = '../datasets/trends/fortnite/'
file_names = get_file_names(path)
df_fortnite = create_df(path,file_names)
plot(df_fortnite, PATH_TO_SAVE_IMAGES, 'fortnite')

### Pauls brothers

In [98]:
path = '../datasets/trends/pauls_brothers/'
file_names = get_file_names(path)
df_pauls = create_df(path,file_names)
plot(df_pauls, PATH_TO_SAVE_IMAGES, 'Pauls brothers')

### Gangnam style

In [99]:
path = '../datasets/trends/gangnam_style/'
file_names = get_file_names(path)
df_gs = create_df(path,file_names)
plot(df_gs, PATH_TO_SAVE_IMAGES, 'Gangnam style')

### How to videos

In [100]:
path = '../datasets/trends/how_to/'
file_names = get_file_names(path)
df_howto = create_df(path,file_names)
plot(df_howto, PATH_TO_SAVE_IMAGES, 'How to')

In [101]:
list_df = [df_coc, df_fortnite, df_gs, df_howto, df_pauls]
list_titles = ['Clash of clans', 'Fortnite', 'Gangnam style', 'How to - videos', 'Paul Brothers']

In [102]:
df_num_views, df_num_videos = create_complete_df(list_df, list_titles)

In [103]:
multi_plot(df_num_videos, PATH_TO_SAVE_IMAGES, type='Number of videos', title = 'trending_events')

In [104]:
multi_plot(df_num_views, PATH_TO_SAVE_IMAGES, type='Total number of views', title = 'trending_events')

# Health

Generate data graphs for health events

In [105]:
theme = 'flu'
PATH_TO_SAVE = '../datasets/health/'+theme+'/'
PATH_TO_SAVE_IMAGES = '../datasets/figures/health'
DATA_FILE = '../original_data/yt_metadata_en.jsonl.gz'
list_words = ['flu', 'influenza']

check_create_dir(PATH_TO_SAVE)
check_create_dir(PATH_TO_SAVE_IMAGES)

../datasets/health/flu/ folder already exists.
../datasets/figures/health folder already exists.


In [67]:
get_data(DATA_FILE,PATH_TO_SAVE,list_words, theme)

In [106]:
theme = 'ebola'
PATH_TO_SAVE = '../datasets/health/'+theme+'/'
DATA_FILE = '../original_data/yt_metadata_en.jsonl.gz'
list_words = ['ebola', 'ebola virus disease', 'evd', 'ebola hemorrhagic fever', 'ehf', 'ebolaviruse', 'ebolaviruses']

check_create_dir(PATH_TO_SAVE)

../datasets/health/ebola/ folder already exists.


In [69]:
get_data(DATA_FILE,PATH_TO_SAVE,list_words, theme)

### Flu

In [107]:
path = '../datasets/health/flu/'
file_names = get_file_names(path)
df_flu = create_df(path,file_names)
plot(df_flu, PATH_TO_SAVE_IMAGES, 'Flu')

### Ebola

In [51]:
path = '../datasets/health/ebola/'
file_names = get_file_names(path)
df_ebola = create_df(path,file_names)
plot(df_ebola, PATH_TO_SAVE_IMAGES, 'Ebola')

In [108]:
list_df = [df_flu, df_ebola]
list_titles = ['Flu', 'Ebola']

In [109]:
df_num_views, df_num_videos = create_complete_df(list_df, list_titles)

In [110]:
multi_plot(df_num_videos, PATH_TO_SAVE_IMAGES, type='Number of videos', title = 'Health')

In [111]:
multi_plot(df_num_views, PATH_TO_SAVE_IMAGES, type='Total number of views', title = 'Health')

# Politics

Generate data & graphs for politics events

In [112]:
theme = 'us_elections'
PATH_TO_SAVE = '../datasets/politics/'+theme+'/'
PATH_TO_SAVE_IMAGES = '../datasets/figures/politics'
DATA_FILE = '../original_data/yt_metadata_en.jsonl.gz'
list_words = ['us elections', 'obama', 'trump', 'biden', 'bush', 'mccain', 'palin', 'clinton', 'romney']

check_create_dir(PATH_TO_SAVE)
check_create_dir(PATH_TO_SAVE_IMAGES)

../datasets/politics/us_elections/ folder already exists.
../datasets/figures/politics folder already exists.


In [77]:
get_data(DATA_FILE,PATH_TO_SAVE,list_words, theme)

In [113]:
theme = 'climate_change'
PATH_TO_SAVE = '../datasets/politics/'+theme+'/'

DATA_FILE = '../original_data/yt_metadata_en.jsonl.gz'
list_words = ['global warming', 'carbon dioxide', 'co2', 'greenhouse effect', 'greenhouse gas', 'ozone', 'climate change']

check_create_dir(PATH_TO_SAVE)


../datasets/politics/climate_change/ folder already exists.


In [79]:
get_data(DATA_FILE,PATH_TO_SAVE,list_words, theme)

### US elections

In [114]:
path = '../datasets/politics/us_elections'
file_names = get_file_names(path)
df_us = create_df(path,file_names)
plot(df_us, PATH_TO_SAVE_IMAGES, 'US elections')

### Climate change

In [115]:
path = '../datasets/politics/climate_change'
file_names = get_file_names(path)
df_cc = create_df(path,file_names)
plot(df_cc, PATH_TO_SAVE_IMAGES, 'Climate change')

In [116]:
list_df = [df_us, df_cc]
list_titles = ['US elections', 'Climate change']

In [117]:
df_num_views, df_num_videos = create_complete_df(list_df, list_titles)

In [118]:
multi_plot(df_num_views, PATH_TO_SAVE_IMAGES, type='Total number of views', title = 'politcis')

In [119]:
multi_plot(df_num_videos, PATH_TO_SAVE_IMAGES, type='Number of videos', title = 'politcis')

# Technology

Generate data and graphs for tech events

In [120]:
theme = 'iphone'
PATH_TO_SAVE = '../datasets/tech/'+theme+'/'
PATH_TO_SAVE_IMAGES = '../datasets/figures/tech'
DATA_FILE = '../original_data/yt_metadata_en.jsonl.gz'
list_words = ['iphone', 'apple', 'smartphone']

check_create_dir(PATH_TO_SAVE)
check_create_dir(PATH_TO_SAVE_IMAGES)

../datasets/tech/iphone/ folder already exists.
../datasets/figures/tech folder already exists.


In [58]:
get_data(DATA_FILE,PATH_TO_SAVE,list_words, theme)

In [121]:
theme = 'electric_car'
PATH_TO_SAVE = '../datasets/tech/'+theme+'/'

DATA_FILE = '../original_data/yt_metadata_en.jsonl.gz'
list_words = ['electric car', 'tesla', 'plug-in']

check_create_dir(PATH_TO_SAVE)


../datasets/tech/electric_car/ folder already exists.


In [60]:
get_data(DATA_FILE,PATH_TO_SAVE,list_words, theme)

### iPhone

In [122]:
path = '../datasets/tech/iphone'
file_names = get_file_names(path)
df_iphone = create_df(path,file_names)
plot(df_iphone, PATH_TO_SAVE_IMAGES, 'iPhone')

### Electric cars

In [123]:
path = '../datasets/tech/electric_car'
file_names = get_file_names(path)
df_ec = create_df(path,file_names)
plot(df_ec, PATH_TO_SAVE_IMAGES, 'Electric Car')

In [124]:
list_df = [df_ec, df_iphone]
list_titles = ['Electric cars', 'iPhone']

In [125]:
df_num_views, df_num_videos = create_complete_df(list_df, list_titles)

In [126]:
multi_plot(df_num_views, PATH_TO_SAVE_IMAGES, type='Total number of views', title = 'Technology')

In [127]:
multi_plot(df_num_videos, PATH_TO_SAVE_IMAGES, type='Number of videos', title = 'Technology')