Delete all variables in the current environment (if you have already run some cells) - clean state.

In [1]:
%reset

Import all necessary packages.

In [2]:
import networkx as nx
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import os
import shutil
from datetime import datetime
from dateutil import parser

Replace with the path to the root folder of the project.

In [3]:
rootdir_path = '/home/andreistoica12/research-internship'

Replace with the path to the folder where we store the dataset.

In [4]:
data_path = '/home/andreistoica12/research-internship/data/covaxxy-csv'

Create 2 subfolders to store important files and graphs, respectively. If they already existed (from previous runnings of the project), delete the folders and their contents and create empty folders to store the current files and graphs, relevant to the current state of the project.

In [5]:
files_path = os.path.join(rootdir_path, 'files')
if os.path.exists(files_path):
   shutil.rmtree(files_path, ignore_errors=False, onerror=None)
os.makedirs(files_path)

graphs_path = os.path.join(rootdir_path, 'graphs')
if os.path.exists(graphs_path):
   shutil.rmtree(graphs_path, ignore_errors=False, onerror=None)
os.makedirs(graphs_path)

In [6]:
covaxxy_graphs_path = os.path.join(graphs_path, 'covaxxy')
if os.path.exists(covaxxy_graphs_path):
   shutil.rmtree(covaxxy_graphs_path, ignore_errors=False, onerror=None)
os.makedirs(covaxxy_graphs_path)

In [7]:
covaxxy_longitudinal_analysis_graphs = os.path.join(covaxxy_graphs_path, 'longitudinal-analysis')
if os.path.exists(covaxxy_longitudinal_analysis_graphs):
   shutil.rmtree(covaxxy_longitudinal_analysis_graphs, ignore_errors=False, onerror=None)
os.makedirs(covaxxy_longitudinal_analysis_graphs)

A list of the current data files need for my analysis.

In [8]:
file_list = os.listdir(data_path)

In [9]:
file_list

['tweet_ids--2021-03-02.csv',
 'tweet_ids--2021-03-03.csv',
 'tweet_ids--2021-03-05.csv',
 'tweet_ids--2021-03-04.csv',
 'tweet_ids--2021-03-01.csv']

For simplicity's and consistency's sake, I will store all data in chronological order, so we sort the list of file names from the start.

In [10]:
file_list.sort(key=lambda date: datetime.strptime(date, "tweet_ids--%Y-%m-%d.csv"))

In [11]:
file_list

['tweet_ids--2021-03-01.csv',
 'tweet_ids--2021-03-02.csv',
 'tweet_ids--2021-03-03.csv',
 'tweet_ids--2021-03-04.csv',
 'tweet_ids--2021-03-05.csv']

I parse the date of the tweets from the file names and transform them into datetime objects. This makes it easier to get the day/month/year, as they are already properties of such type of objects.

In [12]:
keys_datetime = [ datetime.strptime(key, "tweet_ids--%Y-%m-%d.csv") for key in file_list ]

In [13]:
keys_datetime

[datetime.datetime(2021, 3, 1, 0, 0),
 datetime.datetime(2021, 3, 2, 0, 0),
 datetime.datetime(2021, 3, 3, 0, 0),
 datetime.datetime(2021, 3, 4, 0, 0),
 datetime.datetime(2021, 3, 5, 0, 0)]

Ultimately, I will store each .csv file as a pandas DataFrame in a dictionary, where the keys represent a simplified form of the date. So, here, I will format the dates from the datetime objects into simple strings.

In [14]:
keys = [ "{day}-{month}-{year}".format(day=key.day, month=key.month, year=key.year) for key in keys_datetime ]

In [15]:
keys

['1-3-2021', '2-3-2021', '3-3-2021', '4-3-2021', '5-3-2021']

In order to read the data from the files, I need the paths of the files to be passed on to the read_csv() function. The order of the days in the file paths needs to be consistent with the order of the dates in the keys.

In [16]:
paths = [ os.path.join(data_path, file) for file in file_list ]

In [17]:
paths

['/home/andreistoica12/research-internship/data/covaxxy-csv/tweet_ids--2021-03-01.csv',
 '/home/andreistoica12/research-internship/data/covaxxy-csv/tweet_ids--2021-03-02.csv',
 '/home/andreistoica12/research-internship/data/covaxxy-csv/tweet_ids--2021-03-03.csv',
 '/home/andreistoica12/research-internship/data/covaxxy-csv/tweet_ids--2021-03-04.csv',
 '/home/andreistoica12/research-internship/data/covaxxy-csv/tweet_ids--2021-03-05.csv']

Here, I will build the dictionary where the keys represent the formatted simple date and the values are dataframes corresponding to each file.

In [18]:
days = dict()
for i in range(len(file_list)):
    days[keys[i]] = pd.read_csv(paths[i], index_col=0)

In [19]:
days['1-3-2021']

Unnamed: 0,created_at,tweet_id,author_id,text,followers_count,following_count
0,2021-03-01T00:01:56.000Z,1366176845561962503,14914686,@UK_Centrist @_PhB @RolandBakerIII @RicardLope...,639,349
1,2021-03-01T00:01:57.000Z,1366176846895738883,2402490445,"RT @THE_Russell: Berijiklian: ""There may be a ...",1215,4924
2,2021-03-01T00:01:57.000Z,1366176847822811145,56147198,RT @YvetteCooperMP: Cases of the Brazil varian...,1304,589
3,2021-03-01T00:01:57.000Z,1366176848225464323,1252300308165857280,RT @OfficialKat: Cannot wait for the vaccine. ...,154,267
4,2021-03-01T00:01:57.000Z,1366176848284057600,190474968,New vaccination appointments available tomorro...,1410,1602
...,...,...,...,...,...,...
606463,2021-03-01T03:27:26.000Z,1366228560290009091,1199985222432878592,"@goppiaziz At least, vaccinated person can get...",66,119
606464,2021-03-01T03:27:26.000Z,1366228561250701317,172453839,RT @POTUS: The more people that get vaccinated...,798,4999
606465,2021-03-01T03:27:26.000Z,1366228561389051905,47777004,RT @OfficialKat: Cannot wait for the vaccine. ...,2269,1650
606466,2021-03-01T03:27:27.000Z,1366228561724645376,1422286680,@Haugmoen It seems to be easier now that I hav...,290,293


In order to calculate the distribution of the tweets per hour, I will parse the "created_at" column, extract the hour property and create a separate column in each dataframe. I will place it next to the "created_at" column in order to be easily verifiable. Data originates frmo the Twitter API, so it comes in a standard ISO 8601 format, which can be easily parsed using the parser module from the dateutil package.

Note: the cell below runs for approximately 2m30' on my machine (~25-30 seconds for each file).

In [23]:
for key, day in days.items():
    if 'hour' not in day.columns:
        hours = []
        for time in day.loc[:,"created_at"]:
            hour = parser.parse(time).hour
            hours.append(hour)
        day.insert(1, "hour", hours, True)
        print(key + " - added 'hour' column")


In [24]:
days['1-3-2021']

Unnamed: 0,created_at,hour,tweet_id,author_id,text,followers_count,following_count
0,2021-03-01T00:01:56.000Z,0,1366176845561962503,14914686,@UK_Centrist @_PhB @RolandBakerIII @RicardLope...,639,349
1,2021-03-01T00:01:57.000Z,0,1366176846895738883,2402490445,"RT @THE_Russell: Berijiklian: ""There may be a ...",1215,4924
2,2021-03-01T00:01:57.000Z,0,1366176847822811145,56147198,RT @YvetteCooperMP: Cases of the Brazil varian...,1304,589
3,2021-03-01T00:01:57.000Z,0,1366176848225464323,1252300308165857280,RT @OfficialKat: Cannot wait for the vaccine. ...,154,267
4,2021-03-01T00:01:57.000Z,0,1366176848284057600,190474968,New vaccination appointments available tomorro...,1410,1602
...,...,...,...,...,...,...,...
606463,2021-03-01T03:27:26.000Z,3,1366228560290009091,1199985222432878592,"@goppiaziz At least, vaccinated person can get...",66,119
606464,2021-03-01T03:27:26.000Z,3,1366228561250701317,172453839,RT @POTUS: The more people that get vaccinated...,798,4999
606465,2021-03-01T03:27:26.000Z,3,1366228561389051905,47777004,RT @OfficialKat: Cannot wait for the vaccine. ...,2269,1650
606466,2021-03-01T03:27:27.000Z,3,1366228561724645376,1422286680,@Haugmoen It seems to be easier now that I hav...,290,293


The final distribution is made up of the sum of all individual days' distributions. I save a figure in the graphs/ folder for each day, as well as an overall distribution.

In [25]:
final_distribution = pd.Series(0, index=days['1-3-2021'].loc[:,'hour'].sort_values(ascending=True).unique())
for key, day in days.items():
    hour_column_ascending = day.loc[:,"hour"].sort_values(ascending=True)
    distribution = hour_column_ascending.value_counts()[hour_column_ascending.unique()]
    final_distribution = final_distribution.add(distribution)
    axes = distribution.plot(kind='bar')
    figure_path = f"{covaxxy_longitudinal_analysis_graphs}/{key}_distribution.png"
    axes.figure.savefig(figure_path)
    plt.close()
axes = final_distribution.plot(kind='bar')
figure_path = f"{covaxxy_longitudinal_analysis_graphs}/overall_distribution.png"
axes.figure.savefig(figure_path)
plt.close()
