Delete all variables in the current environment (if you have already run some cells) - clean state.

In [148]:
%reset

Import all necessary packages.

In [149]:
import networkx as nx
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import os
import shutil
from datetime import datetime
from dateutil import parser
import json

Replace with the path to the root folder of the project.

In [150]:
rootdir_path = '/home/andreistoica12/research-internship'

Replace with the path to the folder where we store the dataset.

In [151]:
data_path = '/home/andreistoica12/research-internship/data/PhemeDataset'

Create 2 subfolders to store important files and graphs, respectively. If they already existed (from previous runnings of the project), delete the folders and their contents and create empty folders to store the current files and graphs, relevant to the current state of the project.

In [152]:
files_path = os.path.join(rootdir_path, 'files')
if os.path.exists(files_path):
   shutil.rmtree(files_path, ignore_errors=False, onerror=None)
os.makedirs(files_path)

graphs_path = os.path.join(rootdir_path, 'graphs')
if os.path.exists(graphs_path):
   shutil.rmtree(graphs_path, ignore_errors=False, onerror=None)
os.makedirs(graphs_path)

For now, I will analyse one story (a source tweet and the reactions to it) from the Charlie Hebdo shooting event. Hence, I save the path to the event folder.

In [153]:
charlie_hebdo_event_path = data_path + "/threads/en/charliehebdo"

Here, I define a function that first reads the JSON file and stores it into a dictionary, then parses the date contained at the "created_at" key. The number returned is an integer. 

In [154]:
def tweet_hour(tweet_path):
    with open(tweet_path) as f:
        tweet = json.load(f)
    date = parser.parse(tweet['created_at'])
    return date.hour

I define a function to return the source path, given the story path.

In [155]:
def source_tweet_path(story_path):
    source_dir_path = story_path + "/source-tweets"
    source_path = source_dir_path + "/" + os.listdir(source_dir_path)[0]
    
    return source_path

I define a function to return a list of all reactions' paths.

In [156]:
def reaction_tweets_paths(story_path):
    reactions_paths_list = []
    reactions_dir_path = story_path + "/reactions"
    for reaction_name in os.listdir(reactions_dir_path):
        reaction_path = reactions_dir_path + "/" + reaction_name
        reactions_paths_list.append(reaction_path)
        
    return reactions_paths_list

I define a function to store all occurences of dates (only the hours) in a list.

In [157]:
def hours_list_story(story_path):
    hours = []
    source_path = source_tweet_path(story_path)
    hour = tweet_hour(source_path)
    hours.append(hour)
    reactions_paths_list = reaction_tweets_paths(story_path)
    for reaction_path in reactions_paths_list:
        hour = tweet_hour(reaction_path)
        hours.append(hour)
    
    return hours

Here, I define a function to return a pandas Series, representing the distribution of the hours of tweets posted regarding a specific event given as an input parameter. I chose to convert the list to a pandas Series due to the ease in creating a distribution.

In [158]:
def time_distribution_event(event_path):
    hours = []
    for story_id in os.listdir(event_path):
        story_path = event_path + "/" + story_id
        hours.extend(hours_list_story(story_path))
    hours.sort()
    hours_series = pd.Series(hours)
    distribution = hours_series.value_counts()[hours_series.unique()]
    
    return distribution


The following function is delegated to plot the distribution per hour of the tweets sent about a specific topic/event.

In [159]:
def plot_event_distribution(event_name, distribution):
    axes = distribution.plot(kind='bar')
    figure_path = "{graphs_path}/{event}_distribution.png".format(graphs_path = graphs_path, event = event_name)
    axes.figure.savefig(figure_path)
    plt.close()

In [160]:
distribution = time_distribution_event(charlie_hebdo_event_path)

In [161]:
plot_event_distribution("charlie_hebdo", distribution)