# Event detection in Twitter dataset
The goal of this project is to detect past events in Switzerland, using a dataset of tweets. This dataset contain 28 million tweets coming for most of them from Switzerland. This notebook will get you through our project, explaining our methodology. It is splitted in 5 parts:


1. **Extracting hashtags and data cleaning**<br>
We split the tweets
2. **Data structure manipulation**<br>

3. **Improving quality of detected event**<br>

4. **Exporting data**<br>


To make our code more readable, we exported some functions to the file _"helpFunctions.py"_.<br>
Each time we will use one of these functions, we will explicitely write: **# EXTERNAL FUNCTION: name_of_the_function**

#### Importing useful libraries:

In [1]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
import datetime
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.model_selection import KFold
from sklearn import metrics
import matplotlib.pyplot as plt
from functools import reduce
import math
from collections import Counter
import csv
from datetime import datetime
import re
import copy
import json
from datetime import timedelta
from operator import itemgetter
from helpFunctions import *
%matplotlib inline

Creation of check point (a print plus the time)

In [2]:
def pr(strToPrint):
    '''
    Print the current date and time, concatenated to the string passed in argument.   
    :param strToPrint: Regular string to print
    '''
    print( str(datetime.now().time())[:8] + ' '+ strToPrint)
    
def strNb(nb):
    '''
    Transform a high number in a string, with '.' for each thousand
    :param nb: A high number to print
    '''
    return '{0:,}'.format(nb).replace(',', '.')

## 0 - Importing data

Importing a sample of the dataset (useful for testing):

In [3]:
pickle_filename = os.path.join('data','head_100k_pickle.pkl')
tw = pd.read_pickle(pickle_filename)

Preparing columns headers and file name.

In [4]:
columns_header = ['id', 'userId', 'createdAt', 'text', 'longitude', 'latitude', 'placeId',
                  'inReplyTo', 'source', 'truncated', 'placeLatitude', 'placeLongitude', 'sourceName', 'sourceUrl',
                 'userName', 'screenName', 'followersCount', 'friendsCount', 'statusesCount',
                 'userLocation']
filename = os.path.join('data', 'twex.tsv')

Importing the whole dataset:

In [5]:
pr('Starting to read file... (3 min)')
# tw = pd.read_csv(filename, sep='\t', encoding='utf-8', escapechar='\\', names=columns_header,
#                       quoting=csv.QUOTE_NONE, na_values='N', header=None)
pr('File is loaded.')

20:05:32 Starting to read file... (3 min)
20:05:32 File is loaded.


In [6]:
print('The dataset contains {} tweets.'.format(strNb(len(tw))))

The dataset contains 100.000 tweets.


In [7]:
print('First rows of dataset:')
tw.head(2)

First rows of dataset:


Unnamed: 0,id,userId,createdAt,text,longitude,latitude,placeId,inReplyTo,source,truncated,placeLatitude,placeLongitude,sourceName,sourceUrl,userName,screenName,followersCount,friendsCount,statusesCount,userLocation
0,9514097914,17341000.0,2010-02-23 05:55:51,Guuuuten Morgen! :-),7.43926,46.9489,,,197,,,,TwitBird,http://www.nibirutech.com,Tilman Jentzsch,blickwechsel,586,508.0,9016.0,"Bern, Switzerland"
1,9514846412,7198280.0,2010-02-23 06:22:40,Still the best coffee in town — at La Stanza h...,8.53781,47.3678,,,550,,,,Gowalla,http://gowalla.com/,Nico Luchsinger,halbluchs,1820,703.0,4687.0,"Zurich, Switzerland"


## 1 - Extracting hashtags and data cleaning

### 1.1 - Extracting hashtags

To detect event, we have decided to only concentrate on hashtags. Therefore, we are going extract hashtags and only keep tweets that contain hashtags (as the other ones will contains no useful information for us). We start with this operation before the data cleaning because it is the one that reduces our data size the most, and therefore all further operations will execute quicker.<br>
To do so, we are going to examine the _text_ field of each tweet, extract its hashtags and put them in a new column (in the form of a list of hashtags per tweet).

In [8]:
# EXTERNAL FUNCTION: extract_hashtags(text)

pr('Extracting hashtags... (2 min)')

tw['hashtag'] = tw.text.apply(lambda x: extract_hashtags(str(x))) # Getting list of hashtag into new column
twh = tw.ix[tw.hashtag.apply(lambda x: len(x) != 0)] # droping the rows (tweets) that contains no hashtags.

pr('We have extracted {} rows with hashtags out of the {} rows of our initial dataframe.'.format(strNb(len(twh)),strNb(len(tw))))

20:05:32 Extracting hashtags... (2 min)
20:05:33 We have extracted 19.719 rows with hashtags out of the 100.000 rows of our initial dataframe.


In [9]:
print('Examples of tweets (with only text and hashtag column):')
twh[['text', 'hashtag']].head(3)

Examples of tweets (with only text and hashtag column):


Unnamed: 0,text,hashtag
8,"Magic spells run off after midnight, I guess s...",[fb]
10,"Limitas of public transportation! No taxi, rai...",[yam]
15,"So, Feierabend. Jetzt #24 und später #VfB. — a...","[24, vfb]"


### 1.2 - Data cleaning

We will now clean the remaining dataset. The creation date is important for our event detection, and in order to provide visualization to the other team, we needed to give location information to all tweets. Therefore we will drop the rows which contain NA values at _creationAt_, _latitude_ or _longitude_ position.  

In [10]:
pr("Droping rows with NA values (location and creation date).")
tw1 = twh.dropna(axis=0, how='any', subset=['createdAt'])
tw1.dropna(subset=['longitude'], inplace=True)
tw1.dropna(subset=['latitude'], inplace=True)
pr('The data have been reduced from {} tweets to {} tweets.'.format(strNb(len(twh)), strNb(len(tw1))))

20:05:33 Droping rows with NA values (location and creation date).
20:05:33 The data have been reduced from 19.719 tweets to 15.043 tweets.


The latitude and longitude values should be in _float_ format to analyze them correctly later.

In [11]:
tw1.latitude = tw1.latitude.apply(float)
tw1.longitude = tw1.longitude.apply(float)

Some date in our initial data could not be transformed to pandas datetime. Let's check if everything is okay.

In [12]:
pr('Starting to examine dates...')
import warnings
warnings.filterwarnings('ignore')
datetime_serie = tw1['createdAt'].convert_objects(convert_dates='coerce')
dateNotConvertible = datetime_serie[pd.isnull(datetime_serie)]
warnings.filterwarnings('default')
pr('There are {} dates that cannot be transformed.'.format(len(dateNotConvertible)))

20:05:33 Starting to examine dates...
20:05:33 There are 0 dates that cannot be transformed.


It looks good! Now, we will transform our data in a time series. To do so, the best way is to put the creation date of each row as the index of the dataframe.

In [13]:
pr('Converting to datetime...')
tw5 = tw1.copy()
tw5['createdAt'] = pd.to_datetime(tw1['createdAt'])
pr('Setting up new indices...')
tw5.index = tw5['createdAt']
pr('Deleting old "createdAt" column...')
del tw5['createdAt']
pr('Done. The updated dataframe:')
tw5.head(1)

20:05:33 Converting to datetime...
20:05:33 Setting up new indices...
20:05:33 Deleting old "createdAt" column...
20:05:33 Done. The updated dataframe:


Unnamed: 0_level_0,id,userId,text,longitude,latitude,placeId,inReplyTo,source,truncated,placeLatitude,placeLongitude,sourceName,sourceUrl,userName,screenName,followersCount,friendsCount,statusesCount,userLocation,hashtag
createdAt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2010-02-23 09:59:41,9519737890,14657900.0,"Magic spells run off after midnight, I guess s...",6.1387,46.175,,,1,,,,Twitter for iPhone,http://twitter.com/#!/download/iphone,Javier Belmonte,vichango,167,277.0,2885.0,"Geneva, Switzerland",[fb]


## 2 - Data structure manipulation

We want to have the data in the right format to process them. Therefore we will apply several operations to the dataframes.

### 2.1 - One hashtag = one row

Some people put more than one hashtag per tweet. A tweet that contain more than one hashtag will be analyzed the same way as if it contained only one hashtag. So, to process the data in a easier way, we will make a dataframe with **one row per hashtag**. Therefore, we will duplicate the row that contains more than one hashtag.

This will be done by going through the dataframe, and making in parallel a list of rows (with 1 hashtag per row) that needs to be added to the old dataframe. These rows will be stored in a dictionary and will be appended afterwards to the dataframe.

In [14]:
# EXTERNAL FUNCTIONS: reset_added_hashtag_rows_list(), multiplyHashtagRows(row), get_added_hashtag_rows_list()

pr('Multiplying the hashtag rows... (around 10 min)')

# Initial reset of dictionary of rows to start the processing
reset_added_hashtag_rows_list()
tw5_1 = tw5.copy()

# The fct multiplyHashtagRows (applied to each row) will return the first hashtag and add all the other hashtags to the dictionary.
tw5_1['hashtag'] = tw5.apply(multiplyHashtagRows, args=[tw5.columns,], axis=1)

# This will return the dictionary with the added rows.
addedHashtagsRowsList = get_added_hashtag_rows_list()

pr('Finished! {} rows will be added to the dataframe!'.format(len(addedHashtagsRowsList)))

20:05:33 Multiplying the hashtag rows... (around 10 min)
20:05:35 Finished! 6812 rows will be added to the dataframe!


We will create a new dataframe with the additionnal rows and merge it with the old one.

In [15]:
pr('Starting to make the new dataframe with additionnal rows and append it to the original dataframe..')

addedHashtagsDf = pd.DataFrame(addedHashtagsRowsList)
addedHashtagsDf.set_index(['createdAt'], inplace=True)

tw6 = tw5_1.append(addedHashtagsDf)

pr('Done. Original dataframe size was: {} - New dataframe size is: {}'.format(strNb(len(tw5_1)),strNb(len(tw6))))

20:05:35 Starting to make the new dataframe with additionnal rows and append it to the original dataframe..
20:05:35 Done. Original dataframe size was: 15.043 - New dataframe size is: 21.855


In [16]:
print('Examples of tweets (with only text and hashtag column):')
tw6[['text', 'hashtag']].head(3)

Examples of tweets (with only text and hashtag column):


Unnamed: 0_level_0,text,hashtag
createdAt,Unnamed: 1_level_1,Unnamed: 2_level_1
2010-02-23 09:59:41,"Magic spells run off after midnight, I guess s...",fb
2010-02-23 11:28:27,"Limitas of public transportation! No taxi, rai...",yam
2010-02-23 17:47:11,"So, Feierabend. Jetzt #24 und später #VfB. — a...",24


### 2.2 - Gouping per hashtag and merging per day

First, we are grouping each tweets by hashtags.

In [17]:
pr('Grouping by hastag.')
tw6['numberOfTweets'] = 1 ## We make a column number of tweets that will be useful later 
gp = tw6.groupby('hashtag')
pr('Done')

20:05:35 Grouping by hastag.
20:05:35 Done


Then, we will merge all the tweets with the same hashtag that happened during a particular day.

For the column containing the tweet text, we will use a join with a special delimiter to recognize the different tweet text. We made the choice to take the median of the longitude and latitude. This was done (instead of the mean) because we noticed in the data some tweets that were not localized in Switzerland at all, but very far from it. This should avoid having a value strongly biased by extremas.

In [18]:
delimiter = '_$$$_'
str_join = lambda x: delimiter.join(x)

def aggDate(df):
    '''
    Function that applies to a dataframe will group each row by day and aggregate all its content.
    '''
    temp = df.groupby(df.index.map(lambda x: x.date))
    groupedDf = temp.agg({  'text' : str_join, ## Merged text
                            'longitude' : np.median, ## Median of the longitude
                            'latitude' : np.median, ## Median of the latitude
                            'hashtag' : lambda x: x.iloc[0], ## The name of the hashtag
                            'numberOfTweets' : 'count', ## Number of tweets during the day
                            'userId' : pd.Series.nunique}) ## Number of unique users
    # rename userId column to a more representative name
    return groupedDf

To better manipulate the data, we create a dictionary, with each key representing a hashtag. The dictionary value corresponding to the hashtag will be a dataframe that contains tweets grouped by day.<br>
The form will be: {**'hashtag'** : _dataframe containing all tweets of corresponding hashtag grouped per day_}

In [19]:
pr('Putting hashtags in dictionary... (around 50 min)')
dictionary = {}

# Printing variables
count = 0
lengp = len(gp)
printingValue = int(lengp / 10)

for hashtag, df in gp:
    # Grouping per date
    dictionary[hashtag] = aggDate(df)
    
    # Printing information
    count += 1
    if count % printingValue == 0:
        pr("{:.0f}%".format(count/lengp*100))
        
pr('Finished operations! Dictionary with {} different hashtags.'.format(len(dictionary)))

20:05:35 Putting hashtags in dictionary... (around 50 min)
20:05:40 10%
20:05:44 20%
20:05:49 30%
20:05:54 40%
20:05:58 50%
20:06:03 60%
20:06:08 70%
20:06:12 80%
20:06:17 90%
20:06:22 100%
20:06:22 Finished operations! Dictionary with 9410 different hashtags.


In [20]:
print('Example of the dictionary entry "{}":'.format(list(dictionary.keys())[5]))
dictionary[list(dictionary.keys())[5]].head()

Example of the dictionary entry "horror":


Unnamed: 0,hashtag,longitude,numberOfTweets,latitude,text,userId
2010-08-09,horror,9.35622,1,47.4195,El rey de la montaña - Unsichtbare Gefahr #fil...,1
2011-11-03,horror,8.53639,1,47.3662,Ich glaube GfK hat das erste Mal eine PP Präse...,1


## 3 - Event detection

### 3.1 - Parameters to define events

Parameters that define events:

In [21]:
## Parameters of an event:
MIN_TOT_NB_TWEETS = 20 ## The hashtag must have happened at least this number of times in all tweets to be considered.
MIN_NB_DAYS_WITH_HASHTAGS = 3 ## The hashtags must appear at least this number of different days to be considered.
MIN_NB_TWEETS_DURING_EVENT = 7 ## To be considered an event, the hashtag must happen at least this nb of times during the day.
THRESHOLD_ANOMALY_FACTOR = 2.5 ## The occurence of a hashtag during a single day must be above the mean by this FACTOR
                             ## multiplied by the std to be considered as an event.
MAX_DURATION_OF_EVENT = timedelta(days=30) ## The maximum number of days we consider an event can happen
MIN_DURATION_BEFORE_NEW_EVENT = timedelta(days=304) ## (= 10 months) The min time that should pass before an event can happen
                                                    ## again and still be considered as event (ie. Christmas is an event
                                                    ## each year)
MIN_NUMBER_DIFFERENT_USER = 2 # To state that an event occured, a minimum number of different users should have tweeted about it

Helper functions to detect recurrent events that should be removed:

### 3.2 - Events detection

Method that will be applied to each row that will return true if the single day for a defined hashtag should be considered as an event.

In [30]:
def isEvent(row, threshold):
    minNbTweet = max(threshold, MIN_NB_TWEETS_DURING_EVENT)
    return row.numberOfTweets >= minNbTweet and row.userId >= MIN_NUMBER_DIFFERENT_USER

Main method to detect event according to all the above parameters.<br>
The external function _isSpecificEventListIllegal()_ will examine a time series of potential events for a defined hashtag, and detect recurrent events. If the events are too close to each other, they will be cathegorized as recurrent events and be removed.

In [31]:
# EXTERNAL FUNCTION: isSpecificEventListIllegal(detectedEventDateList, max_event_duration, min_duration_before_new_event),

pr('Starting to compute {} different hashtags to detect event. (4 min)'.format(len(dictionary)))

# Printing variables
nbOfEventDetected = 0
count = 0
printingValue = int(len(dictionary) / 10)

# Going through all items of dictionary
for [h,df] in dictionary.items():
    
    # Printing information
    count += 1
    if count % printingValue == 0:
        pr("{:.0f}%".format(count/len(dictionary)*100))
        
    # Making the different tests corresponding to the above parameters to detect event
    if len(df) > MIN_NB_DAYS_WITH_HASHTAGS and df['numberOfTweets'].sum() >= MIN_TOT_NB_TWEETS:
        
        threshold = df['numberOfTweets'].mean() + THRESHOLD_ANOMALY_FACTOR * df['numberOfTweets'].std()
        df['event'] = df.apply(isEvent, args=[threshold,], axis=1)          

        # Remove recurrent events:
        detectedEventDf = df[df['event']]
        if len(detectedEventDf) > 2 and isSpecificEventListIllegal(detectedEventDf.index, MAX_DURATION_OF_EVENT, MIN_DURATION_BEFORE_NEW_EVENT):
            df['event'] = False
                
        # Printing counts updated
        nbOfEventDetected += len(df[df['event']])
            
pr('Done. {} events were detected.'.format(strNb(nbOfEventDetected)))

20:12:25 Starting to compute 9410 different hashtags to detect event. (4 min)
20:12:25 10%
20:12:25 20%
20:12:25 30%
20:12:25 40%
20:12:25 50%
20:12:25 60%
20:12:25 70%
20:12:25 80%
20:12:25 90%
20:12:25 100%
20:12:25 Done. 7 events were detected.


In [None]:
# pr('Starting to compute {} dict items to detect event. (4 min)'.format(len(dictionary)))
# nbOfEventDetected = 0
# count = 0
# printingValue = int(len(dictionary) / 10)
# for [h,df] in dictionary.items():
#     count += 1
#     if count % printingValue == 0:
#         pr("{:.0f}%".format(count/len(dictionary)*100))
#     df['event'] = False
#     if len(df) > MIN_NB_DAYS_WITH_HASHTAGS:
#         if df['numberOfTweets'].sum() > MIN_TOT_NB_TWEETS:
#             threshold = df['numberOfTweets'].mean() + THRESHOLD_ANOMALY_FACTOR * df['numberOfTweets'].std()
#             df['event'] = df.numberOfTweets.apply(lambda x: x >= threshold and x >= MIN_NB_TWEETS_DURING_EVENT)
            
#             ## Remove recurrent events:
#             detectedEventDf = df[df['event']]
#             if len(detectedEventDf) > 2 and isSpecificEventListIllegal(detectedEventDf.index):
#                 df['event'] = False
#             nbOfEventDetected += len(df[df['event']])
# pr('Finished! Number of events detected = {}'.format(nbOfEventDetected))

## Merging close events and grouping into single event dataframe

Here, we have a function that is going to merge events that are considered as too "close" to each other to be considered individually.

In [None]:
# def mergeCloseEvents(rowsList):
#     '''
#     Take a list of dictionary, where each dictionary is a "row" of the event df, which contained detected events.
#     It will process the list to detect event that are close and merge them together.
#     Return : the processed list of event.
#     '''
    
#     def areCloseEvents(event1, event2):
#         '''
#         Return true is 2 events dates are defined as "close"
#         '''
#         return abs(event1['date'] - event2['date']) < MAX_DURATION_OF_EVENT
        
#     def mergeCloseEventsSublist(closeEventList):
#         '''
#         This will be applied to each close event sublist. It will merge all events into one unique event.
#         The event will consist of the total number of tweets, with the concatenation of the tweet texts and the mean
#         of longitude/latitude. A meanDate will be defined as a ponderated mean between all dates.
#         The final date will be the one that is in the closeEventList and is closest to this mean date.
#         We did this to keep the meaning of the date if it had some, and not have some meaningless "mean-date".
#         '''
#         latitude = 0
#         longitude = 0
#         numberOfTweets = 0
#         text = ""
#         originalDate = closeEventList[0]['date']
#         dateDiff = timedelta(days=0)
#         first = True        
#         for tweet in closeEventList:
#             longitude += tweet['longitude']
#             latitude += tweet['latitude'] 
#             numberOfTweets += tweet['numberOfTweets']
#             if first:
#                 text = tweet['text']
#                 first = False
#             else:
#                 text += delimiter + tweet['text']
#                 dateDiff = dateDiff + (tweet['date'] - originalDate) * tweet['numberOfTweets']

#         ## It is multiplied by 2 then soustracted to round correctly to the nearest day
#         meanDate = originalDate + 2* dateDiff / numberOfTweets - dateDiff / numberOfTweets        
#         latitude = latitude / len(closeEventList)
#         longitude = longitude / len(closeEventList)
        
#         ## We are going to detect the event the closest to the mean date
#         minSelectedDate = closeEventList[0]['date']
#         minDistance = abs(closeEventList[0]['date'] - meanDate)
#         for tweet in closeEventList:
#             if abs(tweet['date'] - meanDate) < minDistance:
#                 minSelectedDate = tweet['date']    
        
#         return {'date': minSelectedDate, 'hashtag': closeEventList[0]['hashtag'], 'text': text,
#                     'longitude': longitude, 'latitude':latitude, 'numberOfTweets': numberOfTweets, }
    
#     ############ -----  MAIN METHOD  ----- ############
    
#     ## If the list is big enough, go through the list and form an export list and merge elements that needs to.
#     if len(rowsList) < 2:
#         return rowsList
#     else:
#         firstLastPosOfItemsToMerge = []
#         sortedRowsList = sorted(rowsList, key=itemgetter('date')) 
#         exportedEventList = []
#         ## This goes through the *sorted* list and add the pair of indices (first indice and last indice) where events 
#         ## that should be merged appear.
#         lastEventWasClose = False
#         firstItem = -1
#         for i in range(0, len(sortedRowsList)-1):
#             if areCloseEvents(sortedRowsList[i], sortedRowsList[i+1]):
#                 if not lastEventWasClose: # So it is the first pairs of the sublist of close events in the whole list
#                     firstItem = i
#                     lastEventWasClose = True
#             else:
#                 if lastEventWasClose: # So the list has just ended.
#                     exportedEventList.append(mergeCloseEventsSublist(sortedRowsList[firstItem:i+1]))
#                     lastEventWasClose = False
#                 else: # The element is by itself, let's append it
#                     exportedEventList.append(sortedRowsList[i])  
#         if lastEventWasClose: # If there were events to merge till the last elem of list
#             exportedEventList.append(mergeCloseEventsSublist(sortedRowsList[firstItem:len(sortedRowsList)]))
#         else:
#             exportedEventList.append(sortedRowsList[len(sortedRowsList)-1])
    
#     return exportedEventList

This function will be applied to each dataframe. If a row is detected as an event, it will be added to the locaRowsList which will be used to make a general dataframe of all the events.

In [None]:
localRowsList = []
def applyToMakeEventDf(row):
    if row.event:
        rowToAdd = {'date': row.name, 'hashtag': row.hashtag, 'text': row.text,
                    'longitude': row.longitude, 'latitude':row.latitude, 'numberOfTweets': row.numberOfTweets, }
        global localRowsList
        localRowsList.append(rowToAdd)

In [None]:
eventRowsList = []
localRowsList = []
count = 0
printingValue = int(len(dictionary) / 10)

pr('Starting to make event df with {} dataframes. (around 6 min)'.format(len(dictionary)))
for h, df in dictionary.items():
    global localRowsList
    localRowsList = []
    count += 1
    if count % printingValue == 0:
        pr("{:.0f}%".format(count/len(dictionary)*100))
        
    df.apply(applyToMakeEventDf, axis=1)
    mergedList = mergeCloseEvents(localRowsList) # merging close events
    eventRowsList += mergedList

pr('Making new dataframe.')
new_events = pd.DataFrame(eventRowsList)
new_events.set_index(['date'], inplace=True)
pr('Finished! Dataframe with {} rows'.format(len(new_events)))

In [None]:
print('Events dataframe:')
new_events.head(50)

In [None]:
print('Linked dataframe of all days:')
dictionary[new_events.iloc[0].hashtag].head(10)

## Exporting data

As we worked with another team, we needed a way to communicate them our detection. We used a JSON with all the information.

In [None]:
total_number_of_events = len(new_events)
print('There are {} events.'.format(total_number_of_events))

In [None]:
e_df = new_events.copy()
e_df['date'] = e_df.index
e_df.index = [i for i in range (len(e_df))]
e_df.head(1)

We are going to generate the right datetimes for the jsons:

In [None]:
# epoch_dt = datetime(1970, 1, 1)
# def to_utc(date):
#     d_dt = datetime.combine(date, datetime.min.time())
#     return int((d_dt - epoch_dt).total_seconds()*1000)

In [None]:
# def convert_to_unix_time(record):
#     datetime_index = pd.DatetimeIndex([datetime(record['year'], record['month'], 1)])
#     unix_time_index = datetime_index.astype(np.int64) // 10**6
#     return unix_time_index[0]

In [None]:
pr('Converting dates...')
e_df['year'] = e_df['date'].apply(lambda x: x.year)
e_df['month'] = e_df['date'].apply(lambda x: x.month)
e_df['utc_date'] = e_df['date'].apply(lambda x: to_utc(x))
e_df['unix_time'] = e_df.apply(convert_to_unix_time, axis=1)
pr('Done.')
e_df.head(1)

The generation of a JSON is easier from a dictionary than from a dataframe. Also, the other team we worked with asked us to group events by months.

In [None]:
# Grouping by months
e_gb_month = e_df.groupby(e_df.unix_time)

In [None]:
# Generation of the dictionary for the final JSON
pr('Making event list...')
months = []
for month, df in e_gb_month:
    days = []
    for i in range (len(df)):
        ht = df.iloc[i]['hashtag']
        lat = df.iloc[i]['latitude']
        lon = df.iloc[i]['longitude']
        t_num = df.iloc[i]['numberOfTweets']
        tweets = df.iloc[i]['text'].split(delimiter)
        date = df.iloc[i]['utc_date']
        
        data_unit = { 'name': ht
                    , 'latitude' : lat
                    , 'longitude' : lon
                    , 'tweets' : tweets
                    , 'number_of_tweets' : str(t_num)
                    , 'date' : int(date)}
        days.append(data_unit)
    
    curr_month = {'date': int(month), 'data' : days}
    months.append(curr_month)

final_events = {'events' : months}
pr('Done.')

Creation of the final JSON

In [None]:
exportFilename = 'export_twitter_events_' + datetime.now().strftime("%Y-%m-%d_%Hh%Mmin%S") + \
'_' + str(total_number_of_events)+ '_events.json'
exportPath =  os.path.join('data', exportFilename)

pr('Exporting to json...')
with open(exportPath, 'w') as f:
     json.dump(final_events, f)
pr('Export done. File "{}" has been created.'.format(exportFilename))