# Event detection in Twitter dataset
The goal of this project is to detect past events in Switzerland, using a dataset of tweets. This dataset contain 28 million tweets coming for most of them from Switzerland. This notebook will get you through our project, explaining our methodology. It is splitted in 5 parts:

1. **Data cleaning and hashtags extraction**<br>
We split the tweets
2. **Event detection**<br>

3. **Improving quality of detected event**<br>

4. **Exporting data**<br>


#### Importing useful libraries:

In [1]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
import datetime
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.model_selection import KFold
from sklearn import metrics
import matplotlib.pyplot as plt
from functools import reduce
import math
from collections import Counter
import csv
from datetime import datetime
import re
import copy
import json
from datetime import timedelta
from operator import itemgetter
from helpFunctions import *
%matplotlib inline

Creation of check point (a print plus the time)

In [2]:
def now():
    return str(datetime.now().time())[:8]
def pr(strToPrint):
    print(now() + ' '+ strToPrint)
    
def strNb(nb):
    return '{0:,}'.format(nb).replace(',', '.')

## 0 - Importing data

Importing a sample of the dataset:

In [3]:
pickle_filename = os.path.join('data','head_100k_pickle.pkl')
tw = pd.read_pickle(pickle_filename)

Importing the whole dataset

In [4]:
columns_header = ['id', 'userId', 'createdAt', 'text', 'longitude', 'latitude', 'placeId',
                  'inReplyTo', 'source', 'truncated', 'placeLatitude', 'placeLongitude', 'sourceName', 'sourceUrl',
                 'userName', 'screenName', 'followersCount', 'friendsCount', 'statusesCount',
                 'userLocation']

filename = os.path.join('data', 'sample.tsv')#'twex.tsv')
pr('Starting to read file... (3 min)')
# tw = pd.read_csv(filename, sep='\t', encoding='utf-8', escapechar='\\', names=columns_header,
#                       quoting=csv.QUOTE_NONE, na_values='N', header=None)

pr('File is loaded!')
# Audio(url=sound_file, autoplay=True)

19:44:10 Starting to read file... (3 min)
19:44:10 File is loaded!


In [5]:
print('The dataset contains {} tweets.'.format(strNb(len(tw))))

The dataset contains 1.000.000 tweets.


In [6]:
print('First rows of dataset:')
tw.head(2)

First rows of dataset:


Unnamed: 0,id,userId,createdAt,text,longitude,latitude,placeId,inReplyTo,source,truncated,placeLatitude,placeLongitude,sourceName,sourceUrl,userName,screenName,followersCount,friendsCount,statusesCount,userLocation
0,9514097914.0,17341045.0,2010-02-23 05:55:51,Guuuuten Morgen! :-),7.43926,46.9489,,,197.0,,,,,,,,,,,
1,,,,TwitBird,http://www.nibirutech.com,Tilman Jentzsch,blickwechsel,586.0,508.0,9016.0,"Bern, Switzerland",,,,,,,,,


## 1 - Data cleaning and extracting hashtags

In [7]:
# def extract_hashtags(text):
#     ht_list = re.findall(r"#(\w+)", text)
#     non_empty_hts = list(filter((lambda ht: ht != []), ht_list))
#     lowerCharList = [ht.lower() for ht in non_empty_hts]
#     return lowerCharList

We are going to extract all the hashtags of the "text" cell in each tweet and put them in a new column (in the form of a list of hashtags per tweet).

In [8]:
pr('Extracting hashtags... (2 min)')
tw['hashtag'] = tw.text.apply(lambda x: extract_hashtags(str(x))) # Getting list of hashtag into new column
twh = tw.ix[tw.hashtag.apply(lambda x: len(x) != 0)] # droping the rows (tweets) that contains no hashtags.
pr('We have found {} rows with hashtags.'.format(strNb(len(twh))))

19:44:10 Extracting hashtags... (2 min)
19:44:18 We have found 79.200 rows with hashtags.


In [9]:
print('Examples of tweets (with only text and hashtag column):')
twh[['text', 'hashtag']].head(2)

Examples of tweets (with only text and hashtag column):


Unnamed: 0,text,hashtag
16,"Magic spells run off after midnight, I guess s...",[fb]
20,"Limitas of public transportation! No taxi, rai...",[yam]


## Cleaning data and making date index

We drop tweet which not contains values for text or createdAt as this is mandatory information to privide to our model.

In [10]:
tw1 = twh.dropna(axis=0, how='any', subset=['text', 'createdAt'])
print('The data have been reduced from {} tweets to {} tweets.'.format(len(twh), len(tw1)))

The data have been reduced from 79200 tweets to 79200 tweets.


In [11]:
pr('Removing bad dates...')
twhCleanDate = tw1[tw1['createdAt'].str.len() == 19]
pr('Finished.')

19:44:18 Removing bad dates...
19:44:18 Finished.


In [12]:
pr('Starting to examine dates...')
import warnings
warnings.filterwarnings('ignore')
datetime_serie = twhCleanDate['createdAt'].convert_objects(convert_dates='coerce')
dateNotConvertible = datetime_serie[pd.isnull(datetime_serie)]
warnings.filterwarnings('default')
pr('There are {} dates that cannot be transformed.'.format(len(dateNotConvertible)))

19:44:18 Starting to examine dates...
19:44:18 There are 0 dates that cannot be transformed.


In [13]:
pr('Starting copy...') # (to avoid transformation problems)
tw5 = twhCleanDate.copy()
pr('Converting to datetime...')
tw5['createdAt'] = pd.to_datetime(twhCleanDate['createdAt'])
pr('Setting up new indices...')
tw5.index = tw5['createdAt']
pr('Deleting old "createdAt" column...')
del tw5['createdAt']
pr('Done!')
tw5.head(2)

19:44:18 Starting copy...
19:44:18 Converting to datetime...
19:44:18 Setting up new indices...
19:44:18 Deleting old "createdAt" column...
19:44:18 Done!


Unnamed: 0_level_0,id,userId,text,longitude,latitude,placeId,inReplyTo,source,truncated,placeLatitude,placeLongitude,sourceName,sourceUrl,userName,screenName,followersCount,friendsCount,statusesCount,userLocation,hashtag
createdAt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2010-02-23 09:59:41,9519737890,14657884.0,"Magic spells run off after midnight, I guess s...",6.1387,46.175,,,1.0,,,,,,,,,,,,[fb]
2010-02-23 11:28:27,9521789689,9962022.0,"Limitas of public transportation! No taxi, rai...",6.33641,46.4631,,,550.0,,,,,,,,,,,,[yam]


In [14]:
tw5['hashtag'][:6]

createdAt
2010-02-23 09:59:41               [fb]
2010-02-23 11:28:27              [yam]
2010-02-23 17:47:11          [24, vfb]
2010-02-23 18:19:03    [iphoneography]
2010-02-23 18:31:46     [partnermonth]
2010-02-24 06:09:23      [insider, fb]
Name: hashtag, dtype: object

## Let's put one hashtag per row

We will make a dataframe with one row = one hashtag. This will be done by going through the dataframe, and making in parallel a list of rows (with 1 hashtag per row) that needs to be added to the old dataframe.

In [15]:
addedHashtagsRowsList = []
def multiplyHashtagRows(row, columns):
    '''
    Examine each row. If there are multiple hashtags, it will return the first one.
    (so the first one will replace the list of hashtags in the df). Then for all the next ones,
    it will make a copy of the row in the addedHashtagsRowsList, (in a dictionary format).
    So this dictionary can in the end be transformed in a DF and added to the original DF.
    (The speed is increased a lot by doing it this way!)
    '''
    htList = row.hashtag
    if len(htList) > 1:
        ## Making the dictionary
        addedHashtag = {}
        addedHashtag['createdAt'] = row.name #the df index
        for col in columns:
            addedHashtag[col] = row[col]
        ## Copying the dict for each hashtag
        i = 1
        while i < len(htList) :
            deepCopy = copy.deepcopy(addedHashtag)
            deepCopy['hashtag'] = htList[i]
            addedHashtagsRowsList.append(deepCopy)
            i+=1
    return htList[0] # return the first hashtag

In [16]:
addedHashtagsRowsList = []
tw5_1 = tw5.copy()
pr('Multiplying the hashtag rows... (around 10 min)')
tw5_1['hashtag'] = tw5.apply(multiplyHashtagRows, args=[tw5.columns,], axis=1)
pr('Finished! {} rows will be added to the dataframe!'.format(len(addedHashtagsRowsList)))

19:44:19 Multiplying the hashtag rows... (around 10 min)
19:44:32 Finished! 40780 rows will be added to the dataframe!


In [17]:
pr('Starting to make the new dataframe with additionnal rows..')
addedHashtagsDf = pd.DataFrame(addedHashtagsRowsList)
addedHashtagsDf.set_index(['createdAt'], inplace=True)
pr('Starting to append the two df... Old df size = {}'.format(len(tw5_1)))
tw6 = tw5_1.append(addedHashtagsDf)
pr('Done! New df size = {}'.format(len(tw6)))

19:44:32 Starting to make the new dataframe with additionnal rows..
19:44:32 Starting to append the two df... Old df size = 79162
19:44:32 Done! New df size = 119942


In [18]:
print('Example hahshtag:')
tw6[tw6['hashtag'] == addedHashtagsRowsList[0]['hashtag']].head(3)

Example hahshtag:


Unnamed: 0_level_0,followersCount,friendsCount,hashtag,id,inReplyTo,latitude,longitude,placeId,placeLatitude,placeLongitude,screenName,source,sourceName,sourceUrl,statusesCount,text,truncated,userId,userLocation,userName
createdAt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2011-07-24 16:03:13,,,vfb,95162125584039936,9.501397e+16,,,e401fb8eb4e7595a,,,,14.0,,,,"@mikstweed Wie lange ist es her, dass der #vfb...",,921241.0,,
2010-02-23 17:47:11,,,vfb,9535390586,,47.6463,9.1657,,,,,550.0,,,,"So, Feierabend. Jetzt #24 und später #VfB. — a...",,921241.0,,
2010-12-01 17:25:56,,,vfb,10021795398684672,,46.9499,7.47071,e38a1a641d02f8db,,,,1.0,,,,"So, und jetzt Flachmann suchen und dann ab ins...",,120433.0,,


In [19]:
tw6.hashtag.head()

createdAt
2010-02-23 09:59:41               fb
2010-02-23 11:28:27              yam
2010-02-23 17:47:11               24
2010-02-23 18:19:03    iphoneography
2010-02-23 18:31:46     partnermonth
Name: hashtag, dtype: object

## Grouping per hashtags per day

In order to provide visualisation to the other team, we needed to give ocation information to all tweets. So we did not process tweets without longitude or latitude.

In [20]:
tw6.dropna(subset=['longitude'], inplace=True)
tw6.dropna(subset=['latitude'], inplace=True)

In [21]:
tw6.latitude = tw6.latitude.apply(float)
tw6.longitude = tw6.longitude.apply(float)

Merging information of all tweet containing a particular hastag turned out to be really useful to detect an event. For column of String type, we used a join with a delimiter as balow.

In [22]:
delimiter = '_$$$_'
str_join = lambda x: delimiter.join(x)

Function that applies to a dataframe will group each row by day and aggregate all its content:

In [23]:
def aggDate(df):
    temp = df.groupby(df.index.map(lambda x: x.date))
    groupedDf = temp.agg({  'text' : str_join,
                            'longitude' : np.median,
                            'latitude' : np.median,
                            'hashtag' : lambda x: x.iloc[0], ## the first occurence
                            'numberOfTweets' : 'count',
                            'userId' : pd.Series.nunique})
    # rename userId column to a more representative name
    return groupedDf

In [24]:
pr('Making column number of tweets')
tw6['numberOfTweets'] = 1
pr('Starting group by hastag...')
gp = tw6.groupby('hashtag')
pr('Starting to put hashtag in dictionary... (around 50 min)')

count = 0
lengp = len(gp)
printingValue = int(lengp / 10)
dictionary = {}
for hashtag, df in gp:
    dictionary[hashtag] = aggDate(df)
    count += 1
    if count % printingValue == 0:
        pr("{:.0f}%".format(count/lengp*100))
pr('Finished operations! Dictionary with {} different hashtags.'.format(len(dictionary)))

19:44:32 Making column number of tweets
19:44:32 Starting group by hastag...
19:44:32 Starting to put hashtag in dictionary... (around 50 min)
19:44:44 10%
19:44:55 20%
19:45:07 30%
19:45:18 40%
19:45:29 50%
19:45:40 60%
19:45:50 70%
19:46:01 80%
19:46:12 90%
19:46:23 100%
19:46:23 Finished operations! Dictionary with 27040 different hashtags.


In [25]:
# Creation of a dictionnary {'hashtag' : 'dataframe containing all tweet with the hashtag'}
print('Dictionary with hashtags dataframes:')
dictionary[list(dictionary.keys())[5]].head(4)

Dictionary with hashtags dataframes:


Unnamed: 0,numberOfTweets,longitude,text,userId,latitude,hashtag
2011-03-29,1,7.77983,gibts eigentlich auch richtige #news auf @JOIZ...,1.0,47.2102,news
2012-01-27,1,6.64852,#le12h30 vu de la #régie. #rsr #rts #radio #jo...,1.0,46.5335,news
2012-04-24,1,9.07426,Si comincia! #percompiereilmiodovere #como #ne...,1.0,45.8138,news
2012-05-09,1,7.35289,"""@Gazzetta_it: ""Italiani? Sempre più in sovrap...",1.0,47.7567,news


## Event detection (with elimination of recurrent events)

Parameters that define events:

In [26]:
## Parameters of an event:
MIN_TOT_NB_TWEETS = 20 ## The hashtag must have happened at least this number of times to be considered.
MIN_NB_DAYS_WITH_HASHTAGS = 3 ## The hashtags must appear at least this number of different days to be considered.
MIN_NB_TWEETS_DURING_EVENT = 7 ## To be considered an event, the hashtag must happen at least this nb of times during the day.
THRESHOLD_ANOMALY_FACTOR = 2.5 ## The occurence of a hashtag during a single day must be above the mean by this FACTOR
                             ## multiplied by the std to be considered as an anomaly.
MAX_DURATION_OF_EVENT = timedelta(days=30) ## The maximum number of days we consider an event can happen
MIN_DURATION_BEFORE_NEW_EVENT = timedelta(days=304) ## (= 10 months) The min time that should pass before an event can happen
                                                    ## again and still be considered as event (ie. Christmas is an event
                                                    ## each year)
MIN_NUMBER_DIFFERENT_USER = 2 # To state that an event occured, a minimum number of different users should have tweeted about it

Helper functions to detect recurrent events that should be removed:

In [27]:
# def isSpecificEventListIllegal(detectedEventDateList):
#     '''
#     Return true if the list of dates contain illegal tupples of events, so if the event is recurrent
#     which would mean it is not a real event.
#     '''
#     def datesAreIllegal(date1, date2, date3):
#         '''
#         Return true if the 3 dates are not to be considered as regular events.
#         '''
#         ## Return if the difference is too small to be considered as 2 different events
#         def diffIsSmall(timeDiff):  
#             return timeDiff < MAX_DURATION_OF_EVENT

#         ## Return true if the difference is not big enough to be an annual event.
#         def isDiffSuspect(timeDiff):
#             return timeDiff < MIN_DURATION_BEFORE_NEW_EVENT   

#         diff1 = abs(date1 - date2)
#         diff2 = abs(date2 - date3)
#         diff3 = abs(date3 - date1)

#         ## The difference is too small, it must be the same event
#         if diffIsSmall(diff1) or diffIsSmall(diff2) or diffIsSmall(diff3):
#             return False

#         ## If there are at least 2 out of 3 suspect difference, then the dates are illegal
#         if isDiffSuspect(diff1):
#             return isDiffSuspect(diff2) or isDiffSuspect(diff3)
#         else:
#             return isDiffSuspect(diff2) and isDiffSuspect(diff3)
    
#     ## MAIN FUNCTION : ##
#     # Go through the list of events and try all "triples" to see if there is any illegal triples. This is a quickly done
#     # code to do that. Code complexity bellow is in O(k^3), with k being the size of the list. We will apply this function
#     # to n list so we will have an overall complexity in O(n*k^3). We can consider however that each list will
#     # be small so k can be considered as constant and therefore the overall complexity will be in O(n).
#     for i in range(len(detectedEventDateList) - 2):
#         for j in range(i, len(detectedEventDateList) - 1):
#             for k in range(j, len(detectedEventDateList)):
#                 if datesAreIllegal(detectedEventDateList[i], detectedEventDateList[j], detectedEventDateList[k]):
#                     return True
#     return False

Main method to detect event according to all our criterias.

In [28]:
pr('Starting to compute {} dict items to detect event. (4 min)'.format(len(dictionary)))
nbOfEventDetected = 0
count = 0
printingValue = int(len(dictionary) / 10)
for [h,df] in dictionary.items():
    count += 1
    if count % printingValue == 0:
        pr("{:.0f}%".format(count/len(dictionary)*100))
    df['event'] = False
    if len(df) > MIN_NB_DAYS_WITH_HASHTAGS:
        if df['numberOfTweets'].sum() > MIN_TOT_NB_TWEETS:
            if df['userId'][0] >= MIN_NUMBER_DIFFERENT_USER:
                threshold = df['numberOfTweets'].mean() + THRESHOLD_ANOMALY_FACTOR * df['numberOfTweets'].std()
                df['event'] = df.numberOfTweets.apply(lambda x: x > threshold and x > MIN_NB_TWEETS_DURING_EVENT)
            
            ## Remove recurrent events:
            detectedEventDf = df[df['event']]
            if len(detectedEventDf) > 2 and isSpecificEventListIllegal(detectedEventDf.index):
                df['event'] = False
            nbOfEventDetected += len(df[df['event']])
pr('Finished! Number of events detected = {}'.format(nbOfEventDetected))

19:46:23 Starting to compute 27040 dict items to detect event. (4 min)
19:46:24 10%
19:46:24 20%
19:46:25 30%
19:46:26 40%
19:46:27 50%
19:46:28 60%
19:46:28 70%
19:46:29 80%
19:46:30 90%
19:46:31 100%
19:46:31 Finished! Number of events detected = 8


In [29]:
# pr('Starting to compute {} dict items to detect event. (4 min)'.format(len(dictionary)))
# nbOfEventDetected = 0
# count = 0
# printingValue = int(len(dictionary) / 10)
# for [h,df] in dictionary.items():
#     count += 1
#     if count % printingValue == 0:
#         pr("{:.0f}%".format(count/len(dictionary)*100))
#     df['event'] = False
#     if len(df) > MIN_NB_DAYS_WITH_HASHTAGS:
#         if df['numberOfTweets'].sum() > MIN_TOT_NB_TWEETS:
#             threshold = df['numberOfTweets'].mean() + THRESHOLD_ANOMALY_FACTOR * df['numberOfTweets'].std()
#             df['event'] = df.numberOfTweets.apply(lambda x: x >= threshold and x >= MIN_NB_TWEETS_DURING_EVENT)
            
#             ## Remove recurrent events:
#             detectedEventDf = df[df['event']]
#             if len(detectedEventDf) > 2 and isSpecificEventListIllegal(detectedEventDf.index):
#                 df['event'] = False
#             nbOfEventDetected += len(df[df['event']])
# pr('Finished! Number of events detected = {}'.format(nbOfEventDetected))

## Merging close events and grouping into single event dataframe

Here, we have a function that is going to merge events that are considered as too "close" to each other to be considered individually.

In [30]:
# def mergeCloseEvents(rowsList):
#     '''
#     Take a list of dictionary, where each dictionary is a "row" of the event df, which contained detected events.
#     It will process the list to detect event that are close and merge them together.
#     Return : the processed list of event.
#     '''
    
#     def areCloseEvents(event1, event2):
#         '''
#         Return true is 2 events dates are defined as "close"
#         '''
#         return abs(event1['date'] - event2['date']) < MAX_DURATION_OF_EVENT
        
#     def mergeCloseEventsSublist(closeEventList):
#         '''
#         This will be applied to each close event sublist. It will merge all events into one unique event.
#         The event will consist of the total number of tweets, with the concatenation of the tweet texts and the mean
#         of longitude/latitude. A meanDate will be defined as a ponderated mean between all dates.
#         The final date will be the one that is in the closeEventList and is closest to this mean date.
#         We did this to keep the meaning of the date if it had some, and not have some meaningless "mean-date".
#         '''
#         latitude = 0
#         longitude = 0
#         numberOfTweets = 0
#         text = ""
#         originalDate = closeEventList[0]['date']
#         dateDiff = timedelta(days=0)
#         first = True        
#         for tweet in closeEventList:
#             longitude += tweet['longitude']
#             latitude += tweet['latitude'] 
#             numberOfTweets += tweet['numberOfTweets']
#             if first:
#                 text = tweet['text']
#                 first = False
#             else:
#                 text += delimiter + tweet['text']
#                 dateDiff = dateDiff + (tweet['date'] - originalDate) * tweet['numberOfTweets']

#         ## It is multiplied by 2 then soustracted to round correctly to the nearest day
#         meanDate = originalDate + 2* dateDiff / numberOfTweets - dateDiff / numberOfTweets        
#         latitude = latitude / len(closeEventList)
#         longitude = longitude / len(closeEventList)
        
#         ## We are going to detect the event the closest to the mean date
#         minSelectedDate = closeEventList[0]['date']
#         minDistance = abs(closeEventList[0]['date'] - meanDate)
#         for tweet in closeEventList:
#             if abs(tweet['date'] - meanDate) < minDistance:
#                 minSelectedDate = tweet['date']    
        
#         return {'date': minSelectedDate, 'hashtag': closeEventList[0]['hashtag'], 'text': text,
#                     'longitude': longitude, 'latitude':latitude, 'numberOfTweets': numberOfTweets, }
    
#     ############ -----  MAIN METHOD  ----- ############
    
#     ## If the list is big enough, go through the list and form an export list and merge elements that needs to.
#     if len(rowsList) < 2:
#         return rowsList
#     else:
#         firstLastPosOfItemsToMerge = []
#         sortedRowsList = sorted(rowsList, key=itemgetter('date')) 
#         exportedEventList = []
#         ## This goes through the *sorted* list and add the pair of indices (first indice and last indice) where events 
#         ## that should be merged appear.
#         lastEventWasClose = False
#         firstItem = -1
#         for i in range(0, len(sortedRowsList)-1):
#             if areCloseEvents(sortedRowsList[i], sortedRowsList[i+1]):
#                 if not lastEventWasClose: # So it is the first pairs of the sublist of close events in the whole list
#                     firstItem = i
#                     lastEventWasClose = True
#             else:
#                 if lastEventWasClose: # So the list has just ended.
#                     exportedEventList.append(mergeCloseEventsSublist(sortedRowsList[firstItem:i+1]))
#                     lastEventWasClose = False
#                 else: # The element is by itself, let's append it
#                     exportedEventList.append(sortedRowsList[i])  
#         if lastEventWasClose: # If there were events to merge till the last elem of list
#             exportedEventList.append(mergeCloseEventsSublist(sortedRowsList[firstItem:len(sortedRowsList)]))
#         else:
#             exportedEventList.append(sortedRowsList[len(sortedRowsList)-1])
    
#     return exportedEventList

This function will be applied to each dataframe. If a row is detected as an event, it will be added to the locaRowsList which will be used to make a general dataframe of all the events.

In [31]:
localRowsList = []
def applyToMakeEventDf(row):
    if row.event:
        rowToAdd = {'date': row.name, 'hashtag': row.hashtag, 'text': row.text,
                    'longitude': row.longitude, 'latitude':row.latitude, 'numberOfTweets': row.numberOfTweets, }
        global localRowsList
        localRowsList.append(rowToAdd)

In [32]:
eventRowsList = []
localRowsList = []
count = 0
printingValue = int(len(dictionary) / 10)

pr('Starting to make event df with {} dataframes. (around 6 min)'.format(len(dictionary)))
for h, df in dictionary.items():
    global localRowsList
    localRowsList = []
    count += 1
    if count % printingValue == 0:
        pr("{:.0f}%".format(count/len(dictionary)*100))
        
    df.apply(applyToMakeEventDf, axis=1)
    mergedList = mergeCloseEvents(localRowsList) # merging close events
    eventRowsList += mergedList

pr('Making new dataframe.')
new_events = pd.DataFrame(eventRowsList)
new_events.set_index(['date'], inplace=True)
pr('Finished! Dataframe with {} rows'.format(len(new_events)))

19:46:31 Starting to make event df with 27040 dataframes. (around 6 min)
19:46:32 10%
19:46:34 20%
19:46:35 30%
19:46:37 40%
19:46:39 50%
19:46:40 60%
19:46:42 70%
19:46:44 80%
19:46:45 90%
19:46:47 100%
19:46:47 Making new dataframe.
19:46:47 Finished! Dataframe with 8 rows


In [33]:
print('Events dataframe:')
new_events.head(50)

Events dataframe:


Unnamed: 0_level_0,hashtag,latitude,longitude,numberOfTweets,text
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2012-04-22,mélenchon,46.5243,6.8658,64,JL #Mélenchon votera à 10:00 dans le Xème arro...
2012-05-16,france2,46.3706,6.49121,9,Et trop beau chien au passage ;) #France2 #CLA...
2010-06-16,sui,47.3788,8.53803,25,"#sui wonderful_$$$_en plus du goal, j'aurai mi..."
2012-04-19,lesanges4,46.1573,6.09798,10,"JE VAIS MANGER DEVANT LA TELE.., EN REGARDANT ..."
2012-04-21,placeaupeuple,46.5243,6.86581,40,@steffie1985 Dis toi que tu auras encore 2 sem...
2012-05-06,forzajuve,45.83195,9.226965,16,Parola del giorno ANSIA. Per me e altri milion...
2011-06-08,ios5,47.485,8.29721,9,@AndreBonhote einstellungen gemacht? #ios5 ht...
2011-03-25,ipad2,47.3751,8.53896,41,@greezer der @ThBenkoe hat's schon #iPad2 ht...


In [34]:
print('Linked dataframe of all days:')
dictionary[new_events.iloc[0].hashtag].head(10)

Linked dataframe of all days:


Unnamed: 0,numberOfTweets,longitude,text,userId,latitude,hashtag,event
2012-04-19,17,6.86592,Ce soir #Mélenchon au #19:30. Message aux Suis...,2.0,46.5244,mélenchon,False
2012-04-20,54,6.86579,"@didierpg26 Pardon, mais g loupé un épisode, e...",1.0,46.5243,mélenchon,False
2012-04-21,57,6.8658,"Pour les indécis, ne vous posez pas la questio...",1.0,46.5243,mélenchon,False
2012-04-22,64,6.8658,JL #Mélenchon votera à 10:00 dans le Xème arro...,2.0,46.5243,mélenchon,True
2012-04-23,43,6.86581,"@pierremoscovici @RMCinfo ""#Mélenchon nous so...",2.0,46.5243,mélenchon,False
2012-04-24,12,6.86581,#RadioBastille Rejoignez nous sur #RéseauFDG f...,1.0,46.5243,mélenchon,False
2012-04-25,1,6.86606,@GerlebGg62 @espritpassion Critiquer la postur...,1.0,46.5242,mélenchon,False
2012-05-01,2,6.865765,Ce que vient de tisser aujourd'hui #Sarkozy au...,1.0,46.52425,mélenchon,False
2012-05-02,2,6.865835,#Mélenchon sort du corps de #Hollande _$$$_RT ...,1.0,46.5243,mélenchon,False
2012-05-04,2,6.86568,#Mélenchon à #Stalingrad : sur le pupitre est ...,1.0,46.5243,mélenchon,False


## Exporting data

As we worked with another team, we needed a way to communicate them our detection. We used a JSON with all the information.

In [35]:
total_number_of_events = len(new_events)
print('There are {} events.'.format(total_number_of_events))

There are 8 events.


In [36]:
e_df = new_events.copy()
e_df['date'] = e_df.index
e_df.index = [i for i in range (len(e_df))]
e_df.head(1)

Unnamed: 0,hashtag,latitude,longitude,numberOfTweets,text,date
0,mélenchon,46.5243,6.8658,64,JL #Mélenchon votera à 10:00 dans le Xème arro...,2012-04-22


We are going to generate the right datetimes for the jsons:

In [37]:
# epoch_dt = datetime(1970, 1, 1)
# def to_utc(date):
#     d_dt = datetime.combine(date, datetime.min.time())
#     return int((d_dt - epoch_dt).total_seconds()*1000)

In [38]:
# def convert_to_unix_time(record):
#     datetime_index = pd.DatetimeIndex([datetime(record['year'], record['month'], 1)])
#     unix_time_index = datetime_index.astype(np.int64) // 10**6
#     return unix_time_index[0]

In [39]:
pr('Converting dates...')
e_df['year'] = e_df['date'].apply(lambda x: x.year)
e_df['month'] = e_df['date'].apply(lambda x: x.month)
e_df['utc_date'] = e_df['date'].apply(lambda x: to_utc(x))
e_df['unix_time'] = e_df.apply(convert_to_unix_time, axis=1)
pr('Done.')
e_df.head(1)

19:46:47 Converting dates...
19:46:47 Done.


Unnamed: 0,hashtag,latitude,longitude,numberOfTweets,text,date,year,month,utc_date,unix_time
0,mélenchon,46.5243,6.8658,64,JL #Mélenchon votera à 10:00 dans le Xème arro...,2012-04-22,2012,4,1335052800000,1333238400000


The generation of a JSON is easier from a dictionary than from a dataframe. Also, the other team we worked with asked us to group events by months.

In [40]:
# Grouping by months
e_gb_month = e_df.groupby(e_df.unix_time)

In [41]:
# Generation of the dictionary for the final JSON
pr('Making event list...')
months = []
for month, df in e_gb_month:
    days = []
    for i in range (len(df)):
        ht = df.iloc[i]['hashtag']
        lat = df.iloc[i]['latitude']
        lon = df.iloc[i]['longitude']
        t_num = df.iloc[i]['numberOfTweets']
        tweets = df.iloc[i]['text'].split(delimiter)
        date = df.iloc[i]['utc_date']
        
        data_unit = { 'name': ht
                    , 'latitude' : lat
                    , 'longitude' : lon
                    , 'tweets' : tweets
                    , 'number_of_tweets' : str(t_num)
                    , 'date' : int(date)}
        days.append(data_unit)
    
    curr_month = {'date': int(month), 'data' : days}
    months.append(curr_month)

final_events = {'events' : months}
pr('Done.')

19:46:47 Making event list...
19:46:47 Done.


Creation of the final JSON

In [42]:
exportFilename = 'export_twitter_events_' + datetime.now().strftime("%Y-%m-%d_%Hh%Mmin%S") + \
'_' + str(total_number_of_events)+ '_events.json'
exportPath =  os.path.join('data', exportFilename)

pr('Exporting to json...')
with open(exportPath, 'w') as f:
     json.dump(final_events, f)
pr('Export done. File "{}" has been created.'.format(exportFilename))

19:46:47 Exporting to json...
19:46:47 Export done. File "export_twitter_events_2017-02-03_19h46min47_8_events.json" has been created.
