# Event detection on a Twitter dataset
The goal of this project is to detect past events in Switzerland, using a dataset of tweets. This dataset contains 28 million tweets coming for most of them from Switzerland. This notebook will get you through our project, explaining our methodology. It is splitted in 5 parts:


1. **Extracting hashtags and data cleaning**<br>
The data are cleaned. The hashtags in all the tweets are found and extracted.
2. **Data structure manipulation**<br>
The tweet are grouped per hashtag, and discretized per day.
3. **Event detection**<br>
Events are detected. Improvement is done by removing recurrent event and merging close events.
4. **Exporting the data**<br>
The data is exported in the JSON format to allow another group to provide a visualization with it.
5. **Conclusion**<br>
A small conclusion on the project.

**Note:** To make our code more readable, we exported some functions to the file _"utils.py"_. Each time we will use one of these functions, we will explicitely write: *# EXTERNAL FUNCTION: name_of_the_function()*

Importing libraries:

In [1]:
import pandas as pd
import numpy as np
import os
import re
import copy
import json
import csv
from datetime import timedelta
from utils import * # File where some functions were exported

Useful printing methods that will be used through the project:

In [2]:
def pr(strToPrint):
    '''
    Print the current date and time, concatenated to the string passed in argument.   
    :param strToPrint: Regular string to print
    '''
    print(str(datetime.now().time())[:8] + ' '+ strToPrint)
    
def strNb(nb):
    '''
    Transform a high number in a string, with '.' for each thousand
    :param nb: A high number to print
    '''
    return '{0:,}'.format(nb).replace(',', '.')

## 0 - Importing data

Importing a sample of the dataset (useful for testing):

In [3]:
# pickle_filename = os.path.join('data','head_100k_pickle.pkl')
# tw = pd.read_pickle(pickle_filename)

Preparing columns headers and file name.

In [4]:
columns_header = ['id', 'userId', 'createdAt', 'text', 'longitude', 'latitude', 'placeId',
                  'inReplyTo', 'source', 'truncated', 'placeLatitude', 'placeLongitude', 'sourceName', 'sourceUrl',
                 'userName', 'screenName', 'followersCount', 'friendsCount', 'statusesCount',
                 'userLocation']
filename = os.path.join('data', 'twex.tsv')

Importing the whole dataset:

In [5]:
pr('Starting to read file... (3 min)')
tw = pd.read_csv(filename, sep='\t', encoding='utf-8', escapechar='\\', names=columns_header,
                      quoting=csv.QUOTE_NONE, na_values='N', header=None)
pr('File is loaded.')

01:13:50 Starting to read file... (3 min)


  interactivity=interactivity, compiler=compiler, result=result)


01:16:21 File is loaded.


In [6]:
print('The dataset contains {} tweets.'.format(strNb(len(tw))))

The dataset contains 27.632.392 tweets.


In [7]:
print('First rows of dataset:')
tw.head(2)

First rows of dataset:


Unnamed: 0,id,userId,createdAt,text,longitude,latitude,placeId,inReplyTo,source,truncated,placeLatitude,placeLongitude,sourceName,sourceUrl,userName,screenName,followersCount,friendsCount,statusesCount,userLocation
0,9514097914,17341000.0,2010-02-23 05:55:51,Guuuuten Morgen! :-),7.43926,46.9489,,,197,,,,TwitBird,http://www.nibirutech.com,Tilman Jentzsch,blickwechsel,586,508.0,9016.0,"Bern, Switzerland"
1,9514846412,7198280.0,2010-02-23 06:22:40,Still the best coffee in town — at La Stanza h...,8.53781,47.3678,,,550,,,,Gowalla,http://gowalla.com/,Nico Luchsinger,halbluchs,1820,703.0,4687.0,"Zurich, Switzerland"


## 1 - Extracting hashtags and data cleaning

### 1.1 - Extracting hashtags

To detect event, we have decided to only concentrate on hashtags. Therefore, we are going extract hashtags and only keep tweets that contain hashtags (as the other ones will contains no useful information for us). We start with this operation before the data cleaning because it is the one that reduces our data size the most, and therefore all further operations will execute quicker.<br>
To do so, we are going to examine the _text_ field of each tweet, extract its hashtags and put them in a new column (in the form of a list of hashtags per tweet).

In [8]:
# EXTERNAL FUNCTION: extract_hashtags(text)

pr('Extracting hashtags... (2 min)')

tw['hashtag'] = tw.text.apply(lambda x: extract_hashtags(str(x))) # Getting list of hashtag into new column
twh = tw.ix[tw.hashtag.apply(lambda x: len(x) != 0)] # droping the rows (tweets) that contains no hashtags.

pr('We have extracted {} rows with hashtags out of the {} rows of our initial dataframe.'.format(strNb(len(twh)),strNb(len(tw))))

01:16:21 Extracting hashtags... (2 min)
01:18:11 We have extracted 3.875.280 rows with hashtags out of the 27.632.392 rows of our initial dataframe.


In [9]:
print('Examples of tweets (with only text and hashtag column):')
twh[['text', 'hashtag']].head(3)

Examples of tweets (with only text and hashtag column):


Unnamed: 0,text,hashtag
8,"Magic spells run off after midnight, I guess s...",[fb]
10,"Limitas of public transportation! No taxi, rai...",[yam]
15,"So, Feierabend. Jetzt #24 und später #VfB. — a...","[24, vfb]"


### 1.2 - Data cleaning

We will now clean the remaining dataset. The creation date is important for our event detection, and in order to provide visualization to the other team, we needed to give location information to all tweets. Therefore we will drop the rows which contain NA values at _creationAt_, _latitude_ or _longitude_ position.  

In [10]:
pr("Droping rows with NA values (location and creation date).")
tw1 = twh.dropna(axis=0, how='any', subset=['createdAt'])
tw1.dropna(subset=['longitude'], inplace=True)
tw1.dropna(subset=['latitude'], inplace=True)
pr('The data have been reduced from {} tweets to {} tweets.'.format(strNb(len(twh)), strNb(len(tw1))))

01:18:11 Droping rows with NA values (location and creation date).
01:18:16 The data have been reduced from 3.875.280 tweets to 2.107.021 tweets.


The latitude and longitude values should be in _float_ format to analyze them correctly later.

In [11]:
tw1.latitude = tw1.latitude.apply(float)
tw1.longitude = tw1.longitude.apply(float)

Some date in our initial data could not be transformed to pandas datetime. Let's check if everything is okay.

In [12]:
pr('Starting to examine dates...')
import warnings
warnings.filterwarnings('ignore')
datetime_serie = tw1['createdAt'].convert_objects(convert_dates='coerce')
dateNotConvertible = datetime_serie[pd.isnull(datetime_serie)]
warnings.filterwarnings('default')
pr('There are {} dates that cannot be transformed.'.format(len(dateNotConvertible)))

01:18:18 Starting to examine dates...
01:18:19 There are 0 dates that cannot be transformed.


It looks good! Now, we will transform our data in a time series. To do so, the best way is to put the creation date of each row as the index of the dataframe.

In [13]:
pr('Converting to datetime...')
tw5 = tw1.copy()
tw5['createdAt'] = pd.to_datetime(tw1['createdAt'])
pr('Setting up new indices...')
tw5.index = tw5['createdAt']
pr('Deleting old "createdAt" column...')
del tw5['createdAt']
pr('Done. The updated dataframe:')
tw5.head(1)

01:18:19 Converting to datetime...
01:18:21 Setting up new indices...
01:18:21 Deleting old "createdAt" column...
01:18:21 Done. The updated dataframe:


Unnamed: 0_level_0,id,userId,text,longitude,latitude,placeId,inReplyTo,source,truncated,placeLatitude,placeLongitude,sourceName,sourceUrl,userName,screenName,followersCount,friendsCount,statusesCount,userLocation,hashtag
createdAt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2010-02-23 09:59:41,9519737890,14657900.0,"Magic spells run off after midnight, I guess s...",6.1387,46.175,,,1,,,,Twitter for iPhone,http://twitter.com/#!/download/iphone,Javier Belmonte,vichango,167,277.0,2885.0,"Geneva, Switzerland",[fb]


## 2 - Data structure manipulation

We want to have the data in the right format to process them. Therefore we will apply several operations to the dataframes.

### 2.1 - One hashtag = one row

Some people put more than one hashtag per tweet. A tweet that contain more than one hashtag will be analyzed the same way as if it contained only one hashtag. So, to process the data in a easier way, we will make a dataframe with **one row per hashtag**. Therefore, we will duplicate the row that contains more than one hashtag.

This will be done by going through the dataframe, and making in parallel a list of rows (with 1 hashtag per row) that needs to be added to the old dataframe. These rows will be stored in a dictionary and will be appended afterwards to the dataframe.

In [14]:
# EXTERNAL FUNCTIONS: reset_added_hashtag_rows_list(), multiplyHashtagRows(row), get_added_hashtag_rows_list()

pr('Multiplying the hashtag rows... (around 10 min)')

# Initial reset of dictionary of rows to start the processing
reset_added_hashtag_rows_list()
tw5_1 = tw5.copy()

# The fct multiplyHashtagRows (applied to each row) will return the first hashtag and add all the other hashtags to the dictionary.
tw5_1['hashtag'] = tw5.apply(multiplyHashtagRows, args=[tw5.columns,], axis=1)

# This will return the dictionary with the added rows.
addedHashtagsRowsList = get_added_hashtag_rows_list()

pr('Finished! {} rows will be added to the dataframe!'.format(strNb(len(addedHashtagsRowsList))))

01:18:21 Multiplying the hashtag rows... (around 10 min)
01:23:52 Finished! 2.265.316 rows will be added to the dataframe!


We will create a new dataframe with the additionnal rows and merge it with the old one.

In [15]:
pr('Making the new dataframe of additionnal rows and appending it to the original dataframe..')

addedHashtagsDf = pd.DataFrame(addedHashtagsRowsList)
addedHashtagsDf.set_index(['createdAt'], inplace=True)

tw6 = tw5_1.append(addedHashtagsDf)

pr('Done: Original dataframe size was: {} - New dataframe size is: {}'.format(strNb(len(tw5_1)),strNb(len(tw6))))

01:23:52 Making the new dataframe of additionnal rows and appending it to the original dataframe..
01:24:04 Done: Original dataframe size was: 2.107.021 - New dataframe size is: 4.372.337


In [16]:
print('Examples of tweets (with only text and hashtag column):')
tw6[['text', 'hashtag']].head(3)

Examples of tweets (with only text and hashtag column):


Unnamed: 0_level_0,text,hashtag
createdAt,Unnamed: 1_level_1,Unnamed: 2_level_1
2010-02-23 09:59:41,"Magic spells run off after midnight, I guess s...",fb
2010-02-23 11:28:27,"Limitas of public transportation! No taxi, rai...",yam
2010-02-23 17:47:11,"So, Feierabend. Jetzt #24 und später #VfB. — a...",24


### 2.2 - Gouping per hashtag and merging per day

First, we are grouping each tweets by hashtags.

In [17]:
pr('Grouping by hastag.')
tw6['numberOfTweets'] = 1 ## We make a column number of tweets that will be useful later 
gp = tw6.groupby('hashtag')
pr('Done')

01:24:07 Grouping by hastag.
01:24:07 Done


Then, we will merge all the tweets with the same hashtag that happened during a particular day.

For the column containing the tweet text, we will use a join with a special delimiter to recognize the different tweet text. We made the choice to take the median of the longitude and latitude. This was done (instead of the mean) because we noticed in the data some tweets that were not localized in Switzerland at all, but very far from it. This should avoid having a value strongly biased by extremas.

In [18]:
delimiter = '_$$$_'
str_join = lambda x: delimiter.join(x)

def aggDate(df):
    '''
    Function that applies to a dataframe will group each row by day and aggregate all its content.
    '''
    temp = df.groupby(df.index.map(lambda x: x.date))
    groupedDf = temp.agg({  'text' : str_join, ## Merged text
                            'longitude' : np.median, ## Median of the longitude
                            'latitude' : np.median, ## Median of the latitude
                            'hashtag' : lambda x: x.iloc[0], ## The name of the hashtag
                            'numberOfTweets' : 'count', ## Number of tweets during the day
                            'userId' : pd.Series.nunique}) ## Number of unique users
    # rename userId column to a more representative name
    return groupedDf

To better manipulate the data, we create a dictionary, with each key representing a hashtag. The dictionary value corresponding to the hashtag will be a dataframe that contains tweets grouped by day.<br>
The form will be: {**'hashtag'** : _dataframe containing all tweets of corresponding hashtag grouped per day_}

In [19]:
pr('Putting hashtags in dictionary... (around 50 min)')
dictionary = {}

# Printing variables
count = 0
lengp = len(gp)
printingValue = int(lengp / 10)

for hashtag, df in gp:
    # Grouping per date
    dictionary[hashtag] = aggDate(df)
    
    # Printing information
    count += 1
    if count % printingValue == 0:
        pr("{:.0f}%".format(count/lengp*100))
        
pr('Finished operations! Dictionary with {} different hashtags.'.format(len(dictionary)))

01:24:07 Putting hashtags in dictionary... (around 50 min)
01:30:07 10%
01:35:42 20%
01:41:11 30%
01:46:47 40%
01:52:22 50%
01:57:55 60%
02:03:21 70%
02:08:51 80%
02:14:20 90%
02:19:45 100%
02:19:45 Finished operations! Dictionary with 607601 different hashtags.


In [20]:
print('Example of the dictionary entry "{}":'.format(list(dictionary.keys())[5]))
dictionary[list(dictionary.keys())[5]].head()

Example of the dictionary entry "1reedition":


Unnamed: 0,numberOfTweets,longitude,userId,latitude,hashtag,text
2014-05-02,2,7.25455,1,46.0904,1reedition,Départ de la Patrouilles des jeunes! #pdg2014 ...


## 3 - Event detection

In this section we will detect the different events using the hashtags. We will start by creating some parameters that will define what is an event. The parameters final value was found with a lot of testing in order to find the best combination. In the second part we will execute the code that will go though the tweets to detect events. It will also remove recurrent events. In the third part, "close" events will be merged together, and the events will all be grouped in a single dataframe.

### 3.1 - Parameters to define events

Parameters that define events:

In [21]:
## Parameters of an event:
MIN_TOT_NB_TWEETS = 20 ## The hashtag must have happened at least this number of times in all tweets to be considered.
MIN_NB_DAYS_WITH_HASHTAGS = 3 ## The hashtags must appear at least this number of different days to be considered.
MIN_NB_TWEETS_DURING_EVENT = 7 ## To be considered an event, the hashtag must happen at least this nb of times during the day.
THRESHOLD_ANOMALY_FACTOR = 2.5 ## The occurence of a hashtag during a single day must be above the mean by this FACTOR
                             ## multiplied by the std to be considered as an event.
MAX_DURATION_OF_EVENT = timedelta(days=30) ## The maximum number of days we consider an event can happen
MIN_DURATION_BEFORE_NEW_EVENT = timedelta(days=304) ## (= 10 months) The min time that should pass before an event can happen
                                                    ## again and still be considered as event (ie. Christmas is an event
                                                    ## each year)
MIN_NUMBER_DIFFERENT_USER = 2 # To state that an event occured, a minimum number of different users should have tweeted about it

Helper functions to detect recurrent events that should be removed:

### 3.2 - Events detection

Method that will be applied to each row that will return true if the single day for a defined hashtag should be considered as an event.

In [22]:
def isEvent(row, threshold):
    minNbTweet = max(threshold, MIN_NB_TWEETS_DURING_EVENT)
    return row.numberOfTweets >= minNbTweet and row.userId >= MIN_NUMBER_DIFFERENT_USER

Main method to detect event according to all the above parameters.<br>
The external function _isSpecificEventListIllegal()_ will examine a time series of potential events for a defined hashtag, and detect recurrent events. If the events are too close to each other, they will be cathegorized as recurrent events and be removed.

In [23]:
# EXTERNAL FUNCTION: isSpecificEventListIllegal(detectedEventDateList, max_event_duration, min_duration_before_new_event),

pr('Starting to compute {} different hashtags to detect event. (4 min)'.format(len(dictionary)))

# Printing variables
nbOfEventDetected = 0
count = 0
printingValue = int(len(dictionary) / 10)

# Going through all items of dictionary
for [h,df] in dictionary.items():
    
    # Printing information
    count += 1
    if count % printingValue == 0:
        pr("{:.0f}%".format(count/len(dictionary)*100))
    
    # Initializing event column
    df['event'] = False
    
    # Making the different tests corresponding to the above parameters to detect event
    if len(df) > MIN_NB_DAYS_WITH_HASHTAGS and df['numberOfTweets'].sum() >= MIN_TOT_NB_TWEETS:
        
        threshold = df['numberOfTweets'].mean() + THRESHOLD_ANOMALY_FACTOR * df['numberOfTweets'].std()
        df['event'] = df.apply(isEvent, args=[threshold,], axis=1)          

        # Remove recurrent events:
        detectedEventDf = df[df['event']]
        if len(detectedEventDf) > 2 and isSpecificEventListIllegal(detectedEventDf.index, MAX_DURATION_OF_EVENT, MIN_DURATION_BEFORE_NEW_EVENT):
            df['event'] = False
                
        # Printing counts updated
        nbOfEventDetected += len(df[df['event']])
            
pr('Done. {} events were detected.'.format(strNb(nbOfEventDetected)))

02:19:46 Starting to compute 607601 different hashtags to detect event. (4 min)
02:20:12 10%
02:20:47 20%
02:21:13 30%
02:21:39 40%
02:22:04 50%
02:22:30 60%
02:22:55 70%
02:23:21 80%
02:23:47 90%
02:24:12 100%
02:24:12 Done. 4.138 events were detected.


### 3.3 - Merging close events and grouping into single event dataframe

We are going to group all the detected event into a single dataframe.<br>
In this process, we are going to examine each event series  in order to merge events that are considered as too "close" to each other to be considered individually. For example, we could have an event that happened on the 24th of January, but there might also be a peak just before and just after it (for example on the 18th of January and 25th of January, because people talked about it before and afterwards). These days should be merged together. The strategy applied is described in details in the external file.

In [24]:
eventsWithSpecificHashtagRowList = []

def applyToMakeEventDf(row):
    '''
    This function will be applied to each row of each dataframe of hashtags.
    If a row is detected as an event, it will be added to the locaRowsList which will
    be used to make a dataframe of all the events.
    '''
    if row.event:
        rowToAdd = {'date': row.name, 'hashtag': row.hashtag, 'text': row.text,
                    'longitude': row.longitude, 'latitude':row.latitude, 'numberOfTweets': row.numberOfTweets, }
        global eventsWithSpecificHashtagRowList
        eventsWithSpecificHashtagRowList.append(rowToAdd)

This is the main method that will put all events in a single dataframe and merge the events that need to be merged.

In [25]:
# EXTERNAL FUNCTION: mergeCloseEvents(rowsList, max_event_duration, delimiter)

allEventsRowsList = []
eventsWithSpecificHashtagRowList = []

# Print values
count = 0
printingValue = int(len(dictionary) / 10)

pr('Starting to make event df with {} dataframes. (around 6 min)'.format(len(dictionary)))
for h, df in dictionary.items():
    # Print information
    count += 1
    if count % printingValue == 0:
        pr("{:.0f}%".format(count/len(dictionary)*100))
    
    # Initializing the event list of rows
    global eventsWithSpecificHashtagRowList
    eventsWithSpecificHashtagRowList = []

    # Detecting events for a hashtag
    df.apply(applyToMakeEventDf, axis=1) 
    
    # Merging close events if needed
    mergedList = mergeCloseEvents(eventsWithSpecificHashtagRowList, MAX_DURATION_OF_EVENT, delimiter)
    
    # Adding events to the all events list
    allEventsRowsList += mergedList

pr('Making new dataframe.')

new_events = pd.DataFrame(allEventsRowsList)
new_events.set_index(['date'], inplace=True)

pr('Done. Dataframe with {} events.'.format(strNb(len(new_events))))

02:24:12 Starting to make event df with 607601 dataframes. (around 6 min)
02:24:44 10%
02:25:16 20%
02:25:48 30%
02:26:20 40%
02:26:52 50%
02:27:23 60%
02:27:55 70%
02:28:27 80%
02:28:59 90%
02:29:31 100%
02:29:31 Making new dataframe.
02:29:31 Done. Dataframe with 3.156 events.


So every detected events are in a dataframe. We can observe the start of this dataframe.

In [26]:
print('Events dataframe:')
new_events.head(10)

Events dataframe:


Unnamed: 0_level_0,hashtag,latitude,longitude,numberOfTweets,text
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-03-05,laferrari,46.23355,6.11895,8,#LaFerrari new name of Maranello's car #SIAG h...
2013-04-03,lrts,46.288,6.16644,7,Ça m'manque #LRTs_$$$_Elle a tout dit ! #LRTS_...
2013-06-16,lrts,46.2199,6.1462,7,MDRRRRRRR :( #lrts_$$$_Merci Aurevoir. #lrts_$...
2014-07-13,lrts,46.6097,6.14846,7,J'ai toujours trouvé que c'était Zayn qui avai...
2012-10-15,colorado,45.953875,9.17606,20,Guardando #colorado perché c'è bisogno di ride...
2015-02-27,colorado,45.7888,9.068445,8,#Colorado #riderenoncostaniente guardando Colo...
2014-03-23,mun25000,47.2357,6.02586,17,#Mun25000 [Info] Le taux de participation offi...
2013-06-25,سويسرا,47.1159,9.25662,26,#سويسرا #ورد @ lugano http://t.co/d6It7t6P3a_$...
2014-08-13,سويسرا,46.679794,7.917126,247,#سويسرا\n#جبال_الالب\n#المسافرون_العرب\n#انترل...
2015-08-26,سويسرا,46.68675,7.885123,41,#سويسرا #ص #اوربا#انترلاكن #ليتربون #صباحية_سو...


In [27]:
print('Example of the original tweets for the word "{}" that was detected as an event:'.format(new_events.iloc[0].hashtag))
dictionary[new_events.iloc[0].hashtag].head(10)

Example of the original tweets for the word "laferrari" that was detected as an event:


Unnamed: 0,numberOfTweets,longitude,userId,latitude,hashtag,text,event
2013-03-05,8,6.11895,7,46.23355,laferrari,#LaFerrari new name of Maranello's car #SIAG h...,True
2013-03-06,4,6.11418,2,46.234,laferrari,#Ferrari #LaFerrari bel nome! #genevamotorshow...,False
2013-03-07,1,6.10877,1,46.2316,laferrari,"#GenevaMotorShow, in this pic you can almost m...",False
2013-03-10,2,6.116535,2,46.2341,laferrari,Scarsa #LaFerrari #FerrariGeneva2013 #Siag @ 2...,False
2013-03-11,1,6.11861,1,46.2342,laferrari,#salongeneve#ferrari#laferrari#italia @ Geneva...,False
2013-03-12,1,6.11967,1,46.2337,laferrari,A Genève!! Avui toca Saló internacional de l'A...,False
2013-03-13,2,6.117995,2,46.23155,laferrari,@Tuittolo @kiara969 questa si che è meglio di ...,False
2013-03-15,1,6.13733,1,46.203,laferrari,Soddisfazioni #Motorshow #Geneva\n#LaFerrari @...,False
2013-03-16,1,6.11858,1,46.2345,laferrari,La più bella #Ferrari dopo la #250 ! #bellimai...,False
2013-03-18,1,6.09381,1,46.2204,laferrari,The #LaFerrari F70 http://t.co/FeuVizUF3Y #Fer...,False


## 4 - Exporting the data

In [28]:
total_number_of_events = len(new_events)
print('There are {} events to export.'.format(total_number_of_events))

There are 3156 events to export.


On this project, we worked with another team which was providing a visualisation for the event we detected. Therefore, we decided together to export all these information in a JSON file that we could give to them. The JSON format was the following:

So, we are first going to create a column with the month and the unix time for all the events.

In [29]:
# EXTERNAL FUNCTIONS: convert_to_unix_time(record), to_utc(date)

pr('Converting dates.')
e_df = new_events.copy()
e_df['date'] = e_df.index
e_df.index = [i for i in range (len(e_df))]
e_df['year'] = e_df['date'].apply(lambda x: x.year)
e_df['month'] = e_df['date'].apply(lambda x: x.month)
e_df['utc_date'] = e_df['date'].apply(lambda x: to_utc(x))
e_df['unix_time'] = e_df.apply(convert_to_unix_time, axis=1)
pr('Done.')

02:29:31 Converting dates.
02:29:32 Done.


We will create a dictionary with the same structure as the json that will be exported. This will allow to make the json generation very simple.

In [30]:
# Grouping events by months
e_gb_month = e_df.groupby(e_df.unix_time)

In [31]:
pr('Generation of the dictionary for the final JSON...')

months = []
for month, df in e_gb_month:
    days = []
    for i in range (len(df)):
        ht = df.iloc[i]['hashtag']
        lat = df.iloc[i]['latitude']
        lon = df.iloc[i]['longitude']
        t_num = df.iloc[i]['numberOfTweets']
        tweets = df.iloc[i]['text'].split(delimiter) # We split the tweet text
        date = df.iloc[i]['utc_date']
        
        data_unit = { 'name': ht
                    , 'latitude' : lat
                    , 'longitude' : lon
                    , 'tweets' : tweets
                    , 'number_of_tweets' : str(t_num)
                    , 'date' : int(date)}
        days.append(data_unit)
    
    curr_month = {'date': int(month), 'data' : days}
    months.append(curr_month)

final_events = {'events' : months}
pr('Dictionary has been generated.')

02:29:32 Generation of the dictionary for the final JSON...
02:29:35 Dictionary has been generated.


Creation of the final JSON from the dictionary.

In [32]:
# Filename with the date
exportFilename = 'export_twitter_events_' + datetime.now().strftime("%Y-%m-%d_%Hh%Mmin%S") + \
'_' + str(total_number_of_events)+ '_events.json'
exportPath =  os.path.join('export', exportFilename)

pr('Exporting to json...')
with open(exportPath, 'w') as f:
     json.dump(final_events, f)
pr('Export done. File "{}" has been created.'.format(exportFilename))

02:29:35 Exporting to json...
02:29:36 Export done. File "export_twitter_events_2017-02-05_02h29min35_3156_events.json" has been created.


## 5 - Conclusion

We were very satisfied with the number of event detected. German, Italian and French tweets were correctly located into the different part of Switzerland. Big events like sport games and local events like the white dinner in Basel were correctly detected. The model was, as well, able to detect smaller events, like conferences at the EPFL.

The visualization provided by the other team was great and allowed to understand a lot better our results.

Overall, this was a very interesting project to work on with an amazing dataset and we are happy to had had this opportunity.