# Applied Data Analysis

## Fall Semester 2016

Team members:
* Stylianos Agapiou (stylianos.agapiou@epfl.ch)
* Athanasios Giannakopoulos (athanasios.giannakopoulos@epfl.ch)
* Dimitrios Sarigiannis (dimitrios.sarigiannis@epfl.ch)

## Project Description

We are given a dataset containing tweets in Switzerland starting from 2010. 

The first goal of the project is to analyse the data and reconstruct mobility flows of the users. More concretely, we try to get insights into high-frequency migration patterns in the swiss territory. The implementation is given in [mobility_patterns](mobility_patterns.ipynb). We also perform data analysis on aggregated data from all years that have been processed in the [mobility_patterns_aggregated](mobility_patterns_aggregated.ipynb) notebook.

The second task of the project is to detect events. Here, we focus on dates and locations of such events. The implementation in given in [event_detection](event_detection.ipynb).

The third and final part of the project deals with sentiment analysis. Here, we focus on the tweets linked to events (as detected in the [event_detection](event_detection.ipynb) notebook) and perform a sentiment analysis for each event. The implementation of the sentiment analysis is given in the [sentiment_analysis](sentiment_analysis.ipynb) notebook.


<a name="users_std"></a>
## Event Detection

This notebook focuses on event detection in Switzerland. Specifically, we perform the following tasks using the given dataset:

* we try to determine the location and date of an event using two methods:
        1. the first method uses DBSCAN for event detection
        2. the second method uses a heuristic approach
* we determine the number of users per event. People often tweet about the same subject for several times during one day. Our methods are agnostic to that so these tweets reveal an event for our algorithms (lots of tweets form the same place). However, by counting the number of users per event, we can identify events created by spammers (i.e. one user) and if necessary filter these events out.

* we measure the standard deviation of the time of tweets that describe the same event. A small standard deviation means that the tweets are sent in a short timeframe. However, a big standard deviation means that the tweets are spread thoughout the day. An event may last for a few hours, thus a small standard deviation is more likely to reveal a real event.

In order to detect events, several assumptions are taken into account. These assumptions are described in detail later in the report.

In [1]:
%matplotlib inline
from libraries import *
from utils_event_detection import *

### Data Loading

Since our dataset contains tweets from 2010 to 2016, we decide to do a yearly based analysis. This approach gives us the following advantages:
* the yearly analysis reveals how people use Twitter as time evolves (e.g. social networks become more popular, so it is probable that people used Twitter more in 2013 compared to 2010)
* we detect events in a yearly basis

**NOTE:** For 2016, the <code>tweets_with_text_2016.csv</code> file contains a line that prevents the dataset from being loaded sucessfully. We couldn't investigate the problem due to lack of time, therefore, we added the <code>skiprows=[2588468]</code> argument while calling the <code>read_csv()</code> function.

In [2]:
# year to be analyzed
year = '2010'
# file name
file_name = '../../data/tweets_with_text_' + year + '.csv'
# loading data (treating NaN values and datetime format)
data = pd.read_csv(file_name, sep='|', na_values=['\\N'], header=None, parse_dates=[2])
# give proper names to columns
data.columns = ['tweetId', 'userId', 'createdAt', 'text']
# set tweet ID as index
data.set_index('tweetId', inplace=True)
# display dataframe
data.head()

Unnamed: 0_level_0,userId,createdAt,text
tweetId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
9514097914,17341045,2010-02-23 05:55:51,Guuuuten Morgen! :-)
9514846412,7198282,2010-02-23 06:22:40,Still the best coffee in town — at La Stanza h...
9516574359,14657884,2010-02-23 07:34:25,It has been a week or so.. and today I just co...
9516952605,14703863,2010-02-23 07:51:47,Getting ready.. http://twitpic.com/14v8gz
9517198943,14393717,2010-02-23 08:02:57,Un peu de réconfort liquide en take away après...


We also load the materialized dataframe from the [mobility_patterns](mobility_patterns.ipynb) file. This contains geolocated information about every tweet.

In [3]:
file_name = '../../data/processed_tweets_' + year + '.csv'
tweets = pd.read_csv(file_name, sep='|', index_col='tweetId', 
                     parse_dates=[2])

Using index matching, we put the tweet text in the <code>tweets</code> dataframe and drop any rows that do not contain text.

In [4]:
# create text column in the tweets dataframe
tweets['text'] = data['text']
# drop NaN rows
tweets = tweets.dropna(subset=['text'])
# display dataframe
tweets.head()

Unnamed: 0_level_0,userId,createdAt,longitude,latitude,atWork,hourOfTweet,text
tweetId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
9514097914,17341045,2010-02-23 05:55:51,7.43926,46.9489,False,5,Guuuuten Morgen! :-)
9514846412,7198282,2010-02-23 06:22:40,8.53781,47.3678,False,6,Still the best coffee in town — at La Stanza h...
9516574359,14657884,2010-02-23 07:34:25,6.13396,46.1951,False,7,It has been a week or so.. and today I just co...
9517198943,14393717,2010-02-23 08:02:57,6.63254,46.5199,True,8,Un peu de réconfort liquide en take away après...
9517916537,13535402,2010-02-23 08:35:39,8.5301,47.3152,True,8,I'm at Online PC Magazin in Adliswil http://go...


<a name="requirements"></a>
### Data Preprocessing

We try to detect events using the tweet text and the geolocated information. Our analysis is based on the following assumptions:

* events should happen during one day and all tweets that describe one particular event should have the same date.
* the tweet text should contain at least one hashtag that may describe the event.
* in order to detect an event, we need at least 5 tweets with the same hashtag on a particular day.
* we are interested in detecting events using geolocated information, therefore events should be posted in approximately the same location. We assume that events take place in a small area. To detect event, we reduce the accuracy of the GPS location, as described [here](https://en.wikipedia.org/wiki/Decimal_degrees).

We start by finding the day of each tweet using the given timestamp. Then we find all hashtags for each tweet. In case the tweet text does not contain any hashtag, we assume that the respective tweet does refer to any event and therefore we remove it.

In [5]:
# find day of tweet
tweets['dayOfTweet'] = tweets['createdAt'].dt.date
# find hashtags for each tweet
tweets['hashtags'] = tweets['text'].apply(lambda row: keep_hashtags(row))
# remove rows without hashtags
tweets.dropna(subset=['hashtags'], inplace=True)
# display dataframe
tweets.head()

Unnamed: 0_level_0,userId,createdAt,longitude,latitude,atWork,hourOfTweet,text,dayOfTweet,hashtags
tweetId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9519737890,14657884,2010-02-23 09:59:41,6.1387,46.175,True,9,"Magic spells run off after midnight, I guess s...",2010-02-23,{#fb}
9521789689,9962022,2010-02-23 11:28:27,6.33641,46.4631,True,11,"Limitas of public transportation! No taxi, rai...",2010-02-23,{#yam}
9535390586,921241,2010-02-23 17:47:11,9.1657,47.6463,True,17,"So, Feierabend. Jetzt #24 und später #VfB. — a...",2010-02-23,"{#VfB., #24}"
9536575795,14260616,2010-02-23 18:19:03,8.2601,47.4576,False,18,Greetings ! http://tinyurl.com/ycrdlhq #iPhone...,2010-02-23,{#iPhoneography}
9537030723,14542024,2010-02-23 18:31:46,8.51865,47.3703,False,18,Lexmark launches Evernote SmartSolution! Send ...,2010-02-23,{#partnermonth}


It is possible that a row contains more than one hashtag. Here, we create duplicates for such rows. We wish to have one hashtag per row.

In [6]:
pairs = []
# iterate through all rows
for _, row in tweets.iterrows():
    # for each value in the hashtags, create one new row
    for hashtag in row['hashtags']:
        # append the new row to the list
        pairs.append((row['userId'], row['createdAt'], row['longitude'], 
                      row['latitude'], row['dayOfTweet'], hashtag))
# create a new dataframe from the pairs list and give meaningful names to the columns
df = pd.DataFrame(pairs, columns=['userId', 'createdAt', 'longitude', 'latitude', 'dayOfTweet', 'hashtag'])

Finally, the hashtags should be preprocessed to avoid mismatches due to case sensitivity. Therefore, everything is converted to lower case and punctuation is removed.

In [7]:
# create a dictionary using a comprehension - this maps every character from string.punctuation to None
# initialize a translation object from it.
translator = str.maketrans({key: None for key in string.punctuation})
# process hashtags
df['hashtag'] = df.apply(lambda row: hashtag_preprocess(row['hashtag'], translator), axis=1)
# display dataframe
df.head()

Unnamed: 0,userId,createdAt,longitude,latitude,dayOfTweet,hashtag
0,14657884,2010-02-23 09:59:41,6.1387,46.175,2010-02-23,#fb
1,9962022,2010-02-23 11:28:27,6.33641,46.4631,2010-02-23,#yam
2,921241,2010-02-23 17:47:11,9.1657,47.6463,2010-02-23,#vfb
3,921241,2010-02-23 17:47:11,9.1657,47.6463,2010-02-23,#24
4,14260616,2010-02-23 18:19:03,8.2601,47.4576,2010-02-23,#iphoneography


## Detecting Events
<a name="dbscan"></a>
### Machine Learning Approach: Event detection using DBSCAN
We group our data based on day of tweet and hashtag. We use the <code>size</code> operation to count the number of tweets on a particular day with a particular hashtag.

In [8]:
# groupby dayOfTweet and hashtag, then find the size of each group
df_grouped = df.groupby(by=['dayOfTweet', 'hashtag']).size()
# give meaning name to column
df_grouped.rename('numOfTweets', inplace=True)
# display Series
df_grouped.head()

dayOfTweet  hashtag       
2010-02-23  #24               1
            #fb               1
            #iphoneography    1
            #partnermonth     1
            #vfb              1
Name: numOfTweets, dtype: int64

<a name="df"></a>
We detect events that have at least 5 tweets, i.e. **there are at least 5 events on the same day with the same hashtag**.

In [9]:
# set threshold for min number of events
min_tweets = 5
# find how many tweets happened on a particular day for a particular tweet
df = df.apply(lambda row: fill_num_of_tweets(row, df_grouped), axis=1)
# remove those that do not exceed the threshold value
df = df[df['numOfTweets'] >= min_tweets]
# display dataframe
df.head()

Unnamed: 0,userId,createdAt,longitude,latitude,dayOfTweet,hashtag,numOfTweets
105,14657884,2010-03-07 10:15:03,6.14474,46.1958,2010-03-07,#fb,6
109,17341045,2010-03-07 12:18:19,7.41919,46.9377,2010-03-07,#fb,6
114,17341045,2010-03-07 17:43:44,7.41919,46.9377,2010-03-07,#fb,6
116,13743402,2010-03-07 20:48:23,10.3167,47.7273,2010-03-07,#fb,6
117,14657884,2010-03-07 22:51:16,6.14474,46.1958,2010-03-07,#fb,6


In order to explain the information contained in the <code>df</code> dataframe, we give the following example taken from the sub-dataset of year **2010**. ![title](../../data/figures/df_example.png)

Here, we see that there are 6 tweets with the hashtag #fb posted on 07.03.2010 by 5 different users (5 unique user IDs).

So far, we have satisfied the 3 our of 4 [requirements](#requirements) described above. The $4^{th}$ points requires tweets to be posted from approximately the same location. 

In order to do that, we pass the geolocation information for possible events into DBSCAN. In case the tweets are posted from the same location, DBSCAN will create a cluster. Otherwise, these tweets are considered to be noise and we say that no event is detected. 

We start by grouping tweets based on common day and hashtag. The coordinates of each sub-group are fed into DBSCAN. You can see all events that are detected using DBSCAN, their respective date, as well as their descriptive hashtag. We assume that events take place in a small area (e.g. stadium, conference, festival, etc.) and we reduce the accuracy to approximately 
* 6.4 square kilometers (accuracy equals to 0.01 and 2 for DBSCAN and heuristic respectively)
* 640 square meters (accuracy equals to 0.001 and 3 for DBSCAN and heuristic respectively)

to include also inaccuracies in the GPS location measurements. **This parameters affects significanlty the number of events detected**.

**REMINDER:** We are interested in detecting events using geolocated information and not detecting events in general. Some of the events detected by DBSCAN may not correspond to real events. However, the algorithm detects them as they fullfill all the requirements we have.

In [10]:
# reduce GPS accuracy
accuracy = 0.001
# group by day of tweet and hashtag
df_grouped = df.groupby(by=['dayOfTweet', 'hashtag'])
# find events
list_of_events_dbscan = detect_event_dbscan(df_grouped, accuracy, min_tweets)

Date:  2010-03-25 	 Location:  ('47.365', '8.539') 	 Hashtags:  #bosw
Date:  2010-04-19 	 Location:  ('47.366', '8.545') 	 Hashtags:  #sechselaeuten
Date:  2010-05-29 	 Location:  ('47.162', '8.291') 	 Hashtags:  #esc
Date:  2010-06-05 	 Location:  ('47.667', '9.170') 	 Hashtags:  #bcbs10
Date:  2010-06-24 	 Location:  ('47.411', '8.552') 	 Hashtags:  #swisscrmforum
Date:  2010-06-24 	 Location:  ('47.411', '8.551') 	 Hashtags:  #swisscrmforum
Date:  2010-10-07 	 Location:  ('46.289', '7.972') 	 Hashtags:  #fb
Date:  2010-10-15 	 Location:  ('47.567', '7.597') 	 Hashtags:  #gotthard
Date:  2010-10-20 	 Location:  ('47.383', '8.536') 	 Hashtags:  #tedxzh
Date:  2010-12-03 	 Location:  ('47.588', '9.624') 	 Hashtags:  #ff
Date:  2010-12-06 	 Location:  ('46.194', '6.154') 	 Hashtags:  #tedxgeneva
Date:  2010-12-08 	 Location:  ('47.364', '8.535') 	 Hashtags:  #iabc
Date:  2010-12-08 	 Location:  ('47.364', '8.535') 	 Hashtags:  #switzerland


Now, we have detected possible events. We try to find evidence that these are indeed events. This is indicated by
* the number of users that posted for a particular event
* the standard deviation of the timestamps of the events

The reasoning is given [here](#users_std).

We create a new dataframe that contains only users that made tweets for the events we detected so far.

In [11]:
# initialize datafram
new_df = pd.DataFrame()
# iterate though list of events
for item in list_of_events_dbscan:
    # find all records that match the day of tweet and the hashtag for each detected event
    temp = df[(df['dayOfTweet'] == item[0]) & (df['hashtag'] == item[1])]
    # concatenate to new dataframe
    new_df = pd.concat([new_df, temp])
# give meaningful column names
new_df = new_df[['dayOfTweet', 'hashtag', 'userId']]
# display dataframe
new_df.head()

Unnamed: 0,dayOfTweet,hashtag,userId
284,2010-03-25,#bosw,15402923
285,2010-03-25,#bosw,15402923
287,2010-03-25,#bosw,15402923
289,2010-03-25,#bosw,15402923
290,2010-03-25,#bosw,15402923


Now, we wish to determine how many different users posted for each detected event.

In [12]:
# group by hashtag and dayOfTweet and count unique user IDs
new_df = new_df.groupby(by=['hashtag', 'dayOfTweet'])['userId'].nunique()
# give meaningful name to column
new_df.rename('usersPerHashtag', inplace=True)
# convert to dataframe (needed for later)
new_df = new_df.to_frame()

We define a spammer threshold. If a hashtag is posted by less users that the spammer threshold, then we have evidence that the detected event may not be a true event. However, the algorithm detected it as an event since it met all the aforementioned [requirements](#requirements).

In [13]:
# define a spammer threshold
spammer_threshold = 2
# find if events are spam or not
new_df['spamEvent'] = new_df['usersPerHashtag'] < spammer_threshold
# display dataframe
new_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,usersPerHashtag,spamEvent
hashtag,dayOfTweet,Unnamed: 2_level_1,Unnamed: 3_level_1
#bcbs10,2010-06-05,1,True
#bosw,2010-03-25,1,True
#esc,2010-05-29,4,False
#fb,2010-10-07,3,False
#ff,2010-12-03,2,False


There is one final thing left to do, i.e. we should calculate the standard deviation of the timestamps of detected events. In order to do that, we gather the timestamps for each event, convert it to number of seconds since midnight and calculate the standard deviation. 

A small standard deviation means that the tweets are sent in a short timeframe. However, a big standard deviation means that the tweets are spread thoughout the day. An event may last for a few hours, thus a small standard deviation is more likely to reveal a real event.

In [14]:
# find std of events
std_dict = std_of_events(df, new_df)
# reset index
new_df.reset_index(inplace=True)
# initialize columm
new_df['std'] = np.nan
# fill std for each event
new_df = new_df.apply(lambda row: fill_std(row, std_dict), axis=1)
# sort dataframe
new_df.sort(columns = ['usersPerHashtag', 'std'], inplace=True, axis=0, ascending=False)
# reset indexes
new_df.reset_index(inplace=True, drop=True)
# set locations of events
new_df['approxLocation'] = new_df.apply(lambda row: set_event_location(row, list_of_events_dbscan), axis=1)
# save dataframe
file_name = '../../data/detected_events_dbscan_' + year + '.csv'
new_df.to_csv(file_name, sep='|')
# display dataframe
new_df.head()

Unnamed: 0,hashtag,dayOfTweet,usersPerHashtag,spamEvent,std,approxLocation
0,#tedxzh,2010-10-20,5,False,178.505886,"(47.383, 8.536)"
1,#gotthard,2010-10-15,4,False,159.250786,"(47.567, 7.597)"
2,#esc,2010-05-29,4,False,61.585952,"(47.162, 8.291)"
3,#fb,2010-10-07,3,False,257.948167,"(46.289, 7.972)"
4,#tedxgeneva,2010-12-06,2,False,153.596919,"(46.194, 6.154)"


 <a name="non_spam_dbscan"></a>
Now, we print all events that are not classified as spam. In the section [bellow](#non_spam_heuristic), you can find a list of non spam events detected using a heuristic approach.

In [15]:
# list of non spam event for DBSCAN
dbscan_non_spam = []
for event in new_df.iterrows():
    spam = event[1][3]
    date = event[1][1]
    hashtag = event[1][0]
    # print those that are not spam
    if not spam:
        # append to list and print event
        dbscan_non_spam.append((date, hashtag))
        print('Date: ', date, '\t', 'Hashtag: ', hashtag)

Date:  2010-10-20 	 Hashtag:  #tedxzh
Date:  2010-10-15 	 Hashtag:  #gotthard
Date:  2010-05-29 	 Hashtag:  #esc
Date:  2010-10-07 	 Hashtag:  #fb
Date:  2010-12-06 	 Hashtag:  #tedxgeneva
Date:  2010-12-03 	 Hashtag:  #ff


### Heuristic Approach

We decide to follow a second approach for detecting events. This approach does not involve any machine learning algorithm. On the contraty, it is based on a heuristic method that reduces the accuracy of the GPS coordinates. Tweets that are posted in approximately the same location should have the same longitude and latitude after the accuracy reduction. 

**The reduction accuracy should be in accordance to the accuracy used in the DBSCAN model**. For example, an accuracy of 0.01 (or 0.001) in DBSCAN corresponds to an accuracy of 2 (or 3) in the heuristic.

We work using the <code>df</code> dataframe. This dataframe was last modified [here](#df) and does not contain any information created from DBSCAN. However, we drop the <code>numOfTweets</code> columns, since we are going to re-evaluate it in a different way.

In [16]:
# remove unnecessary column
df.drop('numOfTweets', axis=1, inplace=True)
# define accuracy according to DBSCAN's respective value
accuracy = 3
# reduce the accuracy
df = df.apply(lambda row: reduce_location_accuracy(row, accuracy), axis=1)
# display dataframe
df.head()

Unnamed: 0,userId,createdAt,longitude,latitude,dayOfTweet,hashtag,approxLocation
105,14657884,2010-03-07 10:15:03,6.14474,46.1958,2010-03-07,#fb,"(46.196, 6.145)"
109,17341045,2010-03-07 12:18:19,7.41919,46.9377,2010-03-07,#fb,"(46.938, 7.419)"
114,17341045,2010-03-07 17:43:44,7.41919,46.9377,2010-03-07,#fb,"(46.938, 7.419)"
116,13743402,2010-03-07 20:48:23,10.3167,47.7273,2010-03-07,#fb,"(47.727, 10.317)"
117,14657884,2010-03-07 22:51:16,6.14474,46.1958,2010-03-07,#fb,"(46.196, 6.145)"


Our heuristic detects events using the reduced geolocated accuracy embedded in the <code>approxLocation</code> column. Thus, we group by day of tweet, hashtag and reduced location and try to detect events.

In [17]:
# group by and count tweets per index
df_grouped = df.groupby(by=['dayOfTweet', 'approxLocation', 'hashtag']).size()
# give column meaningful name
df_grouped = df_grouped.rename('numOfTweets')

An event is considered to take place if at least 5 tweets are posted with the same hashtag, on the same day, from the same location. The rest records are filtered out.

In [18]:
# filtering out rows with less than 5 tweets
df_grouped = df_grouped[df_grouped >= min_tweets]
# convert to frame (used later)
df_grouped = df_grouped.to_frame()
# display dataframe
df_grouped.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,numOfTweets
dayOfTweet,approxLocation,hashtag,Unnamed: 3_level_1
2010-03-22,"(47.054, 8.311)",#4to7,6
2010-03-25,"(47.365, 8.539)",#bosw,7
2010-05-29,"(47.162, 8.291)",#esc,10
2010-06-03,"(47.382, 8.536)",#twitref,6
2010-06-05,"(47.667, 9.170)",#bcbs10,6


Now, we join the two aforementioned dataframes (<code>df</code> and <code>df_grouped</code>) and create a new one that contains the <code>numOfTweets</code> column. This column indicates how many are the tweets with the particular date, hashtag and approximate location of each row.

In [19]:
# join the two dataframes
joined_df = pd.merge(df, df_grouped, how='inner', left_on=['dayOfTweet', 'approxLocation', 'hashtag'], 
                     right_index=True)
# keep necessary columns
joined_df = joined_df[['userId', 'dayOfTweet', 'approxLocation', 'hashtag', 'numOfTweets']]
# drop duplicate rows
joined_df.drop_duplicates(inplace=True)
# display dataframe
joined_df.head()

Unnamed: 0,userId,dayOfTweet,approxLocation,hashtag,numOfTweets
250,15596215,2010-03-22,"(47.054, 8.311)",#4to7,6
284,15402923,2010-03-25,"(47.365, 8.539)",#bosw,7
1024,1079931,2010-05-29,"(47.162, 8.291)",#esc,10
1149,6848912,2010-06-03,"(47.382, 8.536)",#twitref,6
1189,10517692,2010-06-05,"(47.667, 9.170)",#bcbs10,6


From now on, we follow exactly the same procedure as in the [DBSCAN](#dbscan) section. More concretely, we are going to find the number of users per event, flag it as a potential spam event and finally estimate the standard deviation of the timestamps of each event.

In [20]:
# find users per event
users_per_hashtag = joined_df.groupby(by=['hashtag', 'dayOfTweet', 'approxLocation']).size()
# detect potential spam events
joined_df = joined_df.apply(lambda row: spam_events(row, users_per_hashtag, spammer_threshold), axis=1)
# display dataframe
joined_df.head()

Unnamed: 0,userId,dayOfTweet,approxLocation,hashtag,numOfTweets,spamEvent,usersPerHashtag
250,15596215,2010-03-22,"(47.054, 8.311)",#4to7,6,True,1
284,15402923,2010-03-25,"(47.365, 8.539)",#bosw,7,True,1
1024,1079931,2010-05-29,"(47.162, 8.291)",#esc,10,True,1
1149,6848912,2010-06-03,"(47.382, 8.536)",#twitref,6,True,1
1189,10517692,2010-06-05,"(47.667, 9.170)",#bcbs10,6,True,1


Once the users per event and the potential spam events are determined, we continue by determining the standard deviation of the timestamps for each event.

In [21]:
# estimate std for each event
std_dict = std_of_events(df)
# fill std value to the dataframe
event_detection = joined_df.apply(lambda row: fill_std(row, std_dict), axis=1)
# sort dataframe
event_detection.sort(columns = ['usersPerHashtag', 'std'], inplace=True, axis=0, ascending=False)
# drop unnecessary columns
event_detection.drop('userId', axis=1, inplace=True)
# drop duplicates
event_detection.drop_duplicates(subset=['dayOfTweet', 'approxLocation', 'hashtag'], inplace=True)
# save dataframe
file_name = '../../data/detected_events_heuristic_' + year + '.csv'
event_detection.to_csv(file_name, sep='|')
# display dataframe
event_detection.head()

Unnamed: 0,dayOfTweet,approxLocation,hashtag,numOfTweets,spamEvent,usersPerHashtag,std
3872,2010-10-20,"(47.383, 8.536)",#tedxzh,5,False,3,150.121354
5361,2010-12-10,"(47.561, 9.637)",#wikileaks,6,True,1,409.51946
3454,2010-10-07,"(46.289, 7.972)",#fb,5,True,1,196.806522
1189,2010-06-05,"(47.667, 9.170)",#bcbs10,6,True,1,140.369478
1589,2010-06-24,"(47.412, 8.552)",#swisscrmforum,7,True,1,130.419359


<a name="non_spam_heuristic"></a>
Here, we create a list of all the events detected using the heuristic, and print those that are not spam events. The respective list of detected non spam events using DBSCAN is given [here](#non_spam_dbscan).

In [22]:
# list of all events
list_of_events_heuristic = []
# list of non spam events for heuristic
heuristic_non_spam = []
for event in event_detection.iterrows():
    spam = event[1][4]
    date = event[1][0]
    hashtag = event[1][2]
    approxLocation = event[1][1]
    list_of_events_heuristic.append((date, hashtag, approxLocation))
    # print those that are not spam
    if not spam:
        # append to list and print
        heuristic_non_spam.append((date, hashtag))
        print('Date: ', date, '\t', 'Hashtag: ', hashtag)

Date:  2010-10-20 	 Hashtag:  #tedxzh


### Comparing the two Methods

Here, we provide a comparison of the two methods, given their results. The analysis is done using:
* the full list of events
* the reduced list of events after filtering the non spam events

In [23]:
analyse_performance(list_of_events_dbscan, list_of_events_heuristic)

Number of events detected with DBSCAN =  12
Number of events detected with heuristic =  20
---------------------------------------------
The two methods found 11 events in common
Common events:
('2010-06-05', '#bcbs10')
('2010-12-06', '#tedxgeneva')
('2010-12-08', '#iabc')
('2010-10-20', '#tedxzh')
('2010-06-24', '#swisscrmforum')
('2010-12-08', '#switzerland')
('2010-05-29', '#esc')
('2010-12-03', '#ff')
('2010-10-15', '#gotthard')
('2010-03-25', '#bosw')
('2010-10-07', '#fb')
---------------------------------------------
Found only by DBSCAN:
('2010-04-19', '#sechselaeuten')
---------------------------------------------
Found only by heuristic:
('2010-12-10', '#wikileaks')
('2010-11-30', '#jmstv')
('2010-11-19', '#fb')
('2010-11-19', '#mannechoched')
('2010-12-01', '#jmstv')
('2010-03-22', '#4to7')
('2010-06-03', '#twitref')
('2010-11-03', '#smgzh')
('2010-12-06', '#onerepulic')


Now, we do the same analysis by using only events that are flagged as non spam.

In [24]:
analyse_performance(dbscan_non_spam, heuristic_non_spam)

Number of events detected with DBSCAN =  6
Number of events detected with heuristic =  1
---------------------------------------------
The two methods found 1 events in common
Common events:
('2010-10-20', '#tedxzh')
---------------------------------------------
Found only by DBSCAN:
('2010-10-15', '#gotthard')
('2010-05-29', '#esc')
('2010-10-07', '#fb')
('2010-12-06', '#tedxgeneva')
('2010-12-03', '#ff')
---------------------------------------------
Found only by heuristic:



### Commenting on the Results

In many cases, we see that the heuristic approach captures more events than the DBSCAN. However, in the case of detecting non spam events DBSCAN does a better job and detects more events. Here, we give a possible explanation.

**Why does DBSCAN capture less non-spam events?**

DBSCAN searches in a circular area around each point to detect neighbors and form clusters. However, the heuristic uses squared areas around each point. Therefore, the area covered by the heuristic is bigger and may include more points, and thus more probable events. Hence, the heuristic is able of capturing more events in the case where no spam filtering is used. Events ara flagged as spam if only one user tweets about them.

**Why does DBSCAN capture more events when spam filtering is used?**

With DBSCAN, once the cluster is formed, it expands as long as data points are close to the cluster, i.e. close data points are merged into a single cluster. The heuristic approach cannot merge neighboring clusters. It can be the case that two different users form two different clusters and tweet about the same event. DBSCAN will create a single cluster out of this with two users in a single cluster, whereas the heuristic will create one cluster for each user. Thus, the event will be flagged as spam and non-spam in the case of the heuristic approach and DBSCAN respectively.

## Visualizing Events

Finally, we provide of visualization of the non spam events detected using DBSCAN

In [25]:
# coordinates of the events
coord = new_df['approxLocation'].values.tolist()
# visualize non spam events
event_map = create_event_map(year, coord, new_df['hashtag'].tolist(), 
                             new_df['spamEvent'].tolist(), new_df['usersPerHashtag'].tolist())
event_map