# Applied Data Analysis
## Fall semester 2016

Team members:
* Stylianos Agapiou (stylianos.agapiou@epfl.ch)
* Athanasios Giannakopoulos (athanasios.giannakopoulos@epfl.ch)
* Dimitrios Sarigiannis (dimitrios.sarigiannis@epfl.ch)

## Project description

We are given a dataset containing tweets in Switzerland starting from 2010. 

The first goal of the project is to analyse the data and reconstruct mobility flows of the users. More concretely, we try to get insights into high-frequency migration patterns in the swiss territory. The implementation is given in [mobility_patterns](mobility_patterns.ipynb)

The second task of the project is to detect events. Here, we focus on dates and locations of such events. The implementation in given in [event_detection](event_detection.ipynb).


<a name="users_std"></a>
## Event Detection

This notebook focuses on event detection in the Switzerland. Specifically, we perform the following tasks using the given dataset:

* we try to determine the location and date of an event using two methods:
        1. the first method uses DBSCAN for event detection
        2. the second method uses a heuristic approach
* we determine the number of users per event. This is a strong indication of a real event. This is a counter-measure against spammers. People often tweet about the same subject for several times during one day. Our methods are agnostic to that so these tweets reveal an event. However, by counting the number of users per event, we can identify events created by spammers (i.e. one user) and if necessary filter these events out

* Finally, we measure the standard deviation of the time of tweets that describe the same event. A small standard deviation means that the tweets are sent in a short timeframe. However, a big standard deviation means that the tweets are spread thoughout the day. An event may last for a few hours, thus a small standard deviation is more likely to reveal an event.

In order to detect events, several assumptions are taken into account. These assumptions are described in detail later in the report.

In [1]:
%matplotlib inline
from libraries import *
from utils_event_detection import *

In [2]:
dateparse = lambda date: pd.datetime.strptime(date, '%Y-%m-%d %H:%M:%S')

### Data loading

Since our dataset contains tweets from 2010 to 2016, we decide to do a yearly based analysis. This approach gives us the following advantages:
* the yearly analysis reveals how people use Twitter as time evolves (e.g. social networks become more popular, so it is probable that people used Twitter more in 2013 compared to 2010)
* we detect events in a yearly basis

In [3]:
# year to be analyzed
year = '2011'
# file name
file_name = '../../data/tweets_with_text_' + year + '.csv'
# loading data (treating NaN values and datetime format)
data = pd.read_csv(file_name, sep='|', na_values=['\\N'], header=None, parse_dates=[2], date_parser=dateparse)
# give proper names to columns
data.columns = ['tweetId', 'userId', 'createdAt', 'text']
# set tweet ID as index
data.set_index('tweetId', inplace=True)
# display dataframe
data.head()

Unnamed: 0_level_0,userId,createdAt,text
tweetId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
20992627776159744,15017105,2011-01-01 00:00:07,Viel Glück und Erfolg zum Neuen Jahr! Denk vor...
20994063574499328,994621,2011-01-01 00:05:49,270 GB via Online #Backup Service #crashplan h...
20999004200370176,5908302,2011-01-01 00:25:27,@filonia isch eh frau schwanger wenn sie en gu...
20999904390291457,13994272,2011-01-01 00:29:02,@thegooroo thx ;)
21001985117388800,13448872,2011-01-01 00:37:18,@natts ^_^ Happy New Year to you too! =)


We also load the materialized dataframe from the [mobility_patterns](mobility_patterns.ipynb) file. This contains geolocated information about every tweet.

In [4]:
file_name = '../../data/processed_tweets_' + year + '.csv'
tweets = pd.read_csv(file_name, sep='|', index_col='tweetId', 
                     parse_dates=[2], date_parser=dateparse)

Using index matching, we put the tweet text in the <code>tweets</code> dataframe and drop any rows that do not contain text.

In [5]:
# create text column in the tweets dataframe
tweets['text'] = data['text']
# drop NaN rows
tweets = tweets.dropna(subset=['text'])
# display dataframe
tweets.head()

Unnamed: 0_level_0,userId,createdAt,longitude,latitude,atWork,hourOfTweet,text
tweetId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
20992627776159744,15017105,2011-01-01 00:00:07,9.45234,47.3597,False,0,Viel Glück und Erfolg zum Neuen Jahr! Denk vor...
20994063574499328,994621,2011-01-01 00:05:49,8.50139,47.403,False,0,270 GB via Online #Backup Service #crashplan h...
20999004200370176,5908302,2011-01-01 00:25:27,8.40702,47.4242,False,0,@filonia isch eh frau schwanger wenn sie en gu...
20999904390291457,13994272,2011-01-01 00:29:02,7.47791,47.1382,False,0,@thegooroo thx ;)
21004302428405760,11855622,2011-01-01 00:46:30,8.27334,47.3663,False,0,Happy New Year ! (@ Büchel's) http://4sq.com/g...


<a name="requirements"></a>
### Data Preprocessing

We try to detect events using the tweet text and the geolocated information. Our analysis is based on the following assumptions:

* events should happen during one day and all tweets that describe one particular event should have the same date
* the tweet text should contain at least one hashtag that may describe the event
* in order to detect an event, we need at least 5 tweets with the same hashtag on a particular day
* we are interested in detecting events using geolocated information, therefore events should be posted in approximately the same location. We assume that events take place in a small area. To detect event, we reduce the accuracy of the GPS location, as described [here](https://en.wikipedia.org/wiki/Decimal_degrees).

We start by finding the day of each tweet using the given timestamp. Then we find all hashtags for each tweet. In case the tweet text does not contain any hashtags, we assume that the respective tweet does refer to any event and therefore we remove it.

In [6]:
# find day of tweet
tweets['dayOfTweet'] = tweets.apply(lambda row :parse_day_of_tweet(row['createdAt']), axis=1)
# find hashtags for each tweet
tweets['hashtags'] = tweets['text'].apply(lambda row: keep_hashtags(row))
# remove rows without hashtags
tweets.dropna(subset=['hashtags'], inplace=True)
# display dataframe
tweets.head()

Unnamed: 0_level_0,userId,createdAt,longitude,latitude,atWork,hourOfTweet,text,dayOfTweet,hashtags
tweetId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
20992627776159744,15017105,2011-01-01 00:00:07,9.45234,47.3597,False,0,Viel Glück und Erfolg zum Neuen Jahr! Denk vor...,2011-01-01,"{#2011, #fb}"
20994063574499328,994621,2011-01-01 00:05:49,8.50139,47.403,False,0,270 GB via Online #Backup Service #crashplan h...,2011-01-01,"{#crashplan, #Backup}"
21045134586028033,16002522,2011-01-01 03:28:45,8.52493,47.3742,False,3,Happy New Year allerseits an diesem ersten #bi...,2011-01-01,"{#binaryday, #010111}"
21129290997309440,2129311,2011-01-01 09:03:10,8.53733,47.3794,False,9,@sicher at least you slept... #headinghome,2011-01-01,{#headinghome}
21144723334889472,14201145,2011-01-01 10:04:29,6.13827,45.9435,False,10,"2K10 was a fantastic yeAR, but it ll be nothin...",2011-01-01,"{#ARevolution, #hpsc}"


It is possible that a row contains more than one hashtags. Here, we create duplicates for such rows. We wish to have one hashtag per row.

In [7]:
pairs = []
# iterate through all rows
for _, row in tweets.iterrows():
    # for each value in the hashtags, create one new row
    for hashtag in row["hashtags"]:
        # append the new row to the list
        pairs.append((row['userId'], row['createdAt'], row['longitude'], 
                      row['latitude'], row['dayOfTweet'], hashtag))
# create a new dataframe from the pairs list and give meaningful names to the columns
df = pd.DataFrame(pairs, columns=['userId', 'createdAt', 'longitude', 'latitude', 'dayOfTweet', 'hashtag'])
# display dataframe
df.head()

Unnamed: 0,userId,createdAt,longitude,latitude,dayOfTweet,hashtag
0,15017105,2011-01-01 00:00:07,9.45234,47.3597,2011-01-01,#2011
1,15017105,2011-01-01 00:00:07,9.45234,47.3597,2011-01-01,#fb
2,994621,2011-01-01 00:05:49,8.50139,47.403,2011-01-01,#crashplan
3,994621,2011-01-01 00:05:49,8.50139,47.403,2011-01-01,#Backup
4,16002522,2011-01-01 03:28:45,8.52493,47.3742,2011-01-01,#binaryday


## Detecting events
<a name="dbscan"></a>
### Machine Learning Approach: Event detection using DBSCAN
We are done with the data preprocessing and start working on event detection. To achieve that, we group our data based on day of tweet and hashtag. We use the <code>size</code> operation to count the number of tweets on a particular day with a particular hashtag.

In [8]:
# groupby dayOfTweet and hashtag, then find the size of each group
df_grouped = df.groupby(by=['dayOfTweet', 'hashtag']).size()
# give meaning name to column
df_grouped.rename('numOfTweets', inplace=True)
# display Series
df_grouped.head()

dayOfTweet  hashtag     
2011-01-01  #010111         2
            #2011           2
            #ARevolution    1
            #Backup         1
            #Endomondo.     1
Name: numOfTweets, dtype: int64

<a name="df"></a>
We detect events that have at least 5 tweets, i.e. **there are at least 5 events on the same day with the same hashtag**.

In [9]:
# set threshold for min number of events
min_tweets = 5
# find how many tweets happened on a particular day for a particular tweet
df = df.apply(lambda row: fill_num_of_tweets(row, df_grouped), axis=1)
# remove those that do not exceed the threshold value
df = df[df['numOfTweets'] >= min_tweets]
# display dataframe
df.head()

Unnamed: 0,userId,createdAt,longitude,latitude,dayOfTweet,hashtag,numOfTweets
100,8614392,2011-01-04 06:13:41,7.56667,47.5833,2011-01-04,#fb,5
110,8614392,2011-01-04 12:30:11,8.54177,47.3705,2011-01-04,#fb,5
115,8614392,2011-01-04 20:09:49,8.54177,47.3705,2011-01-04,#fb,5
116,8614392,2011-01-04 21:51:32,7.56667,47.5833,2011-01-04,#fb,5
118,8614392,2011-01-04 22:26:45,7.56667,47.5833,2011-01-04,#fb,5


In order to explain the information contained in the <code>df</code> dataframe, we give the following example taken from the sub-dataset of year **2010**. ![title](../../data/figures/df_example.png)

Here, see that there are 6 tweets with the hashtag #fb posted on 07.03.2010 by 5 different users (5 unique user IDs).

So far, we have satisfied the 3 our of 4 [requirements](#requirements) described above. The $4^{th}$ points requires tweets to be posted from approximately the same location. 

In order to do that, we pass the geolocation information for possible events into DBSCAN. In case the tweets are posted from the same location, DBSCAN will create a cluster. Otherwise, these tweets are considered to be noise and we say that no event is detected. 

We **remind** that we are interested in detecting events using geolocated information and not detecting events in general.

We start by grouping tweets based on common day and hashtag. The coordinates of each sub-group are fed into DBSCAN. You can see all events that are detected using DBSCAN, their respective date, as well as their descriptive hashtag. We assume that events take place in a small area (e.g. stadium, conference, festival, etc.) and we reduce the accuracy to approximately 
* 6.4 square kilometers (accuracy equals to 0.01 and 2 for DBSCAN and heuristic respectively)
* 640 square meters (accuracy equals to 0.001 and 3 for DBSCAN and heuristic respectively)

to include also inaccuracies in the GPS location measurements. **This parameters affects significanlty the number of events detected**.

In [10]:
# reduce GPS accuracy
accuracy = 0.001
# group by day of tweet and hashtag
df_grouped = df.groupby(by=['dayOfTweet', 'hashtag'])
# find events
list_of_events_dbscan = detect_event_dbscan(df_grouped, accuracy, min_tweets)

Date:  2011-01-26 	 Cluster ID:  0 	 Total Clusters on 2011-01-26:  1 	 Hashtags:  #gdi
Date:  2011-02-02 	 Cluster ID:  0 	 Total Clusters on 2011-02-02:  1 	 Hashtags:  #lift11
Date:  2011-02-03 	 Cluster ID:  0 	 Total Clusters on 2011-02-03:  1 	 Hashtags:  #lift11
Date:  2011-02-04 	 Cluster ID:  0 	 Total Clusters on 2011-02-04:  1 	 Hashtags:  #Lift11
Date:  2011-02-08 	 Cluster ID:  0 	 Total Clusters on 2011-02-08:  1 	 Hashtags:  #pokeRT
Date:  2011-03-09 	 Cluster ID:  0 	 Total Clusters on 2011-03-09:  1 	 Hashtags:  #g_ch
Date:  2011-03-20 	 Cluster ID:  0 	 Total Clusters on 2011-03-20:  1 	 Hashtags:  #swlau
Date:  2011-03-24 	 Cluster ID:  0 	 Total Clusters on 2011-03-24:  2 	 Hashtags:  #smmk11
Date:  2011-03-24 	 Cluster ID:  1 	 Total Clusters on 2011-03-24:  2 	 Hashtags:  #smmk11
Date:  2011-03-25 	 Cluster ID:  0 	 Total Clusters on 2011-03-25:  1 	 Hashtags:  #applestoreZH
Date:  2011-03-25 	 Cluster ID:  0 	 Total Clusters on 2011-03-25:  1 	 Hashtags:  #iPad2


Now, we have detected possible events. We try to find evidence that these are indeed events. This is indicated by
* the number of users that posted for a particular event
* the standard deviation of the timestamps of the events

The reasoning is given [here](#users_std)

We create a new dataframe that contains only users that made tweets for the events we detected so far.

In [11]:
# initialize datafram
new_df = pd.DataFrame()
# iterate though list of events
for item in list_of_events_dbscan:
    # find all records that match the day of tweet and the hashtag for each detected event
    temp = df[(df['dayOfTweet'] == item[0]) & (df['hashtag'] == item[1])]
    # concatenate to new dataframe
    new_df = pd.concat([new_df, temp])
# give meaningful column names
new_df = new_df[['dayOfTweet', 'hashtag', 'userId']]
# display dataframe
new_df.head()

Unnamed: 0,dayOfTweet,hashtag,userId
658,2011-01-26,#gdi,8614392
671,2011-01-26,#gdi,17007639
673,2011-01-26,#gdi,17007639
676,2011-01-26,#gdi,17007639
677,2011-01-26,#gdi,17007639


Now, we wish to determine how many different users posted for each detected event.

In [12]:
# group by hashtag and dayOfTweet and count unique user IDs
new_df = new_df.groupby(by=['hashtag', 'dayOfTweet'])['userId'].nunique()
# give meaningful name to column
new_df.rename('usersPerHashtag', inplace=True)
# convert to dataframe (needed for later)
new_df = new_df.to_frame()

We define a spammer threshold. If a hashtag is posted by less users that the spammer threshold, then we have evidence that the detected event may not be a true event. However, the algorithm detected is as an event since it met all the aforementioned [requirements](#requirements).

In [13]:
# define a spammer threshold
spammer_threshold = 2
# initialize column
new_df['spamEvent'] = False
# find if events are spam or not
new_df = new_df.apply(lambda row: is_spam_event(row, spammer_threshold), axis=1)
# display dataframe
new_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,usersPerHashtag,spamEvent
hashtag,dayOfTweet,Unnamed: 2_level_1,Unnamed: 3_level_1
#11-ICML,2011-06-15,1,True
#11-ICML,2011-06-16,1,True
#11ICML,2011-06-16,1,True
#18,2011-08-12,1,True
#Android,2011-08-02,2,False


There is one final thing left to do, i.e. we should calculate the standard deviation of the timestamps of detected events. In order to do that, we gather the timestamps for each event, convert it to number of seconds since midnight and calculate the standard deviation. 

A small standard deviation means that the tweets are sent in a short timeframe. However, a big standard deviation means that the tweets are spread thoughout the day. An event may last for a few hours, thus a small standard deviation is more likely to reveal an event.

In [14]:
# find std of events
std_dict = std_of_events(df, new_df)
# reset index
new_df.reset_index(inplace=True)
# initialize columm
new_df['std'] = np.nan
# fill std for each event
new_df = new_df.apply(lambda row: fill_std(row, std_dict), axis=1)
# sort dataframe
new_df.sort(columns = ['usersPerHashtag', 'std'], inplace=True, axis=0, ascending=False)
# reset indexes
new_df.reset_index(inplace=True, drop=True)
# save dataframe
file_name = '../../data/detected_events_dbscan_' + year + '.csv'
new_df.to_csv(file_name, sep='|')
# display dataframe
new_df.head()

Unnamed: 0,hashtag,dayOfTweet,usersPerHashtag,spamEvent,std
0,#iPad2,2011-03-25,10,False,150.032093
1,#tedxzh,2011-10-04,10,False,141.890477
2,#fec11,2011-09-09,9,False,285.772872
3,#lift11,2011-02-02,9,False,269.037299
4,#bosw,2011-03-31,9,False,141.610126


 <a name="non_spam_dbscan"></a>
Now, we print all events that are not classified as spam. In the section [bellow](#non_spam_heuristic), you can find a list of non spam events detected using a heuristic approach.

In [15]:
# list of non spam event for DBSCAN
dbscan_non_spam = []
for event in new_df.iterrows():
    spam = event[1][3]
    date = event[1][1]
    hashtag = event[1][0]
    # print those that are not spam
    if not spam:
        # append to list and print event
        dbscan_non_spam.append((date, hashtag))
        print('Date: ', date, '\t', 'Hashtag: ', hashtag)

Date:  2011-03-25 	 Hashtag:  #iPad2
Date:  2011-10-04 	 Hashtag:  #tedxzh
Date:  2011-09-09 	 Hashtag:  #fec11
Date:  2011-02-02 	 Hashtag:  #lift11
Date:  2011-03-31 	 Hashtag:  #bosw
Date:  2011-02-03 	 Hashtag:  #lift11
Date:  2011-10-29 	 Hashtag:  #uxcon11
Date:  2011-10-23 	 Hashtag:  #ew11
Date:  2011-02-08 	 Hashtag:  #pokeRT
Date:  2011-05-14 	 Hashtag:  #esc
Date:  2011-02-04 	 Hashtag:  #Lift11
Date:  2011-06-01 	 Hashtag:  #smgzh
Date:  2011-09-07 	 Hashtag:  #smgzh
Date:  2011-03-20 	 Hashtag:  #swlau
Date:  2011-06-21 	 Hashtag:  #pokeRT
Date:  2011-08-12 	 Hashtag:  #fb
Date:  2011-03-27 	 Hashtag:  #fb
Date:  2011-03-24 	 Hashtag:  #smmk11
Date:  2011-06-24 	 Hashtag:  #odch11
Date:  2011-03-25 	 Hashtag:  #applestoreZH
Date:  2011-11-25 	 Hashtag:  #obstech
Date:  2011-03-31 	 Hashtag:  #som11
Date:  2011-08-02 	 Hashtag:  #Android
Date:  2011-03-30 	 Hashtag:  #som11
Date:  2011-08-24 	 Hashtag:  #emex11
Date:  2011-01-26 	 Hashtag:  #gdi
Date:  2011-03-09 	 Hashtag:

### Heuristic Approach

We decide to follow a second approach for detecting events. This approach does not involve any machine learning algorithms. On the contraty, it is based on a heuristic method. This method is based on the accuracy reduction of the GPS coordinates. Tweets that are posted in approximately the same location should have the same longitude and latitude after reducing the accuracy. Hence, we try to detect event following this logic. In that way, we try to verify the results obtained using DBSCAN. **The accuracy should be in accordance to the accuracy used in the DBSCAN model**. For example, an accuracy of 0.01 (or 0.001) in DBSCAN corresponds to an accuracy of 2 (or 3) in the heuristic.

We work using the <code>df</code> dataframe. This dataframe was last modified [here](#df) and does not contain any information created from DBSCAN. However, we drop the <code>numOfTweets</code> columns, since we are going to re-evaluate it in a different way.

In [16]:
# remove unnecessary column
df.drop('numOfTweets', axis=1, inplace=True)
# define accuracy according to DBSCAN's respective value
accuracy = 3
# reduce the accuracy
df = df.apply(lambda row: reduce_location_accuracy(row, accuracy), axis=1)
# display dataframe
df.head()

Unnamed: 0,userId,createdAt,longitude,latitude,dayOfTweet,hashtag,approxLocation
100,8614392,2011-01-04 06:13:41,7.56667,47.5833,2011-01-04,#fb,"(47.583, 7.567)"
110,8614392,2011-01-04 12:30:11,8.54177,47.3705,2011-01-04,#fb,"(47.370, 8.542)"
115,8614392,2011-01-04 20:09:49,8.54177,47.3705,2011-01-04,#fb,"(47.370, 8.542)"
116,8614392,2011-01-04 21:51:32,7.56667,47.5833,2011-01-04,#fb,"(47.583, 7.567)"
118,8614392,2011-01-04 22:26:45,7.56667,47.5833,2011-01-04,#fb,"(47.583, 7.567)"


Our heuristic detects events using the the reduced geolocated accuracy embedded in the <code>approxLocation</code> column. Thus, we group by day of tweet, hashtag and reduced location and try to detect events.

In [17]:
# group by and count tweets per index
df_grouped = df.groupby(by=['dayOfTweet', 'approxLocation', 'hashtag']).size()
# give column meaningful name
df_grouped = df_grouped.rename('numOfTweets')

An event is considered to take place if at least 5 tweets are posted with the same hashtag, on the same day, from the same location. The rest records are filtered out.

In [18]:
# filtering out rows with less than 5 tweets
df_grouped = df_grouped[df_grouped >= min_tweets]
# convert to frame (used later)
df_grouped = df_grouped.to_frame()
# display dataframe
df_grouped.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,numOfTweets
dayOfTweet,approxLocation,hashtag,Unnamed: 3_level_1
2011-01-26,"(47.302, 8.553)",#gdi,14
2011-02-02,"(46.220, 6.138)",#lift11,9
2011-02-03,"(46.220, 6.138)",#lift11,5
2011-02-04,"(46.220, 6.138)",#Lift11,6
2011-02-06,"(45.943, 6.138)",#annecy2018,5


Now, we join the two aforementioned dataframes (<code>df</code> and <code>df_grouped</code>) and create a new one that contains the <code>numOfTweets</code> column. This column indicates how many are the tweets with the particular date, hashtag and approximate location of each row.

In [19]:
# join the two dataframes
joined_df = pd.merge(df, df_grouped, how='inner', left_on=['dayOfTweet', 'approxLocation', 'hashtag'], 
                     right_index=True)
# keep necessary columns
joined_df = joined_df[['userId', 'dayOfTweet', 'approxLocation', 'hashtag', 'numOfTweets']]
# drop duplicate rows
joined_df.drop_duplicates(inplace=True)
# display dataframe
joined_df.head()

Unnamed: 0,userId,dayOfTweet,approxLocation,hashtag,numOfTweets
676,17007639,2011-01-26,"(47.302, 8.553)",#gdi,14
957,626163,2011-02-02,"(46.220, 6.138)",#lift11,9
976,6316162,2011-02-02,"(46.220, 6.138)",#lift11,9
979,5040991,2011-02-02,"(46.220, 6.138)",#lift11,9
981,634553,2011-02-02,"(46.220, 6.138)",#lift11,9


From now on, we follow exactly the same procedure as in the [DBSCAN](#dbscan) section. More concretely, we are going to find the number of users per event, flag it as a potential spam event and finally estimate the standard deviation of the timestamps of each event.

In [20]:
# find users per event
users_per_hashtag = joined_df.groupby(by=['hashtag', 'dayOfTweet', 'approxLocation']).size()
# detect potential spam events
joined_df = joined_df.apply(lambda row: spam_events(row, users_per_hashtag, spammer_threshold), axis=1)
# display dataframe
joined_df.head()

Unnamed: 0,userId,dayOfTweet,approxLocation,hashtag,numOfTweets,spamEvent,usersPerHashtag
676,17007639,2011-01-26,"(47.302, 8.553)",#gdi,14,True,1
957,626163,2011-02-02,"(46.220, 6.138)",#lift11,9,False,6
976,6316162,2011-02-02,"(46.220, 6.138)",#lift11,9,False,6
979,5040991,2011-02-02,"(46.220, 6.138)",#lift11,9,False,6
981,634553,2011-02-02,"(46.220, 6.138)",#lift11,9,False,6


Once the potential spam events and the user per event are determined, we continue by determining the standard deviation of the timestamps for each event.

In [21]:
# estimate std for each event
std_dict = std_of_events(df)
# fill std value to the dataframe
event_detection = joined_df.apply(lambda row: fill_std(row, std_dict), axis=1)
# sort dataframe
event_detection.sort(columns = ['usersPerHashtag', 'std'], inplace=True, axis=0, ascending=False)
# drop unnecessary columns
event_detection.drop('userId', axis=1, inplace=True)
# drop duplicates
event_detection.drop_duplicates(subset=['dayOfTweet', 'approxLocation', 'hashtag'], inplace=True)
# save dataframe
file_name = '../../data/detected_events_heuristic_' + year + '.csv'
event_detection.to_csv(file_name, sep='|')
# display dataframe
event_detection.head()

Unnamed: 0,dayOfTweet,approxLocation,hashtag,numOfTweets,spamEvent,usersPerHashtag,std
11924,2011-10-04,"(47.416, 8.561)",#tedxzh,30,False,9,146.395229
957,2011-02-02,"(46.220, 6.138)",#lift11,9,False,6,192.556915
13289,2011-10-29,"(46.005, 8.956)",#uxcon11,13,False,5,189.831813
10316,2011-09-09,"(47.414, 8.549)",#fec11,8,False,4,134.676832
2704,2011-03-25,"(47.375, 8.539)",#iPad2,16,False,4,68.361104


<a name="non_spam_heuristic"></a>
Here, we create a list of all the events detected using the heuristic, and print those that are not spam events. The respective list of detected non spam events using DBSCAN is given [here](#non_spam_dbscan).

In [22]:
# list of all events
list_of_events_heuristic = []
# list of non spam events for heuristic
heuristic_non_spam = []
for event in event_detection.iterrows():
    spam = event[1][4]
    date = event[1][0]
    hashtag = event[1][2]
    list_of_events_heuristic.append((date, hashtag))
    # print those that are not spam
    if not spam:
        # append to list and print
        heuristic_non_spam.append((date, hashtag))
        print('Date: ', date, '\t', 'Hashtag: ', hashtag)

Date:  2011-10-04 	 Hashtag:  #tedxzh
Date:  2011-02-02 	 Hashtag:  #lift11
Date:  2011-10-29 	 Hashtag:  #uxcon11
Date:  2011-09-09 	 Hashtag:  #fec11
Date:  2011-03-25 	 Hashtag:  #iPad2
Date:  2011-02-04 	 Hashtag:  #Lift11
Date:  2011-02-08 	 Hashtag:  #pokeRT
Date:  2011-03-31 	 Hashtag:  #bosw
Date:  2011-02-03 	 Hashtag:  #lift11
Date:  2011-03-24 	 Hashtag:  #smmk11
Date:  2011-10-04 	 Hashtag:  #tedxzh
Date:  2011-03-20 	 Hashtag:  #swlau
Date:  2011-11-25 	 Hashtag:  #obstech
Date:  2011-03-09 	 Hashtag:  #g_ch
Date:  2011-03-20 	 Hashtag:  #swlau
Date:  2011-03-22 	 Hashtag:  #DonNorman
Date:  2011-06-01 	 Hashtag:  #smgzh
Date:  2011-06-21 	 Hashtag:  #pokeRT


Finally, we provide a visualization of the non spam events on the swiss map.

In [23]:
# coordinates of the events
coord = event_detection['approxLocation'].values.tolist()
# visualize non spam events
event_map = create_event_map(year, coord, event_detection['hashtag'].tolist(), 
                             event_detection['spamEvent'].tolist(), event_detection['numOfTweets'].tolist())
event_map

### Comparing the two Methods

Here, we provide a comparison of the two methods, given their results. The analysis is done using:
* the full list of events
* the reduced list of events after filtering the non spam events

In [24]:
analyse_performance(list_of_events_dbscan, list_of_events_heuristic)

Number of events detected with DBSCAN =  54
Number of events detected with heuristic =  90
---------------------------------------------
The two methods found 49 events in common
Common events:
('2011-08-02', '#Android')
('2011-10-02', '#letzigrund')
('2011-06-16', '#11ICML')
('2011-08-12', '#fb')
('2011-03-09', '#g_ch')
('2011-07-02', '#oasg')
('2011-08-12', '#18')
('2011-10-02', '#fcz')
('2011-03-20', '#swlau')
('2011-05-27', '#Heimfahrt')
('2011-11-17', '#IGEP')
('2011-01-26', '#gdi')
('2011-07-12', '#moonandstars')
('2011-11-25', '#obstech')
('2011-03-31', '#bosw')
('2011-11-18', '#pir@tage')
('2011-05-27', '#Velo')
('2011-06-15', '#11-ICML')
('2011-09-24', '#missCH')
('2011-10-23', '#ew11')
('2011-06-21', '#pokeRT')
('2011-07-22', '#awesomezh1')
('2011-10-29', '#uxcon11')
('2011-06-18', '#zvilla')
('2011-03-31', '#som11')
('2011-05-07', '#BattesimoPupo')
('2011-06-01', '#smgzh')
('2011-08-25', '#UTMB2011')
('2011-10-04', '#tedxzh')
('2011-11-18', '#iae')
('2011-08-24', '#emex11')


Now, we do the same analysis by using only events that are flagged as non spam.

In [25]:
analyse_performance(dbscan_non_spam, heuristic_non_spam)

Number of events detected with DBSCAN =  30
Number of events detected with heuristic =  16
---------------------------------------------
The two methods found 15 events in common
Common events:
('2011-09-09', '#fec11')
('2011-02-04', '#Lift11')
('2011-03-20', '#swlau')
('2011-06-21', '#pokeRT')
('2011-10-29', '#uxcon11')
('2011-03-24', '#smmk11')
('2011-06-01', '#smgzh')
('2011-11-25', '#obstech')
('2011-03-31', '#bosw')
('2011-02-08', '#pokeRT')
('2011-03-09', '#g_ch')
('2011-10-04', '#tedxzh')
('2011-02-02', '#lift11')
('2011-02-03', '#lift11')
('2011-03-25', '#iPad2')
---------------------------------------------
Found only by DBSCAN:
('2011-10-23', '#ew11')
('2011-05-14', '#esc')
('2011-09-07', '#smgzh')
('2011-08-12', '#fb')
('2011-03-27', '#fb')
('2011-06-24', '#odch11')
('2011-03-25', '#applestoreZH')
('2011-03-31', '#som11')
('2011-08-02', '#Android')
('2011-03-30', '#som11')
('2011-08-24', '#emex11')
('2011-01-26', '#gdi')
('2011-10-30', '#fcz')
('2011-10-02', '#fcz')
('2011-0

### Commenting on the Results

In general, we see that the heuristic approach captures more events than the DBSCAN. However, in the case of detecting non spam events DBSCAN does a better job and detects more events. This is done for the following reasons.
DBSCAN searches in a circular area around each point to detect neighbors and form clusters. However, the heuristic uses squared areas around each point. Therefore, the area covered by the heuristic is bigger and may include more points, and thus more probable events. Hence, in the case where no spam filtering is used, the heuristic is able of capturing more events which are flagged as spam events since only one user tweets about them.