# Applied Data Analysis - Fall 2016
## Twitter-Swisscom Project

### Processing Data

This notebook contains all the operation we made on the data to prepare it for the event detection. 
Here is a summary of what we did :


1 - [Load the data](#load_data)

2 - [Remove unuseful columns](#remove_columns)

3 - [Replace Position](#replace_position)

4 - [Drop nan in position and text](#drop_nan)

5 - [Remove stopword from text](#remove_stopwords)

6 - [Get the hashtags from the text](#get_hashtags)

7 - [Save the tweets into csv for event Detection](#save_tweets)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from utils import *

### 1 -  <a id='load_data'>Load the data</a> 

We load the data for event we created at the pre-processing step.

In [2]:
col_event_split = ['id','userId', 'createdAt', 'text', 'longitude', 'latitude', 'placeId', 'inReplyTo', 'placeLatitude', 'placeLongitude']

In [3]:
parse_dates = ['createdAt']

In [4]:
tweets = pd.read_csv('../twitter-swisscom/event/twex_event_corrected.tsv', sep="\t", encoding='utf-8', escapechar='\\', names=col_event_split, parse_dates=parse_dates, na_values='N', header=None)

In [5]:
tweets.head()

Unnamed: 0,id,userId,createdAt,text,longitude,latitude,placeId,inReplyTo,placeLatitude,placeLongitude
0,9514097914,17341045,2010-02-23 05:55:51,Guuuuten Morgen! :-),7.43926,46.9489,,,,
1,9514846412,7198282,2010-02-23 06:22:40,Still the best coffee in town — at La Stanza h...,8.53781,47.3678,,,,
2,9516574359,14657884,2010-02-23 07:34:25,It has been a week or so.. and today I just co...,6.13396,46.1951,,,,
3,9516952605,14703863,2010-02-23 07:51:47,Getting ready.. http://twitpic.com/14v8gz,8.81749,47.2288,,,,
4,9517198943,14393717,2010-02-23 08:02:57,Un peu de réconfort liquide en take away après...,6.63254,46.5199,,,,


### 2 - <a id='remove_columns'>Drop unuseful columns</a>

We will not use the 'placeId' and the 'inReplayTo' informations to do our detection, so we drop them now.

In [6]:
tweets.drop(['placeId', 'inReplyTo'], inplace=True, axis=1)

### 3 - <a id='replace_position'>Replace Position</a>

Different informations on the longitude and latitude are given, the columns 'longitude'/'latitude' are the position of the Tweet as reported by the user or client application. The 'placeLongitude'/'placeLatitude' indicates that the tweet is associated to a place. And as we can see on the head of the table, the place is not always set.

We decided to use the longitude/latitude columns to represent the position of a tweet and if they are null we will use the placeLatitude and placeLongitude. If both are null we will have to drop the entry as a tweet without position is not usefull for event detection.

In [7]:
tweets = tweets.apply(replace_position, axis=1)

In [8]:
tweets.head()

Unnamed: 0,id,userId,createdAt,text,longitude,latitude,placeLatitude,placeLongitude
0,9514097914,17341045,2010-02-23 05:55:51,Guuuuten Morgen! :-),7.43926,46.9489,,
1,9514846412,7198282,2010-02-23 06:22:40,Still the best coffee in town — at La Stanza h...,8.53781,47.3678,,
2,9516574359,14657884,2010-02-23 07:34:25,It has been a week or so.. and today I just co...,6.13396,46.1951,,
3,9516952605,14703863,2010-02-23 07:51:47,Getting ready.. http://twitpic.com/14v8gz,8.81749,47.2288,,
4,9517198943,14393717,2010-02-23 08:02:57,Un peu de réconfort liquide en take away après...,6.63254,46.5199,,


We can now drop the columns 'placeLongitude' and 'placeLatitude' as they don't give us anymore informations.

In [9]:
tweets.drop(['placeLatitude', 'placeLongitude'],inplace=True, axis=1)

### 4 - <a id='drop_nan'>Drop nan in position and text</a>

Then we drop the NaN values in Longitude and Latitude columns as we need a position to detect event.

In [10]:
len_before = len(tweets.index)
tweets = tweets.dropna(subset=['longitude', 'latitude'])
len_after = len(tweets.index)
print("Number of tweets before dropping the one without position : ", len_before)
print("Number of tweets before dropping the one without position : ", len_after)
print("Percentage of tweets lost : ", ((len_before - len_after)/len_before)*100)

Number of tweets before dropping the one without position :  15812253
Number of tweets before dropping the one without position :  15812253
Percentage of tweets lost :  0.0


As we also base our event detection on the text field we don't want to have nan value in it. So we drop them.

In [11]:
len_before = len(tweets.index)
tweets = tweets.dropna(subset=['text'])
len_after = len(tweets.index)
print("Number of tweets before dropping the one without text : ", len_before)
print("Number of tweets before dropping the one without text : ", len_after)
print("Percentage of tweets lost : ", ((len_before - len_after)/len_before)*100)

Number of tweets before dropping the one without text :  15812253
Number of tweets before dropping the one without text :  15659580
Percentage of tweets lost :  0.965536030823691


We decide to detect an event by its day of occurence we create a new column that gives us the information of the day.

In [12]:
tweets['day'] = pd.DatetimeIndex(tweets['createdAt']).normalize()

### 5 - <a id='remove_stopwords'>Remove stopword from the text field</a>

We decide to remove the stopwords from the tweets' text to keep only words that can describe an event.

In [13]:
stop_words = stopwords.words('english')
stop_words += stopwords.words('french')
stop_words += stopwords.words('german')
stop_words += stopwords.words('italian')
stop_words += string.punctuation
stop_words += ['—','/via','via', 'follow', 'please', 'i\'m', '^_^', ':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
':c', ':{', '>:\\', ';(', ':-)', ':)', ';)','[=o)]', ';-)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
'=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)',
'<3']

In [14]:
def text_process(row):
    text = row['text']
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"@ \S+", "", text)
    text = re.sub(r"@\S+", "", text)
    text = text.split()
    text  = [word for word in text if word.lower() not in stop_words]
    row['text'] = text
    return row

We remove the stopwords but we also remove the URLs and the @ mentions as they are not useful to detect events

In [15]:
tweets = tweets.apply(text_process, axis=1)
tweets.head()

Unnamed: 0,id,userId,createdAt,text,longitude,latitude,day
0,9514097914,17341045,2010-02-23 05:55:51,"[Guuuuten, Morgen!]",7.43926,46.9489,2010-02-23
1,9514846412,7198282,2010-02-23 06:22:40,"[Still, best, coffee, town, Stanza]",8.53781,47.3678,2010-02-23
2,9516574359,14657884,2010-02-23 07:34:25,"[week, so.., today, couldn't, focus, Sportif, ...",6.13396,46.1951,2010-02-23
3,9516952605,14703863,2010-02-23 07:51:47,"[Getting, ready..]",8.81749,47.2288,2010-02-23
4,9517198943,14393717,2010-02-23 08:02:57,"[peu, réconfort, liquide, take, away, après, d...",6.63254,46.5199,2010-02-23


### 6 - <a id='get_hashtags'>Get the hashtags from the text</a>

Now that the text is pretty much clean we want to get the hashtags from the tweets because they are really helpful to detect the events. So we create a column with the hashtags :

In [16]:
tweets['hashtags'] = tweets['text'].apply(find_hashtags)

And in the "text" fields we don't want to have '#' anymore.

In [17]:
tweets['text'] = tweets['text'].apply(remove_hashtags)

### 7 - <a id='save_tweets'>Save the tweets into csv for event Detection</a>

We save the current dataframe so we don't need to re-run the preprocessing everytime as it takes a lot of time.

In [18]:
tweets.to_csv("../twitter-swisscom/event/tweets_processed.csv", sep=',', encoding='utf-8', index=False)