In [2]:
import json
import re
from pyspark.sql import *
from pyspark import SparkContext, SQLContext
import pandas as pd
from dateutil.parser import parse

## Handling the filtered tweets

### Some issues we encountered:

1)    Tweets can countain retweet so many times the same tweet can appear with a retweet identification: `RT @<username>`
    - Resolved by adding Frequency parameter for tweet that has been retweet 
    - Even tough we separeted the tweet from the retweet some of the tweets appears many time without the Retweet identification. It is still important to distinguish them and not count them many times since we reckon that simply copying a message or retweeting a message has less significance than creating it.
    
2)    Even if we remove the retweet, some tweets are still the same but have not the same length which can lead to count separetly the same tweet
    - Resolved by putting a fixed max length to all tweet
    - Or by testing if a string is in another (Complicated solution not adopted)

In [48]:
# Read the filtered tweets from the .txt files
tweets_raw = pd.read_table(filepath_or_buffer='tweets_terror2.txt', names=["lan","id","date", "user_name", "content"])

In this project, the id and user name of the tweet is useless, we keep therefore only the language, the date and the content of the tweet.

In [49]:
tweets_raw = tweets_raw.drop(axis= 1, labels=  ["id", "user_name"])

The date countained in the tweets has been translated into `GMT` 0. So we do not have to worry about translating the date and can directly standarize with the dateutil.parser

In [50]:
#We parse the date to have a uniform 
tweets_raw["date"] = tweets_raw["date"].apply(lambda d: parse(d))

In [51]:
tweets = tweets_raw.copy()
tweets["retweet"] =  tweets["content"].map(lambda s : s[0:4] == "RT @") #Is it a retweet?

Here, we need to normalize our tweet to handle 1) and 2)

In [52]:

# Maximum length that we allowed to have in oder to not have different tweet

MAX_LEN = 140 - 15 - 10  # Limit of a tweet minus the maximum user name 
                         # and other charachter added when a retweet is created


def remove_retweet_and_cut(t):
    """
    Function that remove the RT @ in front of a tweet if it has been detected as a retweet, 
    And cut the tweet according to the MAX_LEN parameter
    """
    
    if(t["retweet"]):
        return ' '.join(t["content"].split()[2:])[0:MAX_LEN]
    else :
        return t["content"][0:MAX_LEN]
    

    
#Apply the function we just created    
tweets["content"] =  tweets.apply(remove_retweet_and_cut, axis = 1)


#------------------------- Handling the frequency of a tweet ---------------------


# We create a dict to map the content and the frequency that a tweet with the same content occur.
freq_dict = dict(tweets.groupby("content")["lan"].count())


tweets = tweets.drop_duplicates(subset="content")


tweets["frequency"] = tweets["content"].map(lambda c : freq_dict[c])

We end up with a nice dataframe of the filtered tweets with the frequency of each tweets 

In [53]:
tweets.sort_values(by="frequency", ascending=False).head()

Unnamed: 0,lan,date,content,retweet,frequency
2153,en,2013-08-02 12:00:53+00:00,Zayn is NOT a terrorist.\nZayn donated for cha...,True,37
769,en,2013-07-16 04:00:12+00:00,The whites agree to stop blaming all Arab's fo...,True,25
8,en,2013-06-15 18:01:53+00:00,Black Crime =Gang Violence. \nArab Crime = Ter...,True,19
3596,fr,2013-07-08 10:40:40+00:00,RTsi arabe a la piscine :\n-Jvais faire la bom...,True,16
3257,es,2012-10-06 14:00:21+00:00,Cuba demanda justicia en el Día de las Víctima...,False,15


In [54]:
#Here are the single tweets
tweets.sort_values(by="frequency", ascending=True).head()

Unnamed: 0,lan,date,content,retweet,frequency
0,fr,2013-06-15 18:00:14+00:00,Pakistan: un double-attentat à la bombe à Quet...,False,1
2609,en,2013-08-02 12:29:20+00:00,@RuckaRuckaAli I love reading how these holy #...,False,1
2610,es,2013-08-02 12:29:21+00:00,#UnDíaComoHoy pero de 1980: en la estación fer...,False,1
2612,en,2013-08-02 12:29:24+00:00,@Harry_Styles I love you. Please follow me my ...,False,1
2614,en,2013-08-02 12:29:27+00:00,@NancyAtwal: You call him a terrorist I call h...,False,1


We see below that the ratio of retweet is consequent. 
Indeed, roughly 1/3 of our filtered tweets have been retweeted.

In [55]:
tweets["retweet"].sum()/len(tweets.retweet)

0.3538555318500457

In [56]:
grp_tweet = tweets.groupby("lan")

In [57]:
grp_tweet["content"].count()

lan
en    2388
es     688
fr      96
it      79
nl      30
Name: content, dtype: int64

We see that, not surprisingly, we have more english tweets than the other languages. Indeed english is the most common widespread language and spanish the second one.