# ADA Project : Milestone 2 Notebook

In this notebook, we will introduce you to the dataset that we chose by locally importing a part of in, and store it in a dataframe. Hence, we will be able to have an insight on the work that we will perform on the full dataset, once everything is set.

In [1]:
import json
import re
from pyspark.sql import *
from pyspark import SparkContext, SQLContext

## 1. Twitter dataset data collection, from cluster to dataframe

In this section, we will make some operation with the help of Spark, to access, filter and export the useful tweets from the cluster to our computer.

### A few words about what we noticed for our dataset 

First, the twitter dataset starts from year 2012.
In the date section, the hour has been scaled, so that the tweet time is always relative to GMT+00. This will be of use when we will relate tweet dates and times with the Wikipedia dataframe.

### i) filtering the useful tweets

We start by declaring the Spark Context in order to make the link with the cluster. With Spark installed locally, we are able to query the cluster directly in the notebook.

In [2]:
sc = SparkContext()

In [3]:
text_file = sc.textFile("hdfs:///datasets/tweets-leon")

The idea of our filter is that we want to work with data that is already highly focused on our subject : terrorist  attacks. Moreover, to avoid some bias that could appear from the selection of a subset of the entire Twitter database, we will consider all the most common languages, and select the interesting tweets independently of the language.
Indeed, the filter is composed of lists of words of different weight, for each language. When a tweet is passed through the filter, we will compute the tweet score depending on it's content and, if the score is high enough, select the tweet to be part of our dataframe. Since all selected words are translated, this should harmonize the selection of our data.

**We define below a few helper functions that will be used for our inition filter :** 



In [4]:
from googletrans import Translator

The first helper function is used to translate a word of interest in every selected language :

In [5]:
def translate_word(w,languages):
    
    translator = Translator()


    w_dict = {}
    
    for l in languages:
        t = translator.translate(w,dest=l)
        w_dict[l]=t.text
        
    
    
    return w_dict
    


This function is the heart of the filter. Languages and words are selected and inputed directly into three different lists of different weight. They are translated and put into three dicts, the keys being the language and the values being all words of the selected language, from the selected "importance list".
The function outputs the three dicts, one for each importance.

In [10]:
def words_to_match():
    

    language = 'en'
    
    t1 = ['terror attack', 'terrorist attack','suicide bombing','mass shooting']



    t2 = ['suicide bomber','car bombing','drone bombing','mass execution','improvised explosive device','truck bomb','grenade attack','train bombing']


    t3 = [' ied', 'hijacking','genocide','bomb attack','vehicule attack','assasination','terrorism','weapon','knife','assault rifle','dead','deaths','died','injured','kill','plant','drive-by shooting','hostage','execution']


    hashtag = ['#prayfor','#terrorism','#terrorists','#terrorattack']

    malus_list = ['years ago','year ago', 'months ago','month ago','anniversary']
    
    l = [t1,t2,t3,hashtag,malus_list]
    
    return l

The below function computes the importance of a tweet by assigning specific weights to every tweet. The assignment is done by iterating on all interesting words, looking whether they occur in the tweet content. According to the word's affiliation to one of the three dicts, different weight is incremented. If the total weight of the tweet reaches the threshold value (here 1.0), the filter returns True.

In [12]:
def is_interesting(content,l):
    
    content = content.lower()
    
    lang = content[:2]
    
    
    weight=0.0
    
    
    
    for w in l[0]:
        if w in content:
            weight+=1.0

    for w in l[1]:
        if w in content:
            weight+=0.9

    for w in l[2]:
        if w in content:
            weight+=0.1
             
    for w in l[3]:
        if w in content:
            weight+=0.7
            
            
    for w in l[3]:
        if w in content:
            weight-=0.5
    
    return (weight >= 1)
    

We declare the variable `bds` to be the three filtering dictionnaries. It will serve as an input of our filtering function.

In [13]:
bds = words_to_match()

Next, we actually call spark by filtering the data in the cluster with our filter, to then take a subset of defined size. We proceed to write it to a text file for later use.

In [14]:
terrorism = text_file.filter(lambda t: is_interesting(t,bds)).take(30)

In [30]:
file_t = open('terr.txt','w')
for item in terrorism:
    file_t.write("%s\n" % item)

In [15]:
terrorism

['en\t345965752198762497\tSat Jun 15 18:07:17 +0000 2013\tSangyeH\tRT @AnnieSage: Unbelievable.... @thinkprogress: In the 6 months since Newtown, there have been FOURTEEN mass shootings http://t.co/yfClLGdx…',
 'en\t345965794213117952\tSat Jun 15 18:07:27 +0000 2013\tSR_Brant\tRT @AnnieSage: Unbelievable.... @thinkprogress: In the 6 months since Newtown, there have been FOURTEEN mass shootings http://t.co/yfClLGdx…',
 'en\t345968344391884800\tSat Jun 15 18:17:35 +0000 2013\tkiraababee\tRT @SweaterGawd: I cum faster than the fbi during a terrorist attack - 😳 shoulda kept that to yourself homie',
 'en\t345968730020384768\tSat Jun 15 18:19:07 +0000 2013\tdrgauravn85\t@asma_rehman02 even your feeder USA is agreed your role in terrorist attacks',
 'en\t345969984171823104\tSat Jun 15 18:24:06 +0000 2013\tWatchTVChannels\tQuetta Carnage: 23 killed in terrorist attacks http://t.co/CZm3wp7kki http://t.co/FLreqkpteK',
 'en\t345973100472594432\tSat Jun 15 18:36:29 +0000 2013\tgeoworld_live\tQuett

As an important part of our analysis, we want to have a feel on the percentage of the tweets that we filter. According to our filter's characteristics, the ratio of tweets of interest over all tweets is of 1/3000. Taking a total tweet count of around 18 billion, the total tweets that are interesting for us is of 6 million.

In [20]:
terr = text_file.take(300000)

count=0
for t in terr:
    if is_interesting(t,bds):
        count+=1
count

99

We can see above that we get almost 100 matches for a subset of 300'000 tweets, illustrating the ratio of 1/3000.

In [21]:
terrorism[:5]

['fr\t345963978092072960\tSat Jun 15 18:00:14 +0000 2013\tKazdaliMaaradj\tPakistan: un double-attentat à la bombe à Quetta (sud-ouest) fait au moins 23 morts (nouveau bilan des autorités locales)',
 'en\t345964011382259712\tSat Jun 15 18:00:22 +0000 2013\tSumairaALi4\t#BLA needs to be targetted in INDIA and UK. ISI should get in motion as were in 80s #JudicialTerrorism',
 'en\t345964045007978497\tSat Jun 15 18:00:30 +0000 2013\tMonotheist_\tUnrest in Baluchistan. BLA terrorism there. Baluch demand justice. Foreign and local intelligence are involved. No one dares to anyone',
 'es\t345964057632837632\tSat Jun 15 18:00:33 +0000 2013\texodo3013\t@akatsuky1000 quien ayudo a librar al pueblo de Libia del terrorista #1 en el mundo Omar K  que masacraba a su pueblo con aviones de guerra',
 'es\t345964070274482176\tSat Jun 15 18:00:36 +0000 2013\tCesar_Soto_16\tCon Los Terroristas - Alianza Metal 1°H: http://t.co/SDT9qzSVHz vía @YouTube']