# ADA Project : Milestone 2 Notebook

In this notebook, we will introduce you to the dataset that we chose by locally importing a part of in, and store it in a dataframe. Hence, we will be able to have an insight on the work that we will perform on the full dataset.

In [144]:
import json
import re
from pyspark.sql import *
from pyspark import SparkContext, SQLContext
import numpy as np

## 1. Twitter dataset data collection, from cluster to dataframe

In this section, we will make some operation with the help of Spark, to access, filter and export the useful tweets from the cluster to our computer.

### A few words about what we noticed for our dataset 

First, the twitter dataset starts from year 2012.
In the date section, the hour has been scaled, so that the tweet time is always relative to GMT+00. This will be of use when we will relate tweet dates and times with the Wikipedia dataframe.

### 1.1 Filtering the useful tweets

We start by declaring the Spark Context in order to make the link with the cluster. With Spark installed locally, we are able to query the cluster directly in the notebook.

In [2]:
sc = SparkContext()

In [3]:
text_file = sc.textFile("hdfs:///datasets/tweets-leon")

The idea of our filter is that we want to work with data that is already highly focused on our subject : terrorist  attacks. For the Milestone 2, we implemented a filter that considered different languages. Following the feedback from the TAs, we decided to stick with only English as the language of the tweets and the keywords, to go back to a **more simple, but more precise filter**. When a tweet is passed through the filter, we will compute the tweet score depending on its content and, if the score is high enough, select the tweet to be part of our dataframe. 

**We define below a few helper functions that will be used for our inition filter :** 



This function is the heart of the filter. Five lists are detailed, representing words of different importance. 

In [4]:
def words_to_match():
    

    language = 'en'
    
    t1 = ['terror attack', 'terrorist attack','suicide bombing','mass shooting']



    t2 = ['suicide bomber','car bombing','drone bombing','mass execution','improvised explosive device','truck bomb','grenade attack','train bombing']


    t3 = [' ied', 'hijacking','genocide','bomb attack','vehicule attack','assasination','terrorism','weapon','knife','assault rifle','dead','deaths','died','injured','kill','plant','drive-by shooting','hostage','execution']


    hashtag = ['#prayfor','#terrorism','#terrorists','#terrorattack']

    malus_list = ['years ago','year ago', 'months ago','month ago','anniversary']
    
    l = [t1,t2,t3,hashtag,malus_list]
    
    return l

The below function computes the importance of a tweet by assigning specific weights to every tweet. The assignment is done by iterating on all interesting words, looking whether they occur in the tweet content. According to the word's affiliation to one of the lists, different weight are incremented. If the total weight of the tweet reaches the threshold value (here 1.0), the filter returns True.

In [5]:
def is_interesting(content,l):
    
    content = content.lower()
    
    lang = content[:2]
    
    
    weight=0.0
    
    
    
    for w in l[0]:
        if w in content:
            weight+=1.0

    for w in l[1]:
        if w in content:
            weight+=0.9

    for w in l[2]:
        if w in content:
            weight+=0.1
             
    for w in l[3]:
        if w in content:
            weight+=0.7
            
            
    for w in l[3]:
        if w in content:
            weight-=0.5
    
    return (weight >= 1)
    

We declare the variable `bds` to be the three filtering dictionnaries. It will serve as an input of our filtering function.

In [11]:
bds = words_to_match()

Next, we actually call spark by filtering the data in the cluster with our filter, to then take a subset of defined size. We proceed to write it to a text file for later use.

In [12]:
terrorism = text_file.filter(lambda t: is_interesting(t,bds)).take(10)

In [13]:
file_t = open('tweets_terror3.txt','w')
for item in terrorism:
    file_t.write("%s\n" % item)

In [14]:
terrorism[:5]

['en\t345965752198762497\tSat Jun 15 18:07:17 +0000 2013\tSangyeH\tRT @AnnieSage: Unbelievable.... @thinkprogress: In the 6 months since Newtown, there have been FOURTEEN mass shootings http://t.co/yfClLGdx…',
 'en\t345965794213117952\tSat Jun 15 18:07:27 +0000 2013\tSR_Brant\tRT @AnnieSage: Unbelievable.... @thinkprogress: In the 6 months since Newtown, there have been FOURTEEN mass shootings http://t.co/yfClLGdx…',
 'en\t345968344391884800\tSat Jun 15 18:17:35 +0000 2013\tkiraababee\tRT @SweaterGawd: I cum faster than the fbi during a terrorist attack - 😳 shoulda kept that to yourself homie',
 'en\t345968730020384768\tSat Jun 15 18:19:07 +0000 2013\tdrgauravn85\t@asma_rehman02 even your feeder USA is agreed your role in terrorist attacks',
 'en\t345969984171823104\tSat Jun 15 18:24:06 +0000 2013\tWatchTVChannels\tQuetta Carnage: 23 killed in terrorist attacks http://t.co/CZm3wp7kki http://t.co/FLreqkpteK']

### 1.2 Handling the filtered tweets

#### Some issues we encountered:

1)    Tweets can countain retweet so many times the same tweet can appear with a retweet identification: `RT @<username>`
    - Resolved by adding Frequency parameter for tweet that has been retweet 
    - Even tough we separeted the tweet from the retweet some of the tweets appears many time without the Retweet identification. It is still important to distinguish them and not count them many times since we reckon that simply copying a message or retweeting a message has less significance than creating it.
    
2)    Even if we remove the retweet, some tweets are still the same but have not the same length which can lead to count separetly the same tweet
    - Resolved by putting a fixed max length to all tweet
    - Or by testing if a string is in another (Complicated solution not adopted)

In [408]:
import pandas as pd
from dateutil.parser import parse
import csv

In [409]:
# Read the filtered tweets from the .txt files
tweets_raw = pd.read_csv(delimiter="\t",filepath_or_buffer='tweets_terr.txt', names=["lan","id","date", "user_name", "content"],encoding='utf-8',quoting=csv.QUOTE_NONE)

In [410]:
tweets_raw.head(5)

Unnamed: 0,lan,id,date,user_name,content
0,en,3.459658e+17,Sat Jun 15 18:07:17 +0000 2013,SangyeH,RT @AnnieSage: Unbelievable.... @thinkprogress...
1,en,3.459658e+17,Sat Jun 15 18:07:27 +0000 2013,SR_Brant,RT @AnnieSage: Unbelievable.... @thinkprogress...
2,en,3.459683e+17,Sat Jun 15 18:17:35 +0000 2013,kiraababee,RT @SweaterGawd: I cum faster than the fbi dur...
3,en,3.459687e+17,Sat Jun 15 18:19:07 +0000 2013,drgauravn85,@asma_rehman02 even your feeder USA is agreed ...
4,en,3.4597e+17,Sat Jun 15 18:24:06 +0000 2013,WatchTVChannels,Quetta Carnage: 23 killed in terrorist attacks...


In this project, the id and user name of the tweet is useless, we keep therefore only the language, the date and the content of the tweet.

In [411]:
tweets_raw = tweets_raw.drop(axis= 1, labels=  ["id", "user_name"])
tweets_raw = tweets_raw.dropna()

The date countained in the tweets has been translated into `GMT` 0. So we do not have to worry about translating the date and can directly standarize with the dateutil.parser

In [413]:
#We parse the date to have a uniform 
tweets_raw["date"] = tweets_raw["date"].apply(lambda d: parse(d, ignoretz = True))

In [414]:
tweets = tweets_raw.copy()
tweets["retweet"] =  tweets["content"].map(lambda s : s[0:4] == "RT @") #Is it a retweet?

Here, we need to normalize our tweet to handle 1) and 2)

In [415]:
def remove_http(t):
    content = t.split()
    for w in content:
        
        if "http" in w:
            content.remove(w)
    return " ".join(content)



# Maximum length that we allowed to have in oder to not have different tweet

MAX_LEN = 140 - 15 - 10  # Limit of a tweet minus the maximum user name 
                         # and other charachter added when a retweet is created


def remove_retweet_and_cut(t):
    """
    Function that remove the RT @ in front of a tweet if it has been detected as a retweet, 
    And cut the tweet according to the MAX_LEN parameter
    """
    
    
    if(t["retweet"]): # We take advantage to uniform the tweets by the lower() function
        return ' '.join(t["content"].split()[2:])[0:MAX_LEN].lower()
    else :
        return t["content"][0:MAX_LEN].lower()
    

    
#Apply the functions we just created
tweets["content"] = tweets["content"].map(remove_http)
tweets["content"] =  tweets.apply(remove_retweet_and_cut, axis = 1)


#------------------------- Handling the frequency of a tweet ---------------------


# We create a dict to map the content and the frequency that a tweet with the same content occur.
freq_dict = dict(tweets.groupby("content")["lan"].count())


tweets = tweets.drop_duplicates(subset="content")


tweets["frequency"] = tweets["content"].map(lambda c : freq_dict[c])

We end up with a nice dataframe of the filtered tweets with the frequency of each tweets 

In [417]:
tweets.sort_values(by="frequency", ascending=False).head()

Unnamed: 0,lan,date,content,retweet,frequency
423016,en,2015-01-12 00:36:17,"2,000 civilians killed in a terrorist attack i...",True,3006
352,en,2012-12-17 03:29:35,here they are - all sixty-one mass shootings i...,True,1585
361528,en,2014-12-17 20:14:55,"just heard about the terrorist attack, my hear...",True,1330
418277,en,2015-01-11 14:01:14,chanting &amp; applause in paris as huge crowd...,True,1242
256112,en,2013-09-25 17:17:29,"in 1962, the us government planned terrorist a...",True,1218


In [418]:
#Here are the single tweets
tweets.sort_values(by="frequency", ascending=True).head()

Unnamed: 0,lan,date,content,retweet,frequency
254900,en,2013-09-24 13:40:38,coming up. like bomb shelters here at home in ...,False,1
327762,en,2014-05-06 09:22:31,people should go home for failing to prevent t...,False,1
327763,en,2014-05-06 09:51:24,".@nikkiyanofsky, @un reports over 399 #priceta...",False,1
327764,en,2014-05-06 11:03:08,@gasam101 would have called for the help of co...,True,1
327765,en,2014-05-06 12:07:02,bombs planted near vaishali's polling station ...,False,1


We see below that the ratio of retweet is consequent. 
Indeed, roughly 1/3 of our filtered tweets have been retweeted.

In [421]:
tweets["retweet"].sum()/len(tweets.retweet)

0.30587482267211957

In [422]:
grp_tweet = tweets.groupby("lan")

In [423]:
grp_tweet["content"].count()

lan
de      1017
en    319924
es      1032
fr       693
it       842
nl      1455
Name: content, dtype: int64

We see that, not surprisingly, we have more english tweets than the other languages. Indeed english is the most common widespread language and spanish the second one.

In [424]:
tweets = tweets.drop(axis= 1, labels=  ["lan"])
tweets.to_csv("dataframe_terror.csv", index=False)

## 2. Data from Wikipedia

In this part we scrape data from Wikipedia. We want to access the tables that register the terror attacks that happened at some point in the past. There are some Wikipedia articles (such as https://en.wikipedia.org/wiki/List_of_terrorist_incidents_in_January-June_2011) that do exactly that. The data is presented as tables, and all the articles that we need present data in this form.

In [425]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
%matplotlib inline

In [426]:
from datetime import date
import re

In [314]:
# Simple map of month name to its number
month_to_int = {
    'January': 1,
    'February': 2,
    'March': 3,
    'April': 4,
    'May': 5,
    'June': 6,
    'July': 7,
    'August': 8,
    'September': 9,
    'October': 10,
    'November': 11,
    'December': 12
}

# Reversed map
int_to_month = {i: m for m, i in month_to_int.items()}

In [315]:
# The wikipedia URL that every article has in common
base_url = 'https://en.wikipedia.org/wiki/List_of_terrorist_incidents_in_'

We show all the articles that we are going to use to find the data

In [316]:
# All specific end of the wikipedia URL, along with the corresponding month numbers of the article
times = {}

for year in range(2011, 2015):
    # For years 2011 to 2014, the articles appear biyearly
    times.update({'January-June_' + str(year): list(range(1, 7))})
    times.update({'July-December_' + str(year): list(range(7, 13))})
    
for year in range(2015, 2018):
    # For years 2015 to 2017, the articles appear monthly
    for month, int_ in month_to_int.items():
        times.update({month + '_' + str(year): [int_]})
        
list(times.keys())


['January-June_2011',
 'July-December_2011',
 'January-June_2012',
 'July-December_2012',
 'January-June_2013',
 'July-December_2013',
 'January-June_2014',
 'July-December_2014',
 'January_2015',
 'February_2015',
 'March_2015',
 'April_2015',
 'May_2015',
 'June_2015',
 'July_2015',
 'August_2015',
 'September_2015',
 'October_2015',
 'November_2015',
 'December_2015',
 'January_2016',
 'February_2016',
 'March_2016',
 'April_2016',
 'May_2016',
 'June_2016',
 'July_2016',
 'August_2016',
 'September_2016',
 'October_2016',
 'November_2016',
 'December_2016',
 'January_2017',
 'February_2017',
 'March_2017',
 'April_2017',
 'May_2017',
 'June_2017',
 'July_2017',
 'August_2017',
 'September_2017',
 'October_2017',
 'November_2017',
 'December_2017']

In [317]:
def to_int(s):
    '''Returns the first integer found in s'''
    i = re.findall('\d+', s)
    return int(i[0]) if len(i) > 0 else float('NaN')

In [318]:
def to_date(s, year):
    '''Returns a date from the datetime library from a string like \'January 1\''''
    l = s.split(' ')
    return date(to_int(year), month_to_int[l[0]], to_int(l[1]))

In [319]:
def wiki_table_to_df(end_url, month_range, base_url=base_url):
    '''Creates a dataframe from the tables available in the wikipedia page'''
    print('Scraping for', end_url)
    r = requests.get(base_url + end_url) # Get request
    soup = BeautifulSoup(r.text, 'lxml') # Parse HTML
    wiki_tables = soup.findAll('table', {'class': 'wikitable sortable'}) # Get tables from the wikipedia page

    table = []

    for month_int, wiki_table in zip(month_range, wiki_tables):
        for row in wiki_table.findAll('tr'):
            elems = row.findAll('td') 
            if len(elems) != 0:
                interesting = [elem.text for elem in elems[:5]]
                 # First element is the day of the month, but we add the name of the month as well in front of it
                interesting[0] = int_to_month[month_int] + ' ' + interesting[0]
                table.append(interesting)
                
    df = pd.DataFrame(table, columns=['date', 'type', 'deaths', 'injuries', 'location'])
    df.date = df.date.apply(lambda s: to_date(s, end_url[-4:])) # Translate the date with the year defined by the end_url arg
    df.deaths = df.deaths.apply(to_int) # Map death number to int
    df.injuries = df.injuries.apply(to_int) # Map injuries number to int
    
    return df

In [378]:
dfs = []

# Get a DataFrame for every article from 2011 to 2017
for time, month_range in times.items():
    dfs.append(wiki_table_to_df(time, month_range))
    
df = pd.concat(dfs)
print('We have {} registered attacks from January 1st, 2011 up to today (November 28th, 2017)'.format(df.shape[0]))

Scraping for January-June_2011
Scraping for July-December_2011
Scraping for January-June_2012
Scraping for July-December_2012
Scraping for January-June_2013
Scraping for July-December_2013
Scraping for January-June_2014
Scraping for July-December_2014
Scraping for January_2015
Scraping for February_2015
Scraping for March_2015
Scraping for April_2015
Scraping for May_2015
Scraping for June_2015
Scraping for July_2015
Scraping for August_2015
Scraping for September_2015
Scraping for October_2015
Scraping for November_2015
Scraping for December_2015
Scraping for January_2016
Scraping for February_2016
Scraping for March_2016
Scraping for April_2016
Scraping for May_2016
Scraping for June_2016
Scraping for July_2016
Scraping for August_2016
Scraping for September_2016
Scraping for October_2016
Scraping for November_2016
Scraping for December_2016
Scraping for January_2017
Scraping for February_2017
Scraping for March_2017
Scraping for April_2017
Scraping for May_2017
Scraping for June_201

Here is what some of the entries of the final result look like

In [379]:
df.iloc[[0, 56, 1033, -1]]

Unnamed: 0,date,type,deaths,injuries,location
0,2011-01-01,Suicide bombing,21.0,97.0,"Alexandria, Egypt"
56,2011-02-13,Raid,7.0,5.0,"Zamboanga, Philippines"
37,2014-11-18,"Shooting, Melee attack",5.0,7.0,"Jerusalem, Israel"
48,2017-12-17,Suicide bombing,1.0,5.0,"Kandahar province, Afghanistan"


We just separate the city from the country here and save our dataframe

In [380]:
def get_city(location):
    if ',' in location :
        city = location.split(', ',1)[0]
        return city
    else : 
        return "Unknown"
    
def get_country(location):
    if ',' in location :
        country = location.split(', ')[-1]
        return country
    else : 
        return location

df['city'] = df['location'].map(get_city)
df['country'] = df['location'].map(get_country)
df["type"] = df["type"].fillna("UnkownType")
df = df.drop(axis= 1, labels=  ["location"])
df.to_csv('attacks.csv', index = False)

## 3. Making sense of the Data

INFO SUR LA DATA : 548847 raw Tweets, env. 350'000 sans les RT, s'arrete le 31 Jan 2016

- reussir a importer dans dataframe
- enlever wiki data apres dernier tweet

In [427]:
tweets = pd.read_csv("dataframe_terror.csv", parse_dates=[0])
terror = pd.read_csv('attacks.csv', parse_dates=[0])

In [428]:
tweets.head()

Unnamed: 0,date,content,retweet,frequency
0,2013-06-15 18:07:17,unbelievable.... @thinkprogress: in the 6 mont...,True,2
1,2013-06-15 18:17:35,i cum faster than the fbi during a terrorist a...,True,1
2,2013-06-15 18:19:07,@asma_rehman02 even your feeder usa is agreed ...,False,1
3,2013-06-15 18:24:06,quetta carnage: 23 killed in terrorist attacks...,False,1
4,2013-06-15 18:36:29,quetta carnage: 23 killed in terrorist attacks,False,2


In [429]:
terror.head()

Unnamed: 0,date,type,deaths,injuries,city,country
0,2011-01-01,Suicide bombing,21.0,97.0,Alexandria,Egypt
1,2011-01-04,Assassination,1.0,0.0,Islamabad,Pakistan
2,2011-01-04,Bombing,4.0,26.0,Abuja,Nigeria
3,2011-01-07,Kidnapping,9.0,,Unknown,Niger
4,2011-01-07,Suicide bombing,17.0,23.0,Spin Boldak,Afghanistan


In [431]:
j = tweets["date"].iloc[2]
terror["type"] = terror["type"].fillna("Unkown")

In [432]:
min_date = max(tweets.date.min(), terror.date.min())
max_date = min(tweets.date.max(), terror.date.max())

terror = terror[(terror["date"]>min_date) & (terror["date"]<max_date )]
tweets = tweets[(tweets["date"]>min_date) & (tweets["date"]<max_date )]

In [461]:
("out" or "in") in ["winqd","in"]

False

In [486]:
t  = tweets.iloc[248243]

overall_score = np.zeros(len(df_terror))


#For ecach event we will compute the hypothetic score matching relating to a particular attack
for index, attack in enumerate(terror.get_values()):
    
    # obviously we start with a score equal to zero
    score = 0
    
    # extracting the attack parameters
    dattack, type_attack, death, injured, city, country = attack
    
    
    #
    score += date_score(t.date, dattack )
    
    #If the tweet countains the exact number of death we already consider as acceptable
    if(not np.isnan(death)):
        if(str(int(death)) in t["content"].split()):
            score +=1
    
    #If the tweet countains the exact number of injured we already consider as acceptable
    if(not np.isnan(injured)):
        if((str(int(injured)) in t["content"].split())):
            score +=1
        
    #If the tweet countains the city of the attack we already consider as acceptable

    if(city.lower() in t["content"]):
        score += 1
    
    #If the tweet countains the country of the attack we don't consider as acceptable,
    #but we increase the score
    if(country.lower() in t["content"]):
        score += 0.7
    
    
    if(type_attack.lower() in t["content"] ):
        score += 1
        
    overall_score[index] = score


    
np.argmax(overall_score)

1059

In [487]:
print("overall score = "+str(overall_score[1059]))
print("tweet = "+t["content"])

t

overall score = 1.52631578947
tweet = @fatahkhairi know the victims of the terror attacks in


date                                       2015-01-12 08:57:04
content      @fatahkhairi know the victims of the terror at...
retweet                                                  False
frequency                                                    1
Name: 253523, dtype: object

In [537]:
terror

Unnamed: 0,date,type,deaths,injuries,city,country
1,2011-01-04,Assassination,1.0,0.0,Islamabad,Pakistan
2,2011-01-04,Bombing,4.0,26.0,Abuja,Nigeria
3,2011-01-07,Kidnapping,9.0,,Unknown,Niger
4,2011-01-07,Suicide bombing,17.0,23.0,Spin Boldak,Afghanistan
5,2011-01-08,Ambush,16.0,14.0,Lawdar,Lahij Yemen
6,2011-01-11,Ambush,10.0,18.0,South Kordofan,Sudan
7,2011-01-11,"Violence, fighting",1.0,4.0,Abidjan,Ivory Coast
8,2011-01-11,"Shooting, riot",1.0,4.0,Samalut,Egypt
9,2011-01-12,Bombing,2.0,7.0,Peshawar,Pakistan
10,2011-01-12,Car bomb,17.0,20.0,Bannu,Pakistan


In [539]:
terror.index

Int64Index([   1,    2,    3,    4,    5,    6,    7,    8,    9,   10,
            ...
            1570, 1571, 1572, 1573, 1574, 1575, 1576, 1577, 1578, 1579],
           dtype='int64', length=1579)

In [570]:
 ## The scores: 
def date_score(dtweet, dattack):
    diff = (dtweet - dattack).days
    
    alpha = 0.1
    
    if diff < 0 :
        return -np.Inf
    else :
        return 1 / (alpha * diff + 1)

    
    
def match_attacks(t, terror):
    
    to_keep_with_all = ['abu','al','del','beit','beni','bir','el','ein','las','la','les','los','new','san','tel','tal']
    
    
    overall_score = np.zeros(len(terror))
    
    #We won't worry the case where the attack date is greater than the date of the tweet
    sub_terror = terror[terror["date"]<t["date"]]
    

    
    #For ecach event we will compute the hypothetic score matching relating to a particular attack
    for index, attack in zip(sub_terror.index,sub_terror.get_values()):

        # obviously we start with a score equal to zero
        score = 0

        # extracting the attack parameters
        dattack, type_attack, death, injured, city, country = attack


        #
        score += date_score(t["date"], dattack )

        #If the tweet countains the exact number of death we already consider as acceptable
        if(not np.isnan(death)):
            if(str(int(death)) in t["content"].split()):
                score +=1

        #If the tweet countains the exact number of injured we already consider as acceptable
        if(not np.isnan(injured)):
            if((str(int(injured)) in t["content"].split())):
                score +=1

        #If the tweet countains the city of the attack we already consider as acceptable

        if(city.lower() in t["content"]):
            score += 1

        #If the tweet countains the country of the attack we don't consider as acceptable,
        #but we increase the score
        if(country.lower() in t["content"]):
            score += 0.7


        if(type_attack.lower() in t["content"] ):
            score += 1

        overall_score[index] = score
    
    if(np.max(overall_score)<1):
        return -1
    
    else:
        return np.argmax(overall_score)

In [574]:
tu = tweets.loc[300000:300100].copy()

%timeit tu["related_attack"] = tu.apply(lambda t: match_attacks(t, terror) , axis = 1)

22.2 s ± 951 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [572]:

print("Approximatively {}  hours to run".format(10 * len(tweets)/3600) )

Approximatively 888.0055555555556  hours to run


In [575]:
tu

Unnamed: 0,date,content,retweet,frequency,related_attack
300000,2015-11-14 19:06:52,"french prosecutor says 129 dead, 352 wounded, ...",True,1,1390
300001,2015-11-14 19:07:59,video: statement by prime minister benjamin ne...,True,2,1390
300002,2015-11-14 19:08:12,made in france movie withdrawn after paris ter...,False,1,1390
300003,2015-11-14 19:08:46,our condolences to former real madrid player @...,True,1,1391
300004,2015-11-14 19:09:00,i didn't see this on thursday. must be no whit...,False,1,1387
300005,2015-11-14 19:09:33,ranting here: yes. just because france had a t...,False,1,1390
300006,2015-11-14 19:10:15,"paris (@ap) - paris prosecutor: 129 dead, 352 ...",True,1,1390
300007,2015-11-14 19:11:05,germany team back home after paris terror atta...,True,2,1390
300008,2015-11-14 19:11:17,"rt the eu didn't create peace in europe,it cau...",True,4,1390
300009,2015-11-14 19:11:38,cal state long beach student among those kille...,True,1,1390


In [579]:
terror.loc[1390]

date        2015-11-13 00:00:00
type                     Attack
deaths                      130
injuries                    368
city                      Paris
country                  France
Name: 1390, dtype: object

In [593]:
print(" \n |".join(list(tweets.loc[300086:300094]["content"])))

paris rocked by huge terror attack day before g20 summit https://t.co/vd4rqatoqf|v 
 |paris is sadly the main target for terrorist attacks. it breaks my heart. 
 |paris terror attacks executed to lock down climate summit conference - coalition against... 
 |arrests made in connection with paris terrorist attacks 
 |it's so easy to blame a religion for the terrorist attacks that are occurring daily. don't be one of those people. 
 |my thoughts are with the victims, families, friends - and with the french people - following paris terrorist attack 
 |why are people blaming a refugee crisis for a terrorist attack?? 
 |mas info aquí theeconomist: could more terrorist attacks lead to israeli-style security s… 
 |briton confirmed killed in paris terror attack named as nick alexander #ripnickalexander https://t.…


In [576]:
terror.loc[1387]

date        2015-11-12 00:00:00
type            Suicide bombing
deaths                       43
injuries                    240
city                     Beirut
country                 Lebanon
Name: 1387, dtype: object

In [578]:
print(tweets.loc[300074]["content"])
tweets.loc[300074]

43 people killed in terrorist attack yesterday in lebanon. and syria where more than 100 people are killed in terro


date                                       2015-11-14 19:39:05
content      43 people killed in terrorist attack yesterday...
retweet                                                   True
frequency                                                    3
Name: 300074, dtype: object

In [530]:
terror.loc[138]

date                2011-04-12 00:00:00
type        Targeted killings/Shootings
deaths                               18
injuries                              0
city                            Karachi
country                        Pakistan
Name: 138, dtype: object

In [502]:
tweets.groupby("related_attack").sum()["frequency"]

### 3.1 Create new DataFrame with Tweets and Wiki data

Create DF : **date, attack type, city, country, real impact, deads, injured, social impact, number of tweets**

- to merge both : match with date of event, and maybe a function that gives a matching score (name of town, country)
- take into account the tweets from the date to a certain amount of days

Remark: Assumption: Each tweet correspond to one Attack identified by the index number

### 3.2 Plot the attacks in a map

- highlight the real impacts and the social impact
- folium ? ideally, map with circles for real impact, and (also circle ?) for social impacts

### 3.3 Other graphs/info

- all the #prayfor : list the different towns
- number of attacks/deaths by country (only taken from **Wikipedia dataset**)
- ranking of the most liked/ignored attack (just divide social impact by real impact)
- same but with country (aggregate social and total impact from before (more relevant than previous point)
- maybe see the rise of ISIS ? find by keyword (ISIS) and check with the timeline (graph ISIS claimed vs time, **Wikipedia dataset**)
- finally, see the fading of reactions over time 


#### 3.3.2 Safety ranking

In [114]:
pd.read_csv?

Avec la partie 3.2 et 3.3, on devrait avoir assez d'info a leur montrer ! 

Et si on galère niveau temps, on se replie sur displayer des trucs sur le dataset wikipedia !

## 4. Do the report