RIP pour

    - Environmental Terrorism
    - Secure Border Initiative

# Data Wrangling

In this notebook we will create the following datasets:

    - 48 Terrorism related search queries [2012-01-01 -> 2014-08-31]
    - 25 Domestic related search queries [2012-01-01 -> 2014-08-31]
    - top 30 MTurk evaluation terrorism related search queries [2012-01-01 -> 2014-08-31]


### Imports

In [13]:
!pip install gtab
import gtab
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np



In [14]:
# Directory to save data
data_path = "data/"

### Create and set GoogleTrendsAnchorBank for queries

In [15]:
def create_and_set_gtab(start_timeframe, end_timeframe, geo, gtab_path = "gtab_data"):
    """Creates and sets a gtab to the required options.
    This functions takes a lot of time if the anchorbank was not yet created.
    It also creates a directory if needed to the gtab_path

    Args:
        start_timeframe (str): start of the timeframe for the queries (included)
        end_timeframe (str): end of the timeframe for the queries (included)
        geo (str): geolocalisation of the search query
        gtab_path (str): path to already existing data

    Returns:
        t (GTAB): GoogleTrendsAnchorBank to use for the queries, consistent with the provided options

    """
    t = gtab.GTAB(dir_path=my_path)
    # Create time frame
    timeframe = start_timeframe + " " + end_timeframe
    
    # Set required options
    t.set_options(pytrends_config={"geo": geo, "timeframe": timeframe})  
    
    # Create anchorbank if it doesn't already exist
    t.create_anchorbank() # takes a while to run since it queries Google Trends. 
    
    # We apply the anchorbank
    t.set_active_gtab(f"google_anchorbank_geo={worldwide}_timeframe={timeframe}.tsv")
    
    return t

In [16]:
#TODO: should we include until present time for an extended analysis?
# Create time frame corresponding to the paper study
start_timeframe = "2012-01-01"
end_timeframe   = "2014-08-31" #TODO: si cest borne inclu (ca a l'air detre le cas) sinon =>"2014-09-01"
timeframe = start_timeframe + " " + end_timeframe

# We choose the worldwide geolocalisation to mimic wikipedia
worldwide = "" # empty string corresponds to worldwide

t = create_and_set_gtab(start_timeframe, end_timeframe, worldwide)

Directory already exists, loading data from it.
Using directory 'gtab_data'
Active anchorbank changed to: google_anchorbank_geo=_timeframe=2019-01-01 2020-08-01.tsv

Start AnchorBank init for region  in timeframe 2012-01-01 2014-08-31...
GTAB with such parameters already exists! Load it with 'set_active_gtab(filename)' or rename/delete it to create another one with this name.
Active anchorbank changed to: google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv



## Create all dataframes

In [32]:
def create_dataframe(search_queries, geo, t):
    """Creates a dataframe concatenating all search queries interest over time data from google
    trends. The data will have as an added collumn the names of the original search and the 
    geolocalisation of the search query.
    The returned dataframe has attributes:
    {date, max_ratio  max_ratio_hi  max_ratio_l, article_name, geo}

    Args:
        search_queries (list[string]): List of all required search queries for the full dataframe
        geo (str): geolocalisation of the search query
        t (GTAB): GoogleTrendsAnchorBank to use for the queries it needs to be consistent with the geo parameter

    Returns:
        dataframe: dataframe concatenating all search queries interest over time data from google
    trends

    """
    # For each terrorism related search query create the corresponding interest over time google trends data
    all_interest_over_time_dfs = [t.new_query(search_query) for search_query in search_queries]
    
    # Remove all queries who failed
    all_interest_over_time_dfs = [df for df in all_interest_over_time_dfs if type(df) != type(-1)]
    
    # Append the name and location to all dataframes
    for i, df in enumerate(all_interest_over_time_dfs):
        # Add the article name collumn 
        df["article_name"] = [search_queries[i]]*len(df)
        # Add the localisation collumn
        df["geo"] = ["worldwide"]*len(df)        
    
    # Concatenate all dfs into one centrale one
    return pd.concat(all_interest_over_time_dfs).reset_index()

In [33]:
def get_topics(domain_name, data_path):
    """Gives the list of all topics corresponding to the search queries for the specific domain.
    
    Args:
         domain_name (str): domain name of the search queries (i.e: terrorism, domestic, top_30_terrorism)
         data_path (str): directory path containing the mapping pickle file 
    
    Returns:
        list of all topics for this domain
    """
    
    mapping_df = pd.read_pickle(data_path+"mapping.pkl")
    
    return mapping_df[mapping_df["domain_name"] == domain_name]["node_equivalent"].tolist()

### Terrorism dataset

48 Terrorism related search queries [2012-01-01 -> 2014-08-31]

In [34]:
# Find all topics for this dataset
terrorism_topic_queries = get_topics("terrorism", data_path)

In [35]:
# Create terrorism dataframe
terrorism_df = create_dataframe(terrorism_topic_queries, worldwide, t)


Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/0v74'
New query calibrated!
Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/07jq_'
New query calibrated!
Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/07jq_'
New query calibrated!
Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/0gtxdb2'
New query calibrated!
Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/0d05q4'
New query calibrated!
Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/0jdd'
New query calibrated!
Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/03shp'
New query cal

In [36]:
# View of created df
terrorism_df.head()

Unnamed: 0,date,max_ratio,max_ratio_hi,max_ratio_lo,article_name,geo
0,2012-01-01,2.934609,3.123105,2.757999,/m/0v74,worldwide
1,2012-01-08,3.317384,3.521799,3.125733,/m/0v74,worldwide
2,2012-01-15,3.827751,4.053392,3.616044,/m/0v74,worldwide
3,2012-01-22,3.572568,3.787596,3.370888,/m/0v74,worldwide
4,2012-01-29,3.444976,3.654698,3.24831,/m/0v74,worldwide


In [37]:
# Save dataframe to pickle
terrorism_df.to_pickle(data_path+"terrorism.pkl")

### Domestic dataset

25 Domestic related search queries [2012-01-01 -> 2014-08-31]

In [38]:
# Find all topics for this dataset
domestic_topic_queries = get_topics("domestic", data_path)

In [40]:
# Create dommestic dataframe
dommestic_df = create_dataframe(domestic_topic_queries, worldwide, t)

Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/0fytk'
New query calibrated!
Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/0js8z'
New query calibrated!
Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/07xhy'
New query calibrated!
Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/038r8p'
New query calibrated!
Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/02qtlv'
New query calibrated!
Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/0y4n5ll'
New query calibrated!
Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/0f4r5'
New query 

In [41]:
# View of created df
dommestic_df.head()

Unnamed: 0,date,max_ratio,max_ratio_hi,max_ratio_lo,article_name,geo
0,2012-01-01,4.727273,4.960754,4.506877,/m/0fytk,worldwide
1,2012-01-08,4.969697,5.211931,4.741001,/m/0fytk,worldwide
2,2012-01-15,4.727273,4.960754,4.506877,/m/0fytk,worldwide
3,2012-01-22,4.848485,5.086342,4.623939,/m/0fytk,worldwide
4,2012-01-29,4.969697,5.211931,4.741001,/m/0fytk,worldwide


In [42]:
# Save dataframe to pickle
dommestic_df.to_pickle(data_path+"dommestic.pkl")

### Top 30 Terrorism dataset

top 30 MTurk evaluation terrorism related search queries [2012-01-01 -> 2014-08-31]

In [43]:
# Find all topics for this dataset
top_30_terrorism_topic_queries = get_topics("domestic", data_path)

In [44]:
# Create top-30 terrorism dataframe
top_30_terrorism_df = create_dataframe(top_30_terrorism_topic_queries, worldwide, t)

Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/0fytk'
New query calibrated!
Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/0js8z'
New query calibrated!
Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/07xhy'
New query calibrated!
Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/038r8p'
New query calibrated!
Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/02qtlv'
New query calibrated!
Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/0y4n5ll'
New query calibrated!
Using gtab_data\output\google_anchorbanks\google_anchorbank_geo=_timeframe=2012-01-01 2014-08-31.tsv
New query '/m/0f4r5'
New query 

In [45]:
# View of created df
top_30_terrorism_df.head()

Unnamed: 0,date,max_ratio,max_ratio_hi,max_ratio_lo,article_name,geo
0,2012-01-01,4.727273,4.960754,4.506877,/m/0fytk,worldwide
1,2012-01-08,4.969697,5.211931,4.741001,/m/0fytk,worldwide
2,2012-01-15,4.727273,4.960754,4.506877,/m/0fytk,worldwide
3,2012-01-22,4.848485,5.086342,4.623939,/m/0fytk,worldwide
4,2012-01-29,4.969697,5.211931,4.741001,/m/0fytk,worldwide


In [46]:
# Save dataframe to pickle
top_30_terrorism_df.to_pickle(data_path+"top_30_terrorism.pkl")