[![Alt Right Community](/img/ALT_RIGHT.jpg)](https://www.jstor.org/stable/26984798?seq=1#metadata_info_tab_contents)

# The rise of far-right extremism speech between 2016 and 2020 
### Observed through a dataset of quotes from the press, highlighting the evolution of opinions and ideas that shape the past, present, and the future of our society.

$$ \\ $$

Project in Applied Data Analysis (CS-401)

*Team members: Camil Hamdane (SV), Clémentine lévy-Fidel (SV), Nathan Fiorellino (SV), Nathan Girard (SV)*



# **Section 1 : Wrangling**

## 1. Getting started

### 1.1 Installing and importing libraries



In [None]:
# Installations
!pip install pandas==1.0.5 
!pip install seaborn
!pip install tld
!pip install numpy
!pip install matplotlib
!pip install vaderSentiment

In [1]:
# Imports
import os
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import ast
from urllib.parse import urlparse
import string
from string import punctuation

In [2]:
# Google import
from google.colab import drive
from google.colab import files

### 1.2 Mounting drive

In [3]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
# Data Path 
input_path = '/content/drive/MyDrive/Quotebank/'
data_storage_path = '/content/drive/Shareddrives/Ada-The data collectivists/'

primer_input = "quotes-"
tail_input = ".json.bz2"

primer_output = "df_FF"
tail_output = ".csv"

years = ["2016", "2017", "2018", "2019", "2020"]

## 2. Preprocessing



### 2.1 Extraction of the data

#### a. Data Exploration


We first take a quick look at the data so that we can confirm that we read it correctly and that we get an  idea of what it looks like so we can better decide how we will preprocess and particularly what kind of filters are relevant.

In [5]:
path_to_file = input_path + primer_input + years[4] + tail_input

In [6]:
path_to_file

'/content/drive/MyDrive/Quotebank/quotes-2020.json.bz2'

In [None]:
df_reader = pd.read_json(path_to_file, lines = True, compression = 'bz2', chunksize = 100)

In [None]:
for chunk in df_reader:
  print("Columns")
  print(chunk.columns)
  print("Colum 1 and 2")
  print(chunk.iloc[0:5,0:2])
  print("Colum 3 and 4")
  print(chunk.iloc[0:5,2:4])
  print("Colum 5 and 6")
  print(chunk.iloc[0:5,4:6])
  print("Colum 7 and 8")
  print(chunk.iloc[0:5,6:8])
  print("Colum 9")
  print(chunk.iloc[0:5,8])
  fig, axes = plt.subplots(1, 2, figsize=(15, 5))

  sns.histplot(ax = axes[0], data = chunk['numOccurrences'], bins = 100)
  sns.histplot(ax = axes[1], data = chunk['date'], bins = 100)
  break

In this section, we extract the data and preprocess it, such that we can better conduct our analysis. The Data exploration brought us to choose the following filtering steps:

- Remove quotes with low numOccurences, less than **10**,
- Remove quotes that have more than **1** speaker,
- Filter for samples with only **1** QID,
- Keep quotes whose probability is higher than **0.6**
- Remove **"phase"** feature of the dataframe
- Remove news outlets relaying less than a certain number of quotes, as they do not contribute to reflecting global trends.

Because of the huge size of the data and the limited capacity of the RAM provided by Google, we decided to process the data **per year**. In addition, chunks of data (per year) are processed sequentially to avoid exceed storage capacity. This preprocessing pipeline filters most of the data, such that **approximately 4%** of the initial data remains at the end. We obtain therefore clean and usable data for further analysis, stored in files named *"processed_quotes-20XX.csv"*, for each year. 

In [61]:
# Constants
min_occurence = 10
acceptable_QID_amount = 1
min_probability = 0.6
remove_columns = ["phase"]

In [91]:
# Definition of helper functions

def clean_chunk(chunk):
    """ 
        Cleans dataset chunk by removing unattributed quotes (quotes whose most probable speaker is unknown) or quotes whose speaker name is associated with more 
        than 1 alias removes 'probas' column and keep only quotes whose speaker probability is greater than 0.6 removes 'phase' column.
    """          
    chunk_clean = chunk.copy()
    
    # Remove Samples with low numOccurences
    chunk_clean = chunk_clean[chunk_clean['numOccurrences'] > min_occurence]

    # Filtering for samples containing exactly one QIDs
    chunk_clean = chunk_clean.loc[chunk_clean[chunk_clean['qids'].map(len) == 1].index]
    chunk_clean['qids'] = chunk_clean['qids'].apply(lambda qids: qids[0])
    
    # Remove samples with more than 1 speaker 
    if chunk_clean['probas'].dtype != 'float64': # In case the 'probas' column has already been pre-processed
        chunk_clean['probas'] = chunk_clean['probas'].apply(lambda probas: float(probas[0][1]))
    
    # Remove Samples with low probability
    chunk_clean = chunk_clean[chunk_clean['probas'] > min_probability]
    
    # Removing phase feature
    if 'phase' in chunk_clean: # In case the 'phase' column has already been pre-processed
        chunk_clean = chunk_clean.drop('phase', axis = 1)

    # Removing irrelevant news outlet
    # Convert string-formatted list into list
    #chunk_clean['urls'] = chunk_clean['urls'].map(lambda x: ast.literal_eval(x))
    # Expand each quote into several rows, one for each URL
    chunk_clean = chunk_clean.explode('urls')
    # Replace each URL by its domain only
    chunk_clean['urls'] = chunk_clean['urls'].map(lambda x: urlparse(x).netloc)

    domain_count = chunk_clean[['quoteID', 'urls']].groupby(['urls']).count()
    single_domain = domain_count[domain_count['quoteID']<10].index.to_list()
    quoteID_single_domain = set(df_expanded[df_expanded['urls'].isin(single_domain)]['quoteID']
    chunk_clean = chunk_clean.drop(chunk_clean[chunk_clean['quoteID'].isin(quoteID_single_domain)].index)
      
    return chunk_clean

In [68]:
def describe_chunk(size, i, add_string):
    """
        Print the size of the chunk processed and its number. 
    """
    
    print(f'{add_string} processing of chunk n°{i+1} with {size} rows')
    print("")

In [95]:
def filter_year(path_to_file, chunksize):
  df_reader = pd.read_json(path_to_file, lines = True, compression = 'bz2', chunksize = chunksize)

  year_df = pd.DataFrame()
  i = 0
  describe_chunk(chunksize, i, "Begining")
  for chunk in df_reader:
    chunk = clean_chunk(chunk)
    year_df = year_df.append(chunk)
    describe_chunk(len(chunk), i, "Ending")
    i = i + 1
    describe_chunk(chunksize, i, "Begining")
  return year_df

In [96]:
chunksize = 10000

In [102]:
def handle_year(i):
  path_to_file = input_path + primer_input + years[i] + tail_input
  path_to_output = data_storage_path + primer_output + '/' + primer_output + years[i] + tail_output
  year_df = filter_year(path_to_file,chunksize)
  year_df.to_csv(path_to_output)

In [None]:
handle_year(0) #2016

In [None]:
handle_year(1)#2017

In [None]:
handle_year(2) #2018

In [None]:
handle_year(3) #2019

In [None]:
handle_year(4) #2020

# 3. Selection with Hatebase

## 3.1

In [None]:
#Import dataframes
path_to_file = data_storage_path + primer_output + '/' + primer_output + years[0] + tail_output
df_2016 = pd.read_csv(path_to_file)
df_2016['quotation'].str.lower()
df_2016.head()

In [None]:
hate_frame = 'HateBase/hateframe.csv'

In [None]:
hate_path = data_storage_path + hate_frame
hateframe = pd.read_csv(hate_path)
hateset = set(hateframe['Term'])
hateset = [' {} '.format(x).lower() for x in hateset]

In [None]:
a = df_2016_hateful.copy()

for col in one_hot: a[col] = 0
a['quotation'] = a['quotation'].apply(lambda x: f' {x} ',)
a["quotation"] = a['quotation'].str.replace('[^\w\s]','')
a

In [None]:
#One hot vector with hate "domains"
one_hot = list(hateframe.columns)[2:]


def extract_hate(df_, hateframe):

  out = df_.copy()
  for col in one_hot: out[col] = 0


  out['quotation'].str.lower()
  out['quotation'] = out['quotation'].apply(lambda x: f' {x} ',)
  out["quotation"] = out['quotation'].str.replace('[^\w\s]','')

  
  out = out[out['quotation'].str.contains('|'.join(hateset))]

  for index, row in out.iterrows():
    match = re.search('|'.join(hateset), str(row['quotation']))
    if match is not None:
      hateword = match[0].strip()
      print(hateword)
      row[one_hot] = hateframe.query('"Term" == @hateword')[one_hot]

  return out

test = extract_hate(df_2016_hateful, hateframe)

  

In [None]:
test

In [None]:
df_2016['quotation'].str.lower()
df_2016['quotation'] = df_2016['quotation'].apply(lambda x: f' {x} ',)
df_2016["quotation"] = df_2016['quotation'].str.replace('[^\w\s]','')

df_2016_hateful = df_2016[df_2016['quotation'].str.contains('|'.join(hateset))]
df_2016_hateful


In [None]:
import re
string = df_2016_hateful['quotation'].iloc[59]

re.search('|'.join(hateset), string)[0]

In [None]:
for i in df_2016_hateful['quotation']: print(i)

In [None]:
for i in range(29): print(df_2016_hateful['quotation'].iloc[i], '\n')
df_2016_hateful

In [None]:
phrase = 'He was talking about anyone who feels offended by anything he said,'

phrase.contains('|'.join(hateset))

In [None]:
sns.histplot(df_2020['numOccurrences'], log_scale = True)

In [None]:
df_2020.describe()

We notice that with this first filtering, the median is now at 23 occurences per quote, which tends to prove more the veracity of the quote than the preivous median of 1.

## 4. Sentiment Analysis


## 5. Populating speaker information

# **Section 2 : Data Analysis**

## 1. Hypotheses