[![Alt Right Community](/img/ALT_RIGHT.jpg)](https://www.jstor.org/stable/26984798?seq=1#metadata_info_tab_contents)

# The rise of far-right extremism speech between 2016 and 2020 
### Observed through a dataset of quotes from the press, highlighting the evolution of opinions and ideas that shape the past, present, and the future of our society.

$$ \\ $$

Project in Applied Data Analysis (CS-401)

*Team members: Camil Hamdane (SV), Clémentine lévy-Fidel (SV), Nathan Fiorellino (SV), Nathan Girard (SV)*



## 1. About the project 

The peace and prosperity that we encountered in developped countries of the North in the last century has led many nations in a path of constant technological progress and economic growth. This prosperity however might be threatened by newfound global issues that not only jeopardize the economy, but also the future of humanity. The recent COVID-19 pandemic has shown once more how the current socio-economic system ubiquitous in most occidental democracies has potentially fatal flaws that we need to address before it is crushed under its own weight.

While most people agree about the uncertainty of the next hundred years, we still have yet to agree on a solution. Some of which are more oriented towards a progressive society, while others prefer a more conservative approach. While multiple point of view rely on economic and environmental claims to base their theories on, some others are based on hate and fear of the difference, hate and fear of the change we might need to forge a more inclusive society. In order to better tackle the issues we face, we need to understand how some ideologies are gaining more traction inside the public debate, to observe how they might shape the minds of citizens.

We are interested today about the **rise of far-right extremism** speech between **2016** and **2020**, observed through a dataset of quotes from the press, highlighting the evolution of opinions and ideas that shape the past, present, and the future of our society.

With that in mind, we hope to answer the following questions:
- Is far-right extremism speech on the rise since 2016 ? 
- How accurately can we identify a trend with the Quotebank dataset, when put in perspective with right-wing extremist terrorist attacks ? 
- Can we highlight some news outlets that spread hate speech more than other, and if so, is it consistent over the years ? 
- Is it also consistent with their political opinion ?

## 2. How to get started

### 2.1 Mount your drive to your notebook

It is possible to mount your Google Drive to Colab if you need additional storage or if you need to use files from it. To do that run (click on play button or use keyboard shortcut 'Command/Ctrl+Enter') the following code cell:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


 

1.   After running the cell, URL will appear.

2.   Following this URL, you will be redirected to the page where you need to choose Google Drive account to mount to.

3.   You will further be asked to give Google Drive Stream a permission to access the chosen Google account

4.   After granting the access, authorization code will be given to you

5.   Copy the authorization code into the dedicated textbox in Colab under '*Enter your authorization code:*' writing

After copying the authorization code, you should get the message saying '*Mounted at /content/gdrive*'

Path to the files from the mounted Drive will then be '/content/drive/MyDrive/'. By opening the Files tab (left sidebar, folder icon) you should also be able to see the accessible files. Now, you can read the data directly from the Google Drive you mounted following the process above. Make sure you mounted the drive to which you saved the shortcut to the Quotebank data. 

### 2.2 Install the required libraries (*OPTIONAL*)

You don't need to perform this step if you have previously installed the libraries ([pandas](https://pandas.pydata.org), [seaborn](https://seaborn.pydata.org), [tld](https://pypi.org/project/tld/)) on an environment.

In [12]:
# Installations
!pip install pandas==1.0.5 
!pip install seaborn
!pip install tld



In [8]:
# Imports
import os
from google.colab import files

import pandas as pd
import numpy as np
import seaborn as sns

ModuleNotFoundError: No module named 'google'

## 3. Preprocessing



In [6]:
# Constants 
path_to_file = '/content/drive/MyDrive/Quotebank/quotes-2020.json.bz2'
new_filename = 'processed_quotes-2020'
csv_extension = '.csv'

### 3.1 Extraction of the data

In this section, you can extract the data and preprocess it, such that you can conduct a correct analysis. Also, this step enables you to help the model to understand more easily this data, hence giving better answer to our questions. To this extent, the filtering step are displayed below:

- Remove news outlets relaying less than XXXXX quotes that do not contribute to reflecting global trends,
- Remove quotes with low numOccurences (less than **10**),
- Filter for samples with only **1** QIDs,
- Filter quotations that mentions keywords that we choose to tackle.

Because of the huge size of the data and the limited capacity of the RAM provided by Google, we decided to process the data **per year**. In addition, chunks of data (per year) are processed sequentially to avoid exceed storage capacity. This preprocessing pipeline filters most of the data, such that **approximately 4%** of the initial data remains at the end. We obtain therefore clean and usable data for further analysis, stored in files named *"processed_quotes-20XX.csv"*, for each year. 

In [5]:
# Definition of helper functions

def clean_chunk(chunk):
    """ 
        Cleans dataset chunk by removing unattributed quotes (quotes whose most probable speaker is unknown) or quotes whose speaker name is associated with more 
        than 1 alias removes 'probas' column and keep only quotes whose speaker probability is greater than 0.6 removes 'phase' column.
    """
    # TO-ADD: remove quotes whose occurences number is smaller than 10
          
    chunk_clean = chunk.copy()
    
    # Filtering for samples containing exactly one QIDs
    chunk_clean = chunk_clean.iloc[chunk[chunk['qids'].map(len) == 1].index]
    chunk_clean['qids'] = chunk_clean['qids'].apply(lambda qids: qids[0])
    
    # Remove samples with more than 1 speaker 
    if chunk_clean['probas'].dtype != 'float64':
        chunk_clean['probas'] = chunk_clean['probas'].apply(lambda probas: float(probas[0][1]))
    
    chunk_clean = chunk_clean[chunk_clean['probas']>0.6]
    
    # Remove Samples with low numOccurences
    chunk_clean = chunk_clean[chunk_clean['numOccurrences']>10]
    
    # Filtering 
    if 'phase' in chunk_clean:
        chunk_clean = chunk_clean.drop('phase', axis = 1)
        
    return chunk_clean

def process_chunk(chunk):
    """
        Print the size of the chunk processed and the names of the columns. 
    """
    
    print(f'Processing chunk with {len(chunk)} rows')
    print(chunk.columns)

In [9]:
# Extraction of the data

df_reader = pd.read_json(path_to_file, lines = True, compression = 'bz2', chunksize = 1000000)

for i, chunk in enumerate(df_reader):
    process_chunk(chunk)
    chunk = clean_chunk(chunk)
    
    chunk_file_name =new_filename + csv_extension
    chunk.to_csv(chunk_file_name, mode='a')
    
    files.download(chunk_file_name)
    break
    
# TO-DO: aggregate chunks for 2020, aggreagate with years 2016-2019, and export data in bz2 or CSV

ValueError: Expected object or value

 ## 4 Processing **[TO BE DONE - P3]**

In [None]:
# Implementation of our models to improve the interpretability of the data

## 5. Visualization **[TO BE DONE - P3]**

In [10]:
# Visualization methods