# Project Milestone 2

Here we will describe the whole pipeline to get all the results we would like to include in the final story (on the final website). We will go through all the different steps and describe in detail the operations needed. 

For the final story we decided to focus on the influence of the Brexit. More precisely we would like to assess how the Brexit was perceived and how it evolved over the years. The visualizations we aim at providing in the final story are detailed in this [Section](#Results).

## **[Preprocessing steps](#Preprocessing)**

As usual the first step consists in several substeps that aims at cleaning and transforming the data. By clicking on the task link, you can access the respective pipeline.
- *[Data exploration and Sanity check](#Sanity_check)* : Explore the dataset, check its consistency and get familiar with the different features/information provided.
- *[Data extraction](#extraction)* : Extract the data of interest that will be used to perform the tasks related to each idea.
- *[Data augmentation](#augmentation)* : Perform a data augmentation to get more features about the quotations such as the quote field, the nationality of the speaker and so on... These new features will be used to perform the tasks related to each idea.
- *[Data cleaning and merging](#augmentation)* : Perform a final cleaning on the quotations as well as on the speakers and generate 3 main datasets that will be used for the analysis
- *[Quotations and speakers clustering](#clustering)* : Cluster the quotations and the speakers according to the quotation vector and the added features in the data augmentation task. This clustering will be mainly used to develop a recommandation tool.

## **[Generate the results for the final story](#Results)**

- [General Statitics](#Statistics) : Explore the dataset, visualize some first graphs for each new features.  
- [Country map](#Country) : Show how brexit is perceived depending on the country.
- [Sector map](#Sector) : Show how brexit is perceived depending on the sector.
- [Visualize speakers evolution](#2Dplot) : Visualize speakers into an embedding space that should reflect the similarities between speakers **[TO BE COMPLETED]**.
- [Recommandation Tool](#Recommandation) : Tool that recommends similar speakers to the one searched by the user **[TO BE COMPLETED]**. 
- [Correlation with stocks](#Stocks) : Study if a correlation exists between remarkable Brexit peaks and the stock actions from companies of the FTSE100 **[TO BE COMPLETED]**.


# Before diving into the code 

To run everything from scratch, make sure to have a `Data` folder containing the following files: 
- The quotebank datasets for each year: `quotes-yyyy.json.bz2`
- The speaker attributes folder `speaker-attributes.parquet` as well as the associated lookup table `wikidata_labels_descriptions_quotebank.csv.bz2`

To benefit from check points, download `Brexit_datas` from [Google drive](https://drive.google.com/drive/folders/12EgO7E97KcNrZtQhjUmkOp5iDF1V7ufR?usp=sharing)


## Import useful librairies and define useful variables

In [None]:
# STANDARD LIBRAIRIES
from os.path import exists
import bz2 
import json
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import itertools
from datetime import datetime

# Dynamic graphs
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import dash
from dash import dcc
from dash import html
from dash.dependencies import Input, Output

# Machine learning librairies
import torch
from sentence_transformers import SentenceTransformer, util
from sklearn.cluster import SpectralClustering
from tsne_torch import TorchTSNE as TSNE
from sklearn.manifold import LocallyLinearEmbedding
import nltk
import numpy as np
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Graph algorithms
import networkx as nx

# import string distances
import stringdist

# Load the lexicon for sentiment analysis
nltk.download('vader_lexicon')

import warnings
warnings.filterwarnings("ignore")

# Data files
PATHS_TO_FILE = ['Data/quotes-20%d.json.bz2' % i for i in range(15,21)]

# Columns to analyte for one-hot vectorization task
columns_to_map = ['nationality', 'gender', 'occupation', 'party', 'academic_degree', 'religion']

# type of the sentence_transformer
sentence_transformer_type = 'all-MiniLM-L6-v2' 

<a id='Preprocessing'></a>

# Preprocessing steps

<a id='Sanity_check'></a>

## Data exploration and Sanity check

We decided to perform the following snaity checks on the original data: 

- We first check that each entry for each quotation is specified in the right format (e.g. `numOccurences` should be an integer).
- We check that the `probas` sum to 1.
- We check that the `numOccurences` is superior or equal to the length of the list containing the urls.
- The `date` is consistent with the dataset they are coming from
- We check that if a `qids` exists then a `speaker` should be specified

In [None]:
# SANITY CHECK FUNCTIONS

def check_type(instance,entry,dtype):
    return type(instance[entry]) == dtype

def check_probas(instance):
    if len(instance) > 0:
        proba_sum = sum([float(potential[1]) for potential in instance["probas"]])
        if proba_sum < 0.98 or proba_sum > 1.02:
            return False
        else:
            return True
    else:
        return False

def check_numOcc(instance):
    return (len(instance["urls"]) <= instance["numOccurrences"])

def check_date(instance,year):
    quotation_year = int(instance["date"][:4])
    return (quotation_year == year)

def check_author_qids(instance):
    if len(instance["qids"]) > 0 and instance["speaker"] is None:
        return False
    else: 
        return True
        

In [None]:
# Define the types for each entry
TYPES = {"quoteID":str,
         "quotation":str,
         "speaker":str,
         "qids":list,
         "date":str,
         "numOccurrences":int,
         "probas":list,
         "urls":list,
         "phase":str}

error_file = "Data/error_file.json.bz2"


if not exists(error_file):
    with bz2.open(error_file, 'wb') as e_file:
        # Loop over the different files that we will read
        for quotebank_data in PATHS_TO_FILE:
            year = int(quotebank_data[-13:-9])
            print("Reading ",quotebank_data," file...")
            # Open the file we want to read
            with bz2.open(quotebank_data, 'rb') as s_file:
                # Loop over the samples
                for instance in s_file:
                    potential_error = ""
                    # Loading a sample
                    instance = json.loads(instance)
                    #### CHECK THE TYPES ####
                    for key, value in TYPES.items():
                        if not check_type(instance,key,value):
                            potential_error += "| Type problem: " + key + " |"
                            # Continue because there exists a problem with the type that may affect the other checks
                            continue
                    #### CHECK THE PROBAS ####
                    if not check_probas(instance):
                        potential_error += "| Probas problem |"
                    #### CHECK THE DATE ####
                    if not check_date(instance,year):
                        potential_error += "| Date problem |"
                    #### CHECK THE NUMOCCURENCES ####
                    if not check_numOcc(instance):
                        potential_error += "| NumOccurences problem |"
                    #### CHECK THE AUTHOR-QIDS ####
                    if not check_author_qids(instance):
                        potential_error += "| Author-qids problem |"
                    # WRITE INTO THE FILE FOR POTENTIAL ERRORS #
                    if len(potential_error) > 0:
                        instance["error"] = potential_error
                        e_file.write((json.dumps(instance)+'\n').encode('utf-8'))

pd.read_json('Data/error_file.json.bz2',compression="bz2",lines=True).shape                    

<a id='extraction'></a>

## Data extraction

As mentionned previously, we are planning to analyze the way Brexit is perceived. Thus, we need to extract first the quotations that discuss Brexit. To do so we will follow the following pipeline:

1. Select all the quotations that contain the word Brexit.
2. Store the new two dataset in the `Brexit_quotes.json.bz2` file.


In [None]:
if not exists('Brexit_datas/Brexit_quotes.json.bz2'):
    # Input file
    PATHS_TO_FILE = ['Data/quotes-20%d.json.bz2' % i for i in range(15,21)]
    # Output file
    PATH_TO_OUT = 'Brexit_datas/Brexit_quotes.json.bz2'

    # Open the file where we will write
    with bz2.open(PATH_TO_OUT, 'wb') as d_file:
        # Loop over the different files that we will read
        for quotebank_data in PATHS_TO_FILE:
            print("Reading ",quotebank_data," file...")
            # Open the file we want to read
            with bz2.open(quotebank_data, 'rb') as s_file:
                # Loop over the samples
                for instance in s_file:
                    # Loading a sample
                    instance = json.loads(instance)
                    # Extracting the quotation
                    quotation = instance['quotation']
                    # Check if the quotation contains at least one word related to Brexit
                    if "brexit" in quotation.lower():
                        # Writing in the new file
                        d_file.write((json.dumps(instance)+'\n').encode('utf-8'))

quotebank_brexit = pd.read_json('Brexit_datas/Brexit_quotes.json.bz2',compression="bz2",lines=True)
quotebank_brexit.head(2)


<a id='augmentation'></a>

## Data augmentation

When we will generate the results for the final story, we will need more information than the initial features provided. The further analysis will require to have access to other features such as the sentiment carried by the quotation and additional information about the author. To do so, the following pipeline will be performed on each quotation:

1. **[Adding features related to the author](#Features_Author)** :  Using the provided file `speaker_attributes.parquet` that was extracted from the Wikidata knowledge base, the following attributes are of interest for each speaker:
    - `occupation`: describes the author's occupancy 
    - `party` identifies the political affiliation of the speaker.
    - `academic_degree` gives information about the education of the author as well as their alma mater.
    - `nationality` identifies the citizenship(s) of the author.
    - `date_of_birth`: identifies the date of birth of the speaker.
    - `gender`: identifies the gender of the speaker.
    - `ethnic_group`: identifies the ethnic group of the speaker.
    - `religion`: identifies the religion of the speaker. 

    The provided `speaker_attributes.parquet` file contains attributes in terms of QIDs, thereby being uninterpretable by humans. To map the QIDs to meaningful labels, we used the provide the file `wikidata_labels_descriptions_quotebank.csv.bz`.
    
    The aforementioned attributes may not be available for all authors. When it is the case, a NaN value is assigned.

2. **[Adding features issued from a sentiment analysis](#Sentiment_Quote)** : The last feature of interest is the sentiment that is carried by the quotation. For the sake of simplicity, each quotation will be classified into three different categories: *Negative*, *Neutral* and *Positive*. 
Sentiment Analysis task can be performed using pretrained Deep Neural Networks. We decided to use **Vader** Neural network for its good performance. NLTK's Vader sentiment analysis tool uses a bag of words approach with some simple heuristics. More on it [here](https://github.com/cjhutto/vaderSentiment). 

<a id='Features_Author'></a>
***Loading the speaker_attributes.parquet file***

In [None]:
# Load the parquet that contains the information about speakers
df_attributes = pd.read_parquet('Data/speaker_attributes.parquet')

# we are not interested in the aliases, lastrevid, US_congress_bio_ID, id, candidacy and type.
keep_attributes = ['id','label', 'date_of_birth', 'nationality', 'gender', 'ethnic_group', 'occupation', 'party', 'academic_degree', 'religion']
# Set the index
df_attributes = df_attributes[keep_attributes].set_index('id')
# Sanity check for the qids
print("Sanity check ok ? : ",df_attributes.index.is_unique)
# Let's have a look
df_attributes.sample(2)

<a id='Features_Author'></a>
***Mapping the QIDs to meaningful labels***

In [None]:
# create dictionnary to use it as a lookup table 
df_map = pd.read_csv('Data/wikidata_labels_descriptions_quotebank.csv.bz2', compression='bz2', index_col='QID')
# Dictionnary where qids are keys and values are corresponding element
map_dict = df_map.Label.to_dict()

def mapping(QIDs):
    """
    The purpose of this function is to map all the QIDs to their labels, 
    using wikidata_labels_descriptions_quotebank.csv
    """
    
    if QIDs is None:
        return np.nan
    else:
        QIDs_mapped = []
        for QID in QIDs:
            try:
                # If a correspondance exists
                QIDs_mapped.append(map_dict[QID])
            except KeyError:
                # If no correspondance exits
                continue
        # If nothing was extracted
        if len(QIDs_mapped) == 0:
            return np.nan
        # Things extracted
        else:
            return QIDs_mapped



# For each column perform the mapping to transform qids to real value
for column in columns_to_map:
    df_attributes[column] = df_attributes[column].apply(mapping)
    
df_attributes.head(2)

<a id='Sentiment_Quote'></a>
***Adding sentiment score to each quote***

In [None]:
def sent_score(quote):
    """The purpose of this function is to use the sentiment analysis tool VADER to find the sentiment associated with a quote."""
    
    sid = SentimentIntensityAnalyzer()
    sentiment_dict = sid.polarity_scores(quote)
    
    # The Compound score is a metric that calculates the sum of all the lexicon ratings which have been normalized between
    # -1(most extreme negative) and +1 (most extreme positive).
    # positive sentiment : (compound score >= 0.05) 
    # neutral sentiment : (compound score > -0.05) and (compound score < 0.05) 
    # negative sentiment : (compound score <= -0.05)
    # see https://predictivehacks.com/how-to-run-sentiment-analysis-in-python-using-vader/
    # or https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/
    
    # decide sentiment as positive, negative and neutral
    if sentiment_dict['compound'] >= 0.05 :
        return "Positive"
 
    elif sentiment_dict['compound'] <= - 0.05 :
        return "Negative" 
 
    else :
        return "Neutral"

# backup quotebank dataframe with sentiment score if the corresponding file doesn't exists
if not exists("Brexit_datas/quotebank_brexit_with_sentiment.json.bz2"):
    quotebank_brexit['sentiment_score'] = quotebank_brexit.quotation.apply(sent_score) 
    quotebank_brexit.to_json("Brexit_datas/quotebank_brexit_with_sentiment.json.bz2")
    
else:
    quotebank_brexit = pd.read_json("Brexit_datas/quotebank_brexit_with_sentiment.json.bz2",compression="bz2")

quotebank_brexit.head(2)

<a id='cleaning'></a>

## Data merging and cleaning

Depending on the different tasks we want to perform we will need to have the dataset in various forms, thus we will generate three types of dataset: 
- `quotebank_brexit`: original dataset [+sentiment score] where dublicated quotations are removed.
- `aug_quotebank_brexit`: dataset with augmented data where both quotations and speakers are cleaned.
- `oneh_quotebank_brexit`: copy of `aug_quotebank_brexit` where categorical values are one-hot encoded.

Thus we first start by [cleaning the quotations](#cleaning_quotation) in the `quotebank_brexit` dataset and then we [clean the speakers](#cleaning_speaker) to be able to merge with augmented data and generate the `aug_quotebank_brexit`. After a [processing](#aug_preprocessing) of the `aug_quotebank_brexit` we will finally [one hot encode](#one_hot_encoding) to generate the `oneh_quotebank_brexit` dataset.

<a id='cleaning_quotation'></a>

### Cleaning the quotations

We noticed that some quotations were very similar, actually too similar. They sometimes differ from the fact that one quotation is nested in another or sometimes they only differ by one character. Here is an example of such a quotation:
- quoteID: **2018-01-26-042810** - *"I look at Nigel Farage's example. It took 17 years, but Brexit came,"*
- quoteID: **2018-01-26-042811** - *"I look at Nigel Farage's example. It took 17 years, but Brexit came. I don't plan to wait that long"*

We need to remove these kind of *duplicates*. To do so we followed this pipeline:
- Converting quotations into vectors using [SentenceTransformer](https://www.sbert.net/docs/usage/semantic_textual_similarity.html) deep neural network.
- Computing [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between each pair of quotations
- Removing quotations that are too similar from the dataset

In [None]:
# Encode quotation 
if not exists("Brexit_datas/vector_quotes.csv.gz"):
    encoder = SentenceTransformer(sentence_transformer_type)
    # Encode quotations
    quotes_encoded = encoder.encode(quotebank_brexit['quotation'].values, convert_to_numpy=True, show_progress_bar=True)
    # Convert to df
    quotes_df = pd.DataFrame(quotes_encoded, index = quotebank_brexit.index)
    # Add speaker column
    quotes_df["speaker"] = quotebank_brexit["speaker"].values
    # Export into a compressed format
    quotes_df.to_csv("Brexit_datas/vector_quotes.csv.gz")
    
else:
    # Read the file
    quotes_df = pd.read_csv("Brexit_datas/vector_quotes.csv.gz",index_col=0,compression="gzip")

quotes_df.head()

In [None]:
# determine the supported device
def get_device():
    if torch.cuda.is_available():
        device = torch.device('cuda:0')
    else:
        device = torch.device('cpu') # don't have GPU 
    return device

# convert a df to tensor to be used in pytorch
def df_to_tensor(df):
    device = get_device()
    return torch.from_numpy(df.values).float().to(device)

# compare pairwise similarity
def filter_similar(df):
    # Get the embeddings computed before
    embeddings = df_to_tensor(df.drop(columns='quoteID'))
    # Compute cosine similarity
    cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)
    # Convert to df
    score = pd.DataFrame(cosine_scores.numpy())
    index = set(score[score>0.95].stack().index.tolist())
    index = [(a,b) for (a,b) in index if a != b]
    # multiple tuples can have common element: need to merge them
    graph = nx.Graph(index)
    index = list(nx.connected_components(graph))
    # map the indices with the Qid of the quote
    index = [tuple(df.quoteID.iloc[ind] for ind in path) for path in index]
    return index

similar_count = quotes_df.assign(quoteID=quotebank_brexit.quoteID.values).groupby('speaker').apply(filter_similar)

In [None]:
# an example of two similar quotations
print("Yvette Cooper:", similar_count["Yvette Cooper"][0])
quotebank_brexit[(quotebank_brexit.quoteID=='2019-01-27-041304') | (quotebank_brexit.quoteID=='2019-01-27-029003')]

In [None]:
# get indices to drop
def drop_duplicate_quotes(ids):
    return [[quoteID for quoteID in path[1:]] for path in ids]
        
# generate the list of quoteIDs to be removed
to_be_removed = similar_count.apply(drop_duplicate_quotes).values.sum()
to_be_removed = list(itertools.chain.from_iterable(to_be_removed))

quotebank_brexit = quotebank_brexit[~quotebank_brexit.quoteID.isin(to_be_removed)]

print("Number of quotations to be removed: ",len(to_be_removed))

<a id='cleaning_speaker'></a>

### Cleaning the speakers

The `aug_quotebank_brexit` provides information about the speaker such as the `nationality`, `occupation`. However one can notice that sometimes the neural network doesn't succeed in finding a speaker and therefore fills `speaker` entry with `None` value. These missing values are difficult to handle as it would require to guess who said the quotation. One could think about training a classifier on the data where the speaker is mentionned but it is actually a fastidious task that we are not able to manage. Unfortunaltely, we decided to remove them from the dataset.

An other issue comes from the fact that for one speaker different Qids exist. However, these Qids correspond to the Wikipedia pages of the same person but in different langagues. This could also come from the fact that there exist multiple wikipedia pages that point to different persons who are homonyms. When many qids exist we check if all the attributes are similar for all the qids. If not, then we are not able to determine which qid is the correct one so unfortunately we discard the row from the dataset.

In [None]:
def check_consistent_qids(QIDS_original):
    QIDS = QIDS_original.copy()
    if len(QIDS) == 0:
        return pd.NA
    elif len(QIDS) == 1:
        return QIDS_original[0]
    else:
        while len(QIDS) > 1:
            first_idx = QIDS.pop(-1)
            try:
                first = df_attributes.loc[first_idx].fillna(0)
                second_idx = QIDS.pop(-1)
                try:
                    second = df_attributes.loc[second_idx].fillna(0)
                except KeyError:
                    QIDS.append(first_idx)
                    continue
            except KeyError:
                continue
            try: 
                if (first != second).sum() > 0:
                    return pd.NA
            except ValueError:
                return pd.NA
        return QIDS_original[0]

if not exists("Brexit_datas/aug_quotebank.json.bz2"):
    # Remove nan values
    aug_quotebank_brexit = quotebank_brexit[quotebank_brexit.speaker != "None"]

    # Remove speakers with multiple different qids
    aug_quotebank_brexit.qids = aug_quotebank_brexit.qids.apply(check_consistent_qids)
    aug_quotebank_brexit = aug_quotebank_brexit[~aug_quotebank_brexit.qids.isna()]

    # Merge the augmented quotebank brexit with df_attributes on qids
    aug_quotebank_brexit = pd.merge(aug_quotebank_brexit, df_attributes, 'inner', left_on="qids", right_index=True)

    # Export to json to add check points
    aug_quotebank_brexit.to_json("Brexit_datas/aug_quotebank.json.bz2")
else:
    # Read json if it already exists
    aug_quotebank_brexit = pd.read_json("Brexit_datas/aug_quotebank.json.bz2",compression="bz2")

# Let's have a look
print("New shape",aug_quotebank_brexit.shape)
aug_quotebank_brexit.head(2)

### Compute age feature for each speaker

In [None]:
def get_age(birth_date,current_year=datetime.now().year):
    
    if isinstance(birth_date,list) and len(birth_date) > 0:
        birth_year = int(birth_date[0][1:5])
        return current_year - birth_year
    else:
        return pd.NA

aug_quotebank_brexit["Age"] = aug_quotebank_brexit.date_of_birth.apply(get_age)

aug_quotebank_brexit.loc[:,["Age","date_of_birth","speaker"]].sample(5)

### Get unique values of categorical features

In [None]:
unique_values = {}

for col in columns_to_map:
    col_serie = aug_quotebank_brexit[col].copy()
    unique_values[col] = pd.unique(col_serie.apply(pd.Series).stack())
    print(col," : number of different categories = ",len(unique_values[col]))

<a id='aug_preprocessing'></a>

### What about the categorical features added

### Occupation feature

Now that we added new features, we had a look at their values. We noticed that there are more than 800 different occupations. It would be interesting to classify them into *categories*. The problem is that we do not have any label on them and using ML techniques such as pre-trained neural networks would be an over-kill. We rather followed a semi-manual approach that is described below: 
- Identify which words are the more frequent in the `occupation` names and associate them with a label. We will call them key words.
- For each `occupation` match it with any key word labels when applicable.
- Label the remaining occupations manually.

***Manage frequent keywords***

In [None]:
# Data frame of the occupations
occupation_df = pd.DataFrame(unique_values["occupation"],columns=["occupation"])

key_words = []

# Loop over the occupations
for occupation in unique_values["occupation"]:
    # Split the occupation string and concatenate
    key_words += occupation.split()

# Convert to a Dataframe
key_words_df = pd.DataFrame(key_words,columns=["occupation"])
# Put all strings to lower
key_words_df.occupation = key_words_df.occupation.str.lower()
# For each key word count the number of occurences and sort by descending
key_words_df = key_words_df.groupby("occupation").size().reset_index(name="Count").sort_values(by="Count",ascending=False)

# If the classification has not been already done
if not exists("Brexit_datas/occupation_class/occupation_agg.csv"):
    key_words_df.to_csv("Brexit_datas/occupation_class/occupation_agg.csv")

print("Look at the most frequent keywords")
print(key_words_df.head(3))

answer = input("Is the classification of keywords done ?")

if (answer.lower() == "yes"):
    # Get the classified keywords
    key_words_classified = pd.read_csv("Brexit_datas/occupation_class/occupation_agg.csv",index_col=0)
    # Get ride of keywords that have not been classified
    key_words_classified = key_words_classified.loc[~key_words_classified.Category.isna()]
    # Manage the case when several categories have been entered
    key_words_classified.Category = key_words_classified.Category.apply(lambda x: x.split("-"))
    # let's have a look at the table
    print("Look at the output table")
    print(key_words_classified.head(3))
else:
    print("Then please classify the keywords")

***Match occupation and keyword labels***

In [None]:
# Function to check if keywords are contained in an occupation
def check_string_in(occupation):
    # Initialize the final list of the supercategories
    final_list = []
    # Loop over the key_words_classified
    for items in key_words_classified.occupation.iteritems():
        # If the keyword is contained in the occupation
        if items[1] in occupation.lower():
            # Concat the supercategories with th existing list
            final_list = final_list + key_words_classified.loc[items[0],"Category"]
    # If no categories return NaN
    if len(final_list) == 0:
        return pd.NA
    # Else return the list without duplicates
    else:
        return list(set(final_list))
        
# Apply the function
occupation_df["Category"] = occupation_df.occupation.apply(check_string_in)

if not exists("Brexit_datas/occupation_class/unclassified_occupation.csv"):
    # Export the occupations that have not been classified
    occupation_df[occupation_df.Category.isna()].to_csv("Brexit_datas/occupation_class/unclassified_occupation.csv")

print("Look at the remaining occupations")
print(occupation_df[occupation_df.Category.isna()].head(3))

answer = input("Is the classification of remaining occupations done ?")

if (answer.lower() == "yes"):
    # Get the remaining occupations classified
    remain_occupations_classified = pd.read_csv("Brexit_datas/occupation_class/unclassified_occupation.csv",index_col=0)
    # Merge with the current data frame
    occupation_final_df = pd.merge(occupation_df,remain_occupations_classified,how="left",on="occupation",suffixes=("","_2"))
    # Split into a list
    occupation_final_df.Category_2 = occupation_final_df.Category_2.apply(lambda x: x.split("-") if type(x) == str else pd.NA)
    # Merge into a single column
    occupation_final_df.loc[~occupation_final_df.Category_2.isna(),"Category"] = occupation_final_df.loc[~occupation_final_df.Category_2.isna(),"Category_2"]
    # Drop the artificial column
    occupation_final_df.drop(columns=["Category_2"],inplace=True)
    # Drop na values that corresponds to unclassifiable jobs such as nazi hunter
    occupation_final_df.dropna(axis=0,inplace=True)
    # Let's have a look
    print("Final data set for the classification of occupations:")
    print(occupation_final_df.head(5))
    # Export to a json file
    if not exists("Brexit_datas/occupation_class/classified_occupation.json"):
        occupation_final_df.to_json("Brexit_datas/occupation_class/classified_occupation.json")
else:
    print("Then please classify the remaining occupations")

***Label remaining occupations***

In [None]:
occupation_final_df = pd.read_json("Brexit_datas/occupation_class/classified_occupation.json").set_index("occupation")

# Let's have a look at the supercategories
print(list(pd.unique(occupation_final_df.Category.apply(pd.Series).stack())))

# Let's replace this into the aug_quotebank dataset
def replace_occupation(occupation):
    if type(occupation) == list:
        if len(occupation) > 0:
            new_occupation = []
            for job in occupation:
                try:
                    new_occupation += occupation_final_df.loc[job,"Category"]
                except KeyError:
                    continue
            if len(new_occupation) > 0:
                return list(set(new_occupation))
            else:
                return pd.NA
                
    else:
        return pd.NA

aug_quotebank_brexit.occupation = aug_quotebank_brexit.occupation.apply(replace_occupation)
aug_quotebank_brexit.head(2)
        

### Country feature

In [None]:
def compare_levensthein(country,proposal=5):

    message = country + " Any deviation ?"
    deviation = input(message)

    if len(deviation) > 0:
        country = deviation
    
    value = []
    for existing in list(current_countries.Country):
        leven_distance = stringdist.levenshtein_norm(country,existing)
        value.append(leven_distance)
        
    value = np.array(value)
    closer = current_countries.Country.values[np.argsort(value)]
    closer = closer[:proposal]

    message = country + " potential candidates \n" + " --- ".join(list(closer))

    fine = True
    while fine:
        try:
            idx_to_keep = int(input(message))
            fine = False
        except ValueError:
            print("Please specify an integer")
            fine = True

    
    if idx_to_keep < 0:
        return pd.NA
    else:
        return closer[idx_to_keep]


if not exists("Brexit_datas\country_class\country_final_mapping.csv"):

    # Get the list of existing countries 
    current_countries = pd.read_excel("Brexit_datas\country_class\countries.xlsx")

    # Remove capital letters and special characters
    current_countries.Country = current_countries.Country.str.lower()
    current_countries.Country = current_countries.Country.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')

    # Country values that we currently have
    countries_to_map = pd.DataFrame(unique_values["nationality"],columns=["Country"])
    # Remove capital letters and special characters
    countries_to_map.Country = countries_to_map.Country.str.lower()
    countries_to_map.Country = countries_to_map.Country.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
    # Drop eventual duplicates
    countries_to_map.drop_duplicates(subset=["Country"],inplace=True)

    # Let's perform a first merge
    countries_to_map = pd.merge(current_countries,countries_to_map,left_on="Country",right_on="Country",how="right")

    # New column new countries
    countries_to_map.loc[~countries_to_map.ISO.isna(),["real_country"]] = countries_to_map[~countries_to_map.ISO.isna()].Country
    # Fill the remaining countries manually
    countries_to_map[countries_to_map.ISO.isna()].real_country= countries_to_map[countries_to_map.ISO.isna()].Country.apply(compare_levensthein)
    # Remove countries that didn't find a correspondance
    countries_to_map = countries_to_map[~countries_to_map.real_country.isna()].drop(columns="ISO")
    # Export the result
    countries_to_map.to_csv("Brexit_datas/country_class/country_final_mapping.csv",index_col=0)

else:
    # Read the already created csv
    countries_to_map = pd.read_csv("Brexit_datas\country_class\country_final_mapping.csv",index_col=0)

countries_to_map.set_index("Country",inplace=True)
countries_to_map.head()

In [None]:
def match_category(initial,lookup_table):

    new_values = []

    original = list(lookup_table.index.values)
    if isinstance(initial,list):
        for old_original in initial:
            if old_original.lower() in original:
                new_values.append(lookup_table.loc[old_original.lower(),lookup_table.columns[0]])
        if len(new_values) == 0:
            return pd.NA
        else:
            return new_values
    else:
        return pd.NA

aug_quotebank_brexit.loc[:,"nationality"] = aug_quotebank_brexit.nationality.apply(match_category,lookup_table=countries_to_map)

## Thresholding to remove categories with low number of occurences

In [None]:
threshold_maps = {}

# define threshold for each target column
threshold_maps["nationality"] = 80
threshold_maps["party"] = 50
threshold_maps["academic_degree"] = 20
# threshold_maps["ethnic_group"] = 20
threshold_maps["religion"] = 50

# function filtering 
def intersection_test(xlist, unique_set):
    if not isinstance(xlist, list):
        return pd.NA
    if len(xlist) == 0:
        return pd.NA
    new_list = list(set(xlist).intersection(unique_set))
    if len(new_list) > 0:
        return new_list
    else:
        return pd.NA


for col, threshold in threshold_maps.items():
    # get unique quantities inside list objects and their count
    unique_count = aug_quotebank_brexit[col].apply(pd.Series).stack().to_frame().rename(columns={0:"Value"})
    unique_count = unique_count.groupby("Value").size().reset_index(name="Count").sort_values(by="Count",ascending=False)
    # define intersection for the specific unique values
    intersection_unique = lambda xlist: intersection_test(xlist, unique_count[unique_count["Count"] > threshold].Value.values)
    # apply filtering
    aug_quotebank_brexit[col] = aug_quotebank_brexit[col].apply(intersection_unique)
    

In [None]:
# TEST, KEEP THIS
Test = aug_quotebank_brexit["nationality"].apply(pd.Series).stack().to_frame().rename(columns={0:"Value"})
Test = Test.groupby("Value").size().reset_index(name="Count").sort_values(by="Count",ascending=False)
Test[Test.Count > 20]

## Academic degree gathering

We would like to gather academic degree of people into a higher categories such as PhD, professor, master bachelor and so on. As the number of different remaining categories for the academic degree is quite low, it can be done manually.

In [None]:
file_academic = "Brexit_datas/academic_degree/original_academic.csv"

if not exists(file_academic):
    Academic_filter = aug_quotebank_brexit["academic_degree"].apply(pd.Series).stack().to_frame().rename(columns={0:"Value"})
    Academic_filter.groupby("Value").count().to_csv(file_academic)

answer = input("Did you gather academic degrees categories into higher categories ?")

if answer.lower() == "yes":
    Academic_filter = pd.read_csv(file_academic).set_index("Value")
    Academic_filter.index = Academic_filter.index.str.lower()
    Academic_filter.head()
    aug_quotebank_brexit.loc[:,"academic_degree"] = aug_quotebank_brexit.academic_degree.apply(match_category,lookup_table=Academic_filter)

In [None]:
def acadedmic_order(degrees,order):
    
    max = -1
    new = None
    if isinstance(degrees,list):
        for degree in degrees:
            if max < order[degree]:
                max = order[degree]
                new = degree
        if new is None:
            print("Empty list")
            return pd.NA
        return [new]
    return pd.NA

Ordered_list = {"Professor":4,"Phd":3,"Master":2,"Bachelor":1,"Other":0}

aug_quotebank_brexit.loc[:,"academic_degree"] = aug_quotebank_brexit.academic_degree.apply(acadedmic_order,order=Ordered_list)

## Religion gathering

We would like to actually do the same thing for the religion feature

In [None]:
file_religion = "Brexit_datas/religion/original_religion.csv"

if not exists(file_religion):
    Religion_filter = aug_quotebank_brexit["religion"].apply(pd.Series).stack().to_frame().rename(columns={0:"Value"})
    Religion_filter.groupby("Value").count().to_csv(file_religion)

answer = input("Did you gather religion categories into higher categories ?")

if answer.lower() == "yes":
    Religion_filter = pd.read_csv(file_religion).set_index("Value")
    Religion_filter.index = Religion_filter.index.str.lower()
    Religion_filter.head()
    aug_quotebank_brexit.loc[:,"religion"] = aug_quotebank_brexit.religion.apply(match_category,lookup_table=Religion_filter)

### Ethnic Group gathering

In [None]:
# All the information contained in ethnic group is actually already contained in the other features so I think we can safely drop it
# Ethnic group has been removed from the columns_to_map list
# aug_quotebank_brexit.drop(columns=["ethnic_group"],inplace=True)

<a id='one_hot_encoding'></a>

### One hot encoding

In [None]:
# One hot vectorization of columns cotaining categorical values
dummy_col = "AAADummy column for the sake"
# Make a copy
oneh_quotebank_brexit = aug_quotebank_brexit.copy()

# Check that the element is a list that contains only one string
def ensure_list(value):
  if isinstance(value, list):
    for i in range(len(value)):
      value[i] = str(value[i])
  elif not pd.isna(value):
    value = [value]
  return value

# Loop over categorical columns
for col in columns_to_map:
  # Get the serie
  col_serie = aug_quotebank_brexit[col].copy().apply(ensure_list)
  # Change nan values to a list containing a dummy column
  col_serie[col_serie.isna()] = col_serie[col_serie.isna()].apply(lambda x: [dummy_col])
    
  # One hot vectorize
  categorical_df = pd.get_dummies(col_serie.apply(pd.Series).stack()).groupby(level=0).sum()
    
  # Drop the dummy column
  categorical_df.drop(columns=[dummy_col],inplace=True)
  # Refresh unique values
  unique_values[col] = categorical_df.columns
    
  # Join with quotebank brexit
  oneh_quotebank_brexit = oneh_quotebank_brexit.join(categorical_df,how="left",rsuffix=col[:3])
  print("One hot vectorizing : ",col,
        "| NaN values : ",categorical_df.isna().apply(lambda x: x*1).sum().sum(),
        "| Number of different categories : ",len(categorical_df.columns),
        "| Shape reduced ? ",categorical_df.shape,oneh_quotebank_brexit.shape)
  # Drop the categorical column
  oneh_quotebank_brexit.drop(columns=col,inplace=True)
  # Check for NaN values
  print("Any NA in the final dataframe: ",oneh_quotebank_brexit.isna().apply(lambda x: x*1).sum().sum())

print("Shape of the final data frame",oneh_quotebank_brexit.shape)
print("Any NA in the final dataframe: ",oneh_quotebank_brexit.isna().apply(lambda x: x*1).sum().sum())

In [None]:
oneh_quotebank_brexit.head(5)

<a id='clustering_task'></a>

# Quotations and speakers clustering

The last preprocessing step consists of clustering the quotations as well as the speakers. This clustering will later be used to create a Recommandation Tool in the context of Brexit. Quotations and speakers that carry similar attributes/ideas will belong to the same cluster. Performing such task can be performed using the following pipeline:
1. The first step is to convert sentences into vectors. This task can be achieved using the [SentenceTransformer](https://www.sbert.net/docs/usage/semantic_textual_similarity.html) deep neural network. The vector obtained from this operation cab be then concatenated with the other existing features (that would be converted to one hot vectors if necessary) (ALREADY DONE).
2. The second step consists in reducing the dimension of the data before applying the clustering algorithm. This task can be achieved using the [Locally Linear Embeddings](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.LocallyLinearEmbedding.html#sklearn.manifold.LocallyLinearEmbedding) algorithm. This algorithm is considered to be an efficient non-linear dimensionality reduction method.
3. The third step is specific to speaker clustering. Indeed, the vectorization of quotes as well as the reduction of dimensionality are only applied to quotes. Thus, we need to perform an **aggregation** to be able to attribute a vector to each speaker. For each speaker, this aggregation can simply be done by taking the mean of the vectors associated with each of their quotations. 
4. The last step consists in performing the clustering operation. This task can be achieved using [Spectral Clustering](#https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering) method.

### Sentiment amplification

In order to calibrate the importance of the *sentiment score* in the quotes vector space, a coefficient of amplification `sentiment_amplification` is applied to the integer coded as `sentiment_score`.

This process aims at increasing the distance among the vectors grouping vectors with the same sentiment score. A priori, this boosts the labelling process.

### Locally linear embedding

This algorithm aims at preserving the neighbouring points. The process is described as follows: 
- For each point, its nearest neighbors are determined. 
- Then it tries to project the new point in the embedded space such that its neighbors are preserved
This spectral dimensionality reduction technique is non-linear, fast and reliable enough to handle big and complex dataset.

In [None]:
"""
    User-defined parameters for the task
"""
# coefficient of amplification of sentiment
sentiment_amplification = 10. 

# Number of clusters to be identified
nb_clusters = 8  

# Restrict dataframe size of tsne_debug is True
emb_debug = True
emb_size = 2000

## Prepare dataframe for clustering (keep only involved columns)

In [None]:
def sentiment_to_int(value):
  return int(value == "Positive") - int(value == "Negative")

def standardize(df):
    return (df - df.mean(axis=0)) / df.std(axis=0)
  
columns_to_drop = ["date","quoteID","qids","phase","probas","urls","date_of_birth","label"]

# restrict dataframe to the one needed
cluster_df = oneh_quotebank_brexit.drop(columns=columns_to_drop)

print("Is there any na values ?",cluster_df.isna().sum().sum())

<a id='One-hot'></a>

## Compose the vectorized dataframe

1. Convert sentiment score into signed integer format: "Positive" = 1, "Negative" = -1, "Neutral" = 0
2. For each column concerning the speaker information, generate a dummy dataframe (see DataFrame.get_dummy)
3. Concatenate along columns all obtained dataframe
4. Average all rows matching the same speaker (which is set as index)
5. Normalize dataset by row and amplify sentiment_score

In [None]:
"""
    Compose a vectorized dataframe
"""

# reload quotes vector
quotes_df = pd.read_csv("Brexit_datas/vector_quotes.csv.gz",compression="gzip").drop(columns="speaker")

# normalize quotations vector
quotes_df = standardize(quotes_df)

# Merge the two data frames
cluster_full_df = pd.concat([cluster_df.drop(columns="quotation"), quotes_df.loc[cluster_df.index]],axis = 1).set_index("speaker")

# convert sentiment_score to int format
cluster_full_df['sentiment_score'] = cluster_full_df['sentiment_score'].apply(sentiment_to_int)

# normalize numOccurences df by row
cluster_full_df.numOccurrences = standardize(cluster_full_df.numOccurrences)

# average over the same speaker
cluster_full_df = cluster_full_df.groupby(level=0).agg(np.mean)

#amplify sentiment
cluster_full_df['sentiment_score'] *= sentiment_amplification

cluster_full_df.head()

<a id='TSNE'></a>

## Convert to *pytorch* tensor and apply Locally Linear Embedding aggregation

In [None]:
# Debug: reduce dataset rows
cluster_quotes_df = quotes_df

cluster_oneh_df = cluster_df.drop(columns="quotation").set_index("speaker",drop=True)
cluster_oneh_df['sentiment_score'] = cluster_oneh_df['sentiment_score'].apply(sentiment_to_int)

# Apply T-stochastic neighboor embedding
# NOT USED: one-hot vectorization gave undefined results
#data_np_emb = TSNE(n_components=2, perplexity=30, n_iter=1000, verbose=True).fit_transform(full_data_tensor) # dim = Nxfinal_dim

# reduce rows if too much
if emb_debug:
    cluster_quotes_df = cluster_quotes_df.iloc[:emb_size]
    cluster_oneh_df = cluster_oneh_df.iloc[:emb_size]
    cluster_full_df = cluster_full_df.iloc[:emb_size]
    
# Apply Linear local embedding on full dataset
data_np_emb = LocallyLinearEmbedding(n_components=2, max_iter=100).fit_transform(df_to_tensor(cluster_full_df))

# Apply Linear local embedding on restricted dataset
data_quotes_emb = LocallyLinearEmbedding(n_components=2, max_iter=100).fit_transform(df_to_tensor(cluster_quotes_df))
data_oneh_emb = LocallyLinearEmbedding(n_components=2, max_iter=100).fit_transform(df_to_tensor(cluster_oneh_df))

In [None]:
data_np_reduc = data_np_emb.transpose()
quotes_np_reduc = data_quotes_emb.transpose()
oneh_np_reduc = data_oneh_emb.transpose()

# plot 2D-embedded vectorizations
fig, axis = plt.subplots(1, 3, figsize=[15,8])

axis[0].scatter(data_np_reduc[0], data_np_reduc[1], color='blue')
axis[0].set_title("Full embedding")

axis[1].scatter(quotes_np_reduc[0], quotes_np_reduc[1], color='orange')
axis[1].set_title("Quotes vector embedding")

axis[2].scatter(oneh_np_reduc[0], oneh_np_reduc[1], color='red')
axis[2].set_title("One-hot vector embedding")

for i in range(3):
    axis[i].tick_params(left=False,bottom=False,labelleft=False,labelbottom=False) 

plt.show()

<a id='Clustering'></a>

## Apply the clustering algorithm

In [None]:
"""
    Clustering
""" 

# Apply Clustering
clustering = SpectralClustering(nb_clusters).fit(data_np_emb)

"""
    Visualization
""" 
fig, axis = plt.subplots(1, 2, figsize=(14, 7))

results = data_np_emb.transpose()

# Visualize without clustering
axis[0].scatter(results[0], results[1])
axis[0].set_title('Raw data')


for label in range(nb_clusters):
    # select data by clustering label
    points = data_np_emb[clustering.labels_ == label]
    points = points.transpose()
    # plot data
    axis[1].scatter(points[0], points[1], label=label)
    
axis[1].set_title('Clustered data')
axis[1].legend()
    
for i in range(2):
    axis[i].tick_params(left=False,bottom=False,labelleft=False,labelbottom=False) 
    
plt.show()

<a id='Results'></a>

# Generate the results for the final story

<a id='Statistics'></a>

## General Statistics

Now that we had preprocessed the datas let's have a look at different basic statistics to explore deeply the dataset. Let's first look at the distribution of the quotations accross time.

In [None]:
fig = px.histogram(quotebank_brexit,x="date")
fig.update_layout(title="Number of quotations about Brexit accross time")
fig.show()

Let's see the top 20 countries that are providing the most quotations about Brexit

In [None]:
country_df = oneh_quotebank_brexit.loc[:,unique_values["nationality"]].sum(axis=0).T.to_frame().reset_index()
country_df = country_df.sort_values(by=0,ascending=False)
fig = px.bar(country_df,y=0,x="index",log_y=True)
fig.update_layout(title="Number of quotations about Brexit accross time",
                  xaxis_title="Country",yaxis_title="Count")
fig.show()

Let's see the top sectors that are providing the most quotations on Brexit

In [None]:
job_df = oneh_quotebank_brexit.loc[:,unique_values["occupation"]].sum(axis=0).T.to_frame().reset_index()
job_df = job_df.sort_values(by=0,ascending=False)
fig = px.bar(job_df,y=0,x="index",log_y=True)
fig.update_layout(title="Number of quotations about Brexit accross time",
                  xaxis_title="Sector",yaxis_title="Count")
fig.show()

Let's see the top 10 speakers that are providing the most quotations on Brexit

In [None]:
aug_quotebank_brexit.groupby("speaker").size().reset_index(name="count").loc[:,["speaker","count"]].sort_values(by="count",ascending=False).head(10)

In [None]:
def select_by_year(low_year,up_year, col=None):
    year_col = pd.DatetimeIndex(oneh_quotebank_brexit.date).year
    if col is None:
        cols = ["sentiment_score"]
    else:
        cols = list(unique_values[col]) + ["sentiment_score"]
    filter_df = oneh_quotebank_brexit.loc[(year_col >= low_year) & (year_col <= up_year),cols]
    filter_df = filter_df.groupby("sentiment_score").sum()
    count = filter_df.sum(axis=0)
    filter_df = (filter_df * 100/ filter_df.sum(axis=0)).T.reset_index()
    filter_df["count"] = count.values
    return filter_df

### Pie charts [Introduction]

In [None]:
UK_df = oneh_quotebank_brexit[oneh_quotebank_brexit["united kingdom"] == 1]
years = [2015,2016,2017,2018,2019,2020]
annotations = []
stepsize = 1/len(years)
horizontal_spacing = 0.05

fig = make_subplots(rows=1, cols=6, specs=[[{'type':'domain'}]*6],horizontal_spacing=horizontal_spacing)

for i,pie_year in enumerate(years):
    year_col = pd.DatetimeIndex(UK_df.date).year
    filter_df = UK_df.loc[year_col == pie_year,["sentiment_score"]]
    filter_df = filter_df.groupby("sentiment_score").size().reset_index(name="Count")
    pie_trace = go.Pie(labels=filter_df.sentiment_score,values=filter_df.Count,name=str(pie_year))
    fig.add_trace(pie_trace,1,i+1)
    x_value = fig.data[i]["domain"]["x"]
    print((x_value[1] + x_value[0])/2)
    annotations.append(dict(text=str(pie_year),x=(x_value[1] + 2*x_value[0])/3,y=0.5, font_size=20, showarrow=False))

fig.update_traces(hole=.4, hoverinfo="label+percent+name")

fig.update_layout(annotations=annotations)


fig.show()


In [None]:
fig.data

<a id='Country'></a>

## Analyze the way Brexit is perceived in European countries

Recall that the goal is to analyze the way Brexit is perceived by each Europe country based on the sentiment carried by the quotation. Besides, we would like to add the time dimension to this analysis, meaning that we would like to follow the evolution of the overall feelings towards Brexit. A view of the expected result is given below:

<a id='Sector'></a>

## Analyze the way Brexit is perceived in different sectors

In [None]:
sector_df = select_by_year(2015,2020,"occupation")

fig = px.bar(sector_df, x="index",text="count",y=sector_df.columns[-4:-1])
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide',
                  title="Sector analysis",
                  xaxis_title="Sector",yaxis_title="Count")
fig.show()


***Dynamic visualization with Dash, going to the url output by the cell below should give you a plot like that with a range slider to select the year range:***

![Dash render](Images\dash_sector.png)

In [None]:
app = dash.Dash(__name__)

app.layout = html.Div([
    dcc.Graph(id="scatter-plot"),
    html.P("Year:"),
    dcc.RangeSlider(
        id='range-slider',
        min=2015, max=2020, step=1,
        marks={2015: '2015', 2016:'2016',2017:'2017',2018:'2018',2019:'2019',2020: '2020'},
        value=[2015,2020]
    ),
])

@app.callback(
    Output("scatter-plot", "figure"), 
    [Input("range-slider", "value")])
def update_bar_chart(slider_range):
    low, high = slider_range
    sector_df = select_by_year(low,high,"occupation")
    fig = px.bar(sector_df, x="index",text="count",y=sector_df.columns[-4:-1])
    return fig

app.run_server()

<a id='2Dplot'></a>

## Visualize speakers orientation trough a 2D plot

In [None]:
# TO BE DONE

<a id='Recommandation'></a>

## Recommandation tool

In [None]:
# TO BE DONE

<a id='Stocks'></a>

## Correlation with stocks
It would be interesting to investigate a correlation between Brexit and the evolution of the british stock market. To do so, data from the FTSE100 was obtained: (TO BE CONTINUED)

In [None]:
import yfinance as yf

FTSE_companies = pd.read_excel("Brexit_datas\FTSE_100_list.xlsx")
tickers_FTSE = list(FTSE_companies.Ticker)

# Get the data for the FTSE companies
stock_action_FTSE = yf.download(tickers_FTSE,'2015-01-01','2020-08-01')['Adj Close']

stock_action_FTSE = stock_action_FTSE.dropna(axis=1).reset_index()

fig = px.line(stock_action_FTSE,x="Date",y=stock_action_FTSE.columns[1:10],log_y=True)

fig.show()