# Project Milestone 2

Here we will describe the whole pipeline to get all the results we would like to include in the final story (on the final website). We will go through all the different steps and describe as detailed as possible the operations needed. 

For the final story we decided to focus on the influence of the Brexit. More precisely we would like to assess how the Brexit was perceived and how it evolves along the years. All the different visualizations we aim at providing in the final story are well detailed in this [Section](#Results).

## **[Preprocessing steps](#Preprocessing)**

As usual the first step consist in several substeps that aims at cleaning and transforming the data. By clicking on the task link, you can access the respective pipeline.
- *[Data exploration and Sanity check](#Sanity_check)* : Explore the dataset, check its consistency and get familiar with the different features/information provided into.
    - Collaborators assigned to that task: ALL.
- *[Data extraction](#extraction)* : Extract the datas of interest that will be further used to perform the tasks related to each idea.
    - Collaborators assigned to that task: Arnaud.
- *[Data augmentation](#augmentation)* : Perform a data augmentation to get more features about the quotations such as the quote field, the nationality of the speaker and so on... These new features will be further used to perform the tasks related to each idea.
    - Collaborators assigned to that task: Jean & Gaelle. 
- *[Quotations and speakers clustering](#clustering)* : Cluster the quotations and the speakers according to the a quotation vector and the added features (data augmentation). This clustering will be further mainly used to develop a recommandation tool.
    - Collaborators assigned to that task: Raffaele.

## **[Generate the results for the final story](#Results)**

- [General Statitics](#Statistics) : 
- [Country map](#Country) : 
- [Sector map](#Sector) : 
- [Visualize speakers evolution](#2Dplot) :
- [Recommandation Tool](#Recommandation) :
- [Correlation with stocks](#Stocks) :


# Before diving into the code 

Make sure you have a **Data** Folder containing all the quotebank datasets.

## Import useful librairies and define useful librairies

In [None]:
import bz2 
import json
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

PATHS_TO_FILE = ['Data/quotes-20%d.json.bz2' % i for i in range(15,21)]

<a id='Preprocessing'></a>

# Preprocessing steps

<a id='Sanity_check'></a>

## Data exploration and Sanity check

We decided to perform the following snaity checks on the original datas: 

- We first check that each entry for each quotation is specified in the right format (e.g. `numOccurences` should be an integer).
- We check that the `probas` sum to 1.
- We check that the `numOccurences` is superior or equal to the length of the list containing the urls.$
- The `date` is consistent with the dataset they are coming from

In [None]:
# SANITY CHECK FUNCTIONS

def check_type(instance,entry,dtype):
    return type(instance[entry]) == dtype

def check_probas(instance):
    # TO BE DEFINED
    return None

def check_numOcc(instance):
    # TO BE DEFINED
    return None

def check_date(instance,year):
    # TO BE DEFINED
    return None

In [None]:
# Define the types for each entry
TYPES = {"quoteID":str,
         "quotation":str,
         "speaker":str,
         "qids":list,
         "date":str,
         "numOccurrences":int,
         "probas":list,
         "urls":list,
         "phase":str}

error_dic = {'Data/quotes-20%d.json.bz2' % i : [] for i in range(20,21)}

# Loop over the different files that we will read
for quotebank_data in PATHS_TO_FILE:
    # Open the file we want to read
    with bz2.open(quotebank_data, 'rb') as s_file:
        # Loop over the samples
        for instance in s_file:
            # Loading a sample
            instance = json.loads(instance)
            #### CHECK THE TYPES ####
            for key, value in TYPES.items():
                if not check_type(instance,key,value):
                    error_dic[quotebank_data].append(instance["quoteIDS"] + ": " + key + " type problem")

print(error_dic)

<a id='extraction'></a>

## Data extraction

As mentionned previously, we are planning to analyze the influence of Brexit on different branch as well as analyzing the evolution of feelings towards China. To be able to perform such tasks, we need first to extract the quotations that are talking from Brexit and the ones that are talking about China. To do so we will follow the following pipeline:

1. Both for Brexit and China, define a neighborhood containing all the words that are respectively closely related to Brexit and China. This neighborhood will be a list of words or expressions that are commonly used to refer to Brexit or China. For instance, for China one could actually add to the vocabulary neighborhood the *"the Middle Kingdom"* expression that is often used to refer to China.
2. Both for Brexit and China, select all the quotations for which, at least, one word/expression from the vocabulary neighborhood appears in it.
3. Store the new two datasets in the following files: 
    - `Brexit_quotes.json.bz2`


In [None]:
# Input file
PATHS_TO_FILE = ['Data/quotes-20%d.json.bz2' % i for i in range(15,21)]
# Output file
PATH_TO_OUT = 'Brexit_datas/Brexit_quotes.json.bz2'

# Open the file where we will write
with bz2.open(PATH_TO_OUT, 'wb') as d_file:
    # Loop over the different files that we will read
    for quotebank_data in PATHS_TO_FILE:
        print("Reading ",quotebank_data," file...")
        # Open the file we want to read
        with bz2.open(quotebank_data, 'rb') as s_file:
            # Loop over the samples
            for instance in s_file:
                # Loading a sample
                instance = json.loads(instance)
                # Extracting the quotation
                quotation = instance['quotation']
                # Check if the quotation contains at least one word related to Brexit
                if "brexit" in quotation.lower():
                    # Writing in the new file
                    d_file.write((json.dumps(instance)+'\n').encode('utf-8')) 

In [None]:
quotebank_brexit = pd.read_json('Brexit_datas/Brexit_quotes.json.bz2',compression="bz2",lines=True)

In [None]:
quotebank_brexit.sample()

<a id='augmentation'></a>

## Data augmentation

When we will generate the results for the final story, we will need more information than the initial features we have. The further analysis will require to have access to other features such as the topic of the quotation, the sentiment that carries the quotation, some information about the author and so on. The main idea is to add new features to the existing dataset or only to the data of interest. To do so, we will follow the following pipeline for each quotation:

1. **Add features related to the author** : The first type of features one can add are the ones related to the author. Accessing at its wikipedia page gives us a lot of different information: looking carrefully at wikidata item field let us select some useful features listed below:
    - `occupation` tells you the author domain.
    - `member of political party` tells you the party at which the author belongs to.
    - `educated at` tells you where the author studied.
    - `country of citizenship` tells you the nationality of the author.
    
    These fields may not exist for all authors (as not all the authors are politicians), but we can actually assign a NaN value when the field does not appear for one author.

2. **Add computed features** : The second type of features we can add are the ones that are directly derived from the initial ones. We selected a bunch of them that will be useful for further analysis:
    - TO BE OPTIONALLY COMPLETED
3. **Add features issued from a sentiment analysis** : The last feature we would like to add is the sentiment carried on by the quotation. Initially we were thinking about a binary sentiment classification: 0 if the sentiment is negative, 1 if it is positive. We could further expand that by classifying the quotations into several categories such as *anger*, *sadness*, *factual* and so on...    
Performing such a text classification task can actually be done using pretrained Deep Neural Networks. XLNet network ([GitHub page](https://github.com/zihangdai/xlnet/) & [Library containing XLNet](https://huggingface.co/transformers/model_doc/xlnet.html)) is close to the state of the art algorithm for classification. Therefore we plan to use it to determine the sentiment contained in each quotation

In [None]:
# DO THE DATA AUGMENTATION HERE

#### Load speaker_attributes.parquet file that contains attributes in terms of QIDs.

In [None]:
df_attributes = pd.read_parquet('Data/speaker_attributes.parquet')
df_attributes.head(2)

In [None]:
# we are not interested in the aliases, lastrevid, US_congress_bio_ID, id, candidacy and type.

keep_attributes = ['label', 'date_of_birth', 'nationality', 'gender', 'ethnic_group', 'occupation', 'party', 'academic_degree', 'religion']
df_attributes = df_attributes[keep_attributes].set_index('label')
df_attributes.head(2)

#### For speaker attributes, map the QIDs to meaningful labels.

In [None]:
# create dictionnary to use it as a lookup table 

df_map = pd.read_csv('Data/wikidata_labels_descriptions_quotebank.csv.bz2', compression='bz2', index_col='QID')
map_dict = df_map.Label.to_dict()

In [None]:
def mapping(QIDs):
    """The purpose of this function is to map all the QIDs to their labels, using wikidata_labels_descriptions_quotebank.csv"""
    
    if QIDs is None:
        return pd.NA
    else:
        QIDs_mapped = []
        for QID in QIDs:
            try:
                QIDs_mapped.append(map_dict[QID])
            except KeyError:
                continue
        return QIDs_mapped

In [None]:
columns_to_map = ['nationality', 'gender', 'ethnic_group', 'occupation', 'party', 'academic_degree', 'religion']

# A MODIFIER: je sais pas pourquoi la ligne d'en bas ne fonctionne pas. La méthode alternative avec la boucle fonctionne mais c'est pas très propre.
# df_attributes[columns_to_map] = df_attributes[columns_to_map].apply(mapping, axis=0)
for column in columns_to_map:
    df_attributes[column] = df_attributes[column].apply(mapping)
    
df_attributes.head(2)

#### Add sentiment score to quote.

In [None]:
quotebank_brexit = pd.read_json('Data/Brexit_quotes.json.bz2', compression='bz2', lines=True)
quotebank_brexit.head(2)

In [None]:
def sent_score(quote):
    """The purpose of this function is to use the sentiment analysis tool VADER to find the sentiment associated with a quote."""
    
    sid = SentimentIntensityAnalyzer()
    sentiment_dict = sid.polarity_scores(quote)
    
    # The Compound score is a metric that calculates the sum of all the lexicon ratings which have been normalized between
    # -1(most extreme negative) and +1 (most extreme positive).
    # positive sentiment : (compound score >= 0.05) 
    # neutral sentiment : (compound score > -0.05) and (compound score < 0.05) 
    # negative sentiment : (compound score <= -0.05)
    # see https://predictivehacks.com/how-to-run-sentiment-analysis-in-python-using-vader/
    # or https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/
    
    # decide sentiment as positive, negative and neutral
    if sentiment_dict['compound'] >= 0.05 :
        return "Positive"
 
    elif sentiment_dict['compound'] <= - 0.05 :
        return "Negative" 
 
    else :
        return "Neutral"

In [None]:
quotebank_brexit['sentiment_score'] = quotebank_brexit['quotation'].apply(sent_score) 
quotebank_brexit.head(2)

#### Merge both dataframes to obtain final dataframe

In [None]:
augmented_quotebank_brexit = pd.merge(quotebank_brexit, df_attributes, 'inner', left_on='speaker', right_index=True)
augmented_quotebank_brexit.head(2)

<a id='clustering'></a>

## Quotations and speakers clustering

The last preprocessing step consist in clustering the quotations as well as the speakers, this clustering will then be used to create a Recommandation Tool in the context of Brexit. The idea would be to first cluster the quotations and then the speakers such that two quotations/speakers that are in the same cluster are quotations/speakers carries on similar things/ideas. Performing such a task can be done following this pipeline:
1. The first step is to convert sentences into vectors to be able to further perform the clustering. This task can be achieved using the [SentenceTransformer](https://www.sbert.net/docs/usage/semantic_textual_similarity.html) deep neural network. The vector obtained from this operation cab be then concatenated with the other existing features (that would be converted to one hot vectors if necessary).
2. \[OPTIONAL STEP\] The second step consists in reducing the dimension of the datas before applying the clustering algorithm. This task can be achieved using the [T-stochastic neighbors embeddings](#https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) algorithm or the [Locally Linear Embeddings](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.LocallyLinearEmbedding.html#sklearn.manifold.LocallyLinearEmbedding) algorithm. These two techniques (specially the first one) are efficient non-linear dimensionality reduction methods.
3. The third step is specific to speaker clustering. Indeed the vectorization of quotes as well as the reduction of dimensionality is only applied to quotes. Thus we need to perform an **aggregation** to be able to attribute a vector to each speaker. For each speaker, this aggregation can simply be done by taking the mean of the vectors associated with each of their quotations. 
4. The last step consist in performing the clustering operation. This task can be achieved using [Gaussian Mixture Model](https://scikit-learn.org/stable/modules/mixture.html#mixture) algorithm or  [Spectral Clustering](#https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering) method.

In [None]:
# PERFORM CLUSTERING HERE

# upload those modules on jupyter
from sentence_transformers import SentenceTransformer
from sklearn.cluster import SpectralClustering
import pytorch as torch
from tsne_torch import TorchTSNE as TSNE

# Encode data

encoder = SentenceTransformer('all-MiniLM-L6-v2')

# Encode all data into a pytorch tensor NxD
# N = number of sentences (samples)
# D = dimension of a sentence vector
data_tensor = None

# suppose quotes is a dictionary with speakers as keys and their respective quotes as values 

for speaker, sentences in quotes.items():
    # Step 1: encode sentences into a pytorch tensor NxD
    # N = number of sentences (samples)
    # D = dimension of a sentence vector
    quotes_tensor = encoder.encode(sentences, convert_to_tensor=True)
    
    # concatenate tensors by rows
    if data_tensor is None:
        data_tensor = quotes_tensor
    else:
        torch.cat((data_tensor, quotes_tensor), 0)
    
    
# Step 2: T-stochastic neighboor embedding
final_dim = 20 # dim = NxD
data_tensor_emb = TSNE(n_components=final_dim, perplexity=30, n_iter=1000).fit_transform(speaker_tensor) # dim = Nxfinal_dim

# Step 3: contract tensor by mean along rows
speaker_tensor = None

with 0 as i:
    for speaker, sentences in quotes.items():
        speaker_vector = torch.mean(data_tensor_emb[i : len(sentences)], 1)
        i += len(sentences)
        
        if speaker_tensor is None:
            speaker_tensor = speaker_vector
        else:
            torch.cat((speaker_tensor, speaker_vector), 0)



# Step 4: Clustering and actual training

# TODO: vector length normalization?

cluster_model = SpectralClustering()
cluster_model.fit(speaker_tensor_emb)

# TODO: visualization

<a id='Results'></a>

# Generate the results for the final story

<a id='Statistics'></a>

## General Statistics

In [None]:
fig, axes = plt.subplots(2,2,figsize=[20,14])

# Plot the number of quotations with respect to the date
sns.histplot(data=augmented_quotebank_brexit,x="date",ax = axes[0,0])

# Plot the number of speakers and the number of quotations per country

# Count speakers per country
country_data = augmented_quotebank_brexit.loc[:,["country","speaker"]].drop_duplicates(subset=['speaker'])
country_data = country_data.groupby("country").size().reset_index(name="Speaker Count")

# Add number of quotations per country
country_data = country_data.join(augmented_quotebank_brexit.groupby("country").size().reset_index(name="Quotation count"),
                                 on = "country")

sns.barplot(data=country_data,x="country",y="Speaker Count",ax=axes[0,1])
sns.barplot(data=country_data,x="country",y="Quotation Count",ax=axes[1,1])

# Plot the number of 





<a id='Country'></a>

## Analyze the way Brexit is perceived in European countries

Recall that the goal is to analyze the way Brexit is perceived in each Europe country based on the sentiment carried by the quotation. Besides we would like to add the time dimension to this analysis, meaning that we would like to follow the evolution of the overall feelings towards Brexit. A view of the expected result is given below:

<a id='Sector'></a>

## Analyze the way Brexit is perceived in different sectors

In [None]:
# SYNTHETIC DATAS
df_dic = {}
df_dic["Sector"] = ["Politic","Politic","Politic","Economy","Science","Art"]
df_dic["Sentiment"] = ["Positive","Neutral","Negative","Positive","Positive","Positive"]
df_dic["percentage"] = [40,60,80,80,60,50]

df = pd.DataFrame(df_dic)
df.head()

In [None]:
fig = px.scatter(df, x="Sector", y="Sentiment", color="Sentiment",
                 size='percentage')
fig.show()

<a id='2Dplot'></a>

## Visualize speakers orientation trough a 2D plot

<a id='Recommandation'></a>

## Recommandation tool

<a id='Stocks'></a>

## Correlation with stocks