# Project Milestone 2

Here we will describe the whole pipeline to get all the results we would like to include in the final story (on the final website). We will go through all the different steps and describe as detailed as possible the operations needed. 

For the final story we decided to focus on the influence of the Brexit. More precisely we would like to assess how the Brexit was perceived and how it evolves along the years. All the different visualizations we aim at providing in the final story are well detailed in this [Section](#Results).

## **[Preprocessing steps](#Preprocessing)**

As usual the first step consist in several substeps that aims at cleaning and transforming the data. By clicking on the task link, you can access the respective pipeline.
- *[Data exploration and Sanity check](#Sanity_check)* : Explore the dataset, check its consistency and get familiar with the different features/information provided into.
    - Collaborators assigned to that task: ALL.
- *[Data extraction](#extraction)* : Extract the datas of interest that will be further used to perform the tasks related to each idea.
    - Collaborators assigned to that task: Arnaud.
- *[Data augmentation](#augmentation)* : Perform a data augmentation to get more features about the quotations such as the quote field, the nationality of the speaker and so on... These new features will be further used to perform the tasks related to each idea.
    - Collaborators assigned to that task: Jean & Gaelle. 
- *[Data cleaning](#augmentation)* 
- *[Quotations and speakers clustering](#clustering)* : Cluster the quotations and the speakers according to the a quotation vector and the added features (data augmentation). This clustering will be further mainly used to develop a recommandation tool.
    - Collaborators assigned to that task: Raffaele.

## **[Generate the results for the final story](#Results)**

- [General Statitics](#Statistics) : 
- [Country map](#Country) : 
- [Sector map](#Sector) : 
- [Visualize speakers evolution](#2Dplot) :
- [Recommandation Tool](#Recommandation) :
- [Correlation with stocks](#Stocks) :


# Before diving into the code 

Make sure you have a `Data` folder containing the following files: 
- The quotebank datasets for each year: `quotes-yyyy.json.bz2`
- The speaker attributes folder `speaker-attributes.parquet` as well as the associated lookup table `wikidata_labels_descriptions_quotebank.csv.bz2`

## Import useful librairies and define useful variables

In [None]:
# STANDARD LIBRAIRIES
import bz2 
import json
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

# Dynamic graphs
import plotly.express as px

# Machine learning librairies
import torch
from sentence_transformers import SentenceTransformer
from sklearn.cluster import SpectralClustering
from tsne_torch import TorchTSNE as TSNE
import nltk
import numpy as np
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Load the lexicon for sentiment analysis
nltk.download('vader_lexicon')

# Data files
PATHS_TO_FILE = ['Data/quotes-20%d.json.bz2' % i for i in range(15,21)]

<a id='Preprocessing'></a>

# Preprocessing steps

<a id='Sanity_check'></a>

## Data exploration and Sanity check

We decided to perform the following snaity checks on the original datas: 

- We first check that each entry for each quotation is specified in the right format (e.g. `numOccurences` should be an integer).
- We check that the `probas` sum to 1.
- We check that the `numOccurences` is superior or equal to the length of the list containing the urls.
- The `date` is consistent with the dataset they are coming from
- We check that if a `qids` exists then a `speaker` should be specified

In [None]:
# SANITY CHECK FUNCTIONS

def check_type(instance,entry,dtype):
    return type(instance[entry]) == dtype

def check_probas(instance):
    if len(instance) > 0:
        proba_sum = sum([potential[1] for potential in instance["probas"]])
        if proba_sum != 1:
            return False
        else:
            return True
    else:
        return False

def check_numOcc(instance):
    return (len(instance["urls"]) >= instance["numOccurences"])

def check_date(instance,year):
    quotation_year = int(instance["date"][:4])
    return (quotation_year == year)

def check_author_qids(instance):
    if len(instance["qids"]) > 0 and instance["speaker"] is None:
        return False
    else: 
        return True
        

In [None]:
# Define the types for each entry
TYPES = {"quoteID":str,
         "quotation":str,
         "speaker":str,
         "qids":list,
         "date":str,
         "numOccurrences":int,
         "probas":list,
         "urls":list,
         "phase":str}

error_file = "Data/error_file.json.bz2"

with bz2.open(error_file, 'wb') as e_file:
    # Loop over the different files that we will read
    for quotebank_data in PATHS_TO_FILE:
        print("Reading ",quotebank_data," file...")
        # Open the file we want to read
        with bz2.open(quotebank_data, 'rb') as s_file:
            # Loop over the samples
            for instance in s_file:
                potential_error = ""
                # Loading a sample
                instance = json.loads(instance)
                #### CHECK THE TYPES ####
                for key, value in TYPES.items():
                    if not check_type(instance,key,value):
                        potential_error += "| Type problem: " + key + " |"
                #### CHECK THE PROBAS ####
                if not check_probas(instance):
                    potential_error += "| Probas problem |"
                #### CHECK THE DATE ####
                if not check_date(instance):
                    potential_error += "| Date problem |"
                #### CHECK THE NUMOCCURENCES ####
                if not check_numOcc(instance):
                    potential_error += "| NumOccurences problem |"
                #### CHECK THE AUTHOR-QIDS ####
                if not check_author_qids(instance):
                    potential_error += "| Author-qids problem |"
                # WRITE INTO THE FILE FOR POTENTIAL ERRORS #
                if len(potential_error) > 0:
                    e_file.write((json.dumps(instance)+'\n').encode('utf-8')) 
                    

<a id='extraction'></a>

## Data extraction

As mentionned previously, we are planning to analyze the way Brexit is perceived and the way it influenced other things. To be able to perform such tasks, we need first to extract the quotations that are talking about Brexit. To do so we will follow the following pipeline:

1. Define a neighborhood containing all the words that are respectively closely related to Brexit. This neighborhood will be a list of words or expressions that are commonly used to refer to Brexit.
2. Select all the quotations for which, at least, one word/expression from the vocabulary neighborhood appears in it.
3. Store the new two datasets in the `Brexit_quotes.json.bz2` file.


In [None]:
# Input file
PATHS_TO_FILE = ['Data/quotes-20%d.json.bz2' % i for i in range(15,21)]
# Output file
PATH_TO_OUT = 'Brexit_datas/Brexit_quotes.json.bz2'

# Open the file where we will write
with bz2.open(PATH_TO_OUT, 'wb') as d_file:
    # Loop over the different files that we will read
    for quotebank_data in PATHS_TO_FILE:
        print("Reading ",quotebank_data," file...")
        # Open the file we want to read
        with bz2.open(quotebank_data, 'rb') as s_file:
            # Loop over the samples
            for instance in s_file:
                # Loading a sample
                instance = json.loads(instance)
                # Extracting the quotation
                quotation = instance['quotation']
                # Check if the quotation contains at least one word related to Brexit
                if "brexit" in quotation.lower():
                    # Writing in the new file
                    d_file.write((json.dumps(instance)+'\n').encode('utf-8')) 

In [None]:
quotebank_brexit = pd.read_json('Brexit_datas/Brexit_quotes.json.bz2',compression="bz2",lines=True)
quotebank_brexit.sample(2)

<a id='augmentation'></a>

## Data augmentation

When we will generate the results for the final story, we will need more information than the initial features we have. The further analysis will require to have access to other features such as the topic of the quotation, the sentiment that carries on the quotation, some information about the author and so on. The main idea is to add new features to the existing dataset or only to the data of interest. To do so, we will follow the following pipeline for each quotation:

1. **Add features related to the author** : The first type of features one can add are the ones related to the author. Accessing at its wikipedia page gives us a lot of different information: looking carrefully at wikidata item field let us select some useful features listed below:
    - `occupation` tells you the author domain.
    - `member of political party` tells you the party at which the author belongs to.
    - `educated at` tells you where the author studied.
    - `country of citizenship` tells you the nationality of the author.
    
    These fields may not exist for all authors (as not all the authors are politicians), but we can actually assign a NaN value when the field does not appear for one author.

2. **Add computed features** : The second type of features we can add are the ones that are directly derived from the initial ones. We selected a bunch of them that will be useful for further analysis:
    - TO BE OPTIONALLY COMPLETED
3. **Add features issued from a sentiment analysis** : The last feature we would like to add is the sentiment carried on by the quotation. For the sake of simplicity, we will classify each quotation in three different categories: *Negative*, *Neutral* and *Positive*. 
Performing such a text classification task can actually be done using pretrained Deep Neural Networks. We decided to use **Vader** Neural network which is described [here](https://github.com/cjhutto/vaderSentiment). 

#### Load speaker_attributes.parquet file that contains attributes in terms of QIDs.

In [None]:
# Load the parquet that contains the information about speakers
df_attributes = pd.read_parquet('Data/speaker_attributes.parquet')

# we are not interested in the aliases, lastrevid, US_congress_bio_ID, id, candidacy and type.
keep_attributes = ['id','label', 'date_of_birth', 'nationality', 'gender', 'ethnic_group', 'occupation', 'party', 'academic_degree', 'religion']
# Set the index
df_attributes = df_attributes[keep_attributes].set_index('id')
# Sanity check for the qids
print("Sanity check ok ? : ",df_attributes.index.is_unique)
# Let's have a look
df_attributes.sample(2)

#### For speaker attributes, map the QIDs to meaningful labels.

In [None]:
# create dictionnary to use it as a lookup table 
df_map = pd.read_csv('Data/wikidata_labels_descriptions_quotebank.csv.bz2', compression='bz2', index_col='QID')
# Dictionnary where qids are keys and values are corresponding element
map_dict = df_map.Label.to_dict()

def mapping(QIDs):
    """
    The purpose of this function is to map all the QIDs to their labels, 
    using wikidata_labels_descriptions_quotebank.csv
    """
    
    if QIDs is None:
        return np.nan
    else:
        QIDs_mapped = []
        for QID in QIDs:
            try:
                # If a correspondance exists
                QIDs_mapped.append(map_dict[QID])
            except KeyError:
                # If no correspondance exits
                continue
        # If nothing was extracted
        if len(QIDs_mapped) == 0:
            return np.nan
        # Things extracted
        else:
            return QIDs_mapped

columns_to_map = ['nationality', 'gender', 'ethnic_group', 'occupation', 'party', 'academic_degree', 'religion']

# For each column perform the mapping to transform qids to real value
for column in columns_to_map:
    df_attributes[column] = df_attributes[column].apply(mapping)
    
df_attributes.head(2)

#### Add sentiment score to quote.

In [None]:
def sent_score(quote):
    """The purpose of this function is to use the sentiment analysis tool VADER to find the sentiment associated with a quote."""
    
    sid = SentimentIntensityAnalyzer()
    sentiment_dict = sid.polarity_scores(quote)
    
    # The Compound score is a metric that calculates the sum of all the lexicon ratings which have been normalized between
    # -1(most extreme negative) and +1 (most extreme positive).
    # positive sentiment : (compound score >= 0.05) 
    # neutral sentiment : (compound score > -0.05) and (compound score < 0.05) 
    # negative sentiment : (compound score <= -0.05)
    # see https://predictivehacks.com/how-to-run-sentiment-analysis-in-python-using-vader/
    # or https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/
    
    # decide sentiment as positive, negative and neutral
    if sentiment_dict['compound'] >= 0.05 :
        return "Positive"
 
    elif sentiment_dict['compound'] <= - 0.05 :
        return "Negative" 
 
    else :
        return "Neutral"

quotebank_brexit['sentiment_score'] = quotebank_brexit.quotation.apply(sent_score) 
quotebank_brexit.head(2)

<a id='cleaning'></a>

## Data merging and cleaning

Depending on the different task we want to perform we will need to have the dataset in various forms, thus we will generate three types of dataset: 
- `quotebank_brexit`: original dataset cleaned
- `aug_quotebank_brexit`: dataset filtered and augmented with the datas
- `oneh_quotebank_brexit`: dataset here categorical values are encoded as one hot vectors

Explain why we get ride of the None values rows, and why we get ride of multiple qids

In [None]:
quotebank_brexit_filter = quotebank_brexit.loc[quotebank_brexit.qids.apply(lambda x : len(x)) == 1]
quotebank_brexit_filter.qids = quotebank_brexit_filter.qids.apply(lambda x : x[0])
aug_quotebank_brexit = pd.merge(quotebank_brexit_filter, df_attributes, 'inner', left_on="qids", right_index=True)
aug_quotebank_brexit.head(2)

#### Merge both dataframes to obtain final dataframe

## One hot vectorization

In [None]:
# One hot vectorization of columns cotaining categorical values
dummy_col = "AAADummy column for the sake"
oneh_quotebank_brexit = aug_quotebank_brexit.copy()
# Columns that contain categorical values
cate_cols = ["nationality","ethnic_group","occupation","party","academic_degree","religion"]
unique_values = {}
# Columns that contain binary values
binary_cols = ["gender"]

# Loop over categorical columns
for col in cate_cols:
    col_serie = aug_quotebank_brexit[col].copy()
    col_serie.loc[col_serie.isna()] = col_serie.loc[col_serie.isna()].apply(lambda x: [dummy_col])
    print("One hot vectorizing : ",col)
    categorical_df = pd.get_dummies(col_serie.apply(pd.Series).stack()).groupby(level=0).sum()
    categorical_df.drop(columns=dummy_col,inplace=True)
    print("Number of different categories : ",len(categorical_df.columns))
    unique_values[col] = categorical_df.columns
    oneh_quotebank_brexit = oneh_quotebank_brexit.join(categorical_df,how="left",rsuffix=col[:3])
    oneh_quotebank_brexit.drop(columns=col,inplace=True)

print("Shape of the final data frame",oneh_quotebank_brexit.shape)

<a id='clustering'></a>

## Quotations and speakers clustering

The last preprocessing step consist in clustering the quotations as well as the speakers, this clustering will then be used to create a Recommandation Tool in the context of Brexit. The idea would be to first cluster the quotations and then the speakers such that two quotations/speakers that are in the same cluster are quotations/speakers carries on similar things/ideas. Performing such a task can be done following this pipeline:
1. The first step is to convert sentences into vectors to be able to further perform the clustering. This task can be achieved using the [SentenceTransformer](https://www.sbert.net/docs/usage/semantic_textual_similarity.html) deep neural network. The vector obtained from this operation cab be then concatenated with the other existing features (that would be converted to one hot vectors if necessary).
2. The second step consists in reducing the dimension of the datas before applying the clustering algorithm. This task can be achieved using the [T-stochastic neighbors embeddings](#https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) algorithm or the [Locally Linear Embeddings](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.LocallyLinearEmbedding.html#sklearn.manifold.LocallyLinearEmbedding) algorithm. These two techniques (specially the first one) are efficient non-linear dimensionality reduction methods.
3. The third step is specific to speaker clustering. Indeed the vectorization of quotes as well as the reduction of dimensionality is only applied to quotes. Thus we need to perform an **aggregation** to be able to attribute a vector to each speaker. For each speaker, this aggregation can simply be done by taking the mean of the vectors associated with each of their quotations. 
4. The last step consist in performing the clustering operation. This task can be achieved using [Gaussian Mixture Model](https://scikit-learn.org/stable/modules/mixture.html#mixture) algorithm or  [Spectral Clustering](#https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering) method.

In [None]:
"""
    User-defined parameters for the task
"""
sentiment_amplification = 1.5 # coefficient of amplification of sentiment, amplification is applied after normalization
normalize_tensor = True # chose whether to normalize the sentence tensor before clustering
nb_clusters = 8   # Number of clusters to be identified
sentence_transformer_type = 'all-MiniLM-L6-v2' # type of the sentence_transformer

In [None]:
"""
    Define here all useful tools for the task.
    
    Given a column of a dataframe, generate a new dummy dataframe
    take into account possible pd.NA and value multiplicity
    
    pd.NA correspond to a null vector
"""
def column_dummy_list(df, col):
  # restrict to specified column
  dfcol = df[col] 

  # add a recognizible string name to pd.NA
  dfcol.fillna('NA', inplace=True)
  dfcol = dfcol.apply(lambda x: ['NA'] if x == 'NA' else x) 

  # get dummy dataframe
  out = dfcol.apply(pd.Series).stack().str.get_dummies().groupby(level=0).sum().add_prefix(col + '_')
  # drop the NA dummy
  return out.drop(col + '_NA', axis = 1)


"""
    Convert sentiment_score to Positive = 1, Negative = -1, default = 0
"""
def sentiment_to_int(value):
  return int(value == "Positive") - int(value == "Negative")


"""
    Tools for the pandas.DataFrame to pytorch.tensor convertion
"""

# determine the supported device
def get_device():
    if torch.cuda.is_available():
        device = torch.device('cuda:0')
    else:
        device = torch.device('cpu') # don't have GPU 
    return device

# convert a df to tensor to be used in pytorch
def df_to_tensor(df):
    device = get_device()
    return torch.from_numpy(df.values).float().to(device)


"""
    Sanity check on data format, all elements should be lists
"""
def ensure_list(value):
    if isinstance(value, list):
      for i in range(len(value)):
        value[i] = str(value[i])
    elif not pd.isna(value):
      value = [value]
    return value

In [None]:
Prepare dataframe for the one-hot vectorization task

In [None]:
# ensure columns are present
columns_to_map = [item for item in columns_to_map if item in list(augmented_quotebank_brexit)]

# restrict dataframe to the one needed
cluster_df = augmented_quotebank_brexit.loc[:, ['speaker', 'quotation', 'sentiment_score', *columns_to_map]]

# remove duplicate quotations
cluster_df = cluster_df.groupby(cluster_df.index).first()

# apply sanity check on the elements format
for label in columns_to_map:
  cluster_df[label] = cluster_df[label].apply(make_list)

# visualize an example
cluster_df.loc[cluster_df['speaker'] == 'Laura Huhtasaari'].head(3)

In [None]:
## Step 1: Encode quotations into their corresponding vectorization

In [None]:
# Encode quotation 
encoder = SentenceTransformer(sentence_transformer_type)
quotes_encoded = encoder.encode(cluster_df['quotation'].values, convert_to_numpy=True)

In [None]:
## Step 2: Compose the vectorized dataframe

1. Convert sentiment score into signed integer format: "Positive" = 1, "Negative" = -1, "Neutral" = 0
2. For each column concerning the speaker information, generate a dummy dataframe (see DataFrame.get_dummy)
3. Concatenate along columns all obtained dataframe
4. Average all rows matching the same speaker (which is set as index)
5. Normalize dataset by row and amplify sentiment_score

In [None]:
"""
    Compose a vectorized dataframe
"""

# replace sentiment by integer value
info_df = cluster_df.loc[:,['sentiment_score']].copy()

# convert sentiment_score into signed unitary integer
info_df['sentiment_score'] = info_df['sentiment_score'].apply(sentiment_to_int)

# for each column estract get the one-hot dummy dataframe 
for column in columns_to_map:
  # get dummyzed column
  dummy_column_df = column_dummy_list(cluster_df, column)
  # concatenate it to info_df along horizontal direction
  info_df = pd.concat([info_df, dummy_column_df], axis = 1)

# replace string quotations by encoded quotations vectors
quotes_df = pd.DataFrame(quotes_encoded, index = info_df.index)
full_df = pd.concat([info_df, quotes_df], axis = 1)
full_df['speaker'] = cluster_df['speaker'].values
full_df.set_index('speaker', drop=True, inplace=True)

# average over the same speaker
full_df = full_df.groupby(level=0).agg(np.mean)

# normalize dataset by row
if normalize_tensor:
  full_df = full_df.div(np.sqrt(np.square(full_df).sum(axis=1)), axis=0)

# amplify sentiment
full_df['sentiment_score'] *= sentiment_amplification

full_df.head(5)

In [None]:
Step 3: Convert to *pytorch* tensor and apply TSNE aggregation

In [None]:
# Convert into pytorch tensor
full_data_tensor = df_to_tensor(full_df)

# Apply T-stochastic neighboor embedding
tsne_dim = 2      # TSNE reduction final dimension, default is 2
data_tensor_emb = TSNE(n_components=tsne_dim, perplexity=30, n_iter=1000).fit_transform(full_data_tensor) # dim = Nxfinal_dim

results = data_tensor_emb.transpose()

# Visualize without clustering
plt.scatter(results[0], results[1])
plt.title("TSNE bidimensional reduction of the speaker vectorization")

plt.show()

In [None]:
## Step 4: Apply the clustering algorithm

In [None]:
"""
    Clustering
""" 

# Apply Clustering
clustering = SpectralClustering(nb_clusters).fit(data_tensor_emb)

"""
    Visualization
""" 
fig, axis = plt.subplots(1, 2, figsize=(14, 7))

results = data_tensor_emb.transpose()

# Visualize without clustering
axis[0].scatter(results[0], results[1])


for label in range(nb_clusters):
    # select data by clustering label
    points = data_tensor_emb[clustering.labels_ == label]
    points = points.transpose()
    # plot data
    axis[1].scatter(points[0], points[1])
    
plt.show()

<a id='Results'></a>

# Generate the results for the final story

<a id='Statistics'></a>

## General Statistics

In [None]:
aug_quotebank_brexit.columns

In [None]:
fig, axes = plt.subplots(2,2,figsize=[20,14])

# Plot the number of quotations with respect to the date
sns.histplot(data=aug_quotebank_brexit,x="date",ax = axes[0,0])

# Plot the number of speakers and the number of quotations per country

"""
# Count speakers per country
country_data = aug_quotebank_brexit.loc[:,["nationality","speaker"]].drop_duplicates(subset=['speaker'])
country_data = country_data.groupby("nationality").size().reset_index(name="Speaker Count")

# Add number of quotations per country
country_data = country_data.join(aug_quotebank_brexit.groupby("nationality").size().reset_index(name="Quotation count"),
                                 on = "nationality")

sns.barplot(data=country_data,x="country",y="Speaker Count",ax=axes[0,1])
sns.barplot(data=country_data,x="country",y="Quotation Count",ax=axes[1,1])
"""

# Plot the number of 





<a id='Country'></a>

## Analyze the way Brexit is perceived in European countries

Recall that the goal is to analyze the way Brexit is perceived in each Europe country based on the sentiment carried by the quotation. Besides we would like to add the time dimension to this analysis, meaning that we would like to follow the evolution of the overall feelings towards Brexit. A view of the expected result is given below:

<a id='Sector'></a>

## Analyze the way Brexit is perceived in different sectors

In [None]:
# SYNTHETIC DATAS
df_dic = {}
df_dic["Sector"] = ["Politic"]*3 + ["Economy"]*3 + ["Science"]*3 + ["Art"]*3
df_dic["Sentiment"] = ["Positive","Neutral","Negative"]*4
df_dic["percentage"] = [40,40,20,50,20,30] + [30,50,20]*2

# df = pd.get_dummies(pd.DataFrame(df_dic))
df = pd.DataFrame(df_dic)
df.head()

In [None]:
fig = px.scatter(df, x="Sector", y="Sentiment", color="Sentiment",
                 size='percentage')
fig.show()

<a id='2Dplot'></a>

## Visualize speakers orientation trough a 2D plot

<a id='Recommandation'></a>

## Recommandation tool

<a id='Stocks'></a>

## Correlation with stocks

In [None]:
# CORRELATION WITH STOCKS ACTIONS