# Project Milestone 2

Here we will describe the whole pipeline to get all the results we would like to include in the final story (on the final website). We will go through all the different steps and describe as detailed as possible the operations needed. 

For the final story we decided to focus on the influence of the Brexit. More precisely we would like to assess how the Brexit was perceived and how it evolves along the years. All the different visualizations we aim at providing in the final story are well detailed in this [Section](#Results).

## **[Preprocessing steps](#Preprocessing)**

As usual the first step consist in several substeps that aims at cleaning and transforming the data. By clicking on the task link, you can access the respective pipeline.
- *[Data exploration and Sanity check](#Sanity_check)* : Explore the dataset, check its consistency and get familiar with the different features/information provided into.
    - Collaborators assigned to that task: ALL.
- *[Data extraction](#extraction)* : Extract the datas of interest that will be further used to perform the tasks related to each idea.
    - Collaborators assigned to that task: Arnaud.
- *[Data augmentation](#augmentation)* : Perform a data augmentation to get more features about the quotations such as the quote field, the nationality of the speaker and so on... These new features will be further used to perform the tasks related to each idea.
    - Collaborators assigned to that task: Jean & Gaelle. 
- *[Data cleaning](#augmentation)* 
- *[Quotations and speakers clustering](#clustering)* : Cluster the quotations and the speakers according to the a quotation vector and the added features (data augmentation). This clustering will be further mainly used to develop a recommandation tool.
    - Collaborators assigned to that task: Raffaele.

## **[Generate the results for the final story](#Results)**

- [General Statitics](#Statistics) : 
- [Country map](#Country) : 
- [Sector map](#Sector) : 
- [Visualize speakers evolution](#2Dplot) :
- [Recommandation Tool](#Recommandation) :
- [Correlation with stocks](#Stocks) :


# Before diving into the code 

Make sure you have a `Data` folder containing the following files: 
- The quotebank datasets for each year: `quotes-yyyy.json.bz2`
- The speaker attributes folder `speaker-attributes.parquet` as well as the associated lookup table `wikidata_labels_descriptions_quotebank.csv.bz2`

Make sure you have a `Brexit_datas` folder containing the following files available on this Google drive: 
- The quotebank dataset containing brexit quotations: 
- The quotebank dataset containing the brexit quotations with a sentiment analysis
- The quotebank dataset containing the quotes translated into vectors


## Import useful librairies and define useful variables

In [None]:
# STANDARD LIBRAIRIES
from os.path import exists
import bz2 
import json
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import itertools

# Dynamic graphs
import plotly.express as px
import dash
from dash import dcc
from dash import html
from dash.dependencies import Input, Output

# Machine learning librairies
import torch
from sentence_transformers import SentenceTransformer, util
from sklearn.cluster import SpectralClustering
from tsne_torch import TorchTSNE as TSNE
import nltk
import numpy as np
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Graph algorithms
import networkx as nx

# Load the lexicon for sentiment analysis
nltk.download('vader_lexicon')

import warnings
warnings.filterwarnings("ignore")

# Data files
PATHS_TO_FILE = ['Data/quotes-20%d.json.bz2' % i for i in range(15,21)]

<a id='Preprocessing'></a>

# Preprocessing steps

<a id='Sanity_check'></a>

## Data exploration and Sanity check

We decided to perform the following snaity checks on the original datas: 

- We first check that each entry for each quotation is specified in the right format (e.g. `numOccurences` should be an integer).
- We check that the `probas` sum to 1.
- We check that the `numOccurences` is superior or equal to the length of the list containing the urls.
- The `date` is consistent with the dataset they are coming from
- We check that if a `qids` exists then a `speaker` should be specified

In [None]:
# SANITY CHECK FUNCTIONS

def check_type(instance,entry,dtype):
    return type(instance[entry]) == dtype

def check_probas(instance):
    if len(instance) > 0:
        proba_sum = sum([potential[1] for potential in instance["probas"]])
        if proba_sum != 1:
            return False
        else:
            return True
    else:
        return False

def check_numOcc(instance):
    return (len(instance["urls"]) >= instance["numOccurences"])

def check_date(instance,year):
    quotation_year = int(instance["date"][:4])
    return (quotation_year == year)

def check_author_qids(instance):
    if len(instance["qids"]) > 0 and instance["speaker"] is None:
        return False
    else: 
        return True
        

In [None]:
# Define the types for each entry
TYPES = {"quoteID":str,
         "quotation":str,
         "speaker":str,
         "qids":list,
         "date":str,
         "numOccurrences":int,
         "probas":list,
         "urls":list,
         "phase":str}

error_file = "Data/error_file.json.bz2"


if ~exists(error_file):
    with bz2.open(error_file, 'wb') as e_file:
        # Loop over the different files that we will read
        for quotebank_data in PATHS_TO_FILE:
            print("Reading ",quotebank_data," file...")
            # Open the file we want to read
            with bz2.open(quotebank_data, 'rb') as s_file:
                # Loop over the samples
                for instance in s_file:
                    potential_error = ""
                    # Loading a sample
                    instance = json.loads(instance)
                    #### CHECK THE TYPES ####
                    for key, value in TYPES.items():
                        if not check_type(instance,key,value):
                            potential_error += "| Type problem: " + key + " |"
                    #### CHECK THE PROBAS ####
                    if not check_probas(instance):
                        potential_error += "| Probas problem |"
                    #### CHECK THE DATE ####
                    if not check_date(instance):
                        potential_error += "| Date problem |"
                    #### CHECK THE NUMOCCURENCES ####
                    if not check_numOcc(instance):
                        potential_error += "| NumOccurences problem |"
                    #### CHECK THE AUTHOR-QIDS ####
                    if not check_author_qids(instance):
                        potential_error += "| Author-qids problem |"
                    # WRITE INTO THE FILE FOR POTENTIAL ERRORS #
                    if len(potential_error) > 0:
                        e_file.write((json.dumps(instance)+'\n').encode('utf-8'))                     

<a id='extraction'></a>

## Data extraction

As mentionned previously, we are planning to analyze the way Brexit is perceived and the way it influenced other things. To be able to perform such tasks, we need first to extract the quotations that are talking about Brexit. To do so we will follow the following pipeline:

1. Define a neighborhood containing all the words that are respectively closely related to Brexit. This neighborhood will be a list of words or expressions that are commonly used to refer to Brexit.
2. Select all the quotations for which, at least, one word/expression from the vocabulary neighborhood appears in it.
3. Store the new two datasets in the `Brexit_quotes.json.bz2` file.


In [None]:
if not exists('Brexit_datas/Brexit_quotes.json.bz2'):
    # Input file
    PATHS_TO_FILE = ['Data/quotes-20%d.json.bz2' % i for i in range(15,21)]
    # Output file
    PATH_TO_OUT = 'Brexit_datas/Brexit_quotes.json.bz2'

    # Open the file where we will write
    with bz2.open(PATH_TO_OUT, 'wb') as d_file:
        # Loop over the different files that we will read
        for quotebank_data in PATHS_TO_FILE:
            print("Reading ",quotebank_data," file...")
            # Open the file we want to read
            with bz2.open(quotebank_data, 'rb') as s_file:
                # Loop over the samples
                for instance in s_file:
                    # Loading a sample
                    instance = json.loads(instance)
                    # Extracting the quotation
                    quotation = instance['quotation']
                    # Check if the quotation contains at least one word related to Brexit
                    if "brexit" in quotation.lower():
                        # Writing in the new file
                        d_file.write((json.dumps(instance)+'\n').encode('utf-8'))

quotebank_brexit = pd.read_json('Brexit_datas/Brexit_quotes.json.bz2',compression="bz2",lines=True)
quotebank_brexit.head(2)


<a id='augmentation'></a>

## Data augmentation

When we will generate the results for the final story, we will need more information than the initial features provided. The further analysis will require to have access to other features such as the sentiment carried by the quotation and additional information about the author. To do so, the following pipeline will be performed on each quotation:

1. **[Adding features related to the author](#Features_Author)** :  Using the provided file `speaker_attributes.parquet` that was extracted from the Wikidata knowledge base, the following attributes are of interest for each speaker:
    - `occupation`: describes the author's occupancy 
    - `party` identifies the political affiliation of the speaker.
    - `academic_degree` gives information about the education of the author as well as their alma mater.
    - `nationality` identifies the citizenship(s) of the author.
    - `date_of_birth`: identifies the date of birth of the speaker.
    - `gender`: identifies the gender of the speaker.
    - `ethnic_group`: identifies the ethnic group of the speaker.
    - `religion`: identifies the religion of the speaker. 

    The provided `speaker_attributes.parquet` file contains attributes in terms of QIDs, thereby being uninterpretable by humans. To map the QIDs to meaningful labels, we used the provide the file `wikidata_labels_descriptions_quotebank.csv.bz`.
    
    The aforementioned attributes may not be available for all authors. When it is the case, a NaN value is assigned.

2. **[Adding features issued from a sentiment analysis](#Sentiment_Quote)** : The last feature of interest is the sentiment that is carried by the quotation. For the sake of simplicity, each quotation will be classified into three different categories: *Negative*, *Neutral* and *Positive*. 
Sentiment Analysis task can be performed using pretrained Deep Neural Networks. We decided to use **Vader** Neural network for its good performance. NLTK's Vader sentiment analysis tool uses a bag of words approach with some simple heuristics. More on it [here](https://github.com/cjhutto/vaderSentiment). 

<a id='Features_Author'></a>
#### 1.1 Loading the speaker_attributes.parquet file:

In [None]:
# Load the parquet that contains the information about speakers
df_attributes = pd.read_parquet('Data/speaker_attributes.parquet')

# we are not interested in the aliases, lastrevid, US_congress_bio_ID, id, candidacy and type.
keep_attributes = ['id','label', 'date_of_birth', 'nationality', 'gender', 'ethnic_group', 'occupation', 'party', 'academic_degree', 'religion']
# Set the index
df_attributes = df_attributes[keep_attributes].set_index('id')
# Sanity check for the qids
print("Sanity check ok ? : ",df_attributes.index.is_unique)
# Let's have a look
df_attributes.sample(2)

<a id='Features_Author'></a>
#### 1.2 Mapping the QIDs to meaningful labels:

In [None]:
# create dictionnary to use it as a lookup table 
df_map = pd.read_csv('Data/wikidata_labels_descriptions_quotebank.csv.bz2', compression='bz2', index_col='QID')
# Dictionnary where qids are keys and values are corresponding element
map_dict = df_map.Label.to_dict()

def mapping(QIDs):
    """
    The purpose of this function is to map all the QIDs to their labels, 
    using wikidata_labels_descriptions_quotebank.csv
    """
    
    if QIDs is None:
        return np.nan
    else:
        QIDs_mapped = []
        for QID in QIDs:
            try:
                # If a correspondance exists
                QIDs_mapped.append(map_dict[QID])
            except KeyError:
                # If no correspondance exits
                continue
        # If nothing was extracted
        if len(QIDs_mapped) == 0:
            return np.nan
        # Things extracted
        else:
            return QIDs_mapped

columns_to_map = ['nationality', 'gender', 'ethnic_group', 'occupation', 'party', 'academic_degree', 'religion']

# For each column perform the mapping to transform qids to real value
for column in columns_to_map:
    df_attributes[column] = df_attributes[column].apply(mapping)
    
df_attributes.head(2)

<a id='Sentiment_Quote'></a>
#### 2. Adding sentiment score to each quote:

In [None]:
def sent_score(quote):
    """The purpose of this function is to use the sentiment analysis tool VADER to find the sentiment associated with a quote."""
    
    sid = SentimentIntensityAnalyzer()
    sentiment_dict = sid.polarity_scores(quote)
    
    # The Compound score is a metric that calculates the sum of all the lexicon ratings which have been normalized between
    # -1(most extreme negative) and +1 (most extreme positive).
    # positive sentiment : (compound score >= 0.05) 
    # neutral sentiment : (compound score > -0.05) and (compound score < 0.05) 
    # negative sentiment : (compound score <= -0.05)
    # see https://predictivehacks.com/how-to-run-sentiment-analysis-in-python-using-vader/
    # or https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/
    
    # decide sentiment as positive, negative and neutral
    if sentiment_dict['compound'] >= 0.05 :
        return "Positive"
 
    elif sentiment_dict['compound'] <= - 0.05 :
        return "Negative" 
 
    else :
        return "Neutral"

if not exists("Brexit_datas/quotebank_brexit_with_sentiment.json.bz2"):
    quotebank_brexit['sentiment_score'] = quotebank_brexit.quotation.apply(sent_score) 
    quotebank_brexit.to_json("Brexit_datas/quotebank_brexit_with_sentiment.json.bz2")
    
else:
    quotebank_brexit = pd.read_json("Brexit_datas/quotebank_brexit_with_sentiment.json.bz2",compression="bz2")

quotebank_brexit.head(2)

In [None]:
sentence_transformer_type = 'all-MiniLM-L6-v2' # type of the sentence_transformer

# Encode quotation 
if not exists("Brexit_datas/vector_quotes.csv.gz"):
    encoder = SentenceTransformer(sentence_transformer_type)
    quotes_encoded = encoder.encode(quotebank_brexit['quotation'].values, convert_to_numpy=True)
    quotes_df = pd.DataFrame(quotes_encoded, index = quotebank_brexit.index)
    quotes_df["speaker"] = quotebank_brexit["speaker"].values
    quotes_df.to_csv("Brexit_datas/vector_quotes.csv.gz")
    
else:
    quotes_df = pd.read_csv("Brexit_datas/vector_quotes.csv.gz",index_col=0,compression="gzip")

quotes_df.head()

In [None]:
# determine the supported device
def get_device():
    if torch.cuda.is_available():
        device = torch.device('cuda:0')
    else:
        device = torch.device('cpu') # don't have GPU 
    return device

# convert a df to tensor to be used in pytorch
def df_to_tensor(df):
    device = get_device()
    return torch.from_numpy(df.values).float().to(device)

# compare pairwise similarity
def filter_similar(df):
    embeddings = df_to_tensor(df.drop(columns='quoteID'))
    cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)
    score = pd.DataFrame(cosine_scores.numpy())
    index = set(score[score>0.95].stack().index.tolist())
    index = [(a,b) for (a,b) in index if a != b]
    # multiple tuples can have common element: need to merge them
    graph = nx.Graph(index)
    index = list(nx.connected_components(graph))
    # map the indices with the Qid of the quote
    index = [tuple(df.quoteID.iloc[ind] for ind in path) for path in index]
    return index

#.get_group('AC Grayling')
similar_count = quotes_df.assign(quoteID=quotebank_brexit.quoteID.values).groupby('speaker').apply(filter_similar)

In [None]:
# an example of two similar quotations
print("Yvette Cooper:", similar_count["Yvette Cooper"][0])
quotebank_brexit[(quotebank_brexit.quoteID=='2019-01-27-041304') | (quotebank_brexit.quoteID=='2019-01-27-029003')]

In [None]:

# get indices to drop
def drop_duplicate_quotes(ids):
    return [[quoteID for quoteID in path[1:]] for path in ids]
        
to_be_removed = similar_count.apply(drop_duplicate_quotes).values.sum()
to_be_removed = list(itertools.chain.from_iterable(to_be_removed))

quotebank_brexit = quotebank_brexit[~quotebank_brexit.quoteID.isin(to_be_removed)]

print("Number of quotations to be removed: ",len(to_be_removed))

<a id='cleaning'></a>

## Data merging and cleaning

Depending on the different task we want to perform we will need to have the dataset in various forms, thus we will generate three types of dataset: 
- `quotebank_brexit`: original dataset cleaned
- `aug_quotebank_brexit`: dataset filtered and augmented with the datas
- `oneh_quotebank_brexit`: dataset here categorical values are encoded as one hot vectors

Explain why we get ride of the None values rows, and why we get ride of multiple qids

### Remove quotations without precised speakers [only augmented quotebank]

In the `aug_quotebank_brexit` we will have a lot of information coming from the speaker such as the `nationality` or the `occupation`. However one can notice that sometimes the neural network didn't succeed in finding a speaker and fill `speaker` entry with `None` value. These missing values are difficult to handle as it would require to guess who told the quotation. One could think about training a classifier on the datas where the speaker is mentionned but it is actually a fastidious task that we are not able to manage. So unfortunaltely the last solution to get ride of these datas was chosen in the `aug_quotebank_brexit`.

An other issue comes from the fact that for one speaker many qids were mentionned. We interpreted this multiple values as multiple wikipedia pages that may point to the same person but in different langagues. This could also come from the fact that there exists multiple wikipedia pages that points to different persons as homonyms may exist. So when many qids are mentionned we check that all the attributes are similar for all the qids, if not, then we are not able to determine which qid is the right one so unfortunately we discard the row.

In [None]:
def check_consistent_qids(QIDS_original):
    QIDS = QIDS_original.copy()
    if len(QIDS) == 0:
        return pd.NA
    elif len(QIDS) == 1:
        return QIDS_original[0]
    else:
        while len(QIDS) > 1:
            first_idx = QIDS.pop(-1)
            try:
                first = df_attributes.loc[first_idx].fillna(0)
                second_idx = QIDS.pop(-1)
                try:
                    second = df_attributes.loc[second_idx].fillna(0)
                except KeyError:
                    QIDS.append(first_idx)
                    continue
            except KeyError:
                continue
            try: 
                if (first != second).sum() > 0:
                    return pd.NA
            except ValueError:
                return pd.NA
        return QIDS_original[0]

# Remove nan values
aug_quotebank_brexit = quotebank_brexit[quotebank_brexit.speaker != "None"]

# Remove speakers with multiple different qids
aug_quotebank_brexit.qids = aug_quotebank_brexit.qids.apply(check_consistent_qids)
aug_quotebank_brexit = aug_quotebank_brexit[~aug_quotebank_brexit.qids.isna()]

# Merge the augmented quotebank brexit with df_attributes on qids
aug_quotebank_brexit = pd.merge(aug_quotebank_brexit, df_attributes, 'inner', left_on="qids", right_index=True)

# Let's have a look
print("New shape",aug_quotebank_brexit.shape)
aug_quotebank_brexit.head(2)

### Cleaning quotation duplicates

We identify that they were duplicates in quotations, or at least some quotations were already contained one in another as shown in the example below.

In [None]:
print(quotebank_brexit.loc[quotebank_brexit.quoteID == "2018-01-26-042810","quotation"].values)
print(quotebank_brexit.loc[quotebank_brexit.quoteID == "2018-01-26-042811","quotation"].values)

## What about the categorical features added

Let's have a look on the number of categorical values for each categorical feature we added. 

In [None]:
unique_values = {}

for col in columns_to_map:
    col_serie = aug_quotebank_brexit[col].copy()
    unique_values[col] = pd.unique(col_serie.apply(pd.Series).stack())
    print(col," : number of different categories = ",len(unique_values[col]))

## Identify Big sectors

We noticed that there were more than 800 different occupations, we would like to classify them in *supercategories*. To do so we proceeded as follows: 


Manage frequent keywords

In [None]:
# Data frame of the occupations
occupation_df = pd.DataFrame(unique_values["occupation"],columns=["occupation"])

key_words = []

# Loop over the occupations
for occupation in unique_values["occupation"]:
    # Split the occupation string and concatenate
    key_words += occupation.split()

# Convert to a Dataframe
key_words_df = pd.DataFrame(key_words,columns=["occupation"])
# Put all strings to lower
key_words_df.occupation = key_words_df.occupation.str.lower()
# For each key word count the number of occurences and sort by descending
key_words_df = key_words_df.groupby("occupation").size().reset_index(name="Count").sort_values(by="Count",ascending=False)

# If the classification has not been already done
if not exists("Brexit_datas/occupation_class/occupation_agg.csv"):
    key_words_df.to_csv("Brexit_datas/occupation_class/occupation_agg.csv")

print("Look at the most frequent keywords")
print(key_words_df.head(3))

answer = input("Is the classification of keywords done ?")

if (answer.lower() == "yes"):
    # Get the classified keywords
    key_words_classified = pd.read_csv("Brexit_datas/occupation_class/occupation_agg.csv",index_col=0)
    # Get ride of keywords that have not been classified
    key_words_classified = key_words_classified.loc[~key_words_classified.Category.isna()]
    # Manage the case when several categories have been entered
    key_words_classified.Category = key_words_classified.Category.apply(lambda x: x.split("-"))
    # let's have a look at the table
    print("Look at the output table")
    print(key_words_classified.head(3))
else:
    print("Then please classify the keywords")


In [None]:
# Function to check if keywords are contained in an occupation
def check_string_in(occupation):
    # Initialize the final list of the supercategories
    final_list = []
    # Loop over the key_words_classified
    for items in key_words_classified.occupation.iteritems():
        # If the keyword is contained in the occupation
        if items[1] in occupation.lower():
            # Concat the supercategories with th existing list
            final_list = final_list + key_words_classified.loc[items[0],"Category"]
    # If no categories return NaN
    if len(final_list) == 0:
        return pd.NA
    # Else return the list without duplicates
    else:
        return list(set(final_list))
        
# Apply the function
occupation_df["Category"] = occupation_df.occupation.apply(check_string_in)

if not exists("Brexit_datas/occupation_class/unclassified_occupation.csv"):
    # Export the occupations that have not been classified
    occupation_df[occupation_df.Category.isna()].to_csv("Brexit_datas/occupation_class/unclassified_occupation.csv")

print("Look at the remaining occupations")
print(occupation_df[occupation_df.Category.isna()].head(3))

answer = input("Is the classification of remaining occupations done ?")

if (answer.lower() == "yes"):
    # Get the remaining occupations classified
    remain_occupations_classified = pd.read_csv("Brexit_datas/occupation_class/unclassified_occupation.csv",index_col=0)
    # Merge with the current data frame
    occupation_final_df = pd.merge(occupation_df,remain_occupations_classified,how="left",on="occupation",suffixes=("","_2"))
    # Split into a list
    occupation_final_df.Category_2 = occupation_final_df.Category_2.apply(lambda x: x.split("-") if type(x) == str else pd.NA)
    # Merge into a single column
    occupation_final_df.loc[~occupation_final_df.Category_2.isna(),"Category"] = occupation_final_df.loc[~occupation_final_df.Category_2.isna(),"Category_2"]
    # Drop the artificial column
    occupation_final_df.drop(columns=["Category_2"],inplace=True)
    # Drop na values that corresponds to unclassifiable jobs such as nazi hunter
    occupation_final_df.dropna(axis=0,inplace=True)
    # Let's have a look
    print("Final data set for the classification of occupations:")
    print(occupation_final_df.head(5))
    # Export to a json file
    if not exists("Brexit_datas/occupation_class/classified_occupation.json"):
        occupation_final_df.to_json("Brexit_datas/occupation_class/classified_occupation.json")
else:
    print("Then please classify the remaining occupations")

In [None]:
occupation_final_df = pd.read_json("Brexit_datas/occupation_class/classified_occupation.json").set_index("occupation")

# Let's have a look at the supercategories
print(list(pd.unique(occupation_final_df.Category.apply(pd.Series).stack())))

# Let's replace this into the aug_quotebank dataset
def replace_occupation(occupation):
    if type(occupation) == list:
        if len(occupation) > 0:
            new_occupation = []
            for job in occupation:
                try:
                    new_occupation += occupation_final_df.loc[job,"Category"]
                except KeyError:
                    continue
            if len(new_occupation) > 0:
                return list(set(new_occupation))
            else:
                return pd.NA
                
    else:
        return pd.NA

aug_quotebank_brexit.occupation = aug_quotebank_brexit.occupation.apply(replace_occupation)
aug_quotebank_brexit.head(2)
        

## One hot encoding

In [None]:
# One hot vectorization of columns cotaining categorical values
dummy_col = "AAADummy column for the sake"
# Make a copy
oneh_quotebank_brexit = aug_quotebank_brexit.copy()

# Check that the element is a list that contains only one string
def ensure_list(value):
  if isinstance(value, list):
    for i in range(len(value)):
      value[i] = str(value[i])
  elif not pd.isna(value):
    value = [value]
  return value

# Loop over categorical columns
for col in columns_to_map:
  # Get the serie
  col_serie = aug_quotebank_brexit[col].copy().apply(ensure_list)
  # Change nan values to a list containing a dummy column
  col_serie[col_serie.isna()] = col_serie[col_serie.isna()].apply(lambda x: [dummy_col])
  # One hot vectorize
  categorical_df = pd.get_dummies(col_serie.apply(pd.Series).stack()).groupby(level=0).sum()
  # Drop the dummy column
  categorical_df.drop(columns=[dummy_col],inplace=True)
  # Refresh unique values
  unique_values[col] = categorical_df.columns
  # Join with quotebank brexit
  oneh_quotebank_brexit = oneh_quotebank_brexit.join(categorical_df,how="left",rsuffix=col[:3])
  print("One hot vectorizing : ",col,
        "| NaN values : ",categorical_df.isna().apply(lambda x: x*1).sum().sum(),
        "| Number of different categories : ",len(categorical_df.columns),
        "| Shape reduced ? ",categorical_df.shape,oneh_quotebank_brexit.shape)
  # Drop the categorical column
  oneh_quotebank_brexit.drop(columns=col,inplace=True)
  # Check for NaN values
  print("Any NA in the final dataframe: ",oneh_quotebank_brexit.isna().apply(lambda x: x*1).sum().sum())

print("Shape of the final data frame",oneh_quotebank_brexit.shape)
print("Any NA in the final dataframe: ",oneh_quotebank_brexit.isna().apply(lambda x: x*1).sum().sum())

In [None]:
print(oneh_quotebank_brexit.quotation.isna().sum())
print(oneh_quotebank_brexit.numOccurrences.isna().sum())
print(oneh_quotebank_brexit.speaker.isna().sum())

<a id='clustering_task'></a>

# Quotations and speakers clustering

The last preprocessing step consist in clustering the quotations as well as the speakers, this clustering will then be used to create a Recommandation Tool in the context of Brexit. The idea would be to first cluster the quotations and then the speakers such that two quotations/speakers that are in the same cluster are quotations/speakers carries on similar things/ideas. Performing such a task can be done following this pipeline:
1. The first step is to convert sentences into vectors to be able to further perform the clustering. This task can be achieved using the [SentenceTransformer](https://www.sbert.net/docs/usage/semantic_textual_similarity.html) deep neural network. The vector obtained from this operation cab be then concatenated with the other existing features (that would be converted to one hot vectors if necessary).
2. The second step consists in reducing the dimension of the datas before applying the clustering algorithm. This task can be achieved using the [T-stochastic neighbors embeddings](#https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) algorithm or the [Locally Linear Embeddings](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.LocallyLinearEmbedding.html#sklearn.manifold.LocallyLinearEmbedding) algorithm. These two techniques (specially the first one) are efficient non-linear dimensionality reduction methods.
3. The third step is specific to speaker clustering. Indeed the vectorization of quotes as well as the reduction of dimensionality is only applied to quotes. Thus we need to perform an **aggregation** to be able to attribute a vector to each speaker. For each speaker, this aggregation can simply be done by taking the mean of the vectors associated with each of their quotations. 
4. The last step consist in performing the clustering operation. This task can be achieved using [Gaussian Mixture Model](https://scikit-learn.org/stable/modules/mixture.html#mixture) algorithm or  [Spectral Clustering](#https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering) method.

In [None]:
"""
    User-defined parameters for the task
"""
sentiment_amplification = 1.5 # coefficient of amplification of sentiment, amplification is applied after normalization
normalize_tensor = True # chose whether to normalize the sentence tensor before clustering
nb_clusters = 8   # Number of clusters to be identified

## Step 0: Prepare dataframe for the one-hot vectorization task

In [None]:
def sentiment_to_int(value):
  return int(value == "Positive") - int(value == "Negative")
  
columns_to_drop = ["date","quoteID","qids","phase","probas","urls","date_of_birth","label"]

# restrict dataframe to the one needed
cluster_df = oneh_quotebank_brexit.drop(columns=columns_to_drop)

print("Is there any na values ?",cluster_df.isna().sum().sum())

<a id='One-hot'></a>

## Step 2: Compose the vectorized dataframe

1. Convert sentiment score into signed integer format: "Positive" = 1, "Negative" = -1, "Neutral" = 0
2. For each column concerning the speaker information, generate a dummy dataframe (see DataFrame.get_dummy)
3. Concatenate along columns all obtained dataframe
4. Average all rows matching the same speaker (which is set as index)
5. Normalize dataset by row and amplify sentiment_score

In [None]:
"""
    Compose a vectorized dataframe
"""

quotes_df = pd.read_csv("Brexit_datas/vector_quotes.csv.gz",compression="gzip").drop(columns="speaker")

# normalize quotations vector
quotes_df = (quotes_df - quotes_df.mean(axis=0))/quotes_df.std(axis=0)

# Merge the two data frames
cluster_full_df = pd.concat([cluster_df.drop(columns="quotation"), quotes_df.loc[cluster_df.index]],axis = 1).set_index("speaker")

# convert sentiment_score to int format
cluster_full_df['sentiment_score'] = cluster_full_df['sentiment_score'].apply(sentiment_to_int)

# normalize numOccurences df by row
cluster_full_df.numOccurrences = (cluster_full_df.numOccurrences - cluster_full_df.numOccurrences.mean()) / cluster_full_df.numOccurrences.std()

# average over the same speaker
cluster_full_df = cluster_full_df.groupby(level=0).agg(np.mean)

#amplify sentiment
cluster_full_df['sentiment_score'] *= sentiment_amplification

cluster_full_df.head()

<a id='TSNE'></a>

## Step 3: Convert to *pytorch* tensor and apply TSNE aggregation

In [None]:
# Convert into pytorch tensor
full_data_tensor = df_to_tensor(cluster_full_df)

# Apply T-stochastic neighboor embedding
tsne_dim = 2 # TSNE reduction final dimension, default is 2
data_np_emb = TSNE(n_components=tsne_dim, perplexity=30, n_iter=1000).fit_transform(full_data_tensor) # dim = Nxfinal_dim

data_np_reduc = data_tensor_emb.transpose()

# Visualize without clustering
plt.scatter(data_np_tensor[0], data_np_tensor[1])
plt.title("TSNE bidimensional reduction of the speaker vectorization")

plt.show()

In [None]:
plt.scatter(data_np_reduc[0], data_np_reduc[1])
plt.title("TSNE bidimensional reduction of the speaker vectorization")

plt.show()

<a id='Clustering'></a>

## Step 4: Apply the clustering algorithm

In [None]:
"""
    Clustering
""" 

# Apply Clustering
clustering = SpectralClustering(nb_clusters).fit(data_tensor_emb)

"""
    Visualization
""" 
fig, axis = plt.subplots(1, 2, figsize=(14, 7))

results = data_tensor_emb.transpose()

# Visualize without clustering
axis[0].scatter(results[0], results[1])


for label in range(nb_clusters):
    # select data by clustering label
    points = data_tensor_emb[clustering.labels_ == label]
    points = points.transpose()
    # plot data
    axis[1].scatter(points[0], points[1])
    
plt.show()

<a id='Results'></a>

# Generate the results for the final story

<a id='Statistics'></a>

## General Statistics

In [None]:
fig = px.histogram(quotebank_brexit,x="date")
fig.update_layout(title="Number of quotations about Brexit accross time")
fig.show()

In [None]:
country_df = oneh_quotebank_brexit.loc[:,unique_values["nationality"]].sum(axis=0).T.to_frame().reset_index()
country_df = country_df.sort_values(by=0,ascending=False).iloc[:20]
fig = px.bar(country_df,y=0,x="index",log_y=True)
fig.update_layout(title="Number of quotations about Brexit accross time")
fig.show()

In [None]:
ethnic_df = oneh_quotebank_brexit.loc[:,unique_values["ethnic_group"]].sum(axis=0).T.to_frame().reset_index()
ethnic_df = ethnic_df.sort_values(by=0,ascending=False).iloc[:20]
fig = px.bar(ethnic_df,y=0,x="index",log_y=True)
fig.update_layout(title="Number of quotations about Brexit accross time")
fig.show()

In [None]:
ethnic_df = oneh_quotebank_brexit.loc[:,unique_values["occupation"]].sum(axis=0).T.to_frame().reset_index()
ethnic_df = ethnic_df.sort_values(by=0,ascending=False)
fig = px.bar(ethnic_df,y=0,x="index",log_y=True)
fig.update_layout(title="Number of quotations about Brexit accross time")
fig.show()

In [None]:
aug_quotebank_brexit.groupby("speaker").size().reset_index(name="count").loc[:,["speaker","count"]].sort_values(by="count",ascending=False).head(5)

<a id='Country'></a>

## Analyze the way Brexit is perceived in European countries

Recall that the goal is to analyze the way Brexit is perceived in each Europe country based on the sentiment carried by the quotation. Besides we would like to add the time dimension to this analysis, meaning that we would like to follow the evolution of the overall feelings towards Brexit. A view of the expected result is given below:

<a id='Sector'></a>

## Analyze the way Brexit is perceived in different sectors

In [None]:
def select_by_year(low_year,up_year):
    year_col = pd.DatetimeIndex(oneh_quotebank_brexit.date).year
    cols = list(unique_values["occupation"]) + ["sentiment_score"]
    sector_df = oneh_quotebank_brexit.loc[(year_col >= low_year) & (year_col <= up_year),cols]
    sector_df = sector_df.groupby("sentiment_score").sum()
    count = sector_df.sum(axis=0)
    sector_df = (sector_df * 100/ sector_df.sum(axis=0)).T.reset_index()
    sector_df["count"] = count.values
    return sector_df

sector_df = select_by_year(2015,2020)

fig = px.bar(sector_df, x="index",text="count",y=sector_df.columns[-4:-1])
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.show()


In [None]:
app = dash.Dash(__name__)

app.layout = html.Div([
    dcc.Graph(id="scatter-plot"),
    html.P("Year:"),
    dcc.RangeSlider(
        id='range-slider',
        min=2015, max=2020, step=1,
        marks={2015: '2015', 2016:'2016',2017:'2017',2018:'2018',2019:'2019',2020: '2020'},
        value=[2015,2020]
    ),
])

@app.callback(
    Output("scatter-plot", "figure"), 
    [Input("range-slider", "value")])
def update_bar_chart(slider_range):
    low, high = slider_range
    sector_df = select_by_year(low,high)
    fig = px.bar(sector_df, x="index",text="count",y=sector_df.columns[-4:-1])
    return fig

app.run_server()

<a id='2Dplot'></a>

## Visualize speakers orientation trough a 2D plot

<a id='Recommandation'></a>

## Recommandation tool

<a id='Stocks'></a>

## Correlation with stocks

In [None]:
# CORRELATION WITH STOCKS ACTIONS