# Project Milestone 2

Here we will describe the whole pipeline to get all the results we would like to include in the final story (on the final website). We will go through all the different steps and describe as detailed as possible the operations needed.

## **[Preprocessing steps](#Preprocessing)**

As usual the first step consist in several substeps that aims at cleaning and transforming the data. By clicking on the task link, you can access the respective pipeline.
- *[Data exploration and Sanity check](#Sanity_check)* : Explore the dataset, check its consistency and get familiar with the different features/information provided into.
- *[Data augmentation](#augmentation)* : Perform a data augmentation to get more features about the quotations such as the quote field, the nationality of the speaker and so on... These new features will be further used to perform the tasks related to each idea.
- *[Data extraction](#extraction)* : Extract the datas of interest that will be further used to perform the tasks related to each idea. 
- *[Quotations and speakers clustering](#clustering)* : Cluster the quotations and the speakers according to the a quotation vector and the added features (data augmentation).

## **[Generate the results for the final story](#Results)**

Among the different ideas of each person in the group we decided to select 3 of them to tell the data story and informs readers about interesting elements that can be found in the dataset. An overview of the 3 ideas is given below, by clicking on the idea link, you can access the respective pipeline and the respective expected outcome.

- *[Analysis of the way Brexit is perceived](#Brexit)* : The first idea consists in visualizing the the way Brexit is perceived accross different fields as well as accross the different countries in Europe.**JEAN COULD YOU COMPLETE WITH YOUR INITIAL DESCRIPTION ? (PUT SOMETHING APPEALING)**

- *[Evolution of overall feelings towards China](#China)*  : The second idea is actually really close to the previous one, we would like to perform a similar analysis for China. **SINCE THE TWO IDEAS ARE REALLY SIMILAR WHY NOT MERGE THEM AND ADD AN OTHER IDEA TO THE LIST** + **JEAN COULD YOU COMPLETE WITH YOUR INITIAL DESCRIPTION ? (PUT SOMETHING APPEALING)**

- *[Speakers/quotes Recommandation tool](#Recommandation)* : Imagine you are a fan of a set of politicians or scientists, and you would like to know if other people are sharing the same style or the same ideas. Using the cluster we presented before, we could establish recommendations of speakers with similar tags/attributes. The user enters his speaker preferences and they receive speakers “close” to the ones the user put in. Further, the cluster allows us to measure the closeness between two speakers or two quotations.

<a id='Preprocessing'></a>

# Preprocessing steps

<a id='Sanity_check'></a>

## Data exploration and Sanity check

Define which exploration and sanity checks we need to perform:

- Sanity check 1
- Sanity check 2
- ...

In [1]:
# PERFORM THE DIFFERENT SANITY CHECKS

<a id='augmentation'></a>

## Data augmentation

When we will generate the results for the final story, we will need more information than the initial features we have. The further analysis will require to have access to other features such as the topic of the quotation, the sentiment that carries the quotation, some information about the author and so on. The main idea is to add new features to the existing dataset or only to the data of interest. To do so, we will follow the following pipeline for each quotation:

1. **Add features related to the author** : The first type of features one can add are the ones related to the author. Accessing at its wikipedia page gives us a lot of different information: looking carrefully at wikidata item field let us select some useful features listed below:
    - `occupation` tells you the author domain.
    - `member of political party` tells you the party at which the author belongs to.
    - `educated at` tells you where the author studied.
    - `country of citizenship` tells you the nationality of the author.
    
    These fields may not exist for all authors (as not all the authors are politicians), but we can actually assign a NaN value when the field does not appear for one author.

2. **Add computed features** : The second type of features we can add are the ones that are directly derived from the initial ones. We selected a bunch of them that will be useful for further analysis:
    - TO BE OPTIONALLY COMPLETED
3. **Add features issued from a sentiment analysis** : The last feature we would like to add is the sentiment carried on by the quotation. Initially we were thinking about a binary sentiment classification: 0 if the sentiment is negative, 1 if it is positive. We could further expand that by classifying the quotations into several categories such as *anger*, *sadness*, *factual* and so on...    
Performing such a text classification task can actually be done using pretrained Deep Neural Networks. XLNet network ([GitHub page](https://github.com/zihangdai/xlnet/) & [Library containing XLNet](https://huggingface.co/transformers/model_doc/xlnet.html)) is close to the state of the art algorithm for classification. Therefore we plan to use it to determine the sentiment contained in each quotation

In [2]:
# DO THE DATA AUGMENTATION HERE

<a id='extraction'></a>

## Data extraction

As mentionned previously, we are planning to analyze the influence of Brexit on different branch as well as analyzing the evolution of feelings towards China. To be able to perform such tasks, we need first to extract the quotations that are talking from Brexit and the ones that are talking about China. To do so we will follow the following pipeline:

1. Both for Brexit and China, define a neighborhood containing all the words that are respectively closely related to Brexit and China. This neighborhood will be a list of words or expressions that are commonly used to refer to Brexit or China. For instance, for China one could actually add to the vocabulary neighborhood the *"the Middle Kingdom"* expression that is often used to refer to China.
2. Both for Brexit and China, select all the quotations for which, at least, one word/expression from the vocabulary neighborhood appears in it.
3. Store the new two datasets in the following files: 
    - `Brexit.json`
    - `China.json`


In [3]:
# DO THE EXTRACTION HERE

<a id='clustering'></a>

## Quotations and speakers clustering

The last preprocessing step consist in clustering the quotations as well as the speakers, this clustering will then be used to create a [Recommandation Tool](#Recommandation). The idea would be to first cluster the quotations and then the speakers such that two quotations/speakers that are in the same cluster are quotations/speakers that carries on similar things/ideas. Performing such a task can be done following this pipeline:
1. The first step is to convert sentences into vectors to be able to further perform the clustering. This task can be achieved using the [SentenceTransformer](https://www.sbert.net/docs/usage/semantic_textual_similarity.html) deep neural network. The vector obtained from this operation cab be then concatenated with the other existing features (that would be converted to one hot vectors if necessary).
2. \[OPTIONAL STEP\] The second step consists in reducing the dimension of the datas before applying the clustering algorithm. This task can be achieved using the [T-stochastic neighbors embeddings](#https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) algorithm or the [Locally Linear Embeddings](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.LocallyLinearEmbedding.html#sklearn.manifold.LocallyLinearEmbedding) algorithm. These two techniques (specially the first one) are efficient non-linear dimensionality reduction methods.
3. The third step is specific to speaker clustering. Indeed the vectorization of quotes as well as the reduction of dimensionality is only applied to quotes. Thus we need to perform an **aggregation** to be able to attribute a vector to each speaker. For each speaker, this aggregation can simply be done by taking the mean of the vectors associated with each of their quotations. 
4. The last step consist in performing the clustering operation. This task can be achieved using [Gaussian Mixture Model](https://scikit-learn.org/stable/modules/mixture.html#mixture) algorithm or  [Spectral Clustering](#https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering) method.

However a question arises : which amount of datas should we consider to perform such a clustering ? Cluster using the whole data set seems infeasible.

In [4]:
# PERFORM CLUSTERING HERE

<a id='Results'></a>

# Generate the results for the final story

<a id='Brexit'></a>

## Analysis of the way Brexit is perceived

Recall that the goal is to analyze the way Brexit is perceived in each Europe country and in each branch (economy) based on the sentiment carried by the quotation. Besides we would like to add the time dimension to this analysis, meaning that we would like to follow the evolution of the overall feelings towards Brexit. A view of the expected result is given below:

Sector Analysis | Country Analysis
- | -
![alt text](Images/brexit_bubbles.png "Sector analysis") | ![alt text](Images/brexit_expected_outcomes.png "Country analysis")

### *Pipeline*


In [None]:
# BREXIT ANALYSIS

<a id='China'></a>

## Evolution of overall feelings towards China

Recall that the goal is to analyze the evolution of the feelings towards China based on the sentiment carried by the quotation. As it was done with the Brexit, this analysis will be conducted for each country and sector. A view of the expected result is given below:

**NEED TO MODIFY THE IMAGES BY PUTTING CHINA INSTEAD OF BREXIT**

Sector Analysis | Country Analysis
- | -
![alt text](Images/brexit_bubbles.png "Sector analysis") | ![alt text](Images/brexit_expected_outcomes.png "Country analysis")

### *Pipeline*

In [None]:
# CHINA ANALYSIS

<a id='Recommandation'></a>

## Recommendation tool

**I WILL COMPLETE THAT PART LATER BECAUSE, CURRENRLY, I AM COMPLETELY OUT OF SERVICE**

### *Pipeline*

In [None]:
# RECOMMANDATION TOOL