# 1. Load a subset of Quotations from politicians

In [1]:
import pandas as pd
from src.data_wrangling.load_data import load_political_quotes
quotes = []
for batch in load_political_quotes(country=['United States of America'], political_alignment=['right-wing'],
                                   year=[2020], chunksize=20000):
    quotes.append(batch)

In [2]:
politician_quotes = pd.concat(quotes, axis=0, ignore_index=False)
politician_quotes = politician_quotes[['quotation', 'speaker', 'qid']]
politician_quotes

Unnamed: 0_level_0,quotation,speaker,qid
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-01-16-000088,[ Department of Homeland Security ] was livid ...,Sue Myrick,Q367796
2020-03-19-000276,[ These ] actions will allow households who ha...,Ben Carson,Q816459
2020-01-20-000982,a host of other protections,Debbie Lesko,Q16731415
2020-03-19-002801,All immigration to the US should be halted due...,Laura Ingraham,Q266863
2020-03-24-004650,And they are working towards delivering their ...,Mike Pompeo,Q473239
...,...,...,...
2020-01-20-084503,"who is out to discover if a mythic superhero, ...",Sylvester Stallone,Q40026
2020-01-04-047023,Worshippers. They were praying and this maniac...,Donald Trump,Q22686
2020-01-28-113424,"yep, it was true, every word of it, so get ove...",Mick Mulvaney,Q1235731
2020-04-07-071722,You can come to a polling place and do it safe...,Robin Vos,Q7352841


# 2. Using Top2Vec for topic extraction
Our basic idea for extracting topics from the quotes was as followed:
1) using a pretrained embedding to embed the quotes into a semantic space. Our idea was to use word2vec for each word and average over each quote.
2) probably reduce the dimensonality of the embedding. If the embedding has too many dimensions, this could reduce the quality of the clustering result, as well as be too computationally expensive.
3) cluster the lower dimensional embedding and use clusters as topics

We found a already existing tool called [Top2Vec](https://github.com/ddangelov/Top2Vec) which does basically this and offers some convinience features.
Main differences are:
- the usage of a doc2vec model for the embedding. It is trained on the input data, we will probably replace it by another embedding.
- reassigning "noise" documents/quotes to closest cluster

We descided to use this instead of coding the pipeline ourself, since its already there and uses some indexed datastructure to speed it up and already allows to save the entire trained model. We will probably adapt this implementation to our needs.

## Example usage of top2vec on quotedata from 2020

### Imports

In [3]:
import pandas as pd

from top2vec import Top2Vec

### Configure Top2Vec
Here we configure Top2Vec and prepare the data. Top2Vec wants the documents and the ids as a list...

In [4]:
documents_for_top2vec = politician_quotes['quotation'].tolist()
ids_for_top2vec  = politician_quotes.index.tolist()

Here we configure the dimensionality reduction(UMAP) and the clustering(HDBSCAN) steps.

In [5]:
umap_args = {'n_neighbors': 15,
             'n_components': 15,
             'metric': 'cosine'}
hdbscan_args = {'min_cluster_size': 5,
                'metric': 'euclidean',
                'cluster_selection_method': 'eom'#, 'core_dist_n_jobs': 1 if pickable error
               }

The speed option chooses a preconfiguration of for doc2vec. Here we used the quickest preset. But this we could also  modify later in the top2vec code manually to get optimal results.

### Execute the pipeline(Doc2Vec, UMAP, HDBSCAN, AssignToTopics)

In [None]:
model = Top2Vec(documents_for_top2vec, document_ids=ids_for_top2vec, speed='fast-learn',
                umap_args=umap_args, hdbscan_args=hdbscan_args, workers=8)

2021-11-12 22:37:01,748 - top2vec - INFO - Pre-processing documents for training
2021-11-12 22:37:15,516 - top2vec - INFO - Creating joint document/word embedding


... and save the model for later.

In [None]:
model.save("2020-doc2vec-fast2")

## A quick look at the results:

How many topics did we find?

In [None]:
model.get_num_topics()

In [None]:
for i in range(model.get_num_topics()):
    model.generate_topic_wordcloud(i)

# 3. Sentiment analysis with TextBlob

We want to know if the quotes have a positive or a negative intention. In the following section a sentiment analysis approach was done with TextBlob. TextBlob is a python library for Natural Language Processing (NLP).It uses Natural Language ToolKit (NLTK) to achieve its tasks. It can be used for complex analysis on textual data.

In [None]:
import pandas as pd
from textblob import TextBlob

### Extract some quotes from a single cluster topic


In [None]:
quotations, quotation_scores, quotation_ids = model.search_documents_by_topic(topic_num=0, num_docs=20)

In [None]:
# example data frame, must be replaced by dataframe of citations for one topic (filtering)
df = politician_quotes.loc[quotation_ids, :]

In [None]:
from src.sentiment_analysis import get_subjectivity, get_polarity, get_sentiment
# add to DataFrame
df['polarity'] = df['quotation'].apply(get_polarity)
df['analysis'] = df['polarity'].apply(get_sentiment)
df.head() 

In [None]:
# count pos, neg and neutral citations
tb_counts = df.analysis.value_counts()
print(tb_counts)

### Visualize the results

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(15,7))
plt.title("Polarity", color = 'g')
plt.pie(tb_counts.values, labels = tb_counts.index,  autopct='%1.1f%%')
plt.legend()
plt.show()

In [None]:
plt.title('Distribution of polarity values')
plt.hist(df['Polarity'])
plt.xlabel('Polarity in the range of -1 = negative, to 1 = positive')
plt.ylabel('Occurence of the value')
plt.show()

### Load the data

In [None]:
import pandas as pd
import scripts.word2vec as w2v

In [None]:
# Download file if not present
df = pd.read_csv('quotes-2020-politicians.csv.gz', compression='gzip')

In [None]:
df.head()

In [None]:
# Extend dataframe with the quotes vectors

w2v.extend_dataframe(df, 'quotation', 'quotation_vector')

In [None]:
df.head()