# Contents
3. [Sentiment Analysis](#3.-Sentiment-Analysis)  
    3.1. [Merge years](#3.1.-Merge-years)   
    3.2. [Sentiment analysis](#3.2.-Sentiment-analysis)

# **3. Sentiment Analysis**

The goal of sentiment analysis is to associate an opinion to each quote belonging to a certain topic. To quantify the opinion of a quote, we use Sentiment Analysis, a technique in natural language processing that links a sentiment score to a text. The sentiment or *polarity score* is a scalar between -1 and 1, where **-1 reflects a strongly negative sentiment, 1 strongly positive and 0 a neutral opinion**. 

After a comparison with Flair and TextBlob (cf. `Milestone2/SentimentAnalysis_exploration`), we decided to use for this scope **VADER**, a rule-based (bag of words) sentiment analysis tool developed at MIT specifically attuned to sentiments expressed in social media. In brief, VADER links a sentiment score to each individual word in a sentence, and calculates the final sentiment score as the mean of each individual word in the sentence [[1]](https://ojs.aaai.org/index.php/ICWSM/article/view/14550).  

In [1]:
# Mount Google Drive
from google.colab import drive
drive._mount('/content/drive')

Mounted at /content/drive


In [3]:
!pip install vaderSentiment
!pip install fastparquet



In [4]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pyarrow.parquet as pq
import pyarrow as pa
import time

In [5]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...




In [6]:
preprocess_folder = '/content/drive/MyDrive/ADA/Processed/'
sentiment_folder = '/content/drive/MyDrive/ADA/Sentiment/'

In [7]:
import datetime
import pytz
def printts(*objects):
    print(datetime.datetime.now(pytz.timezone('Europe/Zurich')).strftime("%d %b %Y %H:%M:%S"), ":", *objects)

## 3.1. Merge years

In [8]:
def merge_df():
  '''
  Loads the preprocessed DataFrame for each year from 2015 to 2020 and merge them
  in a unique DataFrame.
  '''

  # Create list of preprocessed DataFrames per year
  df_years = []
  for filename in sorted(os.listdir(preprocess_folder), reverse=True):
    processpath = os.path.join(preprocess_folder, filename)
    printts(f'Reading {filename}...')
    df_year = pd.read_parquet(processpath)
    df_years.append(df_year)

  # Concatenate the processed years into one single dataframe
  printts(f'Combining years...')
  df = pd.concat(df_years)
  del df_year
  del df_years

  # Shuffle dataframe
  df = df.sample(frac=1, random_state=42)

  # Set index
  index = np.array(list(map(lambda x: 'q' + x, np.arange(len(df)).astype(str))))
  df = df.set_index(index)
  # df = df.reset_index(drop=True)

  printts('Merging done')
  return df

## 3.2. Sentiment analysis

In [11]:
# Create the VADER analyzer
analyzer = SentimentIntensityAnalyzer()

def get_vader_compound_score(sentence):
  # Apply VADER analyzer and get compound score
  return analyzer.polarity_scores(sentence)['compound']

In [None]:
df = merge_df()

08 Dec 2021 16:02:01 : Reading quotes-2020.parquet.gzip...
08 Dec 2021 16:02:05 : Reading quotes-2019.parquet.gzip...
08 Dec 2021 16:02:19 : Reading quotes-2018.parquet.gzip...
08 Dec 2021 16:02:43 : Reading quotes-2017.parquet.gzip...
08 Dec 2021 16:03:09 : Reading quotes-2016.parquet.gzip...
08 Dec 2021 16:03:20 : Reading quotes-2015.parquet.gzip...
08 Dec 2021 16:03:33 : Combining years...
08 Dec 2021 16:04:15 : Merging done


For convenience, we will again process our filtered DataFrame in chunks.

In [13]:
def process_chunk(chunk, rpath):
  '''
  Compute the sentiment for each quote in the chunk.
  '''
  printts('Predicting VADER compound scores ...')
  chunk['sentiment'] = chunk.quotation.apply(get_vader_compound_score)

  # Create a parquet table from your dataframe
  table = pa.Table.from_pandas(chunk[['sentiment']])

  # Write the table to our parquet file.
  # To append to an existing file, we need to use pyarrow.parquet here
  printts(f'Writing chunk to {rpath}...')
  pq.write_to_dataset(table, compression='gzip', root_path=rpath)
  printts('Writing done')
  print('-------------------------')

In [10]:
df_path = os.path.join(sentiment_folder, 'df_politicians_sentiment_only.parquet.gzip')
chunksize = 1e6

if(not os.path.exists(df_path)):
  indx = np.concatenate([np.arange(0, len(df), chunksize), [len(df)]]).astype('int')
  for ii in range(0, len(indx)-1):
    chunk = df.iloc[indx[ii]:indx[ii+1]]
    process_chunk(chunk, df_path)
else:
  print('Found file ' + df_path)

Found file /content/drive/MyDrive/ADA/Sentiment/df_politicians_sentiment_only.parquet.gzip
