# VADER
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a sentiment analysis method attuned expressions from social media. This model classies the text into positive, neutral, and negative parts, then by leveraging lexicon and grammatical rules, from valence scores compute the 'compound score' which is normalized weighted composite score of the sentiment of the text.

Utilizing NLTK, VADER takes into condsideration the text nuances, informal phrases, and even non-English text sentences.

In [81]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from tqdm.auto import tqdm
import os
import pandas as pd

## Load articles' titles and content

In [71]:

titles_path = "Feuille de calcul sans titre - results.csv"
html_dir = "plaintext_articles"

def load_titles(file_path=titles_path):
    """
    Load the titles of Wikispeedia pages

    Input args:
        file_path: path to the text file which contains the list of wikipedia articles
    Output:
        titles: List, list of Wikipedia pages used in Wikispeedia
    """
    titles = pd.read_csv(file_path, header=None)
    titles.columns = ['title']
    return titles

def load_html(titles, dir=html_dir):
    """
    Load the text of Wikispeedia pages

    Input args:
        titles: pandas Dataframe of wikipedia titles for which to retrieve html content
        dir: path to the directory containing the wikipedia articles' html content
    Output:
        content: Dict, dictionary of (title: html content) pairs
    """
    content = []
    for title in titles.title:
        path = os.path.join(dir,title+".txt")
        if os.path.exists(path):
            with open(path, 'r', encoding='utf-8') as file:
                html_content = file.readlines()
                content.append("".join(html_content[1:]))
        else:
            print("WARNING: file", path, "is missing")
    titles['content'] = content
    return titles

In [72]:
articles = load_titles()
len(articles)

35

In [75]:
content = load_html(articles)
content

Unnamed: 0,title,content
0,1755_Lisbon_earthquake,\n1755 Lisbon earthquake\n\n2007 Schools Wikip...
1,1896_Summer_Olympics,\n1896 Summer Olympics\n\n2007 Schools Wikiped...
2,1997_Pacific_hurricane_season,\n1997 Pacific hurricane season\n\n2007 School...
3,Actinium,\nActinium\n\n2007 Schools Wikipedia Selection...
4,Barracuda,\nBarracuda\n\n2007 Schools Wikipedia Selectio...
5,Basketball,\nBasketball\n\n2007 Schools Wikipedia Selecti...
6,Bath_School_disaster,\nBath School disaster\n\n2007 Schools Wikiped...
7,Chicago,\nChicago\n\n2007 Schools Wikipedia Selection....
8,Chocolate,\nChocolate\n\n2007 Schools Wikipedia Selectio...
9,Diamond,\nDiamond\n\n2007 Schools Wikipedia Selection....


## Analysing the compound score

In [76]:
scores = []
analyzer = SentimentIntensityAnalyzer()
for article_content in tqdm(content.iloc):
    vs = analyzer.polarity_scores(article_content.content)
    scores.append(vs['compound'])


0it [00:00, ?it/s]

In [77]:
content['score'] = scores
content

Unnamed: 0,title,content,score
0,1755_Lisbon_earthquake,\n1755 Lisbon earthquake\n\n2007 Schools Wikip...,-0.997
1,1896_Summer_Olympics,\n1896 Summer Olympics\n\n2007 Schools Wikiped...,0.9999
2,1997_Pacific_hurricane_season,\n1997 Pacific hurricane season\n\n2007 School...,-0.9998
3,Actinium,\nActinium\n\n2007 Schools Wikipedia Selection...,-0.8792
4,Barracuda,\nBarracuda\n\n2007 Schools Wikipedia Selectio...,0.9944
5,Basketball,\nBasketball\n\n2007 Schools Wikipedia Selecti...,0.9999
6,Bath_School_disaster,\nBath School disaster\n\n2007 Schools Wikiped...,-0.9998
7,Chicago,\nChicago\n\n2007 Schools Wikipedia Selection....,0.9999
8,Chocolate,\nChocolate\n\n2007 Schools Wikipedia Selectio...,0.9996
9,Diamond,\nDiamond\n\n2007 Schools Wikipedia Selection....,1.0


As can be inferred from the prediction, the model is good at correctly splitting the articles into two separate clusters, with a quite impressive squared error of 0.2. However, the obtained scores are quite extreme, making it impractical to infer the intensity of the negative or positive sense of each article. This could be attributed to the prevalence of neutral words, causing smaller differences between negative and positive counterparts.

In [80]:
content.drop(columns='content').to_csv('results.csv',index=False,header=False)