In [1]:
file_names = ['1755_Lisbon_earthquake','1896_Summer_Olympics','1997_Pacific_hurricane_season','Actinium','Barracuda','Basketball','Bath_School_disaster','Chicago','Chocolate','Diamond','Dice','Drinking_water','Duchenne_muscular_dystrophy','Geography_of_Ireland','Giraffe','Gunpowder','Osama_bin_Laden','Palm_oil','Peace','Pellagra','Phishing','Plant','Plato','Pneumonia','Poison_gas_in_World_War_I','Politics','Pollution','Red_Kite','Rice','Rio_de_Janeiro','Romeo_and_Juliet','Rugby_World_Cup','Rwandan_Genocide','Santa_Claus','Scooby-Doo']
len(file_names)

35

### MODEL DESCRIPTION
Pattern is a multipurpose library that is capable of handling NLP operations, data mining, machine learning etc. It also contains sentiment analysis functionality which is suitable for our task.
The `sentiment` function under `pattern.text.en` module is used to calculate the sentiment of a given text, it takes a sentence as input which can also be a string, Synset, word or document, and returns a (polarity, subjectivity)-tuple with polarity between -1.0 and +1.0 and subjectivity between 0.0 and 1.0. polarity describes the emotional leaning of the text, while subjectivity describes the strength of such emotion.

In our usage the input is a string of the entire article loaded from the Wikispeedia dataset, In this case, it first tokenizes the text into words(punctuation, space and abbreviations are handled at this stage), then it Lowercases each word because sentiment analysis is case insensitive. Next it calculates the sentiment of each word by consulting the predefined sentiment [dictionary](https://github.com/clips/pattern/blob/master/pattern/text/en/en-sentiment.xml)(modifiers and negations are also considered at this time). Finally it returns the average of all the words as the sentiment of the text.

Pattern is a classic and well-known non-commercial library for the sentiment analysis task solution, the module itself lasts over 10 years and has 8.6k stars on github. It provides detailed results (polarity and subjectivity) for many mainstream languages, and its API is fast and easy to use with other NLP preprocessing tools embedded. More details can be found in the [official documentation](https://digiasset.org/html/pattern.html) and [repository](https://digiasset.org/html/pattern.html).




### METHOD && RESULTS
Each selected article is loaded and feeded into the `sentiment` function, the polarity is then printed out. The results are shown below.

In [8]:
from pattern.text.en import sentiment

for file_name in file_names:
    with open('data/plaintext_articles/'+file_name+'.txt', 'r', encoding='utf-8') as file:
        data = file.read()
        polarity, subjectivity = sentiment(data)
        print(f"{file_name}: {polarity}")

1755_Lisbon_earthquake: 0.0816923282902664
1896_Summer_Olympics: 0.11173071331653418
1997_Pacific_hurricane_season: 0.051366249491249474
Actinium: 0.03542682926829269
Barracuda: 0.1132213321465658
Basketball: 0.08600218021995364
Bath_School_disaster: 0.010679336219336222
Chicago: 0.10564842500695483
Chocolate: 0.07674311830989919
Diamond: 0.12359236785162714
Dice: -0.0001423413188119043
Drinking_water: 0.12155415214866433
Duchenne_muscular_dystrophy: 0.051342562953478464
Geography_of_Ireland: 0.06101678376268537
Giraffe: 0.04892030793508625
Gunpowder: 0.01667841269841271
Osama_bin_Laden: 0.04482218734525007
Palm_oil: 0.10377811870669014
Peace: 0.07109719189365207
Pellagra: 0.013132859204287773
Phishing: 0.03130031080031078
Plant: 0.09383068133068131
Plato: 0.16051446416831025
Pneumonia: 0.05341062158293232
Poison_gas_in_World_War_I: 0.08464027042373903
Politics: 0.10395405509821987
Pollution: 0.08715013543960917
Red_Kite: 0.035212025919573085
Rice: 0.08478642004761416
Rio_de_Janeiro: 0

The MSE is compared with the human labeled result(trinary value among -1 0 1) which is around **0.7**

### DISCUSSION
The performance of this model is generally not good compared to other models we tested, the possible reason could be that it uses a fixed sentiment dictionary which may contain bias between different domains. It's observed that the model tends to give a very small value between [0,0.1] despite the articles, a reason could be that Wikipedia articles are designed to be neutral and objective, It generally looks at problems dialectically and rarely produces strong emotions. However, in the range of this project, we care more about the sentiment level of the article (the word itself) rather than the way it is written, the model considers all the words on the page evenly and thus may produce results that are not what we expect.

This is not the optimal model for our task, a ML-based approach is generally more flexible and powerful in this case.