# Unsupervised Sentiment Analysis by Lexicon Induction

The aim of this small project is to show you how the Point-Mutual Information (PMI) measure can be used for *lexicon induction from texts*. This task aims at automatically building a lexicon (list of words) which can be used in various information extraction applications.

We focus here on sentiment analysis. Given a text, we want to determine if it is positive or negative towards a given target (e.g. "a very good movie" is positive while "a bad movie" is negative). To this end, we'll proceed as follows:

  1. Take reviews as input
  2. Extract from this corpus a set of phrases following some fixed lexico-syntactic patterns. 
  3. Estimate the polarity orientation (PO) of these phrases using PMI

  4. Infere the polarity orientation of a review by avergeing the scores of all the phrases it compose
  5. Evaluate the accurcay of the algorithm by comparing the computed scores by the gold scores (the so called star ratings).

Let's illustrate the algorithm by the following example. 

**Movie Name**:  Pearl Harbor

**Review =** During this period I had a sick feeling, knowing what was coming, knowing what was part of our history

**Author rating** = Positive

**Pattern**=[ADJ NN] (adjective followed by a noun]

**Retrieved phrases**=sick feeling

**Polarity orientation of the phrase**= -8.308

**Polarity orientation of the review**= - 0.378


This unsupervised approach is called the **Turney algorithm** and has been proposed in 2002 by Peter Turney, a Canadien researcher from Ottawa. For more details, see the paper : https://aclanthology.org/P02-1053.pdf. Note  that this is a slightly modified version of the Turney's algorithm (step 3 does not require querying any external search engine).

Our aim is to implement this algorithm and evaluate it, first on an English movie review dataset, then on a French dataset. 


 # I. Take reviews as input

We'll use the IMDB Dataset. You can download it from here: https://huggingface.co/datasets/imdb

  * Take a random split of 3000 positive and 3000 negative reviews from the train portion of the dataset. This will be used to compute the polarity orientation of phrases. 

  * Take a random split of 500 positive and 500 negative reviews from the test portion of the dataset. This will be used for evaluating the algorithm

In [None]:
pip install datasets

In [3]:
from datasets import list_datasets, load_dataset
from pprint import pprint

In [4]:
dataset = load_dataset('imdb')  #more info on the load_dataset function can be found here https://huggingface.co/docs/datasets/v1.11.0/loading_datasets.html
#you can also dowload only the train/test/unsupervised portion of the dataset (see the documentation)


Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
dataset["train"].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None)}

In [6]:
print(dataset['train'][0])

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be


 # II. Extract from this corpus a set of phrases

 Let's start from the following patterns. Feel free to add more interesting patterns, see Table 1 in Turney's paper!!):

  * ADJ NN (adjective followed by a noun) 
  * NN ADJ (noun followed by an adjective) 
  * ADJ ADJ (adjective followed by an adjective)
  * ADV ADJ (adverb followed by an adjective)

You can use the Spacy pos_tag function (https://spacy.io/usage/spacy-101)  to parse your data and get the POS.

The use of Pandas library is highly recommended to analyze/manipulate your data  (https://pandas.pydata.org/docs/index.html).


 # III. Compute the polarity orientation of the extracted phrases

Now we have the phrases, we need to compute their polarity orientation based on PMI, by computing their co-occurence score with positive vs. negative seed words. These seed words are arbitrary fixed in advance, like *excellent* and *poor*.

Let V+ (V-) the set of positive (negative) seed words. The polarity orientation PO of a phrase p is computed with the following formuala:

$$ PO(p)= \sum_{(w \in V+)} PMI(p, w) - \sum_{(w \in V-)} PMI(p, w)$$


such that:

$$PMI (a,b)=log_2(\frac{P(a,b)}{P(a)*P(b)})$$


and:

$$P(w)=\frac{Freq(w)}{TotalWordCount}$$

Compute the PO of the phrases extracted in II. using the training dataset.

**Output ==> List of positive/negative seed words + A list of phrases together with their polarity orientation (give only some to all of them!)**






 # IV. Compute the polarity orientation of a review

The PO of a review r is the average PO of all the phrases it compose.

$$ PO(r)= Average_{(p \in r)} PO(p)$$


Compute the PO of each review in the dataset and evaluate the predicted orientation by comparing it with the gold annotations.

**Output==> Confusion matrix + Accuracy of the algorithm**

 # V. Discussions


 * Change the list of seed words. Do you observe any improvement? 
 * Extend the patterns to deal with other types of phrases. Is this impact the accuracy?
 * What are the main conclusions of this project

**Output==> Answer to each question (new accuracies, etc.)**

 # VI. Apply the same algorithm to a French sentiment dataset

  * Download the French movie reviews dataset in https://huggingface.co/datasets/
  * Define a set of seed positive/negative words
  * Define a set of interesting patterns to look at
  * Compute the polarity of the reviews using the Turney's algorithm and display its accuracy (you are asked here to use a small subset of the initial dataset)

**Output==> Confusion matrix + Accuracy of the algorithm**