# Data cleaning

Load and check data quality

In [None]:
import pandas as pd

In [None]:
data = pd.read_excel('../data/sentences_with_sentiment.xlsx')
data.head()

In [None]:
len(data)

Drop the ID column since it contains no useful information

In [None]:
data = data.drop('ID', axis=1)

Labels seem to be already one-hot encoded. Let's ensure the encoding is valid 

In [None]:
all(data.loc[:, ['Positive', 'Negative', 'Neutral']].sum(axis=1) == 1)

Check ```Sentence``` column for duplicates

In [None]:
pd.set_option('display.max_rows', 300)
dup = data['Sentence'].duplicated()
dup[dup]

Are the same rows also duplicates when considering also the labels?

In [None]:
dup_labels = data[['Sentence', 'Positive', 'Negative', 'Neutral']].duplicated()
dup_labels[dup_labels]

In [None]:
all(dup[dup].index == dup_labels[dup_labels].index)

Yes, it would seem so. 

In principle duplicated sentences could be used to represent opinions given by different experts, but since also the labels are the same this would not seem to be the case judging from this sample. The more likely explanation is that each duplicated value representes a common phrase that is _actually_ duplicated across various samples.

Now we can check the label distribution

In [None]:
print('Positive samples', len(data[data['Positive'] == 1]))
print('Negative samples', len(data[data['Negative'] == 1]))
print('Neutral samples', len(data[data['Neutral'] == 1]))

The class distribution is clearly skewed towards positive sentiment. In addition, quite significant portion are neutral - this could be problematic since classifiers will probably have a hard time figuring out subtle differences.

While were at it, lets produce a quick naive baseline for classification accuracy:

In [None]:
len(data[data['Positive'] == 1]) / len(data)

By simply using the largest class as a prediction each time, we should expect on average 60 % accuracy (non-weighted). Any further classifiers should aim to at least outperform this metric.

## Data exploration

### Unigram frequency analysis

Let's try to grasp some intuition behind data by listing out the most common words. The process involves building corpora of the sentences representing the three labels, filtering out knwon English language stop words and punctation, and finally counting the Frequency distributions amongst the indivudual corpora as well as the composite corpus. Throughout this process the excellent nltk library is utilized.

In [None]:
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import string

Start by creating the corpora of Positive, Negative and Neutral labels respectively and tokenizing those

In [None]:
corp_pos = word_tokenize(' '.join(data.loc[data['Positive'] == 1, 'Sentence']).lower())
corp_neg = word_tokenize(' '.join(data.loc[data['Negative'] == 1, 'Sentence']).lower())
corp_neutr = word_tokenize(' '.join(data.loc[data['Neutral'] == 1, 'Sentence']).lower())

In [None]:
corp_pos[:15]

Filter out known stopwords. Notice that before this operation stopwords need to be downloaded using:

```
>>> import nltk
>>> nltk.download('stopwords')
```

Then, a basis for a stop word list can be gotten from ```nltk.corpus.stopwords.words('english')```. Below, we will further tweak this basis list, in a attempt to reduce the noice present by meaningless words such as 'a', 'the', 'it' etc while still keeping acceptable discriminitive power between classes.

In [None]:
sw = [
    'i',
    'me',
    'my',
    'myself',
    'we',
    'our',
    'ours',
    'ourselves',
    'you',
    "you're",
    "you've",
    "you'll",
    "you'd",
    'your',
    'yours',
    'yourself',
    'yourselves',
    'he',
    'him',
    'his',
    'himself',
    'she',
    "she's",
    'her',
    'hers',
    'herself',
    'it',
    "it's",
    'its',
    'itself',
    'they',
    'them',
    'their',
    'theirs',
    'themselves',
    'what',
    'which',
    'who',
    'whom',
    'this',
    'that',
    "that'll",
    'these',
    'those',
    'am',
    'is',
    'are',
    'was',
    'were',
    'be',
    'been',
    'being',
    'have',
    'has',
    'had',
    'having',
    'do',
    'does',
    'did',
    'doing',
    'a',
    'an',
    'the',
    'and',
    'but',
    'if',
    'or',
    'as',
    'of',
    'at',
    'by',
    'for',
    'with',
    'about',
    'into',
    'through',
    'during',
    'to',
    'from',
    'in',
    'out',
    'on',
    'off',
    'then',
    'once',
    'here',
    'there',
    'when',
    'where',
    'why',
    'how',
    'both',
    'each',
    'other',
    'such',
    'own',
    'so',
    's',
    't',
    'can',
    'will',
    'just',
    'now',
    'd',
    'll',
    'm',
    'o',
    're',
    've',
    'y',
]

In [None]:
corp_pos = [t for t in corp_pos if t not in sw]
corp_neg = [t for t in corp_neg if t not in sw]
corp_neutr = [t for t in corp_neutr if t not in sw]

Remove punctuation

In [None]:
corp_pos = [t for t in corp_pos if t not in string.punctuation]
corp_neg = [t for t in corp_neg if t not in string.punctuation]
corp_neutr = [t for t in corp_neutr if t not in string.punctuation]

Then check out the freqdists

In [None]:
fd_pos = FreqDist(corp_pos)
fd_neg = FreqDist(corp_neg)
fd_neutr = FreqDist(corp_neutr)

In [None]:
top_ten = pd.DataFrame({
    'Positive': [w[0] for w in fd_pos.most_common(10)],
    'Pos_rate': [w[1] / len(data[data['Positive'] == 1]) for w in fd_pos.most_common(10)],
    'Negative': [w[0] for w in fd_neg.most_common(10)],
    'Neg_rate': [w[1] / len(data[data['Negative'] == 1]) for w in fd_neg.most_common(10)],    
    'Neutral': [w[0] for w in fd_neutr.most_common(10)],
    'Neutr_rate': [w[1] / len(data[data['Neutral'] == 1])for w in fd_neutr.most_common(10)]
}, index=range(1,11))

In [None]:
top_ten

'Safety' and 'data' seem to be very popular words amongst both Positive and Negative corpora, although the proportions in the negative case are significantly higher. Words like 'should', 'further', 'limited' seem like obvious predictors for the negative class. In neutral class the word 'studies' is the most common ones, with 'safety' and 'data' receiving lower rankings. It is hence possible to hypothezise the following distiction:

* Many negative and positive tend to be **argumentative** of why the given data does or does not show evidence of product safety. With safety concerns present, the authors tend to be more explicit in their wordings about 'data' and 'safety'
* Neutral comments tend to be **descriptive** w.r.t. to the procedures followed during conducting and reporting the given study/studies

This could prove to be an useful feature in one-vs-all classification approach. Obviously the dataset here is very limited, so the general applicability of these findings if of course questionable. 

In [None]:
list(top_ten['Neutral'])

### Bigram and trigram analysis

A similar approach can be used for sequences of two and four words

In [None]:
from nltk import bigrams, trigrams

In [None]:
bi_fd_pos = FreqDist(list(bigrams(corp_pos)))
bi_fd_neg = FreqDist(list(bigrams(corp_neg)))
bi_fd_neutr = FreqDist(list(bigrams(corp_neutr)))

top_ten_bi = pd.DataFrame({
    'Positive': [w[0] for w in bi_fd_pos.most_common(10)],
    'Pos_rate': [w[1] / len(data[data['Positive'] == 1]) for w in bi_fd_pos.most_common(10)],
    'Negative': [w[0] for w in bi_fd_neg.most_common(10)],
    'Neg_rate': [w[1] / len(data[data['Negative'] == 1]) for w in bi_fd_neg.most_common(10)],    
    'Neutral': [w[0] for w in bi_fd_neutr.most_common(10)],
    'Neutr_rate': [w[1] / len(data[data['Neutral'] == 1])for w in bi_fd_neutr.most_common(10)]
}, index=range(1,11))

top_ten_bi

In [None]:
tri_fd_pos = FreqDist(list(trigrams(corp_pos)))
tri_fd_neg = FreqDist(list(trigrams(corp_neg)))
tri_fd_neutr = FreqDist(list(trigrams(corp_neutr)))

top_ten_tri = pd.DataFrame({
    'Positive': [w[0] for w in tri_fd_pos.most_common(10)],
    'Pos_rate': [w[1] / len(data[data['Positive'] == 1]) for w in tri_fd_pos.most_common(10)],
    'Negative': [w[0] for w in tri_fd_neg.most_common(10)],
    'Neg_rate': [w[1] / len(data[data['Negative'] == 1]) for w in tri_fd_neg.most_common(10)],    
    'Neutral': [w[0] for w in tri_fd_neutr.most_common(10)],
    'Neutr_rate': [w[1] / len(data[data['Neutral'] == 1])for w in tri_fd_neutr.most_common(10)]
}, index=range(1,11))

top_ten_tri

From these analyses it can be determined that there seem to be some phrases the evaluators frequently use word-for-word when describing limitations in the drug evaluation procedure. For instance, the phrase

```chmp considers following measures```

appears a total of four times (11 %) in the negative class, but not one single time in the positive class. 

From statistical point of view the dataset is probably too small to efficiently train on trigram-based features. Bigram features could offer some useful information, since at least the phrases 'safety profile' and 'clinical data' ore replicated in non-negligiable portion of Positive examples

### Semantic lexicon

Semantic lexicon is a collection of words and phrases associated with a specific sentiment (Positive/Neutral/Negative). While there are some open source semantic lexicons available, best results could arguably be obtained by hand-curated lexicons. 

To demonstrate the concept, these phrases were manually gathered by examining the provided Excel-file. This is a 'poor man's semantic lexicon' in the sense that we only include the low-hanging fruits, i.e. phrases that clearly are repeated many times throughout the data. 

During the manual gathering process, the labels were hidden in order to get as objective as possible evaluation.

In [None]:
positives = [
    'based on the',
    'bioequivalence',
    'bioequivalent',
    'biosimilarity',
    'accepted by the chmp',
    'comparable',
    'these objectives have been met',
    'the available safety data are considered supportive'
]

negatives = [
    'should be provided',
    'data are considered very limited',
    'chmp considers the following measures',
    
]

phrases = positives + negatives