# Data cleaning

Load and check data quality

In [1]:
import pandas as pd

In [2]:
data = pd.read_excel('../data/sentences_with_sentiment.xlsx')
data.head()

Unnamed: 0,ID,Sentence,Positive,Negative,Neutral
0,1,The results in 2nd line treatment show an ORR ...,1,0,0
1,2,The long duration of response and high durable...,1,0,0
2,3,The median OS time in the updated results exce...,0,0,1
3,4,"Therefore, the clinical benefit in 2nd line tr...",1,0,0
4,5,"The data provided in 1st line, although prelim...",1,0,0


In [3]:
len(data)

266

Drop the ID column since it contains no useful information

In [4]:
data = data.drop('ID', axis=1)

Labels seem to be already one-hot encoded. Let's ensure the encoding is valid 

In [5]:
all(data.loc[:, ['Positive', 'Negative', 'Neutral']].sum(axis=1) == 1)

True

Check ```Sentence``` column for duplicates

In [6]:
pd.set_option('display.max_rows', 300)
dup = data['Sentence'].duplicated()
dup[dup]

19     True
134    True
136    True
137    True
138    True
139    True
140    True
141    True
142    True
143    True
144    True
146    True
147    True
148    True
149    True
150    True
151    True
152    True
153    True
154    True
155    True
157    True
158    True
159    True
160    True
161    True
162    True
163    True
164    True
165    True
Name: Sentence, dtype: bool

Are the same rows also duplicates when considering also the labels?

In [7]:
dup_labels = data[['Sentence', 'Positive', 'Negative', 'Neutral']].duplicated()
dup_labels[dup_labels]

19     True
134    True
136    True
137    True
138    True
139    True
140    True
141    True
142    True
143    True
144    True
146    True
147    True
148    True
149    True
150    True
151    True
152    True
153    True
154    True
155    True
157    True
158    True
159    True
160    True
161    True
162    True
163    True
164    True
165    True
dtype: bool

In [8]:
all(dup[dup].index == dup_labels[dup_labels].index)

True

Yes, it would seem so. 

In principle duplicated sentences could be used to represent opinions given by different experts, but since also the labels are the same this would not seem to be the case judging from this sample. The more likely explanation is that each duplicated value representes a common phrase that is _actually_ duplicated across various samples.

Now we can check the label distribution

In [9]:
print('Positive samples', len(data[data['Positive'] == 1]))
print('Negative samples', len(data[data['Negative'] == 1]))
print('Neutral samples', len(data[data['Neutral'] == 1]))

Positive samples 160
Negative samples 36
Neutral samples 70


The class distribution is clearly skewed towards positive sentiment. In addition, quite significant portion are neutral - this could be problematic since classifiers will probably have a hard time figuring out subtle differences.

While were at it, lets produce a quick naive baseline for classification accuracy:

In [10]:
len(data[data['Positive'] == 1]) / len(data)

0.6015037593984962

By simply using the largest class as a prediction each time, we should expect on average 60 % accuracy (non-weighted). Any further classifiers should aim to at least outperform this metric.

## Data exploration

### Unigram frequency analysis

Let's try to grasp some intuition behind data by listing out the most common words. The process involves building corpora of the sentences representing the three labels, filtering out knwon English language stop words and punctation, and finally counting the Frequency distributions amongst the indivudual corpora as well as the composite corpus. Throughout this process the excellent nltk library is utilized.

In [11]:
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import string

Start by creating the corpora of Positive, Negative and Neutral labels respectively and tokenizing those

In [12]:
corp_pos = word_tokenize(' '.join(data.loc[data['Positive'] == 1, 'Sentence']).lower())
corp_neg = word_tokenize(' '.join(data.loc[data['Negative'] == 1, 'Sentence']).lower())
corp_neutr = word_tokenize(' '.join(data.loc[data['Neutral'] == 1, 'Sentence']).lower())

In [13]:
corp_pos[:15]

['the',
 'results',
 'in',
 '2nd',
 'line',
 'treatment',
 'show',
 'an',
 'orr',
 'of',
 '33',
 '%',
 'with',
 'some',
 'patients']

Filter out known stopwords. Notice that before this operation stopwords need to be downloaded using:

```
>>> import nltk
>>> nltk.download('stopwords')
```

Then, a basis for a stop word list can be gotten from ```nltk.corpus.stopwords.words('english')```. Below, we will further tweak this basis list, in a attempt to reduce the noice present by meaningless words such as 'a', 'the', 'it' etc while still keeping acceptable discriminitive power between classes.

In [14]:
sw = [
    'i',
    'me',
    'my',
    'myself',
    'we',
    'our',
    'ours',
    'ourselves',
    'you',
    "you're",
    "you've",
    "you'll",
    "you'd",
    'your',
    'yours',
    'yourself',
    'yourselves',
    'he',
    'him',
    'his',
    'himself',
    'she',
    "she's",
    'her',
    'hers',
    'herself',
    'it',
    "it's",
    'its',
    'itself',
    'they',
    'them',
    'their',
    'theirs',
    'themselves',
    'what',
    'which',
    'who',
    'whom',
    'this',
    'that',
    "that'll",
    'these',
    'those',
    'am',
    'is',
    'are',
    'was',
    'were',
    'be',
    'been',
    'being',
    'have',
    'has',
    'had',
    'having',
    'do',
    'does',
    'did',
    'doing',
    'a',
    'an',
    'the',
    'and',
    'but',
    'if',
    'or',
    'as',
    'of',
    'at',
    'by',
    'for',
    'with',
    'about',
    'into',
    'through',
    'during',
    'to',
    'from',
    'in',
    'out',
    'on',
    'off',
    'then',
    'once',
    'here',
    'there',
    'when',
    'where',
    'why',
    'how',
    'both',
    'each',
    'other',
    'such',
    'own',
    'so',
    's',
    't',
    'can',
    'will',
    'just',
    'now',
    'd',
    'll',
    'm',
    'o',
    're',
    've',
    'y',
]

In [15]:
corp_pos = [t for t in corp_pos if t not in sw]
corp_neg = [t for t in corp_neg if t not in sw]
corp_neutr = [t for t in corp_neutr if t not in sw]

Remove punctuation

In [16]:
corp_pos = [t for t in corp_pos if t not in string.punctuation]
corp_neg = [t for t in corp_neg if t not in string.punctuation]
corp_neutr = [t for t in corp_neutr if t not in string.punctuation]

Then check out the freqdists

In [17]:
fd_pos = FreqDist(corp_pos)
fd_neg = FreqDist(corp_neg)
fd_neutr = FreqDist(corp_neutr)

In [18]:
top_ten = pd.DataFrame({
    'Positive': [w[0] for w in fd_pos.most_common(10)],
    'Pos_rate': [w[1] / len(data[data['Positive'] == 1]) for w in fd_pos.most_common(10)],
    'Negative': [w[0] for w in fd_neg.most_common(10)],
    'Neg_rate': [w[1] / len(data[data['Negative'] == 1]) for w in fd_neg.most_common(10)],    
    'Neutral': [w[0] for w in fd_neutr.most_common(10)],
    'Neutr_rate': [w[1] / len(data[data['Neutral'] == 1])for w in fd_neutr.most_common(10)]
}, index=range(1,11))

In [19]:
top_ten

Unnamed: 0,Positive,Pos_rate,Negative,Neg_rate,Neutral,Neutr_rate
1,safety,0.29375,safety,0.472222,studies,0.3
2,data,0.28125,data,0.388889,safety,0.242857
3,study,0.20625,patients,0.333333,study,0.214286
4,efficacy,0.1875,study,0.25,ct-p10,0.171429
5,clinical,0.175,should,0.222222,efficacy,0.157143
6,patients,0.16875,treatment,0.194444,data,0.157143
7,considered,0.1625,limited,0.194444,patients,0.142857
8,treatment,0.15,further,0.166667,dose,0.142857
9,profile,0.14375,address,0.166667,insulin,0.142857
10,product,0.13125,efficacy,0.166667,product,0.128571


'Safety' and 'data' seem to be very popular words amongst both Positive and Negative corpora, although the proportions in the negative case are significantly higher. Words like 'should', 'further', 'limited' seem like obvious predictors for the negative class. In neutral class the word 'studies' is the most common ones, with 'safety' and 'data' receiving lower rankings. It is hence possible to hypothezise the following distiction:

* Many negative and positive tend to be **argumentative** of why the given data does or does not show evidence of product safety. With safety concerns present, the authors tend to be more explicit in their wordings about 'data' and 'safety'
* Neutral comments tend to be **descriptive** w.r.t. to the procedures followed during conducting and reporting the given study/studies

This could prove to be an useful feature in one-vs-all classification approach. Obviously the dataset here is very limited, so the general applicability of these findings if of course questionable. 

In [38]:
list(top_ten['Neutral'])

['studies',
 'safety',
 'study',
 'ct-p10',
 'efficacy',
 'data',
 'patients',
 'dose',
 'insulin',
 'product']

### Bigram and trigram analysis

A similar approach can be used for sequences of two and four words

In [20]:
from nltk import bigrams, trigrams

In [21]:
bi_fd_pos = FreqDist(list(bigrams(corp_pos)))
bi_fd_neg = FreqDist(list(bigrams(corp_neg)))
bi_fd_neutr = FreqDist(list(bigrams(corp_neutr)))

top_ten_bi = pd.DataFrame({
    'Positive': [w[0] for w in bi_fd_pos.most_common(10)],
    'Pos_rate': [w[1] / len(data[data['Positive'] == 1]) for w in bi_fd_pos.most_common(10)],
    'Negative': [w[0] for w in bi_fd_neg.most_common(10)],
    'Neg_rate': [w[1] / len(data[data['Negative'] == 1]) for w in bi_fd_neg.most_common(10)],    
    'Neutral': [w[0] for w in bi_fd_neutr.most_common(10)],
    'Neutr_rate': [w[1] / len(data[data['Neutral'] == 1])for w in bi_fd_neutr.most_common(10)]
}, index=range(1,11))

top_ten_bi

Unnamed: 0,Positive,Pos_rate,Negative,Neg_rate,Neutral,Neutr_rate
1,"(safety, profile)",0.125,"(chmp, considers)",0.111111,"(insulin, glargine)",0.085714
2,"(clinical, data)",0.06875,"(considers, following)",0.111111,"(safety, profile)",0.057143
3,"(ct-p10, mabthera)",0.05625,"(following, measures)",0.111111,"(reference, products)",0.057143
4,"(efficacy, data)",0.04375,"(necessary, address)",0.111111,"(safety, data)",0.057143
5,"(reference, product)",0.04375,"(measures, necessary)",0.083333,"(pivotal, studies)",0.057143
6,"(safety, data)",0.0375,"(address, missing)",0.083333,"(et, al)",0.057143
7,"(comparable, between)",0.0375,"(address, issues)",0.083333,"(efficacy, safety)",0.057143
8,"(between, ct-p10)",0.0375,"(issues, related)",0.083333,"(medicinal, product)",0.042857
9,"(bioequivalence, study)",0.0375,"(although, dataset)",0.083333,"(overall, safety)",0.042857
10,"(film-coated, tablets)",0.0375,"(dataset, afl)",0.083333,"(profile, ct-p10)",0.042857


In [22]:
tri_fd_pos = FreqDist(list(trigrams(corp_pos)))
tri_fd_neg = FreqDist(list(trigrams(corp_neg)))
tri_fd_neutr = FreqDist(list(trigrams(corp_neutr)))

top_ten_tri = pd.DataFrame({
    'Positive': [w[0] for w in tri_fd_pos.most_common(10)],
    'Pos_rate': [w[1] / len(data[data['Positive'] == 1]) for w in tri_fd_pos.most_common(10)],
    'Negative': [w[0] for w in tri_fd_neg.most_common(10)],
    'Neg_rate': [w[1] / len(data[data['Negative'] == 1]) for w in tri_fd_neg.most_common(10)],    
    'Neutral': [w[0] for w in tri_fd_neutr.most_common(10)],
    'Neutr_rate': [w[1] / len(data[data['Neutral'] == 1])for w in tri_fd_neutr.most_common(10)]
}, index=range(1,11))

top_ten_tri

Unnamed: 0,Positive,Pos_rate,Negative,Neg_rate,Neutral,Neutr_rate
1,"(between, ct-p10, mabthera)",0.0375,"(chmp, considers, following)",0.111111,"(overall, safety, profile)",0.042857
2,"(based, efficacy, data)",0.025,"(considers, following, measures)",0.111111,"(safety, profile, ct-p10)",0.042857
3,"(data, considered, supportive)",0.025,"(following, measures, necessary)",0.083333,"(profile, ct-p10, appeared)",0.042857
4,"(mg, film-coated, tablets)",0.025,"(measures, necessary, address)",0.083333,"(ct-p10, appeared, roughly)",0.042857
5,"(2nd, line, treatment)",0.01875,"(necessary, address, issues)",0.083333,"(appeared, roughly, similar)",0.042857
6,"(biosimilarity, ct-p10, mabthera)",0.01875,"(address, issues, related)",0.083333,"(roughly, similar, reference)",0.042857
7,"(ct-p10, mabthera, considered)",0.01875,"(although, dataset, afl)",0.083333,"(similar, reference, product)",0.042857
8,"(mabthera, considered, demonstrated)",0.01875,"(dataset, afl, patients)",0.083333,"(reference, product, although)",0.042857
9,"(considered, demonstrated, based)",0.01875,"(afl, patients, updated)",0.083333,"(product, although, pooled)",0.042857
10,"(demonstrated, based, efficacy)",0.01875,"(patients, updated, data)",0.083333,"(although, pooled, incidences)",0.042857


From these analyses it can be determined that there seem to be some phrases the evaluators frequently use word-for-word when describing limitations in the drug evaluation procedure. For instance, the phrase

```chmp considers following measures```

appears a total of four times (11 %) in the negative class, but not one single time in the positive class. 

From statistical point of view the dataset is probably too small to efficiently train on trigram-based features. Bigram features could offer some useful information, since at least the phrases 'safety profile' and 'clinical data' ore replicated in non-negligiable portion of Positive examples

### TODO migrate this to somewhere else

In [24]:
from sklearn.feature_extraction.text import CountVectorizer

In [33]:
vectorizer = CountVectorizer()
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
X.todense()

matrix([[0, 1, 1, 1, 0, 0, 1, 0, 1],
        [0, 1, 0, 1, 0, 2, 1, 0, 1],
        [1, 0, 0, 0, 1, 0, 1, 1, 0],
        [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

In [34]:
vectorizer.vocabulary_

{'this': 8,
 'is': 3,
 'the': 6,
 'first': 2,
 'document': 1,
 'second': 5,
 'and': 0,
 'third': 7,
 'one': 4}