# Data cleaning

Load and check data quality

In [1]:
import pandas as pd

In [2]:
data = pd.read_excel('../data/sentences_with_sentiment.xlsx')
data.head()

Unnamed: 0,ID,Sentence,Positive,Negative,Neutral
0,1,The results in 2nd line treatment show an ORR ...,1,0,0
1,2,The long duration of response and high durable...,1,0,0
2,3,The median OS time in the updated results exce...,0,0,1
3,4,"Therefore, the clinical benefit in 2nd line tr...",1,0,0
4,5,"The data provided in 1st line, although prelim...",1,0,0


In [3]:
len(data)

266

Drop the ID column since it contains no useful information

In [4]:
data = data.drop('ID', axis=1)

Labels seem to be already one-hot encoded. Let's ensure the encoding is valid 

In [5]:
all(data.loc[:, ['Positive', 'Negative', 'Neutral']].sum(axis=1) == 1)

True

Check ```Sentence``` column for duplicates

In [6]:
pd.set_option('display.max_rows', 300)
dup = data['Sentence'].duplicated()
dup[dup]

19     True
134    True
136    True
137    True
138    True
139    True
140    True
141    True
142    True
143    True
144    True
146    True
147    True
148    True
149    True
150    True
151    True
152    True
153    True
154    True
155    True
157    True
158    True
159    True
160    True
161    True
162    True
163    True
164    True
165    True
Name: Sentence, dtype: bool

Are the same rows also duplicates when considering also the labels?

In [7]:
dup_labels = data[['Sentence', 'Positive', 'Negative', 'Neutral']].duplicated()
dup_labels[dup_labels]

19     True
134    True
136    True
137    True
138    True
139    True
140    True
141    True
142    True
143    True
144    True
146    True
147    True
148    True
149    True
150    True
151    True
152    True
153    True
154    True
155    True
157    True
158    True
159    True
160    True
161    True
162    True
163    True
164    True
165    True
dtype: bool

In [8]:
all(dup[dup].index == dup_labels[dup_labels].index)

True

Yes, it would seem so. In principle duplicated sentences could be used to represent opinions given by different experts, but since also the labels are the same this would not seem to be the case judging from this sample. At this stage we'll consider it safe to discard the duplicated rows since they seem to bring no obvious value.

In [9]:
data = data.drop(dup[dup].index)

In [10]:
len(data)

236

In [11]:
data.head()

Unnamed: 0,Sentence,Positive,Negative,Neutral
0,The results in 2nd line treatment show an ORR ...,1,0,0
1,The long duration of response and high durable...,1,0,0
2,The median OS time in the updated results exce...,0,0,1
3,"Therefore, the clinical benefit in 2nd line tr...",1,0,0
4,"The data provided in 1st line, although prelim...",1,0,0


Now we can check the label distribution

In [75]:
print('Positive samples', len(data[data['Positive'] == 1]))
print('Negative samples', len(data[data['Negative'] == 1]))
print('Neutral samples', len(data[data['Neutral'] == 1]))

Positive samples 140
Negative samples 32
Neutral samples 64


The class distribution is clearly skewed towards positive sentiment. In addition, quite significant portion are neutral - this could be problematic since classifiers will probably have a hard time figuring out subtle differences.

## Data exploration

Let's try to grasp some intuition behind data by listing out the most common words. The process involves building corpora of the sentences representing the three labels, filtering out knwon English language stop words and punctation, and finally counting the Frequency distributions amongst the indivudual corpora as well as the composite corpus. Throughout this process the excellent nltk library is utilized.

In [76]:
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import string

Start by creating the corpora of Positive, Negative and Neutral labels respectively and tokenizing those

In [77]:
corp_pos = word_tokenize(' '.join(data.loc[data['Positive'] == 1, 'Sentence']).lower())
corp_neg = word_tokenize(' '.join(data.loc[data['Negative'] == 1, 'Sentence']).lower())
corp_neutr = word_tokenize(' '.join(data.loc[data['Neutral'] == 1, 'Sentence']).lower())

In [78]:
corp_pos[:15]

['the',
 'results',
 'in',
 '2nd',
 'line',
 'treatment',
 'show',
 'an',
 'orr',
 'of',
 '33',
 '%',
 'with',
 'some',
 'patients']

Filter out known stopwords. Notice that before this operation stopwords need to be downloaded using:

```
>>> import nltk
>>> nltk.download('stopwords')
```

In [79]:
corp_pos = [t for t in corp_pos if t not in stopwords.words('english')]
corp_neg = [t for t in corp_neg if t not in stopwords.words('english')]
corp_neutr = [t for t in corp_neutr if t not in stopwords.words('english')]

Remove punctuation

In [80]:
corp_pos = [t for t in corp_pos if t not in string.punctuation]
corp_neg = [t for t in corp_neg if t not in string.punctuation]
corp_neutr = [t for t in corp_neutr if t not in string.punctuation]

Then check out the freqdists

In [81]:
fd_pos = FreqDist(corp_pos)
fd_neg = FreqDist(corp_neg)
fd_neutr = FreqDist(corp_neutr)

In [82]:
fd_pos.most_common(20)

[('safety', 41),
 ('data', 39),
 ('study', 29),
 ('clinical', 27),
 ('patients', 25),
 ('efficacy', 23),
 ('treatment', 22),
 ('considered', 21),
 ('profile', 19),
 ('product', 17),
 ('bioequivalence', 15),
 ('studies', 14),
 ('support', 14),
 ('overall', 14),
 ('subjects', 14),
 ('sma', 14),
 ('results', 12),
 ('application', 12),
 ('mg', 12),
 ('rate', 11)]

In [83]:
fd_neg.most_common(20)

[('safety', 14),
 ('data', 11),
 ('patients', 9),
 ('study', 8),
 ('treatment', 7),
 ('period', 6),
 ('studies', 6),
 ('combination', 6),
 ('chmp', 5),
 ('address', 5),
 ('efficacy', 5),
 ('limited', 5),
 ('considers', 4),
 ('following', 4),
 ('measures', 4),
 ('necessary', 4),
 ('additional', 4),
 ('related', 4),
 ('provided', 4),
 ('term', 4)]

In [84]:
fd_neutr.most_common(20)

[('studies', 19),
 ('study', 15),
 ('safety', 13),
 ('efficacy', 11),
 ('patients', 10),
 ('dose', 10),
 ('insulin', 10),
 ('data', 9),
 ('difference', 8),
 ('clinical', 8),
 ('product', 7),
 ('additional', 7),
 ('compared', 7),
 ('frc', 7),
 ('infections', 6),
 ('glargine', 6),
 ('lixisenatide', 6),
 ('provided', 5),
 ('related', 5),
 ('reactions', 5)]