# Load data

Load and check data quality

In [None]:
import pandas as pd

In [None]:
data = pd.read_excel('../data/sentences_with_sentiment.xlsx')
data.head()

In [None]:
len(data)

Drop the ID column since it contains no useful information

In [None]:
data = data.drop('ID', axis=1)

Labels seem to be already one-hot encoded. Let's ensure the encoding is valid 

In [None]:
all(data.loc[:, ['Positive', 'Negative', 'Neutral']].sum(axis=1) == 1)

Check ```Sentence``` column for duplicates

In [None]:
pd.set_option('display.max_rows', 300)
dup = data['Sentence'].duplicated()
dup[dup]

Are the same rows also duplicates when considering also the labels?

In [None]:
dup_labels = data[['Sentence', 'Positive', 'Negative', 'Neutral']].duplicated()
dup_labels[dup_labels]

In [None]:
all(dup[dup].index == dup_labels[dup_labels].index)

Yes, it would seem so. In principle duplicated sentences could be used to represent opinions given by different experts, but since also the labels are the same this would not seem to be the case judging from this sample. At this stage we'll consider it safe to discard the duplicated rows since they seem to bring no obvious value.

In [None]:
data = data.drop(dup[dup].index)

In [None]:
len(data)

In [None]:
data.head()

Now we can check the label distribution

In [None]:
print('Positive samples', len(data[data['Positive'] == 1]))
print('Negative samples', len(data[data['Negative'] == 1]))
print('Neutral samples', len(data[data['Neutral'] == 1]))

The class distribution is clearly skewed towards positive sentiment.