# Data Cleaning

## Introduction

This notebook goes through a necessary step of any data science project - data cleaning. Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out". Feeding dirty data into a model will give us results that are meaningless.

Specifically, we'll be walking through:

1. **Getting the data - **in this case, we'll be scraping data from a website
2. **Cleaning the data - **we will walk through popular text pre-processing techniques
3. **Organizing the data - **we will organize the cleaned data into a way that is easy to input into other algorithms

The output of this notebook will be clean, organized data in two standard text formats:

1. **Corpus** - a collection of text
2. **Document-Term Matrix** - word counts in matrix format

In [0]:
import pandas as pd

df=pd.read_csv('Emotion Phrases.csv')
df.head()

Unnamed: 0,Emotions,Phrases
0,joy,On days when I feel close to my partner and ot...
1,fear,Every time I imagine that someone I love or I ...
2,anger,When I had been obviously unjustly treated and...
3,sadness,When I think about the short time that we live...
4,disgust,At a gathering I found myself involuntarily si...


When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate. Here are a bunch of things you can do to clean your data. We're going to execute just the common cleaning steps here and the rest can be done at a later point to improve our results.

**Common data cleaning steps on all text:**
* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words

**More data cleaning steps after tokenization:**
* Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...

In [0]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [0]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(df.Phrases.apply(round1))
data_clean

Unnamed: 0,Phrases
0,on days when i feel close to my partner and ot...
1,every time i imagine that someone i love or i ...
2,when i had been obviously unjustly treated and...
3,when i think about the short time that we live...
4,at a gathering i found myself involuntarily si...
...,...
9189,yucky
9190,zeal
9191,zealous
9192,zest


In [0]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    text = re.sub('á','',text)
    return text

round2 = lambda x: clean_text_round2(x)

In [0]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.Phrases.apply(round2))
data_clean

Unnamed: 0,Phrases
0,on days when i feel close to my partner and ot...
1,every time i imagine that someone i love or i ...
2,when i had been obviously unjustly treated and...
3,when i think about the short time that we live...
4,at a gathering i found myself involuntarily si...
...,...
9189,yucky
9190,zeal
9191,zealous
9192,zest


In [0]:
data_clean['Emotion']=df['Emotions']
# data_clean.Phrases.loc['anger']

In [0]:
data_clean.head()

Unnamed: 0,Phrases,Emotion
0,on days when i feel close to my partner and ot...,joy
1,every time i imagine that someone i love or i ...,fear
2,when i had been obviously unjustly treated and...,anger
3,when i think about the short time that we live...,sadness
4,at a gathering i found myself involuntarily si...,disgust


**NOTE:** This data cleaning aka text pre-processing step could go on for a while, but we are going to stop for now. After going through some analysis techniques, if you see that the results don't make sense or could be improved, you can come back and make more edits such as:
* Mark 'cheering' and 'cheer' as the same word (stemming / lemmatization)
* Combine 'thank you' into one term (bi-grams)
* And a lot more...

## Organizing The Data

I mentioned earlier that the output of this notebook will be clean, organized data in two standard text formats:
1. **Corpus - **a collection of text
2. **Document-Term Matrix - **word counts in matrix format

### Document-Term Matrix

For many of the techniques we'll be using in future notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [0]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.Phrases)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,aa,abandoned,abdomen,abdominal,abhor,abhorr,abhorred,abhorrence,abhorrent,abilities,ability,able,abnormal,abomin,abominable,abominably,abominate,abomination,aboriginal,aborted,abortion,aboveboard,abroad,abrupt,abruptely,abruptly,absailing,abscence,absence,absent,absentminded,absentmindedness,absolutely,absurd,abuse,abused,abusing,abusive,abut,academic,...,yielding,york,young,younger,youngest,youngish,youngsters,youngstters,yournals,youth,youths,yr,yrs,yucki,yucky,yugoslavia,yukky,zaire,zalu,zambezi,zambia,zcbc,zeal,zealand,zealander,zealous,zeeland,zemba,zero,zesco,zest,zestfulness,zhu,zigzagging,zip,zipper,zomba,zombies,zone,zoophiliac
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9189,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9190,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9191,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9192,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


In [0]:
data=data_dtm.transpose()
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,9154,9155,9156,9157,9158,9159,9160,9161,9162,9163,9164,9165,9166,9167,9168,9169,9170,9171,9172,9173,9174,9175,9176,9177,9178,9179,9180,9181,9182,9183,9184,9185,9186,9187,9188,9189,9190,9191,9192,9193
aa,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
abandoned,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
abdomen,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
abdominal,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
abhor,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
data_dtm.index=range(9194)

In [0]:
data_dtm['zzemotion']=df['Emotions']
data_dtm.head()

Unnamed: 0,aa,abandoned,abdomen,abdominal,abhor,abhorr,abhorred,abhorrence,abhorrent,abilities,ability,able,abnormal,abomin,abominable,abominably,abominate,abomination,aboriginal,aborted,abortion,aboveboard,abroad,abrupt,abruptely,abruptly,absailing,abscence,absence,absent,absentminded,absentmindedness,absolutely,absurd,abuse,abused,abusing,abusive,abut,academic,...,york,young,younger,youngest,youngish,youngsters,youngstters,yournals,youth,youths,yr,yrs,yucki,yucky,yugoslavia,yukky,zaire,zalu,zambezi,zambia,zcbc,zeal,zealand,zealander,zealous,zeeland,zemba,zero,zesco,zest,zestfulness,zhu,zigzagging,zip,zipper,zomba,zombies,zone,zoophiliac,zzemotion
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,joy
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,fear
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,anger
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,sadness
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,disgust


# Modeling

### Importing libraries

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
import numpy as np

In [0]:
names=list(data_dtm.columns[:-1])
len(names)

10084

In [0]:
x=data_dtm[names]
y=data_dtm['zzemotion']

In [0]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=12)

In [0]:
# models = []
# models.append(('LOR', LogisticRegression(solver='liblinear', multi_class='ovr')))
# models.append(('LDA', LinearDiscriminantAnalysis()))
# models.append(('KNN', KNeighborsClassifier(n_neighbors=5,metric='euclidean')))
# models.append(('CART', DecisionTreeClassifier(criterion='gini')))
# models.append(('DTC',DecisionTreeClassifier(criterion='entropy')))
# models.append(('NB', GaussianNB()))
# # models.append(('SVM', SVC(gamma='auto')))
# results = []
# names = []
# for name, model in models:
# 	kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
# 	cv_results = cross_val_score(model, x_train, y_train, cv=kfold, scoring='accuracy')
# 	results.append(cv_results)
# 	names.append(name)
# 	print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))

## Logistic Regression

In [0]:
model_lor = LogisticRegression(solver='liblinear', multi_class='ovr')
model_lor.fit(x_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='ovr', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
y_pred = model_lor.predict(x_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(model_lor.score(x_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.53


In [0]:
confusion_matrix_lor = confusion_matrix(y_test, y_pred)
print(confusion_matrix_lor)
print(classification_report(y_test,y_pred))

[[173  34  21  28 150  14  29   0]
 [ 47 171  18  19  59  13  20   0]
 [ 23  15 215   8  97  17  14   0]
 [ 59   9  22 128  45  21  37   0]
 [ 11   7  15  11 425  19  12   0]
 [ 27  14  14  14 108 212  19   0]
 [ 46  24  17  30  41  23 141   0]
 [  1   0   0   0  22   0   0   0]]
              precision    recall  f1-score   support

       anger       0.45      0.39      0.41       449
     disgust       0.62      0.49      0.55       347
        fear       0.67      0.55      0.60       389
       guilt       0.54      0.40      0.46       321
         joy       0.45      0.85      0.59       500
     sadness       0.66      0.52      0.58       408
       shame       0.52      0.44      0.47       322
    surprise       0.00      0.00      0.00        23

    accuracy                           0.53      2759
   macro avg       0.49      0.45      0.46      2759
weighted avg       0.55      0.53      0.52      2759



  _warn_prf(average, modifier, msg_start, len(result))


#### ROC CURVE

In [0]:
logit_roc_auc = roc_auc_score(y_test, model_lor.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, model_lor.predict_proba(x_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

ValueError: ignored

## Linear Discriminant Analysis

In [0]:
model_lda=LinearDiscriminantAnalysis()
model_lda.fit(x_train,y_train)

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                           solver='svd', store_covariance=False, tol=0.0001)

In [0]:
y_pred = model_lda.predict(x_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(model_lda.score(x_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.27


In [0]:
confusion_matrix_lda = confusion_matrix(y_test, y_pred)
print(confusion_matrix_lda)
print(classification_report(y_test,y_pred))

[[ 66  41  37  49  47 156  44   9]
 [ 45  78  42  41  49  54  34   4]
 [ 30  41 101  35  48  94  31   9]
 [ 25  38  36  72  51  44  51   4]
 [ 25  25  41  46 154 169  30  10]
 [ 29  21  33  36  50 202  34   3]
 [ 43  33  43  43  44  35  78   3]
 [  1   0   0   0   1  19   2   0]]
              precision    recall  f1-score   support

       anger       0.25      0.15      0.19       449
     disgust       0.28      0.22      0.25       347
        fear       0.30      0.26      0.28       389
       guilt       0.22      0.22      0.22       321
         joy       0.35      0.31      0.33       500
     sadness       0.26      0.50      0.34       408
       shame       0.26      0.24      0.25       322
    surprise       0.00      0.00      0.00        23

    accuracy                           0.27      2759
   macro avg       0.24      0.24      0.23      2759
weighted avg       0.28      0.27      0.27      2759



## K Neighbors Classifier

In [0]:
model_knn=KNeighborsClassifier(n_neighbors=5,metric='euclidean')
model_knn.fit(x_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [0]:
y_pred = model_knn.predict(x_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(model_knn.score(x_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.25


In [0]:
confusion_matrix_knn = confusion_matrix(y_test, y_pred)
print(confusion_matrix_knn)
print(classification_report(y_test,y_pred))

## Decision Tree Classifier
GINI

In [0]:
model_dtc=DecisionTreeClassifier(criterion='gini')
model_dtc.fit(x_train,y_train)

In [0]:
y_pred = model_dtc.predict(x_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(model_dtc.score(x_test, y_test)))

In [0]:
confusion_matrix_dtc = confusion_matrix(y_test, y_pred)
print(confusion_matrix_dtc)
print(classification_report(y_test,y_pred))

## Decision Tree Classifier
ENTROPY

In [0]:
model_cart=DecisionTreeClassifier(criterion='entropy')
model_cart.fit(x_train,y_train)

In [0]:
y_pred = model_cart.predict(x_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(model_cart.score(x_test, y_test)))

In [0]:
confusion_matrix_cart = confusion_matrix(y_test, y_pred)
print(confusion_matrix_cart)
print(classification_report(y_test,y_pred))

## Naive Bayes

In [0]:
model_nb=GaussianNB()
model_nb.fit(x_train,y_train)

In [0]:
y_pred = model_nb.predict(x_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(model_nb.score(x_test, y_test)))

In [0]:
confusion_matrix_nb = confusion_matrix(y_test, y_pred)
print(confusion_matrix_nb)
print(classification_report(y_test,y_pred))