In [45]:
%run Import_Library.ipynb

In [None]:
%run text_functions.ipynb

In [83]:
%run evaluation_metrics.ipynb

**Introduction**

- Term Frequency–Inverse Document Frequency (TF-IDF): Machine doesn't understand the text. We have to transform text into sparse matrix or term-document matrix. The term-document matrix then is a two-dimensional matrix whose rows are the terms and columns are the documents, so each entry (i, j) represents the frequency of term i in document j.

- For each entry in the matrix, the term frequency measures the number of times that term i appears in document j, and the inverse document frequency measures the number of documents in the corpus which contain term i. 

- The tf-idf score is the product of these two metrics (tf*idf). So an entry's tf-idf score increases when term i appears frequently in document j, but decreases as the term appears in other documents. In another word, idf is a cross-document normalization, that puts less weight on common terms, and more weight on rare terms

**Data Preparation**
- Convert values in predicted column to binary
- Feature selection: len_text, digits, non_alpha_char, processed_text
- Shuffle randomly the data with selected features in order to reduce the bias.
- Split dataset into train and test subsets with the ratio 80:20

In [47]:
from joblib import load

data = load('data.lib')

data.head()

Unnamed: 0,label,text,len_text,digits,non_alpha_char,processed_text
0,ham,"Go until jurong point, crazy.. Available only ...",111,0,28,go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,29,0,11,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,25,33,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,49,0,16,u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,0,14,nah dont think go usf life around though


In [48]:
'''
Dense matrices store every entry in the matrix. Sparse matrices only store the nonzero entries. Sparse matrices don't have a lot of extra features, and some algorithms may not work for them. You use them when you need to work with matrices that would be too big for the computer to handle them, but they are mostly zero, so they compress easily.
'''

"\nDense matrices store every entry in the matrix. Sparse matrices only store the nonzero entries. Sparse matrices don't have a lot of extra features, and some algorithms may not work for them. You use them when you need to work with matrices that would be too big for the computer to handle them, but they are mostly zero, so they compress easily.\n"

In [49]:
seed = 123
# Convert values in label column to binary: spam is 1 and ham is 0
#data['label'] = np.where(data['label']=='spam', 1, 0)
# Select relevant features for training the model
features = ['len_text', 'digits', 'non_alpha_char', 'processed_text']
# Feature variables
X_text = data['processed_text'].values

X_len_text = data['len_text'].values

X_digits = data['digits'].values

X_non_alpha_char = data['non_alpha_char'].values
# Target variable
Y = data['label'].values
# Shuffle the training data and labels.
myshuffle(seed, X_text)

myshuffle(seed, X_len_text)

myshuffle(seed, X_digits)

myshuffle(seed, X_non_alpha_char)

myshuffle(seed, Y)

In [50]:
print(type(X_text))

<class 'numpy.ndarray'>


In [51]:
features = ['len_text', 'digits', 'non_alpha_char', 'processed_text']
target = 'label'

x = data[features]
y = data[target]

x_train, x_val, y_train, y_val = train_test_split(x
                                                  , y
                                                  , shuffle = True
                                                  , test_size = 0.2)


In [52]:
x_train.head()

Unnamed: 0,len_text,digits,non_alpha_char,processed_text
4437,148,40,29,private 2003 account statement 07753741225 sho...
4182,83,1,26,err cud im go 8pm havent get way contact
4485,146,24,27,win urgent mobile number award 2000 prize guar...
3664,23,0,6,dad back ph
5210,31,0,7,great office today


In [53]:
y_train.head()

4437    spam
4182     ham
4485    spam
3664     ham
5210     ham
Name: label, dtype: object

In [54]:
x_val.head()

Unnamed: 0,len_text,digits,non_alpha_char,processed_text
3266,74,0,25,happy sad one thing past good morning
579,116,0,39,let pool money together buy bunch lotto ticket...
4680,70,0,15,im inside officestill fill formsdon know leave
4117,25,0,5,izzit still rain
3835,42,0,7,hang brother family


In [55]:
y_val.head()

3266    ham
579     ham
4680    ham
4117    ham
3835    ham
Name: label, dtype: object

We will need to convert the texts into numerical vectors:
- Tokenization: Divide the texts into words or smaller sub-texts, which will enable good generalization of relationship between the texts and the labels. This determines the “vocabulary” of the dataset (set of unique tokens present in the data).
- Vectorization: Define a good numerical measure to characterize these texts.

In [56]:
x_train_text = x_train['processed_text']

x_val_text = x_val['processed_text']

x_train_tfidf, x_val_tfidf = ngram_vectorize(train_texts = x_train_text.values
                                             , train_labels = y_train.values
                                             , val_texts = x_val_text.values)

**Add new features to sparse matrix built from the text of trainset**

In [57]:
x_train_new = x_train_tfidf

for feature in ['len_text', 'digits', 'non_alpha_char']:
    x_train_new = add_feature(x_train_new, x_train[feature].values.reshape(1, -1))

x_train_new

<4457x7699 sparse matrix of type '<class 'numpy.float64'>'
	with 57153 stored elements in Compressed Sparse Row format>

**Add new features to sparse matrix built from text of testset**

In [58]:
x_val_new = x_val_tfidf

for feature in ['len_text', 'digits', 'non_alpha_char']:
    x_val_new = add_feature(x_val_new, x_val[feature].values.reshape(1, -1))

x_val_new

<1115x7699 sparse matrix of type '<class 'numpy.float64'>'
	with 12311 stored elements in Compressed Sparse Row format>

In [60]:
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
#(y_true, y_pred, normalize=True, sample_weight=None)[source]

clf = GaussianNB()
# A sparse matrix was passed, but dense data is required.
clf.fit(x_train_new.toarray(), y_train)

y_preds = clf.predict(x_val_new.toarray())

round(accuracy_score(y_val, y_preds), 2)

0.91

In [61]:
print(clf.classes_)

['ham' 'spam']


**Evaluation Metrics**
-  The accuracy will yield misleading results if the data set is unbalanced; that is, when the numbers of observations in different classes vary greatly. Confusion Matrix and Classification Report will help us to evaluate the performance for each class.

**1) Confusion Matrix**

- A confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an supervised machine learning algorithm.

- Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class. There are four ways to check if the predictions are right or wrong:

    - 1) True Positive (TP): The class was positive and predicted positive.
    - 2) True Negative (TN): The class was negative and predicted negative.
    - 3) False Negative (FN): The class was positive but predicted negative
    - 4) False Positive (FP) : The case was negative but predicted positive

In [86]:
df_confusion_matrix =  make_confusion_matrix(confusion_matrix(y_val, y_preds)
                                              , columns = ['Actual Ham', 'Actual Spam']
                                              , index = ['Predicted Ham', 'Predicted Spam'])
df_confusion_matrix

Unnamed: 0,Actual Ham,Actual Spam
Predicted Ham,878,84
Predicted Spam,13,140


- The diagonal elements:
    - Show the number of correct classifications for each class: 878 and 140 for the classes spam and ham, respectively. 
    - The correction of prediction = True Positive (TP) + True Negative (TN) = 878 + 140 = 918

- The off-diagonal elements: 
    - Provides the misclassifications for each class: 84 and 13 for classes spam and ham, respectively. 
    - The misclassification = False Negative (FN) + False Positive (FP) = 84 + 13 = 97

**2) Classification Report**

The report shows the main classification metrics precision, recall and f1-score on a per-class basis. The metrics are calculated by using true and false positives, true and false negatives. Positive and negative in this case are generic names for the predicted classes. 

**Precision – What percent of your predictions were correct ?**

- Precision is the ability of a classifier not to label an instance positive that is actually negative. For each class it is defined as the ratio of true positives to the sum of true and false positives.

- Precision = Accuracy of positive predictions.

- Precision = TP/(TP + FP)

**Recall – What percent of the positive cases did you catch ?** 

- Recall is the ability of a classifier to find all positive instances. For each class it is defined as the ratio of true positives to the sum of true positives (TP) and false negatives (FN).

- Recall: Fraction of positives that were correctly identified.
- Recall = TP/(TP+FN)

**F1 score – What percent of positive predictions were correct?**

- The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0. Generally speaking, F1 scores are lower than accuracy measures as they embed precision and recall into their computation. As a rule of thumb, the weighted average of F1 should be used to compare classifier models, not global accuracy.

- F1 Score = 2*(Recall * Precision) / (Recall + Precision)

In [62]:
from sklearn.metrics import classification_report
# Thus in binary classification, 
# the count of true negatives is 
# , false negatives is 
#, true positives is  
# and false positives is .
print (classification_report(y_val , y_preds, target_names = ['ham', 'spam']))

              precision    recall  f1-score   support

         ham       0.99      0.91      0.95       962
        spam       0.62      0.92      0.74       153

    accuracy                           0.91      1115
   macro avg       0.81      0.91      0.85      1115
weighted avg       0.94      0.91      0.92      1115



**References**
- 1) https://stats.stackexchange.com/questions/336455/fpr-false-positive-rate-vs-fdr-false-discovery-rate/340079#340079
- 2) https://stats.stackexchange.com/questions/459994/how-to-interpret-the-confusion-matrix
- 3) https://en.wikipedia.org/wiki/Type_I_and_type_II_errors#False_positive_and_false_negative_rates

