In [None]:
%run Import_Library.ipynb

In [None]:
%run text_functions.ipynb

In [52]:
%run evaluation_metrics.ipynb

**Introduction**
- There are two main training algorithms for word2vec: Continous Bag of Word (CBOW) and Skip-gram. The major difference between these two methods is that CBOW is using context to predict a target word while skip-gram is using a word to predict a target context. Generally, the skip-gram method can have a better performance compared with CBOW method, for it can capture two semantics for a single word
- The Word2Vec Skip-gram model, for example, takes in pairs (word1, word2) generated by moving a window across text data, and trains a 1-hidden-layer neural network based on the synthetic task of given an input word, giving us a predicted probability distribution of nearby words to the input


**Data Preparation**

- Convert values in predicted column to binary
- Feature selection: len_text, digits, non_alpha_char, processed_text
- Shuffle randomly the data with selected features in order to reduce the bias.
- Split dataset into train and test subsets with the ratio 80:20

In [37]:
from joblib import load

data = load('data.lib')

data.head()

Unnamed: 0,label,text,len_text,digits,non_alpha_char,processed_text
0,ham,"Go until jurong point, crazy.. Available only ...",111,0,28,go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,29,0,11,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,25,33,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,49,0,16,u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,0,14,nah dont think go usf life around though


In [40]:
#print(type(X_text))

def split_data(data, features, target, test_size=0.2):
    '''
    Shuffle and split data into trainset and testset
    
    '''
    X = data[features]
    Y = data[target]
    
    x_train, x_val, y_train, y_val = train_test_split(X , Y , shuffle = True, test_size = test_size)
    
    return x_train, x_val, y_train, y_val

In [41]:
features = ['len_text', 'digits', 'non_alpha_char', 'processed_text']

target = 'label'

data[target] = np.where(data[target]=='spam', 1, 0)

x_train, x_val, y_train, y_val = split_data(data, features, target, test_size = 0.2)

In [43]:
def get_doc_vec(words, w2vec_model):
    """
    Function to take a document as a list of words and return the document vector
    
    Arg:
        - words: a list of words, for example: ["w1", "w2", "w3"]
        - w2vec_model: vector of vocabularies
        
    """
    good_words = []
    for word in words:
        # Words not in the original model will fair
        try:
            if word in w2vec_model:
                good_words.append(word)
        except:
            print(word, 'does not exist in the model.')
            continue
    # If no words are in the original model
    # print("good_words: {}".format(good_words))
    if len(good_words) == 0:
        return None
    # Return the mean of the vectors for all the good words
    return w2vec_model[good_words].mean(axis=0)

**Train Word2Vec Model with the corpus**
- 1) Set the input as the entire corpus, the model will build the set of vocabulary
- 2) Create multi-dimensional vector from each document in trainset and testset

In [44]:
corpus = data['processed_text'].str.split()

size_vect = 100
size_window = 30
ch_sg = 1 # The default training algorithm is skip-gram
min_word_cnt = 10
# build the model with the entire corpus
model = gensim.models.word2vec.Word2Vec(corpus
                                        , min_count = min_word_cnt
                                        , size = size_vect
                                        , window = size_window
                                        , iter = 1000
                                        , sg = ch_sg
                                       , workers = 5)

x_train['w2vec'] = x_train['processed_text'].apply(lambda sent : get_doc_vec(sent, model))

x_val['w2vec'] = x_val['processed_text'].apply(lambda sent : get_doc_vec(sent, model))

**Let’s try to understand the parameters used in training the model**
- size: The number of dimensions of the embeddings and the default is 100.
- window: The maximum distance between a target word and words around it. The default window is 5.
- min_count: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5.
- workers: The number of partitions during training and the default workers is 3.
- sg: The training algorithm, either CBOW(0) or skip gram(1). The default training algorithm is CBOW.

In [45]:
x_train.head()

Unnamed: 0,len_text,digits,non_alpha_char,processed_text,w2vec
742,62,0,15,well little time thing good time ahead,"[-0.47330487, -0.47312462, -0.5221913, -0.0153..."
3602,28,0,6,jay tell already,"[-0.08387119, -0.5157668, -0.02518104, 0.05996..."
5531,60,0,13,compliment away system side,"[-0.4088728, -0.70502704, 0.115013175, 0.24147..."
3693,46,0,10,movie laptop,"[-0.417161, -1.1797577, 0.039054543, 0.5233974..."
4586,158,26,28,u secret admirer look 2 make contact ufind rre...,"[-0.15474895, -0.63778716, 0.33145007, 0.17539..."


**Remarks**: we can't fit the data into machine learning algorithm because the data structure is not good.

- Normalise the data
- Addind new features
- Remove the Nan Values

In [46]:
def w2v_normalization(x_data, y_data, size_vector):
    '''
    x_data: either x_train or x_val
    y_data: either y_train or y_val
    '''
    # Data Normalization
    x_np_vecs = np.zeros((len(x_data), size_vector))
    for i, vec in enumerate(x_data['w2vec']):
        x_np_vecs[i, :] = vec

    # Combine the full dataframe with the labels
    x_data_w2v = pd.DataFrame(data = x_np_vecs
                              , index = x_data.index)
    
    # Add new features
    x_data_w2v = x_data_w2v.join(x_data[['len_text', 'digits', 'non_alpha_char']])
    
    # Join train data with label data in order to remove NaN values
    x_data_w2v = x_data_w2v.join(y_data)
    
    x_data_w2v = x_data_w2v.dropna()
    
    y_data_train = x_data_w2v['label']
    
    x_data_train = x_data_w2v.drop(columns=['label'])
      
    return x_data_train, y_data_train


In [47]:
x_train_w2v, y_train = w2v_normalization(x_train
                                         , y_train
                                         , size_vect)

x_val_w2v, y_val = w2v_normalization(x_val
                                     , y_val
                                     , size_vect)


In [48]:
x_train_w2v.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,93,94,95,96,97,98,99,len_text,digits,non_alpha_char
742,-0.473305,-0.473125,-0.522191,-0.015343,-0.311092,0.425733,0.560416,0.491478,0.096124,-0.249853,...,0.1264,0.263273,-0.740204,-0.215985,0.521319,-0.123448,-0.411699,62,0,15
3602,-0.083871,-0.515767,-0.025181,0.05996,-0.029355,0.48392,0.456451,0.707481,-0.198413,-0.26207,...,0.130765,0.207721,-0.780777,-0.299856,0.4191,-0.007276,-0.455397,28,0,6


In [49]:
x_val_w2v.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,93,94,95,96,97,98,99,len_text,digits,non_alpha_char
3366,0.075215,-0.461036,0.207333,0.172093,0.015896,0.375577,0.299629,0.408385,0.015985,0.114019,...,-0.151495,0.032117,-0.448398,-0.264202,0.262411,-0.151933,-0.05464,22,0,5
3090,-0.541918,-0.28151,0.121438,0.022833,-0.474408,0.717466,0.197144,0.138646,0.155885,-0.054211,...,0.055396,-0.148287,-0.098618,-0.270382,0.403873,-0.35779,0.000982,50,0,12


**Select the best feature and fit the data to Bayes Classifier**

- When we convert all of the texts in a dataset into word uni+bigram tokens, we may end up with tens of thousands of tokens. Not all of these tokens/features contribute to label prediction. So we can drop certain tokens, for instance those that occur extremely rarely across the dataset. We can also measure feature importance (how much each token contributes to label predictions), and only include the most informative tokens.

- There are many statistical functions that take features and the corresponding labels and output the feature importance score. Two commonly used functions are f_classif and chi2. 



In [54]:
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
def select_best_features_and_fit(x_train, y_train, x_val):
    '''
    return predicted values
    '''
    
    selector = SelectKBest(f_classif, k = min(TOP_K, x_train.shape[1]))
    
    selector.fit(x_train, y_train)

    x_train_selector = selector.transform(x_train).astype('float32')

    x_val_selector = selector.transform(x_val).astype('float32')

    clf = GaussianNB()
    # A sparse matrix was passed, but dense data is required.
    clf.fit(x_train_selector, y_train)

    y_preds = clf.predict(x_val_selector)

    return y_preds


In [55]:
y_preds = select_best_features_and_fit(x_train_w2v, y_train, x_val_w2v)


In [56]:
print(round(accuracy_score(y_val, y_preds), 2))

0.94


**Evaluation Metrics**
-  The accuracy will yield misleading results if the data set is unbalanced; that is, when the numbers of observations in different classes vary greatly. Confusion Matrix and Classification Report will help us to evaluate the performance for each class.

**1) Confustion Matrix**

- A confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an supervised machine learning algorithm.

- Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class. The diagonal elements show the number of correct classifications for each class meanwhile the off-diagonal elements provides the mis-classifications for each class.

- There are four ways to check if the predictions are right or wrong:

    - 1) True Positive (TP): The class was positive and predicted positive.
    - 2) True Negative (TN): The class was negative and predicted negative.
    - 3) False Negative (FN): The class was positive but predicted negative
    - 4) False Positive (FP) : The case was negative but predicted positive


In [57]:
df_confusion_matrix =  make_confusion_matrix(confusion_matrix(y_val, y_preds)
                                              , columns = ['Actual Ham', 'Actual Spam']
                                              , index = ['Predicted Ham', 'Predicted Spam'])
df_confusion_matrix

Unnamed: 0,Actual Ham,Actual Spam
Predicted Ham,896,56
Predicted Spam,16,147


- The correction of prediction = TP + TN = 896 + 147 = 1043

- The mis-classification = FN + FP = 84 + 13 = 72

**2) Classification Report**
- The report shows the main classification metrics precision, recall and f1-score on a per-class basis. The metrics are calculated by using true and false positives, true and false negatives. Positive and negative in this case are generic names for the predicted classes.

- **Precision – What percent of your predictions were correct ?**

    - Precision is the ability of a classifier not to label an instance positive that is actually negative. For each class it is defined as the ratio of true positives to the sum of true and false positives.

    - Precision = Accuracy of positive predictions.

    - Precision = TP/(TP + FP)

- **Recall – What percent of the positive cases did you catch ?**

    - Recall is the ability of a classifier to find all positive instances. For each class it is defined as the ratio of true positives to the sum of true positives (TP) and false negatives (FN).

    - Recall: Fraction of positives that were correctly identified.

    - Recall = TP/(TP+FN)

- **F1 score – What percent of positive predictions were correct?**

    - The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0. Generally speaking, F1 scores are lower than accuracy measures as they embed precision and recall into their computation. As a rule of thumb, the weighted average of F1 should be used to compare classifier models, not global accuracy.

    - F1 Score = 2*(Recall * Precision) / (Recall + Precision)

In [58]:
from sklearn.metrics import classification_report

print (classification_report(y_val , y_preds, target_names = ['ham', 'spam']))

              precision    recall  f1-score   support

         ham       0.98      0.94      0.96       952
        spam       0.72      0.90      0.80       163

    accuracy                           0.94      1115
   macro avg       0.85      0.92      0.88      1115
weighted avg       0.94      0.94      0.94      1115

