---
# Convolutional Neural Networks for text classification
---
<br>
<center><h3> Abstract </h3></center>
 Text classification is one of the most common NLP tasks, there are plenty approaches that can be taken  but sometimes is difficult to think that famous approaches in other ML areas could be useful to perform text analysis, this time we'll try to apply CNN originally designed and widely applied to image analysis to perform text classification.

***Motivation:***
- CNNs are faster to train than LSTM models.
- CNNs are translation invariant, that means they could recognize patterns in the text no matter where they are.
- CNNs are also efficient in terms of representation with a large vocabulary.
- Convolutional Filters learn good representations automatically, without needing to represent the whole vocabulary.

***When to use it?:***
- When there is no a strong dependance between a sequence and it long past words.

***Note***: In this notebook we put and a special focus on computational performance,  we tried to avoid extra computational complexity repeating tasks, so feel free to contact us if there is any doubt.


---
# Index
---
___1. Introduction___
> - <a href='#1.1'>1.1 Data set Description</a>

___2. Preprocessing___
> - <a href='#2.1'>2.1 Data cleaning</a>
> - <a href='#2.2'>2.2 Data Preparation and Analysis</a>

___3.Feature extraction___
> - <a href='#3.1'>BOW representation</a>
> - <a href='#3.2'>FasText representation</a>

 ___<a href='#4.'>4.Model Desing</a>___
 
 ___5.Training and Testing___
 > - <a href='#5.1'>5.1 CNN+BOW representation</a>
> - <a href='#5.2'>5.2 CNN+FastText representation</a>

<a id='1.1'></a>
## Data set Description
We worked in the 20 Newsgroups that can be found at 
> [20 news groups dataset](http://qwone.com/~jason/20Newsgroups/)

Also this data set is available as a kaggle dataset.

The data is divided in two folders one for the train set and the other one for the test set, then each subdirectory in the bundle represents a newsgroup.

In order to simplify the data manipulation we constructed a pandas dataframe with the following structure
 
 **ID | Document | label | **. 


In [None]:
import pandas as pd #database manipulation
import numpy as np #math library

<a id='2.1'></a>
## Preprocessing
In the original 20 news group dataset we should remove headers,footers and quotes but this preprocessing have been made in the 20 news groups v3 by the data set uploader.
At this time we  only worried about some preprocessing on the text such as:
>
- **Remove weird characters** (if they exist).
- **Separate contractions**: Firtsly we thought about expand the contractions but actually as we will remove the most common characters, we just separate the contractions from other words  for example  **doesn't will be does n't**.
- **Also we have  to remove the footer lines **.
- **Remove long words**: we established a maximum length o 13 letters for each word which is the average maximum length for a word in English after this limit is exceeded probably this represents a spelling error.
- **Remove emails and links**: we removed emails and links cause they doesn't apport important information for the problem.
 >

In [None]:
import re
#this are our cleaning rules
cleaningOptions = {
    '[A-Za-z0-9_-]{10,}':'',#long words nor
    #expand contractions
    "\'m":" am",
    "\'s":" is",
    "\'ve":" have",
    "n\'t":" not",
    "\'re":" are",
    "\'d":" had",
    "\'ll":" will",
    #delete double space, and sequences of "-,*,^,."
    '\s{2,}|\?{2,}|\!{2,}|#{2,}|={2,}|-{2,}|_{2,}|\.{2,}|\*{2,}|\^{2,}':'',
    #Separate simbols from words
    '(':' ( ',
    '/':' / ',
    ')':' ) ',
    '?':' ? ',
    '¿':' ¿ ',
    ']':' ] ',
    '[':' [ ',
    '}':' } ',
    '{':' { ',
    '<':' < ',
    '"':' " ',
    '>':' > ',
    ',':' , ',
    '!':' ! ',
    '.':' . ',
    ':':' : ',
    '-':' - ',
    #delete emails
    "[A-Za-z0-9_-]*@[A-Za-z0-9._-]*\s?":"",
    #delete links
    "https?://[A-Za-z0-9./-]+":"",
}

def escapePattern(pattern):
    """Helper function to build our regex"""
    if len(pattern)==1:
        pattern=re.escape(pattern)
    return pattern

def compileCleanerRegex(cleaningOptions=None):
    """Given a dictionary of rules this contruct the regular expresion to detect the patterns """
    return re.compile("(%s)" % "|".join(map(escapePattern,cleaningOptions.keys())))

replacementDictRegex = compileCleanerRegex(cleaningOptions)

In [None]:
def cleaning_text(text,cleaningOptions=None,replacementDictRegex=None,encode_format="utf-8-sig",decode_format="ascii",option="ignore"):
    """Cleaning function for text
       Given a text this function applies the cleaning rules defined
       in a dictionary using a regex to detect the patterns and remove non-ascii characters.
   Args:
       text (str): The text we want to clean.
       cleaning options (dict): The rules to be applied for the cleaning.
       replacementDictRegex(regex): The regular expression for detecting
                                    the patterns defined in the cleaning options
                                    this has been compiled using the compileCleanerRegex(cleaningOptions) function.
        encode_format(str):the format from the incomming text by default is utf-8 that fix for most of the cases
        decode_format(str):the format of the cleaning results
                            (the function use the encode/decode trick to remove unwanted characters)
    Returns:
        The cleaned text applying the  cleaning options.
    
    """
    """ REMOVING PUNCTUATIONS ALREADY PERFORMED by KERAS TOKENIZER
    ##remove extra characters (TODO)
    #s = re.sub(r'(.)\1+', r'\1\1', "asssigned")
    remove_punctuation=str.maketrans('','',string.punctuation)
    text=text.translate(remove_punctuation)
    also with this we can skip the " #Separate simbols from words part"
    #"""
    #optionals
    #Removing weird characters
    text = text.encode(encode_format).decode(decode_format,option)
    #dict.get(key, default = None) default is the value to be returned if the key doesn't exist
    return replacementDictRegex.sub(lambda mo:cleaningOptions.get(mo.group(1),), text)    

__Let's made a test  for the cleaning function:__

In [None]:
oidDescriptionStr="""I'm a  nicewhirrrclickwhirrr"Clam" test: ({}(. hi jij ... ,,)\1+) https://www.kaggle.com/criscastromaya/cnn-for-text-classification((it's)((((djcriz5@gemail.com )))()((isn't)(---____--)) Control"""
print(cleaning_text(oidDescriptionStr,cleaningOptions,replacementDictRegex))

Then  we  got the path of all the files 

In [None]:
import glob,string
path = '../input/20newsgroups/20news-bydate-v3/*/*/*.txt'
#list files
files=glob.glob(path)

Afterwards we constructed a pandas dataframe for the test and the train set

In [None]:
import codecs
from tqdm import tqdm
def contructDataframe(file_list,cleaningOptions=cleaningOptions,replacementDictRegex=replacementDictRegex):
    """
    This function contructs a pandas for the test and training  dataframe with the format **ID | Document | label | **. 
    and also will perfom the preprocessing for the data using the cleaning function  
    Args:
        file_list(list[str]): the path of the  files tobe cleaned and storein the dataframes
        cleaning options (dict): The rules to be applied for the cleaning.
        replacementDictRegex(regex): The regular expression for detecting
                                    the patterns defined in the cleaning options
                                    this has been compiled using the compileCleanerRegex(cleaningOptions) function.
    returns:
        training_df,testing-df(pandas.dataframe): the treaning and testing set as pandas dataframes in the format |ID|Text|Label.
    """
    train=[]
    test=[]
    mode="r"
    encoding="utf-8"
    e_option="ignore"
    for file in tqdm(file_list):
        text = codecs.open(file, mode,encoding, e_option).read() 
        if("20news-bydate-test" in file):
            test.append((cleaning_text(text,cleaningOptions,replacementDictRegex),file.split("/")[-2]))
        else:
            train.append((cleaning_text(text,cleaningOptions,replacementDictRegex),file.split("/")[-2]))
    return pd.DataFrame(train,columns=['text','label']),pd.DataFrame(test,columns=['text','label'])

In [None]:
df_train,df_test=contructDataframe(files)

<a id='2.2'></a>
### Data Preparation and Analysis
*** As a sanity check lets see if there is no missing data or evident errors***

In [None]:
print("Train: ",df_train.isnull().values.any()," Test: ",df_test.isnull().values.any())

**** Also we'll see the distribution of the classes****

In [None]:
df_train.groupby(df_train.label).size().reset_index(name="counts").plot.bar(x='label',title="Samples per each class (Training set)",color='red')


In [None]:
df_test.groupby(df_test.label).size().reset_index(name="counts").plot.bar(x='label',title="Samples per each class (Test set)")

Luckily the train and the test set are pretty well balanced.But we still have to check about the state of the data.
So we perfom a text and now at this case the size of the text counting tokens.

In [None]:
#df_train[df_train.text.str.split(" ").apply(len)==df_train.text.str.split(" ").apply(len).max()]
max_l=df_train.text.str.split(" ").apply(len).max()
min_l=df_train.text.str.split(" ").apply(len).min()
print(f"As we can see there is something not to good whit the dataset cause the bigger document contains {max_l} tokens  and the smaller document contains  {min_l} tokens")

The gap between the biggest and the smaller document is huge, In consequence, we should visualize the distribution for the length of the documents and also check the extreme values.

In [None]:
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt
df_train['doc_len'] = df_train.text.apply(lambda words: len(words.split()))
df_test['doc_len'] = df_test.text.apply(lambda words: len(words.split()))

def plot_doc_lengths(dataframe):
    max_seq_len = np.round(dataframe.doc_len.mean() + dataframe.doc_len.std()).astype(int)
    sns.distplot(tuple(dataframe.doc_len), hist=True, kde=True, label='Document lengths')
    plt.axvline(x=max_seq_len, color='k', linestyle='--', label=f'Sequence length mean:{max_seq_len}')
    plt.title('Document lengths')
    plt.legend()
    plt.show()
    print(f" the bigger document contain {df_train['doc_len'].max()} words  and the smaller {df_train['doc_len'].min()} words")
plot_doc_lengths(df_train)

Then we looked at the smaller and the biggest document, just to see what's wrong.

In [None]:
df_train[df_train.doc_len==df_train['doc_len'].max()]

In [None]:
df_train[df_train.doc_len==df_train['doc_len'].min()].tail(2)

As we can see there are many empty texts. 
After an examination of the data set, we decided to delete the entries smaller than 10 tokens and bigger than 3250 tokens cause outside of this range, base 64  strings and other unwanted noise starts to appear.

In [None]:
df_train=df_train[(10<df_train.doc_len)&(3250>df_train.doc_len)]
##also I'll do the same for the test set
df_test=df_test[(10<df_test.doc_len)&(3250>df_test.doc_len)]


Let's see how this filtration altered the data distribution.

In [None]:
#df_train.sort_values(by=['doc_len'])
df_train.groupby(df_train.label).size().reset_index(name="counts").plot.bar(x='label',title="Samples per each class (Train set)",color='red')

In [None]:
df_test.groupby(df_test.label).size().reset_index(name="counts").plot.bar(x='label',title="Samples per each class (Test set)")

In [None]:
plot_doc_lengths(df_train)

<a id='3.1'></a>
# Feature extraction
In this section we will transform our text data in a numerical representation, We will use for this experiment the a Bag of words  representation implemented  in keras tokenizer (sparse) and a Skip-gram based pre-trained model using the Facebook fasttext representation.  I highly recommend the following article written by Dipanjan (DJ) Sarkar to going deeper in this subject: [ A hands-on intuitive approach to Deep Learning Methods for Text Data ](https://towardsdatascience.com/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa)

Before starting with the feature extraction we  splited our ***Training data*** in two parts:  ***training*** and ***validation***

In [None]:
from sklearn.model_selection import train_test_split
SEED = 200
X_train, X_validation, y_train, y_validation = train_test_split(df_train.text, df_train.label, test_size=0.2, random_state=3,stratify= df_train.label)

In [None]:
X_train.groupby(y_train).size().reset_index(name="counts").plot.bar(x='label',title="Samples per each class (Train set)",color='red')

In [None]:
X_validation.groupby(y_validation).size().reset_index(name="counts").plot.bar(x='label',title="Samples per each class (Validation set)",color='green')

In [None]:
df_test.text.groupby(df_test.label).size().reset_index(name="counts").plot.bar(x='label',title="Samples per each class (Test set)")

Then we transformed our data into a Bag of words based model representation.
Basically the Keras tokenizer make a dictionary of the whole dataset and it will present each document as a sequence of words asigning a number to each word according to its frequence in the texts.

In [None]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=500000)
tokenizer.fit_on_texts(X_train)
sequences_train = tokenizer.texts_to_sequences(X_train)

In [None]:
print(f"Original document: {X_train.values[0]} \nNumerical representation: {sequences_train[0]}")

In [None]:
sequences_validation = tokenizer.texts_to_sequences(X_validation)
sequences_test = tokenizer.texts_to_sequences(df_test.text.values)

Then we visualize what the tokenizer has learned and also delete the most 10 frequent  and unfrequent words 

In [None]:
 """
    Citiation
    ---------
    DL4NLP lab by Oier Lopez de Lacalle
    https://www.researchgate.net/profile/Oier_Lopez_de_Lacalle2
"""
# Recorver the word index that was created with the tokenizer
word_index = tokenizer.word_index
print('Found {} unique tokens.\n'.format(len(word_index)))
word_count = tokenizer.word_counts
print("Show the most frequent word index:")
for i, word in enumerate(sorted(word_count, key=word_count.get, reverse=True)):
    print('   {} ({}) --> {}'.format(word, word_count[word], word_index[word]))
    del tokenizer.index_word[tokenizer.word_index[word]]
    del tokenizer.index_docs[tokenizer.word_index[word]]
    del tokenizer.word_index[word]
    del tokenizer.word_docs[word]
    del tokenizer.word_counts[word]
    if i == 9: 
        print('')
        break
print("Show the least frequent word index:")
for i, word in enumerate(sorted(word_count, key=word_count.get, reverse=False)):
    print('   {} ({}) --> {}'.format(word, word_count[word], word_index[word]))
    del tokenizer.index_word[tokenizer.word_index[word]]
    del tokenizer.index_docs[tokenizer.word_index[word]]
    del tokenizer.word_index[word]
    del tokenizer.word_docs[word]
    del tokenizer.word_counts[word]
    if i == 9: 
        print('')
        break

In [None]:
# Recorver the word index that was created with the tokenizer
word_index = tokenizer.word_index
print('Found {} unique tokens.\n'.format(len(word_index)))
word_count = tokenizer.word_counts
print("Show the most frequent word index:")
for i, word in enumerate(sorted(word_count, key=word_count.get, reverse=True)):
    print('   {} ({}) --> {}'.format(word, word_count[word], word_index[word]))
    if i == 9: 
        print('')
        break
print("Show the least frequent word index:")
for i, word in enumerate(sorted(word_count, key=word_count.get, reverse=False)):
    print('   {} ({}) --> {}'.format(word, word_count[word], word_index[word]))
    if i == 9: 
        print('')
        break

This vectors has different lenght for this reason we need to pad the sequences in order to fit them in our model.
We know from our previous data analysis that the most useful documents has a mean of 500 words so we will delimit the length of documents to 600.

In [None]:
max_length=600

In [None]:
from keras.preprocessing import sequence
x_train=sequence.pad_sequences(sequences_train,maxlen=max_length)
x_validation=sequence.pad_sequences(sequences_validation,maxlen=max_length)
x_test=sequence.pad_sequences(sequences_test,maxlen=max_length)
print(f"Train set shape: {x_train.shape}\nValidation set shape: {x_validation.shape}\nTest set shape: {x_test.shape}")

Also we need to transform our labels into something recognizable for the model so we  used the scikit-learn label Binarizer to perfom this task using the one hot encode representation.

In [None]:
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
y_train_categorical=encoder.fit_transform(y_train.values.reshape(-1, 1))
y_validation_categorical=encoder.transform(y_validation.values.reshape(-1, 1))
y_test_categorical=encoder.transform(df_test.label.values.reshape(-1, 1))

In [None]:
print(f"Train set labels: {y_train_categorical.__len__()}\nValidation set labels: {y_validation_categorical.__len__()}\nTest set labels: {y_test_categorical.__len__()}")

<a id='4.'></a>
## Model Desing

This architecture is composed by the following layers:
>
- **Embedding Layer**: This layer learn provide a dense representation of words and their relative meanings, this is used to find relationships between words and their context, we decided to establish the dimension as 100 and used and input lenght of 500 which is also the mean of document lenghts in our data set, also this has a vocabulary of the same size of the training set vocabulary,in the case of the fastText integration we will use a dimension of 300 according to our pre-trained embedding , the weights of this layer will be defined by a embedding matrix.
- **Convolutional Layer**: This layer tries to find patterns in the sentences applying filters and then will generate feature maps, this first layer is composed by 64 filters with a size of 7 and uses the relu function as activation function, this layer pads the input in such a way that the output feature maps has the same dimension.
- **Max pooling layer**: This layer will select the  most important features from the conv layer  generated feature maps this uses a pool size of 2 and stride equal to 1.
- **Convolutional Layer**: This layer will  find patterns in the feature maps  and then will generate new feature maps also this one has 64 filters with a size of 7 and uses the relu function as activation function, this layer pads the input in such a way that the output feature maps has the same dimension.
- **Global Max pooling layer**: This layer will select the most important features from  generated feature maps of the conv layer.
- **Dropout**: This layer is used to improve the generalization of the model in this case this drops 50%  of the neurons from the previous layer to force the weights to be equitative distributed.
- **Dense layer**: In order to learn some additional information and actually didn't apply dropout to the output layer we added a dense layer of 64 neurons also I used the l2  regularization method  also known as weight decay it forces the weights to decay towards zero (but not exactly zero). I highly recommend this [article about generalization in DL models](https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/).
- **Output Layer(dense)**: The final layer has 20 neurons that corresponds to each class, in this case we used the softmax function to map a probability for each class.

Also  we used binary cross-entropy loss function widely used for multi-classification problems and a custom adam optimizer to learn the parameters and decrease the loss function.
>

In [None]:
from keras.layers import *
from keras import Sequential,optimizers
from keras_sequential_ascii import keras2ascii

class CNNtext(Sequential):
    """
    This class extends  keras.sequencial in order to build our 
    model according to the designed architecture
    """
    #params for the convolutional layers
    __num_filters = 64
    __weight_decay = 1e-4
    #optimizers
    __adam = optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
    def __init__(self,max_length,number_of_classes,embedding_matrix=None,vocab_size=None,tokenizer=None):
        #creating the model heritance from Keras.sequencial
        super().__init__()
        #params for the embedding layer
        self.__embedding_dim=100 if embedding_matrix is None else embedding_matrix.shape[1]
        #self.__vocab_size=vocab_size if tokenizer is None else tokenizer.word_index.__len__()+1
        self.__vocab_size=vocab_size if tokenizer is None else max(tokenizer.index_word.keys())+1
        try:
            self.__max_length=max_length
            self.__number_of_classes=number_of_classes 
        except NameError as error:
            print("Error ",error," must be defined.")
            
        #defining layers
        #This layer will learn an embedding the vocab_size is the vocabulary learn from our tokenizer
        #the embedding dimension is defined by our selfs in this case we choose a dimension of 100
        #the input length is the maximum length of the documents we will use
        if embedding_matrix is None:
            self.add(Embedding(self.__vocab_size,
                               self.__embedding_dim,
                               input_length=self.__max_length,trainable=True))
        else:
            self.add(Embedding(embedding_matrix.shape[0],
                               embedding_matrix.shape[1],
                               weights=[embedding_matrix],
                               input_length=self.__max_length,
                               trainable=False))
        #then we apply a 1D conv layer that should apply filters to the sequence and generate features maps.
        self.add(Conv1D(self.__num_filters, 7, activation='relu', padding='same'))
        #then we will get the most important features using a max pooling layer
        self.add(MaxPooling1D(2))
        #afterwards we apply a conv 1D layer to learn new features form the previous results
        self.add(Conv1D(self.__num_filters, 7, activation='relu', padding='same'))
        #we select again the most important features
        self.add(GlobalMaxPooling1D())
        #then we apply dropout to improve the generalization
        self.add(Dropout(0.5))
        #then we will pass the results into a dense layer that will also learn some internal representation and we also use the l2 regularization
        self.add(Dense(32, activation='relu', kernel_regularizer=regularizers.l2(self.__weight_decay)))
        #for the final layer we will use softmax to obtain the probabilities of each class.
        self.add(Dense(self.__number_of_classes, activation='softmax'))  
        #to compute the loss function we use binary_crossentropy
        #which is widely used for multi-classification problems
        #we also use the adam optimazer to learn the parameters(weights)
        #and minimize the loss function.
        self.compile(loss='binary_crossentropy', optimizer=self.__adam, metrics=['accuracy'])     

<a id='5.1'></a>
## Training and testing (CNN+BOW)

We will use the early stopping technique which monitors the status of  the validation loss  to stop the training when the loss stops its improving.

In [None]:
from keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=4, verbose=1)
callbacks_list = [early_stopping]

We define the batch size and the number of epochs,we trained to fit the whole train set but the actual number of epochs would be decided by the condition established in the callback.

In [None]:
#training params
batch_size = 150
num_epochs = 20

In [None]:
tokenizer.num_words

In [None]:
CNN_BOW=CNNtext(max_length,
              encoder.classes_.__len__(),
              tokenizer=tokenizer)

In [None]:
keras2ascii(CNN_BOW)

In [None]:
hist = CNN_BOW.fit(x_train, y_train_categorical,
                 batch_size=batch_size, epochs=num_epochs, callbacks=callbacks_list,
                 validation_data=(x_validation,y_validation_categorical),
                 shuffle=True)

We  checked the perfomance using the test set

In [None]:
loss, accuracy = CNN_BOW.evaluate(x_test,encoder.transform(df_test.label.values), verbose=1)
print('Accuracy: %f' % (accuracy*100),'loss: %f' % (loss*100))

In [None]:
def plot_model_perfomance(hist,name):
    plt.style.use('fivethirtyeight')
    plt.figure(1)
    plt.plot(hist.history['loss'], lw=2.0, color='b', label='train')
    plt.plot(hist.history['val_loss'], lw=2.0, color='r', label='val')
    plt.title(name)
    plt.xlabel('Epochs')
    plt.ylabel('Cross-Entropy Loss')
    plt.legend(loc='upper right')
    plt.figure(2)
    plt.plot(hist.history['acc'], lw=2.0, color='b', label='train')
    plt.plot(hist.history['val_acc'], lw=2.0, color='r', label='val')
    plt.title(name)
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend(loc='upper left')
    plt.show()

In [None]:
plot_model_perfomance(hist,'CNN BOW')

Know we will construct the confusion matrix making the predictions for the test ,validation and train sets

In [None]:
bow_predict_y_test = CNN_BOW.predict(x_test,verbose=1)
bow_predict_y_train = CNN_BOW.predict(x_train,verbose=1)
bow_predict_y_validation = CNN_BOW.predict(x_validation,verbose=1)

In [None]:
bow_predict_y_test= encoder.inverse_transform(bow_predict_y_test)
bow_predict_y_train= encoder.inverse_transform(bow_predict_y_train)
bow_predict_y_validation= encoder.inverse_transform(bow_predict_y_validation)

In [None]:
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(y=None,y_predict=None,classes=None,name=None):
    plt.figure(figsize=(30, 30))
    sns.heatmap(confusion_matrix(y,y_predict), 
                xticklabels=classes,
                yticklabels=classes)
    plt.title(name)
    plt.show()

This is the confusion matrix for the test set.

In [None]:
plot_confusion_matrix(df_test.label.values,bow_predict_y_test,encoder.classes_,'Test accuracy CNN BOW')

This is the confusion matrix for the validation set.

In [None]:
plot_confusion_matrix(y_validation,bow_predict_y_validation,encoder.classes_,'Validation accuracy CNN BOW')

This is the confusion matrix for the train set.

In [None]:
plot_confusion_matrix(y_train,bow_predict_y_train,encoder.classes_,'Train accuracy CNN BOW')

<a id='3.2'></a>
## FastText integration 
In this section we will use the fastext embeddings and see how our results could be affected 
we are using a pretrained [Fasttext embedding](https://www.kaggle.com/facebook/fasttext-english-word-vectors-including-subwords#wiki-news-300d-1M-subword.vec) from kaggle.

First we build the embedding matrix for our vocabulary

In [None]:
def read(file=None,embed_dim=300,threshold=None, vocabulary=None):
    embedding_matrix= np.zeros((max(vocabulary.index_word.keys())+1, embed_dim)) if threshold is None else np.zeros((threshold, embed_dim))
    #embedding_matrix= np.zeros((vocabulary.word_index.__len__()+1, embed_dim)) if threshold is None else np.zeros((threshold, embed_dim))
    words_not_found=[]
    matching=[]
    f = codecs.open(file, encoding='utf-8')
    for line in tqdm(f):
        vec = line.rstrip().rsplit(' ')
        word=vec[0].lower()
        if word in vocabulary.word_index:
            matching.append(word)
            embedding_matrix[vocabulary.word_index[word]]= np.asarray(vec[1:], dtype='float32')
        else:
            words_not_found.append(word)      
    f.close()
    return embedding_matrix,words_not_found,matching

In [None]:
embedding_matrix,words_not_found,match= read("../input/fasttext-english-word-vectors-including-subwords/wiki-news-300d-1M-subword.vec",vocabulary=tokenizer)

In [None]:
print(f"{len(words_not_found)} words not found")

In [None]:
embedding_matrix.shape

<a id='5.2'></a>
Know is time to build the model, this time we will set weights of the embedddings layer using the embedding matrix from the fastetext vectors

In [None]:
CNN_fastText=CNNtext(max_length,
                     encoder.classes_.__len__(),
                     embedding_matrix=embedding_matrix,
                     tokenizer=tokenizer)

In [None]:
keras2ascii(CNN_fastText)

In [None]:
hist = CNN_fastText.fit(x_train, y_train_categorical,
                 batch_size=batch_size, epochs=num_epochs, callbacks=callbacks_list,
                 validation_data=(x_validation,y_validation_categorical),
                 shuffle=True)

In [None]:
plot_model_perfomance(hist,'CNN FastText')

In [None]:
ft_predict_y_test = CNN_fastText.predict(x_test,verbose=1)
ft_predict_y_train = CNN_fastText.predict(x_train,verbose=1)
ft_predict_y_validation = CNN_fastText.predict(x_validation,verbose=1)

In [None]:
ft_predict_y_test= encoder.inverse_transform(ft_predict_y_test)
ft_predict_y_train= encoder.inverse_transform(ft_predict_y_train)
ft_predict_y_validation= encoder.inverse_transform(ft_predict_y_validation)

In [None]:
loss, accuracy = CNN_fastText.evaluate(x_test,encoder.transform(df_test.label.values), verbose=1)
print('Accuracy: %f' % (accuracy*100),'loss: %f' % (loss*100))

In [None]:
plot_confusion_matrix(df_test.label.values,ft_predict_y_test,encoder.classes_,'Test accuracy CNN FastText')

In [None]:
plot_confusion_matrix(y_validation,ft_predict_y_validation,encoder.classes_,'Validation accuracy CNN FastText')

In [None]:
plot_confusion_matrix(y_train,ft_predict_y_train,encoder.classes_,'Train accuracy CNN FastText')

### Conclusions

CNNs are very useful for recognize  text patterns and its properties allows us to design very strong models for NLP tasks, making a comparison between the FastText embeddings and the embeddings learn  from the embedding layer in the first approach we see that the results for the confusion matrix are better for the first approach but seeing the behaviour of the loss and accuracy we can notice that results are better using FastText embeddings cause the gap between validation and train are smaller.