### Libraries Required

libraries for dataset preparation, feature engineering, model training 

In [1]:
# if you want to run it in colab
# from google.colab import drive
# drive.mount('/content/drive')
# %cd /content/drive/My Drive/Deliverable

# !pip install xlsxwriter

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/Deliverable


In [1]:
import pandas as pd
import numpy as np
import sklearn
import keras
from sklearn import metrics
from sklearn import model_selection ###use for train test split
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer, LabelEncoder # Use sklearn utility to convert label strings to numbered index
from keras.preprocessing import text, sequence ## for text preprocessing
from keras import layers, models, optimizers ##keras model making function
from keras.utils import to_categorical #for preprocessing
from keras.utils import plot_model

Using TensorFlow backend.


#### Libraries Version

In [2]:
print ('Numpy tested version: ',np.__version__)
print ('pandas tested version: ',pd.__version__)
print ('keras tested version: ',keras.__version__)
print ('sklearn tested version: ',sklearn.__version__)

Numpy tested version:  1.16.2
pandas tested version:  0.24.2
keras tested version:  2.2.4
sklearn tested version:  0.21.1


## Dataset preparation

1. Loading the dataset
2. Dataset Features Counting

In [3]:
data = pd.read_excel("dataset/Dataset0.xlsx")
data.head()

Unnamed: 0,Category,Article
0,politics,From: thomasr@cpqhou.se.hou.compaq.com (G. Tho...
1,politics,ukip s secret weapon by any measure new york...
2,politics,From: hambidge@bms.com\nSubject: Re: Gun Contr...
3,sports,Pakistan on revenge mission\n\nPakistan's cric...
4,politics,From: garrett@Ingres.COM\nSubject: Re: Return ...


In [4]:
data['Category'].value_counts()

politics          3626
Food & drink      2543
Technology        2134
sports            1963
Promotion          876
entertainment      541
Family             297
travel             297
education          133
style & beauty      74
Name: Category, dtype: int64

## Feature Engineering

Raw text data will be transformed into feature vectors and new features will be created using the existing dataset. 

1. Tokenizing
2. Word Embeddings as features

### Train Test Spliting
spliting the dataset into training and validation sets so that we can train and test classifier.

In [5]:
np.random.seed(99)

In [6]:
train_X, valid_X, train_Y, valid_Y = model_selection.train_test_split(data['Article'].values, data['Category'].values, shuffle = False, test_size = 0.5, random_state = 99)

In [7]:
valid_X, test_X, valid_Y,test_Y = model_selection.train_test_split(valid_X, valid_Y, shuffle = False, test_size = 0.5, random_state = 99)

### Tokenizing

Turning text into vectorize numbers or sequence of intergers, each integer being the index of a token in a dictionary.

Commands:
1. text.Tokenizer()
        a. Keras provides a more sophisticated API for preparing text that can be fit and reused to prepare multiple 
        text documents. This may be the preferred approach for large projects.
        b. Keras provides the Tokenizer class for preparing text documents for deep learning. The Tokenizer must be 
        constructed and then fit on either raw text documents or integer encoded text documents.




2. tokenizer.texts_to_sequences   > Text line seperated words to numbers
    
        By default, this function automatically does 3 things:

        a. Splits words by space (split=” “).
        b. Filters out punctuation (filters=’!”#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n’).
        c. Converts text to lowercase (lower=True).
        d. Changing to number based on fit tokenizer data. 

3. sequence.pad_sequences(maxlen=5000) 
   
        a. This function transforms a list of 'num_samples' sequences (lists of integers) into a 2D Numpy array of shape 
        '(num_samples, num_timesteps)'. 'num_timesteps' is either the 'maxlen' argument if provided, or the length of the 
        longest sequence otherwise. Sequences that are shorter than 'num_timesteps' or 'maxlen' are padded with zero at 
        the end.
        Sequences longer than 'num_timesteps' or 'maxlen' are shortened so that they fit the desired length.

In [8]:
# create a tokenizer 
tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(data['Article'])# fit the tokenizer on the documents
word_index = tokenizer.word_index

In [9]:
# convert text to sequence of tokens and pad them to ensure equal length vectors
train_x = sequence.pad_sequences(tokenizer.texts_to_sequences(train_X),maxlen=5000)
valid_x = sequence.pad_sequences(tokenizer.texts_to_sequences(valid_X),maxlen=5000)
test_x = sequence.pad_sequences(tokenizer.texts_to_sequences(test_X),maxlen=5000)
# article_data = sequence.pad_sequences(tokenizer.texts_to_sequences(data['Article']))

### Word Embedding

A word embedding is a form of representing words and documents using a dense vector representation. The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used. Word embeddings can be trained using the input corpus itself or can be generated using pre-trained word embeddings such as Glove, FastText, and Word2Vec. We have use Word2Vec pretrained moel.

Steps:

1. Loading the pretrained word embeddings
2. Transforming text documents to sequence of tokens and pad them
3. Create a mapping of token and their respective embeddings

In [11]:
embeddings_index = {}
embedding_matrix = np.zeros((len(word_index) + 1, 300))

In [12]:
%%time
# load the pre-trained word-embedding vectors. these weights trained on alarge numbers of words.
for i, line in enumerate(open('Model/wiki-news-300d-1M.vec', encoding="utf8")):
    values = line.split()
    embeddings_index[values[0]] = np.asarray(values[1:], dtype='float32')

Wall time: 3min 3s


In [13]:
# create token-embedding mapping
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

### Categorical Veriable Handling

As our target values consists of names of Categories, so we need to convert them numbers to use it in model. For it we use label encoder.

In [14]:
# Use sklearn utility to convert label strings to numbered index
encoder = LabelEncoder()
encoder.fit(data['Category'])
train_y0 = encoder.transform(train_Y)
valid_y0 = encoder.transform(valid_Y)
test_y0 = encoder.transform(test_Y)

In [15]:
num_classes = np.max(train_y0) + 1
train_y = to_categorical(train_y0, num_classes)
valid_y = to_categorical(valid_y0, num_classes)
test_y = to_categorical(test_y0, num_classes)

## Model Building

The final step in the text classification framework is to train a classifier using the features created in the previous step. There are many different choices of machine learning models which can be used to train a final model. We will use CNN to learn the deep features of text.

In Convolutional neural networks, convolutions over the input layer are used to compute the output. This results in local connections, where each region of the input is connected to a neuron in the output. Each layer applies different filters and combines their results.

![model](model.png)


Layers Functions:
1. layers.Input(features_numbers,):
    Creating a Tensor point that will consist of number of features as input. 


2. embedding_layer = layers.Embedding():
    embedding layer comes up with a relation of the inputs in another dimension.
        a. Basically, our neural network captures underlying structure of the inputs (our sentences) and puts relation  
        between words in our vocabulary into a higher dimension (let's say 2) by optimization.
        b. Deeper understanding would say that the frequency of each word appearing with another word from our vocabulary
        influences (in a very naive approach we can calculate it by hand)
        c. Aforementioned frequency could be one of many underlying structures that NN can capture
        d. You can find the intuition on the [youtube link](https://www.youtube.com/watch?v=kw9R0nD69OU) explaining the
        word embeddings 
        e.Word-embedding techniques such as word2vec try to capture the full meaning of words in the resulting embedding, 
        the embedding layer in a supervised network might not learn such a semantically-rich and general representation.It 
        is often useful to initialize your embedding layer with weights learned by word2vec on a big corpus.


3. layers.SpatialDropout1D():
    This version performs the same function as Dropout, however it drops entire 1D feature maps instead of ndividual 
    elements. As our Data is 2D array so it delete the of specific words randomely during training.


4. layers.Convolution1D():
    This layer creates a convolution kernel that is convolved with the layer input over a single spatial (or temporal) 
    dimension to produce a tensor of outputs.
    

5. layers.GlobalMaxPool1D():
    Perform Max pooling operation on the data as 1D. Means Taking the max entries in each row and forward it to next


6. layers.Dense(): Pass the data to densily connected Neurons.


7. layers.Dropout(): Dropout the percentage of Neurons Randomely during training



In [16]:
def create_cnn():
    # Add an Input Layer
    input_layer = layers.Input((train_x.shape[1], ))

    # Add the word embedding Layer
#   embedding_layer=layers.add(embedding_layer)
    
    
    embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable = False)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)

    # Add the convolutional Layer
    conv_layer = layers.Convolution1D(100, 3, activation="relu")(embedding_layer)

    # Add the pooling Layer
    pooling_layer = layers.GlobalMaxPool1D()(conv_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(pooling_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    output_layer2 = layers.Dense(num_classes, activation="softmax")(output_layer1)

    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), loss='categorical_crossentropy')
    
    return model

### Training the model

In [17]:
classifier = create_cnn() #calling the model.

In [18]:
# fit the training dataset on the classifier
hist=classifier.fit(train_x, train_y,epochs=5,verbose=1,shuffle=True,validation_split=0.2)

Train on 4993 samples, validate on 1249 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [19]:
# predict the labels on validation dataset
predictions = classifier.predict(valid_x)

predictions = predictions.argmax(axis=-1) # index of maximum value in an array

accuracy=metrics.accuracy_score(predictions, valid_y0)
print ('Accuracy on test data: ',  accuracy)

Accuracy on test data:  0.9333546940083307


In [20]:
#saving the model so that it can be used next time.
classifier.save_weights('Model/model_trained_weights.hd5')

### Predicting and testing the model:

In [22]:
# Here's how to generate a prediction on individual examples
text_labels = encoder.classes_ 
tested_index=np.random.randint(0,len(test_x),size=10)

In [23]:
for i in tested_index:
    prediction = classifier.predict(np.array([test_x[i]]))
    predicted_label = text_labels[np.argmax(prediction)]
    print(i)
    print("..." + "\n" + test_X[i][:150] + "\n" "...")
    print("Actual label: " + text_labels[test_y0[i]])
    print("Predicted label: " + predicted_label + "\n")

2632
...
From: margoli@watson.ibm.com (Larry Margolis)
Subject: Re: I thought commercial Advertising was Not allowed
Distribution: na
News-Software: IBM OS/2 P
...
Actual label: politics
Predicted label: Technology

1296
...
From: romdas@uclink.berkeley.edu (Ella I Baff)
Subject: GETTING AIDS FROM ACUPUNCTURE NEEDLES
Organization: University of California, Berkeley
Lines: 
...
Actual label: Technology
Predicted label: Technology

3026
...
From: aldridge@netcom.com (Jacquelin Aldridge)
Subject: Re: Candida(yeast) Bloom, Fact or Fiction
Organization: NETCOM On-line Communication Services 
...
Actual label: Technology
Predicted label: Technology

1505
...
For Pastry: Sift sugar and flour into bowl and cut in chilled butter. Blend in cold water until it forms a ball and set aside to rest.
Preheat oven to
...
Actual label: Food & drink
Predicted label: Food & drink

334
...
From: turpin@cs.utexas.edu (Russell Turpin)
Subject: Re: Science and methodology (was: Homeopathy ... tradition?)
Orga

## Testing on New Data File

In [0]:
classifier.load_weights('Model/model_trained_weights.hd5') # Loading the model.

In [26]:
new_test_data=pd.read_excel('dataset/Dataset0.xlsx')
new_test_data.head()

Unnamed: 0,Category,Article
0,politics,From: thomasr@cpqhou.se.hou.compaq.com (G. Tho...
1,politics,ukip s secret weapon by any measure new york...
2,politics,From: hambidge@bms.com\nSubject: Re: Gun Contr...
3,sports,Pakistan on revenge mission\n\nPakistan's cric...
4,politics,From: garrett@Ingres.COM\nSubject: Re: Return ...


In [0]:
# convert text to sequence of tokens and pad them to ensure equal length vectors 
new_article_data = sequence.pad_sequences(tokenizer.texts_to_sequences(new_test_data['Article']),maxlen=train_x.shape[1])

In [0]:
prediction=classifier.predict(new_article_data)##predicting on new data
predicted_label = text_labels[np.argmax(prediction,axis=1)]

In [29]:
new_test_data['predicted labels']=predicted_label ## saving the result on dataframe
new_test_data.head()

Unnamed: 0,Category,Article,predicted labels
0,politics,From: thomasr@cpqhou.se.hou.compaq.com (G. Tho...,politics
1,politics,ukip s secret weapon by any measure new york...,politics
2,politics,From: hambidge@bms.com\nSubject: Re: Gun Contr...,politics
3,sports,Pakistan on revenge mission\n\nPakistan's cric...,sports
4,politics,From: garrett@Ingres.COM\nSubject: Re: Return ...,politics


In [0]:
new_test_data.to_excel('Result/result_data.xlsx',engine='xlsxwriter') ##saving the excel file.
plot_model(classifier, to_file='Model/model.png', show_shapes=True)