# Deep learning for multiclass text classification
#### Applied Text Mining - Utrecht Summer School

Welcome to the sixth practical of the course “Applied Text Mining”. In this practical, we will apply a various deep learning models for multiclass classification. We will work with the 20 Newsgroups data set from the sklearn library and will apply deep learning models on that using the Keras library. 

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It was originally collected by Ken Lang, and it has become a popular data set for experiments in text applications of machine learning techniques.

On the other hand, Keras is a deep learning and neural networks API by François Chollet which is capable of running on top of Tensorflow (Google), Theano or CNTK (Microsoft).

### Let's get started!

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
from tensorflow import keras
import pandas as pd
import json
import numpy as np

### Exercises

1\. **Load the tarin and test subsets of the 20 Newsgroups data set from sklearn datasets. Remove the headers, footers and qoutes from the news article when loading data sets. Use 45 for random_state. In order to get faster execution times for this practical we will work on a partial dataset with only 5 categories out of the 20 available in the dataset ('rec.sport.hockey', 'talk.politics.mideast', 'soc.religion.christian', 'comp.graphics', 'sci.med').**

In [2]:
categories = ['rec.sport.hockey', 'talk.politics.mideast', 'soc.religion.christian', 'comp.graphics', 'sci.med']

In [3]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), 
                                  categories=categories, shuffle=True, random_state=321)
# type(twenty_train)

In [4]:
twenty_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), 
                                 categories=categories, shuffle=True, random_state=321)

2\. **Find out about the number of news articles in train and test sets.**

In [None]:
twenty_train.target_names

In [None]:
twenty_train.filenames.shape

In [None]:
twenty_test.filenames.shape

3\. **Covert the train and test data sets to separate dataframes.**

In [5]:
import pandas as pd
df_train = pd.DataFrame(list(zip(twenty_train.data, twenty_train.target)), columns=['text', 'label'])
df_train.head()

Unnamed: 0,text,label
0,\nDr. cheghadr bA namakand! They just wait un...,4
1,\n\n\n\n\n:) No...I was one of the lucky ones....,2
2,\n\n[After a small refresh Hasan got on the tr...,4
3,Before getting excited and implying that I am ...,4
4,I have posted disp135.zip to alt.binaries.pict...,0


In [6]:
df_test = pd.DataFrame(list(zip(twenty_test.data, twenty_test.target)), columns=['text', 'label'])
df_test.head()

Unnamed: 0,text,label
0,"hi all, Ive applied for the class of 93 at qui...",2
1,:In article <enea1-270493135255@enea.apple.com...,2
2,"\nI don't know the answer the to this one, alt...",0
3,\n\nWe here at IBM have the same problem with ...,0
4,\nI was at an Adobe seminar/conference/propaga...,0


In [7]:
tfidf = TfidfVectorizer(binary=True)

def tfidf_features(txt, flag):
    if flag == "train":
        x = tfidf.fit_transform(txt)
    else:
        x = tfidf.transform(txt)
    x = x.astype('float16')
    return x 

X_train = tfidf_features(df_train.text.values, flag="train")
X_test = tfidf_features(df_test.text.values, flag="test")

In [8]:
#Converting the list of strings to the matrix of vectors (to be fed our nn)

lb = LabelEncoder()
y = lb.fit_transform(df_train.label.values)
y_train = keras.utils.to_categorical(y)

In [9]:
y_train.shape

(2941, 5)

In [10]:
y = lb.transform(df_test.label.values)
y_test = keras.utils.to_categorical(y)


In [None]:
print(history.history.keys())
plt.plot(history.history['accuracy'])
#plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

4\. **In order to feed predictive deep learning models with the text data, one first need to turn the text into vectors of numerical values suitable for statistical analysis. Use the CountVectorizer and create a document-term matrix.**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train = count_vect.fit_transform(df_train.text.values)
X_train.shape

In [None]:
X_train.nnz / float(X_train.shape[0])


In [None]:
X_test = count_vect.transform(df_test.text.values)

In [None]:
count_vect.vocabulary_

The extracted vectors are very sparse, with an average of 111 non-zero components by sample in a more than 37000-dimensional space (less than 0.3% non-zero features)

5\. **Now we want to visualize the output of KMeans clustering. To do this, first use PCA to transform the high-dimensional feature space into 2 dimensions, and plot the points using a scatter plot.**

In [12]:
from keras.models import Sequential
from keras import layers

input_dim = X_train.shape[1]  # Number of features

model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(5, activation='softmax'))

model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.summary()


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 10)                371490    
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 55        
Total params: 371,545
Trainable params: 371,545
Non-trainable params: 0
_________________________________________________________________


In [15]:
history = model.fit(X_train, y_train, epochs=20, batch_size=512)
model.save_weights("model.h5")
print("Saved model to disk")

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Saved model to disk


In [17]:
accr = model.evaluate(X_test,y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

Test set
  Loss: 0.211
  Accuracy: 0.891


In [18]:
df_train.label.shape

(2941,)

In [None]:
X_train.shape

In [None]:
df_train.label.values

7\. **Evaluate the quality of the 5Means clustering with the sklearn metrics for clustering: homogeneity_score, completeness_score, v_measure_score, adjusted_rand_score, silhouette_score.**

8\. **Apply the steps in questions 4 and 5, by choosing 2 and 10 as the number of clusters. Compare the visualisations. (with a plot?) Apply the Elbow method to find the optimal k.**

9\. **Manually create a dataframe with two news articles x and z, and predict cluster labels for new dataset with the optimal KMeans.**

10\. **Hierarchical clustering is a type of unsupervised machine learning algorithm used to cluster unlabeled data points. Like K-means clustering, hierarchical clustering also groups together the data points with similar characteristics. Apply the Hierarchical Clustering with ward linkage.**

11\. **Plot the dendrogram of the hierarchical clustering.**

12\. **Train a Negative Matrix Factorization model for topic modeling on a tf document-term matrix of the BBCSport dataset.**

13\. **Show top 10 words per topic with their probabilities.**

14\. **Use the NMF model and identify the topic of a new text: "Robben open to playing at Euro 2020 with Netherlands after bagging two assists for Groningen"? Simply call the transform function of the model and it will give you a score of each topic. Choose the topic with the highest score to determine it’s topic.**

15\. ****Train a Latent Dirichlet Allocation model for topic modeling and repeat the steps in 12, 13 and 14.****

16\. **Compare the performance of the LDA model versus the NMF model.**