# Practical 2: Text classification
#### Ayoub Bagheri
<img src="img/uu_logo.png" alt="logo" align="right" title="UU" width="50" height="20" />

In this practical, are going to create a text classification pipeline. We will work with the famous 20 Newsgroups data set from the sklearn library.

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It was originally collected by Ken Lang, and it has become a popular data set for experiments in text applications of machine learning techniques.

Today we will use the following libraries. Take care to have them installed!

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.pipeline import Pipeline
from sklearn import metrics
import pandas as pd
import numpy as np

### Let's get started!

1\. **Use the code below to load the tarin and test subsets of the 20 Newsgroups data set from sklearn datasets. Remove the headers, footers and qoutes from the news article when loading data sets. Use number 321 for random_state. In order to get faster execution times for this practical we will work on a partial data set with only 5 categories out of the 20 available in the data set: ('rec.sport.hockey', 'talk.politics.mideast', 'soc.religion.christian', 'comp.graphics', 'sci.med').**

In [2]:
categories = ['rec.sport.hockey', 'talk.politics.mideast', 'soc.religion.christian', 'comp.graphics', 'sci.med']

In [3]:
twenty_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), 
                                  categories=categories, shuffle=True, random_state=321)
# type(twenty_train)

In [4]:
twenty_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), 
                                 categories=categories, shuffle=True, random_state=321)

2\. **Find out about the number of news articles in train and test sets.**

In [5]:
twenty_train.target_names

['comp.graphics',
 'rec.sport.hockey',
 'sci.med',
 'soc.religion.christian',
 'talk.politics.mideast']

In [6]:
twenty_train.filenames.shape

(2941,)

In [7]:
twenty_test.filenames.shape

(1958,)

3\. **Covert the train and test to dataframes.**

In [8]:
df_train = pd.DataFrame(list(zip(twenty_train.data, twenty_train.target)), columns=['text', 'label'])
df_train.head()

Unnamed: 0,text,label
0,\nDr. cheghadr bA namakand! They just wait un...,4
1,\n\n\n\n\n:) No...I was one of the lucky ones....,2
2,\n\n[After a small refresh Hasan got on the tr...,4
3,Before getting excited and implying that I am ...,4
4,I have posted disp135.zip to alt.binaries.pict...,0


In [9]:
df_test = pd.DataFrame(list(zip(twenty_test.data, twenty_test.target)), columns=['text', 'label'])
df_test.head()

Unnamed: 0,text,label
0,"hi all, Ive applied for the class of 93 at qui...",2
1,:In article <enea1-270493135255@enea.apple.com...,2
2,"\nI don't know the answer the to this one, alt...",0
3,\n\nWe here at IBM have the same problem with ...,0
4,\nI was at an Adobe seminar/conference/propaga...,0


4\. **In order to feed classification models with text data, first you need to turn the text into vectors of numerical values suitable for statistical analysis. Use the binary representation with TfidfVectorizer and create document-term matrices for test and train (name them X_train and X_test). We also built similar dtm in the previous practical.**

In [10]:
# A function for transforming train or test into tfidf features
def tfidf_features(txt, flag):
    if flag == "train":
        x = tfidf.fit_transform(txt)
    else:
        x = tfidf.transform(txt)
    x = x.astype('float16')
    return x 

tfidf = TfidfVectorizer(binary=True)
X_train = tfidf_features(df_train.text.values, flag="train")
X_test = tfidf_features(df_test.text.values, flag="test")

# With CountVectorizer and without the function
# from sklearn.feature_extraction.text import CountVectorizer
# count_vect = CountVectorizer()
# X_train = count_vect.fit_transform(df_train.text.values)
# X_test = count_vect.transform(df_test.text.values)

In [11]:
X_train.nnz / float(X_train.shape[0])

111.5678340700442

The extracted vectors are very sparse, with an average of 111 non-zero components by sample in a more than 37000-dimensional space (less than 0.3% non-zero features)

In [12]:
X_test.nnz / float(X_train.shape[0])

75.78748724923496

In [13]:
# tfidf.vocabulary_

In [14]:
df_train.label.values

array([4, 2, 4, ..., 0, 4, 4], dtype=int64)

5\. **Create y_train and y_test objects from the df_train.label.values and df_test.label.values, respectively.**

In [15]:
y_train = df_train.label.values

In [16]:
y_train.shape

(2941,)

In [17]:
y_test = df_test.label.values

6\. **Select at least two of the following classifiers and train two models on the data set.**
    - [K-Nearest Neighbor classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)
    - [Multionimal Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)
    - [Support Vector Machine](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
    - [Decision Tree Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)
    - [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)

In [18]:
knn = KNeighborsClassifier(n_neighbors=3)
model = knn.fit(X_train, y_train)

knn = KNeighborsClassifier(n_neighbors=10)
model2 = knn.fit(X_train, y_train)

knn = KNeighborsClassifier(n_neighbors=100)
model3 = knn.fit(X_train, y_train)
print('accuracy with 3 neighbours:', model.score(X_test, y_test),
      '\naccuracy with 10 neighbours:', model2.score(X_test, y_test), 
      '\naccuracy with 100 neighbours:', model3.score(X_test, y_test))

accuracy with 3 neighbours: 0.2170582226762002 
accuracy with 10 neighbours: 0.20684371807967314 
accuracy with 100 neighbours: 0.8600612870275791


In [19]:
nb = MultinomialNB(alpha=1)
model = nb.fit(X_train, y_train)

nb = MultinomialNB(alpha=10)
model2 = nb.fit(X_train, y_train)
print('accuracy with alpha=1:', model.score(X_test, y_test),
      '\naccuracy with alpha=10:', model2.score(X_test, y_test))

accuracy with alpha=1: 0.8263534218590398 
accuracy with alpha=10: 0.6634320735444331


In [20]:
svm = LinearSVC(C=1.0)
model = svm.fit(X_train, y_train)

svm = LinearSVC(C=0.1)
model2 = svm.fit(X_train, y_train)
print('accuracy with default regularization:', model.score(X_test, y_test), 
      '\naccuracy with more regularization:', model2.score(X_test, y_test))

accuracy with default regularization: 0.8973442288049029 
accuracy with more regularization: 0.8810010214504597


In [21]:
tree = DecisionTreeClassifier(max_depth=5)
model = tree.fit(X_train, y_train)

tree = DecisionTreeClassifier(max_depth=None)
model2 = tree.fit(X_train, y_train)
print('accuracy with maximum tree depth 5:', model.score(X_test, y_test), 
      '\naccuracy with unlimited tree depth:', model2.score(X_test, y_test))

accuracy with maximum tree depth 5: 0.48621041879468846 
accuracy with unlimited tree depth: 0.6307456588355465


In [22]:
rfc = RandomForestClassifier(n_estimators=3)
model = rfc.fit(X_train, y_train)

rfc = RandomForestClassifier(n_estimators=20)
model2 = rfc.fit(X_train, y_train)
print('accuracy with 3 trees:', model.score(X_test, y_test), 
      '\naccuracy with 20 trees:', model2.score(X_test, y_test))

accuracy with 3 trees: 0.5097037793667007 
accuracy with 20 trees: 0.7293156281920327


7\. **Using a [Voting Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier), we can combine multiple classifiers. Can we get better results if we combine the classifiers? (this is also called ensemble learning)**

In [23]:
vc = VotingClassifier(estimators=[('knn', knn), ('nb', nb), ('svm', svm), ('tree', tree)])
vc.fit(X_train, y_train)
vc.score(X_test, y_test)

0.8621041879468846

8\. **In order to prepare a text classifier easier, we can use the Pipeline class from sklearn. Create a pipeline with TfidfVectorizer and your best classifer from step 6.**

In [24]:
text_clf = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', LinearSVC()),
])

9\. **Fit the pipeline on your training set.**

In [25]:
text_clf.fit(twenty_train.data, twenty_train.target)

Pipeline(steps=[('vect', TfidfVectorizer()), ('clf', LinearSVC())])

10\. **Compute the accuracy on the test set.**

In [26]:
predicted = text_clf.predict(twenty_test.data)
acc = np.mean(predicted == twenty_test.target)
acc

0.8953013278855976

11\. **Can you also compute precision, recall and f1?**

In [27]:
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

         comp.graphics       0.81      0.95      0.88       389
      rec.sport.hockey       0.94      0.93      0.94       399
               sci.med       0.91      0.83      0.87       396
soc.religion.christian       0.91      0.88      0.90       398
 talk.politics.mideast       0.92      0.88      0.90       376

              accuracy                           0.90      1958
             macro avg       0.90      0.90      0.90      1958
          weighted avg       0.90      0.90      0.90      1958

