# Text Classification using Scikit Learn
### Diving into Machine Learning, NLP, Text Classification Scikit Learn 

In this notebook, I am going to do text-classification using Scikit Learn, on the popular `20NewsGroups` dataset.

#### How Supervised *Machine* Learning works?
1. Training the machine with data which is well-labeled i.e. already tagged with the correct answer.
2. Providing the machine with a new set of examples (data), so that supervised learning algorithm analyses the training data (set of training examples) and produces a correct outcome from labeled data. Basically, how much the data is matching.

## 1. Understanding the Dataset - 20NewsGroups

I am going to use the popular `20NewsGroups` dataset available in scikit learn library, which is widely used in different beginner tutorials including the one's I have referred.

The popular 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

* The tutorial I refered - [towardsdatascience.com](https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a).
* You can read more about this dataset here: - http://qwone.com/~jason/20Newsgroups/

In [1]:
from sklearn.datasets import fetch_20newsgroups

In [2]:
training_data = fetch_20newsgroups(subset='train', shuffle=True)

In [3]:
len(training_data.data)

11314

In [4]:
testing_data = fetch_20newsgroups(subset='test', shuffle=True)

In [5]:
len(testing_data.data)


7532

### Target Names (Categories of Our Data)

* Our data is organized into 20 different newsgroups, each corresponding to a different topic. 
* Some of the newsgroups are very closely related to each other, while others are highly unrelated.

In [6]:
len(training_data.target_names)

20

In [7]:
training_data.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

### Observation Samples

In [8]:
print(training_data.data[0])
print(testing_data.data[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





From: v064mb9k@ubvmsd.cc.buffalo.edu (NEIL B. GANDLER)
Subject: Need info on 88-89 Bonneville
Organization: University at Buffalo
Lines: 10
News-Software: VAX/VMS VNEWS 1.41
Nntp-Posting-Host: ubvmsd.cc.buffalo.edu


 I am a little confused on all of the models of the 88-89 bon

We took the first observation of the dataset, for both training and testing. As we can see, they are very much alike in format. Infact, thus, our machine should be able to predict they are highly similar.

## 2. Extracting Text Features

* **What We Have?**  - Text files/datasets are collections of words
* **What We Need?**  - To run ML-algorithms, we need to convert text into numerical-feature vectors.

### Converting to Term-Document Matrix using Bag of Words Model

#### What is the `Bag of Words` Model?
Bag of Words model is simplifying representation of text, in which some text is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

You can read more about it in [wikipedia](https://en.wikipedia.org/wiki/Bag-of-words_model)

#### What we are going to do?
1. Segment each text file into words (for English splitting by space)
2. Count occurence of each word in each document
3. Assign each word an integer id
4. Each unique word in our dictionary will correspond to a feature (descriptive feature).

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

The Scikit learn libary provides a method which can convert a collection of text documents to a matrix of token counts.

In [10]:
count_vect = CountVectorizer()
tds_count = count_vect.fit_transform(training_data.data)

The fit_tranform function learns the vocabulary dictionary, and then returns term-document matrix.

In [11]:
tds_count.shape

(11314, 130107)

The `tds_count` variable is an matrix in the format [n_samples, n_features]

### TF-IDF Transformation

#### TF (Term Frequency) 
* The issue when we just count the number of words in each document, is that it will give more weightage to longer documents than shorter documents. 
* For avoiding this, we can use frequency (TF - Term Frequencies) adjusted for document length:

$$
        \frac{\text{Count of Word in document}}{\text{Total words in document}}
$$


#### TF - IDF (Term Frequency - Inverse Document Frequency)
* We can also reduce the weightages of more common words (such as `the`, `is`, `an` etc.), which occurs more frequently in a collection of documents compared to others.
* This is done through TF-IDF i.e Term Frequency times inverse document frequency.

You can read more about it in [wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

In [12]:
from sklearn.feature_extraction.text import TfidfTransformer

In [13]:
tfidf_transformer = TfidfTransformer()
tds_tfidf = tfidf_transformer.fit_transform(tds_count)
tds_tfidf.shape

(11314, 130107)

## 3. Prediction using Machine Learning Algorithms

In [29]:
from sklearn.pipeline import Pipeline
import numpy as np

* We will be now using `Pipeline` function of the scikit learn library which allows us to run multiple tasks such as extraction of text features, TF-IDF transformation and classification all one after the other, much easily.
* The numpy libary is gonna be useful for us to compare our prediction with the test dataset.

### Naive Bytes

In [31]:
from sklearn.naive_bayes import MultinomialNB

nb_classifier = Pipeline([
                        ('vect', CountVectorizer()), 
                        ('tfidf', TfidfTransformer()), 
                        ('nb_classifier', MultinomialNB())
])

nb_classifier = nb_classifier.fit(training_data.data, training_data.target)

In [32]:
prediction = nb_classifier.predict(testing_data.data)
np.mean(prediction == testing_data.target)

0.7738980350504514

### SDGC Classifier

In [53]:
from sklearn.linear_model import SGDClassifier

sdgc_classifier = Pipeline([
                        ('vect', CountVectorizer()), 
                        ('tfidf', TfidfTransformer()), 
                        ('sdgc_clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42))
])

sdgc_classifier = sdgc_classifier.fit(training_data.data, training_data.target)

In [54]:
prediction = sdgc_classifier.predict(testing_data.data)
np.mean(prediction == testing_data.target)

0.8238183749336165

## 4. Improving Prediction

### Removing Stop Words
* Stop words are usually the most common words in a language
* We can filter those out if they are not relevant for the underlying problem.

Read more at [wikipedia](https://en.wikipedia.org/wiki/Stop_words)

In [56]:
nb_sw_classifier = Pipeline([
                        ('vect', CountVectorizer(stop_words='english')), 
                        ('tfidf', TfidfTransformer()), 
                        ('nb_classifier', MultinomialNB())
])
nb_sw_classifier = nb_sw_classifier.fit(training_data.data, training_data.target)

In [57]:
prediction = nb_sw_classifier.predict(testing_data.data)
np.mean(prediction == testing_data.target)

0.8169144981412639

As we can see, the accuracy has improved to 81.69% from 77.38% earlier for the NB algorithm.

In [60]:
sdgc_sw_classifier = Pipeline([
                        ('vect', CountVectorizer(stop_words='english')), 
                        ('tfidf', TfidfTransformer()), 
                        ('sdgc_clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42))
])
sdgc_sw_classifier = sdgc_sw_classifier.fit(training_data.data, training_data.target)

In [61]:
prediction = sdgc_sw_classifier.predict(testing_data.data)
np.mean(prediction == testing_data.target)

0.8224907063197026

However, the performance of SDGC alogrithm almost remained the same, or rather slightly decreased to 82.38% from 82.24% earlier.