**Natural Language Processing - A Classical Approach**

This notebook is inspired from the following notebooks/pages:

- [Auto Quiz](https://github.com/cloudxlab/ml/tree/master/projects/autoquiz)
- [Scikit-learn documentation](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

# Create a Quiz using TextBlob

We will start by importing the required modules/packages:

In [1]:
from textblob import TextBlob
import random
import re

Following is the text we are going to use to create the quiz. This text is from wikipedia on World War 2 - https://en.wikipedia.org/wiki/World_War_II which we have assigned here to the variable ww2.

However, can use any text file you like by loading it as follows:

`f = open('path/filename.txt')`<br>
`text = f.read()`

Note the triple quotes, they are used for defining multi line string.

In [2]:
ww2 = '''
World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier. It involved the vast majority of the world's countries—including all of the great powers—eventually forming two opposing military alliances: the Allies and the Axis. It was the most widespread war in history, and directly involved more than 100 million people from over 30 countries. In a state of total war, the major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, erasing the distinction between civilian and military resources.

World War II was the deadliest conflict in human history, marked by 50 million to 85 million fatalities, most of which were civilians in the Soviet Union and China. It included massacres, the deliberate genocide of the Holocaust, strategic bombing, starvation, disease and the first use of nuclear weapons in history.[1][2][3][4]

The Empire of Japan aimed to dominate Asia and the Pacific and was already at war with the Republic of China in 1937,[5] but the world war is generally said to have begun on 1 September 1939[6] with the invasion of Poland by Nazi Germany and subsequent declarations of war on Germany by France and the United Kingdom. Supplied by the Soviet Union, from late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. Under the Molotov–Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. The war continued primarily between the European Axis powers and the coalition of the United Kingdom and the British Commonwealth, with campaigns including the North Africa and East Africa campaigns, the aerial Battle of Britain, the Blitz bombing campaign, and the Balkan Campaign, as well as the long-running Battle of the Atlantic. On 22 June 1941, the European Axis powers launched an invasion of the Soviet Union, opening the largest land theatre of war in history, which trapped the major part of the Axis military forces into a war of attrition. In December 1941, Japan attacked the United States and European colonies in the Pacific Ocean, and quickly conquered much of the Western Pacific.

The Axis advance halted in 1942 when Japan lost the critical Battle of Midway, and Germany and Italy were defeated in North Africa and then, decisively, at Stalingrad in the Soviet Union. In 1943, with a series of German defeats on the Eastern Front, the Allied invasion of Sicily and the Allied invasion of Italy which brought about Italian surrender, and Allied victories in the Pacific, the Axis lost the initiative and undertook strategic retreat on all fronts. In 1944, the Western Allies invaded German-occupied France, while the Soviet Union regained all of its territorial losses and invaded Germany and its allies. During 1944 and 1945 the Japanese suffered major reverses in mainland Asia in South Central China and Burma, while the Allies crippled the Japanese Navy and captured key Western Pacific islands.

The war in Europe concluded with an invasion of Germany by the Western Allies and the Soviet Union, culminating in the capture of Berlin by Soviet troops, the suicide of Adolf Hitler and the subsequent German unconditional surrender on 8 May 1945. Following the Potsdam Declaration by the Allies on 26 July 1945 and the refusal of Japan to surrender under its terms, the United States dropped atomic bombs on the Japanese cities of Hiroshima and Nagasaki on 6 August and 9 August respectively. With an invasion of the Japanese archipelago imminent, the possibility of additional atomic bombings and the Soviet invasion of Manchuria, Japan formally surrendered on 2 September 1945. Thus ended the war in Asia, cementing the total victory of the Allies.

World War II changed the political alignment and social structure of the world. The United Nations (UN) was established to foster international co-operation and prevent future conflicts. The victorious great powers—China, France, the Soviet Union, the United Kingdom, and the United States—became the permanent members of the United Nations Security Council.[7] The Soviet Union and the United States emerged as rival superpowers, setting the stage for the Cold War, which lasted for the next 46 years. Meanwhile, the influence of European great powers waned, while the decolonisation of Africa and Asia began. Most countries whose industries had been damaged moved towards economic recovery. Political integration, especially in Europe, emerged as an effort to end pre-war enmities and to create a common identity.[8]
'''

Next we’ll convert our text to a `TextBlob` object.

In [3]:
ww2b = TextBlob(ww2)

In [4]:
ww2b.sentences[0]

Sentence("
World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier.")

In [5]:
ww2b.sentences[0].tags

[('World', 'NNP'),
 ('War', 'NNP'),
 ('II', 'NNP'),
 ('often', 'RB'),
 ('abbreviated', 'VBN'),
 ('to', 'TO'),
 ('WWII', 'NNP'),
 ('or', 'CC'),
 ('WW2', 'NNP'),
 ('also', 'RB'),
 ('known', 'VBN'),
 ('as', 'IN'),
 ('the', 'DT'),
 ('Second', 'NNP'),
 ('World', 'NNP'),
 ('War', 'NNP'),
 ('was', 'VBD'),
 ('a', 'DT'),
 ('global', 'JJ'),
 ('war', 'NN'),
 ('that', 'WDT'),
 ('lasted', 'VBD'),
 ('from', 'IN'),
 ('1939', 'CD'),
 ('to', 'TO'),
 ('1945', 'CD'),
 ('although', 'IN'),
 ('related', 'JJ'),
 ('conflicts', 'NNS'),
 ('began', 'VBD'),
 ('earlier', 'RBR')]

Now we are ready to apply different methods on our text.

**Go Back to the Slides**

Now we are going to prepare the dictionary of parts-of-speech as the key and value is a list of words:

`{part-of-speech: [word1, word2]}`

We are basically grouping the words based on the parts-of-speech.

In [6]:
sposs = {}

for sentence in ww2b.sentences:   
    poss = {}
    sposs[sentence.string] = poss;
    for t in sentence.tags:
        tag = t[1]
        if tag not in poss:
            poss[tag] = []
        poss[tag].append(t[0])

In [7]:
sposs['\nWorld War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier.']

{'NNP': ['World', 'War', 'II', 'WWII', 'WW2', 'Second', 'World', 'War'],
 'RB': ['often', 'also'],
 'VBN': ['abbreviated', 'known'],
 'TO': ['to', 'to'],
 'CC': ['or'],
 'IN': ['as', 'from', 'although'],
 'DT': ['the', 'a'],
 'VBD': ['was', 'lasted', 'began'],
 'JJ': ['global', 'related'],
 'NN': ['war'],
 'WDT': ['that'],
 'CD': ['1939', '1945'],
 'NNS': ['conflicts'],
 'RBR': ['earlier']}

Next, we will create the blank in string.

In [8]:
def replaceIC(word, sentence):
    insensitive_hippo = re.compile(re.escape(word), re.IGNORECASE)
    return insensitive_hippo.sub('__________________', sentence)

In [9]:
replaceIC("hippo", "this is hippo")

'this is __________________'

The `removeWord` function tries to create a blank space in a sentence. It does that by first trying to randomly selection proper-noun, and if the proper noun is not found, it selects a noun randomly.

In [10]:
def removeWord(sentence, poss):
    words = None
    if 'NNP' in poss:
        words = poss['NNP']
    elif 'NN' in poss:
        words = poss['NN']
    else:
        print("NN and NNP not found")
        return (None, sentence, None)
    if len(words) > 0:
        word = random.choice(words)
        replaced = replaceIC(word, sentence)
        return (word, sentence, replaced)
    else:
        print("words are empty")
        return (None, sentence, None)

Finally, we will iterate over the sentences and create blank spaces using the `removeWord` function.

In [11]:
for sentence in sposs.keys():
    poss = sposs[sentence]
    (word, osentence, replaced) = removeWord(sentence, poss)
    if replaced is None:
        print ("Founded none for ")
        print(sentence)
    else:
        print(replaced)
        print ("\n===============")
        print(word)
        print ("===============")
        print("\n")


World War __________________ (often abbreviated to WW__________________ or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier.

II


It involved the vast majority of the world's countries—including all of the great powers—eventually forming two opposing military alliances: the Allies and the __________________.

Axis


It was the most widespread __________________ in history, and directly involved more than 100 million people from over 30 countries.

war


In a state of total __________________, the major participants threw their entire economic, industrial, and scientific capabilities behind the __________________ effort, erasing the distinction between civilian and military resources.

war


World War II was the deadliest conflict in human history, marked by 50 million to 85 million fatalities, most of which were civilians in the Soviet __________________ and China.

Union


It included massacres, the de

**Go Back to the Slides**

# Find Related Posts using Scikit-Learn

We will start by impoting `CountVectorizer` module, and initializing it.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)

The `min_df` parameter determines how CountVectorizer treats seldom words:

- If it is set to an integer, all words occurring less than that value will be dropped
- If it is a fraction, all words that occur in less than that fraction of the overall dataset will be dropped

The `max_df` parameter works in a similar manner.

In [13]:
content = ["How to format my hard disk", " Hard disk format problems "]

In [14]:
X = vectorizer.fit_transform(content)
vectorizer.get_feature_names()

['disk', 'format', 'hard', 'how', 'my', 'problems', 'to']

In [15]:
X.toarray()

array([[1, 1, 1, 1, 1, 0, 1],
       [1, 1, 1, 0, 0, 1, 0]])

In [16]:
print(X.toarray().transpose())

[[1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 0]
 [0 1]
 [1 0]]


This means that the first sentence contains all the words except "problems", while the second contains all but "how", "my", and "to". In fact, these are exactly the same columns as we have seen in the preceding table. From X, we can extract a feature vector that we will use to compare two documents with each other.

Now we will apply this on a toy dataset. Let's consider 5 toy posts and find the similarity with a given post.

In [17]:
post1 = "This is a toy post about machine learning. Actually, it contains not much interesting stuff."
post2 = "Imaging databases can get huge."
post3 = "Most imaging databases save images permanently."
post4 = "Imaging databases store images."
post5 = "Imaging databases store images. Imaging databases store images. Imaging databases store images."

Now let's create out **vectorizer**.

In [18]:
posts = [post1, post2, post3, post4, post5]

X_train = vectorizer.fit_transform(posts)
num_samples, num_features = X_train.shape

print("#samples: %d, #features: %d" % (num_samples, num_features))

#samples: 5, #features: 24


As we provided 5 different posts and there are 24 different words in them.

In [19]:
X_train.toarray()

array([[1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1,
        1, 1],
       [0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0],
       [0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
        0, 0],
       [0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0],
       [0, 0, 0, 0, 3, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,
        0, 0]])

In [20]:
vectorizer.get_feature_names()

['about',
 'actually',
 'can',
 'contains',
 'databases',
 'get',
 'huge',
 'images',
 'imaging',
 'interesting',
 'is',
 'it',
 'learning',
 'machine',
 'most',
 'much',
 'not',
 'permanently',
 'post',
 'save',
 'store',
 'stuff',
 'this',
 'toy']

In [21]:
print(len(vectorizer.get_feature_names()))

24


In [22]:
new_post = "imaging databases "

In [23]:
new_post_vec = vectorizer.transform([new_post])
new_post_vec.toarray()

array([[0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0]])

Next we find the distance between the two vectors. We can measure this distance using the **Euclidean Distance**. But first we will normalize each vectors. The `scipy.linalg` module provides a function called norm. The `norm()` function calculates the Euclidean norm (shortest distance). So this will first normalize the vectors and then find the distance between them.

In [24]:
def dist_norm(v1, v2):
    v1_normalized = v1/sp.linalg.norm(v1.toarray())
    v2_normalized = v2/sp.linalg.norm(v2.toarray())
    delta = v1_normalized - v2_normalized
    return sp.linalg.norm(delta.toarray())

In [25]:
import sys
import scipy as sp
best_doc = None
best_dist = sys.maxsize
best_i = None
for i, post in enumerate(posts):
    post_vec = X_train.getrow(i)
    d = dist_norm(post_vec, new_post_vec)
    print("=== Post %i with dist=%.2f: %s"%(i, d, post))
    if d<best_dist:
        best_dist = d
        best_i = i
print("Best post is %i with dist=%.2f"%(best_i, best_dist))

=== Post 0 with dist=1.41: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
=== Post 1 with dist=0.86: Imaging databases can get huge.
=== Post 2 with dist=0.92: Most imaging databases save images permanently.
=== Post 3 with dist=0.77: Imaging databases store images.
=== Post 4 with dist=0.77: Imaging databases store images. Imaging databases store images. Imaging databases store images.
Best post is 3 with dist=0.77


**Go Back to the Slides**

## Load the file contents and the categories

The dataset is called "Twenty Newsgroups". Here is their [official website](http://qwone.com/~jason/20Newsgroups/).

In the following we will use the built-in dataset loader for 20 newsgroups from scikit-learn. In order to get faster execution times for this first example we will work on a partial dataset with only 4 categories out of the 20 available in the dataset:

In [26]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

We can now load the list of files matching those categories as follows:

In [27]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(data_home='/cxldata/scikit_learn_data', subset='train', categories=categories, shuffle=True, random_state=42)

The returned dataset is a `scikit-learn` "bunch": a simple holder object with fields that can be both accessed as python dict keys or object attributes for convenience, for instance the `target_names` holds the list of the requested category names:

In [28]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

The files themselves are loaded in memory in the data attribute. For reference the filenames are also available:

In [29]:
len(twenty_train.data)

2257

In [30]:
len(twenty_train.filenames)

2257

Let’s print the first lines of the first loaded file:

In [31]:
twenty_train.data[0]

'From: sd345@city.ac.uk (Michael Collier)\nSubject: Converting images to HP LaserJet III?\nNntp-Posting-Host: hampton\nOrganization: The City University\nLines: 14\n\nDoes anyone know of a good way (standard PC application/PD utility) to\nconvert tif/img/tga files into LaserJet III format.  We would also like to\ndo the same, converting to HPGL (HP plotter) files.\n\nPlease email any response.\n\nIs this the correct group?\n\nThanks in advance.  Michael.\n-- \nMichael Collier (Programmer)                 The Computer Unit,\nEmail: M.P.Collier@uk.ac.city                The City University,\nTel: 071 477-8000 x3769                      London,\nFax: 071 477-8565                            EC1V 0HB.\n'

In [32]:
print("\n".join(twenty_train.data[0].split("\n")[:3]))

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton


In [33]:
print(twenty_train.target_names[twenty_train.target[0]])

comp.graphics


For speed and space efficiency reasons `scikit-learn` loads the target attribute as an array of integers that corresponds to the index of the category name in the `target_names` list. The category integer id of each sample is stored in the `target` attribute:

In [34]:
twenty_train.target[:10]

array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2])

It is possible to get back the category names as follows:

In [35]:
for t in twenty_train.target[:10]:
    print(twenty_train.target_names[t])

comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med


## Extract feature vectors suitable for machine learning

Now we will apply the bag of word approach. Tokenizing text with scikit-learn.

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35788)

`CountVectorizer` supports counts of **N-grams** of words or consecutive characters. Once fitted, the vectorizer has built a dictionary of feature indices:

In [37]:
count_vect.vocabulary_['algorithm']

4690

In [38]:
count_vect.vocabulary_[u'the']

32142

The index value of a word in the vocabulary is linked to its frequency in the whole training corpus.

**Go Back to the Slides**

## Train a linear model to perform categorization

In [39]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(2257, 35788)

In [40]:
TfidfTransformer?

In [41]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2257, 35788)

In [42]:
result = tfidf_transformer.transform(count_vect.transform(['This is a cow', 'That is a goat']))
print(result)

  (0, 32270)	0.18781918880023565
  (0, 18474)	0.1570845575454142
  (0, 10071)	0.9695609285130992
  (1, 32139)	0.1529256887887155
  (1, 18474)	0.1447113185933946
  (1, 15515)	0.977584967140687


In [43]:
for s in 'this is a cow'.split():
    if s in count_vect.vocabulary_:
        print(s, count_vect.vocabulary_[s])
    else:
        print('Skipping ', s)

this 32270
is 18474
Skipping  a
cow 10071


## Training a Classifier

We’ll start with a naïve Bayes classifier, which provides a nice baseline for this task.

In [44]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

The multinomial variant of Naive Bayes is one the most suitable for word counts tasks.

To try to predict the outcome on a new document we need to extract the features using almost the same feature extracting chain as before. The difference is that we call `transform` instead of `fit_transform` on the transformers, since they have already been fit to the training set:

In [45]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

In [46]:
predicted = clf.predict(X_new_tfidf)

In [47]:
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


In order to make the vectorizer => transformer => classifier easier to work with, `scikit-learn` provides a `Pipeline` class that behaves like a compound classifier:

In [48]:
from sklearn.pipeline import Pipeline

text_clfnb = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

The names `vect`, `tfidf` and `clf` (classifier) are arbitrary. We will use them to perform grid search for suitable hyperparameters below. We can now train the model with a single command:

In [49]:
text_clfnb.fit(twenty_train.data, twenty_train.target)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

Now let us perform performance evaluation on the test set:

In [50]:
import numpy as np

twenty_test = fetch_20newsgroups(data_home='/cxldata/scikit_learn_data', subset='test', categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predictednb = text_clfnb.predict(docs_test)
np.mean(predictednb == twenty_test.target)

0.8348868175765646

We achieved 83.488% accuracy!

In [51]:
from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predictednb, target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.97      0.60      0.74       319
         comp.graphics       0.96      0.89      0.92       389
               sci.med       0.97      0.81      0.88       396
soc.religion.christian       0.65      0.99      0.78       398

              accuracy                           0.83      1502
             macro avg       0.89      0.82      0.83      1502
          weighted avg       0.88      0.83      0.84      1502



## Use a grid search strategy to find a good configuration of both the feature extraction components and the classifier

Now we’ll use a different classifier and compute the performance metrics:

In [52]:
from sklearn.linear_model import SGDClassifier

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None)),
])

In [53]:
text_clf.fit(twenty_train.data, twenty_train.target)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=Non...
                ('clf',
                 SGDClassifier(alpha=0.001, average=False, class_weight=None,
                               early_stopping=False, epsilon=0.1, eta0=0.0,
                               fit_intercept=True, l1_ratio=0.15,
                               learning_rate='optimal', loss='hinge',
                      

In [54]:
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

0.9101198402130493

We achieved 91% accuracy!

In [55]:
from sklearn import metrics

print(metrics.classification_report(twenty_test.target, predicted, target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.95      0.80      0.87       319
         comp.graphics       0.87      0.98      0.92       389
               sci.med       0.94      0.89      0.91       396
soc.religion.christian       0.90      0.95      0.93       398

              accuracy                           0.91      1502
             macro avg       0.91      0.91      0.91      1502
          weighted avg       0.91      0.91      0.91      1502



In [56]:
print(metrics.classification_report(twenty_test.target, predictednb, target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.97      0.60      0.74       319
         comp.graphics       0.96      0.89      0.92       389
               sci.med       0.97      0.81      0.88       396
soc.religion.christian       0.65      0.99      0.78       398

              accuracy                           0.83      1502
             macro avg       0.89      0.82      0.83      1502
          weighted avg       0.88      0.83      0.84      1502



In [57]:
metrics.confusion_matrix(twenty_test.target, predicted)

array([[256,  11,  16,  36],
       [  4, 380,   3,   2],
       [  5,  35, 353,   3],
       [  5,  11,   4, 378]])

As expected the confusion matrix shows that posts from the newsgroups on atheism and Christianity are more often confused for one another than with computer graphics.

**Parameter tuning using grid search**

Since there are different parameters which we can choose, we’ll apply grid search to find the best parameters:

In [58]:
from sklearn.model_selection import GridSearchCV

parameters = {
    'vect__ngram_range': [(1, 1), (1, 2)],
    'tfidf__use_idf': (True, False),
    'clf__alpha': (1e-2, 1e-3),
}

Here we’ll be applying grid search for the parameters - `ngram_range`, `use_idf` and `alpha`. If we have multiple CPU cores at our disposal, we can tell the grid searcher to try these eight parameter combinations in parallel with the `n_jobs` parameter.

In [59]:
gs_clf = GridSearchCV(text_clf, parameters, cv=5, n_jobs=-1)

In [60]:
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])

Now we will predict and find the best score. The result of calling `fit` on a `GridSearchCV` object is a classifier that we can use to `predict`:

In [61]:
twenty_train.target_names[gs_clf.predict(['God is love'])[0]]

'soc.religion.christian'

The object’s `best_score_` and `best_params_` attributes store the best mean score and the parameters setting corresponding to that score:

In [62]:
gs_clf.best_score_

0.9175000000000001

In [63]:
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))

clf__alpha: 0.001
tfidf__use_idf: True
vect__ngram_range: (1, 1)


**Go Back to the Slides**

In [64]:
from sklearn.ensemble import RandomForestClassifier

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier(n_estimators=200, min_samples_split=4)),
])

text_clf.fit(twenty_train.data, twenty_train.target)

predicted = text_clf.predict(docs_test)
print(np.mean(predicted == twenty_test.target))

from sklearn import metrics

print(metrics.classification_report(twenty_test.target, predicted, target_names=twenty_test.target_names))

0.8135818908122503
                        precision    recall  f1-score   support

           alt.atheism       0.94      0.64      0.76       319
         comp.graphics       0.71      0.95      0.82       389
               sci.med       0.90      0.69      0.78       396
soc.religion.christian       0.81      0.93      0.87       398

              accuracy                           0.81      1502
             macro avg       0.84      0.81      0.81      1502
          weighted avg       0.84      0.81      0.81      1502



## TODO: Apply The Deep Neural networks.
