see https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In [1]:
from sklearn.datasets import fetch_20newsgroups

In [2]:
training_data = fetch_20newsgroups(subset='train', 
                                   categories=['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med'], 
                                   shuffle=True, 
                                   random_state=42)

## Explore the data

How much training data do we have (ie how many posts)?

In [3]:
len(training_data.data)

2257

What does the data look like?

In [4]:
print(training_data.data[0])

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.



The training data also contains what category (newsgroup) the post came from.

In [5]:
print(training_data.target_names)

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']


In [6]:
for target in training_data.target[:5]:
    print(target, training_data.target_names[target])

1 comp.graphics
1 comp.graphics
3 soc.religion.christian
3 soc.religion.christian
3 soc.religion.christian


### Tokenize text 

Turn the posts into a vector of word counts

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
training_data_counts = vectorizer.fit_transform(training_data.data)
len(vectorizer.get_feature_names())

35788

So there are 35788 different words in the training data. Lots of garbage.

In [8]:
vectorizer.get_feature_names()[:10]

['00',
 '000',
 '0000',
 '0000001200',
 '000005102000',
 '0001',
 '000100255pixel',
 '00014',
 '000406',
 '0007']

Let's look at the word counts for the first post.

In [9]:
training_data_counts[0]

<1x35788 sparse matrix of type '<class 'numpy.int64'>'
	with 73 stored elements in Compressed Sparse Row format>

In [10]:
print(training_data_counts[0])

  (0, 230)	1
  (0, 12541)	1
  (0, 3166)	1
  (0, 14085)	1
  (0, 20459)	1
  (0, 35416)	1
  (0, 3062)	1
  (0, 2326)	2
  (0, 177)	2
  (0, 31915)	1
  (0, 33572)	1
  (0, 9338)	1
  (0, 26175)	1
  (0, 4378)	1
  (0, 17556)	1
  (0, 32135)	1
  (0, 15837)	1
  (0, 9932)	1
  (0, 32270)	1
  (0, 18474)	1
  (0, 27836)	1
  (0, 5195)	1
  (0, 12833)	2
  (0, 25337)	1
  (0, 25361)	1
  :	:
  (0, 5201)	1
  (0, 12051)	1
  (0, 587)	1
  (0, 20253)	1
  (0, 33597)	2
  (0, 32142)	5
  (0, 23915)	1
  (0, 16082)	1
  (0, 16881)	1
  (0, 25663)	1
  (0, 23122)	1
  (0, 17302)	2
  (0, 19780)	2
  (0, 16916)	2
  (0, 32493)	4
  (0, 17366)	1
  (0, 9805)	2
  (0, 31077)	1
  (0, 9031)	3
  (0, 21661)	3
  (0, 33256)	2
  (0, 4017)	2
  (0, 8696)	4
  (0, 29022)	1
  (0, 14887)	1


Word #8696 occurred 4 times. What word is that?

In [11]:
vectorizer.get_feature_names()[8696]

'city'

In [12]:
print(training_data.data[0])

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.



In [13]:
vectorizer.get_feature_names()[8690:8700]

['citizen',
 'citizens',
 'citizenship',
 'citr',
 'citrate',
 'citrus',
 'city',
 'city_________________________________________________',
 'civic',
 'civil']

Note that we are not stemming words - we have both `citizen` and `citizens`. We also have `city` with a bunch of underscores after it (?) So basically, our data could really use some cleaning up (that we're not going to bother with today)

### Normalize frequencies

If a document is really long it will in general have higher word counts, so it might seem to be more relavant than a shorter doccument which uses key words relatively more often. To counter this we usually normalize the counts by the length of the document in some fashion.

Also, if a word occurs in most of the documents, it probably doesn't have much information value. So we want to reduce its score.

A common way to do both of these is to calculate the Term Frequency Inverse Document Frequency (TFIDF).

In [14]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
training_counts_tfidf = tfidf_transformer.fit_transform(training_data_counts)

In [15]:
print(training_counts_tfidf[0])

  (0, 35416)	0.1348710554299733
  (0, 35312)	0.0312703097833574
  (0, 34775)	0.034481472140846715
  (0, 34755)	0.043341654399042764
  (0, 33915)	0.0999409997803694
  (0, 33597)	0.06567578043186388
  (0, 33572)	0.09313007554599557
  (0, 33256)	0.11819702490105698
  (0, 32493)	0.07283773941616518
  (0, 32391)	0.12806013119559947
  (0, 32270)	0.023871142738151236
  (0, 32142)	0.08865416253721688
  (0, 32135)	0.04910237380446671
  (0, 32116)	0.10218403421141944
  (0, 31915)	0.08631915131162177
  (0, 31077)	0.016797806021219684
  (0, 30623)	0.0686611288079694
  (0, 29022)	0.1348710554299733
  (0, 28619)	0.047271576160535234
  (0, 27836)	0.06899050810672397
  (0, 26175)	0.08497460943470851
  (0, 25663)	0.034290706362898604
  (0, 25361)	0.11947938145690981
  (0, 25337)	0.04935883383975408
  (0, 24677)	0.09796250319482307
  :	:
  (0, 14676)	0.07691883385947053
  (0, 14281)	0.13635772403701527
  (0, 14085)	0.06666452137859726
  (0, 12833)	0.125601499991304
  (0, 12541)	0.1348710554299733
  (0, 

## Classification

### Naive Bayes classifier

In [16]:
from sklearn.naive_bayes import MultinomialNB

nb_classifier = MultinomialNB()
nb_classifier.fit(training_counts_tfidf, training_data.target)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Quick predictions

In [17]:
testing_data = ['doctors are cool', 'OpenGL on the GPU is fast']
testing_counts = vectorizer.transform(testing_data)
testing_counts_tfidf = tfidf_transformer.transform(testing_counts)

predicted = nb_classifier.predict(testing_counts_tfidf)

for doc, category in zip(testing_data, predicted):
    print('%r => %s' % (doc, training_data.target_names[category]))

'doctors are cool' => sci.med
'OpenGL on the GPU is fast' => comp.graphics


### Pipeline

We can string the vectorizer, TFIDF, and classifier together in an sklearn Pipeline, to create a new classifier that works directly with the raw data. 

In [18]:
from sklearn.pipeline import Pipeline

nb_classifier = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

nb_classifier.fit(training_data.data, training_data.target)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

### Evaluate on a real test set

In [19]:
import numpy as np

testing_data = fetch_20newsgroups(subset='test', categories=training_data.target_names, shuffle=True, random_state=42)
testing_data_predictions = nb_classifier.predict(testing_data.data\
                                                )
np.mean(testing_data_predictions == testing_data.target) 

0.8348868175765646

So we guess right 83% of the time. Not terrible! Let's get some more metrics.

In [20]:
from sklearn import metrics

print(metrics.classification_report(testing_data.target, 
                                    testing_data_predictions, 
                                    target_names=training_data.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.97      0.60      0.74       319
         comp.graphics       0.96      0.89      0.92       389
               sci.med       0.97      0.81      0.88       396
soc.religion.christian       0.65      0.99      0.78       398

             micro avg       0.83      0.83      0.83      1502
             macro avg       0.89      0.82      0.83      1502
          weighted avg       0.88      0.83      0.84      1502



\begin{align}
\text{precision of category} & = \frac {\text {number from that category guessed correctly}} 
                                        {\text{total number from that category}} \\
                             & = \text {what proportion of that category did we guess right} \\
                             \\
\text{recall of category} &= \frac {\text {number from that category guessed correctly}} 
                                        {\text{number we thought were from that category}} \\
                          &= \text {of the ones we thought were that category, what proportion were we correct about}                                  
\end{align}


So we want the precision and precall to both be 1. 

In this case, alt.atheism has a precision of 0.97, so 97% of the alt.atheism posts got correctly labelled.

On the other hand, the recall is 0.6, which means that when we say a post in from alt.atheism, we're right 60% of the time.

Note that soc.religion.christianity has the opposite problem. It would appear that we are classifying too many religion-related posts as alt.atheism.

A **confusion matrix** shows us how many things got classified as other things.

In [32]:
import pandas as pd

row_index = pd.MultiIndex.from_tuples([('predicted', x) for x in training_data.target_names])
col_index = pd.MultiIndex.from_tuples([('actual', x) for x in training_data.target_names])
confusion_matrix = metrics.confusion_matrix(testing_data.target, testing_data_predictions)

pd.DataFrame(confusion_matrix, index=row_index, columns=col_index)

Unnamed: 0_level_0,Unnamed: 1_level_0,actual,actual,actual,actual
Unnamed: 0_level_1,Unnamed: 1_level_1,alt.atheism,comp.graphics,sci.med,soc.religion.christian
predicted,alt.atheism,192,2,6,119
predicted,comp.graphics,2,347,4,36
predicted,sci.med,2,11,322,61
predicted,soc.religion.christian,2,2,1,393


So we can see that of the posts predicted to be alt.atheism, almost a third were actually soc.religion.christian.

## Better data munging

Let's see if we can improve the data a bit. 

In [33]:
from nltk.stem.snowball import EnglishStemmer

In [35]:
stemmer = EnglishStemmer()
analyzer = CountVectorizer().build_analyzer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

stem_vectorizer = CountVectorizer(analyzer=stemmed_words)

In [40]:
nb_classifier = Pipeline([
    ('vect', stem_vectorizer),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

nb_classifier.fit(training_data.data, training_data.target)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer=<function stemmed_words at 0x123a0b158>,
        binary=False, decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=Non...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [41]:
testing_data_predictions = nb_classifier.predict(testing_data.data)
np.mean(testing_data_predictions == testing_data.target) 

0.8288948069241012

In [42]:
row_index = pd.MultiIndex.from_tuples([('predicted', x) for x in training_data.target_names])
col_index = pd.MultiIndex.from_tuples([('actual', x) for x in training_data.target_names])
confusion_matrix = metrics.confusion_matrix(testing_data.target, testing_data_predictions)

pd.DataFrame(confusion_matrix, index=row_index, columns=col_index)

Unnamed: 0_level_0,Unnamed: 1_level_0,actual,actual,actual,actual
Unnamed: 0_level_1,Unnamed: 1_level_1,alt.atheism,comp.graphics,sci.med,soc.religion.christian
predicted,alt.atheism,183,2,7,127
predicted,comp.graphics,2,348,7,32
predicted,sci.med,3,10,322,61
predicted,soc.religion.christian,2,2,2,392


Surprisingly the results are worse than with the non-stemmed words :(

###  Use a better classifier!

Let's try a slightly better classifier, a support vector machine.

In [52]:
from sklearn.linear_model import SGDClassifier

svm_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=10, tol=0.01)),
])

svm_clf.fit(training_data.data, training_data.target)  

testing_data_predictions = svm_clf.predict(testing_data.data)
np.mean(testing_data_predictions == testing_data.target) 

0.9121171770972037

In [53]:
row_index = pd.MultiIndex.from_tuples([('predicted', x) for x in training_data.target_names])
col_index = pd.MultiIndex.from_tuples([('actual', x) for x in training_data.target_names])
confusion_matrix = metrics.confusion_matrix(testing_data.target, testing_data_predictions)

pd.DataFrame(confusion_matrix, index=row_index, columns=col_index)

Unnamed: 0_level_0,Unnamed: 1_level_0,actual,actual,actual,actual
Unnamed: 0_level_1,Unnamed: 1_level_1,alt.atheism,comp.graphics,sci.med,soc.religion.christian
predicted,alt.atheism,261,10,12,36
predicted,comp.graphics,4,380,2,3
predicted,sci.med,7,36,349,4
predicted,soc.religion.christian,5,11,2,380


We can see that we are not over-predicting alt.atheism nearly as badly as we were using Naive Bayes.

In [56]:
from sklearn import metrics

print(metrics.classification_report(testing_data.target, 
                                    testing_data_predictions, 
                                    target_names=training_data.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.94      0.82      0.88       319
         comp.graphics       0.87      0.98      0.92       389
               sci.med       0.96      0.88      0.92       396
soc.religion.christian       0.90      0.95      0.93       398

             micro avg       0.91      0.91      0.91      1502
             macro avg       0.92      0.91      0.91      1502
          weighted avg       0.92      0.91      0.91      1502

