# Email classification - 20newsgroup dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

In [2]:
# load the train data
traind = datasets.fetch_20newsgroups(subset='train')
testd = datasets.fetch_20newsgroups(subset='test')

In [32]:
xtr = traind.data
ytr = traind.target
xts = testd.data
yts = testd.target
cnames = traind.target_names

In [18]:
print(len(xtr),len(ytr))
print(len(xts),len(yts))
print(len(cnames))

11314 11314
7532 7532
20


In [8]:
print(traind.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features      

In [14]:
print(ytr[150],cnames[ytr[150]])
print(xtr[150])

1 comp.graphics
From: weston@ucssun1.sdsu.edu (weston t)
Subject: graphical representation of vector-valued functions
Organization: SDSU Computing Services
Lines: 13
NNTP-Posting-Host: ucssun1.sdsu.edu

gnuplot, etc. make it easy to plot real valued functions of 2 variables
but I want to plot functions whose values are 2-vectors. I have been 
doing this by plotting arrays of arrows (complete with arrowheads) but
before going further, I thought I would ask whether someone has already
done the work. Any pointers??

thanx in advance


Tom Weston                    | USENET: weston@ucssun1.sdsu.edu
Department of Philosophy      | (619) 594-6218 (office)
San Diego State Univ.         | (619) 575-7477 (home)
San Diego, CA 92182-0303      | 



In [16]:
ytr[:20]

array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4,  8, 19,  4, 14,  6,  0,  1,
        7, 12,  5])

In [17]:
cnames

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [24]:
print(xtr[1802])

From: gtoal@gtoal.com (Graham Toal)
Subject: Re: text of White House announcement and Q&As on clipper chip encryption
Lines: 14

	Actually, many of us have noted this. We have noted that the program
	started at least 4 years ago, that the contracts with VLSI Technology
	and Microtoxin were let at least 14 months ago, that production of the
	chips is well underway, and so forth.

	Nobody I know has claimed Clinton intitiated the program. But he chose
	to go ahead with it.

Perhaps the NSA realised that *no-one* would even contemplate falling for
the dual-escrow bluff while under the Bush administration and *had* to
wait for a Democrat govt to con into promoting this because people *might*
just believe they were honest.  (Didn't work, did it? :-) )

G



In [48]:
import re
import spacy
nlp = spacy.load("en_core_web_sm")



## Text Cleaning

In [49]:
# use regex for text cleaning
def transform_data(xd):
    for i in range(len(xd)):
        doc = xd[i]
        doc = re.sub("[\w.]+@[\w.]+","",doc) # removing email ids
        doc = re.sub("[0-9]+","",doc) # removing tokens which start with numbers followed by text
        doc = re.sub("_","",doc)
        doc = nlp(doc)
        doc = " ".join([w.lemma_ for w in doc])
        xd[i] = doc
    return xd

In [50]:
xtr = transform_data(xtr)

## Vectorization

In [55]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(lowercase=True,stop_words='english',min_df=20,max_df=0.90)
vec.fit(xtr)
print(len(vec.get_feature_names()))
print(vec.get_feature_names())

6563


In [57]:
xtr2 = vec.transform(xtr).toarray()

In [58]:
xtr2.shape

(11314, 6563)

In [63]:
xtr[0]

'from :   ( where be my thing ) \n subject : WHAT car be this ! ? \n Nntp - Posting - host : rac.wam.umd.edu \n Organization : University of Maryland , College Park \n line : \n  I be wonder if anyone out there could enlighten I on this car I see \n the other day . it be a -door sport car , look to be from the late / \n early . it be call a Bricklin . the door be really small . in addition , \n the front bumper be separate from the rest of the body . this be \n all I know . if anyone can tellme a model name , engine spec , year \n of production , where this car be make , history , or whatever info you \n have on this funky look car , please e - mail . \n\n thank , \n - IL \n    ---- bring to you by your neighborhood Lerxst ---- \n\n\n\n\n'

# Classification

In [64]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(xtr2,ytr)

MultinomialNB()

In [65]:
from sklearn import metrics
xts2 = vec.transform(xts).toarray()
ypred = model.predict(xts2)
print(metrics.classification_report(yts,ypred))

              precision    recall  f1-score   support

           0       0.74      0.66      0.70       319
           1       0.64      0.69      0.67       389
           2       0.67      0.76      0.71       394
           3       0.63      0.68      0.65       392
           4       0.77      0.78      0.78       385
           5       0.78      0.75      0.76       395
           6       0.71      0.82      0.76       390
           7       0.84      0.85      0.85       396
           8       0.89      0.92      0.90       398
           9       0.88      0.93      0.90       397
          10       0.91      0.97      0.94       399
          11       0.91      0.90      0.90       396
          12       0.73      0.55      0.63       393
          13       0.91      0.76      0.83       396
          14       0.82      0.91      0.86       394
          15       0.74      0.89      0.81       398
          16       0.66      0.91      0.77       364
          17       0.93    

In [66]:
metrics.accuracy_score(yts,ypred)

0.7857142857142857

In [67]:
metrics.accuracy_score(ytr,model.predict(xtr2))

0.9183312709916918

In [71]:
TfidfVectorizer?