##### Tokenization 
Consider the following text version of a post to an online learning forum in a statistics course.

Thanks John!<br /><br /><font size="3">

&quot;Illustrations and demos will be

provided for students to work through on

their own&quot;</font>.

Do we need that to finish project? If yes,

where to find the illustration and demos? 

Thanks for your help.<img title="smiles"

alt="smiles" src="\url{http://lms.statistics.

com/pix/smartpix.php/statistics_com_1/s/smil

ey.gif}" \><br /><br />


In [1]:
%matplotlib inline

from pathlib import Path

from zipfile import ZipFile
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
import nltk
from nltk import word_tokenize          
from nltk.stem.snowball import EnglishStemmer 
import matplotlib.pylab as plt
from dmba import printTermDocumentMatrix, classificationSummary, liftChart

nltk.download('punkt')

no display found. Using non-interactive Agg backend


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Cobbadmin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Question 1 (3 points) Identify 10 non-word tokens in the passage.

In [2]:
text = ['Thanks John!',
        'Illustrations and demos will be provided for students to work through on their own.',
        ' Do we need that to finish project?',
        'If yes, where to find the illustration and demos?',
        ' Thanks for your help.smiles']

# Learn features based on text
count_vect = CountVectorizer()
counts = count_vect.fit_transform(text)

printTermDocumentMatrix(count_vect, counts)

               S1  S2  S3  S4  S5
and             0   1   0   1   0
be              0   1   0   0   0
demos           0   1   0   1   0
do              0   0   1   0   0
find            0   0   0   1   0
finish          0   0   1   0   0
for             0   1   0   0   1
help            0   0   0   0   1
if              0   0   0   1   0
illustration    0   0   0   1   0
illustrations   0   1   0   0   0
john            1   0   0   0   0
need            0   0   1   0   0
on              0   1   0   0   0
own             0   1   0   0   0
project         0   0   1   0   0
provided        0   1   0   0   0
smiles          0   0   0   0   1
students        0   1   0   0   0
thanks          1   0   0   0   1
that            0   0   1   0   0
the             0   0   0   1   0
their           0   1   0   0   0
through         0   1   0   0   0
to              0   1   1   1   0
we              0   0   1   0   0
where           0   0   0   1   0
will            0   1   0   0   0
work          

In [3]:
stopWords = list(sorted(ENGLISH_STOP_WORDS))
ncolumns = 6; nrows= 30

print('First {} of {} stopwords'.format(ncolumns * nrows, len(stopWords)))
for i in range(0, len(stopWords[:(ncolumns * nrows)]), ncolumns):
    print(''.join(word.ljust(13) for word in stopWords[i:(i+ncolumns)]))

First 180 of 318 stopwords
a            about        above        across       after        afterwards   
again        against      all          almost       alone        along        
already      also         although     always       am           among        
amongst      amoungst     amount       an           and          another      
any          anyhow       anyone       anything     anyway       anywhere     
are          around       as           at           back         be           
became       because      become       becomes      becoming     been         
before       beforehand   behind       being        below        beside       
besides      between      beyond       bill         both         bottom       
but          by           call         can          cannot       cant         
co           con          could        couldnt      cry          de           
describe     detail       do           done         down         due          
during       each        

### Non-word tokens in the passage.

In [4]:
# Create a custom tokenizer that will use NLTK for tokenizing and lemmatizing 
# (removes interpunctuation and stop words)
class LemmaTokenizer(object):
    def __init__(self):
        self.stemmer = EnglishStemmer()
        self.stopWords = set(ENGLISH_STOP_WORDS)

    def __call__(self, doc):
        return [self.stemmer.stem(t) for t in word_tokenize(doc) 
                if t.isalpha() and t not in self.stopWords]

# Learn features based on text
count_vect = CountVectorizer(tokenizer=LemmaTokenizer())
counts = count_vect.fit_transform(text)

print("Non-word tokens in the passage:")
printTermDocumentMatrix(count_vect, counts)

Non-word tokens in the passage:
         S1  S2  S3  S4  S5
demo      0   1   0   1   0
finish    0   0   1   0   0
illustr   0   1   0   1   0
john      1   0   0   0   0
need      0   0   1   0   0
project   0   0   1   0   0
provid    0   1   0   0   0
student   0   1   0   0   0
thank     1   0   0   0   1
work      0   1   0   0   0
yes       0   0   0   1   0


### Question 2 (2 points) Suppose this passage constitutes a document to be classified, but you are not certain of the business goal of the classification task. Identify material (at least 20% of the terms) that, in your judgment, could be discarded fairly safely without knowing that goal.


In [5]:
# Apply CountVectorizer and TfidfTransformer sequentially
count_vect = CountVectorizer()
tfidfTransformer = TfidfTransformer(smooth_idf=False, norm=None)
counts = count_vect.fit_transform(text)
tfidf = tfidfTransformer.fit_transform(counts)

printTermDocumentMatrix(count_vect, tfidf)

                     S1        S2        S3        S4        S5
and            0.000000  1.916291  0.000000  1.916291  0.000000
be             0.000000  2.609438  0.000000  0.000000  0.000000
demos          0.000000  1.916291  0.000000  1.916291  0.000000
do             0.000000  0.000000  2.609438  0.000000  0.000000
find           0.000000  0.000000  0.000000  2.609438  0.000000
finish         0.000000  0.000000  2.609438  0.000000  0.000000
for            0.000000  1.916291  0.000000  0.000000  1.916291
help           0.000000  0.000000  0.000000  0.000000  2.609438
if             0.000000  0.000000  0.000000  2.609438  0.000000
illustration   0.000000  0.000000  0.000000  2.609438  0.000000
illustrations  0.000000  2.609438  0.000000  0.000000  0.000000
john           2.609438  0.000000  0.000000  0.000000  0.000000
need           0.000000  0.000000  2.609438  0.000000  0.000000
on             0.000000  2.609438  0.000000  0.000000  0.000000
own            0.000000  2.609438  0.000

In [6]:
tfidf

<5x31 sparse matrix of type '<class 'numpy.float64'>'
	with 37 stored elements in Compressed Sparse Row format>

#### Because the TF-IDF identifies documents with frequent occurrences of rare terms (i.e. yields high values for documents with a relatively high frequency for terms that are relatively rare overall, and near-zero values for terms that are absent from a document, or present in most documents) , the items with the high TF-IDF values can be safely discarded.

### Question 3 (3 points) Suppose the classification task is to predict whether this post requires the attention of the instructor, or whether a teaching assistant might suffice. Identify the 20% of the terms that you think might be most helpful in that task.



In [7]:
corpus = []
label = []
# with ZipFile(text) as rawData:
#    for info in rawData.infolist():
#        if info.is_dir():
#            continue
#        label.append(1 if 'rec.autos' in info.filename else 0)
#        corpus.append(rawData.read(info))
for word in text:
    if word in corpus:
        continue
    else:
        label.append(1 if word in corpus else 0)
        corpus.append(word)
preprocessor = CountVectorizer(tokenizer=LemmaTokenizer(), encoding='latin1')
preprocessedText = preprocessor.fit_transform(corpus)

# Step 3: TF-IDF and latent semantic analysis
tfidfTransformer = TfidfTransformer()
tfidf = tfidfTransformer.fit_transform(preprocessedText)

# Extract concepts using LSA ()
svd = TruncatedSVD()
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)

lsa_tfidf = lsa.fit_transform(tfidf)

# split dataset into 60% training and 40% test set
Xtrain, Xtest, ytrain, ytest = train_test_split(lsa_tfidf, label, test_size=0.4, random_state=42)

# run logistic regression model on training
logit_reg = LogisticRegression(solver='lbfgs')
logit_reg.fit(Xtrain, ytrain)

# print confusion matrix and accuracty
classificationSummary(ytest, logit_reg.predict(Xtest))

ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0

In [8]:
tfidf

<5x11 sparse matrix of type '<class 'numpy.float64'>'
	with 14 stored elements in Compressed Sparse Row format>

### Question 4 (3 points) What aspect of the passage is most problematic from the standpoint of simply using a bag-of-words approach, as opposed to an approach in which meaning is extracted?

#### The first sentence of the passage, ""Illustrations and demos will be provided for students to work through on their own" is problematic from the standpoint of simply using a bag-of-words approach because the concepts to which the passage will map terms is not obvious.
