### Assignment 9: Implementing Text Mining

#### The problems in this assignment are based on the exercises of Chapter 20 in Data Mining for Business Analytics.

#### Scenario Consider the case of a website that caters to the needs of a specific farming community, and carries classified ads intended for that community. Anyone, including robots, can post an ad via a web interface, and the site owners have problems with ads that are fraudulent, spam, or simply not relevant to the community. They have provided a file with 4143 ads, each ad in a row, and each ad labeled as either −1 (not relevant) or 1 (relevant). The goal is to develop a predictive model that can classify ads automatically.

#### Data For more information about the data set see https://archive.ics.uci.edu/ml/datasets/Farm+Ads. Each ad includes the words on the ad creative and the words from the landing page. Each word from the creative is given a prefix of 'ad-'. Title ('title-') and header ('header-') HTML markups are noted in a similar way in the text of the landing page. Stemming and stop word removal is already applied.

#### Data preprocessing Open the file farm-ads.csv, and brieﬂy review some of the relevant and non-relevant ads to get a ﬂavor for their contents.

#### Following the example in the chapter, preprocess the data in Python.

In [1]:
# Import Required Packages
%matplotlib inline

from pathlib import Path

from zipfile import ZipFile
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
import nltk
from nltk import word_tokenize          
from nltk.stem.snowball import EnglishStemmer 
import matplotlib.pylab as plt
from dmba import printTermDocumentMatrix, classificationSummary, liftChart

nltk.download('punkt')

no display found. Using non-interactive Agg backend


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Cobbadmin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
# Read in data
farm_ads_df = pd.read_csv('farm-ads.csv')
farm_ads_df.head(10)

Unnamed: 0,-1,ad-abdominal ad-aortic ad-aneurysm ad-doctorfinder ad-help ad-patient ad-local ad-physician ad-treat ad-aaa ad-www ad-findtheaaanswer ad-org page found
0,-1,ad-abdominal ad-aortic ad-aneurysm ad-million...
1,-1,ad-absorbent ad-oil ad-snar ad-factory ad-dir...
2,-1,ad-acid ad-reflux ad-relief ad-top ad-treatme...
3,-1,ad-acid ad-reflux ad-symptom ad-acid ad-reflu...
4,-1,ad-addiction ad-treatment ad-detox ad-opioid ...
5,-1,ad-adenocarcinoma ad-treatment ad-learn ad-le...
6,-1,ad-adhd ad-sign ad-top ad-ten ad-adhd ad-symp...
7,-1,ad-adhd ad-treatment ad-adult ad-information ...
8,-1,ad-adorable ad-farm ad-mural ad-easy ad-paint...
9,-1,ad-adult ad-hepatitis ad-recently ad-diagnose...


In [8]:
farm_ads_df.tail(10)

Unnamed: 0,-1,ad-abdominal ad-aortic ad-aneurysm ad-doctorfinder ad-help ad-patient ad-local ad-physician ad-treat ad-aaa ad-www ad-findtheaaanswer ad-org page found
4132,1,ad-yorkie ad-puppy ad-sale ad-toy ad-teacup a...
4133,1,ad-yorkie ad-puppy ad-sale ad-yorkie ad-toy a...
4134,1,ad-zealand ad-travel ad-plan ad-zealand ad-tr...
4135,1,ad-zupreem ad-pet ad-food ad-price ad-guarant...
4136,1,ad-zupreem ad-pet ad-food ad-price ad-guarant...
4137,1,ad-zupreem ad-pet ad-food ad-price ad-guarant...
4138,1,ad-zupreem ad-pet ad-food ad-zupreem ad-anima...
4139,1,ad-zupreem ad-pet ad-food ad-zupreem ad-anima...
4140,1,ad-zupreem ad-pet ad-food ad-zupreem ad-anima...
4141,1,ad-zupreem ad-pet ad-food ad-zupreem ad-anima...


### Question 1 (5 points) Based on the analysis of the document, create a term-document matrix. Examine the term-document matrix. Is it sparse or dense? Look at the first row of the term-document matrix and determine the meaning of the non-zero elements.

In [11]:
count_vect = CountVectorizer(token_pattern='[a-zA-Z!:)]+')
counts = count_vect.fit_transform(farm_ads_df)

printTermDocumentMatrix(count_vect, counts)

                 S1  S2
aaa               0   1
abdominal         0   1
ad                0  13
aneurysm          0   1
aortic            0   1
doctorfinder      0   1
findtheaaanswer   0   1
found             0   1
help              0   1
local             0   1
org               0   1
page              0   1
patient           0   1
physician         0   1
treat             0   1
www               0   1


##### The term-document matrix is sparse.

 ##### Looking at the first row of the term-document matrix, the non-zero elements are the counts.

### Question 3 (8 points) Using logistic regression, partition the data (60% training, 40% validation, random_state=1), and develop a model to classify the documents as ‘relevant’ or ‘non-relevant.’ Comment on its efficacy.

In [12]:
stopWords = list(sorted(ENGLISH_STOP_WORDS))
ncolumns = 6; nrows= 30

print('First {} of {} stopwords'.format(ncolumns * nrows, len(stopWords)))
for i in range(0, len(stopWords[:(ncolumns * nrows)]), ncolumns):
    print(''.join(word.ljust(13) for word in stopWords[i:(i+ncolumns)]))

First 180 of 318 stopwords
a            about        above        across       after        afterwards   
again        against      all          almost       alone        along        
already      also         although     always       am           among        
amongst      amoungst     amount       an           and          another      
any          anyhow       anyone       anything     anyway       anywhere     
are          around       as           at           back         be           
became       because      become       becomes      becoming     been         
before       beforehand   behind       being        below        beside       
besides      between      beyond       bill         both         bottom       
but          by           call         can          cannot       cant         
co           con          could        couldnt      cry          de           
describe     detail       do           done         down         due          
during       each        

In [14]:
# Create a custom tokenizer that will use NLTK for tokenizing and lemmatizing 
# (removes interpunctuation and stop words)
class LemmaTokenizer(object):
    def __init__(self):
        self.stemmer = EnglishStemmer()
        self.stopWords = set(ENGLISH_STOP_WORDS)

    def __call__(self, doc):
        return [self.stemmer.stem(t) for t in word_tokenize(doc) 
                if t.isalpha() and t not in self.stopWords]

# Learn features based on text
count_vect = CountVectorizer(tokenizer=LemmaTokenizer())
counts = count_vect.fit_transform(farm_ads_df)

printTermDocumentMatrix(count_vect, counts)

      S1  S2
page   0   1


In [16]:
# Apply CountVectorizer and TfidfTransformer sequentially
count_vect = CountVectorizer()
tfidfTransformer = TfidfTransformer(smooth_idf=False, norm=None)
counts = count_vect.fit_transform(farm_ads_df)
tfidf = tfidfTransformer.fit_transform(counts)

printTermDocumentMatrix(count_vect, tfidf)

                  S1         S2
aaa              0.0   1.693147
abdominal        0.0   1.693147
ad               0.0  22.010913
aneurysm         0.0   1.693147
aortic           0.0   1.693147
doctorfinder     0.0   1.693147
findtheaaanswer  0.0   1.693147
found            0.0   1.693147
help             0.0   1.693147
local            0.0   1.693147
org              0.0   1.693147
page             0.0   1.693147
patient          0.0   1.693147
physician        0.0   1.693147
treat            0.0   1.693147
www              0.0   1.693147


In [24]:
corpus=[]
#corpus.append[ad.text]
for ad in farm_ads_df:
        if ad in corpus:
            continue
        corpus.extend[ad]

# Step 2: preprocessing (tokenization, stemming, and stopwords)
class LemmaTokenizer(object):
    def __init__(self):
        self.stemmer = EnglishStemmer()
        self.stopWords = set(ENGLISH_STOP_WORDS)
    def __call__(self, doc):
        return [self.stemmer.stem(t) for t in word_tokenize(doc) 
                if t.isalpha() and t not in self.stopWords]

preprocessor = CountVectorizer(tokenizer=LemmaTokenizer(), encoding='latin1')
preprocessedText = preprocessor.fit_transform(corpus)

# Step 3: TF-IDF and latent semantic analysis
tfidfTransformer = TfidfTransformer()
tfidf = tfidfTransformer.fit_transform(preprocessedText)

# Extract 20 concepts using LSA ()
svd = TruncatedSVD(20)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)

lsa_tfidf = lsa.fit_transform(tfidf)

AttributeError: 'str' object has no attribute 'read'

### Question 4 (3 points) Why use the concept-document matrix, and not the term-document matrix, to provide the predictor variables?

#### It appears that one is just the transpose of the other, but the data reduction technique we used (singular value decomposition) reduced the number of columns (documents) but kept the number of rows (words). 