# Text Mining

#### Import libraries

In [1]:
import os
import pandas as pd
import numpy as np
from zipfile import ZipFile
import nltk
from nltk.stem.snowball import SnowballStemmer, EnglishStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from string import punctuation
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from dmba import classificationSummary

SEED = 42

In [2]:
#nltk.download('punkt_tab')
#nltk.download('stopwords')

### Example 1: Lecture

**Remove punctuation from sample text.**

<h4 style="color:blue"> Write Your Code Below: </h4>

In [3]:
s = 'The technician was resolving technical issues quickly.'


<h3 style="color:teal"> Expected Output: </h3>

['The', 'technician', 'was', 'resolving', 'technical', 'issues', 'quickly']

**Remove stop words from sample text.**

<h4 style="color:blue"> Write Your Code Below: </h4>

In [None]:
stop_words = stopwords.words('english')


<h3 style="color:teal"> Expected Output: </h3>

['technician', 'resolving', 'technical', 'issues', 'quickly']

**Stem words from sample text.**

<h4 style="color:blue"> Write Your Code Below: </h4>

In [None]:
stemmer = SnowballStemmer("english")


<h3 style="color:teal"> Expected Output: </h3>

['technician', 'resolv', 'technic', 'issu', 'quick']

### Example 2: Lecture

**Conduct all preprocessing tasks on multiple sentences.**

<h4 style="color:blue"> Write Your Code Below: </h4>

In [None]:
def prepare_text(s):

    return " ".join(words)

In [None]:
sentences = ['The technician was resolving technical issues quickly.',
             'The engineer resolved several technical problems efficiently.']


<h3 style="color:teal"> Expected Output: </h3>

['technician resolv technic issu quick',
 'engin resolv sever technic problem effici']

**Create Bag-Of-Words count matrix.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,effici,engin,issu,problem,quick,resolv,sever,technic,technician
0,0,0,1,0,1,1,0,1,1
1,1,1,0,1,0,1,1,1,0


**Create TF-IDF matrix.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,effici,engin,issu,problem,quick,resolv,sever,technic,technician
0,0.0,0.0,1.693147,0.0,1.693147,1.0,0.0,1.0,1.693147
1,1.693147,1.693147,0.0,1.693147,0.0,1.0,1.693147,1.0,0.0


# Problem 20.3: Classifying Classified Ads Submitted Online
Consider the case of a website that caters to the needs of a specific farming community, and carries classified ads intended for that community.  Anyone, including robots, can post an ad via a web interface, and the site owners have problems with ads that are fraudulent, spam, or simply not relevant to the community.  They have provided a file with 4143 ads, each ad in a row, and each ad labeled as either -1 (not relevant) or 1 (relevant).
The goal is to develop a predictive model that can classify ads automatically.

**Read in `farm-ads.csv` data and display the top for relevant and non-relevant ads**

**Following the example in the chapter, preprocess the data in Python, and create a term-document matrix, and a concept matrix. Limit the number of concepts to 20.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

2210 relevant ads
      relevance                                               text
1933          1   ad-abdominal ad-aortic ad-aneurysm ad-million...
1934          1   ad-ac ad-montana ad-ranch ad-horse ad-cattle ...
1935          1   ad-acai ad-pure ad-product ad-amazon title-ac...
1936          1   ad-acclaim ad-website ad-builder ad-design ad...
1937          1   ad-acclaim ad-website ad-builder ad-design ad...
1933 non-relevant ads
   relevance                                               text
0         -1   ad-abdominal ad-aortic ad-aneurysm ad-doctorf...
1         -1   ad-abdominal ad-aortic ad-aneurysm ad-million...
2         -1   ad-absorbent ad-oil ad-snar ad-factory ad-dir...
3         -1   ad-acid ad-reflux ad-relief ad-top ad-treatme...
4         -1   ad-acid ad-reflux ad-symptom ad-acid ad-reflu...


**Preprocess data and create TF-IDF matrix**

Use custom token pattern alpha letters and dashes (-)

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

(4143, 58047)


**Reduce dimensions using truncated singular value decomposition (SVD)**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

(4143, 20)


## Solution 20.3.c

Using logistic regression, partition the data (60\% training, 40\% validation), and develop a model to classify the documents as `relevant' or `non-relevant.'

<h4 style="color:blue"> Write Your Code Below: </h4>

In [None]:
# split dataset into 60% training and 40% validation set
train_X, valid_X, train_y, valid_y = train_test_split(lsa_tfidf,
                                                      farm_ads.relevance, 
                                                      test_size=0.4, 
                                                      random_state=SEED)

# run logistic regression model on training
logit_reg = linear_model.LogisticRegression(solver='lbfgs')
logit_reg.fit(train_X, train_y)

# print confusion matrix and accuracty
classificationSummary(valid_y, logit_reg.predict(valid_X), 
                      class_names=logit_reg.classes_)

<h3 style="color:teal"> Expected Output: </h3>

Confusion Matrix (Accuracy 0.7829)

       Prediction
Actual  -1   1
    -1 551 242
     1 118 747
