<h1 align=center><font size=5>Text Classification with NLTK</font></h1>

We will use Bag of Words (BoW) model for text classification.

### Data <a id='data'></a>

Let us consider the 20 newsgroups dataset. Access the dataset in the following URL: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html

In [None]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism','talk.religion.misc','comp.graphics','sci.space']

newsgroups_train = fetch_20newsgroups(subset= "train",
                                remove= ("headers", "footers", "quotes"),
                                categories= categories, 
                                shuffle= True, random_state= 123)

newsgroups_test = fetch_20newsgroups(subset= "test",
                                remove= ("headers", "footers", "quotes"),
                                categories= categories, 
                                shuffle= True, random_state= 123)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [None]:

X_train = newsgroups_train.data
y_train = newsgroups_train.target


X_test = newsgroups_test.data
y_test = newsgroups_test.target

In [None]:
!pip install --user -U nltk

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/92/75/ce35194d8e3022203cca0d2f896dbb88689f9b3fce8e9f9cff942913519d/nltk-3.5.zip (1.4MB)
[K     |████████████████████████████████| 1.4MB 3.2MB/s 
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Created wheel for nltk: filename=nltk-3.5-cp36-none-any.whl size=1434676 sha256=0fb78391db9e0fca05ec2333e47f949467078b1e1bb80df7f0db1d17ef6f4824
  Stored in directory: /root/.cache/pip/wheels/ae/8c/3f/b1fe0ba04555b08b57ab52ab7f86023639a526d8bc8d384306
Successfully built nltk
Installing collected packages: nltk
Successfully installed nltk-3.5


In [None]:
import nltk
nltk.download('wordnet')
nltk.download('stopwords')

from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
stopwords_list.extend(['the'])


from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

import string

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Clean data





In [None]:
nltk.download('punkt')
import re
from nltk.tokenize import RegexpTokenizer
def clean_text(text):
    text = text.lower()
    text = text.strip()
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token.isalpha()]
    tokens = [token for token in tokens if not token in stopwords_list]
    tokens = [token for token in tokens if len(token)>= 4]
    tokens = [token for token in tokens if token not in string.punctuation]
    tokens = [lemmatizer.lemmatize(token, pos = 'v') for token in tokens]
    text = ' '.join(tokens)
    return text

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Perform the helper function over training and test texts.


In [None]:
X_train_cleaned = []
for text in X_train:
    X_train_cleaned.append(clean_text(text))
X_test_cleaned = []
for text in X_test:
    X_test_cleaned.append(clean_text(text))

###Print out the first 5 samples of training data before and after cleanning.

In [None]:
print(X_train[:5])
print(X_train_cleaned[:5])

['Is the ".3ds" file format for Autodesk\'s 3D Animation Studio available?\n\nThanks,\nGary', '\n[...stuff deleted...]\n\nComputers are an excellent example...of evolution without "a" creator.\nWe did not "create" computers.  We did not create the sand that goes\ninto the silicon that goes into the integrated circuits that go into\nprocessor board.  We took these things and put them together in an\ninteresting way. Just like plants "create" oxygen using light through \nphotosynthesis.  It\'s a much bigger leap to talk about something that\ncreated "everything" from nothing.  I find it unfathomable to resort\nto believing in a creator when a much simpler alternative exists: we\nsimply are incapable of understanding our beginnings -- if there even\nwere beginnings at all.  And that\'s ok with me.  The present keeps me\nperfectly busy.', "I am trying to configure Zsoft's PC Paintbrush IV+ for use with my\nLogitech Scanman 32 (hand scanner), but I can't get Paintbrush to\nacknowledge the s

### Bag of Words Model <a id='bow'></a>

In this part, we consider the bag of words (BoW) model for text classification.

###Convert the training/test text data to BoW.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
# Training set
X_train_cv = cv.fit_transform(X_train_cleaned)
# Test set
X_test_cv = cv.transform(X_test_cleaned)
print(X_train_cleaned[2])
print(X_train_cv[2])
X_train_cv = X_train_cv.toarray()
X_test_cv = X_test_cv.toarray()
X_train_cv.shape

try configure zsoft paintbrush logitech scanman hand scanner paintbrush acknowledge scanner anybody use paintbrush scanner help thank luis nobrega
  (0, 15405)	1
  (0, 16344)	1
  (0, 15926)	1
  (0, 3082)	1
  (0, 17319)	1
  (0, 11039)	3
  (0, 8962)	1
  (0, 13555)	1
  (0, 6663)	1
  (0, 13556)	3
  (0, 144)	1
  (0, 761)	1
  (0, 6856)	1
  (0, 9042)	1
  (0, 10446)	1


(2034, 17334)

In [None]:
print(X_train_cv[2]) 

[0 0 0 ... 0 0 0]


In [None]:
X_train_cv.shape

(2034, 17334)

### Table of feature vectors for the corpus of the training data.

In [None]:
import pandas as pd
docs = pd.DataFrame(X_train_cv)
docs.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,17294,17295,17296,17297,17298,17299,17300,17301,17302,17303,17304,17305,17306,17307,17308,17309,17310,17311,17312,17313,17314,17315,17316,17317,17318,17319,17320,17321,17322,17323,17324,17325,17326,17327,17328,17329,17330,17331,17332,17333
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
