### (1) 텍스트 분류 
- Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.
- Reference : https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a

1. Loading the data set in jupyter.
2. Extracting features from text files.
3. Running ML algorithms.
4. Grid Search for parameter tuning.
5. Useful tips and a touch of NLTK.

## 1. Loading the "20 Newsgroup" data set in jupyter.

In [1]:
from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

In [2]:
print(twenty_train.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features       

In [3]:
twenty_train.target_names   # prints all the categories

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [4]:
import numpy as np
from pprint import pprint

pprint(twenty_train.target)
pprint(np.unique(twenty_train.target))

array([7, 4, 4, ..., 3, 1, 8])
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])


In [5]:
print("\n".join(twenty_train.data[0].split("\n")[:3]))   # prints first line of the first data file

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu


## 2. Extracting features from text files.

In order to run machine learning algorithms we need to convert the text files into <strong>"numerical feature vectors"</strong>. We will be using <strong>"bag of words"</strong> model for our example. Briefly, we segment each text file into words (for English splitting by space), and <strong>"count # of times each word occurs in each document"</strong> and finally <strong>"assign each word an integer id"</strong>. <strong>"Each unique word"</strong> in our dictionary will correspond to a <strong>"feature"</strong> <u>(descriptive feature)</u>.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 130107)

In [7]:
X_train_counts[0]

<1x130107 sparse matrix of type '<class 'numpy.int64'>'
	with 89 stored elements in Compressed Sparse Row format>

As a result, Learning the vocabulary dictionary and it returns a Document-Term matrix. 

[n_samples, n_features]

- <strong>TF</strong> : # count(word) / # Total words in each document

- <strong>TF-IDF</strong> : reduce the weightage of more common words like 'the, is, an etc'

In [8]:
# Transform a count matrix to a normalized tf or tf-idf representation
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) # CountVectorizer를 거친 data를 넣는다!
X_train_tfidf.shape

(11314, 130107)

## 3. Running ML algorithms.

- a. Naive Bayes : 
https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes

In [9]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

###### Building a pipeline : we can write less code and do all of the above, by building a pipeline as follows:

In [10]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

###### Performance of NB Classifier : test the performance of the NB classifier on test set

In [11]:
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
pred = text_clf.predict(twenty_test.data)
print(np.round(np.mean(pred == twenty_test.target), 5)*100, "%")

77.39 %


In [12]:
dict(zip(*np.unique((pred == twenty_test.target), return_counts=True)))

{False: 1703, True: 5829}

- b. Support Vector Machine(SVM) : https://scikit-learn.org/stable/modules/svm.html

If you want to fit <strong>a large-scale linear classifier without copying a dense numpy C-contiguous double precision array as input</strong>, we suggest to use the <strong>SGDClassifier class</strong> instead.

In [13]:
from sklearn.linear_model import SGDClassifier

text_clf_svm = Pipeline([('vect', CountVectorizer()),
                         ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss = 'hinge', 
                                                   penalty = 'l2',
                                                   alpha = 1e-3,
                                                   random_state = 42))])
_ = text_clf_svm.fit(twenty_train.data, twenty_train.target)

In [14]:
pred_svm = text_clf_svm.predict(twenty_test.data)
print(np.round(np.mean(pred_svm == twenty_test.target), 5)*100, "%")

82.408 %


## 4. Grid Search for parameter tuning.

- a. Naive Bayes

In [17]:
# find optimal performance among various parameters
from sklearn.model_selection import GridSearchCV

# parameters name start with the classifier name!
param = {'vect__ngram_range':[(1, 1), (1, 2)],   # unigram & bigrams which is optimal
         'tfidf__use_idf':(True, False),
         'clf__alpha': (1e-2, 1e-3)}             # 1/100, 1/1000

In [18]:
gs_clf = GridSearchCV(text_clf, param, n_jobs = -1)      # n_jobs = -1 : to use multiple cores from user machine.
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)

In [22]:
# see the best mean score and the params

print(np.round(gs_clf.best_score_ * 100, 2), '%')
print(gs_clf.best_params_)

91.58 %
{'clf__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


###### Performance of NB Classifier after Grid Search

In [27]:
text_clf_final = Pipeline([('vect', CountVectorizer(ngram_range = (1, 2))),
                           ('tfidf', TfidfTransformer(use_idf = True)),
                           ('clf', MultinomialNB(alpha = 0.001))])
text_clf_final = text_clf_final.fit(twenty_train.data, twenty_train.target)

pred_final = text_clf_final.predict(twenty_test.data)

In [28]:
print(np.round(np.mean(pred_final == twenty_test.target), 4)*100, "%")

83.62 %


##### 77.39% => 83.62%

- b. Support Vector Machine(SVM)

In [23]:
param_svm = {'vect__ngram_range':[(1, 1), (1, 2)],
             'tfidf__use_idf':(True, False),
             'clf-svm__alpha':(1e-2, 1e-3)}
gs_clf_svm = GridSearchCV(text_clf_svm, param_svm, n_jobs = -1)
gs_clf_svm = gs_clf_svm.fit(twenty_train.data, twenty_train.target)

In [24]:
print(np.round(gs_clf_svm.best_score_ * 100, 2), '%')
print(gs_clf_svm.best_params_)

90.52 %
{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


###### Performance of SVM Classifier after Grid Search

In [29]:
text_clf_svm_final = Pipeline([('vect', CountVectorizer(ngram_range = (1, 2))),
                               ('tfidf', TfidfTransformer(use_idf = True)),
                               ('clf-svm', SGDClassifier(loss = 'hinge', 
                                                         penalty = 'l2',
                                                         alpha = 1e-3,
                                                         random_state = 42))])
text_clf_svm_final = text_clf_svm_final.fit(twenty_train.data, twenty_train.target)

pred_svm_final = text_clf_svm_final.predict(twenty_test.data)

In [32]:
print(np.round(np.mean(pred_svm_final == twenty_test.target), 3)*100, "%")

83.5 %


##### 82.408% => 83.5%

## 5. Useful tips and a touch of NLTK.

(1) <strong>Removing <u>stop words</u></strong> : (the, then etc) from the data. In most of the text classification problems, this is indeed not useful. 

<strong>Let’s see if removing stop words increases the accuracy.</strong>

- a. Naive Bayes

In [34]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [36]:
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
pred = text_clf.predict(twenty_test.data)
print(np.round(np.mean(pred == twenty_test.target), 4)*100, "%")

81.69 %


##### 77.39% => 81.69%

In [37]:
# find optimal performance among various parameters
from sklearn.model_selection import GridSearchCV

# parameters name start with the classifier name!
param = {'vect__ngram_range':[(1, 1), (1, 2)],   # unigram & bigrams which is optimal
         'tfidf__use_idf':(True, False),
         'clf__alpha': (1e-2, 1e-3)}             # 1/100, 1/1000

In [38]:
gs_clf = GridSearchCV(text_clf, param, n_jobs = -1)      # n_jobs = -1 : to use multiple cores from user machine.
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)

In [39]:
# see the best mean score and the params

print(np.round(gs_clf.best_score_ * 100, 2), '%')
print(gs_clf.best_params_)

91.29 %
{'clf__alpha': 0.01, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


###### Performance of NB Classifier after Grid Search

In [42]:
text_clf_final = Pipeline([('vect', CountVectorizer(stop_words='english', ngram_range = (1, 2))),
                           ('tfidf', TfidfTransformer(use_idf = True)),
                           ('clf', MultinomialNB(alpha = 0.01))])
text_clf_final = text_clf_final.fit(twenty_train.data, twenty_train.target)

pred_final = text_clf_final.predict(twenty_test.data)

In [45]:
print(np.round(np.mean(pred_final == twenty_test.target), 5)*100, "%")

83.006 %


##### 81.69% => 83.006%

- b. Support Vector Machine(SVM)

In [46]:
from sklearn.linear_model import SGDClassifier

text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words='english')),
                         ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss = 'hinge', 
                                                   penalty = 'l2',
                                                   alpha = 1e-3,
                                                   random_state = 42))])
_ = text_clf_svm.fit(twenty_train.data, twenty_train.target)

In [47]:
pred_svm = text_clf_svm.predict(twenty_test.data)
print(np.round(np.mean(pred_svm == twenty_test.target), 5)*100, "%")

82.262 %


##### 82.408% => 82.262%

In [48]:
param_svm = {'vect__ngram_range':[(1, 1), (1, 2)],
             'tfidf__use_idf':(True, False),
             'clf-svm__alpha':(1e-2, 1e-3)}
gs_clf_svm = GridSearchCV(text_clf_svm, param_svm, n_jobs = -1)
gs_clf_svm = gs_clf_svm.fit(twenty_train.data, twenty_train.target)

In [49]:
print(np.round(gs_clf_svm.best_score_ * 100, 2), '%')
print(gs_clf_svm.best_params_)

90.29 %
{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


###### Performance of SVM Classifier after Grid Search

In [50]:
text_clf_svm_final = Pipeline([('vect', CountVectorizer(stop_words='english', ngram_range = (1, 2))),
                               ('tfidf', TfidfTransformer(use_idf = True)),
                               ('clf-svm', SGDClassifier(loss = 'hinge', 
                                                         penalty = 'l2',
                                                         alpha = 1e-3,
                                                         random_state = 42))])
text_clf_svm_final = text_clf_svm_final.fit(twenty_train.data, twenty_train.target)

pred_svm_final = text_clf_svm_final.predict(twenty_test.data)

In [51]:
print(np.round(np.mean(pred_svm_final == twenty_test.target), 3)*100, "%")

83.2 %


##### 82.408% => 83.2%

(2) <strong>FitPrior = False</strong> : When set to false for MultinomialNB, a uniform prior will be used.

In [52]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB(fit_prior = False))])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [53]:
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
pred = text_clf.predict(twenty_test.data)
print(np.round(np.mean(pred == twenty_test.target), 4)*100, "%")

82.14 %


##### 81.69% => 82.14%

(3) <strong>Stemming</strong> : stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. 

(ex. A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish".)

We need NLTK which comes with various stemmers that can help reducing the words to their root form.

In [55]:
# Snowball stemmer works very well for English language
import nltk

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [56]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english", ignore_stopwords = True)

In [57]:
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

In [59]:
stemmed_count_vect = StemmedCountVectorizer(stop_words='english')
text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect),
                             ('tfidf', TfidfTransformer()),
                             ('mnb', MultinomialNB(fit_prior = False))])
text_mnb_stemmed = text_mnb_stemmed.fit(twenty_train.data, twenty_train.target)

In [60]:
predicted_mnb_stemmed = text_mnb_stemmed.predict(twenty_test.data)
print(np.round(np.mean(predicted_mnb_stemmed == twenty_test.target)*100,3), '%')

81.678 %
