### Text Classification with naive bayes and k-means clustering


In [1]:
import numpy as np
import pandas as pd

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
import re
#from collections import Counter
 
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
 
# Cleaning the text 
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])  # lower-case. stop words removal
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)         # punctuation
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split()) # lemmatization
    processed = re.sub(r"\d+","",normalized)
    y = processed.split()
    return y

def getCleanText(doc):
    clean_lines = []
    for line in doc:
        line = line.strip()
        cleaned = clean(line)
        cleaned = ' '.join(cleaned)
        clean_lines.append(cleaned)
    
    return clean_lines

### Text Classification with naive bayes  

Naive Bayes is a Supervised Classification technique. We have a training (in this case sklearn 20 newsgroups) dataset with labels containing the correct answers. Using the training set we train a naive bayes model to predict the classifications on new input.

In [2]:
# load data
from sklearn.datasets import fetch_20newsgroups

cat = ['soc.religion.christian', 'rec.sport.baseball']
twenty_train = fetch_20newsgroups(subset='train',
                                  categories=cat, 
                                  remove=('headers', 'footers', 'quotes'),
                                  shuffle=True, random_state=42)


list(twenty_train.target_names)

# top 10 docs
for t in twenty_train.target[:10]:
    print(twenty_train.target_names[t])
 

soc.religion.christian
soc.religion.christian
rec.sport.baseball
soc.religion.christian
rec.sport.baseball
rec.sport.baseball
rec.sport.baseball
soc.religion.christian
rec.sport.baseball
soc.religion.christian


### Tokenizing text

Tokenizing and filtering of stopwords are included in CountVectorizer, 
which builds a dictionary of features and transforms documents to feature vectors.
However longer documents will have higher average count values than shorter documents. The tf-idf weight is a statistical measure to evaluate the importantance of a word in a document from a corpus. Importance increases proportionally to the number of times it appears in the document but is offset by the frequency of the word in the corpus. Term frequency (TF) is how common a word is, inverse document frequency (IDF) is how unique or rare a word is.


In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# vectorization
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)


# training: naive bayes for classification
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, 
                          twenty_train.target) # labels


# predicting
docs_new = ['Take me out to the ball park', 
            'Jesus says love thy neighbor', 
            'Baseball is ninety percent mental. The other half is physical.']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

#predicted = km.predict(X_new_tfidf)
predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))
    

'Take me out to the ball park' => rec.sport.baseball
'Jesus says love thy neighbor' => soc.religion.christian
'Baseball is ninety percent mental. The other half is physical.' => rec.sport.baseball


Seems to have categorized the docs correctly.

### Text Clustering with K-Means
 K-means is an Unsupervised Learning Algorithm, where there are no labels with right or wrong answers.
 K-means clustering is able to gradually learn how to cluster the unlabelled points into groups by analysis of
 the  mean distance of said points.
 
 In this case, I have created a corpus containing quotes about dogs and google without any target or label.  
 This is used to train the K-means model for prediction of new documnets.
 


In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

docs = ["Time spent with dogs is never wasted",
"Dogs choose us, we don't own them.",
"It is impossible to keep a straight face in the presence of one or more puppies.",
"A dog is the only thing on earth that loves you more than you love yourself.",
"No matter how you're feeling, a little dog gonna love you."
"There's a saying, when the dog looks at you, the dog is not thinking what kind of a person you are",
"If you can spell 'Nietzsche' without Google, you deserve a cookie.",
"Google Translate app is incredible.",
"If you open 100 tabs in google you get a smiley face.",
"Best dog photo I've ever taken.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome.",
"I use google maps regularly"]

clean_docs = getCleanText(docs)

# vectorization
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(clean_docs)

# train k-means model
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)


print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    
    
print("\nTop prediction:")
new_docs = ["Dogs are man's best friend", 
            "Cat and dog videos on you-tube are the best",
            "Google chrome is the worst",
            "Google is taking over the world"]

# prediction
pred = model.predict(vectorizer.transform(new_docs))
print(pred)

Top terms per cluster:
Cluster 0:
 dog
 love
 choose
 time
 wasted
 spent
 best
 ive
 taken
 photo
Cluster 1:
 google
 map
 regularly
 use
 feedback
 impressed
 incredible
 app
 translate
 tab

Top prediction:
[0 0 1 1]


Seems to have categorized the docs correctly. The first 2 docs were categorized into cluster 1, while 
the last 2 were categorized into cluster 2

References:<br>
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html <br>
https://hackernoon.com/finding-the-most-important-sentences-using-nlp-tf-idf-3065028897a3
