
#### Preface
This notebook explores the same four categories from the 20 Newsgroup data set. The same supervised and unsupervised algorithms are run below however I dropped LDA from this notebook as the results were continuously way worse than baseline. One difference versus the other notebook is that the supervised classifier does not arrive at its predictions via any gridsearch optimization. 

This was more of a curiosity than anything else. Mikolov's models have been hugely influential globally and I definitely respect his leaning towards parsimony whenever possible. It is interesting to me how far NLP has come since his seminal papers on Word2vec, Doc2vec and Fasttext. It is also interesting that more simple count based models like TF and TFIDF still do ok on simple classification tasks.

The word embeddings are derived from Gensim's Doc2vec version of the original Mikolov model. I am using the DBOW model for the paragraph vectors along with the skipgram model for the word vectors. Both sets of vectors are concatenated during training. I avoided using hierarchical softmax and favoured negative sampling(20 random noise words). Model was trained over just five epochs, alpha was set at 0.001. 


Results are below


#### References:
Blei, Ng, Jorden(2003): Latent Dirichlet Allocation,
https://ai.stanford.edu/~ang/papers/nips01-lda.pdf

Gensim, PyData Berlin (2016): 
https://github.com/RaRe-Technologies/movie-plots-by-genre
 
Greene, O’Callaghan, Cunningham(2014): 'How Many Topics? Stability Analysis for Topic Models' https://arxiv.org/abs/1404.4606
  
Li(2018): Multi-Class Text Classification with Doc2Vec & Logistic Regression
https://towardsdatascience.com/multi-class-text-classification-with-doc2vec-logistic-regression-9da9947b43f4

Mikolov, Lee(2014): Distributed Representations of Sentences and Documents.
https://cs.stanford.edu/~quocle/paragraph_vector.pdf

#### Aug 2019

In [1]:
#Dependencies 
from sklearn.datasets import fetch_20newsgroups
from pprint import pprint
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import numpy as np 
import re, nltk, spacy, gensim
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline 
from gensim.models import word2vec
from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
from gensim.models import Doc2Vec
from sklearn import utils
from sklearn.model_selection import train_test_split
import gensim
 
from gensim.models.doc2vec import TaggedDocument
from sklearn.metrics import classification_report
import nltk
from nltk.corpus import stopwords
import multiprocessing
 

In [2]:
#Specify topic categories used in analysis,
cats = ['rec.sport.baseball', 'rec.sport.hockey','talk.politics.guns', 'talk.politics.mideast']  
target_names = ['Baseball', 'Hockey', 'Guns','Middle East']

#Scaler to convert any negative embedding values to positive values for model 
scaler = MinMaxScaler(feature_range=(0, 1), copy=True)


#Get the data
def get_data():
   global df
   data_tr = fetch_20newsgroups(subset='train', categories=cats,shuffle=True, random_state=1,
                             remove=('headers','footers', 'quotes'))  
   data_te = fetch_20newsgroups(subset='test', categories=cats,shuffle=True, random_state=1,
                             remove=('headers','footers', 'quotes'))
   df_train= pd.DataFrame(data_tr.data, columns=['comment'])
   df_train['labels']=data_tr.target 
   df_test= pd.DataFrame(data_te.data, columns=['comment'])  
   df_test['labels']=data_te.target
   df= pd.concat([df_train,df_test])

#Pre processing cleaning
def cleanText(text):   
    text = re.sub(r'\|\|\|', r' ', text) 
    text = re.sub(r'http\S+', r'<URL>', text)
    text = text.lower()
    text = text.replace('x', '')
    return text

#Tokenize
def tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(word.lower())
    return tokens

 
#Prep training and test data for models 
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, embeddings = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=20)) for doc in sents])
    return targets, embeddings

#Train doc2vec model with 500 features    
def train_model():
    global model_dbow     
    cores = multiprocessing.cpu_count()
    #Using 400 dimension feature space, a PV-DBOW model, word vectors are also trained using the skipgram model 
    #and concatenated with paragraph vectors. Word window is set at five words. Noise reduction during training 
    #is via negative sampling with twenty random words feed to the model at a time.    
    model_dbow = Doc2Vec(dm=0,dm_concat=1,vector_size=400, dbow_words=1,window=5, negative=20, hs=0,workers=4)
    model_dbow.build_vocab([x for x in tqdm(train_tagged.values)])  
    #train for just five epochs with alpha set at 0.001
    for epoch in range(5):
        model_dbow.train(utils.shuffle([x for x in tqdm(train_tagged.values)]), total_examples=len(train_tagged.values), epochs=1)
        model_dbow.alpha -= 0.001
        model_dbow.min_alpha = model_dbow.alpha     

#Generate embeddings        
def vec_for_learning(model, tagged_docs): 
    sents = tagged_docs.values
    targets, embeddings = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=20)) for doc in sents])
    return targets, embeddings

#Generate tagged docs
def run_tags():
    global train_tagged,test_tagged
    train, test = train_test_split(df, test_size=0.6, shuffle=False,random_state=42)
    train_tagged = train.apply(
    lambda r: TaggedDocument(words=tokenize_text(r['comment']), tags=[r.labels]), axis=1)
    test_tagged = test.apply(
    lambda r: TaggedDocument(words=tokenize_text(r['comment']), tags=[r.labels]), axis=1)

#Call the Doc2Vec model generate word embedding vectors for models to use as predictors
def call_dbow():
    global y_train, X_train,y_test, X_test
    run_tags()
    train_model()
    y_train, X_train = vec_for_learning(model_dbow, train_tagged)
    y_test, X_test = vec_for_learning(model_dbow, test_tagged)
                   
#Run supervised classifier
def run_supervised_classifier(): 
    clf = SGDClassifier(loss="hinge", penalty="l2", max_iter=5000, tol=1e-6)
    X_train_sc= scaler.fit_transform(X_train)
    X_test_s=scaler.transform(X_test)
    clf.fit(X_train_sc, y_train)
    y_pred=clf.predict(X_test_s)
    print('')
    print('Supervised classifier metrics')
    print(classification_report(y_test, y_pred, target_names=target_names))   
                    
#Run NMF model     
def run_nnmf():
    nnmf = NMF(n_components=4, random_state=1, init='nndsvd', max_iter=500, alpha=0.0,l1_ratio=0.0)
    X_train_sc= scaler.fit_transform(X_train)
    X_test_s=scaler.transform(X_test)
    X_test_sc=np.absolute(X_test_s)
    nnmf_train= nnmf.fit_transform(X_train_sc)
    nnmf_test = nnmf.transform(X_test_sc)
    nnmf_pred= nnmf_test.argmax(axis=1)    
    print('')
    print("NMF classification metrics")
    print(classification_report(y_test, nnmf_pred, target_names=target_names))
                  
#Main function to compare each of the models        
def compare_models():
    get_data()
    df['comment'] = df['comment'].apply(cleanText)    
    run_tags() 
    call_dbow()
    run_supervised_classifier()
    run_nnmf() 

In [3]:
#Call main function
compare_models()     

100%|██████████| 1537/1537 [00:00<00:00, 711471.72it/s]
100%|██████████| 1537/1537 [00:00<00:00, 299273.26it/s]
100%|██████████| 1537/1537 [00:00<00:00, 886868.24it/s]
100%|██████████| 1537/1537 [00:00<00:00, 984671.64it/s]
100%|██████████| 1537/1537 [00:00<00:00, 899866.73it/s]
100%|██████████| 1537/1537 [00:00<00:00, 1131983.36it/s]



Supervised classifier metrics
              precision    recall  f1-score   support

    Baseball       0.59      0.95      0.73       604
      Hockey       0.92      0.73      0.81       603
        Guns       0.94      0.65      0.77       525
 Middle East       0.92      0.79      0.85       574

   micro avg       0.78      0.78      0.78      2306
   macro avg       0.84      0.78      0.79      2306
weighted avg       0.84      0.78      0.79      2306


NMF classification metrics
              precision    recall  f1-score   support

    Baseball       0.40      0.87      0.55       604
      Hockey       0.03      0.02      0.02       603
        Guns       0.94      0.54      0.68       525
 Middle East       0.55      0.32      0.41       574

   micro avg       0.43      0.43      0.43      2306
   macro avg       0.48      0.44      0.42      2306
weighted avg       0.47      0.43      0.41      2306

