#### Preface

The code below explores how far one might lean on an unsupervised topic model to classify a document it hasn't seen. 


The two unsupervised NLP algorithms chosen are Latent Dirichlet Allocation(LDA) and a non-negative matrix factorization algorithm(NMF). For the sake of comparison I also ran a generic supervised classification algorithm on the same data. 
 
Four categories were chosen from the 20_newsgroup data set for training and testing. They were as follows:

1. Baseball
2. Hockey
3. Guns 
4. Middle East 

The unsupervised algorithms had the number of topics manually set to four. This is simplistic given the likely non-exclusivity of topics across categories in any document. Both models trained on just the text data from the four categories but had no idea what category each document was. 

They are then tested on unseen documents and asked to predict what category each document is. For the sake of simplistically each category is regarded as having one dominant topic. Each time the unsupervised model produces a probability for the four topics per document the document is assigned the topic/category with the highest probability.  

This is asking a lot from both unsupervised models but the NMF model did better than baseline but definitely worse than the supervised algorithm. LDA results were volatile which is not too surprising given its design.  

Results are below. 

#### References:

Greene, O’Callaghan, Cunningham(2014): 'How Many Topics? Stability Analysis for Topic Models'
https://arxiv.org/abs/1404.4606

Blei, Ng, Jorden(2003): Latent Dirichlet Allocation,   
https://ai.stanford.edu/~ang/papers/nips01-lda.pdf

#### July 2019

In [1]:
#Dependencies 
from sklearn.datasets import fetch_20newsgroups
from pprint import pprint
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import numpy as np 
import re, nltk, spacy, gensim
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline 
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler

In [2]:
#Specify topic categories used in analysis
cats = ['rec.sport.baseball', 'rec.sport.hockey','talk.politics.guns', 'talk.politics.mideast']
target_names = ['Baseball', 'Hockey', 'Guns','Middle East']

#Get Spacy 
nlp = spacy.load("en_core_web_sm")
nlp.max_length = 5110014

#Regex to do some pre spacy cleaning, end line, emails and single quotatins removed
def clean_data(self):
    global cleaned
    cleaning = self
    cleaned = [re.sub('\s+', ' ',sent)for sent in cleaning]
    cleaned = [re.sub('\S*@\S*\s?', '', sent) for sent in cleaning]
    cleaned = [re.sub("\'", "", sent) for sent in cleaning]

#Tokenize      
def sent_to_words(self):
    for sentence in self:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))   
         
#Lemmatize with Spacy        
def lemmatization(texts):  
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append(" ".join([token.lemma_ for token in doc]))                          
    return texts_out     

#Fetch and process train data
def train_data(): 
    global tr_data,data_tr       
    data_tr = fetch_20newsgroups(subset='train', categories=cats,shuffle=True, random_state=1,
                             remove=('headers','footers', 'quotes'))     
    clean_data(data_tr.data)         
    data_words = list(sent_to_words(cleaned))
    tr_data = lemmatization(data_words)
    
#Fetch and process test data    
def test_data(): 
    global te_data ,data_te     
    data_te = fetch_20newsgroups(subset='test', categories=cats,shuffle=True, random_state=1,
                             remove=('headers','footers', 'quotes'))     
    clean_data(data_te.data)         
    data_words = list(sent_to_words(cleaned))
    te_data = lemmatization(data_words)

#Fetch data and do some cleaning    
def get_data():
    train_data()
    test_data()
    
#Get tfidf vectors    
def get_tfidf():    
    global tfidf_train,tfidf_test
    vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=3000,ngram_range=(1, 1),
                                   stop_words='english')
    tfidf_train = vectorizer.fit_transform(tr_data)
    tfidf_test = vectorizer.transform(te_data) 

#get term frequency vectorizor
def get_tfs():
    global tf_train,tf_test
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=3000,ngram_range=(1, 1),stop_words='english') 
    tf_train = tf_vectorizer.fit_transform(tr_data)
    tf_test = tf_vectorizer.transform(te_data)
    
#Vectorize data    
def vectorize_data():
    get_tfidf()
    get_tfs()

#Fetch clean and vectorized data
def read_prep_data():
    get_data()
    vectorize_data()

#Supervised classifier 
def run_supervised_classifier():
    parameters = {'tfidf__ngram_range': [(1, 1), (1, 2)],
               'tfidf__use_idf': (True, False),
               'tfidf__max_features': (1000,2000,3000),
               'clf__alpha': (1e-2, 1e-3)}
    svm_clf = Pipeline([('tfidf', TfidfVectorizer()),('clf', SGDClassifier(loss='hinge',
         random_state=42,max_iter=500,tol=1e-3))])
    gs_clf = GridSearchCV(svm_clf, parameters, cv=5, iid=False, n_jobs=-1)
    gs_clf = gs_clf.fit(tr_data,data_tr.target)
    y_pred= gs_clf.predict(te_data)       
    print('')
    print("Supervised classification metrics")
    print(classification_report(data_te.target, y_pred, target_names=target_names))
     
    
#Latent Dirichlet Allocation (LDA model)           
def run_lda():      
    lda_train = LatentDirichletAllocation(n_components=4, max_iter=100,
                                learning_method='online',random_state=0) 
    lda_train.fit_transform(tf_train)
    lda_test= lda_train.transform(tf_test)
    lda_pred= lda_test.argmax(axis=1)
    print('')
    print("LDA classification metrics")        
    print(classification_report(data_te.target, lda_pred, target_names=target_names))
      
                    
#Non-negative matrix factorization model  (NMF model)        
def run_nnmf():
    nnmf = NMF(n_components=4, random_state=1,
          beta_loss='kullback-leibler', solver='mu', init='nndsvda', max_iter=100, alpha=0.0,
          l1_ratio=0.0)
    nnmf_train= nnmf.fit_transform(tfidf_train)
    nnmf_test = nnmf.transform(tfidf_test)
    nnmf_pred= nnmf_test.argmax(axis=1)    
    print('')
    print("NMF classification metrics")
    print(classification_report(data_te.target, nnmf_pred, target_names=target_names))
                  
            
#Baseline check on four labels in the unseen test data
def check_baseline(): 
    ave_label_pc=[]
    for i in range(4):
        ave_label_pc.append(list(data_te.target).count(i)/data_te.target.shape[0])   
        m= np.array(ave_label_pc).mean()
    print("Baseline check(average percentage of the four labels in test data) :", m)
    print('')   
        
#Run unsupervised models
def run_unsupervised_models():
    run_lda() 
    run_nnmf()  

#Main function compares supervised against the two unsupervised models(LDA, NMF)         
def model_comparison():
    read_prep_data()  
    check_baseline()   
    run_supervised_classifier()
    run_unsupervised_models()          

In [3]:
#Run main function to compare the models
model_comparison()

Baseline check(average percentage of the four labels in test data) : 0.25


Supervised classification metrics
              precision    recall  f1-score   support

    Baseball       0.82      0.87      0.84       397
      Hockey       0.91      0.89      0.90       399
        Guns       0.85      0.85      0.85       364
 Middle East       0.88      0.85      0.86       376

   micro avg       0.86      0.86      0.86      1536
   macro avg       0.86      0.86      0.86      1536
weighted avg       0.86      0.86      0.86      1536


LDA classification metrics
              precision    recall  f1-score   support

    Baseball       0.15      0.31      0.20       397
      Hockey       0.47      0.61      0.53       399
        Guns       0.36      0.14      0.20       364
 Middle East       0.02      0.00      0.00       376

   micro avg       0.27      0.27      0.27      1536
   macro avg       0.25      0.26      0.23      1536
weighted avg       0.25      0.27      0.24    