#### Preface

The code below explores how far one might lean on an unsupervised topic model to classify a document it hasn't seen. 


The two unsupervised NLP algorithms chosen are Latent Dirieclet Allocation(LDA) and a non-negative matrix factorization algorithm(NMF). For the sake of comparison I also ran a generic supervised classification algorithm on the same data. 
 
Four categories were chosen from the 20_newsgroup data set for training and testing. They were as follows(baseball,hockey,guns and lastly the Middle East). 

The baseline was aproximately .25 for the test data as each category was aproximately 25% of the entire test data set.

The unsupervised algorithms had the number of topics manually set to four. This is simplistic given the likely non-exclusivity of topics across categories in any document. Both models trained on just the text data from the four categories but had no idea what category each document was. 

They are then tested on unseen documents and asked to predict what category each document is. For purposes of simplisticly each category is regared as having one dominant topic. Each time the unsupervised model produces a probability for the four topics per document the document is assigned the topic/category with the highest probabily.  

This is asking alot from both unsupervised models but the NMF model did better than baseline but definitely worse than the supervised algorithm. LDA results were volatile which is not too surprsing given its design. I am a huge fan of Andrew Ng who co-designed the original LDA model but this is probably one of the least favorite NLP models to work with.  

Results are below. 

#### References:

Greene,O’Callaghan,Cunningham,(2014): 'How Many Topics? Stability Analysis for Topic Models'
https://arxiv.org/abs/1404.4606

Blei, Ng, Jorden(2003): Latent Dirichlet ALlocation,   
https://ai.stanford.edu/~ang/papers/nips01-lda.pdf

#### July 2019

In [1]:
#Dependencies 
from sklearn.datasets import fetch_20newsgroups
from pprint import pprint
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import numpy as np 
import re, nltk, spacy, gensim
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline 

In [2]:
#Specify topic categories used in analysis
cats = ['rec.sport.baseball', 'rec.sport.hockey','talk.politics.guns', 'talk.politics.mideast']  

#Get Spacy 
nlp = spacy.load("en_core_web_sm")
nlp.max_length = 5110014

#Regex to do some pre spacy cleaning, end line, emails and single quotatins removed
def clean_data(self):
    global cleaned
    cleaning = self
    cleaned = [re.sub('\s+', ' ',sent)for sent in cleaning]
    cleaned = [re.sub('\S*@\S*\s?', '', sent) for sent in cleaning]
    cleaned = [re.sub("\'", "", sent) for sent in cleaning]

#Tokenize      
def sent_to_words(self):
    for sentence in self:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))   
         
#Lemmatize with Spacy        
def lemmatization(texts):  
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append(" ".join([token.lemma_ for token in doc]))
                          
    return texts_out     

#Fetch and process train data
def train_data(): 
    global tr_data,data_tr  
     
    data_tr = fetch_20newsgroups(subset='train', categories=cats,shuffle=True, random_state=1,
                             remove=('headers','footers', 'quotes'))     
    clean_data(data_tr.data)         
    data_words = list(sent_to_words(cleaned))
    tr_data = lemmatization(data_words)
    
#Fetch and process test data    
def test_data(): 
    global te_data ,data_te     
    data_te = fetch_20newsgroups(subset='test', categories=cats,shuffle=True, random_state=1,
                             remove=('headers','footers', 'quotes'))     
    clean_data(data_te.data)         
    data_words = list(sent_to_words(cleaned))
    te_data = lemmatization(data_words)

#Fetch data and do some cleaning    
def get_data():
    train_data()
    test_data()
    
#Get tfidf vectors    
def get_tfidf():    
    global tfidf_train,tfidf_test
    vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=3000,ngram_range=(1, 1),
                                   stop_words='english')
    tfidf_train = vectorizer.fit_transform(tr_data)
    tfidf_test = vectorizer.transform(te_data) 

#get term frequency vectorizor
def get_tfs():
    global tf_train,tf_test
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=3000,ngram_range=(1, 1),stop_words='english') 
    tf_train = tf_vectorizer.fit_transform(tr_data)
    tf_test = tf_vectorizer.transform(te_data)
    
#Vectorize data    
def vectorize_data():
    get_tfidf()
    get_tfs()

#Fetch clean and vectorized data
def read_prep_data():
    get_data()
    vectorize_data()

#Pipeline with grid search that optimizes for accuracy across both the transformer and the classifier. 
def run_supervised_classifier():
    parameters = {'tfidf__ngram_range': [(1, 1), (1, 2)],
               'tfidf__use_idf': (True, False),
               'tfidf__max_features': (1000,2000,3000),
               'clf__alpha': (1e-2, 1e-3)}
    svm_clf = Pipeline([('tfidf', TfidfVectorizer()),('clf', SGDClassifier(loss='hinge',
         random_state=42,max_iter=500,tol=1e-3))])
    gs_clf = GridSearchCV(svm_clf, parameters, cv=5, iid=False, n_jobs=-1)
    gs_clf = gs_clf.fit(tr_data,data_tr.target)
    print("Best parameters from supervised grid search classifier :")
    print(gs_clf.best_params_)
    print('')
    print("Supervised classifier accuracy on test data :",gs_clf.score(te_data,data_te.target))
    
#Produces accuracy of unsupervised models
def unsup_acc(self): 
    global d
    pred= self.argmax(axis=1)
    pred.shape
    act= data_te.target
    acc_score= pred - act  
    corr=[] 
    for i in acc_score:
        if i ==0:
            corr.append(i)
    c= np.array(corr)
    d= (c.shape[0]/act.shape[0]) 
         
#Latent Dirichlet Allocation (LDA model)           
def run_lda():      
    lda_train = LatentDirichletAllocation(n_components=4, max_iter=1000,
                                learning_method='online',random_state=0) 
    lda_train.fit_transform(tf_train)
    lda_test= lda_train.transform(tf_test)
    unsup_acc(lda_test)
    print('')
    print('LDA accuracy :',d)
                
#Non-negative matrix factorization model  (NMF model)        
def run_nnmf():
    nnmf = NMF(n_components=4, random_state=1,
          beta_loss='kullback-leibler', solver='mu', init='nndsvda', max_iter=1000, alpha=.2,
          l1_ratio=0.5)
    nnmf_train= nnmf.fit_transform(tfidf_train)
    nnmf_test = nnmf.transform(tfidf_test) 
    unsup_acc(nnmf_test)
    print('')
    print('NMF accuracy :', d)
                      
            
#Baseline check on four labels in the unseen test data
def check_baseline(): 
    ave_label_pc=[]
    for i in range(4):
        ave_label_pc.append(list(data_te.target).count(i)/data_te.target.shape[0])   
        m= np.array(ave_label_pc).mean()
    print("Baseline check(average percentage of the four labels in test data) :", m)
    print('')   
        
#Run unsupervised models
def run_unsupervised_models():
    run_lda() 
    run_nnmf()  

#Main function compares supervised against the two unsupervised models(LDA, NMF)         
def model_comparison():
    read_prep_data()  
    check_baseline()   
    run_supervised_classifier()
    run_unsupervised_models()          

In [3]:
#Run main function to compare the models
model_comparison()

Baseline check(average percentage of the four labels in test data) : 0.25

Best parameters from supervised grid search classifier :
{'clf__alpha': 0.001, 'tfidf__max_features': 3000, 'tfidf__ngram_range': (1, 1), 'tfidf__use_idf': True}

Supervised classifier accuracy on test data : 0.8626302083333334

LDA accuracy : 0.044921875

NMF accuracy : 0.4290364583333333
