# Classifying Authors with Unsupervised Machine Learning
Raj Prasad
July 2019

[html version](https://daddyprasad5.github.io/predicting_authors_v4.html) - with all the code hidden away for a quick read

[jupyter notebook version](https://github.com/daddyprasad5/thinkful/blob/master/predicting_authors_v4.ipynb) - with all the code exposed in an interactive notebook

"Stylometry" is the science of determining the author of a work whose authorship is uncertain.  I've explored several ways to classify the authorship of american poems.  The dataset contains 10 poems each by 6 authors.   

I created some base features using the poems' text: 

* punctuation frequency
* part of speech frequency
* stop word frequency
* poem length
* ratio of unique words
* tf/idf glover vectors

Then I created several models of different types, trained on different feature vectors. A random-guessing machine would be correct 16% of the time.

|Model | Feature set | Correct-classification rate
|---|---|---
|kmeans | base features | 40%
|spectral clustering | base features | 40%
|logistic regression | base features | 67%
|logistic regression | kmeans & spectral classifications | 6%
|logistic regression | base features + kmeans & spectral classifiations | 67%
|logistic regression | 10 features chosen by scikit learn's recursive feature elimination | 63%

The best model is the logistic regression using the base features directly.  The unsupervised learning features were weaker as solo classifiers and added no value to the logistic regression when included as input features. 

In [1]:
#imports
import os
import pandas as pd
import spacy
import numpy as np
from collections import defaultdict, Counter
from sklearn.model_selection import train_test_split

In [2]:
#read poems, display some of the base data

def listdir_nohidden(path):
    for f in os.listdir(path):
        if not f.startswith('.'):
            yield f

directory = "poems/"
authors, titles, poems = ([] for i in range(3))
token_counts = defaultdict(Counter)  #used later in creating features
author_int = 0
nlp = spacy.load('en')     

for author in listdir_nohidden(directory):
    for title in listdir_nohidden(directory+author):
        with open(directory+author+"/"+title, 'r') as myfile:
            doc = nlp(myfile.read())
            authors.append(author)
            titles.append(title[:-4])
            poems.append(doc.text)
            for token in doc:
                if token.is_stop:
                    token_counts[token.pos][token.orth] += 1

# Convert the author strings into numbers, create some dicts for translations
unique_authors = set(authors)
author_index = range(len(unique_authors))
author_dict = dict(zip(unique_authors, author_index))
rev_author_dict = dict(zip(author_index, unique_authors))
                
poems_df = pd.DataFrame({"author": [author_dict[a] for a in authors], "title": titles, "poem": poems})
poems_df.head()


Unnamed: 0,author,title,poem
0,3,"SIENA MI FE', DISFECEMI MAREMMA","AMONG the pickled foetuses and bottled bones,\..."
1,3,the_age_demanded,VIDE POEM II.\n\nFOR this agility chance found...
2,3,ODE_POUR _ELECTION_DE_SON_SEPULCHRE,"FOR three years, out of key with his time,\nHe..."
3,3,medallion,LUINI in porcelain!\nThe grand piano\nUtters a...
4,3,1920 (MAUBERLEY) I,"I.\n\nTURNED from the ""eau-forte\n..."


In [3]:
#poems by author
print(rev_author_dict)
poems_df.author.value_counts()

{0: 'robert_frost', 1: 'ts_eliot', 2: 'ralph_waldo_emerson', 3: 'ezra_pound', 4: 'edgar_allen_poe', 5: 'walt_whitman'}


5    10
4    10
3    10
2    10
1    10
0    10
Name: author, dtype: int64

In [4]:
#set baseline for evaluating model prediction
baseline = int(poems_df.author.value_counts().max()
            / poems_df.author.value_counts().sum() * 100) 

print("guessing the highest probability class all the time would result in a {}% success rate".format(baseline))



guessing the highest probability class all the time would result in a 16% success rate


In [5]:
#create some features based on punctuation frequency, parts of speech frequency, 
from collections import Counter
import spacy,en_core_web_sm
nlp = en_core_web_sm.load()

#create features: count parts of speech occurrences

pos = ["adv", "conj", "noun","pron","propn","verb"]
periods, words, unique_words, commas, exclamations, semicolons, colons, advs, conjs, nouns, prons, propns, verbs = ([] for i in range(13))

    
for poem in poems_df.poem: 
    
    tokens = nlp(poem)
    
    #count punctuations
    comma = poem.count(',') 
    exclamation = poem.count('!')
    semicolon= poem.count(';')
    colon = poem.count(':')
    period = poem.count('.')
    
    #get percent of all words for each part of speech
    c = Counter(([token.pos_ for token in tokens]))
    adv = conj = noun = pron = propn = verb = 0
    sbase = sum(c.values())
    for el, cnt in c.items():
        val = (100.0* cnt)/sbase
        if el == "ADV": adv = val
        elif el == "CONJ": conj = val
        elif el == "NOUN": noun = val
        elif el == "PRON": pron = val
        elif el == "PROPN": propn = val
        elif el == "VERB": verb = val
        
    #append to feature lists
    words.append(sbase)
    unique_words.append(len(c) / sbase)
    advs.append(adv)
    conjs.append(conj)
    nouns.append(noun)
    prons.append(pron/sbase)
    propns.append(propn/sbase)
    verbs.append(verb/sbase)
    commas.append(comma/sbase)
    exclamations.append(exclamation/sbase)
    semicolons.append(semicolon/sbase)
    colons.append(colon/sbase)
    periods.append(period/sbase)

#add feature lists to the dataframe
X=pd.DataFrame()
X["adv_percent"] = advs
X["conj_percent"] = conjs
X["noun_percent"] = nouns
X["propn_percent"] = propns
X["verb_percent"] = verbs
X["words_per_poem"] = words
X["unique_words_rate"] = unique_words
X["commas_rate"] = commas
X["exclamations_rate"] = exclamations
X["semicolons_rate"] = semicolons
X["colons_rate"] = colons
X["periods_rate"] = periods
# for word in top_common_words:
#     X_bag_of_words_temp["top_word_{}_freq".format(word)] = tcw_freqs[word]

#create features sets to later create multiple features for regression
punctuation_fs = ["commas_per_poem", "exclamations_per_poem", "semicolons_per_poem", 
                  "colons_per_poem", "periods_per_poem"]
pos_fs = ["adv_percent", "conj_percent", "noun_percent", "propn_percent", "verb_percent"] 
words_fs = ["words_per_poem", "unique_words_per_poem"]


In [6]:
#use PCA and tf/idf to create another set of features

from gensim.corpora import Dictionary
from gensim.models.tfidfmodel import TfidfModel
from gensim.matutils import sparse2full
from sklearn import linear_model, datasets, metrics
from sklearn.decomposition import PCA

def keep_token(t):
    return (t.is_alpha and 
            not (t.is_space or t.is_punct or 
                 t.is_stop or t.like_num))

def lemmatize_doc(doc):
    return [ t.lemma_ for t in doc if keep_token(t)]

docs = [lemmatize_doc(nlp(doc)) for doc in poems_df["poem"].values]
docs_dict = Dictionary(docs)
docs_dict.filter_extremes(no_below=3, no_above=0.2)
docs_dict.compactify()

#create tf/idf matrix
docs_corpus = [docs_dict.doc2bow(doc) for doc in docs]
model_tfidf = TfidfModel(docs_corpus, id2word=docs_dict)
docs_tfidf  = model_tfidf[docs_corpus]
docs_vecs   = np.vstack([sparse2full(c, len(docs_dict)) for c in docs_tfidf])

#get the Glove embedding vector for each TF-IDF term.
tfidf_emb_vecs = np.vstack([nlp(docs_dict[i]).vector for i in range(len(docs_dict))])

#get a TF-IDF weighted Glove vector summary of each document
docs_emb = np.dot(docs_vecs, tfidf_emb_vecs) 

#create pca components
pca = PCA(5) #5 components explains ~90% of the variance
docs_pca = pca.fit_transform(docs_emb)

#start a features dataframe
X_tf_idf = pd.DataFrame()
pca_dim = ([[r[col] for r in docs_pca] for col in range(len(docs_pca[0]))])
for i, pca_dim in enumerate(docs_pca.transpose()): 
    X["tf_idf_PCA_{}".format(i)] = pca_dim


In [37]:
#define Xs, Y, and train-test splits for clustering

Y = poems_df.author
poems_train, poems_test, X_train, X_test, Y_train, Y_test = train_test_split(poems_df, X, Y, test_size=0.25, random_state=5)


In [39]:
#build a "burrows" word frequency featureset 

#get all the tokens for the training set and test set
tokens_counts = []
train_docs = []
test_docs = []
for poem in poems_train.poem:
    doc = nlp(poem)
    train_docs.append(doc)
    for token in doc:
        if token.is_stop:
            token_counts[token.pos][token.orth] += 1
for poem in poems_test.poem:
    doc = nlp(poem)
    test_docs.append(doc)
    for token in doc:
        if token.is_stop:
            token_counts[token.pos][token.orth] += 1
            
#get a list of the top 10 most common words in the training set only
common_words = dict()
for pos_id, counts in sorted(token_counts.items()):
    pos = doc.vocab.strings[pos_id]
    for orth_id, count in counts.most_common():
        w = doc.vocab.strings[orth_id].lower()
        if w in common_words: 
            common_words[w] = common_words[w] + count
        else: 
            common_words[w] = count

top_common_words = sorted(common_words, key=common_words.__getitem__, reverse=True)[:10]
print("top common words: ", top_common_words)

#build a feature for frequency of each top word, in the training and test sets only
X_bag_of_words_train = pd.DataFrame()
X_bag_of_words_test = pd.DataFrame()
tcw_freqs_train = dict()
tcw_freqs_test = dict()
for word in top_common_words: 
    tcw_freqs_train[word] = []
    tcw_freqs_test[word] = []
for doc in train_docs:
    sbase = len(doc)
    cw_counts = Counter(token.text.lower() for token in doc if token.text.lower() in top_common_words)
    for word in top_common_words:
        tcw_freqs_train[word].append(cw_counts[word] / sbase * 100)
for doc in test_docs:
    sbase = len(doc)
    cw_counts = Counter(token.text.lower() for token in doc if token.text.lower() in top_common_words)
    for word in top_common_words:
        tcw_freqs_test[word].append(cw_counts[word] / sbase * 100)

for word in top_common_words: 
    X_bag_of_words_train["top_word_{}_freq".format(word)] = tcw_freqs_train[word]
    X_bag_of_words_test["top_word_{}_freq".format(word)] = tcw_freqs_test[word]
X_bag_of_words_train.index = poems_train.index
X_bag_of_words_test.index = poems_test.index
        
#build a dictionary (author) of dictionaries (frequency distributions of the most common words) for training set only
tw_freq_cols = ["top_word_{}_freq".format(word) for word in top_common_words]
cols = tw_freq_cols.copy()
cols.append("author")
burrows = pd.concat([poems_train, X_bag_of_words_train], axis=1).loc[:, cols]
burrows = burrows.groupby(by="author").mean()

#using just the training set only, find the mean of means and the std of means for each feature (common word), then 
#find the zscore of each author for each feature and add to the training set and test set
bow_fs = []
for col in tw_freq_cols:
    u_of_us = burrows[col].values.mean()
    std_of_means = burrows[col].values.std()
    X_train["z_{}".format(col)] = (X_bag_of_words_train[col] - u_of_us) / std_of_means
    X_test["z_{}".format(col)] = (X_bag_of_words_test[col] - u_of_us) / std_of_means
    bow_fs.append("z_{}".format(col))
X_test.head()




top common words:  ['the', 'and', 'of', 'in', 'a', 'i', 'to', 'with', 'that', 'is']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,adv_percent,conj_percent,noun_percent,propn_percent,verb_percent,words_per_poem,unique_words_rate,commas_rate,exclamations_rate,semicolons_rate,colons_rate,periods_rate,tf_idf_PCA_0,tf_idf_PCA_1,tf_idf_PCA_2,tf_idf_PCA_3,tf_idf_PCA_4,z_top_word_the_freq,z_top_word_and_freq,z_top_word_of_freq,z_top_word_in_freq,z_top_word_a_freq,z_top_word_i_freq,z_top_word_to_freq,z_top_word_with_freq,z_top_word_that_freq,z_top_word_is_freq
31,7.746479,0,15.492958,0.034715,0.069431,142,0.077465,0.049296,0.007042,0.0,0.0,0.028169,-42.627598,10.355962,8.339205,5.58776,4.795969,-1.45275,-1.695428,-1.573685,-2.227953,0.280649,-1.624403,-2.707871,-6.330648,2.735418,1.536585
42,3.338898,0,19.699499,0.00864,0.018395,599,0.023372,0.051753,0.0,0.026711,0.001669,0.018364,60.046505,6.144707,-3.516366,6.666318,0.234833,1.522923,-0.374475,0.545024,-2.366904,-0.669635,0.377213,-0.339616,1.828311,0.765425,-1.680499
34,4.372624,0,17.30038,0.006144,0.019517,526,0.026616,0.096958,0.034221,0.007605,0.0,0.007605,28.609455,14.834477,-12.066852,0.018187,-1.930606,-0.155028,-0.453922,-3.051876,-1.288748,-1.107541,-1.054551,-0.010942,-3.233552,-0.142379,-0.377763
52,2.150538,0,16.129032,0.161868,0.046248,93,0.129032,0.129032,0.064516,0.043011,0.0,0.021505,-49.930099,5.630881,0.033307,5.004325,0.818642,-1.376453,0.889302,7.477792,-0.813327,-0.271212,-1.624403,2.376661,-6.330648,-2.100753,-1.680499
56,5.306122,0,18.77551,0.009996,0.033319,245,0.053061,0.15102,0.008163,0.028571,0.0,0.028571,4.633615,2.254403,2.037948,4.091328,2.725576,-0.775043,0.945481,-3.85552,-1.80056,-0.700108,-1.624403,-0.777824,10.292537,-2.100753,0.184097


In [None]:
#function to calculate the best predict-class to actual-class alignments for any clustering model
import itertools

def best_class_alignments(y_pred, y): 
    y_pred = np.array(y_pred)
    y = np.array(y)
    ct = pd.crosstab(y_pred, y)
    actual_classes = ct.columns
    pred_classes = ct.index
    pred_perms = list(itertools.permutations(pred_classes))
    scores = []
    for pred_perm in list(pred_perms):
        score = 0
        for i, val in enumerate(pred_perm):
            score += ct.iloc[i][val]
        scores.append(score)
    best_score = max(scores)
    best_perm = (pred_perms[scores.index(best_score)])
    class_dict = dict()
    for i, perm in enumerate(best_perm):
        class_dict[perm] = actual_classes[i]
    return (class_dict, best_score/len(y_pred))


In [42]:
#functions to fit clustering models

from sklearn.cluster import KMeans
from sklearn.cluster import SpectralClustering

def cluster_kmeans(fvs_train, y_train, k=6, max_iter=300, n_init=20, tol=.0001, verbose=True):
    #fits on the train data and predicts on all the data
    km = KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=max_iter,
                n_clusters=k, n_init=n_init, n_jobs=None, precompute_distances='auto',
                random_state=46, tol=0.0001, verbose=0)
    y_pred = km.fit_predict(fvs_train)
    if verbose:
        ct = pd.crosstab(y_pred, y_train, margins=True)
        ct.columns = cols
        print("k-means clustering:")
        print(ct)
    return km, y_pred


def cluster_spectral(fvs_train, y_train, k=6, gamma=1.0, n_init=100, n_neighbors=10, verbose=True):
    # Declare and fit the model.
    sc = SpectralClustering(affinity='rbf', assign_labels='kmeans', coef0=1, degree=3,
                            eigen_solver=None, eigen_tol=0.0, gamma=gamma, kernel_params=None,
                            n_clusters=k, n_init=n_init, n_jobs=None, n_neighbors=n_neighbors,
                            random_state=46)
    y_pred = sc.fit_predict(fvs_train)
    if verbose:
        ct = pd.crosstab(y_pred, y_train, margins=True)
        ct.columns = cols
        print("spectral clustering:")
        print(ct)
    return sc, y_pred

In [43]:
#try several different clustering methods and compare

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

auths = [rev_author_dict[index] for index in range(len(unique_authors))]
cols = auths.copy()
cols.append("total")

km, km_ypred = cluster_kmeans(X_train, Y_train, len(unique_authors))
print("best class alignment and percent accurate classification: \n", best_class_alignments(km_ypred, Y_train))

sc, sc_ypred = cluster_spectral(X_train, Y_train, len(unique_authors))
print("best class alignment and percent accurate classification: \n", best_class_alignments(sc_ypred, Y_train))

k-means clustering:
       robert_frost  ts_eliot  ralph_waldo_emerson  ezra_pound  edgar_allen_poe  walt_whitman  total
row_0                                                                                               
0                 6         1                    0           5                4             2     18
1                 0         0                    0           0                0             1      1
2                 0         1                    4           2                1             3     11
3                 0         1                    0           0                1             0      2
4                 0         0                    1           1                2             0      4
5                 2         3                    2           0                0             2      9
All               8         6                    7           8                8             8     45
best class alignment and percent accurate classification: 
 ({0: 0, 5: 



spectral clustering:
       robert_frost  ts_eliot  ralph_waldo_emerson  ezra_pound  edgar_allen_poe  walt_whitman  total
row_0                                                                                               
0                 0         0                    0           0                0             1      1
2                 0         1                    0           0                0             0      1
5                 8         5                    7           8                8             7     43
All               8         6                    7           8                8             8     45
best class alignment and percent accurate classification: 
 ({5: 0, 2: 1, 0: 2}, 0.2)


  n_init=n_init)


kmeans performs "better" as defined by author identification precision, giving an 20% lift in precision over the base success rate (guessing most populous class all the time), while spectral clustering performs only marginally better than base. 

In [44]:
#grid-search of k-means
from sklearn import metrics

def kmeans_grid_search(fvs_train, y_train, k, max_iters, n_inits, tols):
    scores=[]
    kms=[]
    clusterss=[]
    for max_iter in max_iters:
        for n_init in n_inits:
            for tol in tols:
                km, clusters = cluster_kmeans(fvs_train, y_train, k=k, max_iter=max_iter, n_init=n_init, tol=tol, verbose=False)
                #scores.append(metrics.silhouette_score(fvs_train, clusters))
                best_class, score = best_class_alignments(clusters, y_train)
                scores.append(score)
                kms.append(km)
                clusterss.append(clusters)
    return (kms, clusterss, scores)

kms, clusterss, scores = kmeans_grid_search(X_train, Y_train, len(unique_authors), 
                                           max_iters=[100,300,500,1000], 
                                           n_inits=[1,20,50,100], 
                                          tols=[1, 0, .1, .01, .001, .0001, .00001, .000001, .0000001])
            

In [45]:
#count of unique correct-classification ratios
print("counts of each unique correct classification ratio")
print(pd.Series(scores).value_counts())

counts of each unique correct classification ratio
0.377778    72
0.355556    36
0.311111    36
dtype: int64


The tuning did not improve kmeans much.

In [46]:
# take a look at the best tuned kmeans
km = kms[scores.index(max(scores))]
km_y_pred = km.fit_predict(X_train)
ct = pd.crosstab(km_y_pred, Y_train)
print("k-means clustering on training data:")
print(ct)
class_dict, best_score = best_class_alignments(pd.Series(km_y_pred), Y_train)
print("best score", best_score)
print("class dict", class_dict)

km = kms[scores.index(max(scores))]
km_y_test_pred = km.fit_predict(X_test)
ct = pd.crosstab(km_y_test_pred, Y_test)
print("k-means clustering on test data:")
print(ct)
class_dict, best_score = best_class_alignments(pd.Series(km_y_test_pred), Y_test)
print("best score", best_score)
print("class dict", class_dict)

k-means clustering on training data:
author  0  1  2  3  4  5
row_0                   
0       6  1  0  5  4  2
1       0  1  0  0  1  0
2       0  0  4  2  1  3
3       0  0  0  0  0  1
4       0  0  1  1  2  0
5       2  4  2  0  0  2
best score 0.37777777777777777
class dict {0: 0, 3: 1, 2: 2, 5: 3, 4: 4, 1: 5}
k-means clustering on test data:
author  0  1  2  3  4  5
row_0                   
0       1  0  1  1  1  1
1       0  0  1  0  0  0
2       0  2  0  0  0  0
3       0  1  0  0  0  0
4       0  0  1  0  1  0
5       1  1  0  1  0  1
best score 0.4
class dict {0: 0, 2: 1, 1: 2, 3: 3, 4: 4, 5: 5}


oddly, kmeans performs better on test data than on the training data. 

In [47]:
#tuning spectral clustering
from sklearn.cluster import SpectralClustering
from sklearn import metrics

#def cluster_spectral(fvs_train, y_train, k, gamma=1.0, n_init=100, n_neighbors=10, verbose=True)

def spectral_grid_search(fvs_train, y_train, k, gammas, n_inits, n_neighbors=10):
    scores=[]
    scs=[]
    clusterss=[]
    for gamma in gammas:
        for n_init in n_inits:
            for n_neighbor in n_neighbors:
                sc, clusters = cluster_spectral(fvs_train, y_train, k, gamma=gamma, n_init=n_init, n_neighbors=n_neighbor, verbose=False)
                #scores.append(metrics.silhouette_score(fvs_train, clusters))
                best_class, score = best_class_alignments(clusters, y_train)
                scores.append(score)
                scs.append(sc)
                clusterss.append(clusters)
    return (scs, clusterss, scores)

scs, clusterss, scores = spectral_grid_search(X_train, Y_train, len(unique_authors),
                                              gammas=[.01, .1, 9, 1, 10],
                                              n_inits=[1, 20, 50, 100],
                                              n_neighbors=[3, 7, 10, 15])

  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)


  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)


  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)
  n_init=n_init)




In [52]:
# take a look at the best tuned spectral clustering model
sc = scs[scores.index(max(scores))]
sc_y_pred = sc.fit_predict(X_train)
ct = pd.crosstab(sc_y_pred, Y_train)
print("spectral clustering on training data:")
print(ct)
class_dict, best_score = best_class_alignments(pd.Series(sc_y_pred), Y_train)
print("best score", best_score)
print("class dict", class_dict)

sc_y_test_pred = sc.fit_predict(X_test)
ct = pd.crosstab(sc_y_test_pred, Y_test)
print("spectral clustering on test data:")
print(ct)
class_dict, best_score = best_class_alignments(pd.Series(sc_y_test_pred), Y_test)
print("best score", best_score)
print("class dict", class_dict)





spectral clustering on training data:
author  0  1  2  3  4  5
row_0                   
0       0  0  0  0  0  1
1       4  2  1  1  5  3
2       2  4  6  5  3  3
3       1  0  0  0  0  1
4       0  0  0  1  0  0
5       1  0  0  1  0  0
best score 0.3111111111111111
class dict {1: 0, 4: 1, 2: 2, 5: 3, 3: 4, 0: 5}
spectral clustering on test data:
author  0  1  2  3  4  5
row_0                   
0       1  0  3  1  1  0
1       1  0  0  0  1  1
2       0  1  0  0  0  0
3       0  1  0  1  0  1
4       0  1  0  0  0  0
5       0  1  0  0  0  0




best score 0.4
class dict {2: 0, 0: 1, 1: 2, 3: 3, 4: 4, 5: 5}


tuning made a big difference for the spectral clustering model.oddly, spectral clustering performs better on test data than on the training data. 

In [53]:
#create a feature set made up of the kmeans and spectral clustering cluster assignments

X_cl_train = pd.DataFrame({"kmeans": km_y_pred, "spectral": sc_y_pred})
X_cl_test = pd.DataFrame({"kmeans": km_y_test_pred, "spectral": sc_y_test_pred})

X_cl_train.head()


Unnamed: 0,kmeans,spectral
0,2,2
1,4,4
2,5,2
3,2,3
4,5,1


In [54]:
###### fit on the base features (used for clustering) for comparison

#logistic_base = linear_model.LogisticRegression(solver='newton-cg', tol=30, multi_class='multinomial')
logistic_base = linear_model.LogisticRegressionCV(max_iter=1000, multi_class="multinomial")

logistic_base.fit(X_train, Y_train)
Y_pred_base = logistic_base.predict(X_test)

print("Logistic regression using underlying features directly:\n%s\n" % (
    metrics.classification_report(Y_pred_base, Y_test)))

#fit the logistic model on the clusters-as-features and on the underlying features and compare.
from sklearn import linear_model, datasets, metrics

#fit on the clusters-as-features
#logistic = linear_model.LogisticRegression(solver='newton-cg', tol=1, multi_class='multinomial')
logistic = linear_model.LogisticRegressionCV(max_iter=1000, multi_class="multinomial")

logistic.fit(X_cl_train, Y_train)
Y_pred = logistic.predict(X_cl_test)

print("Logistic regression using unsupervised features only:\n%s\n" % (
    metrics.classification_report(Y_test, Y_pred)))

#fit on all the features - base and the unsupervised features

X_combined_train = pd.concat([X_cl_train.reset_index(drop=True), X_train.reset_index(drop=True)],axis=1)
X_combined_test = pd.concat([X_cl_test.reset_index(drop=True), X_test.reset_index(drop=True)],axis=1)

#logistic_base = linear_model.LogisticRegression(solver='newton-cg', tol=30, multi_class='multinomial')
logistic_base = linear_model.LogisticRegressionCV(max_iter=1000, multi_class="multinomial")

logistic_base.fit(X_combined_train, Y_train)
Y_pred_combined = logistic_base.predict(X_combined_test)

print("Logistic regression using both base and unsupervised features:\n%s\n" % (
    metrics.classification_report(Y_pred_combined, Y_test)))




Logistic regression using underlying features directly:
              precision    recall  f1-score   support

           0       1.00      0.40      0.57         5
           1       0.50      1.00      0.67         2
           2       0.67      0.67      0.67         3
           3       0.50      0.33      0.40         3
           4       0.50      1.00      0.67         1
           5       0.00      0.00      0.00         1

   micro avg       0.53      0.53      0.53        15
   macro avg       0.53      0.57      0.50        15
weighted avg       0.67      0.53      0.54        15




  'precision', 'predicted', average, warn_for)


Logistic regression using unsupervised features only:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.00      0.00      0.00         4
           2       0.00      0.00      0.00         3
           3       0.00      0.00      0.00         2
           4       0.25      0.50      0.33         2
           5       0.17      0.50      0.25         2

   micro avg       0.13      0.13      0.13        15
   macro avg       0.07      0.17      0.10        15
weighted avg       0.06      0.13      0.08        15






Logistic regression using both base and unsupervised features:
              precision    recall  f1-score   support

           0       1.00      0.40      0.57         5
           1       0.50      1.00      0.67         2
           2       0.67      0.67      0.67         3
           3       0.50      0.33      0.40         3
           4       0.50      1.00      0.67         1
           5       0.00      0.00      0.00         1

   micro avg       0.53      0.53      0.53        15
   macro avg       0.53      0.57      0.50        15
weighted avg       0.67      0.53      0.54        15






In [65]:
# Let's reduct the number of features - simplify the logistic regression model
from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# create a base classifier used to evaluate a subset of attributes
model = linear_model.LogisticRegressionCV(max_iter=5000, multi_class="multinomial", cv=5)
#model.fit(X_combined_train, Y_train)
#print(model.scores_)
#Y_pred_train = model.predict(X_combined_train)
#Y_pred_test = model.predict(X_combined_test)
rfe = RFE(model, 10)
rfe = rfe.fit(X_combined_train, Y_train)
# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)

rfe.estimator_.fit(X_combined_train, Y_train)
Y_pred_train = rfe.estimator_.predict(X_combined_train)
Y_pred_test = rfe.estimator_.predict(X_combined_test)

print("Logistic regression using a subset of the features:\n%s\n" % (
    metrics.classification_report(Y_train, Y_pred_train)))


print("Logistic regression using a subset of the features:\n%s\n" % (
    metrics.classification_report(Y_test, Y_pred_test)))





[ True False  True False False False False False False False False False
 False False False False  True  True False  True  True  True False  True
  True  True False False False]
[ 1 10  1 20  4 12 18 11 15 13 17 16 19 14  9  7  1  1  3  1  1  1  8  1
  1  1  6  5  2]
Logistic regression using a subset of the features:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         8
           1       1.00      1.00      1.00         6
           2       1.00      1.00      1.00         7
           3       1.00      1.00      1.00         8
           4       1.00      1.00      1.00         8
           5       1.00      1.00      1.00         8

   micro avg       1.00      1.00      1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45


Logistic regression using a subset of the features:
              precision    recall  f1-score   support

           0       0.40      

In [66]:
#these 10 top features perform mearly as well as the full set...
included = pd.Series(rfe.ranking_ == True)
X_combined_train.columns[included.index[included]]
impt_features = pd.DataFrame({"cols": X_combined_train.columns, "include": rfe.ranking_ == True})
impt_features = impt_features[impt_features.include == True]
impt_features.cols


0                  kmeans
2             adv_percent
16           tf_idf_PCA_2
17           tf_idf_PCA_3
19    z_top_word_the_freq
20    z_top_word_and_freq
21     z_top_word_of_freq
23      z_top_word_a_freq
24      z_top_word_i_freq
25     z_top_word_to_freq
Name: cols, dtype: object