In [66]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [188]:
from lib import jsoncorpus, datastuff
import gensim
import traceback
import numpy as np
import matplotlib.pyplot as plt
import sklearn.pipeline
import sklearn.preprocessing
import sklearn.cross_validation
import sklearn.ensemble
import sklearn.svm
from lib.scikitComponents import *
import itertools
import collections

# LDA

## Data preprocessing

In [107]:
# Load in the previously created corpus and dictionary of scraped sites
dictionary, corpus, meta_corpus, dmoz_data = jsoncorpus.load_or_create('docs/sites.jl')

Done.


In [162]:
# Build a list of all topics
allcategories = set(tuple(t) for t in dmoz_data['dmoz_categories'])
# Build a list of all top-level topics
topcategories = set(topic[0] for topic in dmoz_data['dmoz_categories'])
# Link topics to URLs
meta = list(zip(dmoz_data['urls'], dmoz_data['dmoz_categories']))
# Represent the topics in an alternative way
heirarchal_categories = lambda max_depth: [['; '.join(topics[:ti+1]) for ti, t in enumerate(topics) if ti < max_depth] for topics in dmoz_data['dmoz_categories']]
# Top categories
top_categories = [x[0] for x in heirarchal_categories(1)]

## Define a classification pipeline for the corpus data

The Pipeline() object chains together objects from the lib.scikitComponents file, so that they can be used as part of a scikit-learn classification pipeline

In [219]:
def lda_feature_pipeline(num_topics):
    return sklearn.pipeline.Pipeline([
            ('lda_model', LDAModel(dictionary, num_topics)),
            ('matrix_builder', TopicMatrixBuilder(num_topics))
            ])

Define a classifier (decision trees), and chain the preprocessing step to the classifier

In [220]:
classifier = sklearn.ensemble.ExtraTreesClassifier(random_state=0, n_estimators=100, oob_score=True, bootstrap=True, n_jobs=4)
#classifier = sklearn.svm.SVC()
clf = sklearn.pipeline.Pipeline([
    ('preprocess_meta', lda_feature_pipeline(len(topcategories))),
    ('classification', classifier)])


## Process the dmoz categories
In this case, we will just turn the top category in to an index

In [221]:
dmoz_encoder = sklearn.preprocessing.LabelEncoder().fit(top_categories)
classes = encoder.transform(top_categories)
print classes

[7 7 7 ..., 3 3 3]


## Run the classifier

We can use the meta corpus or the full body corpus here, just by replacing a single argument

The data is split in to training and test sets, and then fit to the training set. The LDA model is generated ***only*** from the training set, not the testing set.

In [222]:
#sklearn.cross_validation.cross_val_score(clf, meta_corpus, classes)

In [223]:
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(np.array(meta_corpus), classes)

In [224]:
model = clf.fit(X_train, y_train)
print model.score(X_test, y_test)

0.213333333333


In [225]:
category_scores = collections.defaultdict(float)
category_counts = collections.defaultdict(float)
for real, pred in sorted(zip(encoder.inverse_transform(y_test), encoder.inverse_transform(model.predict(X_test)))):
    category_scores[real] += 1 if real == pred else 0
    category_counts[real] += 1 

Find the score for each category. 

Note that if the classifier assigned *random* categories, the score would be $\frac{1}{\textrm{num categories}}$. Instead, it is actually quite high in some casses, indicating a moderate ammount of succcess (given how naive this is!)

In [251]:
print "Category       Proportion   Score"
print "---------------------------------"
for cat, proportion, score in [(k, category_counts[k]/len(y_test), category_scores[k] / category_counts[k]) for k in category_counts.keys()]:
    print "{:<15} {:<7.4}      {:<5.4}".format(cat, proportion, score)

Category       Proportion   Score
---------------------------------
arts            0.06         0.25 
regional        0.006667      0.0  
shopping        0.05         0.1  
reference       0.06667      0.225
business        0.08667      0.1731
kids and teens  0.02         0.25 
computers       0.1017       0.1311
recreation      0.06833      0.07317
sports          0.1517       0.6154
society         0.07833      0.02128
health          0.05         0.1333
home            0.05333      0.125
games           0.03667      0.1364
news            0.06167      0.1892
science         0.1083       0.1692
