In [66]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [173]:
from lib import jsoncorpus, datastuff
import gensim
import traceback
import numpy as np
import matplotlib.pyplot as plt
import sklearn.pipeline
import sklearn.preprocessing
import sklearn.cross_validation
import sklearn.ensemble
from lib.scikitComponents import *
import itertools
import collections

# LDA

## Data preprocessing

In [107]:
# Load in the previously created corpus and dictionary of scraped sites
dictionary, corpus, meta_corpus, dmoz_data = jsoncorpus.load_or_create('docs/sites.jl')

Done.


In [162]:
# Build a list of all topics
allcategories = set(tuple(t) for t in dmoz_data['dmoz_categories'])
# Build a list of all top-level topics
topcategories = set(topic[0] for topic in dmoz_data['dmoz_categories'])
# Link topics to URLs
meta = list(zip(dmoz_data['urls'], dmoz_data['dmoz_categories']))
# Represent the topics in an alternative way
heirarchal_categories = lambda max_depth: [['; '.join(topics[:ti+1]) for ti, t in enumerate(topics) if ti < max_depth] for topics in dmoz_data['dmoz_categories']]
# Top categories
top_categories = [x[0] for x in heirarchal_categories(1)]

## Define a classification pipeline for the corpus data

The Pipeline() object chains together objects from the lib.scikitComponents file, so that they can be used as part of a scikit-learn classification pipeline

In [144]:
def lda_feature_pipeline(num_topics):
    return sklearn.pipeline.Pipeline([
            ('lda_model', LDAModel(dictionary, num_topics)),
            ('matrix_builder', TopicMatrixBuilder(num_topics))
            ])

Define a classifier (decision trees), and chain the preprocessing step to the classifier

In [145]:
trees = sklearn.ensemble.ExtraTreesClassifier(random_state=0, n_estimators=100, oob_score=True, bootstrap=True, n_jobs=4)
clf = sklearn.pipeline.Pipeline([
    ('preprocess_meta', lda_feature_pipeline(len(topcategories))),
    ('classification', trees)])


## Process the dmoz categories
In this case, we will just turn the top category in to an index

In [185]:
dmoz_encoder = sklearn.preprocessing.LabelEncoder().fit(top_categories)
classes = encoder.transform(top_categories)
print classes

[7 7 7 ..., 3 3 3]


## Run the classifier

We can use the meta corpus or the full body corpus here, just by replacing a single argument

The data is split in to training and test sets, and then fit to the training set. The LDA model is generated ***only*** from the training set, not the testing set.

In [None]:
#sklearn.cross_validation.cross_val_score(clf, meta_corpus, classes)

In [165]:
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(np.array(meta_corpus), classes)

In [186]:
model = clf.fit(X_train, y_train)
print model.score(X_test, y_test)

0.246666666667


In [180]:
category_scores = collections.defaultdict(float)
category_counts = collections.defaultdict(int)
for real, pred in sorted(zip(encoder.inverse_transform(y_test), encoder.inverse_transform(model.predict(X_test)))):
    category_scores[real] += 1 if real == pred else 0
    category_counts[real] += 1 

Find the score for each category. 

Note that if the classifier assigned *random* categories, the score would be $\frac{1}{\textrm{num categories}}$. Instead, it is actually quite high in some casses, indicating a moderate ammount of succcess (given how naive this is!)

In [183]:

{k: category_scores[k] / category_counts[k] for k in category_counts.keys()}

{u'arts': 0.1794871794871795,
 u'business': 0.2777777777777778,
 u'computers': 0.2553191489361702,
 u'games': 0.0,
 u'health': 0.1724137931034483,
 u'home': 0.15151515151515152,
 u'kids and teens': 0.0625,
 u'news': 0.1891891891891892,
 u'recreation': 0.05405405405405406,
 u'reference': 0.13333333333333333,
 u'regional': 0.0,
 u'science': 0.18518518518518517,
 u'shopping': 0.125,
 u'society': 0.0,
 u'sports': 0.5086206896551724}