The question here is whether we can determine which patch a game is from. This has potential application for telling users that their playstyle is outdated, or for static stats by learning which features are determinate in making that decision.

In [13]:
%matplotlib inline
import numpy as np
from sklearn import svm, metrics
from sklearn import linear_model as lmod
from sklearn.cross_validation import train_test_split

In [14]:
patch = np.load('../datasets/np/champ_99_version_feature_10000.npy')
items = np.load('../datasets/np/champ_99_items_feature_10000.npy')
wins = np.load('../datasets/np/champ_99_stats.winner_feature_10000.npy').reshape((-1, 1))
features = np.hstack((items, wins))
features.shape

(4097, 268)

The `patch` information is the output. It is trivially to reformat into the customary -1, 1

In [15]:
p511 = patch == '5.11'
output = np.ones(patch.shape)
output[p511] = -1

In [16]:
FS = lmod.RandomizedLogisticRegression()
FS.fit(features, output)

RandomizedLogisticRegression(C=1, fit_intercept=True,
               memory=Memory(cachedir=None), n_jobs=1, n_resampling=200,
               normalize=True, pre_dispatch='3*n_jobs', random_state=None,
               sample_fraction=0.75, scaling=0.5, selection_threshold=0.25,
               tol=0.001, verbose=False)

In [17]:
FS.get_support().nonzero()

(array([  3,   9,  12,  17,  25,  39,  67,  68, 139, 144, 145, 146]),)

So on a single-champion basis (with an admittedly small dataset size, 255 samples total) we are getting no features selected. Dass ist nicht so gut.

**EDIT** After a second run with an order of a magnitude more data, RLR extracts two features of import.
**EDIT** After fixing a dataset generation bug, RLR extracts 12 features of import!

In [18]:
M = svm.LinearSVC()
train_feat, test_feat, train_out, test_out = train_test_split(features, output, test_size=0.33)
M.fit(train_feat, train_out)
pred = M.predict(test_feat)
fpr, tpr, _ = metrics.roc_curve(test_out, pred)
auc = metrics.auc(fpr, tpr)
print(auc)

0.700511913276


Only slightly better than random. Let's see how things look using only the RLR features.
**EDIT** Much better after fixing a dataset generation bug.

In [21]:
support = FS.get_support()
M = svm.LinearSVC()
M.fit(train_feat[:, support], train_out)
pred = M.predict(test_feat[:, support])
fpr, tpr, _ = metrics.roc_curve(test_out, pred)
auc = metrics.auc(fpr, tpr)
print(auc)
print(metrics.r2_score(test_out, pred))

0.70717855358
0.00185621474977


No significant improvement (other than in training time).

In [20]:
output[output == -1] = 0
train_feat, test_feat, train_out, test_out = train_test_split(features, output, test_size=0.33)
L = lmod.LogisticRegression()
L.fit(train_feat, train_out)
score = L.score(test_feat, test_out)
print(score)
pred = L.predict(test_feat)
fpr, tpr, _ = metrics.roc_curve(test_out, pred)
print(metrics.auc(fpr, tpr))
print(metrics.r2_score(test_out, pred))

0.758314855876
0.692803598201
-0.0528807025059
