** Classifications **

The data contains for each app list of tf-idf values which is short for term frequency–inverse document frequency. tf-idf is directly proportionally to the number of times the word appears in the document but inversely proportionally to the number of times the word is present in the corpus. 



In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


** Load the Pre Processed Data into sparse matrix for memory efficiency **

Total Rows = 20104
Total Cols = 13626
Amount of Data for double = 8 * 20104 * 13626 = 2.09G

If we have a limited RAM of 8G we are running out of RAM if we run some classification algorithm on this. So we need to optimize the memory. So storing the data in linked list sparse matrix and then for efficient arithmetic operations converting it to Compressed sparse row matrix. 

In [33]:
import csv
from scipy.sparse import lil_matrix

total_rows = 20104
total_cols = 13626

app_data = lil_matrix((total_rows, total_cols), dtype='double')
app_names_with_labels = []
app_name_to_label = dict({})
with open('./training/data_with_comma.csv','r') as dest_f:
    data_iter = csv.reader(dest_f, 
                           delimiter = ",")
    
    for rowidx, row in enumerate(data_iter):
        app_names_with_labels.append((row[0],row[1]))
        for colidx, val in zip(*[iter(row[2:])]*2):
            app_data[rowidx,colidx] = float(val)

** Cosine Similarity **

If two apps are of similar category most likely they have similar set of words. Like any app which belong to the category of games will have words game in it.

We can use cosine similarity to closely match the test set with the highly similar app in the training set. 

In [84]:

import numpy as np
import sklearn.metrics.pairwise as smp
from sklearn.metrics import precision_recall_fscore_support
from sklearn.cross_validation import KFold

csr_app_data = app_data.tocsr()
app_labels = np.array([ val[1] for val in app_names_with_labels ])

kf = KFold(n=20104, n_folds=10)
for train_idx, test_idx in kf:
    train_data = csr_app_data[train_idx]
    train_labels = app_labels[train_idx]
    test_data = csr_app_data[test_idx] 
    test_labels = app_labels[test_idx]
    predict_labels = []
    for idx, test_vector in enumerate(test_data):
            clf = smp.cosine_similarity(train_data, test_vector)
            max_idx = clf.argmax()
            predict_labels.append(train_labels[max_idx])
            
    predict_labels_arr = np.array(predict_labels)

    (precision, recall, fscore, support) = precision_recall_fscore_support(test_labels, predict_labels_arr)

    print("Overall Precision = {0} Recall {1}".format(sum(precision)/precision.size, sum(recall)/recall.size))



Overall Precision = 0.547080129026 Recall 0.539810685965
Overall Precision = 0.522065276247 Recall 0.52112201216
Overall Precision = 0.556398032519 Recall 0.552126370994
Overall Precision = 0.535557038314 Recall 0.536068591557
Overall Precision = 0.528987153983 Recall 0.520720674724
Overall Precision = 0.541531890467 Recall 0.540921180314
Overall Precision = 0.541837802659 Recall 0.537664577136
Overall Precision = 0.541692573604 Recall 0.532820614929
Overall Precision = 0.536379481561 Recall 0.529833819689
Overall Precision = 0.555440131103 Recall 0.553121555571


** Using KNN Algorithm **

Tried using the Euclidean distance for the KNN and accuracy seems to be high for 1 Neighbour. May be it is overfitting. 

In [34]:
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
csr_app_data = app_data.tocsr()
app_labels = np.array([ val[1] for val in app_names_with_labels ])
test_idx = np.random.uniform(0, 1, 20104) <= 0.95
train_data = csr_app_data[test_idx==True]
train_labels = app_labels[test_idx==True]
test_data = csr_app_data[test_idx==False] 
test_labels = app_labels[test_idx==False]
print("Total test data {0}".format(len(test_labels)))
for n in range(1, 10, 2):
    clf = KNeighborsClassifier(n_neighbors=n) 
    clf.fit(train_data, train_labels)
    preds = clf.predict(test_data)
    accuracy = np.where(preds==test_labels, 1, 0).sum() / float(len(test_labels))
    print "Neighbors: %d, Accuracy: %3f" % (n, accuracy)


Total test data 1057
Neighbors: 1, Accuracy: 0.417219
Neighbors: 3, Accuracy: 0.343425
Neighbors: 5, Accuracy: 0.336802
Neighbors: 7, Accuracy: 0.320719
Neighbors: 9, Accuracy: 0.324503


Predicting with 20,000 training and 104 test

The results
Neighbors: 1, Accuracy: 0.500000
Neighbors: 3, Accuracy: 0.432692
Neighbors: 5, Accuracy: 0.451923
Neighbors: 7, Accuracy: 0.461538
Neighbors: 9, Accuracy: 0.413462


** Multinomial Naive Bayes Classification ** 

In [57]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import linear_model
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
csr_app_data = app_data.tocsr()
app_labels = np.array([ val[1] for val in app_names_with_labels ])
test_idx = np.random.uniform(0, 1, 20104) <= 0.99
train_data = csr_app_data[test_idx==True]
train_labels = app_labels[test_idx==True]
test_data = csr_app_data[test_idx==False] 
test_labels = app_labels[test_idx==False]
print("Total test data {0}".format(len(test_labels)))
clf = MultinomialNB()
preds = clf.fit(train_data, train_labels).predict(test_data)
accuracy = np.where(preds==test_labels, 1, 0).sum() / float(len(test_labels))
print("Total accuracy {0}".format(accuracy))

Total test data 190
Total accuracy 0.631578947368


** Logistic Regression **

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import linear_model
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
csr_app_data = app_data.tocsr()
app_labels = np.array([ val[1] for val in app_names_with_labels ])
test_idx = np.random.uniform(0, 1, 20104) <= 0.99
train_data = csr_app_data[test_idx==True]
train_labels = app_labels[test_idx==True]
test_data = csr_app_data[test_idx==False] 
test_labels = app_labels[test_idx==False]
print("Total test data {0}".format(len(test_labels)))
clf = linear_model.LogisticRegression(C=1e5)
preds = clf.fit(train_data, train_labels).predict(test_data)
accuracy = np.where(preds==test_labels, 1, 0).sum() / float(len(test_labels))
print("Total accuracy {0}".format(accuracy))

** Support Vector Machines **

In [94]:
from sklearn import linear_model
import numpy as np
from sklearn import svm
csr_app_data = app_data.tocsr()
app_labels = np.array([ val[1] for val in app_names_with_labels ])
test_idx = np.random.uniform(0, 1, 20104) <= 0.90
train_data = csr_app_data[test_idx==True]
train_labels = app_labels[test_idx==True]
test_data = csr_app_data[test_idx==False] 
test_labels = app_labels[test_idx==False]
print("Total test data {0}".format(len(test_labels)))
clf = svm.LinearSVC()
preds = clf.fit(train_data, train_labels).predict(test_data)
accuracy = np.where(preds==test_labels, 1, 0).sum() / float(len(test_labels))
print("Total accuracy {0}".format(accuracy))

Total test data 1959
Total accuracy 0.661051556917


In [95]:
from sklearn import linear_model
import numpy as np
from sklearn.tree import DecisionTreeClassifier
csr_app_data = app_data.tocsr()
app_labels = np.array([ val[1] for val in app_names_with_labels ])
test_idx = np.random.uniform(0, 1, 20104) <= 0.99
train_data = csr_app_data[test_idx==True]
train_labels = app_labels[test_idx==True]
test_data = csr_app_data[test_idx==False] 
test_labels = app_labels[test_idx==False]
print("Total test data {0}".format(len(test_labels)))
clf = DecisionTreeClassifier(random_state=0)
preds = clf.fit(train_data, train_labels).predict(test_data)
accuracy = np.where(preds==test_labels, 1, 0).sum() / float(len(test_labels))
print("Total accuracy {0}".format(accuracy))

Total test data 194
Total accuracy 0.417525773196


In [89]:
from sklearn.decomposition import TruncatedSVD
import numpy as np
import sklearn.metrics.pairwise as smp
from sklearn.metrics import precision_recall_fscore_support
from sklearn.cross_validation import KFold

csr_app_data = app_data.tocsr()
app_labels = np.array([ val[1] for val in app_names_with_labels ])

kf = KFold(n=20104, n_folds=10)
for train_idx, test_idx in kf:
    train_data = csr_app_data[train_idx]
    train_labels = app_labels[train_idx]
    test_data = csr_app_data[test_idx] 
    test_labels = app_labels[test_idx]
    svd = TruncatedSVD(n_components=200, random_state=42)
    reduced_train_data = svd.fit_transform(train_data)
    predict_labels = []
    for idx, test_vector in enumerate(test_data):
        reduced_test_vector = svd.transform(test_vector)
        clf = smp.cosine_similarity(reduced_train_data, reduced_test_vector)
        max_idx = clf.argmax()
        predict_labels.append(train_labels[max_idx])
            
    predict_labels_arr = np.array(predict_labels)

    (precision, recall, fscore, support) = precision_recall_fscore_support(test_labels, predict_labels_arr)

    print("Overall Precision = {0} Recall {1}".format(sum(precision)/precision.size, sum(recall)/recall.size))


Overall Precision = 0.554671136325 Recall 0.552625701271
Overall Precision = 0.544333224976 Recall 0.545808353114
Overall Precision = 0.553490457444 Recall 0.55381985925
Overall Precision = 0.56021611717 Recall 0.557676388879
Overall Precision = 0.562168112744 Recall 0.558990193287
Overall Precision = 0.5513285504 Recall 0.557144044696
Overall Precision = 0.563780594994 Recall 0.565238296047
Overall Precision = 0.563819750508 Recall 0.562721201641
Overall Precision = 0.566468569446 Recall 0.571718906401
Overall Precision = 0.576093341728 Recall 0.582484041026


In [90]:
from sklearn.decomposition import TruncatedSVD
import numpy as np
import sklearn.metrics.pairwise as smp
from sklearn.metrics import precision_recall_fscore_support
from sklearn.cross_validation import KFold
from sklearn.tree import DecisionTreeClassifier

csr_app_data = app_data.tocsr()
app_labels = np.array([ val[1] for val in app_names_with_labels ])

kf = KFold(n=20104, n_folds=10)
for train_idx, test_idx in kf:
    train_data = csr_app_data[train_idx]
    train_labels = app_labels[train_idx]
    test_data = csr_app_data[test_idx] 
    test_labels = app_labels[test_idx]
    svd = TruncatedSVD(n_components=200, random_state=42)
    reduced_train_data = svd.fit_transform(train_data)
    predict_labels = []
    reduced_test_data = svd.transform(test_data)
    clf = DecisionTreeClassifier(random_state=42)
    preds = clf.fit(reduced_train_data, train_labels).predict(reduced_test_data)

    (precision, recall, fscore, support) = precision_recall_fscore_support(test_labels, preds)

    print("Overall Precision = {0} Recall {1}".format(sum(precision)/precision.size, sum(recall)/recall.size))


Overall Precision = 0.430767444421 Recall 0.419534708071
Overall Precision = 0.420894184762 Recall 0.419155387966
Overall Precision = 0.416830715602 Recall 0.421482987822
Overall Precision = 0.43380656711 Recall 0.424024198404
Overall Precision = 0.424615345529 Recall 0.422128800823
Overall Precision = 0.440756297092 Recall 0.439595674659
Overall Precision = 0.440507938935 Recall 0.433141657722
Overall Precision = 0.435774615425 Recall 0.429264895803
Overall Precision = 0.44809568877 Recall 0.446198930497
Overall Precision = 0.427699936332 Recall 0.433958775722


In [92]:
from sklearn.decomposition import TruncatedSVD
import numpy as np
import sklearn.metrics.pairwise as smp
from sklearn.metrics import precision_recall_fscore_support
from sklearn.cross_validation import KFold
from sklearn.tree import DecisionTreeClassifier

csr_app_data = app_data.tocsr()
app_labels = np.array([ val[1] for val in app_names_with_labels ])

kf = KFold(n=20104, n_folds=10)
for train_idx, test_idx in kf:
    train_data = csr_app_data[train_idx]
    train_labels = app_labels[train_idx]
    test_data = csr_app_data[test_idx] 
    test_labels = app_labels[test_idx]
    svd = TruncatedSVD(n_components=500, random_state=42)
    reduced_train_data = svd.fit_transform(train_data)
    predict_labels = []
    reduced_test_data = svd.transform(test_data)
    clf = svm.LinearSVC()
    preds = clf.fit(reduced_train_data, train_labels).predict(reduced_test_data)
    accuracy = np.where(preds==test_labels, 1, 0).sum() / float(len(test_labels))
    print("Total accuracy {0}".format(accuracy))

    (precision, recall, fscore, support) = precision_recall_fscore_support(test_labels, preds)

    print("Overall Precision = {0} Recall {1}".format(sum(precision)/precision.size, sum(recall)/recall.size))

Total accuracy 0.636499254102
Overall Precision = 0.636397613381 Recall 0.634406484389
Total accuracy 0.633515663849
Overall Precision = 0.628182638846 Recall 0.629988379777
Total accuracy 0.621581302834
Overall Precision = 0.619555827851 Recall 0.620463390865
Total accuracy 0.633515663849
Overall Precision = 0.639366213906 Recall 0.635518105163
Total accuracy 0.628855721393
Overall Precision = 0.631239262478 Recall 0.623369289271
Total accuracy 0.642786069652
Overall Precision = 0.624283351653 Recall 0.635799867056
Total accuracy 0.642288557214
Overall Precision = 0.63849601999 Recall 0.637248098321
Total accuracy 0.647263681592
Overall Precision = 0.651363791723 Recall 0.636791599146
Total accuracy 0.635820895522
Overall Precision = 0.637401943493 Recall 0.637253355252
Total accuracy 0.63631840796
Overall Precision = 0.642025732986 Recall 0.640805784006
