### ABOUT THIS SCRIPT AND THE APROACH

Given a csv containing a list of products where each product has a title and description then use machine learning to cluster them into seperable and distinct categories. There may or may not be known category lables for all the products or the manually labled products might be miss classified into the wrong categories.

The following approach uses an unspervised learning approach to define the category. We start by chosing k clusters using the number of labled categories in the training data. This would need to be adjusted.

Then the text including title and description is vectorized which means to convert each text column into a vector of features derived from the words used.

Finally, we compare the k-clusters chosen for the training data to the manually labled categories to measure fit and relationship to the machine learned categories. If ran on the provided training set without modification then you'll notice that the kmeans can't reliable pick out the same sub-categories within the apple products however it does a good job of matching the labled categories for printers, monitors, and software respectively.

In [6]:
# EMAIL: alton@frontanalytics.com

# Adapted from:
# http://scikit-learn.org/stable/auto_examples/text/document_clustering.html
# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>
#         Lars Buitinck <L.J.Buitinck@uva.nl>
# License: BSD 3 clause

from __future__ import print_function

from sklearn.datasets.base import Bunch
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn import metrics

from sklearn.cluster import KMeans, MiniBatchKMeans

import logging
from optparse import OptionParser
import sys
from time import time

import numpy as np
import pandas

In [7]:
####################################################################
# Data loader
def load_data(path_to_csv):
    """
    Loads the text data into memory using the bundle dataset structure.
    Note that on larger corpora, memory safe CorpusReaders should be used.
    """
    
    # import the data
    df = pandas.read_csv(path_to_csv)
    
    # remove rows where there are nulls in the column of interest
    df = df[pandas.notnull(df['description'])]
    df = df[pandas.notnull(df['name'])]
    
    # pre-process the product's text
    #data = df['description'] # just the description
    
    # combine all words in both description and title
    data = df['description'].str.cat(df['name'], sep=' ')
    
    target = df['category']
    
    return Bunch(
        data=data,
        raw_df = df,
        target=target,
        target_names=frozenset(target),
    )

In [8]:
#############################################################################
# load data

path_to_csv = "products.csv"
dataset = load_data(path_to_csv)

print("%d documents" % len(dataset.data))
print("%d categories" % len(dataset.target_names))
print()

4881 documents
12 categories



In [9]:
dataset.data.head()

0    Very clean Microsoft surface RT Tablet/Laptop ...
1    This is my third attempt at posting this ad so...
2        2 available. Both new. Dell lap top bag - NEW
3    I custom built my wife's PC many years ago, bu...
4    New WD My Passport Ultra 1TB external hard dri...
Name: description, dtype: object

In [10]:
###############################
# Set parameters for the following options
use_hashing = False
use_idf = False
n_features = 1000000 # max number of features
n_components = False
minibatch = False
verbose = True

In [11]:
labels = dataset.target
true_k = np.unique(labels).shape[0]

print("Extracting features from the training dataset using a sparse vectorizer")
t0 = time()
if use_hashing:
    if use_idf:
        # Perform an IDF normalization on the output of HashingVectorizer
        hasher = HashingVectorizer(n_features=n_features,
                                   stop_words='english', non_negative=True,
                                   norm=None, binary=False)
        vectorizer = make_pipeline(hasher, TfidfTransformer())
    else:
        vectorizer = HashingVectorizer(n_features=n_features,
                                       stop_words='english',
                                       non_negative=False, norm='l2',
                                       binary=False)
else:
    vectorizer = TfidfVectorizer(max_df=0.5, max_features=n_features,
                                 min_df=2, stop_words='english',
                                 use_idf=use_idf)
X = vectorizer.fit_transform(dataset.data)

print("done in %fs" % (time() - t0))
print("n_samples: %d, n_features: %d" % X.shape)
print()

Extracting features from the training dataset using a sparse vectorizer
done in 0.583210s
n_samples: 4881, n_features: 8838



In [12]:
if n_components:
    print("Performing dimensionality reduction using LSA")
    t0 = time()
    # Vectorizer results are normalized, which makes KMeans behave as
    # spherical k-means for better results. Since LSA/SVD results are
    # not normalized, we have to redo the normalization.
    svd = TruncatedSVD(opts.n_components)
    normalizer = Normalizer(copy=False)
    lsa = make_pipeline(svd, normalizer)

    X = lsa.fit_transform(X)

    print("done in %fs" % (time() - t0))

    explained_variance = svd.explained_variance_ratio_.sum()
    print("Explained variance of the SVD step: {}%".format(
        int(explained_variance * 100)))

    print()

In [13]:
###############################################################################
# Do the actual clustering

if minibatch:
    km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
                         init_size=1000, batch_size=1000, verbose=verbose)
else:
    km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1,
                verbose=verbose)

print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X)
print("done in %0.3fs" % (time() - t0))
print()

print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(labels, km.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, km.labels_, sample_size=1000))

print()

Clustering sparse data with KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=12, n_init=1,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=True)
Initialization complete
Iteration  0, inertia 7806.127
Iteration  1, inertia 4319.798
Iteration  2, inertia 4239.722
Iteration  3, inertia 4212.953
Iteration  4, inertia 4204.528
Iteration  5, inertia 4197.837
Iteration  6, inertia 4190.284
Iteration  7, inertia 4187.951
Iteration  8, inertia 4186.844
Iteration  9, inertia 4186.319
Iteration 10, inertia 4185.518
Iteration 11, inertia 4184.717
Iteration 12, inertia 4184.674
Iteration 13, inertia 4184.641
Iteration 14, inertia 4184.595
Iteration 15, inertia 4184.543
Iteration 16, inertia 4184.442
Iteration 17, inertia 4184.353
Iteration 18, inertia 4184.311
Iteration 19, inertia 4184.262
Iteration 20, inertia 4184.207
Iteration 21, inertia 4184.110
Iteration 22, inertia 4183.763
Iteration 23, inertia 4183.044
Iteration 24, inertia 4182.894
Iter

In [16]:
if not use_hashing:
    print("Top terms per cluster:")

    if n_components:
        original_space_centroids = svd.inverse_transform(km.cluster_centers_)
        order_centroids = original_space_centroids.argsort()[:, ::-1]
    else:
        order_centroids = km.cluster_centers_.argsort()[:, ::-1]

    terms = vectorizer.get_feature_names()
    for i in range(true_k):
        print("Cluster %d:" % i, end='')
        for ind in order_centroids[i, :10]:
            print(' %s' % terms[ind], end='')
        print()

Top terms per cluster:
Cluster 0: apple power mac text great supply adapter mouse computer keyboard
Cluster 1: core drive intel windows ram gb hard computer 10 text
Cluster 2: ipad case air new mini condition apple text 16gb gb
Cluster 3: hp new text ink windows printer computer condition black used
Cluster 4: laptop text great windows new condition hp used ram dell
Cluster 5: router wireless modem cable text netgear linksys dual link docsis
Cluster 6: macbook pro 13 new core condition air apple intel gb
Cluster 7: printer hp new ink color text great print scanner used
Cluster 8: new text used brand computer keyboard great 801 condition box
Cluster 9: monitor lcd dell great computer text 19 inch condition screen
Cluster 10: tablet samsung galaxy new tab text condition 10 case screen
Cluster 11: sale items zzzpawnshop new 801 com available layaway main pawn


# How well does our k-means match the pre-defined clusters

In [17]:
scored = km.predict(X)
pandas.crosstab(scored, dataset.target)

category,Android Tablets and Accessories,Apple Hardware and Accessories,Apple Laptops,Apple iPads and Accessories,Desktop Hardware and Accessories,Desktops,Laptop Hardware and Accessories,Laptops,Monitors,Palm,Printers,Software
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,4,57,75,14,43,9,11,6,8,2,1,1
1,6,5,99,0,97,301,28,193,1,0,0,10
2,5,7,1,286,0,0,0,1,0,0,0,0
3,3,0,1,0,19,44,7,51,15,0,112,0
4,1,8,11,3,10,3,122,317,0,0,0,1
5,0,7,0,0,187,3,46,1,0,0,1,0
6,1,23,263,0,1,0,9,29,0,0,0,2
7,0,0,0,0,7,1,0,0,0,0,348,0
8,64,32,51,31,599,91,118,129,50,4,81,115
9,0,3,2,0,27,17,3,2,241,0,0,0
