## LS COE Clustering

The goal of this analysis is to group a set of items into categories using their descriptions.  Conceptually, we will put all items with "similar" descriptions into the same group.  

### Import Data

In [1]:
import pandas as pd
import numpy as np

raw_data = pd.read_csv('c:/users/aclark/Desktop/My Box Files/LifeScience_COE_DataCollection/Clustering/lscoeItems.csv')

### Vectorize the data

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

##################### Vectorize the inputs #############################

tfidf = TfidfVectorizer(analyzer='char_wb', lowercase=True, ngram_range=(2,2))
Vectorized_Descr = tfidf.fit_transform(raw_data['cleandescr'])


### Attempt 1: Use Mini-Batch KMeans to cluster into 20 categories

While I dont expect this to work well, it should be fast and provide an answer.  It will serve as the base line for other tests and methods.  The input data is scrubbed using the following regexp (in postgres):

regexp_replace(regexp_replace(regexp_replace(lscoe."Item Description", '[\t?&"\\/)(\*\]\[]','','g'), '\s+',' ','g'),'^\s+','','g')

See the process document in the clustering folder for details and the SQL Code.

In [20]:
from sklearn.cluster import MiniBatchKMeans


####################

mbk = MiniBatchKMeans(init='k-means++', n_clusters=10, compute_labels=True)
km = mbk.fit(Vectorized_Descr)


################  Writeout results ##############################

#Convert to dataframe and pair up with the origional data
labels = pd.DataFrame(km.predict(Vectorized_Descr))
labels.columns = ['label']
                      
outframe = pd.merge(raw_data,labels, right_index=True, left_index=True)


#write out the csv file
outframe.to_csv('c:/users/aclark/Desktop/My Box Files/LifeScience_COE_DataCollection/Clustering/mbk.csv')


### Attempt 2: Spectral Clustering using Cosine Distance

This method doesnt work because of the size of the data.  Spectral Clustering a 108k x 108k matrix seems bigger than 10GB of Ram can manage.  Will try and modify the query to pull only the top 80% of spend.

In [3]:
from sklearn.cluster import SpectralClustering

#Calculate the Cosine Distance Matrix
CosMatrix = (Vectorized_Descr * Vectorized_Descr.T)
        
#Unsupervised clustering 
spec = SpectralClustering(n_clusters=20)
Cluster =spec.fit(CosMatrix)

#Append the cluster labels to the origional threads
temp['Labels'] = Cluster.labels_
    
temp.to_csv('C:/Users/aclark/Desktop/My Box Files/LifeScience_COE_DataCollection/Clustering/spectral.csv')


MemoryError: 

### Attempt 3:  Linear SVM using Greg's Data From 2008

Using greg's classifications from the 2008 GLS RFP, I will reformulate this problem as a Supervised Learning problem.  As a result, I can train a Linear SVM on Greg's data and use it to cluster the COE data.

In [42]:
import pandas as pd
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split

#Import gregs data, a 2 dimensional array with labels in the first column and descriptions in the second.
SVMData = pd.read_csv('c:/users/aclark/Desktop/My Box Files/LifeScience_COE_DataCollection/Clustering/TopItems.csv')

#Vectorize the entire population of data 
#tfidf = TfidfVectorizer(analyzer='word', lowercase=True, ngram_range=(1,1), decode_error='ignore', max_df=.3)
tfidf = CountVectorizer(analyzer='char_wb', binary=True, max_df=.2, decode_error='ignore', ngram_range=(4,4))

Vectorized_Descr = tfidf.fit_transform(SVMData['cleandescr'])


print('vectorized data')

#Split out the Gregs Data from the COE Data
gX = Vectorized_Descr[0:8137]
gy = SVMData['label'][0:8137]

# Take a portion of the vectorized descriptions as a training set another as testing set.
X_train, X_test, Y_train, Y_test = train_test_split(gX, gy ,test_size=.65, random_state=10)

print('Starting Classification')

svc = LinearSVC(C=10000, class_weight='auto', dual=False, fit_intercept=False)
svc.fit(X_train,Y_train)

scores = svc.score(X_test,Y_test)
print scores

vectorized data
Starting Classification
0.760680529301


<6102x121501 sparse matrix of type '<type 'numpy.float64'>'
	with 120015 stored elements in Compressed Sparse Row format>

### Random Forest Classifier

In [19]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

#Import gregs data, a 2 dimensional array with labels in the first column and descriptions in the second.
RFCData = pd.read_csv('c:/users/aclark/Desktop/My Box Files/LifeScience_COE_DataCollection/Clustering/TopItems.csv')

print('vectorizing data')
#Vectorize the entire population of data 
tfidf = TfidfVectorizer(analyzer='word', lowercase=True, ngram_range=(1,1), max_df=.2, decode_error='ignore')
Vectorized_Descr = tfidf.fit_transform(RFCData['cleandescr'])

#Split out the Gregs Data from the COE Data
gX = Vectorized_Descr[0:8137]
gy = RFCData['label'][0:8137]

# Take a portion of the vectorized descriptions as a training set another as testing set.
X_train, X_test, Y_train, Y_test = train_test_split(gX, gy ,test_size=.40, random_state=10)

print('Starting Classification')
RFC = RandomForestClassifier(n_estimators=25, n_jobs=1)
RFC.fit(X_train.toarray(),Y_train)

print('Scoring Results')
scores = RFC.score(X_test.toarray(),Y_test)
labels = pd.DataFrame(RFC.predict(Vectorized_Descr[8137:].toarray()))

print('training accuracy')
print(RFC.score(X_train.toarray(),Y_train))


print('Writing Results')
outframe = pd.merge(RFCData[8137:].reindex(),labels, right_index=True, left_index=True)

outframe.to_csv('c:/users/aclark/Desktop/My Box Files/LifeScience_COE_DataCollection/Clustering/RFCResults.csv')

print scores

vectorizing data
Starting Classification
Scoring Results
training accuracy
0.996108152397
Writing Results
0.720430107527
