### K-means

1. Load dataframe of Seattle listings and extract lemmatized host descriptions
2. Vectorize lemmatized host descriptions
3. Extract features with SVD
4. Cluster using kmeans
5. Assign each listing a cluster
6. Pickle dataframe

In [None]:
import pickle

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline

from sklearn.cluster import KMeans

from sklearn.metrics.pairwise import pairwise_distances, cosine_similarity
from sklearn.metrics import silhouette_score

import matplotlib.pyplot as plt
%matplotlib inline

#### Load Seattle listings dataframe and extract lemmatized host descriptions

In [None]:
with open('../data/processed/s_listings.pkl', 'rb') as picklefile:
    s_listings = pickle.load(picklefile)

In [None]:
host_lemmas = s_listings['host_lemmas']

#### Create TFIDF vector of lemmatized host descriptions
In an attempt to only include meaninful words:
* Minimum document frequency set to 10: given word must appear in at least 10 host descriptions
* Token pattern returns words with 2 or more letters
* Only unigrams (default settings of ngram_range)

In [None]:
tfidf = TfidfVectorizer(ngram_range=(1,2),
                        min_df=10,
                        token_pattern="\\b[a-z][a-z]+\\b")
tfidf.fit(host_lemmas)
x = tfidf.transform(host_lemmas)

In [None]:
#words in tfidf vector
features = tfidf.get_feature_names()
print len(features)

In [None]:
#10 words with highest tfidf
top = tfidf.idf_.argsort()[:10].tolist()
[(features[i], tfidf.idf_[i]) for i in top]

In [None]:
#10 words with lowest tfidf
bottom = tfidf.idf_.argsort()[::-1].tolist()[:10]
[(features[i], tfidf.idf_[i]) for i in bottom]

#### Extract features with SVD

In [None]:
svd = TruncatedSVD(n_components=650,
                   random_state=16)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)

In [None]:
x_svd = lsa.fit_transform(x)

In [None]:
sum(svd.explained_variance_ratio_)

As it takes 650 features (of my original 1063) to explain over 90% of the variance in my data, I'll just use the original set to preserve interpretability when clustering using kmeans.

Also, this implies that my data does not have a strong structure that can be explained in a handful of components

#### Cluster using kmeans

In [None]:
k=12
kmeans = KMeans(n_clusters=k, init='k-means++', random_state=16)
kmeans.fit(x)
clusters = kmeans.labels_.tolist()

#### Interpretation
* Show top 5 words (features in each cluster centroid with highest TFIDF)
* Show host closest to centroid of cluster (either smallest pairwise cosine distance or largest pairwise cosine similarity between host and cluster centroid)

In [None]:
#"coordinates" of cluster centers (tfidf vectors)
centroids = kmeans.cluster_centers_

#indexes of features in descending order by tfidf value
ordered_centroids = centroids.argsort()[:, ::-1]

#hosts closest to centroids by either:
#1. smallest pairwise cosine distance
#center_hosts = pd.DataFrame(pairwise_distances(x, centroids, metric='cosine')).idxmin().tolist()
#2. largest pairwise cosine similarity
center_hosts = pd.DataFrame(cosine_similarity(x, centroids)).idxmax().tolist()

In [None]:
#number of words to print
n=5

for i in range(k):
    print 'Cluster %d' % i
    print 'TOP %d WORDS:' % n
    for index in ordered_centroids[i, :n]:
        print features[index]
    print 'REPRESENTATIVE HOST:'
    print s_listings['abouts'].iloc[center_hosts[i]]
    print

#### Assign clusters to each listing

In [None]:
clusters = kmeans.labels_

In [None]:
s_listings['kmeans'] = clusters

In [None]:
s_listings['kmeans'].value_counts(sort=False)

#### Pickle dataframe

In [None]:
km = s_listings['kmeans']

In [None]:
with open('../data/interim/kmeans.pkl', 'wb') as picklefile:
    pickle.dump(km, picklefile)

#### Check inertia and silhouette score for various numbers of clusters (just for kicks)

In [None]:
inertias = {}
silhouettes = {}
for k in range(2,20):
    kmeans = KMeans(n_clusters=k, init='k-means++', random_state=16)
    data_clustered = kmeans.fit_transform(x)
    clusters = kmeans.labels_.tolist()
    inertias[k] = kmeans.inertia_
    silhouettes[k] = silhouette_score(data_clustered, clusters, random_state=16)

In [None]:
plt.scatter(inertias.keys(), inertias.values());

In [None]:
plt.scatter(silhouettes.keys(), silhouettes.values());

Low, erratic scores: not good, but not surprising given my data.