# KMeans clustering Medium Articles

Read in your CSV with the text column if not

In [1]:
import pandas as pd
%matplotlib inline
from dirtyclean import clean
import numpy as np
import glob

In [3]:
df = pd.read_csv("medium_2015.csv", encoding = "ISO-8859-1")

In [4]:
df.head()

Unnamed: 0,link,min_read,filename,author,publications,description,date,followers,claps,people_clapped,high_text,images,body,tag
0,https://www.google.com/url?q=https://m.signalv...,2,PRESS RELEASE: BASECAMP VALUATION TOPS $100 BI...,Jason Fried,Signal v. Noise,Founder & CEO at Basecamp. Co-author of Gettin...,1-Dec-15,205000,6222,5328,In order to determine the valuation of compani...,Yes,"Basecamp is now a $100 billion dollar company,...",Startup
1,https://medium.com/@damjancvetkovdimitrov/thes...,5,These photos are why IÕm trapped in Tokyo fore...,Damjan Cvetkov-Dimitrov,,I never anticipate ants.. professionally. I al...,25-Nov-15,2200,6340,5607,,Yes,"Tokyo, holy excrements from small undefined cr...",Sci-Fy
2,https://medium.com/the-year-of-the-looking-gla...,2,Average Manager vs. Great Manager,Julie Zhuo,The Year of the Looking Glass,Product design VP,11-Aug-15,179000,8700,7034,,Yes,Sketches,Management
3,https://medium.com/interactive-mind/mobile-201...,7,Mobile:2015 UI/UX Trends,Onur Oral,Interactive Mind,Designer of things.,31-Jul-15,2300,4814,4766,Apps which have well-done micro-interactions c...,Yes,"Whether on an app screen, a web browser, or a ...",
4,https://medium.com/firm-narrative/want-a-bette...,5,Want a Better Pitch? Watch This.,Andy Raskin,,Helping leaders tell strategic stories.,,20000,7447,6226,"ÒIt lacksÊoomph,Ó she said. ÒThe information i...",Yes,"Three weeks ago, the CMO of a San Francisco st...",Startup


In [5]:
df.shape

(15, 14)

In [6]:
df.dtypes

link              object
min_read           int64
filename          object
author            object
publications      object
description       object
date              object
followers          int64
claps              int64
people_clapped     int64
high_text         object
images            object
body              object
tag               object
dtype: object

In [7]:
df["body"] = df["body"].apply(clean)

In [8]:
df.head()

Unnamed: 0,link,min_read,filename,author,publications,description,date,followers,claps,people_clapped,high_text,images,body,tag
0,https://www.google.com/url?q=https://m.signalv...,2,PRESS RELEASE: BASECAMP VALUATION TOPS $100 BI...,Jason Fried,Signal v. Noise,Founder & CEO at Basecamp. Co-author of Gettin...,1-Dec-15,205000,6222,5328,In order to determine the valuation of compani...,Yes,Basecamp is now a billion dollar company accor...,Startup
1,https://medium.com/@damjancvetkovdimitrov/thes...,5,These photos are why IÕm trapped in Tokyo fore...,Damjan Cvetkov-Dimitrov,,I never anticipate ants.. professionally. I al...,25-Nov-15,2200,6340,5607,,Yes,Tokyo holy excrements from small undefined cre...,Sci-Fy
2,https://medium.com/the-year-of-the-looking-gla...,2,Average Manager vs. Great Manager,Julie Zhuo,The Year of the Looking Glass,Product design VP,11-Aug-15,179000,8700,7034,,Yes,Sketches,Management
3,https://medium.com/interactive-mind/mobile-201...,7,Mobile:2015 UI/UX Trends,Onur Oral,Interactive Mind,Designer of things.,31-Jul-15,2300,4814,4766,Apps which have well-done micro-interactions c...,Yes,Whether on an app screen a web browser or a we...,
4,https://medium.com/firm-narrative/want-a-bette...,5,Want a Better Pitch? Watch This.,Andy Raskin,,Helping leaders tell strategic stories.,,20000,7447,6226,"ÒIt lacksÊoomph,Ó she said. ÒThe information i...",Yes,Three weeks ago the CMO of a San Francisco sta...,Startup


## Vectorize your documents

What are the options when creating a `TfidfVectorizer`?

Object `TfidfVectorizer` not found.


Let's think about:
* **ngram_range: Do we just want single words? Or more? (1,2) is one- and two-word phrases, etc.
* **max_features**: Can it make things faster? `1` and up
* **max_df**: Should we ignore words that show up too often? `0.0`-`1.0` for percent, OR an integer for absolute document counts
* **min_df**: Should we ignore words that show up too little? `0.0`-`1.0` for percent, OR an integer for absolute document counts
* **vocabulary**: Only care about certain words

Also... how many documents do we have?

In [15]:
df.shape

(10, 12)

In [9]:
%%time
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob

def textblob_tokenizer(str_input):
    blob = TextBlob(str_input.lower())
    tokens = blob.words
    words = [token.stem() for token in tokens]
    return words

# Vectorize and save into a new dataframe
vec = TfidfVectorizer(tokenizer=textblob_tokenizer,
                      stop_words='english',
                      max_df= 0.9, #if you're in >90%, ignore
                      min_df= 0.15, #if you're in 15%, ignore
                      use_idf=True)

# Fit from the 'text' column of our dataframe
matrix = vec.fit_transform(df['body'])

# Then turn it into a new dataframe
results = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())

CPU times: user 1.48 s, sys: 215 ms, total: 1.69 s
Wall time: 2.25 s


In [10]:
%%time
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob
#diwxame to textblob giati argei
def textblob_tokenizer(str_input):
    blob = TextBlob(str_input.lower())
    tokens = blob.words
    words = [token.stem() for token in tokens]
    return words

# Vectorize and save into a new dataframe
vec = TfidfVectorizer(stop_words='english',
                      max_df= 0.9, #if you're in >90%, ignore
                      use_idf=True)

# Fit from the 'text' column of our dataframe
matrix = vec.fit_transform(df['body'])

# Then turn it into a new dataframe
results = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())

CPU times: user 44 ms, sys: 1.63 ms, total: 45.6 ms
Wall time: 48 ms


In [11]:
results.head()

Unnamed: 0,aanandamayee,abandon,abc,ability,able,abruptly,absent,absolutely,abstraction,abundance,...,ówhat,ówhen,ówhy,ôdepthõ,ôexperimentsõ,ôflat,ôfreeconomicsõ,ôtangibleõ,ôtoo,ôwowõ
0,0.052774,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.052774,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04564,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.040293,0.016298,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.023201,0.023201,0.023201,0.0,0.023201,0.023201,0.023201
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.037382,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


> ...Try it without the TextBlob tokenizer

## Cluster your documents

In [12]:
%%time
from sklearn.cluster import KMeans

# How many clusters?
number_of_clusters=3
km = KMeans(n_clusters=number_of_clusters)

print("Fitting", number_of_clusters, "clusters usinga ", matrix.shape, "matrix")

# Let's fit it!
km.fit(matrix)
km.fit

Fitting 3 clusters usinga  (15, 3067) matrix
CPU times: user 173 ms, sys: 24 ms, total: 197 ms
Wall time: 247 ms


## See what they look like

In [13]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vec.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :8]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: sketches valuation basecamp musk billion bhatnagar company powerwall
Cluster 1: work want people books time say morning think
Cluster 2: today design web internet did tokyo time day


## Push the category back to the original dataframe

In [14]:
df['category'] = km.labels_
df

Unnamed: 0,link,min_read,filename,author,publications,description,date,followers,claps,people_clapped,high_text,images,body,tag,category
0,https://www.google.com/url?q=https://m.signalv...,2,PRESS RELEASE: BASECAMP VALUATION TOPS $100 BI...,Jason Fried,Signal v. Noise,Founder & CEO at Basecamp. Co-author of Gettin...,1-Dec-15,205000,6222,5328,In order to determine the valuation of compani...,Yes,Basecamp is now a billion dollar company accor...,Startup,0
1,https://medium.com/@damjancvetkovdimitrov/thes...,5,These photos are why IÕm trapped in Tokyo fore...,Damjan Cvetkov-Dimitrov,,I never anticipate ants.. professionally. I al...,25-Nov-15,2200,6340,5607,,Yes,Tokyo holy excrements from small undefined cre...,Sci-Fy,2
2,https://medium.com/the-year-of-the-looking-gla...,2,Average Manager vs. Great Manager,Julie Zhuo,The Year of the Looking Glass,Product design VP,11-Aug-15,179000,8700,7034,,Yes,Sketches,Management,0
3,https://medium.com/interactive-mind/mobile-201...,7,Mobile:2015 UI/UX Trends,Onur Oral,Interactive Mind,Designer of things.,31-Jul-15,2300,4814,4766,Apps which have well-done micro-interactions c...,Yes,Whether on an app screen a web browser or a we...,,2
4,https://medium.com/firm-narrative/want-a-bette...,5,Want a Better Pitch? Watch This.,Andy Raskin,,Helping leaders tell strategic stories.,,20000,7447,6226,"ÒIt lacksÊoomph,Ó she said. ÒThe information i...",Yes,Three weeks ago the CMO of a San Francisco sta...,Startup,0
5,https://medium.com/building-asana/work-hard-li...,5,"Work Hard, Live Well",Dustin Moskovitz,Building Asana,,20-Aug-15,16000,5169,4799,The research is clear: beyond ~40Ð50 hours per...,Yes,Amazon isnÕt the only company burning out thei...,Tech,1
6,https://byrslf.co/you-re-only-23-stop-rushing-...,4,YouÕre only 23. Stop rushing life.,Susie Pan,Life Tips.,"Life wanderer, world traveler, serial entrepre...",9-Dec-15,3100,5259,5055,"As long as I have learned something, and that ...",Yes,I asked my CEO today Òwhat can I do to be bett...,Personal Development,1
7,https://medium.com/the-mission/never-tell-peop...,3,Never Tell People What You Do,Bruce Kasanoff,Life Learning,Social media ghostwriter. LinkedIn Influencer.,9-Sep-15,3200,4942,4784,"When you say what you want, you give others th...",Yes,ItÕs a simple question and youÕve probably ans...,Advice,1
8,https://medium.com/message/you-are-not-late-b3...,4,You Are Not Late,Kevin Kelly,The Message,"Senior Maverick at Wired, Cool Tools maven, au...",27-Jul-14,15500,5894,5255,ÊThere has never been a better time in the who...,Yes,Can you imagine how awesome it would have been...,IMHO,2
9,https://medium.com/@tommauchline/15-things-i-l...,3,15 things I learnt about Islam and British val...,Thomas Mauchline,,Builds digital campaigns around people and dat...,6-Dec-15,975,5164,4569,muslims like all British people get flustered ...,Yes,No mosque has enough parking and muslim men lo...,Muslim,1


## Be pleased

In [None]:
#['said', 'thee', 'ye'] + list(stop_words.ENGLISH_STOP_WORDS)