# KMeans clustering Medium Articles

Read in your CSV with the text column if not

In [None]:
import pandas as pd
%matplotlib inline
from dirtyclean import clean
import numpy as np
import glob

In [None]:
df = pd.read_csv("top_10.csv")

In [11]:
df.head()

Unnamed: 0,author,body,filename,followers,description,claps,people_clapped,min_read,high_text,author_date,date,images
0,Josh Spector,"You’re busy, so I’ll keep this quick.\nFollowi...",3.The Two Minutes It Takes To Read This Will I...,27000,Writer. Strategist. For The Interested newslet...,25191,17953,2,Delete the word “that.”,Apr 2017,"Jul 22, 2016",Yes
1,Trent Lapinski,rump Is What Happens When You Nominate A Cheat...,"8.Dear Democrats, Read This If You Do Not Unde...",13400,Tech Entrepreneur. Journalist. Technologist. C...,10885,9981,5,"This is the problem with America today, the te...",,"Nov 9, 2016",Yes
2,Shem Magnezi,"That’s right, I said it.\n\nFuck your startup ...",2.Fuck You Startup World,6200,Doing what I love,22661,17316,5,You should celebrate any day that you don’t ha...,,"Oct 11, 2016",Yes
3,Tobias Stone,It seems we’re entering another of those stupi...,1.History tells us what may happen next with B...,27000,"Writing about politics, history, and society. ...",18983,17092,7,We need to find a way to bridge from our close...,Aug 2017,"Jul 23, 2016",Yes
4,Max Braun,When I couldn’t buy a smart mirror and made on...,10.My Bathroom Mirror Is Smarter Than Yours,7400,Inevitable technology. Lately robots at X.,10353,9506,3,Maybe I’ll post a more detailed making-of with...,,"Jan 30, 2016",Yes


In [12]:
df.shape

(10, 12)

In [14]:
df.dtypes

author            object
body              object
filename          object
followers          int64
description       object
claps              int64
people_clapped     int64
min_read           int64
high_text         object
author_date       object
date              object
images            object
dtype: object

In [None]:
df["body"] = df["body"].apply(clean)

In [9]:
df.head()

0    4
1    6
2    7
3    8
dtype: int32

## Vectorize your documents

What are the options when creating a `TfidfVectorizer`?

Object `TfidfVectorizer` not found.


Let's think about:
* **ngram_range: Do we just want single words? Or more? (1,2) is one- and two-word phrases, etc.
* **max_features**: Can it make things faster? `1` and up
* **max_df**: Should we ignore words that show up too often? `0.0`-`1.0` for percent, OR an integer for absolute document counts
* **min_df**: Should we ignore words that show up too little? `0.0`-`1.0` for percent, OR an integer for absolute document counts
* **vocabulary**: Only care about certain words

Also... how many documents do we have?

In [15]:
df.shape

(10, 12)

In [17]:
%%time
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob

def textblob_tokenizer(str_input):
    blob = TextBlob(str_input.lower())
    tokens = blob.words
    words = [token.stem() for token in tokens]
    return words

# Vectorize and save into a new dataframe
vec = TfidfVectorizer(tokenizer=textblob_tokenizer,
                      stop_words='english',
                      max_df= 0.9, #if you're in >90%, ignore
                      min_df= 0.15, #if you're in 15%, ignore
                      use_idf=True)

# Fit from the 'text' column of our dataframe
matrix = vec.fit_transform(df['body'])

# Then turn it into a new dataframe
results = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())

CPU times: user 1.02 s, sys: 17.5 ms, total: 1.03 s
Wall time: 1.08 s


In [18]:
%%time
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob
#diwxame to textblob giati argei
def textblob_tokenizer(str_input):
    blob = TextBlob(str_input.lower())
    tokens = blob.words
    words = [token.stem() for token in tokens]
    return words

# Vectorize and save into a new dataframe
vec = TfidfVectorizer(stop_words='english',
                      max_df= 0.9, #if you're in >90%, ignore
                      use_idf=True)

# Fit from the 'text' column of our dataframe
matrix = vec.fit_transform(df['body'])

# Then turn it into a new dataframe
results = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())

CPU times: user 52.6 ms, sys: 1.51 ms, total: 54.1 ms
Wall time: 58.8 ms


In [19]:
results.head()

Unnamed: 0,000,10,100,12,130,140,15,150,16,17,...,yesterday,york,young,youtube,yoy,zero,zoom,zooming,zuck,zucks
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.021363,0.024028,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.024028,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.033828,0.0,0.022742,0.0,0.022742,0.0,0.0,0.019333,...,0.0,0.0,0.0,0.0,0.022742,0.0,0.0,0.0,0.022742,0.022742
3,0.0,0.0,0.030136,0.0,0.0,0.0,0.0,0.0,0.0,0.017223,...,0.0,0.015068,0.0,0.0,0.0,0.0,0.017223,0.02026,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


> ...Try it without the TextBlob tokenizer

## Cluster your documents

In [23]:
%%time
from sklearn.cluster import KMeans

# How many clusters?
number_of_clusters=3
km = KMeans(n_clusters=number_of_clusters)

print("Fitting", number_of_clusters, "clusters usinga ", matrix.shape, "matrix")

# Let's fit it!
km.fit(matrix)
km.fit

Fitting 3 clusters usinga  (10, 2922) matrix
CPU times: user 165 ms, sys: 4.13 ms, total: 169 ms
Wall time: 169 ms


## See what they look like

In [24]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vec.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :8]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: people trump just news mirror need use way
Cluster 1: sentence ross word sentences example short nerds adds
Cluster 2: fuck fucking shit solution goddamn know just want


## Push the category back to the original dataframe

In [25]:
df['category'] = km.labels_
df

Unnamed: 0,author,body,filename,followers,description,claps,people_clapped,min_read,high_text,author_date,date,images,category
0,Josh Spector,"You’re busy, so I’ll keep this quick.\nFollowi...",3.The Two Minutes It Takes To Read This Will I...,27000,Writer. Strategist. For The Interested newslet...,25191,17953,2,Delete the word “that.”,Apr 2017,"Jul 22, 2016",Yes,1
1,Trent Lapinski,rump Is What Happens When You Nominate A Cheat...,"8.Dear Democrats, Read This If You Do Not Unde...",13400,Tech Entrepreneur. Journalist. Technologist. C...,10885,9981,5,"This is the problem with America today, the te...",,"Nov 9, 2016",Yes,0
2,Shem Magnezi,"That’s right, I said it.\n\nFuck your startup ...",2.Fuck You Startup World,6200,Doing what I love,22661,17316,5,You should celebrate any day that you don’t ha...,,"Oct 11, 2016",Yes,2
3,Tobias Stone,It seems we’re entering another of those stupi...,1.History tells us what may happen next with B...,27000,"Writing about politics, history, and society. ...",18983,17092,7,We need to find a way to bridge from our close...,Aug 2017,"Jul 23, 2016",Yes,0
4,Max Braun,When I couldn’t buy a smart mirror and made on...,10.My Bathroom Mirror Is Smarter Than Yours,7400,Inevitable technology. Lately robots at X.,10353,9506,3,Maybe I’ll post a more detailed making-of with...,,"Jan 30, 2016",Yes,0
5,Hillary Clinton [parody],"What the fuck is your problem, America??\nI’m ...",6.Let Me Remind You Fuckers Who I Am,12500,"45th President of the United States, patriarch...",12081,11415,4,“Oh but what about your eeeemaaaaillls???” Shu...,,"Jul 25, 2016",Yes,2
6,Jose Aguinaga,No JavaScript frameworks were created during t...,4.How it feels to learn JavaScript in 2016,8700,Web Engineer.,33715,18783,13,"I need to display data on a page, not perform ...",May 2017,"Oct 3, 2016",Yes,0
7,Jose Aguinaga,It’s easier to fool people than to convince th...,7.How Technology is Hijacking Your Mind — from...,15100,"Co-founder, Center for Humane Technology // Ex...",21025,12638,16,"We need our smartphones, notifications screens...",Jul 2017,"May 18, 2016",Yes,0
8,David Hopkins,I want to discuss a popular TV show my wife an...,5.How a TV Sitcom Triggered the Downfall of We...,9800,I write a little bit of everything—short stori...,18489,15529,6,I see Kim Kardashian’s ass at the top of CNN.c...,Sep 2017,"Mar 22, 2016",Yes,1
9,Tristan de Montebello,Almost every day I sit down in a coffee shop i...,9.What are people working on in coffee shops,2600,I teach adult beginners how to learn guitar in...,10395,9720,5,It felt absolutely amazing to connect with all...,,"May 10, 2016",Yes,0


## Be pleased

In [None]:
#['said', 'thee', 'ye'] + list(stop_words.ENGLISH_STOP_WORDS)