# Machine learning: k-mean clusters

### author: bhavesh patel

#### We will use Wikipedia data for famous people.  The data set contains around 50,000 famous people. We will create a cluster for the group of people.  We will identify who are similar to given person.

#### We will go through unsupervised learning, as we don't have a label for each cluster.  That is, we don't have famous sports vs. politician vs. actor, etc.

#### Typically, we extract features from training data, which goes to machine learning model.  Training data is also passed to machine learning algorithm.  In this case, we are not passing this data. Hence it becomes unsupervised learning.

#### We will use TF-IDF document representation.  At high level, same words appearing in one document e.g. the, a should have less weighting compare to word appearing across different documents.

#### The TF = Term Frequency is counted by number of times the word apperas.  
#### IDF = Inverse Document Frequency.   That uses log function for inverse frequency.
####  log (number of docs / 1 + number of docs using word).
#### larger number of doc with same word approaces to log 1 which is close to zero.
#### smaller number of doc with same word approaches to log (low value) which is larger value.


In [1]:
import graphlab

In [2]:
# Limit number of worker processes. This preserves system memory, which prevents hosted notebooks from crashing.
graphlab.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 4)

This non-commercial license of GraphLab Create for academic use is assigned to bhaveshhk8@gmail.com and will expire on October 17, 2017.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1482710143.log


In [3]:
people = graphlab.SFrame('people_wiki.gl/')

In [4]:
people.head()

URI,name,text
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...


##### The data has the URL for each person, name of the person and text about the person.  Let's retrive data for Obama.

In [5]:
people[people['name'] == 'Barack Obama']['text']

dtype: str
Rows: ?
['barack hussein obama ii brk husen bm born august 4 1961 is the 44th and current president of the united states and the first african american to hold the office born in honolulu hawaii obama is a graduate of columbia university and harvard law school where he served as president of the harvard law review he was a community organizer in chicago before earning his law degree he worked as a civil rights attorney and taught constitutional law at the university of chicago law school from 1992 to 2004 he served three terms representing the 13th district in the illinois senate from 1997 to 2004 running unsuccessfully for the united states house of representatives in 2000in 2004 obama received national attention during his campaign to represent illinois in the united states senate with his victory in the march democratic party primary his keynote address at the democratic national convention in july and his election to the senate in november he began his presidential campa

In [6]:
# Let's find out how big is out database.

len(people)

59071

In [7]:
# now let's explore words used in obama.  we need to find out frequencye of the words.
# that will be used in TD-IDF algorithm.
# nv for name value pair.

obama = people[people['name'] == 'Barack Obama']
obama['word_nv'] = graphlab.text_analytics.count_words(obama['text'])


In [8]:
print obama['word_nv']

[{'operations': 1, 'represent': 1, 'office': 2, 'unemployment': 1, 'is': 2, 'doddfrank': 1, 'over': 1, 'unconstitutional': 1, 'domestic': 2, 'named': 1, 'ending': 1, 'ended': 1, 'proposition': 1, 'seats': 1, 'graduate': 1, 'worked': 1, 'before': 1, 'death': 1, '20': 2, 'taxpayer': 1, 'inaugurated': 1, 'obamacare': 1, 'civil': 1, 'mccain': 1, 'to': 14, '4': 1, 'policy': 2, '8': 1, 'has': 4, '2011': 3, '2010': 2, '2013': 1, '2012': 1, 'bin': 1, 'then': 1, 'his': 11, 'march': 1, 'gains': 1, 'cuba': 1, 'californias': 1, '1992': 1, 'new': 1, 'not': 1, 'during': 2, 'years': 1, 'continued': 1, 'presidential': 2, 'husen': 1, 'osama': 1, 'term': 3, 'equality': 1, 'prize': 1, 'lost': 1, 'stimulus': 1, 'january': 3, 'university': 2, 'rights': 1, 'gun': 1, 'republican': 2, 'rodham': 1, 'troop': 1, 'withdrawal': 1, 'involvement': 3, 'response': 3, 'where': 1, 'referred': 1, 'affordable': 1, 'attorney': 1, 'school': 3, 'senate': 3, 'house': 2, 'national': 2, 'creation': 1, 'related': 1, 'hawaii': 1,

In [9]:
# let's sort it to make it easier.  There is stack function in graph lab
# to view data side by side like a table.

obama_word_table = obama[['word_nv']].stack('word_nv', new_column_name=['word','count'])

In [10]:
obama_word_table.head()

word,count
cuba,1
relations,1
sought,1
combat,1
ending,1
withdrawal,1
state,1
islamic,1
by,1
gains,1


In [11]:
obama_word_table.sort('count', ascending=False)

word,count
the,40
in,30
and,21
of,18
to,14
his,11
obama,9
act,8
he,7
a,7


In [12]:
# to find the cluster and K-nearest-neighbours, we need to calculate data for all people.
# nv = name value pair.

people['word_nv'] = graphlab.text_analytics.count_words(people['text'])

In [13]:
people.head()

URI,name,text,word_nv
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'selection': 1, 'carltons': 1, 'being': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1, 'thomas': 1, 'closely': 1, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1, 'issued': 1, 'mainly': 1, 'nominat ..."
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'all': 1, 'bauforschung': 1, ..."
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'they': 1, 'gangstergenka': 1, ..."
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...,"{'currently': 1, 'less': 1, 'being': 1, ..."
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...,"{'exclusive': 2, 'producer': 1, 'show' ..."
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...,"{'taxi': 1, 'salon': 1, 'gangs': 1, 'being': 1, ..."
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...,"{'houston': 1, 'frankie': 1, 'labels': 1, ..."
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...,"{'phenomenon': 1, 'deborash': 1, 'both' ..."


In [14]:
# now compute TF-IDF for each person.

tfidf_for_wiki_people = graphlab.text_analytics.tf_idf(people['word_nv'])

In [15]:
tfidf_for_wiki_people.head()

dtype: dict
Rows: 10
[{'selection': 3.836578553093086, 'carltons': 7.0744723837970485, 'being': 1.7938099524877322, '2005': 1.6425861253275964, 'coach': 5.444264118987054, 'its': 1.6875948402695313, 'before': 2.9935647453367427, '21': 2.797250863489293, 'northern': 3.310021742836038, 'bullants': 7.489987827758714, 'to': 0.23472468840899618, 'perth': 5.051601193605607, 'sydney': 3.5981675296480873, '2014': 2.2073995783446634, 'has': 0.428497539744039, '2011': 1.7023470901042916, '2013': 1.9545642372230505, 'division': 2.7906099979103978, 'his': 0.7878343656409721, 'rules': 3.8272034844276295, 'assistant': 2.5220702633476124, 'spanned': 5.531174273867493, 'early': 1.929422753652229, 'game': 2.4168995190159084, 'five': 2.2137301792754096, 'during': 1.3174651479035495, 'continued': 2.720588055069447, '44game': 9.887883100557085, 'kangaroos': 20.726873835958425, 'twice': 3.3301582227950113, 'round': 2.897933583948961, 'the': 0.0027426017494956603, 'parade': 5.510031837293684, 'born': 0.2681

In [16]:
# let's add another column to people SFrame to store tfidf value.

people['tfidf_value'] = tfidf_for_wiki_people

In [17]:
people.head()

URI,name,text,word_nv
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'selection': 1, 'carltons': 1, 'being': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1, 'thomas': 1, 'closely': 1, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1, 'issued': 1, 'mainly': 1, 'nominat ..."
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'all': 1, 'bauforschung': 1, ..."
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'they': 1, 'gangstergenka': 1, ..."
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...,"{'currently': 1, 'less': 1, 'being': 1, ..."
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...,"{'exclusive': 2, 'producer': 1, 'show' ..."
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...,"{'taxi': 1, 'salon': 1, 'gangs': 1, 'being': 1, ..."
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...,"{'houston': 1, 'frankie': 1, 'labels': 1, ..."
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...,"{'phenomenon': 1, 'deborash': 1, 'both' ..."

tfidf_value
"{'selection': 3.836578553093086, ..."
"{'precise': 6.44320060695519, ..."
"{'just': 2.7007299687108643, ..."
"{'all': 1.6431112434912472, ..."
"{'they': 1.8993401178193898, ..."
"{'currently': 1.637088969126014, ..."
"{'exclusive': 10.455187230695827, ..."
"{'taxi': 6.0520214560945025, ..."
"{'houston': 3.935505942157149, ..."
"{'phenomenon': 5.750053426395245, ..."


In [18]:
obama

URI,name,text,word_nv
<http://dbpedia.org/resou rce/Barack_Obama> ...,Barack Obama,barack hussein obama ii brk husen bm born august ...,"{'operations': 1, 'represent': 1, 'offi ..."


In [19]:
# now let's find out tf-idf for Obama.
# first read the obama values again, as it containts its own ididf value.
obama = people[people['name']=='Barack Obama']


In [20]:
# now let's get the tf-idf value for Obama.
obama[['tfidf_value']].stack('tfidf_value', new_column_name=['word','tfidf_value']).sort('tfidf_value', ascending=False)

word,tfidf_value
obama,43.2956530721
act,27.678222623
iraq,17.747378588
control,14.8870608452
law,14.7229357618
ordered,14.5333739509
military,13.1159327785
involvement,12.7843852412
response,12.7843852412
democratic,12.4106886973


In [21]:
# this make more sense for the obama document itself.  Without tf-idf, it was showing that 
# "the", "a" etc. had more counts.  But with tf-idf, it reduces the importance of that
# as it uses log (count) function.

In [22]:
# let's find out if Obama is closer to Clinton or Bill Gates.

clinton = people[people['name'] == 'Bill Clinton']

gates = people[people['name'] == 'Bill Gates']

In [23]:
# now let's use cosine function to understand if Obama is closer to Clinton or Gates.
# [0] is used for syntax even though it has only one row to access.

graphlab.distances.cosine(obama['tfidf_value'][0], clinton['tfidf_value'][0])

0.8339854936884276

In [24]:
graphlab.distances.cosine(obama['tfidf_value'][0], gates['tfidf_value'][0])

0.9900304363196061

In [25]:
# from distanct point of view, Obama is closer to Clinton than Gates.

In [26]:
# now let's build K nearest neighbour model for document retrieval.

knn_model = graphlab.nearest_neighbors.create(people,features=['tfidf_value'],label='name')

In [27]:
# now find out who is similar to Obama?

knn_model.query(obama)

query_label,reference_label,distance,rank
0,Barack Obama,0.0,1
0,Joe Biden,0.794117647059,2
0,Joe Lieberman,0.794685990338,3
0,Kelly Ayotte,0.811989100817,4
0,Bill Clinton,0.813852813853,5


In [28]:
# who is similar to jolie?

jolie = people[people['name'] == 'Angelina Jolie']

knn_model.query(jolie)

query_label,reference_label,distance,rank
0,Angelina Jolie,0.0,1
0,Brad Pitt,0.784023668639,2
0,Julianne Moore,0.795857988166,3
0,Billy Bob Thornton,0.803069053708,4
0,George Clooney,0.8046875,5


In [29]:
# who is similar to scarlett?
scarlett = people[people['name'] == 'Scarlett Johansson']

knn_model.query(scarlett)

query_label,reference_label,distance,rank
0,Scarlett Johansson,0.0,1
0,Jennifer Aniston,0.79,2
0,Jennifer Connelly,0.809210526316,3
0,"Robert Downey, Jr.",0.811965811966,4
0,Chlo%C3%AB Sevigny,0.8125,5
