# K-Mean Clustering for Text

K-means clustering is a type of unsupervised learning, which is used when we have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable `k`. 

The algorithm works iteratively to assign each data point to one of `k` groups based on the features that are provided. Data points are clustered based on *feature similarity*. 

In this lab, we will look at how to separate 45,000 movies into 20 groups using synopsis. Let's load the data using the following code.

```python
import pandas as pd
df = pd.read_csv('movie_overview.csv')
```


In [1]:
import pandas as pd
df = pd.read_csv('movie_overview.csv')

Now we have gotten our data, what we want to do next is to perform clustering. But before we begin, we will need to construct the feature by turning text into numbers. Similar to what we did before in the supervised learning chapter. We are not going to do it manually this time, instead, we are using the **TF-IDF** Vectorizer from `sklearn` to help us. We first construct an Vectorizer with English stopwords, then constructure the feature set.

```python
vectorizer = TfidfVectorizer(stop_words='english')
features = vectorizer.fit_transform(documents)
```


In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# Extract overview as documents and convert to Unicode for processing
# You may learn more about Unicode here: https://unicode.org/
documents = df['overview'].values.astype("U") 

vectorizer = TfidfVectorizer(stop_words='english')
features = vectorizer.fit_transform(documents)

Once we have gotten our features, we can proceed to analyze them using K-Mean clustering. Let's get to algorithm to create `k=20` clusters.

```python
k = 20
model = KMeans(n_clusters=k, max_iter=100, n_init=1)
model.fit(features)
```

Depends on the processing power of your machine, this will take a little while around 30-40 seconds.

In [3]:
k = 20
model = KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=1)
model.fit(features)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=20, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

Once you have done the clustering, we can then proceed to inspect the results. First let's assign the cluster number back to our dataframe.

```python
df['cluster'] = model.labels_
```

Now we can use the cluster results by writing the results to CSV files. 

```python
clusters = df.groupby('cluster')    

for cluster in clusters.groups:
    f = open('clusters'+str(cluster)+ '.csv', 'w')
    data = clusters.get_group(cluster)[['title','overview']]
    f.write(data.to_csv(index_label='id'))
    f.close()
```

In [7]:
# Assign the cluster label back to the dataframe
df['cluster'] = model.labels_

In [10]:
df.head(10)

Unnamed: 0,id,title,overview,cluster
0,0,Toy Story,"Led by Woody, Andy's toys live happily in his ...",14
1,1,Jumanji,When siblings Judy and Peter discover an encha...,14
2,2,Grumpier Old Men,A family wedding reignites the ancient feud be...,12
3,3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",11
4,4,Father of the Bride Part II,Just when George Banks has recovered from his ...,0
5,5,Heat,"Obsessive master thief, Neil McCauley leads a ...",2
6,6,Sabrina,An ugly duckling having undergone a remarkable...,13
7,7,Tom and Huck,"A mischievous young boy, Tom Sawyer, witnesses...",12
8,8,Sudden Death,International action superstar Jean Claude Van...,14
9,9,GoldenEye,James Bond must unmask the mysterious head of ...,14


In [None]:
# output the result to a text file.

clusters = df.groupby('cluster')    

for cluster in clusters.groups:
    f = open('cluster'+str(cluster)+ '.csv', 'w')
    data = clusters.get_group(cluster)[['title','overview']]
    f.write(data.to_csv(index_label='id'))
    f.close()
    
    

Now if you inspect the results, it looks all right but we cannot really understand why and how the algorithms group these movies together. What we can do is to check the centroid of each cluster. Once we know the centroid, we will know the movies that are closed to the centroids and that helps us to understand the similarities between these movies.

We can use the code below to extract the centroids (i.e. representative terms) for each cluster.

```python
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

for i in range(k):
    print("Cluster %d:" % i)
    for j in order_centroids[i, :10]:
        print (' %s' % terms[j])
    print('----')
```

In [11]:
print("Cluster centroids: \n")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

for i in range(k):
    print("Cluster %d:" % i)
    for j in order_centroids[i, :10]:
        print (' %s' % terms[j])
    print('------------')

Cluster centroids: 

Cluster 0:
 wife
 husband
 man
 life
 daughter
 son
 ex
 home
 young
 death
------------
Cluster 1:
 story
 true
 tells
 based
 life
 love
 world
 film
 set
 young
------------
Cluster 2:
 man
 young
 life
 father
 old
 finds
 world
 death
 time
 girl
------------
Cluster 3:
 new
 york
 city
 life
 young
 world
 home
 years
 friends
 finds
------------
Cluster 4:
 school
 high
 students
 teacher
 student
 girl
 friends
 girls
 new
 group
------------
Cluster 5:
 war
 world
 ii
 civil
 american
 soldiers
 german
 army
 soldier
 story
------------
Cluster 6:
 old
 year
 boy
 life
 father
 mother
 girl
 years
 family
 daughter
------------
Cluster 7:
 town
 small
 sheriff
 local
 new
 life
 young
 girl
 home
 people
------------
Cluster 8:
 friend
 best
 life
 friends
 help
 girl
 young
 old
 man
 new
------------
Cluster 9:
 film
 documentary
 directed
 director
 life
 feature
 movie
 based
 films
 world
------------
Cluster 10:
 nan
 ݣ1890
 frazier
 fraw
 fray
 fray

# Conclusion

This is a classic example of how to utilize K-Mean clustering for finding similar text. You can use other text datasets (e.g. Twitter, Facebook status, customer reviews) by using the same code. All you need is to replace the filename and change the number of cluster `k`. 