# Text Clustering

In [1]:
%load_ext lab_black
%load_ext autoreload
%autoreload 2

In [2]:
import re
from typing import Dict, List

import jsonpickle
import numpy as np
import pandas as pd
from joblib import dump
from nltk.corpus import stopwords
from sklearn import set_config
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import ParameterGrid
from sklearn.pipeline import Pipeline

In [3]:
%aimport src.clean.clean_data
from src.clean.clean_data import TextCleaner

%aimport src.workflows.clustering_utils
from src.workflows.clustering_utils import run_clustering_trials

## Overview

Process StackExchange posts using text pre-processing techniques and perform clustering on processed data with the `KMeans` clustering algorithm.

This notebook requires a single `.parquet` file of stackexchange posts to be stored in `data/raw/text_clustering_data.parquet.gzip`. This file was created by `1_get_data.ipynb`.

## User Inputs

In [4]:
# datasets to load
topics = ["biology", "cooking", "crypto", "diy", "robotics", "travel"]

# Data
num_samples_to_use = 86000
raw_data_filepath = "data/raw/text_clustering_data.parquet.gzip"

# Clustering
n_clusters = 6
kmeans_random_state = 42

In [5]:
all_stop_words = set(stopwords.words("english"))
manual_stop_words = [
    # HTML tags
    "http",
    "href",
    "jpg",
    "imgur",
    "com",
    "img",
    "alt",
    "li",
    "ul",
    "ol",
    "src",
    "em",
    "en",
    "rel",
    "nofollow",
    "blockquote",
    "www",
    "png",
]

# Manually add to stop words
for manual_stop_word in manual_stop_words:
    all_stop_words.add(manual_stop_word)

In [6]:
def clean(pipe: Pipeline, df: pd.DataFrame) -> List:
    """Clean text data."""
    corpus = pipe.fit_transform(df).tolist()
    return [corpus, pipe]


def train(pipe: Pipeline, corpus: List) -> Pipeline:
    _ = pipe.fit(corpus)
    return pipe


def get_top_10_terms(pipe: Pipeline, n_clusters: int) -> Dict:
    print("Top 10 words per cluster:")

    # Get the cluster centroids
    order_centroids = pipe.named_steps["clusterer"].cluster_centers_.argsort()[:, ::-1]

    # Get all words for each cluster
    terms = pipe.named_steps["vectorizer"].get_feature_names_out()

    # Print top 10 words per cluster
    d_top_ten = {}
    for i in range(n_clusters):
        t10_terms = []
        print(f"Cluster {i}: ", end="")
        for ind in order_centroids[i, :10]:
            t10_terms.append(terms[ind])
        print(", ".join(t10_terms))
        d_top_ten[f"cluster_{i}"] = t10_terms
    return d_top_ten


def get_cluster_numbers(pipe: Pipeline) -> List:
    cluster_numbers = pipe.named_steps["clusterer"].labels_.tolist()
    return cluster_numbers


def get_cluster_posts(
    df: pd.DataFrame,
    cluster_numbers: List,
    cluster_num: int,
    num_docs_to_read: int = 5,
) -> Dict:
    d_cluster_articles = {}
    df_with_clusters = df.assign(cluster=cluster_numbers)[
        df.assign(cluster=cluster_numbers)["cluster"] == cluster_num
    ].iloc[:num_docs_to_read]
    for k, article in df_with_clusters["content"].iteritems():
        d_cluster_articles[k] = article
    return d_cluster_articles

## Load Raw Data

In [7]:
%%time
df = pd.read_parquet(raw_data_filepath, engine="auto").sample(n=num_samples_to_use)
df.head()

Unnamed: 0,id,title,content,tags,topic
15061,2661,What meteorological conditions do I need to co...,<p>It's always difficult to decide what clothe...,gear clothing weather-and-climate,diy
65232,33841,New device checking rules for flights to/from USA,"<p>I see the TSA have brought in <a href=""http...",air-travel airport-security tsa,diy
6061,64471,"Riding motorcycles in Medellin, Colombia with ...",<p>I'm planning to visit Medellin and Bogota s...,driving-licenses motorcycles colombia,diy
11890,16020,How to make kidney beans tender?,<p>The way I currently cook kidney beans is to...,cooking-time beans,biology
33566,39339,Can I have multiple valid ESTAs in different p...,<p>I'm a citizen of country X and have a valid...,usa passports esta dual-nationality,diy


## Pre-Processing

The following text cleaning tasks are performed
- cleaning
  - lowercase
  - remove HTML tags (`<abbr></abbr>`, `<link>`, `<head></head>`, etc.)
    - web scraping
  - keep letters, numbers and underscore (drop everything else. eg. `$`, `%`, `^`, `&`)
  - remove space, new lines, tab character (` `, `\n`, `\t`)
  - remove punctuation (`,`, `.`, `!`, etc.)
- tokenization
  - splits each document into a list of its words (tokens)
  - `['the big dog ...']` becomes `['the', 'big', 'dog', ...]`
- remove stop words
  - eg. *the*, *and*, etc.

In [8]:
%%time
pipe_clean = Pipeline([("cleantext", TextCleaner("content"))])
corpus, pipe_clean_trained = clean(pipe_clean, df)

CPU times: user 30.1 s, sys: 188 ms, total: 30.3 s
Wall time: 30.3 s


The resulting list of tokens represents the text corpus that encapsulates our documents.

## Converting Text to Numbers

### Vectorization

We now need to convert all our text tokens into number since ML algorithms usually deal with numbers. This is the process of vectorization and the output is a vector of numbers.

Each number in the vector represents a word in a all the documents in the text corpus (data).

### Approach

To do this, we will a technique called [TFIDF](https://monkeylearn.com/blog/what-is-tf-idf/) which associates every token with a number representing how relevant the token is to a document. Documents with similar relevant words (tokens) will have similar word vectors, which an ML algorithm can be trained on.

Briefly
- first count word occurrences by document
- then give a higher weighting to words that occur frequently within a document but not frequently within the entire corpus (all the documents), since these words are assumed to contain more meaning in relation to a given document

In [9]:
vectorizer = TfidfVectorizer(
    max_df=0.85,  # ignore tokens with a document freq > 80%
    min_df=15,  # ignore terms with doc freq < 20%
    stop_words=all_stop_words,  # we did this during the cleaning
    ngram_range=(1, 1),  # unigrams
)

**Notes**
1. `max_df`
   - if a token is in more than 80% of the documents, then it probably has little meanining in the context of topics (cooking, travel, etc.)
2. `min_df`
   - the token must be in at least 20% of the documents
   - eg. a word might appear in many documents in many topics but carries no real meaning for separating the topics

### Result of TFIDF Vectorization

The result of TFIDF vectorization will be numeric features (`X`) that can be used by a ML model.

## ML Algorithm for Clustering

We will use the `KMeans` algorithm to cluster the documents after vectorization

In [10]:
kmeans_random_state = 42

In [11]:
est = KMeans(
    n_clusters=n_clusters,
    max_iter=500,
    n_init=10,
    random_state=kmeans_random_state,
)

**Notes**
1. `KMeans` may require multiple runs to reach convergence and so `n_init` is set to a larger value than 1.

Define a ML pipeline with the
- text vectorizer
- ML clustering algorithm

In [12]:
pipe = Pipeline(
    [
        ("vectorizer", vectorizer),
        ("clusterer", est),
    ]
)
pipe

Pipeline(steps=[('vectorizer',
                 TfidfVectorizer(max_df=0.85, min_df=15,
                                 stop_words={'a', 'about', 'above', 'after',
                                             'again', 'against', 'ain', 'all',
                                             'alt', 'am', 'an', 'and', 'any',
                                             'are', 'aren', "aren't", 'as',
                                             'at', 'be', 'because', 'been',
                                             'before', 'being', 'below',
                                             'between', 'blockquote', 'both',
                                             'but', 'by', 'can', ...})),
                ('clusterer',
                 KMeans(max_iter=500, n_clusters=6, random_state=42))])

### ML Training

Train the pipeline on the entire corpus we created earlier

In [13]:
%%time
pipe = train(pipe, corpus)

CPU times: user 1min 9s, sys: 121 ms, total: 1min 9s
Wall time: 10.7 s


### Inspect Top Tokens in Each Cluster

We will now inspect the clusters that our ML model has learned during training.

We'll first get the top 10 words in each cluster

In [14]:
d_top_ten = get_top_10_terms(pipe, n_clusters)

Top 10 words per cluster:
Cluster 0: visa, passport, uk, schengen, us, travel, days, need, transit, visit
Cluster 1: wire, switch, light, wires, box, breaker, fan, outlet, ground, black
Cluster 2: water, hot, heater, pressure, valve, tank, shower, cold, pipe, house
Cluster 3: wall, house, floor, door, would, wood, concrete, like, paint, room
Cluster 4: would, like, one, know, use, get, time, make, find, way
Cluster 5: key, encryption, message, hash, public, aes, cipher, keys, data, random


## Assign Names to Clusters

### Assign Cluster Numbers to Raw Data

In [15]:
cluster_numbers = get_cluster_numbers(pipe)

### Read the Documents in a Cluster

Start reading the text of each document in a given cluster. Decide on what is the topic of the text. Consider the top 10 words for each cluster. With all this in mind, suggest a name for the cluster.

#### Cluster 2

In [16]:
q = 2
_ = get_top_10_terms(pipe, n_clusters)
d_selected_cluster_posts = get_cluster_posts(df, cluster_numbers, 2, 5)
print()
for k, post in d_selected_cluster_posts.items():
    post_without_html = re.sub(r"\<[^<>]*\>", "", post).replace("\n", "").strip()
    print(f"Cluster = {q}, Raw data index = {k:,}\n{post_without_html}\n")

Top 10 words per cluster:
Cluster 0: visa, passport, uk, schengen, us, travel, days, need, transit, visit
Cluster 1: wire, switch, light, wires, box, breaker, fan, outlet, ground, black
Cluster 2: water, hot, heater, pressure, valve, tank, shower, cold, pipe, house
Cluster 3: wall, house, floor, door, would, wood, concrete, like, paint, room
Cluster 4: would, like, one, know, use, get, time, make, find, way
Cluster 5: key, encryption, message, hash, public, aes, cipher, keys, data, random

Cluster = 2, Raw data index = 81,864
I live in the Midwest of the United States, and we not-too-rarely get a large amount of rain coming down at one time both in the spring and sometimes in the fall. Last night was no different (aside from random earthquakes..) than usual, but after I went to bed, my smoke alarm went off. I quickly determined that there was in fact no fire, so I went to take the smoke alarm off of its mounting and when I removed it, I found that there was a decent amount of water in 

**Suggested Cluster Label = Biological Studies** (some posts are talking about home repair)

#### Cluster 3

In [17]:
q = 3
_ = get_top_10_terms(pipe, n_clusters)
d_selected_cluster_posts = get_cluster_posts(df, cluster_numbers, 3, 5)
print()
for k, post in d_selected_cluster_posts.items():
    post_without_html = re.sub(r"\<[^<>]*\>", "", post).replace("\n", "").strip()
    print(f"Cluster = {q}, Raw data index = {k:,}\n{post_without_html}\n")

Top 10 words per cluster:
Cluster 0: visa, passport, uk, schengen, us, travel, days, need, transit, visit
Cluster 1: wire, switch, light, wires, box, breaker, fan, outlet, ground, black
Cluster 2: water, hot, heater, pressure, valve, tank, shower, cold, pipe, house
Cluster 3: wall, house, floor, door, would, wood, concrete, like, paint, room
Cluster 4: would, like, one, know, use, get, time, make, find, way
Cluster 5: key, encryption, message, hash, public, aes, cipher, keys, data, random

Cluster = 3, Raw data index = 63,096
I bought a new house about a month ago, and have since noticed some plumbing issues in the laundry room:laundry tub drains slowlyoccasional "musty" smell, which seems to come from the drainsI did some research online and found that lack of drain ventilation could cause both of these problems. So I checked the plumbing and it looks like both the washing machine and adjacent wash basin drain into the same pipe. As far as I can tell, there is no ventilation on these 

**Suggested Cluster Label = Home Repairs, Home Rennovations**

#### Cluster 1

In [18]:
q = 1
_ = get_top_10_terms(pipe, n_clusters)
d_selected_cluster_posts = get_cluster_posts(df, cluster_numbers, 1, 5)
print()
for k, post in d_selected_cluster_posts.items():
    post_without_html = re.sub(r"\<[^<>]*\>", "", post).replace("\n", "").strip()
    print(f"Cluster = {q}, Raw data index = {k:,}\n{post_without_html}\n")

Top 10 words per cluster:
Cluster 0: visa, passport, uk, schengen, us, travel, days, need, transit, visit
Cluster 1: wire, switch, light, wires, box, breaker, fan, outlet, ground, black
Cluster 2: water, hot, heater, pressure, valve, tank, shower, cold, pipe, house
Cluster 3: wall, house, floor, door, would, wood, concrete, like, paint, room
Cluster 4: would, like, one, know, use, get, time, make, find, way
Cluster 5: key, encryption, message, hash, public, aes, cipher, keys, data, random

Cluster = 1, Raw data index = 60,884
I'm trying to put my RV into the breaker box I've got a 30 amp breaker double pole I've got10/2 wire, do I use a single pole or double pole 30 amp breaker hand my RV plug is a 4 prong plug how do I make this work

Cluster = 1, Raw data index = 62,339
I just installed 6 new 4" halo can lights. When they are on I can smell a slight odor coming from them in the attic. Is this normal? I felt the light housing in the attic and is just warm. I also felt each wire going 

**Suggested Cluster Label = Personal Electronic Devices, Home Automation and Home Electricity**

#### Cluster 0

In [19]:
q = 0
_ = get_top_10_terms(pipe, n_clusters)
d_selected_cluster_posts = get_cluster_posts(df, cluster_numbers, q, 5)
print()
for k, post in d_selected_cluster_posts.items():
    post_without_html = re.sub(r"\<[^<>]*\>", "", post).replace("\n", "").strip()
    print(f"Cluster = {q}, Raw data index = {k:,}\n{post_without_html}\n")

Top 10 words per cluster:
Cluster 0: visa, passport, uk, schengen, us, travel, days, need, transit, visit
Cluster 1: wire, switch, light, wires, box, breaker, fan, outlet, ground, black
Cluster 2: water, hot, heater, pressure, valve, tank, shower, cold, pipe, house
Cluster 3: wall, house, floor, door, would, wood, concrete, like, paint, room
Cluster 4: would, like, one, know, use, get, time, make, find, way
Cluster 5: key, encryption, message, hash, public, aes, cipher, keys, data, random

Cluster = 0, Raw data index = 33,566
I'm a citizen of country X and have a valid United States ESTA in my X passport.  Recently, I became of citizen of Y and now have a Y passport as well, which I would like to use for future travel to the US.  Both countries qualify for the Visa Waiver Program.  Can I have ESTAs in both passports at the same time?The official CBP site seems contradictory, first implying that I should get a new ESTA:  If you obtain a new passport or change your name, gender or countr

**Suggested Cluster Label = International Travel**

#### Cluster 4

In [20]:
q = 4
_ = get_top_10_terms(pipe, n_clusters)
d_selected_cluster_posts = get_cluster_posts(df, cluster_numbers, q, 5)
print()
for k, post in d_selected_cluster_posts.items():
    post_without_html = re.sub(r"\<[^<>]*\>", "", post).replace("\n", "").strip()
    print(f"Cluster = {q}, Raw data index = {k:,}\n{post_without_html}\n")

Top 10 words per cluster:
Cluster 0: visa, passport, uk, schengen, us, travel, days, need, transit, visit
Cluster 1: wire, switch, light, wires, box, breaker, fan, outlet, ground, black
Cluster 2: water, hot, heater, pressure, valve, tank, shower, cold, pipe, house
Cluster 3: wall, house, floor, door, would, wood, concrete, like, paint, room
Cluster 4: would, like, one, know, use, get, time, make, find, way
Cluster 5: key, encryption, message, hash, public, aes, cipher, keys, data, random

Cluster = 4, Raw data index = 15,061
It's always difficult to decide what clothes to take with you when you go to a new place. Usually I check Wikipedia to get an idea of approximate climate and weather conditions, but it often doesn't have much data besides average temperatures. Other factors, such as wind and precipitation, might enter into the equation, and the same temperature can "feel like" differently in two distinct locations.Perhaps not the same factors are important whether one travels to t

**Suggested Cluster Label = Food and Meals** (some posts are talking about home repair and others are talking about international travel)

#### Cluster 5

In [21]:
q = 5
_ = get_top_10_terms(pipe, n_clusters)
d_selected_cluster_posts = get_cluster_posts(df, cluster_numbers, q, 5)
print()
for k, post in d_selected_cluster_posts.items():
    post_without_html = re.sub(r"\<[^<>]*\>", "", post).replace("\n", "").strip()
    print(f"Cluster = {q}, Raw data index = {k:,}\n{post_without_html}\n")

Top 10 words per cluster:
Cluster 0: visa, passport, uk, schengen, us, travel, days, need, transit, visit
Cluster 1: wire, switch, light, wires, box, breaker, fan, outlet, ground, black
Cluster 2: water, hot, heater, pressure, valve, tank, shower, cold, pipe, house
Cluster 3: wall, house, floor, door, would, wood, concrete, like, paint, room
Cluster 4: would, like, one, know, use, get, time, make, find, way
Cluster 5: key, encryption, message, hash, public, aes, cipher, keys, data, random

Cluster = 5, Raw data index = 42,634
I have designed an SQL aggregate function in Oracle that bitwise XORs all MD5 sums of the values stored in a column.For example, if my table is:+-----+----------+---------+| Key | Sequence |  Value  |+-----+----------+---------+|   1 |        1 | 'Hello' ||   1 |        2 | 'World' ||   2 |        1 | '1234'  ||   3 |        0 | (empty) ||   4 |        1 | 'Hello' ||   4 |        3 | 'World' |+-----+----------+---------+I can run the following query in Oracle:with

**Suggested Cluster Label = Encryption, Cryptographic Block Cyphers, Padding Oracle Attack**

We'll summarize these cluster labels below

In [28]:
d_cluster_label_names = {
    5: "Encryption, Cryptographic Block Cyphers, Padding Oracle Attack",
    4: "Food and Meals",
    2: "Biological Studies",
    0: "International Travel Experiences",
    3: "Home Repairs, Home Rennovations",
    1: "Personal Electronic Devices, Home Automation and Home Electricity",
}
df_cluster_label_names = (
    pd.DataFrame.from_dict(
        d_cluster_label_names,
        orient="index",
    )
    .reset_index()
    .rename(
        columns={0: "Suggested Label", "index": "Cluster Number"},
    )
)
df_cluster_label_names

Unnamed: 0,Cluster Number,Suggested Label
0,5,"Encryption, Cryptographic Block Cyphers, Paddi..."
1,4,Food and Meals
2,2,Biological Studies
3,0,International Travel Experiences
4,3,"Home Repairs, Home Rennovations"
5,1,"Personal Electronic Devices, Home Automation a..."


### Assign Cluster Names as Column in Raw Data

Assign cluster names to raw data and show the true topic name and the suggested cluster name

In [29]:
df["cluster_name"] = list(map(d_cluster_label_names.get, cluster_numbers))
with pd.option_context("display.max_colwidth", 100):
    display(df[["content", "cluster_name"]].sample(15))

Unnamed: 0,content,cluster_name
33128,<p>I have an old house which probably dates back to 1920s or so. Most walls are plaster and the ...,"Home Repairs, Home Rennovations"
3999,"<p>I have trouble cooking the <a href=""http://www.st-hubert.com/epicerie/produits/categorie-sauc...",Food and Meals
8879,<p>I have heard that offspring can't grow taller than either of their parents but I've also hear...,Food and Meals
49908,<p>In some literature it is written that the private key should be chosen random from </p>\n\n<p...,"Encryption, Cryptographic Block Cyphers, Padding Oracle Attack"
31946,<p>There doesn't appear to be PKCS#7 or CMS support in pyCrypto. I'd appreciate a recommendation...,"Encryption, Cryptographic Block Cyphers, Padding Oracle Attack"
22859,<p>How do you dial in the right amount of thickness vs soft melt in your mouth style?</p>\n\n<p>...,Food and Meals
64769,<p>Do I need to adjust the oven temperature in a roast duck recipe if I want to put more than on...,Food and Meals
56031,<p>I would like to know if there is a good source that combines Slam problem with vision. From m...,Food and Meals
49549,"<p>In a couple of weeks I will be going to Edinburgh, Scotland. I've been told that the weather ...",Food and Meals
51833,<p>I made some pasta dough this morning and put it in the fridge. When I ran the first batches t...,Food and Meals


## Export to Disk for Deployment

Export the Pipeline so it can be used during deployment

In [24]:
# dump(pipe, "cluster_v1.joblib")

## Iterating over End-to-End Workflow

In [25]:
pipe_clean = Pipeline([("cleantext", TextCleaner("content"))])

vectorizer = TfidfVectorizer(
    max_df=0.85,  # ignore tokens with a document freq > 80%
    min_df=15,  # ignore terms with doc freq < 20%
    stop_words=all_stop_words,  # we did this during the cleaning
    ngram_range=(1, 1),  # unigrams
)
kmeans_random_state = 42
est = KMeans(
    n_clusters=n_clusters,
    max_iter=500,
    n_init=10,
    random_state=kmeans_random_state,
)
pipe = Pipeline(
    [
        ("vectorizer", vectorizer),
        ("clusterer", est),
    ]
)

param_grid = {
    "vectorizer__max_df": [0.85, 0.75],
    "vectorizer__min_df": [15, 50],
    "clusterer__max_iter": [500, 750],
    "clusterer__n_init": [10],
    "clusterer__n_clusters": [6],
}

In [26]:
%%time
state = run_clustering_trials(
    jsonpickle.encode(pipe_clean),
    jsonpickle.encode(pipe),
    df.drop(columns=["cluster_name"], errors="ignore").to_json(orient='split'),
    list(ParameterGrid(param_grid)),
    5,
)

16:31:10.761 | Beginning flow run 'adept-gharial' for flow 'Run through complete Clustering Workflow'...
16:31:10.762 | Starting task runner `SequentialTaskRunner`...
16:31:16.952 | Beginning subflow run 'brainy-capuchin' for flow 'Preprocess Text Data'...
16:31:16.952 | Starting task runner `SequentialTaskRunner`...
16:31:23.294 | Cleaning...
16:31:54.304 | Done
16:31:54.785 | Shutting down task runner `SequentialTaskRunner`...
16:31:57.169 | Subflow run 'brainy-capuchin' finished in state Completed(message=None, type=COMPLETED)
16:32:04.009 | Beginning subflow run 'tiny-beetle' for flow 'Cluster Data'...
16:32:04.010 | Starting task runner `DaskTaskRunner`...
16:32:04.010 | Creating a new Dask cluster with `distributed.deploy.local.LocalCluster`
16:32:04.750 | The Dask dashboard is available at http://127.0.0.1:8787/status
16:32:08.088 | Submitting task run 'cluster_data-902e7ee1-0' to task runner...


  {'task': <prefect.tasks.Task object at 0x7f7bc2977 ... ait_for': None}
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and 
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good


16:32:12.228 | Submitting task run 'cluster_data-902e7ee1-1' to task runner...
16:32:13.392 | Training with {'clusterer__max_iter': 500, 'clusterer__n_clusters': 6, 'clusterer__n_init': 10, 'vectorizer__max_df': 0.85, 'vectorizer__min_df': 15}...
16:32:15.743 | Submitting task run 'cluster_data-902e7ee1-2' to task runner...
16:32:16.908 | Training with {'clusterer__max_iter': 500, 'clusterer__n_clusters': 6, 'clusterer__n_init': 10, 'vectorizer__max_df': 0.85, 'vectorizer__min_df': 50}...
16:32:19.326 | Submitting task run 'cluster_data-902e7ee1-3' to task runner...
16:32:20.565 | Training with {'clusterer__max_iter': 500, 'clusterer__n_clusters': 6, 'clusterer__n_init': 10, 'vectorizer__max_df': 0.75, 'vectorizer__min_df': 15}...
16:32:23.006 | Submitting task run 'cluster_data-902e7ee1-4' to task runner...
16:32:24.331 | Training with {'clusterer__max_iter': 500, 'clusterer__n_clusters': 6, 'clusterer__n_init': 10, 'vectorizer__max_df': 0.75, 'vectorizer__min_df': 50}...
16:32:27.019

In [27]:
dfs_str, posts_str = [state.result()[k].result() for k in range(2)]
df_flow_output = pd.read_json(dfs_str, orient="split")
with pd.option_context("display.max_colwidth", 100):
    display(df_flow_output)
print(posts_str)

Unnamed: 0,index,content,num_clusters,cluster,top_10_tokens,params_str
0,33566,"<p>I'm a citizen of country X and have a valid United States ESTA in my X passport. Recently, I...",6,0,"[visa, passport, uk, schengen, us, travel, days, need, transit, visit]","{'clusterer__max_iter': 500, 'clusterer__n_clusters': 6, 'clusterer__n_init': 10, 'vectorizer__m..."
1,48613,<p>I am not sure this is the right place to ask this question but assuming any traveler might ha...,6,0,"[visa, passport, uk, schengen, us, travel, days, need, transit, visit]","{'clusterer__max_iter': 500, 'clusterer__n_clusters': 6, 'clusterer__n_init': 10, 'vectorizer__m..."
2,55030,<p>My old passport have 2-year valid visa of DRCongo and it is expire on 04.06.2016. Now I am on...,6,0,"[visa, passport, uk, schengen, us, travel, days, need, transit, visit]","{'clusterer__max_iter': 500, 'clusterer__n_clusters': 6, 'clusterer__n_init': 10, 'vectorizer__m..."
3,49361,<p>I am flying from London to Central America and back this summer on a bit of an extended holid...,6,0,"[visa, passport, uk, schengen, us, travel, days, need, transit, visit]","{'clusterer__max_iter': 500, 'clusterer__n_clusters': 6, 'clusterer__n_init': 10, 'vectorizer__m..."
4,55078,<p>My wife has a Schengen visa issued by the Italian consulate in the UK.</p>\n\n<p>It expires a...,6,0,"[visa, passport, uk, schengen, us, travel, days, need, transit, visit]","{'clusterer__max_iter': 500, 'clusterer__n_clusters': 6, 'clusterer__n_init': 10, 'vectorizer__m..."
...,...,...,...,...,...,...
25,15061,<p>It's always difficult to decide what clothes to take with you when you go to a new place. Usu...,6,5,"[would, like, one, know, use, get, time, make, find, way]","{'clusterer__max_iter': 750, 'clusterer__n_clusters': 6, 'clusterer__n_init': 10, 'vectorizer__m..."
26,65232,"<p>I see the TSA have brought in <a href=""http://www.nytimes.com/2014/07/08/us/new-tsa-rules-for...",6,5,"[would, like, one, know, use, get, time, make, find, way]","{'clusterer__max_iter': 750, 'clusterer__n_clusters': 6, 'clusterer__n_init': 10, 'vectorizer__m..."
27,6061,<p>I'm planning to visit Medellin and Bogota sometime in the near future. I've a U.S. DL which e...,6,5,"[would, like, one, know, use, get, time, make, find, way]","{'clusterer__max_iter': 750, 'clusterer__n_clusters': 6, 'clusterer__n_init': 10, 'vectorizer__m..."
28,11890,<p>The way I currently cook kidney beans is to soak them overnight. But still they have to be co...,6,5,"[would, like, one, know, use, get, time, make, find, way]","{'clusterer__max_iter': 750, 'clusterer__n_clusters': 6, 'clusterer__n_init': 10, 'vectorizer__m..."


Cluster 0: visa, passport, uk, schengen, us, travel, days, need, transit, visit
Cluster 1: wire, switch, light, wires, box, breaker, fan, outlet, ground, black
Cluster 2: water, hot, heater, pressure, valve, tank, shower, cold, pipe, house
Cluster 3: wall, house, floor, door, would, wood, concrete, like, paint, room
Cluster 4: would, like, one, know, use, get, time, make, find, way
Cluster 5: key, encryption, message, hash, public, aes, cipher, keys, data, random

Cluster = 2, Raw data index = 81,864, Hyper-Parameters = {'clusterer__max_iter': 500, 'clusterer__n_clusters': 6, 'clusterer__n_init': 10, 'vectorizer__max_df': 0.85, 'vectorizer__min_df': 15}
I live in the Midwest of the United States, and we not-too-rarely get a large amount of rain coming down at one time both in the spring and sometimes in the fall. Last night was no different (aside from random earthquakes..) than usual, but after I went to bed, my smoke alarm went off. I quickly determined that there was in fact no fire

## Summary

Accomplished
1. Used ML to cluster stackexchange posts using text of the post
2. Read through the posts in each cluster and Assigned names to each cluster
   - most clusters' posts covered a single topic

## Future Work

Looking Forward
1. Iterate over more hyper-parameters of the text vectorizer (TFIDF)
2. Text Pre-processing (stemming)