# Clustering Newsgroups data using K-means
* We'll use the Newsgroup dataset from scikit learn.
* We will use all data from four categories, 'alt.atheism', 'talk.religion.misc', 'comp.graphics', and 'sci.space', as an example.

# Step 1: Loading and Preprocessing the data
* We'll load the data and clean it up a bit with the preprocessing techniques we saw in the previous chapter (remove numbers, lemmatize the words, remove names)

In [2]:
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import names
from nltk.stem import WordNetLemmatizer

# Defining our categories (the ones we'll use to fetch the data)
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space'
]

groups = fetch_20newsgroups(subset = 'all', categories=categories)

# Getting our labels and label names
labels = groups.target
label_names = groups.target_names

# Removing names and lemmatizing 
all_names = set(names.words())
lemmatizer = WordNetLemmatizer()

# An empty list to store our cleaned data
data_cleaned = []

for doc in groups.data:
    doc = doc.lower()
    doc_cleaned = " ".join(lemmatizer.lemmatize(word) for word in doc.split() if word.isalpha() and word not in all_names)
    data_cleaned.append(doc_cleaned)
    

In [3]:
# Converting the cleaned text data into count vectors
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer(stop_words="english", max_features = None, 
                              max_df = 0.5, min_df = 2)
data = count_vector.fit_transform(data_cleaned)


Note that we don't limit the max_features, but we set up max_df and min_df, which mean maximum and minimum document frequency. Document frequency of a word is measured by the fraction of documents (samples) in the dataset that contain this word. 

# Step 2: Clustering the data
* We'll try clustering the data (cleaned) as is, however, we need to know that as of this moment, the tokens in data (the CountVector) is only considering term_frequency, which might give us incorrect results

In [None]:
from sklearn.cluster import KMeans

k = 4
kmeans = KMeans(n_clusters=k, random_state = 42, n_init = 'auto')