## Understanding how to cluster text data

- This demo is a simplified version of sklearn original demo located here: http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

### Let us get some sample text data

In this case I will get it from the sklearn import.
- sklean import will provide the 'fetch_20newsgroups' object
- Which loads the filenames and data from the 20 newsgroups dataset.
- Also it is useful to debug the fetching of the data using logging, so I have added the logging import

In [27]:
from sklearn.datasets import fetch_20newsgroups
import logging
import numpy as np

### Going to keep things simple here and I am just going to use 4 categories of the data

In [6]:
# Load some categories from the training set
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]

In [10]:
!curl "http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz" -o 20news-bydate.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  3661  100  3661    0     0   172k      0 --:--:-- --:--:-- --:--:--  188k


In [20]:
!ls -al

total 4
drwxr-xr-x 2 nbuser nbuser      0 Jan  1  1970 .
drwx------ 7 nbuser nbuser   4096 Nov 11 13:17 ..
drwxr-xr-x 2 nbuser nbuser      0 Jul 27 18:24 .ipynb_checkpoints
-rw-r--r-- 1 nbuser nbuser  40960 Aug 23 00:24 .~Learning Feature Selection in Scikit Learn.ipynb
-rw-r--r-- 1 nbuser nbuser   3661 Nov 11 14:07 20news-bydate.tar.gz
-rw-r--r-- 1 nbuser nbuser   3394 Aug 22 19:12 Basic Into to K-Nearest Neighbors Algorithm in Scikit Learn.ipynb
-rw-r--r-- 1 nbuser nbuser   1311 Nov 11 18:07 DemoSimpleCaching.py
-rw-r--r-- 1 nbuser nbuser   8668 Nov 11 18:04 Demo_Clustering_Text_Data.ipynb
-rw-r--r-- 1 nbuser nbuser   4997 Aug 16 00:26 Exploration_Of_Collaborative_Filtering.ipynb
-rw-r--r-- 1 nbuser nbuser  27828 Jul 27 18:24 Exploration_Of_Logistic_Regression_Activation_Functions_Demo_Plot.ipynb
-rw-r--r-- 1 nbuser nbuser  40960 Aug 22 17:44 Learning Feature Selection in Scikit Learn.ipynb
-rw-r--r-- 1 nbuser nbuser   3490 Aug  3 08:54 Natural_Language_Processing_Intro_w

### Importing my simple api for reading the previously cached (picked) news group data

In [21]:
%run -n DemoSimpleCaching.py

In [23]:
# to show logging message of what is happening behind the scene.
logging.basicConfig() 

print("Loading 20 newsgroups dataset for categories:")
print(categories)

#dataset = fetch_20newsgroups(subset='all', categories=categories,
#                             shuffle=True, random_state=42,download_if_missing=False)
document_data_folder = r'./document_data'
sc = SimpleCache(cache_file_folder=document_data_folder,cache_name='news_group_data')
dataset = sc.read()

Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']


### Some basic facts of the news group dataset

In [46]:
labels = dataset.target
unique_labels = np.unique(labels)
true_k = unique_labels.shape[0]

print("There are: {} documents".format(len(dataset.data)))
print("Total categories are: {}".format(len(dataset.target_names)))
print("List of unque labels are: {}".format(unique_labels))
print("A sample of the News document data is shown here (for the sample, n = 1 out of 3387): \n\n {}".
      format(dataset.data[0][:300]))

There are: 3387 documents
Total categories are: 4
List of unque labels are: [0 1 2 3]
A sample of the News document data is shown here (for the sample, n = 1 out of 3387): 

 From: healta@saturn.wwc.edu (Tammy R Healy)
Subject: Re: who are we to judge, Bobby?
Lines: 38
Organization: Walla Walla College
Lines: 38

In article <1993Apr14.213356.22176@ultb.isc.rit.edu> snm6394@ultb.isc.rit.edu (S.N. Mozumder ) writes:
>From: snm6394@ultb.isc.rit.edu (S.N. Mozumder )
>Subject


### Understanding Term Frequency - Inverse Document Frequency (tf-idf)
- This is a measure that can be used to idenfify/codify the importance of a word to a document in a collection or a corpus (collection of documents).

- Further details can be found here: https://en.wikipedia.org/wiki/Tf%E2%80%93idf