# Incremental agglomerative clustering
Incremental agglomerative clustering, given old clusters, maps new data to old clusters and creates new clusters for unmapped records. It is a bottom-up approach, meaning it assumes all the data points belong to separate clusters initially. Then it recursively merges the cluster pairs which have minimum distance between them. This kind of approach is useful when we are dealing with temporal text data and need to cluster it incrementally in time. For example, news, social media posts, chats etc. which keep on increasing with time and there is no endpoint to wait for before doing the analysis. This implementation is based on the following paper:

* X. Dai, Q. Chen, X. Wang and J. Xu, "Online topic detection and tracking of financial news based on hierarchical clustering," 2010 International Conference on Machine Learning and Cybernetics, 2010, pp. 3341-3346, doi: 10.1109/ICMLC.2010.5580677.

## Steps
•	Considers set of records, does tf-idf vectorization -> agglomerative hierarchical clustering (sklearn) -> for next interval, update tf-idf vectorizer-> use clusters identified in just previous interval as candidate clusters, perform agglomerative hierarchical clustering on new data -> map/merge new clusters with candidate clusters.

Steps to identify topics for new set of stories, given some candidate topics as previous set of stories and corresponding topic: 

1. Get the set of candidate clusters CTS from previous set.
2. Get the set of new clusters in new set using agglomerative clustering (sklearn).
3. Get a cluster Tc from the new set NTC, and calculate the similarity between Tc and each single cluster ct within the old set CTS. If the maximum similarity, which is the similarity between ct and Tc, is not smaller than the threshold θ, we consider that ct is related to Tc.
4. Combine the cluster Tc into the previous cluster ct, and rebuild the cluster model.
5. Delete Tc from NTC, repeat from step 3.

## Example
This notebook shows how to use this clustering method by running an example with kaggle dataset (https://www.kaggle.com/rmisra/news-category-dataset). This is a news category dataset with date information along with headline and short description. We have shown clustering over headlines only, dataset being sorted according to date.

In [2]:
import pandas as pd
from clustering.agglomerative_clusters import find_clusters
import warnings
warnings.filterwarnings('ignore')
pd.options.display.max_colwidth=-1

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [11]:
df = pd.read_json(".\\sample_data\\example_news_data.json", lines=True)
df.sort_values(by='date', inplace=True)         # sort the data by date
df['content'] = df['headline']       #  method requires the text 

In [3]:
# first chunk
df_p1 = df[:5000]
out = find_clusters(df_p1, thresh1=0.6, nfeatures=400)   # function call
len(out['class_prd'].unique())

threshold: 0.6, nfeatures: 400


510

In [4]:
out.loc[out.class_prd==0, ['headline']]          # example cluster, it can be noticed that this cluster consists of headlines related "care".

Unnamed: 0,headline
200512,Staying Bedbug-Free in the Direct Care Industry
200034,A Guide To Interntional Fabric Care Symbols (PHOTOS)
199474,Afghan Midwives Address Need For More Skilled Maternal Care
199089,Post-Natal Care In France: How I Got My Vagina Back In Shape
198415,The Real Cost of Delaying Your Health Care
197832,A Care Revolution... in America?
197654,Health Care Costs And How You Could Be Overspending
197603,"Pajamas, Like Brown Bag Lunches, Mean Someone Cares"
197092,Caring for Your Pet in a Tough Economy
197087,Pesticides and Personal Care Products Pollute Our Environment


In [5]:
# second chunk

df_p2 = df[5000:10000]
out2 = find_clusters(df_p2, thresh1=0.6, nfeatures=400, old_samples=out)

threshold: 0.6, nfeatures: 400


In [6]:
out2[out2.class_prd==0]['headline']     # same cluster name in new chunk. We can observe that the headlines related to "care" have been grouped into same cluster as for previous chunk.

index
5084    Health Care Reform Mandate, How Does It Work?                                     
5409    Bethenny Frankel: Mother, SkinnyGirl Mogul, Skin Care Expert?                     
5435    Why We Should Care About Children's Fashion                                       
5493    Irina Shayk Thong Makes An Appearance At Jeffrey Fashion Cares (PHOTOS)           
5564    What Health Care is Like: Seeking Supreme Analogies                               
5584    Health Care Reform: What's At Stake If Obamacare Is Overturned?                   
6034    How Active Should We Be in Our Own Medical Care?                                  
6132    Choosing the TEDMED 20 Great Challenges of Health Care                            
6425    Reinventing Health Care: The Design and Investment of the Millennia               
7094    Losing a Loved One and Confronting the Heartlessness of Health Care Cost Control  
7602    A New Mainstream Health Care?                                               

We can iterate over the whole data in small chunks in similar manner. Feel free to checkout other clusters and to test with your own data.