POC TO FIND WORDS FROM MULTIPLE DOCUMENTS AND CLUSTER THEM TOGETHER BY ANALYZING OPTIMAL CLUSTERS FOR THE DATA :
It has 2 columns one is the documents containing all talks in text format , another will be which author and title .
I preporcessed the documents by removing stopwords and punctuations for model building .
I used tf-idf to convert our document columns into a vector representation such the computer can understand.
Applied optimal Cluster algorithm by itterating 15 ,20 times and plot to find the elbow , another is silhoutte method to score clusters according to closeness of datapoints with respect to other clusters .
BELOW ARE THE ELBOW METHOD USED FOR 15 AND 20 KMEANS ITTERATIONS :
1st : Elbow method (15 itterations to plot )
With 15 itterations was not sure about the optimal clusters so tried with 20 itterations , it might take some time to execute .
2nd : Elbow method (20 itterations to plot )
3RD : Silhoutte method to find optimal clusters .
I used 11 itterations to find which one will bring the optimum clusters , below is the results :
For n_clusters=2, The Silhouette Coefficient is 0.025221700503161602 For n_clusters=3, The Silhouette Coefficient is 0.025360625895740875 For n_clusters=4, The Silhouette Coefficient is 0.024858191350452728 For n_clusters=5, The Silhouette Coefficient is 0.025151911809533658 For n_clusters=6, The Silhouette Coefficient is 0.024629056036962343 For n_clusters=7, The Silhouette Coefficient is 0.02083552089015085 For n_clusters=8, The Silhouette Coefficient is 0.02414572702868102 For n_clusters=9, The Silhouette Coefficient is 0.022338255264880966 For n_clusters=10, The Silhouette Coefficient is 0.023874614304817
Both the results are not complete satisfactory , but i choosed 5 clusters as an mid level cluster .
,Cluster 0: actually things right youre theres
Cluster 1: new water data space actually
Cluster 2:
world percent countries need country Cluster 3:
said life got did love
Cluster 4: women men woman said world