# Foundations of Artificial Intelligence and Machine Learning
## A Program by IIIT-H and TalentSprint
#### To be done in the Lab


In this experiment, we explore how k-Means clustering is able to find patterns and produce groupings without any form of external information to help.

We take the text of the famous Russian novel "War and Peace" by Leo Tolstoy, downloaded from Project Gutenburg (http://www.gutenberg.org/), and extract all sentences from it. 

We will be performing following actions :
1. Use the nltk library to break the text into sentences. 
2. Build a word2vec representation using these sentences, after removing fluff words from these sentences.
3. Build a k-means algorithm on top of the data.

To run the experiment execute the following commands



1.!pip3 install gensim

2.!pip3 install nltk

3.import nltk

4.nltk.download('punkt')


In [None]:
# Importing required packages
import gensim
from gensim.models import Word2Vec
from gensim.models import word2vec
from gensim.models import Phrases
import logging

In [None]:
dataset = "AIML_DS_WAR-And-PEACE_STD.txt"


Now let us  read the entire file into a list of lines, converting everything to lowercase as well as remove trailing and leading whitespace.

In [None]:
wp_text_stage0 = [line.strip().lower() for line in open(dataset,encoding="utf8")]
print(wp_text_stage0[4000:4010])

Now we will combine them into one gigantic string

In [None]:
wp_text_stage1 = ' '.join(wp_text_stage0)

In [None]:
print(len(wp_text_stage1))
print(wp_text_stage1[40000:40200])

We have gigantic string with us. let us see how to break this string into sentences

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
wp_text_stage2 = sent_tokenize(wp_text_stage1)

In [None]:
print(len(wp_text_stage2))
print(wp_text_stage2[5000:5010])

So we have about 26k sentences, in the tome. We now take each sentence and clean it up as below:
 * replace all non-alphanumeric characters by space
 * split each sentence on whitespace
 * in each sentence drop words that are less than 3 letters long and are part of fluff words

We read the entire contents the fluff file into a set. As mentioned earlier a set is much faster for checking membership

In [None]:
fluff = set([line.strip() for line in open("AIML_DS_STOPWORDS_STD.txt")])

Replacing all non-alphanumeric characters by space

In [None]:
import re
only_alnum = re.compile(r"[^\w]+") ## \w => unicode alphabet
#only_alnum = re.compile(r"[^a-z0-9]") --> This will remove accented characters which are part of many names!

## Replaces one or more occurrence of any characters other unicode alphabets and numbers
def cleanUp(s):
    return re.sub(only_alnum, " ", s).strip()
wp_text_stage3 = [cleanUp(s) for s in wp_text_stage2]
print(wp_text_stage3[4000:4010])

Now we break each sentence into words, and store these words as a list. We traverse this list and drop the unwanted words. 

In [None]:
def choose_words(s):
    return [w for w in s.split() if len(w) > 2 and w not in fluff]

In [None]:
wp_text_stage4 = [choose_words(sentence) for sentence in wp_text_stage3]
print(wp_text_stage4[4000:4010])

In [None]:
print(len(wp_text_stage4))

We convert the words to common stem -- that is we do not want to consider "run", "runs", "running" as separate words

In [None]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
print(stemmer.stem("running"), stemmer.stem("run"), stemmer.stem("runs"), stemmer.stem("runner"))
print(stemmer.stem("guns"), stemmer.stem("gun"), stemmer.stem("gunned"), stemmer.stem("gunning"))

In [None]:
def stem_list(wordlist):
    return [stemmer.stem(word) for word in wordlist]
for n in range(4000, 4010):
    print(wp_text_stage4[n], stem_list(wp_text_stage4[n]))

In [None]:
wp_text_stage5 = [stem_list(s) for s in wp_text_stage4]
print(wp_text_stage5[4000:4010])

Now let us build a word2vec model with this corpus.

In [None]:
num_features = 300    # Word vector dimensionality                      
min_word_count = 50   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 6           # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

In [None]:
wp = word2vec.Word2Vec(wp_text_stage5, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

In [None]:
wp.init_sims(replace=True)

In [None]:
wp.corpus_count

In [None]:
len(wp.wv.vocab.keys())

In [None]:
sorted(list(wp.wv.vocab))

In [None]:
words = ["chair","car","man","woman","clean","close","cloud","coat", "confus","danger","daughter","deal","run","walk","count","father","girl","near","neck","spoke","spoken","stand","show","shown"]

Let us save this so that we can continue

In [None]:
wp.wv.save_word2vec_format('wp.bin')

In [None]:
import numpy as np
X = np.array([wp[w] for w in wp.wv.vocab if w in words])
X

Now Let us apply k-means algorithm on the top of the data

# k-means
K-means  is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster.

In [None]:
from sklearn.cluster import KMeans
km = KMeans(n_clusters = 4)
km.fit(X)
y_kmeans = km.predict(X)

In [None]:
from sklearn import manifold
lle_data = manifold.LocallyLinearEmbedding(n_neighbors=10, n_components=2).fit_transform(X)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(lle_data[:,0],lle_data[:,1], c =y_kmeans )
for i in range(len(words)-3):
    plt.annotate(words[i], xy = (lle_data[i][0],lle_data[i][1]))
plt.show()

The words get divided into four clusters  as shown by the four colors, visualize into 2D by LLE plot