## Theory: Introduction to Topic Modelling

In this blog we will learn about topic modelling (or how to identify topics in a given corpus of text).

The problem of topic modeling has become increasingly important in the last decade as our society has become more digitalized, with increased use of online platforms such as social media, which, combined with advances in storage capacity technologies, has resulted in an exponential increase in the amount of unstructured data available to us, with text data being one of the most common types.

However, all this data that is becoming increasing available has a lot of information associated with it, which makes it difficult for us to find exactly what we're looking for. As an analogy, finding a certain sentence on a page is considerably easier than finding it in an entire book.

So, we need tools and techniques to organise/search/understand the vast quantities of information.
1. Firstly, we need to organise the data. Because at this moment most of the data is not organised in any manner.
    - ex: text data on social media platforms (twitter/instagram) is unorganised
    - ex: text data on customer reviews is unorganised
2. Secondly, wee need to be able to search from it
3. Thirdly, understand the information that is present in the data

Topic modelling provide us with methods to (organise/search/understand) by summarising the logical actions of textual information that is present in the data. It helps:
1. Discover the hidden topical patterns that are present in the collection
2. Annotate the data, so that it can be used further on based on the topics that have been identified

Topic modelling can also be described as a method for finding a group of words from a collection of documents that best represents the information in the collection. It can also be thought of as a form of text mining, because you're mining a vast amount of textual information. 

In order to do this, we will be using unsupervised machine learning techniques because our data is unlabelled, which is help us in clustering/grouping this data (ex: customer reviews) to identify the main ideas/topics in a corpus of text.

In this blog we will use a real-world twitter textual corpus of data. However, note that the same techniques will apply to any corpus of text. This corpus of text was web-scraped directly from the twitter data and therefore it will have all the characteristics that any real-world corpus of text will have with things like: people have their own coloquial language, presence of noise. So, we will learn how to clean the noise and use that data for clustering to identify the main topics that people are talking about. Then, depending on whatever is the business decision, they can take certain steps towards it.

In order to do so, I will be using the NLTK (Natural Language Toolkit) package.

## Theory: Introduction to NLTK (Natural Language Toolkit)

NLTK provides us with tools that enable us to make the computer understand natural language. It's the leading library for building python programmes to work with human language data.

For a computer, it's very easy to interpret programming language as they follow a specific syntax enabling the computer to parse it with any problems. However, human language is a very unstructured form of data, which means that the same thing can be said in a variety of ways and it means the same thing. 

- For example, sentences "I hope you're doing well" and "I hope everything is fine with you", both have the same meaning and use similar words, however, the way they're structured is different. Hence, for a computer these two sentence are basically different sentences which can mean different things. 

- NLTK is one of the tools that help computers clean and preprocess the human language data in such a way that makes it more structured such that computers can understand it.

NLTK provides quite an easy to use interface and it has a suite of text processing libraries, for things like:
- **Classification**
- **Tokenisation** => sepparating out the words and removing punctuation
- **Stemming** => since same word can have prefixes/suffixes, stemming will cut all these prefixes/suffixes & get the core of the word which still has the same meaning (ex: words "stemming", "stemmer" => can be stemmed to the word "stem")
- **Tagging** => each word can be gramatically tagged - whether it's (an article, a verb, a noun, an adjective, etc)

Finally, the best thing is that NLTK is free and open source, so if you want you can even contribute to it as it is a community driven project. 

Here is the link to [GitHub Repository of NLTK](https://github.com/nltk/nltk)

The way to use NLTK is just like any other python library, you can directly import it:
```python
import nltk
```
However, initially you would have to install NLTK and given that it has a vast suite of tools that are available for a variety of tasks - usually you don't require all of them and to install all of them would take you a very long time.

Therefore, instead of installing the whole NLTK package, you can install parts of it on your machine. To do so, NLTK provides a very nice GUI.

<img src="./images/nltk-gui.png" alt="Drawing" style="width: 500px;"/>

If you know the speicifc tools that you're going to use, go to "Corpora", search for them and download them:

<img src="./images/nltk-gui2.png" alt="Drawing" style="width: 500px;"/>

For example, here I've installed the stopwords sub-package because this is something that I commonly use to remove all words that esentially don't convey any meaning (they usually tend to be words that are articles):

<img src="./images/nltk-gui3.png" alt="Drawing" style="width: 500px;"/>

Fuerthermore, you can also download models for implementing transfer learning:
<img src="./images/nltk-gui4.png" alt="Drawing" style="width: 500px;"/>
(for example, this model Word2Vec is quite famously used amongst NLP Data Scientists)

Then, once your required sub-packaged gets downloaded, you import it like any other python packge:

```python
from nltk import stopwords
```

In [33]:
import nltk
nltk.download_gui()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


## Practise: Loading and Exploring Twitter Data

Importing libraries

In [34]:
import nltk
import numpy as np
import pandas as pd
import re # regex

Perform configuration of settings

In [35]:
# Get multiple outputs in the same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Ignore all warnings
import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings(action="ignore", category=DeprecationWarning)

In [36]:
# Display all rows and columns of a DataFrame instead of a truncated version
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

Because this is a real-world corpus of text gathered through web-scraping Twitter, it's more complicated in the sense that there's a lot of noise and coloqualiasms, so it's not a set of grammatically correct phrases like you'd find in literature or articles.

In [37]:
raw_data = pd.read_csv("../data/data.csv", encoding="ISO-8859-1")

In [38]:
raw_data.shape
raw_data.head()

(21047, 4)

Unnamed: 0,username,date,tweet,mentions
0,shivaji_takey,10-06-2020,Please check what happens to this no 940417705...,['vodafonein']
1,sarasberiwala,10-06-2020,Network fluctuations and 4G Speed is pathetic....,['vodafonein']
2,chitreamod,10-06-2020,This has been going on since 3rd... this absol...,['vodafonein']
3,sanjan_suman,10-06-2020,@VodafoneIN I have done my recharge of 555 on...,['vodafonein']
4,t_nihsit,10-06-2020,But when???Still I am not received any call fr...,['vodafonein']


Based on the shape of our dataframe, we see that we will be working with 21047 tweets in total. This particular twitter data only include tweets that have mentions of vodafone, which is a telecom company.

Now, let's perform some sanity checks to help us understand whether our dataset have any issues we need to handle

First, let's check whether there're any duplicate tweets:

In [39]:
unique_text = raw_data["tweet"].unique()
print(len(unique_text))

21047


Since number of unique tweets equal to number of rows, then we don't have any duplicates.

Let's look in detail at some individual tweets:

In [40]:
raw_data["tweet"][4]

'But when???Still I am not received any call from customer care.Very poor services.'

Just from looking at a few tweets, there seems to be a lot of complaints, which seems logical as people usually go to twitter to get their complaints noticed more quickly or raise awareness.

At the same time, we see that there're a wide variety of topics mentioned, even in this small number of tweets that we have seen.

We have 21047 tweets that has been gathered in a very short duration. Given this large number of tweets, it would be almost impossible for a person to go through all the tweets and figure out the topics that are present and be able to identify which area needs the most addressing (maybe it's the service area or network area or number porting area).

## Practise: Cleaning the Data with Pattern Removal

So, from previous exploration we've found out that we're using tweets addressed to specific telecommunication company, Vodafone. From exploring just a few tweets we realised that there're various topics that are present in the twitter corpus that we will be using as our textual input to the unsupervised machine learning model that we want to implement.

Now, we will take the first steps of cleaning this data.

Function below will help us remove certain patterns from the data:

It takes in as (input=text) and (pattern that we're looking at removing) and using the regex library it will find this particular pattern in all the input data.

In [41]:
def remove_pattern(input_text, pattern):
    r = re.findall(pattern, input_text)
    for i in r:
        input_text = re.sub(i, '', input_text)
        # result = re.sub(pattern, repl, string);
        # replacing all the found patterns with an empty string
    return input_text

Above, is a generic function to which you can give any pattern and any input text - it will remove the pattern and give you back the input text.

#### First step will be to remove any @ mentions
Because all of this data will be addressing Vodafone only either way, so having @Vodafone doens't add any useful information. Having @Vodafone present in all the tweets would act as noise for the algorithm.

In [42]:
# np.vectorize is a convenient way of writing a for-loop
# it will loop through all the values in raw_data["tweet"] pandas Series

# pattern that we want to remove is, @[\w]*, any word starting with @
raw_data["clean_text"] = np.vectorize(remove_pattern)(raw_data["tweet"], "@[\w]*")
raw_data["clean_text"] 

0        Please check what happens to this no 940417705...
1        Network fluctuations and 4G Speed is pathetic....
2        This has been going on since 3rd... this absol...
3          I have done my recharge of 555 on 9709333370...
4        But when???Still I am not received any call fr...
5         mere area me vodafone ka network nai aa raha ...
6        Thanks, but I have visited the website, called...
7         \nHi,\n Today my Vodafone cim is deactivated ...
8        Dear Vodafone, I have already responded to you...
9         SIR OUR MARKET AREA ME BILKUL NETWORK NAHI AA...
10          Why the hell Previous plan Deactivated of 2...
11       Vodafone Netwrk is worst ever...Using from so ...
12        9796053999... internet not working, pls assis...
13                        Still waiting for your reply    
14       Reverse migration ka zamana hai dear Alu. Kar ...
15        Unable to access your website and is showing ...
16       Worst customer care, I charged 1072 for intern.

We can see that all mentions of @Vodafone have been removed as we wanted.

Step 2: We want to removing everything that is not a letter

In [43]:
raw_data["clean_text"] = raw_data["clean_text"].str.replace("[^a-zA-Z#]", " ")
raw_data["clean_text"]

0        Please check what happens to this no          ...
1        Network fluctuations and  G Speed is pathetic ...
2        This has been going on since  rd    this absol...
3          I have done my recharge of     on           ...
4        But when   Still I am not received any call fr...
5         mere area me vodafone ka network nai aa raha ...
6        Thanks  but I have visited the website  called...
7          Hi   Today my Vodafone cim is deactivated wi...
8        Dear Vodafone  I have already responded to you...
9         SIR OUR MARKET AREA ME BILKUL NETWORK NAHI AA...
10          Why the hell Previous plan Deactivated of  ...
11       Vodafone Netwrk is worst ever   Using from so ...
12                      internet not working  pls assis...
13                        Still waiting for your reply    
14       Reverse migration ka zamana hai dear Alu  Kar ...
15        Unable to access your website and is showing ...
16       Worst customer care  I charged      for intern.

We can see that all numbers and punctuation have been replaced with spaces.

But don't worry about these spaces, because they will be further truncated later on.

Step 3: Now, we want to make sure that the computer understands that "Please" and "please" mean the same thing. Although we as humans understand these to mean the same thing, for the computer they would seem as two different words because the codes for capitalcase and lowercase letter "p" is totally different. So we will lowercase all the sentences:

In [44]:
raw_data["clean_text"] = raw_data["clean_text"].str.lower()
raw_data["clean_text"]

0        please check what happens to this no          ...
1        network fluctuations and  g speed is pathetic ...
2        this has been going on since  rd    this absol...
3          i have done my recharge of     on           ...
4        but when   still i am not received any call fr...
5         mere area me vodafone ka network nai aa raha ...
6        thanks  but i have visited the website  called...
7          hi   today my vodafone cim is deactivated wi...
8        dear vodafone  i have already responded to you...
9         sir our market area me bilkul network nahi aa...
10          why the hell previous plan deactivated of  ...
11       vodafone netwrk is worst ever   using from so ...
12                      internet not working  pls assis...
13                        still waiting for your reply    
14       reverse migration ka zamana hai dear alu  kar ...
15        unable to access your website and is showing ...
16       worst customer care  i charged      for intern.

Step 4: Now, we want to remove any spaces that we have in these sentences. So we will only keep words with length greater than 2.

In [45]:
raw_data["clean_text"] = raw_data["clean_text"].apply(lambda x: " ".join(w for w in x.split() if len(w) > 2))
raw_data["clean_text"]

0        please check what happens this not woking sinc...
1        network fluctuations and speed pathetic need j...
2        this has been going since this absolutely unpr...
3        have done recharge but haven got perday with u...
4        but when still not received any call from cust...
5        mere area vodafone network nai raha hai bhitol...
6        thanks but have visited the website called you...
7        today vodafone cim deactivated without any inf...
8        dear vodafone have already responded your repl...
9        sir our market area bilkul network nahi raha c...
10       why the hell previous plan deactivated and why...
11       vodafone netwrk worst ever using from many yea...
12                    internet not working pls assist asap
13                            still waiting for your reply
14       reverse migration zamana hai dear alu kar wapa...
15       unable access your website and showing connect...
16       worst customer care charged for internet postp.

## Practise: Tokenise and Identify Special Instances of Tweets

Now we will tokenise our sentences => converting each sentence into an array of words

In [46]:
tokenized_tweet = raw_data["clean_text"].apply(lambda x: x.split())
tokenized_tweet 

0        [please, check, what, happens, this, not, woki...
1        [network, fluctuations, and, speed, pathetic, ...
2        [this, has, been, going, since, this, absolute...
3        [have, done, recharge, but, haven, got, perday...
4        [but, when, still, not, received, any, call, f...
5        [mere, area, vodafone, network, nai, raha, hai...
6        [thanks, but, have, visited, the, website, cal...
7        [today, vodafone, cim, deactivated, without, a...
8        [dear, vodafone, have, already, responded, you...
9        [sir, our, market, area, bilkul, network, nahi...
10       [why, the, hell, previous, plan, deactivated, ...
11       [vodafone, netwrk, worst, ever, using, from, m...
12             [internet, not, working, pls, assist, asap]
13                      [still, waiting, for, your, reply]
14       [reverse, migration, zamana, hai, dear, alu, k...
15       [unable, access, your, website, and, showing, ...
16       [worst, customer, care, charged, for, internet.

Now we will get our data into a format on which a count vectoriser can be applied. Applying a vectoriser is necessary to get our data into a format which can be fed into a machine learning model.

In [47]:
for i in range(len(tokenized_tweet)):
    tokenized_tweet[i] = " ".join(tokenized_tweet[i])
raw_data["clean_text"] = tokenized_tweet

In [48]:
raw_data.head()

Unnamed: 0,username,date,tweet,mentions,clean_text
0,shivaji_takey,10-06-2020,Please check what happens to this no 940417705...,['vodafonein'],please check what happens this not woking sinc...
1,sarasberiwala,10-06-2020,Network fluctuations and 4G Speed is pathetic....,['vodafonein'],network fluctuations and speed pathetic need j...
2,chitreamod,10-06-2020,This has been going on since 3rd... this absol...,['vodafonein'],this has been going since this absolutely unpr...
3,sanjan_suman,10-06-2020,@VodafoneIN I have done my recharge of 555 on...,['vodafonein'],have done recharge but haven got perday with u...
4,t_nihsit,10-06-2020,But when???Still I am not received any call fr...,['vodafonein'],but when still not received any call from cust...


Let's see whether after performing the cleaning operations we resulted in any tweets being empty? This could happen if a tweet only contained a mention or only special character like numbers or punctuation without any letters?

(ex: "@Vodafone" => "", "323232" => "")

In [49]:
raw_data["clean_text"].apply(lambda x: len(x) == 0).sum()

831

Indeed, now there're 831 empty tweets, so we should remove them.

To do so, let's drop duplicates - this will drop empty tweets bcs they're duplicates as they have the same text, "".
All duplicates of a string will get removed except for the first one.

In [50]:
raw_data.drop_duplicates(subset=["clean_text"], keep="first", inplace=True)

Whenever certain rows are removed, the indexing order breaks, so it's very important to reset the index: 

In [51]:
raw_data.reset_index(drop=True, inplace=True)

In [52]:
raw_data["clean_text_length"] = raw_data["clean_text"].apply(len)
raw_data.head()

Unnamed: 0,username,date,tweet,mentions,clean_text,clean_text_length
0,shivaji_takey,10-06-2020,Please check what happens to this no 940417705...,['vodafonein'],please check what happens this not woking sinc...,68
1,sarasberiwala,10-06-2020,Network fluctuations and 4G Speed is pathetic....,['vodafonein'],network fluctuations and speed pathetic need j...,78
2,chitreamod,10-06-2020,This has been going on since 3rd... this absol...,['vodafonein'],this has been going since this absolutely unpr...,56
3,sanjan_suman,10-06-2020,@VodafoneIN I have done my recharge of 555 on...,['vodafonein'],have done recharge but haven got perday with u...,178
4,t_nihsit,10-06-2020,But when???Still I am not received any call fr...,['vodafonein'],but when still not received any call from cust...,74


In [53]:
(raw_data["clean_text_length"] == 0).sum()

1

As we can see, now there's only one empty tweet. So we can query it individually and remove it manually.

In [54]:
indexes_to_drop = raw_data[raw_data["clean_text_length"] == 0].index
indexes_to_drop

Int64Index([20], dtype='int64')

In [55]:
raw_data.drop(index=indexes_to_drop, inplace=True)

In [56]:
(raw_data["clean_text_length"] == 0).sum()

0

Now, we can see that there're now empty tweets left.

In [57]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19761 entries, 0 to 19761
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   username           19761 non-null  object
 1   date               19761 non-null  object
 2   tweet              19761 non-null  object
 3   mentions           19761 non-null  object
 4   clean_text         19761 non-null  object
 5   clean_text_length  19761 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 1.1+ MB


## Practise: Vectoriser

Vectoriser is a method of tokenisation that considers the quantity of a particular that is there in the document. Here, each document is a tweet. So, for example, say we had the following two documents/tweets:
- (Hi how are you) => 4 words
- (Hi how do you do) => 5 words

The idea is to have these two sentences in a more structured format.
We see that (1st sentence has 4 words) and (2nd sentence has 5 words), but most machine learning models need a fixed number of input length.

So, what the vectoriser helps us do is convert these sentence into a structured format which would look something like this:

- [Hi, how, are, you, do]
- [1, 1, 1, 1, 0]
- [1, 1, 0, 1, 2]

All the possible words will be stored in one array. Then each document/tweet will be represented as an array of integers of the same length as the number of possible words, with the number corresponding to each word indicating the frequency with which that word occurred in the document/tweet.

There're two types of vectorisers that are most widely used are:
1. **TF-IDF (Term Frequency-Inverse Document Frequency) vectoriser** => it converts a collection of raw documents to a matrix format. 

    - In information retrieval, TF-IDF also takes into account the number of times a particular word has appeared amongst all the documents.
    - In this case, words ("Hi" "how" "you") appeared very often. So the numeric value assigned to them will be lower compared to the words that actually distinguish the two documents, in this case these would be ("do", "are") - these words will have higher values.
    - So, when TF-IDF is implemeneted, these values:
        - [1, 1, 1, 1, 0]
        - [1, 1, 0, 1, 2] 
        
      Will be floating point values. 
      
      So, it is used to reflect how important a particular word is in a particular document.
      
      This is particularly useful when you're doing NLP on bigger textual information which has paragraphs. In our case, since our textual data are short tweets, it is better to use the Count Vectoriser.
2. **Count vectoriser** => it will simply gives us a matrix of the form of the example we seen above, like this:
        - [Hi, how, are, you, do]
        - [1, 1, 1, 1, 0]
        - [1, 1, 0, 1, 2]
    - Number of features will be equal to the vocabulary size found by analysing all the documents. Note, when using the CountVectoriser from sklearn we don't have to pass in this vocabulary, but instead, by passing all the documents, it will build up this vocabulary. 


In [58]:
from sklearn.feature_extraction.text import CountVectorizer

count_vec = CountVectorizer(analyzer="word", ngram_range=(1, 1), stop_words="english", min_df=0.0001, max_df=0.7)

Exploring parameters of CountVectorizer:
- analyzer="word" => there're two possible value: "word" or "char". We will use "word" because we want each whole word to be considered as a feature rather than each particular characater being considered as a feature
- ngram_range=(1,1) => If we give (1,1), it assumes unigrams, meaning that it will consider each word on its own and will build up a vocabulary of each word. If we give (1,2), it assumes bigrams, meaning that it will consider each two adjecent words as a feature. 
- stop_words="english" => we want this vectoriser to not consider any stopwords it finds in the sentences as they provide little to no information that can be used for clustering. we will remove "english" stopwords because our documents are in english
- min_df=0.0001 => specifying that we don't want words that are very infrequent. we set min_df a to very small value of 0.0001 because we do want most of the words to be recorded as part of our vocabulary, but at the same time if there is a typo or some random text which doesn't make sense (ex: "I enjoy food c disjkdnskdsk") which might occur in one tweet compared to the 21047 that we have
- max_df=0.7 => specifying that we don't want words that are very frequent. if there is a word that appears very frequently, suppose some word have not been removed by stopwords, then we don't want it to be recorded as part of our vocabulary. For example, suppose the word "the" was not removed by stopwords, we don't want "the" coming up as a topic. Hence, we have given it a value of 0.7.

Now that we have defined out CountVectorizer and given it all the paramters it needs, we want to fit our clean_text data in it which will build up the vocabulary.

In [59]:
count_vec.fit(raw_data["clean_text"])

Then, we will transform this clean_text data to create a matrix

In [60]:
desc_matrix = count_vec.transform(raw_data["clean_text"])
desc_matrix

<19761x6743 sparse matrix of type '<class 'numpy.int64'>'
	with 198134 stored elements in Compressed Sparse Row format>

Let's have a look at how this matrix looks like:

In [61]:
desc_matrix.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

We see that this is a 2-dimensional matrix, in other words, it is a list of lists. Where each individual internal list correspond to a document/tweet.

This matrix is based on a vocabulary built using 21047 documents/tweets and hence the number of features will be very high, so this is a very high dimensional matrix. Therefore, we see a lot of zeroes, as we're using hardly any features in each document compared to the total amount of features.

Let's take a look at the shape of this matrix:

In [62]:
desc_matrix.shape

(19761, 6743)

So, there're in total 19761 tweets which is less than 21047 because we removed tweets which were noisy and not relevant to us. From these 19761 tweets/documents it has formed 6743 features (since we used unigrams and analyzer="word", this means that our vocabulary contains 6743 words).

## Theory: Understanding K-Means

Now that we have transformed our unstructured data to a more structured format of data in form of a matrix. Let's discuss on the actual algorithm that we will be using. We will be using the K-Means unsupervised machine learning algorithm for clustering. K-Means is one of the simplest and most popular unsupervised machine learning algorithm. 

To put it simply, the objective of K-Means is to group similar datapoints together and discover underlying patterns. To achieve this, K-Means needs a fixed number of clusters in the dataset that it takes as input.

Cluster referes to a collection of datapoints aggregated together because of certain similarities.

Let's take a look at how K-Means works with help of some visualisations (using this [amazing website](https://www.naftaliharris.com/blog/visualizing-k-means-clustering/)):

Say in a 2-dimensional data space this is how our data looks like (there is some x-axis and y-axis) and each circle in the diagram represents a single datapoint.

<img src="./images/kmean.png" alt="Drawing" style="width: 500px;"/>

The objective is to be able to group these datapoints. Visually as humans we can see that clearly there're 3 different groups, but assume you didn't have this visualisation and all you had was this data in a numeric/matrix format. In this case, how would even decide how many clusters must used because this is one of the hyperparameters that K-Means model takes as input.

Let's look at an example and add two centroids to it:
<img src="./images/kmean2.png" alt="Drawing" style="width: 500px;"/>
These centroids get assigned completely randomly in this 2-dimensional data space. For our twitter dataset we had a 6743-dimensional data space as we had 6743 features in total, but that is very difficult to visualise.

Now, let's look at this 2-dimensinal data space and assume that there is a continous value of some variable on the x-axis and on the y-axis. 

Steps of K-Means:
1. Step 1: K-Means will assign these centroids randomly
    - Then, distance of each datapoint to each centroid is calculated.
    - Then, each particular datapoint will get assigned to the centroid that is nearest to that particular datapoint
    - For example, for this datapoint, we will calculate the distance to the blue centroid and to the red centroid. Then, we would see that the distance of this datapoint is closer to the blue centroid than to the red centroid, so that datapoint will get assigned to the blue centroid.
<img src="./images/kmean3.png" alt="Drawing" style="width: 500px;"/>
    - This happens for all the datapoints and with this diagram we can easily see that there is a clear cut division of the datapoints that get assigned to the red centroid and the datapoints that get assigned to the blue centroid, depending on which centroid is closer to each particular datapoint.
2. Step 2: After the step 1 of assigning each datapoint to the closest to centroid to it, the second part will constitute of calculating the new position of each of the centroids. In order to calculate the new position of each of the centroids, for each of the centroids it will take all the datapoints that have been assigned to that centoid and then mean of each of the features will be calculated amongst all the datapoints. For example, in this case to calculate the new position of the blue centroid, we will take all the datapoints assigned to the blue cluster, and since we only have 2 features (x and y), then we will calculate the mean of x amongst all those datapoints and then the mean of y amongst all those datapoints. 
<img src="./images/kmean4.png" alt="Drawing" style="width: 500px;"/>
2. Then, those two means will serve as the coordinates of the new position of the blue centroid so the centroid gets moved. With twitter dataset, same will occur but instead of only calculating 2 means for two features, we will calculate 6743 means for all 6743 features. The same will occur for each of the centroids that we have.
<img src="./images/kmean5.png" alt="Drawing" style="width: 500px;"/>

3. Now, the previous two steps will be repeated until we find the best possible clustering of our data given the number of clusters that we have. Concretely, now in the second pass the datapoints will get re-assigned based on the new locations of centroids (step 1 will be repeated - each datapoint will pick a centroid for itself which is closeset to it by picking the shortest distance from itself to each of the centroids).

<img src="./images/kmean6.png" alt="Drawing" style="width: 500px;"/>

Now, if we make the next few passes, we notice that our centroids stop moving. This is because K-Means have been able to group these datapoints into their respective clusters given the number of clusters we have given it and at this point the algorithm finishes. 

<img src="./images/kmean7.png" alt="Drawing" style="width: 500px;"/>

However, we can easily see that 2 was not a good value for the number of clusters. Let's look again when we give the number of centroids as 3.

Again, we have the similar set of datapoints: 
<img src="./images/kmean8.png" alt="Drawing" style="width: 500px;"/>

Now, let's randomly assign 3 centroids:
<img src="./images/kmean9.png" alt="Drawing" style="width: 400px;"/>

Let's start updating our centroids:

Pass 1:
<img src="./images/kmean10.png" alt="Drawing" style="width: 400px;"/>
- Now again, the datapoints will get reassigned to the centroid that is closest to them.
<img src="./images/kmean11.png" alt="Drawing" style="width: 400px;"/>
Pass2:
<img src="./images/kmean12.png" alt="Drawing" style="width: 400px;"/>
- Now again, the datapoints will get reassigned to the centroid that is closest to them.
<img src="./images/kmean13.png" alt="Drawing" style="width: 400px;"/>

We notice that our centroids are going into the right direction, but they are moving very slowly. Very often the number of passes it and time that it takes you to perform the optimum clustering of datapoints will greatly depend on the initialisation of our centroids.

Pass 10:
<img src="./images/kmean14.png" alt="Drawing" style="width: 400px;"/>
Now, we see that each centroid have found his group of datapoints succesfully and any consecutive passes will not make any changes, so our algorithm finishes at this point. In other words, each datapoint is assigned to its respective cluster.

Although this example was performed on a 2-dimensional data space, but the same procedure will be performed on a higher dimensional data space (for example on our 6743 dimensions that we have in our tweeter dataset, however, we cannot really visualise it as simply as in this 2D format, but we will use some other visualisation technique that will help us understand and figure out what this optimal number of clusters should be)

## Practise: Clustering with 8 centroids

## Practise: Clustering with 2 centroids

## Practise: Clustering with 2 centroids - Word Clouds

## Practise: General Function Homogeneity in Cluster - Finding the optimal Cluster Number