## Theory: Introduction to Topic Modelling

In this blog we will learn about topic modelling (or how to identify topics in a given corpus of text).

The problem of topic modeling has become increasingly important in the last decade as our society has become more digitalized, with increased use of online platforms such as social media, which, combined with advances in storage capacity technologies, has resulted in an exponential increase in the amount of unstructured data available to us, with text data being one of the most common types.

However, all this data that is becoming increasing available has a lot of information associated with it, which makes it difficult for us to find exactly what we're looking for. As an analogy, finding a certain sentence on a page is considerably easier than finding it in an entire book.

So, we need tools and techniques to organise/search/understand the vast quantities of information.
1. Firstly, we need to organise the data. Because at this moment most of the data is not organised in any manner.
    - ex: text data on social media platforms (twitter/instagram) is unorganised
    - ex: text data on customer reviews is unorganised
2. Secondly, wee need to be able to search from it
3. Thirdly, understand the information that is present in the data

Topic modelling provide us with methods to (organise/search/understand) by summarising the logical actions of textual information that is present in the data. It helps:
1. Discover the hidden topical patterns that are present in the collection
2. Annotate the data, so that it can be used further on based on the topics that have been identified

Topic modelling can also be described as a method for finding a group of words from a collection of documents that best represents the information in the collection. It can also be thought of as a form of text mining, because you're mining a vast amount of textual information. 

In order to do this, we will be using unsupervised machine learning techniques because our data is unlabelled, which is help us in clustering/grouping this data (ex: customer reviews) to identify the main ideas/topics in a corpus of text.

In this blog we will use a real-world twitter textual corpus of data. However, note that the same techniques will apply to any corpus of text. This corpus of text was web-scraped directly from the twitter data and therefore it will have all the characteristics that any real-world corpus of text will have with things like: people have their own coloquial language, presence of noise. So, we will learn how to clean the noise and use that data for clustering to identify the main topics that people are talking about. Then, depending on whatever is the business decision, they can take certain steps towards it.

In order to do so, I will be using the NLTK (Natural Language Toolkit) package.

## Theory: Introduction to NLTK (Natural Language Toolkit)

NLTK provides us with tools that enable us to make the computer understand natural language. It's the leading library for building python programmes to work with human language data.

For a computer, it's very easy to interpret programming language as they follow a specific syntax enabling the computer to parse it with any problems. However, human language is a very unstructured form of data, which means that the same thing can be said in a variety of ways and it means the same thing. 

- For example, sentences "I hope you're doing well" and "I hope everything is fine with you", both have the same meaning and use similar words, however, the way they're structured is different. Hence, for a computer these two sentence are basically different sentences which can mean different things. 

- NLTK is one of the tools that help computers clean and preprocess the human language data in such a way that makes it more structured such that computers can understand it.

NLTK provides quite an easy to use interface and it has a suite of text processing libraries, for things like:
- **Classification**
- **Tokenisation** => sepparating out the words and removing punctuation
- **Stemming** => since same word can have prefixes/suffixes, stemming will cut all these prefixes/suffixes & get the core of the word which still has the same meaning (ex: words "stemming", "stemmer" => can be stemmed to the word "stem")
- **Tagging** => each word can be gramatically tagged - whether it's (an article, a verb, a noun, an adjective, etc)

Finally, the best thing is that NLTK is free and open source, so if you want you can even contribute to it as it is a community driven project. 

Here is the link to [GitHub Repository of NLTK](https://github.com/nltk/nltk)

The way to use NLTK is just like any other python library, you can directly import it:
```python
import nltk
```
However, initially you would have to install NLTK and given that it has a vast suite of tools that are available for a variety of tasks - usually you don't require all of them and to install all of them would take you a very long time.

## Practise: Loading and Exploring Twitter Data

## Practise: Cleaning the Data with Pattern Removal

## Practise: Tokenise and Identify Special Instances of Tweets

## Practise: Vectoriser

## Theory: Understanding K-Means

## Practise: Clustering with 8 centroids

## Practise: Clustering with 2 centroids

## Practise: Clustering with 2 centroids - Word Clouds

## Practise: General Function Homogeneity in Cluster - Finding the optimal Cluster Number