## Theory: Introduction to Topic Modelling

In this blog we will learn about topic modelling (or how to identify topics in a given corpus of text).

The problem of topic modeling has become increasingly important in the last decade as our society has become more digitalized, with increased use of online platforms such as social media, which, combined with advances in storage capacity technologies, has resulted in an exponential increase in the amount of unstructured data available to us, with text data being one of the most common types.

However, all this data that is becoming increasing available has a lot of information associated with it, which makes it difficult for us to find exactly what we're looking for. As an analogy, finding a certain sentence on a page is considerably easier than finding it in an entire book.

So, we need tools and techniques to organise/search/understand the vast quantities of information.
1. Firstly, we need to organise the data. Because at this moment most of the data is not organised in any manner.
    - ex: text data on social media platforms (twitter/instagram) is unorganised
    - ex: text data on customer reviews is unorganised
2. Secondly, wee need to be able to search from it
3. Thirdly, understand the information that is present in the data

Topic modelling provide us with methods to (organise/search/understand) by summarising the logical actions of textual information that is present in the data. It helps:
1. Discover the hidden topical patterns that are present in the collection
2. Annotate the data, so that it can be used further on based on the topics that have been identified

Topic modelling can also be described as a method for finding a group of words from a collection of documents that best represents the information in the collection. It can also be thought of as a form of text mining, because you're mining a vast amount of textual information. 

In order to do this, we will be using unsupervised machine learning techniques because our data is unlabelled, which is help us in clustering/grouping this data (ex: customer reviews) to identify the main ideas/topics in a corpus of text.

In this blog we will use a real-world twitter textual corpus of data. However, note that the same techniques will apply to any corpus of text. This corpus of text was web-scraped directly from the twitter data and therefore it will have all the characteristics that any real-world corpus of text will have with things like: people have their own coloquial language, presence of noise. So, we will learn how to clean the noise and use that data for clustering to identify the main topics that people are talking about. Then, depending on whatever is the business decision, they can take certain steps towards it.

In order to do so, I will be using the NLTK (Natural Language Toolkit) package.

## Theory: Introduction to NLTK (Natural Language Toolkit)

NLTK provides us with tools that enable us to make the computer understand natural language. It's the leading library for building python programmes to work with human language data.

For a computer, it's very easy to interpret programming language as they follow a specific syntax enabling the computer to parse it with any problems. However, human language is a very unstructured form of data, which means that the same thing can be said in a variety of ways and it means the same thing. 

- For example, sentences "I hope you're doing well" and "I hope everything is fine with you", both have the same meaning and use similar words, however, the way they're structured is different. Hence, for a computer these two sentence are basically different sentences which can mean different things. 

- NLTK is one of the tools that help computers clean and preprocess the human language data in such a way that makes it more structured such that computers can understand it.

NLTK provides quite an easy to use interface and it has a suite of text processing libraries, for things like:
- **Classification**
- **Tokenisation** => sepparating out the words and removing punctuation
- **Stemming** => since same word can have prefixes/suffixes, stemming will cut all these prefixes/suffixes & get the core of the word which still has the same meaning (ex: words "stemming", "stemmer" => can be stemmed to the word "stem")
- **Tagging** => each word can be gramatically tagged - whether it's (an article, a verb, a noun, an adjective, etc)

Finally, the best thing is that NLTK is free and open source, so if you want you can even contribute to it as it is a community driven project. 

Here is the link to [GitHub Repository of NLTK](https://github.com/nltk/nltk)

The way to use NLTK is just like any other python library, you can directly import it:
```python
import nltk
```
However, initially you would have to install NLTK and given that it has a vast suite of tools that are available for a variety of tasks - usually you don't require all of them and to install all of them would take you a very long time.

Therefore, instead of installing the whole NLTK package, you can install parts of it on your machine. To do so, NLTK provides a very nice GUI.

<img src="./images/nltk-gui.png" alt="Drawing" style="width: 500px;"/>

If you know the speicifc tools that you're going to use, go to "Corpora", search for them and download them:

<img src="./images/nltk-gui2.png" alt="Drawing" style="width: 500px;"/>

For example, here I've installed the stopwords sub-package because this is something that I commonly use to remove all words that esentially don't convey any meaning (they usually tend to be words that are articles):

<img src="./images/nltk-gui3.png" alt="Drawing" style="width: 500px;"/>

Fuerthermore, you can also download models for implementing transfer learning:
<img src="./images/nltk-gui4.png" alt="Drawing" style="width: 500px;"/>
(for example, this model Word2Vec is quite famously used amongst NLP Data Scientists)

Then, once your required sub-packaged gets downloaded, you import it like any other python packge:

```python
from nltk import stopwords
```

In [2]:
import nltk
nltk.download_gui()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


## Practise: Loading and Exploring Twitter Data

Importing libraries

In [3]:
import nltk
import numpy as np
import pandas as pd
import re # regex

Perform configuration of settings

In [4]:
# Get multiple outputs in the same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Ignore all warnings
import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings(action="ignore", category=DeprecationWarning)

In [5]:
# Display all rows and columns of a DataFrame instead of a truncated version
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

Because this is a real-world corpus of text gathered through web-scraping Twitter, it's more complicated in the sense that there's a lot of noise and coloqualiasms, so it's not a set of grammatically correct phrases like you'd find in literature or articles.

In [11]:
raw_data = pd.read_csv("../data/data.csv", encoding="ISO-8859-1")

(21047, 4)

Unnamed: 0,username,date,tweet,mentions
0,shivaji_takey,10-06-2020,Please check what happens to this no 940417705...,['vodafonein']
1,sarasberiwala,10-06-2020,Network fluctuations and 4G Speed is pathetic....,['vodafonein']
2,chitreamod,10-06-2020,This has been going on since 3rd... this absol...,['vodafonein']
3,sanjan_suman,10-06-2020,@VodafoneIN I have done my recharge of 555 on...,['vodafonein']
4,t_nihsit,10-06-2020,But when???Still I am not received any call fr...,['vodafonein']


In [12]:
raw_data.shape
raw_data.head()

(21047, 4)

Unnamed: 0,username,date,tweet,mentions
0,shivaji_takey,10-06-2020,Please check what happens to this no 940417705...,['vodafonein']
1,sarasberiwala,10-06-2020,Network fluctuations and 4G Speed is pathetic....,['vodafonein']
2,chitreamod,10-06-2020,This has been going on since 3rd... this absol...,['vodafonein']
3,sanjan_suman,10-06-2020,@VodafoneIN I have done my recharge of 555 on...,['vodafonein']
4,t_nihsit,10-06-2020,But when???Still I am not received any call fr...,['vodafonein']


Based on the shape of our dataframe, we see that we will be working with 21047 tweets in total. This particular twitter data only include tweets that have mentions of vodafone, which is a telecom company.

Now, let's perform some sanity checks to help us understand whether our dataset have any issues we need to handle

First, let's check whether there're any duplicate tweets:

In [15]:
unique_text = raw_data["tweet"].unique()
print(len(unique_text))

21047


Since number of unique tweets equal to number of rows, then we don't have any duplicates.

Let's look in detail at some individual tweets:

In [20]:
raw_data["tweet"][4]

'But when???Still I am not received any call from customer care.Very poor services.'

Just from looking at a few tweets, there seems to be a lot of complaints, which seems logical as people usually go to twitter to get their complaints noticed more quickly or raise awareness.

At the same time, we see that there're a wide variety of topics mentioned, even in this small number of tweets that we have seen.

We have 21047 tweets that has been gathered in a very short duration. Given this large number of tweets, it would be almost impossible for a person to go through all the tweets and figure out the topics that are present and be able to identify which area needs the most addressing (maybe it's the service area or network area or number porting area).

## Practise: Cleaning the Data with Pattern Removal

So, from previous exploration we've found out that we're using tweets addressed to specific telecommunication company, Vodafone. From exploring just a few tweets we realised that there're various topics that are present in the twitter corpus that we will be using as our textual input to the unsupervised machine learning model that we want to implement.

Now, we will take the first steps of cleaning this data.

Function below will help us remove certain patterns from the data:

It takes in as (input=text) and (pattern that we're looking at removing) and using the regex library it will find this particular pattern in all the input data.

In [21]:
def remove_pattern(input_text, pattern):
    r = re.findall(pattern, input_text)
    for i in r:
        input_text = re.sub(i, '', input_text)
        # result = re.sub(pattern, repl, string);
        # replacing all the found patterns with an empty string
    return input_text

Above, is a generic function to which you can give any pattern and any input text - it will remove the pattern and give you back the input text.

#### First step will be to remove any @ mentions
Because all of this data will be addressing Vodafone only either way, so having @Vodafone doens't add any useful information. Having @Vodafone present in all the tweets would act as noise for the algorithm.

In [24]:
# np.vectorize is a convenient way of writing a for-loop
# it will loop through all the values in raw_data["tweet"] pandas Series

# pattern that we want to remove is, @[\w]*, any word starting with @
raw_data["clean_text"] = np.vectorize(remove_pattern)(raw_data["tweet"], "@[\w]*")
raw_data["clean_text"] 

0        Please check what happens to this no 940417705...
1        Network fluctuations and 4G Speed is pathetic....
2        This has been going on since 3rd... this absol...
3          I have done my recharge of 555 on 9709333370...
4        But when???Still I am not received any call fr...
5         mere area me vodafone ka network nai aa raha ...
6        Thanks, but I have visited the website, called...
7         \nHi,\n Today my Vodafone cim is deactivated ...
8        Dear Vodafone, I have already responded to you...
9         SIR OUR MARKET AREA ME BILKUL NETWORK NAHI AA...
10          Why the hell Previous plan Deactivated of 2...
11       Vodafone Netwrk is worst ever...Using from so ...
12        9796053999... internet not working, pls assis...
13                        Still waiting for your reply    
14       Reverse migration ka zamana hai dear Alu. Kar ...
15        Unable to access your website and is showing ...
16       Worst customer care, I charged 1072 for intern.

We can see that all mentions of @Vodafone have been removed as we wanted.

Step 2: We want to removing everything that is not a letter

## Practise: Tokenise and Identify Special Instances of Tweets

## Practise: Vectoriser

## Theory: Understanding K-Means

## Practise: Clustering with 8 centroids

## Practise: Clustering with 2 centroids

## Practise: Clustering with 2 centroids - Word Clouds

## Practise: General Function Homogeneity in Cluster - Finding the optimal Cluster Number