<h1> NATURAL LANGUAGE PROCESSING (N.L.P.)</h1>

Natural Language is the daily use language which we humans use to communiate our thoughts and ideas to one another.<br> But now the task is to commnicate with a machine. This is the task we accomplish using N.L.P. We all have come across scenarios where N.L.P. is being used in real time like:
> Swiggy/Zomato Chat Bot.<br>
> Smart Replies in G-Mail, WhatsApp, LinkedIN, etc.
<br>

Now N.L.P. is not just limited to talking to a machine, it involves a wide variety of tasks like:
- Classification of sentiments
- Information Extraction
- Information Retrival Systems
- Neural Machine Translation

and many more.


A common misconception is there that `we can converse with a machine using a programing language.` But this is not true as, using a programming language we **instruct** the machine what to do, **Instructing is not the same as conversing**.

While we are proceeding towards making humans and machine interaction possible there are certain caveats that we need to address:
1. <u><b>Complexity of Representation</b></u>: Any huamn language that we consider will have its associated grammar rules, sentiment, sarcasm, etc. associated with it which makes it very complex in representation.
For e.g.
> in English : I am a student at Innomatics Research Labs<br>
> in French : Je suis étudiante à Innomatics Research Labs<br>

In french the statement has a feminine nature to it which was not present in english. This is the complexity of language.

2. <u><b>Ambiguity</b></u>: This problem arises on multiple scenarios, like say a word which can have entirely different meaning depending upon the situation or context in which it is being used. For e.g.<br>
- Cows are grazing at the `bank`. {Here the bank is river bank}
- I went to `bank` to deposit money. {Here the bank is a financial institution}

Using N.L.P. techniques we can handle these problems upto a certain extent.

<h2>Working with Text Data</h2>

All the steps for using M.L. with text data will be the same as discussed previously with numerical and categorical data.<br>
The only difference will come in how we are going to perform `Data Cleaning` and `Data Transformation`

To perform N.L.P. tasks using python, the most commonly used modules are:
- `nltk` (Natural Language Tool Kit)
- `spacy`

For M.L. we will primarily focus on nltk and for D.L. based N.L.P. tasks we will use spacy.

<h2>N.L.P. Terminologies</h2>

- Each row in a N.L.P. Dataset is called as a `Document`.
- The entire data which is a collection of Documents is called as a `Corpus`.

<h1> Data Transformation for Text </h1>

Text has to be transformed into numerical values so that the M.L. algorithms can understand it and give a model as output. Now here we will first see data transformation followed by data cleaning so that we can build upon the importance of data cleaning.<br>
There are many ways of transforming text into numerical values however we will be primarily looking at:
- Bag of Words (BoW)
- Term Frequency - Inverse Document Frequency (TF-IDF)

Rest techniques will further explored in D.L.

<h2>Bag of Words (BoW)</h2>

Bag of Words (BoW) is a popular technique in Natural Language Processing (NLP) used to represent text data. <br>It is a simple and effective way to convert text documents into numerical feature vectors, which can then be used as input to machine learning models. <br>The basic idea behind BoW is to represent a document as a multiset of its words, disregarding grammar and word order.<br>
It has a 2 step mechanism as following:
- Step 1: Learn the vocabulary (i.e. unique words) from the entire corpus and create the feature vector using the vocabulary.
- Step 2: Count the number of times each feature has appeared and create the Document Term Matrix (D.T.M.).

Document Term Matrix is the numerical representation of the given corpus.
<br>

<u><b>NOTE: AN IMPORTANT POINT TO KEEP IN MIND IS THAT WHEN BoW CREATES THE FEATURE VECTOR IT SORTS ALL THE FEATURES ALPHABETICALLY.</b></u>

<h3>Example for BoW</h3>

Consider the following two sentences:

- Sentence 1: "The cat sat on the mat."
- Sentence 2: "The dog jumped over the fence."

To create a BoW representation for these sentences, we first create a vocabulary containing all unique words in both sentences:

- Vocabulary: ["the", "cat", "sat", "on", "mat", "dog", "jumped", "over", "fence"]

Next, we represent each sentence as a vector where each element corresponds to the frequency of a word in the vocabulary:

- Sentence 1 BoW: [1, 1, 1, 1, 1, 0, 0, 0, 0]
- Sentence 2 BoW: [1, 0, 0, 0, 0, 1, 1, 1, 1]

<u><b>NOTE:
- BY NO MEANS IS BoW THE SAME AS ONE HOT ENCODING. O.H.E PURELY USES 0 AND 1 ONLY IN ITS ENCODING WHEREAS BoW USES THE FREQUENCY OF THE WORDS.
- DATA CLEANING IS VERY VERY IMPORTANT FOR BoW AS IF IT IS NOT DONE PROPERLY IT WILL LEAD TO CURSE OF DIMENSIONALITY.</b></u>

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample sentences
sentences = [
    "The cat sat on the mat.",
    "The dog jumped over the fence."
]

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Fit and transform the sentences
X = vectorizer.fit_transform(sentences)

print("Document Term Matrix\n",X)
# this output will be in the form of sparse matrix


# Get the vocabulary
vocabulary = vectorizer.get_feature_names_out()
print("\n\n Vocabulary learnt by BoW:",vocabulary)
print()

# Convert to array and print BoW representation
bow_representation = X.toarray()
for i, sentence in enumerate(sentences):
    print(f"Sentence {i+1} BoW: {bow_representation[i]}")
print()

# Lets see the DTM in expanded form with vocabulary
df = pd.DataFrame(bow_representation, columns = vocabulary)
df

Document Term Matrix
   (0, 8)	2
  (0, 0)	1
  (0, 7)	1
  (0, 5)	1
  (0, 4)	1
  (1, 8)	2
  (1, 1)	1
  (1, 3)	1
  (1, 6)	1
  (1, 2)	1


 Vocabulary learnt by BoW: ['cat' 'dog' 'fence' 'jumped' 'mat' 'on' 'over' 'sat' 'the']

Sentence 1 BoW: [1 0 0 0 1 1 0 1 2]
Sentence 2 BoW: [0 1 1 1 0 0 1 0 2]



Unnamed: 0,cat,dog,fence,jumped,mat,on,over,sat,the
0,1,0,0,0,1,1,0,1,2
1,0,1,1,1,0,0,1,0,2


<h3>Understanding Sparsity and Sparse Matrix</h3>

- Sparsity: In layman terms you can say that sparsity is a problem which arises when in a large matrix you have most of the values as 0 and are not important. The problem is that if we directly store such huge matrices it will end up taking huge memory in storage as well as any computation done on them will also be very costly.
- Sparse Matrix: It is a solution to the above mentioned problem of sparsity. It solves the problem by representing the entire matrix as (row,column) pair with each pair having a corresponding non zero value.<br>
For e.g. In the above output `(0,8) 2` means that at 0th row and 8th column index the value is 2.

<h4>Considerations and Edge Cases</h4>

- Case Sensitivity: By default, most implementations of BoW are case-sensitive. Consider converting all text to lowercase to ensure consistency.
- Stop Words: Common words like "the", "and", "is" often don't carry much meaning. Consider removing them from the vocabulary to focus on more informative terms.{We will see more about this in Data Cleaning}
- Tokenization: BoW relies on tokenization to split text into words. Consider different tokenization strategies based on your specific use case.
- Sparse Representation: BoW matrices can be very large and sparse, especially for large vocabularies or datasets. Consider using sparse matrix representations for memory efficiency.
- Handling Out-of-Vocabulary (OOV) Words: Decide how to handle words in test data that are not present in the training vocabulary. They can be ignored or represented separately.
- Word Order: BoW disregards word order and context, which may lead to loss of information in certain tasks like sentiment analysis or machine translation.

<h2>N-Grams Approach</h2>

The N-gram approach is a technique in Natural Language Processing (NLP) used to capture the structure and context of textual data by considering sequences of N consecutive words (or characters) as one word. N-grams are essentially contiguous sequences of N items (words, characters, etc.) extracted from a text.

Lets see an example:

- Consider the sentence: "The cat sat on the mat."
  - Unigrams (1-grams): ["The", "cat", "sat", "on", "the", "mat"]
  - Bigrams (2-grams): ["The cat", "cat sat", "sat on", "on the", "the mat"]
  - Trigrams (3-grams): ["The cat sat", "cat sat on", "sat on the", "on the mat"]
  - 4-grams: ["The cat sat on", "cat sat on the", "sat on the mat"]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample sentences
sentences = [
    "The cat sat on the mat.",
    "The dog jumped over the fence."
]

# Create CountVectorizer object with ngram_range parameter
vectorizer = CountVectorizer(ngram_range=(1, 2))
# Change ngram_range for different N-grams

# Fit and transform the sentences
X = vectorizer.fit_transform(sentences)
print("Document Term Matrix:\n",X)
print()

# Get the vocabulary
vocabulary = vectorizer.get_feature_names_out()
print("\nVocabulary Learnt:\n",vocabulary)
print()

# Convert to array and print N-gram representation
ngram_representation = X.toarray()

# Lets see the DTM in expanded form with vocabulary
df = pd.DataFrame(ngram_representation, columns = vocabulary)
df

Document Term Matrix:
   (0, 14)	2
  (0, 0)	1
  (0, 12)	1
  (0, 8)	1
  (0, 7)	1
  (0, 15)	1
  (0, 1)	1
  (0, 13)	1
  (0, 9)	1
  (0, 18)	1
  (1, 14)	2
  (1, 2)	1
  (1, 5)	1
  (1, 10)	1
  (1, 4)	1
  (1, 16)	1
  (1, 3)	1
  (1, 6)	1
  (1, 11)	1
  (1, 17)	1


Vocabulary Learnt:
 ['cat' 'cat sat' 'dog' 'dog jumped' 'fence' 'jumped' 'jumped over' 'mat'
 'on' 'on the' 'over' 'over the' 'sat' 'sat on' 'the' 'the cat' 'the dog'
 'the fence' 'the mat']



Unnamed: 0,cat,cat sat,dog,dog jumped,fence,jumped,jumped over,mat,on,on the,over,over the,sat,sat on,the,the cat,the dog,the fence,the mat
0,1,1,0,0,0,0,0,1,1,1,0,0,1,1,2,1,0,0,1
1,0,0,1,1,1,1,1,0,0,0,1,1,0,0,2,0,1,1,0


<b><u>NOTE: IN THE ABOVE CODE IF YOU GIVE THE `ngram_range = (1,2)` THEN THE VOCABULARY WILL CONTAIN 1 GRAM + 2 GRAM VOCABULARY.<br>
HOWEVER IF YOU WANT ONLY 2 GRAM VOCABULARY THEN GIVE `ngram_range=(2,2)`.</u></b>

<h2>ADVANTAGES AND DISADVANTAGES OF BAG OF WORDS</h2>

1. Advantages
  - It is simple to understand and implement like One Hot Encoding *{BoW is replacing 0 & 1 with the count of the feature in the given document}*.
  - It gives a fixed length encoding for any sequence of arbitrary length as long as the vocabulary does not change.
  - Documents with same word/vocabulary will have similar representation. So if two documents have a similar vocabulary, they will be closer to each other in the vector space and vice-versa.

2. Disadvantages
  - 1) The size of feature vector increases with the increase in the size of vocabulary. Thus making sparsity a continuing problem.
      - Solution: It can be tackled by limiting the frequency of the most frequent words.
  - 2) It does not capture the similarity between differnt words that mean the same thing i.e. **Semantic Meaning is not captured.**
    - **Is there an algorithm to solve the above 2 problems?**
      - Yes, it is *word2vec algorithm*. {will be covered in D.L.}

  - 3) BoW representation does not have any way to handle O.O.V. (Out of Vocabulary) words.
    - O.O.V. are the new words which were not seen in the corpus that was used to build the vectorizer in the training phase.
  - 4) Word order information is lost in the BoW representation.
    - Solution: One Way to control is to use N-Grams approach.
      - **Is there a way to solve problem 3 and 4 ?**
        - Yes, it is *BERT Algorithm* (Bi Directional Encoder Representation from Transformer) {will be covered in D.L.}
  - 5) BoW representation suffers from curse of dimensionality.

<h2>Term Frequency - Inverse Document Frequency (TF-IDF)</h2>

TF-IDF is another popular technique in Natural Language Processing (NLP) used to represent text data. It reflects how important a word is to a document within a collection of documents. TF-IDF combines two metrics: term frequency (TF), which measures the frequency of a term in a document, and inverse document frequency (IDF), which measures how rare a term is across documents in a corpus.
<br>

The working of TF-IDF can be divided into 2 steps:
- Step 1: Learn the vocabulary (i.e. unique words) from the entire corpus and create the feature vector using the vocabulary.
- Step 2: For each feature in each document compute the TF-IDF value.


$$ TF \ IDF = TF(word_i, doc_j) * IDF(word_i, corpus) $$

$$ TF(word_i, doc_j) = \frac{No \ of \ time \ word_i \ occurs \ in \ doc_j}{Total \ no \ of \ words \ in \ doc_j} $$

$$ IDF(word_i, corpus) = \log_n(\frac{No \ of \ docs \ in \ corpus}{No \ of \ docs \ which \ contains \ word_i}) $$


Lets take a look at an example and see the working:<br>

Consider a corpus containing three documents:

- Document 1: "The cat sat on the mat."
- Document 2: "The dog jumped over the fence."
- Document 3: "The cat and the dog are friends."

To compute TF-IDF for the term "cat" in Document 1:

- Term Frequency (TF): Number of times "cat" appears in Document 1 = 1
- Inverse Document Frequency (IDF): log(N / df), where N is the total number of documents and df is the number of documents containing the term "cat" (df = 2 in this case)
- TF-IDF = TF * IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Sample documents
documents = [
    "The cat sat on the mat.",
    "The dog jumped over the fence.",
    "The cat and the dog are friends."
]

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)
print("Document Term Matrix:\n",X) # Output will be a sparse matrix
print()

# Get the vocabulary
vocabulary = vectorizer.get_feature_names_out()
print("\n Vocabulary learnt:\n",vocabulary)
print()

# Convert to array and print TF-IDF representation
tfidf_representation = X.toarray()

# Lets see the DTM in expanded form with vocabulary
df = pd.DataFrame(tfidf_representation, columns = vocabulary)
df

Document Term Matrix:
   (0, 7)	0.44839402160692654
  (0, 8)	0.44839402160692654
  (0, 10)	0.44839402160692654
  (0, 2)	0.3410152109911944
  (0, 11)	0.5296574648148862
  (1, 4)	0.44839402160692654
  (1, 9)	0.44839402160692654
  (1, 6)	0.44839402160692654
  (1, 3)	0.3410152109911944
  (1, 11)	0.5296574648148862
  (2, 5)	0.42439575294071896
  (2, 1)	0.42439575294071896
  (2, 0)	0.42439575294071896
  (2, 3)	0.32276390910429226
  (2, 2)	0.32276390910429226
  (2, 11)	0.5013099366829596


 Vocabulary learnt:
 ['and' 'are' 'cat' 'dog' 'fence' 'friends' 'jumped' 'mat' 'on' 'over'
 'sat' 'the']



Unnamed: 0,and,are,cat,dog,fence,friends,jumped,mat,on,over,sat,the
0,0.0,0.0,0.341015,0.0,0.0,0.0,0.0,0.448394,0.448394,0.0,0.448394,0.529657
1,0.0,0.0,0.0,0.341015,0.448394,0.0,0.448394,0.0,0.0,0.448394,0.0,0.529657
2,0.424396,0.424396,0.322764,0.322764,0.0,0.424396,0.0,0.0,0.0,0.0,0.0,0.50131


<h4>Considerations and Edge Cases</h4>

- Normalization: TF-IDF values are often normalized to prevent bias towards longer documents.
- Stop Words: Similar to BoW, consider removing common stop words to improve the quality of TF-IDF representations.
- Tokenization: Like BoW, TF-IDF relies on tokenization. Ensure consistent tokenization strategies across documents.
- Sparse Representation: TF-IDF matrices can also be large and sparse. Consider using sparse matrix representations for efficiency.
- Handling Out-of-Vocabulary (OOV) Words: Decide how to handle OOV words, similar to BoW.
- Smoothing: Consider using smoothing techniques to handle terms that may have zero IDF due to not appearing in the entire corpus.

<h1> Cleaning Text Data</h1>

The text data in general which is made available for any N.L.P. task has a lot unnecessary elements which we needs to clean out. The benefit is that we only keep the relevant and important information and discard the rest of it.

There are a lot of steps which can be performed for a specific N.L.P. task but some of the most common steps are:

1. <b><u>Converting all text data to lower case.</u></b>
2. <b><u>Removing all the special characters.</u></b>
3. <b><u>Removing the stopwords.</u></b>
4. <b><u>Converting every word to its root form.</u></b>

Lets explore these steps in detail

<h3><b><u>1. Converting All Text Data To Lower Case</u></b></h3>

This is the first step performed for cleaning the text with a very specific reason because the machine treats, `"Adam"` and `"adam"` as two separate strings/words; whereas we know that it is the same thing.<br> In order to ensure that the machine/model also understands the same we convert the text to lower case.

<h3><b><u>2. Removing All The Special Characters</u></b></h3>

Special characters like`.,"";` etc. are removed because these symbols are just to enhance the readability for humans and are not contributing any significant meaning in the sentence.<br>

For e.g.: `I told them, "My name is Adam".`; this sentence is the way in which humans write and are accustomed to reading and interpreting it.<br>
Now after we remove the special characters and convert to lower case:<br>
`i told them my name is adam`; this sentence is also conveying the same meaning as the previous one.<br> In N.L.P. the meaning and the understanding of the intent is more important not the way its written.<br>*{In advanced N.L.P. techniques we will see that even the punctuations and presentation of output can be taken care of but that is a topic for later modules.}*

<b><u>NOTE: THE ABOVE TWO STEPS WORK ON CHARACTER LEVEL TOKENS</u></b>

<h3><b><u>3. Removing Stopwords</u></b></h3>

`Stopwords` are those words which enhance the human readability of a sentence, and removing them does not change the core meaning of the sentence. e.g. of stopwords, I, me, am, the, etc.<br>
For e.g.
- `i told them my name is adam`. the core meaning of this sentence is that a name was told as adam.
- After removing stopwords, `told name adam`. Here also the core meaning is that a name was told as adam.

When dealing in NLP the most important thing is to ensure that the model gets the core meaning of the sentence. Post which you can make a sentence readable, translate, etc. but all this can only happen if the core meaning is understood correctly.<br>

Another advantage of removing the stopwords is that it helps in countering `Curse of Dimensionality` as after removing stopwords the total number of words is reduced; which in turn means we have to get numerical representation for less number of words as a result less dimensions. **All this happens without loosing the core crux of the sentence.**

<h3><b><u>4. Converting Every Word To Its Root Form</u></b></h3>

In order to convert a given word to its root form the techniques which we will be using are:
- Stemming
- Lemmatization

Reducing words to their root form is important for several reasons:

1. **Text Normalization**: It simplifies and standardizes words, which is crucial for processing and understanding text data¹.
2. **Reduces Complexity**: By converting words to their base form, it helps in reducing the complexity of the text, making it easier for algorithms to process¹.
3. **Improves Accuracy**: It can improve the accuracy of various NLP tasks like text classification, information retrieval, and text summarization by reducing the dimensionality of the data².
4. **Facilitates Search**: It allows search algorithms to equate different forms of a word, enhancing the ability to find relevant results².
5. **Efficient Language Processing**: It plays a role in making language processing more efficient and intelligent by bringing words with similar meanings to their root form³.

Now, lets understand stemming and lemmatization:
- **Stemming** usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. For example, the stem of the words "running", "runner", and "ran" is "run".

- **Lemmatisation**, on the other hand, usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. For instance, the lemma of "was" is "be", and the lemma of "mice" is "mouse".

- Here's an example sentence: `"The striped bats are hanging on their feet for best."`

- **Stemming** might reduce the sentence to: `"The strip bat are hang on their feet for best."`
- **Lemmatisation** would result in: `"The striped bat be hang on their foot for good."`

<b><u>NOTE</b></u>:
- Stemming can sometimes create non-words, as in "strip" from "striped", whereas lemmatisation attempts to return real words.
- Lemmatisation is more sophisticated and considers the context of the word in order to determine its base form.
- Stemming is faster than lemmatization as it does not worry about humany readability of root words.
- The speed of stemming comes at the cost of being very brutal on the word as it bluntly chops of extra parts to reduce a word to its root form.

<h4>Errors Related To Stemming</h4>

- Overstemming: This happens when two or more unrelated words result in the same stem.
- Understemming: This happens when two or more related words result in different stem.

<h4>Variants of Stemming</h4>

- `Porter Stemmer`: It only works for english language and may or may not be linguistically correct in giving root words.
- `Snowball Stemmer`: It is an upgraded version of `Porter Stemmer` as it is able to suport multiple languages and is much better at getting root forms of words.
- `Lancaster Stemmer`: It utilizes an iterative approach to convert words to their root form. Of the 3 it is the most aggresive stemmer which leads it to more often than not with the issue of overstemming. Lancaster Stemmer is also for english language only.

<h4><b><u>When to use stemming and when to use lemmatization?</u></b></h4>

- Use `stemming` if the task is like: Building Search Engines, Working with information retrival systems, etc.

- Use `lemmatization` if the task is like: Sentiment Analysis of reviews, Language Modelling, etc.

<b><u>NOTE: THE ABOVE 2 STEPS WORK ON WORD LEVEL TOKENS</u></b>

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
