## Introduction

In any machine learning task, cleaning or preprocessing the data is as important as model building if not more. And when it comes to unstructured data like text, this process is even more important. 

Objective of this kernel is to understand the various text preprocessing steps with code examples. 

Some of the common text preprocessing / cleaning steps are:
* Lower casing
* Removal of Punctuations
* Removal of Stopwords
* Removal of Frequent words
* Removal of Rare words
* Stemming
* Lemmatization
* Removal of emojis
* Removal of emoticons
* Conversion of emoticons to words
* Conversion of emojis to words
* Removal of URLs 
* Removal of HTML tags
* Chat words conversion
* Spelling correction


So these are the different types of text preprocessing steps which we can do on text data. But we need not do all of these all the times. We need to carefully choose the preprocessing steps based on our use case since that also play an important role. 

For example, in sentiment analysis use case, we need not remove the emojis or emoticons as it will convey some important information about the sentiment. Similarly we need to decide based on our use cases. 

In [2]:
import numpy as np
import pandas as pd
import re
import nltk
# import spacy
import string
pd.options.mode.chained_assignment = None
import PyPDF2
from PyPDF2 import PdfReader
# full_df = pd.read_csv("../input/customer-support-on-twitter/twcs/twcs.csv", nrows=5000)
# df = full_df[["text"]]
# df["text"] = df["text"].astype(str)
# full_df.head()

pdf = r'C:\Users\chira\OneDrive\Desktop\frshr\LLM_roadmap\future\generali.pdf'
pdfFileObject = open(pdf, 'rb')
pdfReader = PyPDF2.PdfReader (pdfFileObject)
# pdfReader
count = len(pdfReader.pages)
output = ''
for i in range(count):
    page = pdfReader.pages[i]
    output += page.extract_text()

output

"Below are our details of products\nFeatured products:\nFuture Generali single premium anchor plan\nfuture generali new assured wealth plan\nfuture generali long term income plan\nfuture generali money back super plan\nfuture generali lifetime partner plan\nfuture generali assured income plan\nfuture generali new assure plus\nfuture generali bima advantage plus\nfuture generali dhan vridhi\nfuture generali Heart and health insurance plan\nServices offered:\nTerm Insurance plans\nEndowment Plan\nULIPs (Unit Linked Insurance Plans)\nHealth insurance plans\nchild plans\nretirement plans\nsavings plan\nOffice location *Redirect to branch locator*\nhttps://life.futuregenerali.in/branch-locator/\nThe pricing or premiums for products can be calculated using the calculator under \nthe calculate premium tab in the website\nHow are our products better than competition\n95.04% Individual claims settlement ratio for FY 2022-23\n1.5 Million lives covered\nFuture Generali India Life Insurance Compan

In [3]:
import pandas as pd
df = pd.DataFrame([output], columns=['text'])
df

Unnamed: 0,text
0,Below are our details of products\nFeatured pr...


## Lower Casing

Lower casing is a common text preprocessing technique. The idea is to convert the input text into same casing format so that 'text', 'Text' and 'TEXT' are treated the same way. 

This is more helpful for text featurization techniques like frequency, tfidf as it helps to combine the same words together thereby reducing the duplication and get correct counts / tfidf values.

This may not be helpful when we do tasks like Part of Speech tagging (where proper casing gives some information about Nouns and so on) and Sentiment Analysis (where upper casing refers to anger and so on)

By default, lower casing is done my most of the modern day vecotirzers and tokenizers like [sklearn TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and [Keras Tokenizer](https://keras.io/preprocessing/text/). So we need to set them to false as needed depending on our use case. 

In [4]:
df["text_lower"] = df["text"].str.lower()
df.head()

Unnamed: 0,text,text_lower
0,Below are our details of products\nFeatured pr...,below are our details of products\nfeatured pr...


## Removal of Punctuations

One another common text preprocessing technique is to remove the punctuations from the text data. This is again a text standardization process that will help to treat 'hurray' and 'hurray!' in the same way.

We also need to carefully choose the list of punctuations to exclude depending on the use case. For example, the `string.punctuation` in python contains the following punctuation symbols 

`!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~`

We can add or remove more punctuations as per our need.

In [5]:
# drop the new column created in last cell
df.drop(["text_lower"], axis=1, inplace=True)

PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

df["text_wo_punct"] = df["text"].apply(lambda text: remove_punctuation(text))
df.head()

Unnamed: 0,text,text_wo_punct
0,Below are our details of products\nFeatured pr...,Below are our details of products\nFeatured pr...


## Removal of stopwords

Stopwords are commonly occuring words in a language like 'the', 'a' and so on. They can be removed from the text most of the times, as they don't provide valuable information for downstream analysis. In cases like Part of Speech tagging, we should not remove them as provide very valuable information about the POS.

These stopword lists are already compiled for different languages and we can safely use them. For example, the stopword list for english language from the nltk package can be seen below.


In [6]:
from nltk.corpus import stopwords
", ".join(stopwords.words('english'))

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

Similarly we can also get the list for other languages as well and use them. 

In [7]:
STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

df["text_wo_stop"] = df["text_wo_punct"].apply(lambda text: remove_stopwords(text))
df.head()

Unnamed: 0,text,text_wo_punct,text_wo_stop
0,Below are our details of products\nFeatured pr...,Below are our details of products\nFeatured pr...,Below details products Featured products Futur...


## Removal of Frequent words

In the previos preprocessing step, we removed the stopwords based on language information. But say, if we have a domain specific corpus, we might also have some frequent words which are of not so much importance to us. 

So this step is to remove the frequent words in the given corpus. If we use something like tfidf, this is automatically taken care of.  

Let us get the most common words adn then remove them in the next step

In [8]:
from collections import Counter
cnt = Counter()
for text in df["text_wo_stop"].values:
    for word in text.split():
        cnt[word] += 1
        
cnt.most_common(10)

[('plan', 46),
 ('future', 27),
 ('generali', 27),
 ('term', 24),
 ('income', 24),
 ('Generali', 22),
 ('Future', 20),
 ('life', 20),
 ('policy', 19),
 ('tax', 19)]

In [9]:
FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])
def remove_freqwords(text):
    """custom function to remove the frequent words"""
    return " ".join([word for word in str(text).split() if word not in FREQWORDS])

df["text_wo_stopfreq"] = df["text_wo_stop"].apply(lambda text: remove_freqwords(text))
df.head()

Unnamed: 0,text,text_wo_punct,text_wo_stop,text_wo_stopfreq
0,Below are our details of products\nFeatured pr...,Below are our details of products\nFeatured pr...,Below details products Featured products Futur...,Below details products Featured products singl...


## Removal of Rare words

This is very similar to previous preprocessing step but we will remove the rare words from the corpus. 

In [10]:
# Drop the two columns which are no more needed 
df.drop(["text_wo_punct", "text_wo_stop"], axis=1, inplace=True)

n_rare_words = 10
RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]])
def remove_rarewords(text):
    """custom function to remove the rare words"""
    return " ".join([word for word in str(text).split() if word not in RAREWORDS])

df["text_wo_stopfreqrare"] = df["text_wo_stopfreq"].apply(lambda text: remove_rarewords(text))
df.head()

Unnamed: 0,text,text_wo_stopfreq,text_wo_stopfreqrare
0,Below are our details of products\nFeatured pr...,Below details products Featured products singl...,Below details products Featured products singl...


We can combine all the list of words (stopwords, frequent words and rare words) and create a single list to remove them at once.

## Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form (From [Wikipedia](https://en.wikipedia.org/wiki/Stemming))

For example, if there are two words in the corpus `walks` and `walking`, then stemming will stem the suffix to make them `walk`. But say in another example, we have two words `console` and `consoling`, the stemmer will remove the suffix and make them `consol` which is not a proper english word.

There are several type of stemming algorithms available and one of the famous one is porter stemmer which is widely used. We can use nltk package for the same.

In [11]:
from nltk.stem.porter import PorterStemmer

# Drop the two columns 
df.drop(["text_wo_stopfreq", "text_wo_stopfreqrare"], axis=1, inplace=True) 

stemmer = PorterStemmer()
def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

df["text_stemmed"] = df["text"].apply(lambda text: stem_words(text))
df.head()

Unnamed: 0,text,text_stemmed
0,Below are our details of products\nFeatured pr...,below are our detail of product featur product...


We can see that words like `private` and `propose` have their `e` at the end chopped off due to stemming. This is not intented. What can we do fort hat? We can use Lemmatization in such cases.

Also this porter stemmer is for English language. If we are working with other languages, we can use snowball stemmer. The supported languages for snowball stemmer are

In [12]:
from nltk.stem.snowball import SnowballStemmer
SnowballStemmer.languages

('arabic',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'norwegian',
 'porter',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish')

## Lemmatization

Lemmatization is similar to stemming in reducing inflected words to their word stem but differs in the way that it makes sure the root word (also called as lemma) belongs to the language. 

As a result, this one is generally slower than stemming process. So depending on the speed requirement, we can choose to use either stemming or lemmatization. 

Let us use the `WordNetLemmatizer` in nltk to lemmatize our sentences

In [13]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

df["text_lemmatized"] = df["text"].apply(lambda text: lemmatize_words(text))
df.head()

Unnamed: 0,text,text_stemmed,text_lemmatized
0,Below are our details of products\nFeatured pr...,below are our detail of product featur product...,Below are our detail of product Featured produ...


We can see that the trailing `e` in the `propose` and `private` is retained when we use lemmatization unlike stemming. 

Wait. There is one more thing in lemmatization. Let us try to lemmatize `running` now.

In [14]:
lemmatizer.lemmatize("founded")

'founded'

Wow. It returned `running` as such without converting it to the root form `run`. This is because the lemmatization process depends on the POS tag to come up with the correct lemma. Now let us lemmatize again by providing the POS tag for the word.

In [15]:
lemmatizer.lemmatize("founded", "v") # v for verb

'found'

Now we are getting the root form `run`. So we also need to provide the POS tag of the word along with the word for lemmatizer in nltk. Depending on the POS, the lemmatizer may return different results.

Let us take the example, `stripes` and check the lemma when it is both verb and noun.

In [16]:
print("Word is : stripes")
print("Lemma result for verb : ",lemmatizer.lemmatize("founded", 'v'))
print("Lemma result for noun : ",lemmatizer.lemmatize("founded", 'n'))

Word is : stripes
Lemma result for verb :  found
Lemma result for noun :  founded


Now let us redo the lemmatization process for our dataset.

In [17]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV}
def lemmatize_words(text):
    pos_tagged_text = nltk.pos_tag(text.split())
    return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])

df["text_lemmatized"] = df["text"].apply(lambda text: lemmatize_words(text))
df.head()

Unnamed: 0,text,text_stemmed,text_lemmatized
0,Below are our details of products\nFeatured pr...,below are our detail of product featur product...,Below be our detail of product Featured produc...


We can now see that in the third row, `sent` got converted to `send` since we provided the POS tag for lemmatization.

## Removal of Emojis

With more and more usage of social media platforms, there is an explosion in the usage of emojis in our day to day life as well. Probably we might need to remove these emojis for some of our textual analysis.

Thanks to [this code,](https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b) please find below a helper function to remove emojis from our text. 

In [18]:
output_lemmatized = df['text_lemmatized'][0]

In [19]:
output_lemmatized

"Below be our detail of product Featured products: Future Generali single premium anchor plan future generali new assure wealth plan future generali long term income plan future generali money back super plan future generali lifetime partner plan future generali assure income plan future generali new assure plus future generali bima advantage plus future generali dhan vridhi future generali Heart and health insurance plan Services offered: Term Insurance plan Endowment Plan ULIPs (Unit Linked Insurance Plans) Health insurance plan child plan retirement plan saving plan Office location *Redirect to branch locator* https://life.futuregenerali.in/branch-locator/ The pricing or premium for product can be calculate use the calculator under the calculate premium tab in the website How be our product well than competition 95.04% Individual claim settlement ratio for FY 2022-23 1.5 Million life cover Future Generali India Life Insurance Company Limited offer an extensive range of life insuranc

In [20]:
from langchain.docstore.document import Document

doc =  Document(page_content=output_lemmatized, metadata={"source": "local"})
# type(doc)

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter=RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)
documents=text_splitter.split_documents([doc])
# print(documents)

# ## Vector Embedding And Vector Store
from langchain_openai import OpenAIEmbeddings
# from langchain_community.vectorstores import Chroma
# db = Chroma.from_documents(documents,OpenAIEmbeddings())

from langchain_community.vectorstores import FAISS

db = FAISS.from_documents(documents, OpenAIEmbeddings())
db.save_local(r'C:\Users\chira\OneDrive\Desktop\frshr\LLM_roadmap\future\faiss_index')
# db = FAISS.load_local(r"C:\Users\chira\OneDrive\Desktop\frshr\LLM_roadmap\bharti\faiss_index", OpenAIEmbeddings(), allow_dangerous_deserialization = True)

**More to come. Stay tuned!**