### Introduction : 

    In this article series we will go through the text preprocessing techniques/steps for Natural Language Processing (NLP) problems. Text data is one of the most unstructured forms of available data and when comes to deal with Human language then it’s too complex & NLP is all about that. So lets start by defining what is NLP ? 

![image005.png](attachment:image005.png)

    NLP is an art to extract some meaningful information from the text. Now a days many organization deal with huge amount of text data like customers review, tweets,news letters,emails, etc. and get much more information from text by using NLP & Machine Learning.Some of the real-world applications of NLP are,

    1. Speech recognition – The task of converting voice data to text data.
    2. Sentiment analysis- The task of extracting qualities like attitudes, emotions, etc. The most basic task in sentiment analysis is to classify the polarity of a sentence that is positive, negative, or neutral.
    3. Natural language generation – The task of producing text data from some structured data.
    4. Part-of-speech (POS) tagging – The task of tagging the Part of Speech of a particular word in a sentence based on its definition and context.

![1_Uf_qQ0zF8G8y9zUhndA08w.png](attachment:1_Uf_qQ0zF8G8y9zUhndA08w.png)

    Data preprocessing is an essential and first step in the pipeline of Natural Language Processing (NLP), building a Machine Learning model and depending on how well the data has been preprocessed, the results are seen.

### Importance of Text Preprocessing : 

    Raw text data might contain unwanted or unimportant text due to which our results might not give efficient accuracy, and might make it hard to understand and analyze. So, proper pre-processing must be done on raw data.

    Consider that you scraped some tweets from Twitter. For example,

    ” I am wayyyy too lazyyy!!! Never got out of bed for the whole 2 days. #lazy_days “

    The sentences “I am wayyyy too lazyyy!!!” and “I am way too lazy”, both have the same semantic meaning, but gives us a totally different vibe, right. Depending on how these data get pre-processed, the results also differ. Pre-processing is therefore the most important task in NLP. It helps us remove all the unimportant things from our data and make our data ready for further processing.

    These various text preprocessing steps are widely used for dimensionality reduction. As text gets converted to numerical vectors, In the vector space model, each word/term is an axis/dimension. The text/document is represented as a vector in the multi-dimensional space.

    The number of unique words means the number of dimensions.

![Screenshot%202021-09-06%20at%205.35.09%20PM.png](attachment:Screenshot%202021-09-06%20at%205.35.09%20PM.png)

    To illustrate the importance of text preprocessing, let’s consider a task on sentiment analysis for customer reviews.
    
    Suppose a customer feedbacked that “their customer support service is a nightmare”, a human can surely and clearly identify the sentiment of the review as negative. However for a machine, it is not that straightforward.
    
    To illustrate this point, I experimented with the Azure text analytics API. Feeding in the same review, the API returns a result of 50%, i.e., neutral sentiment, which is wrong.

![Screenshot%202021-09-06%20at%202.00.30%20PM.png](attachment:Screenshot%202021-09-06%20at%202.00.30%20PM.png)

    However, if we had performed some text preprocessing, in this case just removing some stopwords (explained further below but for now, think of stopwords as very common words such that they do not help much in our NLP tasks), we will see that the results become 16%, i.e., negative sentiment, which is correct.

![Screenshot%202021-09-06%20at%202.00.17%20PM.png](attachment:Screenshot%202021-09-06%20at%202.00.17%20PM.png)

    So as illustrated, text preprocessing if done correctly can help to increase the accuracy of the NLP tasks.

### General Outline of Text Preprocessing

    So how do we go about doing text preprocessing? Generally, there are 3 main components:
    
    1. Tokenization
    2. Normalization
    3. Noise removal

![Screenshot%202021-09-06%20at%202.10.18%20PM.png](attachment:Screenshot%202021-09-06%20at%202.10.18%20PM.png)

    In a nutshell, tokenization is about splitting strings of text into smaller pieces, or “tokens”. Paragraphs can be tokenized into sentences and sentences can be tokenized into words. Normalization aims to put all text on a level playing field, e.g., converting all characters to lowercase. Noise removal cleans up the text, e.g., remove extra whitespaces.

### 1 - Tokenization

    Tokenization is a step which splits longer strings of text into smaller pieces, or tokens. Larger chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc. Further processing is generally performed after a piece of text has been appropriately tokenized. Tokenization is also referred to as text segmentation or lexical analysis. Sometimes segmentation is used to refer to the breakdown of a large chunk of text into pieces larger than words (e.g. paragraphs or sentences), while tokenization is reserved for the breakdown process which results exclusively in words.

    This may sound like a straightforward process, but it is anything but. How are sentences identified within larger bodies of text? Off the top of your head you probably say "sentence-ending punctuation," and may even, just for a second, think that such a statement is unambiguous.

    Sure, this sentence is easily identified with some basic segmentation rules:

    The quick brown fox jumps over the lazy dog.

    But what about this one:

    Dr. Ford did not ask Col. Mustard the name of Mr. Smith's dog.

    Or this one:

    "What is all the fuss about?" asked Mr. Peters.

    And that's just sentences. What about words? Easy, right? Right?

    This full-time student isn't living in on-campus housing, and she's not wanting to visit Hawai'i.

    Limitations of Tokenization : 
    
    Challenges in tokenization depend on the type of language. Languages such as English and French are referred to as space-delimited as most of the words are separated from each other by space. Languages such as Chinese and Thai are said to be unsegmented as words do not have clear boundaries. Tokenizing the unsegmented language requires additional lexical and morphological information. Tokenization is also affected by writing systems. Structures of languages can be grouped into three categories:
    
    Isolating: Words do not divide into smaller units. Example: Mandarin
    
    Agglutinative: Words divide into smaller units. Example: Japanese, Tamil
    
    Inflectional: Boundaries between morphemes are not clear and ambiguous in terms of grammatical meaning. Example: Latin

### 2 - Normalization

    Before further processing, text needs to be normalized. Normalization generally refers to a series of related tasks meant to put all text on a level playing field: converting all text to the same case (upper or lower), removing punctuation, converting numbers to their word equivalents, and so on. Normalization puts all words on equal footing, and allows processing to proceed uniformly.
    
    So basically, Normalization is the process of converting the token into its basic form (morpheme). Inflection is removed from the token to get the base form of the word. It helps in reducing the number of unique tokens and redundancy in the data. It reduces the data dimensionality and removes the variation of a word from the text.

    Normalizing text can mean performing a number of tasks, but for our framework we will approach normalization in 3 distinct steps: 
    (1) stemming, 
    (2) lemmatization, and 
    (3) everything else.

    (1) Stemming

    Stemming is the process of eliminating affixes (suffixed, prefixes, infixes, circumfixes) from a word in order to obtain a word stem.

    running → run

    (2) Lemmatization: 

    Lemmatization is related to stemming, differing in that lemmatization is able to capture canonical forms based on a word's lemma.

    For example, stemming the word "better" would fail to return its citation form (another word for lemma); however, lemmatization would result in the following:
    
    better → good
    
    It should be easy to see why the implementation of a stemmer would be the less difficult feat of the two.

    (3) Everything else
    
    A clever catch-all, right? Stemming and lemmatization are major parts of a text preprocessing endeavor, and as such they need to be treated with the respect they deserve. These aren't simple text manipulation; they rely on detailed and nuanced understanding of grammatical rules and norms.

    There are, however, numerous other steps that can be taken to help put all text on equal footing, many of which involve the comparatively simple ideas of substitution or removal. They are, however, no less important to the overall process. These include:

        1. set all characters to lowercase
        2. remove numbers (or convert numbers to textual representations)
        3. remove punctuation (generally part of tokenization, but still worth keeping in mind at this stage, even as confirmation)
        4. strip white space (also generally part of tokenization)
        5. remove default stop words (general English stop words)
        
        Stop words are those words which are filtered out before further processing of text, since these words contribute little to overall meaning, given that they are generally the most common words in a language. For instance, "the," "and," and "a," while all required words in a particular passage, don't generally contribute greatly to one's understanding of content. As a simple example, the following panagram is just as legible if the stop words are removed:

    The quick brown fox jumps over the lazy dog. -->   quick brown fox jumps over  lazy dog
    
        6. remove given (task-specific) stop words
        7. remove sparse terms (not always necessary or helpful, though!)
    
    At this point, it should be clear that text preprocessing relies heavily on pre-built dictionaries, databases, and rules. You will be relieved to find that when we undertake a practical text preprocessing task in the Python ecosystem in our next article that these pre-built support tools are readily available for our use; there is no need to be inventing our own wheels.

### 3 - Noise Removal

    Noise removal continues the substitution tasks of the framework. While the first 2 major steps of our framework (tokenization and normalization) were generally applicable as-is to nearly any text chunk or project (barring the decision of which exact implementation was to be employed, or skipping certain optional steps, such as sparse term removal, which simply does not apply to every project), noise removal is a much more task-specific section of the framework.

    Keep in mind again that we are not dealing with a linear process, the steps of which must exclusively be applied in a specified order. Noise removal, therefore, can occur before or after the previously-outlined sections, or at some point between).

    How about something more concrete. Let's assume we obtained a corpus from the world wide web, and that it is housed in a raw web format. We can, then, assume that there is a high chance our text could be wrapped in HTML or XML tags. While this accounting for metadata can take place as part of the text collection or assembly process (step 1 of our textual data task framework), it depends on how the data was acquired and assembled. This previous post outlines a simple process for obtaining raw Wikipedia data and building a corpus from it. As we have control of this data collection and assembly process, dealing with this noise (in a reproducible manner) at this time makes sense.

    But this is not always the case. If the corpus you happen to be using is noisy, you have to deal with it. Recall that analytics tasks are often talked about as being 80% data preparation!

    The good thing is that pattern matching can be your friend here, as can existing software tools built to deal with just such pattern matching tasks.

    1. remove text file headers, footers
    2. remove HTML, XML, etc. markup and metadata
    3. extract valuable data from other formats, such as JSON, or from within databases
    
    if you fear regular expressions, this could potentially be the part of text preprocessing in which your worst fears are realized

    As you can imagine, the boundary between noise removal and data collection and assembly is a fuzzy one, and as such some noise removal must take place before other preprocessing steps. For example, any text required from a JSON structure would obviously need to be removed prior to tokenization.

    There are many libraries and algorithms used to deal with NLP-based problems. A regular expression(re) is mostly used library for text cleaning. NLTK(Natural language toolkit) and spacy are the next level library used for performing Natural language tasks like removing stopwords, named entity recognition, part of speech tagging, phrase matching, etc. To summarize :
    
    1. NLTK 
    2. Spacy 
    3. Gensim 
    4. re (regex expression)

    NLTK is an old library used for practicing NLP techniques by beginners. Spacy is the latest released library with the most advanced techniques and mostly used in the production environment so I would like to encourage you to learn both the libraries and experience its power.

                                                .....

    We will be using all the previous libraries one by one to perform the same tasks. But before everything else comes the part of loading you data and keep only the relevent information Then we will pre-process the text data. 

    1. Import the dataset & Libraries.
    2. Dealing with Missing Values.
    3. Labeling the Dataset.
    4. Data Cleaning and text preprocessing.

    1. Import the dataset & Libraries
    
    First step is usually importing the libraries that will be needed in the program. A library is essentially a collection of modules that can be called and used.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
dataset = pd.read_csv('Amazon_Unlocked_Mobile.csv')

    let’s look at the dataset we got, its look like as shown below, Here we can see there is 6 features ‘Product Name’, ‘Brand Name’, ‘Price’, ‘Rating’, ‘Reviews’ and ‘Review Votes’.

In [3]:
dataset.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


    2. Dealing With Missing Values

    In this step we will check the null values in our dataset and replace or drop as per the dataset.

In [4]:
dataset.isna().sum()

Product Name        0
Brand Name      65171
Price            5933
Rating              0
Reviews            62
Review Votes    12296
dtype: int64

    We are doing sentiment analysis on this dataset. So we required basically two features ‘Rating’ and ‘Review’. As above, Reviews having only 62 null values. Now, we will first trim our dataset with only two features and then remove these all 62 records with the help of below code.

In [7]:
dataset = dataset[['Rating','Reviews']]
dataset.dropna(inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset.dropna(inplace = True)


In [8]:
dataset.isna().sum()

Rating     0
Reviews    0
dtype: int64

    As we can see all null values has been removed from our dataset. Let’s Create labels according to the rating given by customers.

    3. Labeling The Dataset

    As per our dataset there is rating from 1 to 5. So, According to rating we will create there labels, Positive(for 1 & 2 Rating), Neutral(for 3 Rating) and Negative (for 4,& 5 Rating).

![Screenshot%202021-09-06%20at%203.32.51%20PM.png](attachment:Screenshot%202021-09-06%20at%203.32.51%20PM.png)

In [9]:
def labeldatapoints(rating):
    
    if rating >= 4:
        return 'positive'
    elif rating == 3:
        return 'neutral'
    else:
        return 'negative'

dataset['Label'] = dataset['Rating'].apply(labeldatapoints)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['Label'] = dataset['Rating'].apply(labeldatapoints)


    So, As above code we create the label as per the rating. Let’s look at the dataset.

In [12]:
dataset.head(10)

Unnamed: 0,Rating,Reviews,Label
0,5,I feel so LUCKY to have found this used (phone...,positive
1,4,"nice phone, nice up grade from my pantach revu...",positive
2,5,Very pleased,positive
3,4,It works good but it goes slow sometimes but i...,positive
4,4,Great phone to replace my lost phone. The only...,positive
5,1,I already had a phone with problems... I know ...,negative
6,2,The charging port was loose. I got that solder...,negative
7,2,"Phone looks good but wouldn't stay charged, ha...",negative
8,5,I originally was using the Samsung S2 Galaxy f...,positive
9,3,It's battery life is great. It's very responsi...,neutral


    4. Data Cleaning And Text Preprocessing.

    We are only considering the ‘Reviews’ feature from the dataset for text preprocessing. I will do few steps here to clean the text data, Generally it’s depends on the text data or problem requirement. Here i am explaining this process step-by-step.

    Based on the general outline above, we performed a series of steps under each component.

    1. Remove HTML tags

    2. Remove extra whitespaces

    3. Convert accented characters to ASCII characters

    4. Expand contractions

    5. Remove special characters

    6. Lowercase all texts

    7. Convert number words to numeric form

    8. Remove numbers

    9. Remove stopwords

    10. Lemmatization

    Link to full code can be found at bottom of article, but read on to understand the salient steps taken.

    The necessary dependencies are as such.

#### Remove HTML Tags

       If the reviews or texts are web scraped, chances are they will contain some HTML tags. Since these tags are not useful for our NLP tasks, it is better to remove them

    To do so, we can use BeautifulSoup’s HTML parser as follows:

In [None]:
from bs4 import BeautifulSoup

In [5]:
def strip_html_tags(text):
    """remove html tags from text"""
    
    soup = BeautifulSoup(text, "html.parser")
    
    stripped_text = soup.get_text(separator = " ")
    
    return stripped_text

In [16]:
strip_html_tags('<p>The name of the companies are as follows </br> Tedlers <br></p>')

'The name of the companies are as follows   Tedlers '

#### Remove URLs :

In [17]:
import re 

def clean_urls(text):
    return re.sub(r'http\S+','',text)

In [18]:
clean_urls('The name of my website is https://www.srijeet.com for quite some time now')

'The name of my website is '

In [20]:
clean_urls('The name of my website is http://www.srijeet.com for quite some time now')

'The name of my website is  for quite some time now'

#### Convert Accented Characters

    “Would you like to have latté at our café?”

    Words with accent marks like “latté” and “café” can be converted and standardized to just “latte” and “cafe”, else our NLP model will treat “latté” and “latte” as different words even though they are referring to same thing. To do this, we use the module unidecode.

##### Appraoch 1 :  Using unidecode

In [99]:
!pip install unidecode



In [None]:
import unidecode

In [6]:
def remove_accented_chars(text):
    """remove accented characters from text, e.g. café"""
    text = unidecode.unidecode(text)
    return text

In [104]:
remove_accented_chars('café latté')

'cafe latte'

##### Appraoch 2 :  Using gensim

    utils.to_unicode module in the gensim library can be used for this. It converts a string (bytestring in encoding or Unicode), to unicode.

In [101]:
import gensim
from gensim import utils

In [107]:
s = " I am wayyyy too lazyyy!!! Never got out of bed for the whole 2 days. #lazy_days "
s = utils.to_unicode(s)
print(s)

 I am wayyyy too lazyyy!!! Never got out of bed for the whole 2 days. #lazy_days 


#### Expand Contractions

    Contractions are shortened words, e.g., don’t and can’t. Expanding such words to “do not” and “can not” helps to standardize text.
    
    We use the contractions module to expand the contractions.

In [18]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.0.52-py2.py3-none-any.whl (7.2 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting anyascii
  Downloading anyascii-0.3.0-py3-none-any.whl (284 kB)
[K     |████████████████████████████████| 284 kB 284 kB/s eta 0:00:01
[?25hCollecting pyahocorasick
  Downloading pyahocorasick-1.4.2.tar.gz (321 kB)
[K     |████████████████████████████████| 321 kB 576 kB/s eta 0:00:01
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25ldone
[?25h  Created wheel for pyahocorasick: filename=pyahocorasick-1.4.2-cp38-cp38-macosx_10_9_x86_64.whl size=32403 sha256=f64844305eee7568c42fe0b9fa722e38994e557561aa0048c4a9fb803d87cab3
  Stored in directory: /Users/srijeetchatterjee/Library/Caches/pip/wheels/74/bc/b8/e5f739a84005620cfe66d3fcb8bb182e309d6056bc6700b60e
Successfully built pyahocorasick
Installing collected packages: pyahocorasick, an

In [11]:
import contractions

In [7]:
def expand_contractions(text):
    """expand shortened words, e.g. don't to do not"""
    text = contractions.fix(text)
    return text

    Note: This step is optional depending on your NLP task as spaCy’s tokenization and lemmatization functions will perform the same effect to expand contractions such as can’t and don’t. The slight difference is that spaCy will expand “we’re” to “we be” while pycontractions will give result “we are”.

#### Treatment for Numbers

    There are two steps in our treatment of numbers.

    One of the steps involve the conversion of number words to numeric form, e.g., seven to 7, to standardize text. To do this, we use the word2number module. Sample code as follows:

In [16]:
!pip install word2number

Collecting word2number
  Downloading word2number-1.1.zip (9.7 kB)
Building wheels for collected packages: word2number
  Building wheel for word2number (setup.py) ... [?25ldone
[?25h  Created wheel for word2number: filename=word2number-1.1-py3-none-any.whl size=5580 sha256=f6025e8e375003d80329f1bec336b32bf6cef611cf6e10abd93186f8da1faa0f
  Stored in directory: /Users/srijeetchatterjee/Library/Caches/pip/wheels/cb/f3/5a/d88198fdeb46781ddd7e7f2653061af83e7adb2a076d8886d6
Successfully built word2number
Installing collected packages: word2number
Successfully installed word2number-1.1


In [None]:
from word2number import w2n

In [12]:
!pip install -U pip setuptools wheel
!pip install -U spacy


!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz

from spacy.lang.en import English
from spacy import displacy   


In [None]:
import spacy
nlp = spacy.load('en_core_web_md')

In [9]:
text = """three cups of coffee"""

doc = nlp(text)

tokens = [w2n.word_to_num(token.text) if token.pos_ == 'NUM' else token for token in doc]

print(tokens) # result: [3, cups, of, coffee]

NameError: name 'nlp' is not defined

    The other step is to remove numbers. As you shall see later, we are able to toggle on or off the steps by setting parameters to True or False value. Removing numbers may make sense for sentiment analysis since numbers contain no information about sentiments. However, if our NLP task is to extract the number of tickets ordered in a message to our chatbot, we will definitely not want to remove numbers.

##### Appraoch 1 : Using re 

In [115]:
import re
def remove_nums(text):
    return re.sub(r'[0-9]',' ',text)

In [116]:
remove_nums('3 members were put to cellag99t')

'  members were put to cellag  t'

##### Appraoch 2 : Using Gensim

In [117]:
import gensim.parsing.preprocessing as gp

In [118]:
s = gp.strip_numeric('3 members were put to cellag99t')
print(s)

 members were put to cellagt


#### Remove Puntuations : 

    Remove punctuation if they are not relevant to your analyses. 

    Punctuation is basically the set of symbols : [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]:

#### Appraoch 1 : Using re

In [28]:
import re
def remove_nums(text):
    return re.sub(r'[!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]',' ',text)

In [110]:
remove_nums('All the team members @batalion89 did awesome job !!! They will be rewarded . <Hurray> #happiness')

'All the team members  batalion89 did awesome job     They will be rewarded    Hurray   happiness'

#### Appraoch 2 : Using gensim 

In [111]:
import gensim.parsing.preprocessing as gp

In [113]:
s = 'All the team members @batalion89 did awesome job !!! They will be rewarded . <Hurray> #happiness'
s = gp.strip_punctuation(s)
s = gp.strip_tags(s)
print(s)

All the team members  batalion89 did awesome job   They will be rewarded    Hurray   happiness


#### Converting to lower-case :

    All words changes into lower case or uppercase to avoid the duplication. Because “Phone” and “phone” will be considered as 2 separate words if this step is not done.

In [34]:
def convert_lower(text):
    return str(text).lower()

In [37]:
convert_lower('INTRODUCTION :  My 12th class batch was the best. #Happiness')

'introduction :  my 12th class batch was the best. #happiness'

#### Stopwords Removal :

    Stop-words are commonly used words in a language. Examples are ‘a’, ’an’, ’the’, ’is’, ’what’ etc. Stop-words are removed from the text so that we can concentrate on more important words and prevent stop-words from being analyzed. If we search ‘what is text preprocessing’, we want to focus more on ‘text preprocessing’ rather than ‘what is’.

    Stop words can mean different things for different applications. In some applications, removing all stop words from determiners to preposition is appropriate. But in some applications, like sentimental analysis, removal of tokens like not, good, etc. can throw algorithms off their tracks.

    As mentioned earlier, stopwords are very common words. Words like “we” and “are” probably do not help at all in NLP tasks such as sentiment analysis or text classifications. Hence, we can remove stopwords to save computing time and efforts in processing large volumes of text.

##### Approach 1 : Using Spacy 

    We can use spaCy’s inbuilt stopwords, but we should be cautious and modify the stopwords list accordingly. E.g., for sentiment analysis, the word “not” is important in the meaning of a text such as “not good”. However, spaCy included “not” as a stopword. We therefore modify the stopwords by the following code:

In [None]:
import spacy
nlp = spacy.load('en_core_web_md')

In [None]:
# exclude words from spacy stopwords list
deselect_stop_words = ['no', 'not']
for w in deselect_stop_words:
    nlp.vocab[w].is_stop = False

##### Approach 2 : Using nltk 

    It is possible to remove stopwords using Natural Language Toolkit (nltk). You also may check the list of stopwords by using following code.

In [43]:
from nltk.corpus import stopwords
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

    So, these are the stopwords which we need to remove, Let’s remove the stopwords.

In [44]:
stop_words = set(stopwords.words('english'))

In [54]:
def remove_stop_words(text):
    return ' '.join([words for words in text.split() if words not in stop_words])

In [55]:
remove_stop_words('Why are you telling me the truth ?')

'Why telling truth ?'

##### Appraoch 3 : Using Gensim 

In [123]:
import gensim.parsing.preprocessing as gp

In [124]:
s = gp.remove_stopwords('Why are you telling me the truth ?')
print(s)

Why telling truth ?


#### Stemming & Lemmatization  : 

    These are the core process of Normalization. The aim of both processes is the same: reducing the inflectional forms of each word into a common base or root. Both process are different, let’s see what is stemming and lemmatization.
    
    Stemming usually refers to a crude process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational units (the obtained element is known as the stem).
    
    On the other hand, lemmatization consists in doing things properly with the use of a vocabulary and morphological analysis of words, to return the base or dictionary form of a word, which is known as the lemma.
    
    If we stem the sentence “I saw an amazing thing ”we would obtain ‘s’ instead of ‘saw’, but if we lemmatize it we would obtain ‘see’, which is the lemma.

                        Stemming :  I saw an amazing thing ------> I s an amaz thing.
        
                        Lemmatization :  I saw an amazing thing ------> I see an amazing thing.

    Both techniques could remove important information but also help us to normalize our corpus (although lemmatization is the one that is usually applied). Actually stemming create some words, that may not have any meaning, so we usually use lemmatization.
    
    I will show you the difference between both with the help of code and result.

    let’s look at stemming first.

    Stemmer is easy to build than a lemmatizer as the latter requires deep linguistics knowledge in constructing dictionaries to look up the lemma of the word.

##### Stemming : 

    There are different algorithms for stemming but the most common algorithm, which is also known to be empirically effective for English, is Porter’s Algorithm. Porter’s Algorithm consists of 5 phases of word reductions applied sequentially.

    Since stemming follows a crude heuristic approach that chops off the end of the tokens in the hope of correctly transforming into its root form, it may sometimes generate non-meaningful terms. For example, it may convert the token ‘increase’ into ‘increas’, causing the token to lose its meaning.

    Stemming has two types of errors — over-stemming and under-stemming. Over-stemming refers to the problem where two words with different stems are stemmed to the same root. This is also known as a false positive. Under-stemming is the situation where two words with the same stem are not stemmed together. This is also known as a false negative. Light stemming tends to reduce over-stemming errors but increases the under-stemming errors whereas heavy stemming increases over-stemming errors but reduces under-stemming errors.

##### Appraoch 1 : NLTK 

In [87]:
from nltk.stem import PorterStemmer

In [88]:
stemming_object = PorterStemmer()

In [89]:
def porter_stemming(list_of_token):
    return [stemming_object.stem(i) for i in list_of_token]

In [90]:
text = 'I saw an amazing thing'
list_of_token = text.split(' ')
list_of_token

['I', 'saw', 'an', 'amazing', 'thing']

     you can also use nltk.tokenize() to create the list of words 

In [91]:
porter_stemming(list_of_token)

['I', 'saw', 'an', 'amaz', 'thing']

    Result: As observe the output, there is some words has been stem like ‘commponents’ to ‘ ‘commpon’, ‘says’ to ‘say’, ‘people’ to ‘peopl’ and ‘troubling’ to ‘troubl’.

    Now we can see here, there is some words changed those have no meaning, and this is the challenge to use stemming. Let’s go forward for lemmatization and see the difference in the output.

##### Appraoch 2 : Using Gensim 

    The stem_text() function returns porter stemmed version of the string. Porter stemmer is known for its speed and simplicity.

In [127]:
import gensim.parsing.preprocessing as gp

In [128]:
s = gp.stem_text('I saw an amazing thing')
print(s)

i saw an amaz thing


#### Lemmatization:

    Lemmatization is similar to stemming, the difference being that lemmatization refers to doing things properly with the use of vocabulary and morphological analysis of words, aiming to remove inflections from the word and to return the base or dictionary form of that word, also known as the lemma. It does a full morphological analysis of the word to accurately identify the lemma for each word. It may use a dictionary such as a Wordnet for mapping or some other rule-based approaches.

##### Appraoch 1 : nltk 

In [92]:
from nltk.stem import WordNetLemmatizer

In [93]:
lemma_obj = WordNetLemmatizer()

In [94]:
def wordnet_lemmatizing(list_of_token):
    return [lemma_obj.lemmatize( word = w, pos = 'v') for w in list_of_token]

In [95]:
text = 'I saw an amazing thing'
list_of_token = text.split(' ')
list_of_token

['I', 'saw', 'an', 'amazing', 'thing']

In [96]:
wordnet_lemmatizing(list_of_token)

['I', 'saw', 'an', 'amaze', 'thing']

    Now here we can see it finds the root word, like ‘troubling’ to ‘trouble’, ‘took’ to ‘take’ and ‘payed’ to ‘pay’ . So, As opposed to stemming, lemmatization does not simply chop off inflections. Instead it uses lexical knowledge bases to get the correct base forms of words.

##### Appraoch 2 :  Spacy 

    Lemmatization is the process of converting a word to its base form, e.g., “caring” to “care”. We use spaCy’s lemmatizer to obtain the lemma, or base form, of the words. Sample code:

In [None]:
text = """he kept eating while we are talking"""
doc = nlp(text)

# Lemmatizing each token
mytokens = [word.lemma_ if word.lemma_ != "-PRON-" else word.lower_ for word in doc]
print(mytokens) 

# result: ['he', 'keep', 'eat', 'while', 'we', 'be', 'talk']

    Another method to obtain the base form of a word is stemming. We did not use it in our text preprocessing code but you can consider stemming if processing speed is of utmost concern. But do take note that stemming is a crude heuristic that chops the ends off of words and hence, the result may not be good or actual words. E.g., stemming “caring” will result in “car”.

![Screenshot%202021-09-06%20at%205.49.44%20PM.png](attachment:Screenshot%202021-09-06%20at%205.49.44%20PM.png)

    Though lemmatization proves to be better than stemming, either form of normalization does not tend to improve English IR performance in aggregate. In some cases, it proves to be useful while in other cases it hampers the performance.

    For more examples of lemmatization in python check this blog(https://www.machinelearningplus.com/nlp/lemmatization-examples-python/) and for a detailed explanation of the differences between stemming and lemmatization check this blog(https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/)

#### Tokenization :

    Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens. We will use Natural language tool kit (nltk) library for tokenization.

    Note: If we have data in the form of paragraphs, and we want to convert the paragraph into sentences, then we will use nltk.sent_tokenize(paragraph).

    Here we will use below line of code to perform tokenization.

In [38]:
import nltk
from nltk.tokenize import word_tokenize

In [39]:
def nltk_tokenize(text):
    return word_tokenize(text)

In [40]:
nltk_tokenize('All the team members @batalion89 did awesome job !!! They will be rewarded . <Hurray> ')

['All',
 'the',
 'team',
 'members',
 '@',
 'batalion89',
 'did',
 'awesome',
 'job',
 '!',
 '!',
 '!',
 'They',
 'will',
 'be',
 'rewarded',
 '.',
 '<',
 'Hurray',
 '>']

    That is why its better to clean and tokenize. 

#### Removing whitespaces

    After removing the punctuations, we remove all the whitespaces in the text data as they are useless and only increases the size of the training set. We will remove all the whitespaces from the comments, keeping only those tokens that contribute towards the toxicity analysis of the corpus.

##### Appraoch 1 : Using gensim 

In [121]:
import gensim.parsing.preprocessing as gp

In [119]:
s = gp.strip_multiple_whitespaces('Whatever has happened    should not   have happened   ')
print(s)

Whatever has happened should not have happened 


In [120]:
s

'Whatever has happened should not have happened '

### Complete code with gensim 

In [None]:
import pandas as pd
import gensim
from gensim import utils
import gensim.parsing.preprocessing as gp

df = pd.read_csv(folderpath) #consider that df['tweets'] column contains tweets.

def preprocess_text(s):
        s = utils.to_unicode(s)
        s = s.lower()
        s = gp.strip_punctuation(s)
        s = gp.strip_tags(s)
        s = gp.strip_numeric(s)
        s = gp.strip_multiple_whitespaces(s)
        s = gp.remove_stopwords(s)
        s = gp.stem_text(s)
    
    return s

df['tweets']=df['tweets'].apply(str) #to convert each row of tweets column to string type
df['tweets']=df['tweets'].apply(preprocess_text) #pass each row of tweets column to preprocess_text()

https://www.analyticsvidhya.com/blog/2021/08/why-must-text-data-be-pre-processed/ 

### Complete code with Spacy 

In [None]:
from bs4 import BeautifulSoup
import spacy
import unidecode
from word2number import w2n
import contractions

nlp = spacy.load('en_core_web_md')

# exclude words from spacy stopwords list
deselect_stop_words = ['no', 'not']
for w in deselect_stop_words:
    nlp.vocab[w].is_stop = False


def strip_html_tags(text):
    """remove html tags from text"""
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text(separator=" ")
    return stripped_text


def remove_whitespace(text):
    """remove extra whitespaces from text"""
    text = text.strip()
    return " ".join(text.split())


def remove_accented_chars(text):
    """remove accented characters from text, e.g. café"""
    text = unidecode.unidecode(text)
    return text


def expand_contractions(text):
    """expand shortened words, e.g. don't to do not"""
    text = contractions.fix(text)
    return text


def text_preprocessing(text, accented_chars=True, contractions=True, 
                       convert_num=True, extra_whitespace=True, 
                       lemmatization=True, lowercase=True, punctuations=True,
                       remove_html=True, remove_num=True, special_chars=True, 
                       stop_words=True):
    """preprocess text with default option set to true for all steps"""
    if remove_html == True: #remove html tags
        text = strip_html_tags(text)
    if extra_whitespace == True: #remove extra whitespaces
        text = remove_whitespace(text)
    if accented_chars == True: #remove accented characters
        text = remove_accented_chars(text)
    if contractions == True: #expand contractions
        text = expand_contractions(text)
    if lowercase == True: #convert all characters to lowercase
        text = text.lower()

    doc = nlp(text) #tokenise text

    clean_text = []
    
    for token in doc:
        flag = True
        edit = token.text
        # remove stop words
        if stop_words == True and token.is_stop and token.pos_ != 'NUM': 
            flag = False
        # remove punctuations
        if punctuations == True and token.pos_ == 'PUNCT' and flag == True: 
            flag = False
        # remove special characters
        if special_chars == True and token.pos_ == 'SYM' and flag == True: 
            flag = False
        # remove numbers
        if remove_num == True and (token.pos_ == 'NUM' or token.text.isnumeric()) \
        and flag == True:
            flag = False
        # convert number words to numeric numbers
        if convert_num == True and token.pos_ == 'NUM' and flag == True:
            edit = w2n.word_to_num(token.text)
        # convert tokens to base form
        elif lemmatization == True and token.lemma_ != "-PRON-" and flag == True:
            edit = token.lemma_
        # append tokens edited and not removed to list 
        if edit != "" and flag == True:
            clean_text.append(edit)        
    return clean_text

In [None]:
# text = """I'd like to have three cups   of coffee<br /><br />from your Café. #delicious"""

text_preprocessing(text) 
# result: ['like', 'cup', 'coffee', 'cafe', 'delicious'

    To toggle on or off specific steps, we can set the relevant parameters to True or False value. E.g., to not remove numbers, set the parameter “remove_num” to False.

In [None]:
# example to not remove numbers
text_preprocessing(text, remove_num=False)

    After this, we can then convert the processed text into something that can be represented numerically. Two main ways of doing so are one-hot encodings and word embedding vectors. We shall explore these in the next article.
    
    Lastly, do note that there are experts(https://twitter.com/peteskomoroch/status/1068318945382297600) who expressed views that text preprocessing negatively impact rather than enhance the performance of deep learning models. Nonetheless, text preprocessing is definitely crucial for non-deep learning models.
    
    Thanks for reading and I hope the code and article are useful. Please also feel free to comment with any questions or suggestions you may have.

https://towardsdatascience.com/nlp-text-preprocessing-a-practical-guide-and-template-d80874676e79

https://www.kdnuggets.com/2017/12/general-approach-preprocessing-text-data.html

https://www.kdnuggets.com/2017/11/framework-approaching-textual-data-tasks.html

https://www.kdnuggets.com/2016/03/data-science-process-rediscovered.html/2

https://www.kdnuggets.com/2018/08/practitioners-guide-processing-understanding-text-2.html 

https://pypi.org/project/contractions/

https://pypi.org/project/word2number/

https://medium.com/analytics-vidhya/text-preprocessing-for-nlp-natural-language-processing-beginners-to-master-fd82dfecf95

https://towardsdatascience.com/text-preprocessing-in-natural-language-processing-using-python-6113ff5decd8

https://www.nltk.org/

https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/

https://medium.com/analytics-vidhya/text-preprocessing-nlp-basics-430d54016048