# Natural Language Processing with Python: Introduction

Source: https://sanjayasubedi.com.np/nlp/nlp-intro/



__Import Libraries__

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import regex as re
plt.style.use('ggplot')

__NLTK Downloads__

---

## Introduction

NLP Flow:
Text --> Preprocess --> Feature Extraction --> Model

## Preprocesing

In this section, I’ll introduce some of the common pre-processing steps. As an input, we have a text. It could be a news article, search query, instructions for a chat-bot etc. We feed this input to a Pre-processing step where we need to extract the tokens, which could be a word or a phrase or even a sentence, and clean our input text i.e. fix spelling mistakes, remove useless words (stop-words), augment the words with part of speech or something else etc. What we do in this step depends on the problem we are trying to solve but for many applications tokenization, stop-word removal and stemming are fairly common

__Example Input__

In [3]:
text = "This warning shouldn't be taken lightly."

### Tokenization

In [4]:
print(text.split(' '))



In [8]:
clean_text = re.sub('\p{P}+', '', text)
print(clean_text)
print(clean_text.split(' '))



In [10]:
!python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [11]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print(doc)



In [13]:
type(doc)

spacy.tokens.doc.Doc

In [18]:
print([token for token in doc])



In [15]:
print([token.text for token in doc])



Now the tokenization looks much better. The punctuations are still present but we can easily remove them. Every token produced by spaCy is of type spacy.tokens.token.Token and it has a number of properties. Among them there are a few that start with is_* e.g. is_digit, is_punct, is_stop etc. that can be used to determine what kind of token it is.

### Stopword Removal

Stop-words are words that occur frequently but don’t carry any meaning on their own. For example, a, an, the occur very frequently and can be discarded without any loss of meaning for most of NLP tasks.

In [20]:
print([(token.text, token.is_stop) for token in doc])



In [21]:
print([token.text for token in doc if not token.is_stop])



### Stemming

Stemming is a process of reducing the words to their root form. For example, stem of cats would be cat, transportation would be transport etc. Again, this is to reduce the size of vocabulary because for most of the applications, distinction between cats and cat is not important. For example, when a user searches for documents containing the word cats but we only have documents containing the word cat, then the user would get zero results. But if we stem the user’s query then we would be able to retrieve some results. A popular algorithm used for stemming is Porter algorithm. spaCy does not have any feature for stemming but libraries like NLTK have such feature. Stemming algorithms are mostly based on rules and the output is not always a valid word. Consider the following examples.

### Lemmatization

Lemmatisation is a more complex version of stemming. Part of speech (POS) of each word is determined and then different rules are applied for different POS. spaCy provides lemmatisation since it is much better than stemming but it is a bit more computationally expensive.

In [24]:
print([(token.text, token.lemma_) for token in nlp('we are meeting tomorrow')])
print([(token.text, token.lemma_) for token in nlp('i am going to a meeting')])

[('we', '-PRON-'), ('are', 'be'), ('meeting', 'meet'), ('tomorrow', 'tomorrow')]
[('i', 'i'), ('am', 'be'), ('going', 'go'), ('to', 'to'), ('a', 'a'), ('meeting', 'meeting')]


In [25]:
print([token.lemma_ for token in doc])



In [26]:
print([(token.text, token.pos_) for token in doc])



In [27]:
lemmatized = [token.lemma_ for token in doc]
print([(token.text, token.pos_) for token in nlp(' '.join(lemmatized))])

