# what is nlp?

NLP can be defined as a field of computer science that is concerned with enabling computer algorithms to understand, analyze, and generate natural languages.

NLP works at different levels, which means that machines process and understand natural language at different levels. These levels are as follows:
* Morphological level: This level deals with understanding word structure and word information.
* Lexical level: This level deals with understanding the part of speech of the word.
* Syntactic level: This level deals with understanding the syntactic analysis of a sentence, or parsing a sentence.
* Semantic level: This level deals with understanding the actual meaning of a sentence.
* Discourse level: This level deals with understanding the meaning of a sentence beyond just the sentence level, that is, considering the context.
* Pragmatic level: This level deals with using real-world knowledge to understand the sentence.

**Text analytics vs. NLP**
Text analytics is the method of extracting meaningful insights and answering questions from text data, such as those to do with the length of sentences, length of words, word count, and finding words from the text. 

NLP, on the other hand, helps us in understanding the semantics and the underlying meaning of text, such as the sentiment of a sentence, top keywords in text, and parts of speech for different words. It is not just restricted to text data; voice (speech) recognition and analysis also come under the domain of NLP. 

Natural Language Understanding (NLU) and Natural Language Generation (NLG). A proper explanation of these terms is provided here:
NLU: NLU refers to a process by which an inanimate object with computing power is able to comprehend spoken language. As mentioned earlier, Siri and Alexa use techniques such as Speech to Text to answer different questions, including inquiries about the weather, the latest news updates, live match scores, and more.
NLG: NLG refers to a process by which an inanimate object with computing power is able to communicate with humans in a language that they can understand or is able to generate human-understandable text from a dataset. Continuing with the example of Siri or Alexa, ask one of them about the chances of rainfall in your city. It will reply with something along the lines

NLP can be broadly categorized into two types:
* Natural Language Understanding: NLU refers to a process by which machine is able to comprehend spoken language.
* Natural Language Generation: NLG refers to a process by which a machine is able to communicate with humans in a language that they can understand or is able to generate human-understandable text from a dataset.

# basic text analytics

In [1]:
sentence = 'The quick brown fox jumps over the lazy dog'
sentence

'The quick brown fox jumps over the lazy dog'

In [3]:
# Check whether the word 'quick' belongs to that text 

def find_word(word, sentence):
    return word in sentence
find_word('quick', sentence)

True

In [4]:
# Find out the index value of the word 'fox'

def get_index(word, text):
    return text.index(word)
get_index('fox', sentence)

16

In [5]:
# Find out the rank of the word 'lazy'

get_index('lazy', sentence.split())

7

In [6]:
# Print the third word of the given text

def get_word(text, rank):
    return text.split()[rank]
get_word(sentence, 2)

'brown'

In [7]:
# Print the third word of the given sentence in reverse order

get_word(sentence,2)[::-1]

'nworb'

In [8]:
# Concatenate the first and last words of the given sentence

def concat_words(text):
    """
    This method will concat first and last 
    words of given text
    """
    words = text.split()
    first_word = words[0]
    last_word = words[len(words)-1]
    return first_word + last_word

concat_words(sentence)

'Thedog'

In [9]:
# Print words at even positions

def get_even_position_words(text):
    words = text.split()
    return [words[i] for i in range(len(words)) if i%2 == 0]
 
get_even_position_words(sentence)

['The', 'brown', 'jumps', 'the', 'dog']

In [10]:
# Print the last three letters of the text

def get_last_n_letters(text, n):
    return text[-n:]
get_last_n_letters(sentence,3)

'dog'

In [11]:
# Print the text in reverse order

def get_reverse(text):
    return text[::-1]
get_reverse(sentence)

'god yzal eht revo spmuj xof nworb kciuq ehT'

In [13]:
# Print each word in reverse order, maintaining sequence

def get_word_reverse(text):
    words = text.split()
    return ' '.join([word[::-1] for word in words])

get_word_reverse(sentence)

'ehT kciuq nworb xof spmuj revo eht yzal god'

In [15]:
# Reverse the sequence of words in string

wordList = sentence.split()
' '.join(word for word in wordList[::-1])

'dog lazy the over jumps fox brown quick The'

# various steps in nlp

## tokenization

Tokenization refers to the procedure of splitting a sentence into its constituent parts — the words and punctuation that it is made up of. 

In [16]:
from nltk import word_tokenize, download
download(['punkt','averaged_perceptron_tagger','stopwords'])

[nltk_data] Downloading package punkt to /Users/ashaynaik/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ashaynaik/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ashaynaik/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

The **download** method downloads the given data from NLTK. NLTK data contains different corpora and trained models. In the preceding example, we will be downloading the stop word list, 'punkt', and a perceptron tagger, which is used to implement parts of speech tagging using a structured algorithm. The data will be downloaded at nltk_data/corpora/ in the home directory of your computer.

In [17]:
def get_tokens(sentence):
    words = word_tokenize(sentence)
    return words

In [18]:
print(get_tokens("I am reading NLP Fundamentals."))

['I', 'am', 'reading', 'NLP', 'Fundamentals', '.']


The **word_tokenize** method is used to split the sentence into words/tokens. 

## parts of speech (pos) tagging

The **pos_tag** method returns a list of tuples in which every tuple consists of the word followed by the PoS tag.

List of PoS tags: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

In [19]:
from nltk import word_tokenize, pos_tag

def get_tokens(sentence):
    words = word_tokenize(sentence)
    return words

def get_pos(words):
    return pos_tag(words)

words  = get_tokens("I am reading NLP Fundamentals")
get_pos(words)

[('I', 'PRP'),
 ('am', 'VBP'),
 ('reading', 'VBG'),
 ('NLP', 'NNP'),
 ('Fundamentals', 'NNS')]

## stop word removal

In [20]:
from nltk import download
download('stopwords')
from nltk import word_tokenize
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ashaynaik/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [21]:
# Check the list of stop words provided for English

stop_words = stopwords.words('english')
print(stop_words[:5])

['i', 'me', 'my', 'myself', 'we']


In [22]:
# Remove stop words from a sentence

def remove_stop_words(sentence, stop_words):
    return ' '.join([word for word in sentence if word not in stop_words])

sentence = "I am learning Python. \
    It is one of the most popular programming languages"

sentence_words = word_tokenize(sentence)

print(remove_stop_words(sentence_words, stop_words))

I learning Python . It one popular programming languages


In [23]:
# Add your own stop words to the stop word list

stop_words.extend(['I', 'It', 'one'])
print(remove_stop_words(sentence_words, stop_words))

learning Python . popular programming languages


## text normalization

Text normalization is a process wherein different variations of text get converted into a standard form. We need to perform text normalization as there are some words that can mean the same thing as each other. There are various ways of normalizing text, such as spelling correction, stemming, and lemmatization.

In [26]:
def normalize(text):
    return text.replace("US", "United States")   \
                .replace("UK", "United Kingdom") \
                .replace("-18", "-2018")

sentence = "I visited the US from the UK on 22-10-18"
normalized_sentence = normalize(sentence)
print(normalized_sentence)

normalized_sentence = normalize('The US and the UK are two superpowers')
print(normalized_sentence)

I visited the United States from the United Kingdom on 22-10-2018
The United States and the United Kingdom are two superpowers


## spelling correction

**autocorrect** is a Python library used to correct the spelling of misspelled words for different languages. It provides a **spell** method which takes a word as input and returns the correct spelling of the word.

In [28]:
from nltk import word_tokenize
from autocorrect import Speller

spell = Speller(lang='en')
spell('Natureal')

'Natural'

In [31]:
sentence = word_tokenize("Ntural Luanguage Processin deals with the art of extracting insightes from Natural Languaes")

def correct_spelling(tokens):
    sentence_corrected = ' '.join([spell(word) for word in tokens])
    return sentence_corrected

print(correct_spelling(sentence))

Natural Language Procession deals with the art of extracting insights from Natural Languages


Most of the wrongly spelled words have been corrected. But the word "Processin" was wrongly converted into "Procession." This happened because to change "Processin" to "Procession" or "Processing," an equal number of edits is required. To rectify this, we need to use other kinds of spelling correctors that are aware of context.

## stemming

Stemming is the process of converting words into their base forms.

We will be using two algorithms provided by the NLTK library: 
* The porter stemmer is a rule-based algorithm that transforms words to their base form by removing suffixes from words. 
* The snowball stemmer is an improvement over the porter stemmer and is a little bit faster and uses less memory. 

In NLTK, this is done by the **stem** method provided by the **PorterStemmer** class.

In [33]:
from nltk import stem

def get_stems(word, stemmer):
    return stemmer.stem(word)

porterStem = stem.PorterStemmer()

production = get_stems("production", porterStem)
coming = get_stems("coming", porterStem)
firing = get_stems("firing",porterStem)
battling = get_stems("battling",porterStem)

# for snowball specify language
stemmer = stem.SnowballStemmer("english")
battling_sb = get_stems("battling", stemmer)

print(production, coming, firing, battling, battling_sb)

product come fire battl battl


## lemmatization

Lemmatization is the process of converting words to their base grammatical form rather than just randomly axing words (as stemming does?). An additional check is made by looking through a dictionary to extract the base form of a word. Getting more accurate results requires some additional information; for example, PoS tags along with words will help in getting better results.

In the following, we use **WordNetLemmatizer**, which is an NLTK interface of WordNet (a freely available lexical English database that can be used to generate semantic relationships between words). It provides **lemmatize** method, which returns the lemma (grammatical base form) of a given word using WordNet.

In [34]:
from nltk import download
download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ashaynaik/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [36]:
lemmatizer = WordNetLemmatizer()

def get_lemma(word):
    return lemmatizer.lemmatize(word)

products = get_lemma('products')
production = get_lemma('production')
coming = get_lemma('coming')
battling = get_lemma('battling')

print(products, production, coming, battling)

product production coming battling


## name entity recognition (ner)

NER is the process of extracting important entities, such as person names, place names, and organization names, from some given text. These are usually not present in dictionaries.

Chunking is the process of grouping words together into chunks, which can be further used to find noun groups and verb groups, or can also be used for sentence partitioning.

In [37]:
from nltk import download
from nltk import pos_tag
from nltk import ne_chunk
from nltk import word_tokenize
download('maxent_ne_chunker')
download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/ashaynaik/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /Users/ashaynaik/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [38]:
sentence = "We are reading a book published by Packt which is based out of Birmingham."

def get_ner(text):
    i = ne_chunk(pos_tag(word_tokenize(text)), binary=True)
    return [a for a in i if len(a)==1]

get_ner(sentence)

[Tree('NE', [('Packt', 'NNP')]), Tree('NE', [('Birmingham', 'NNP')])]

# word sense disambiguation

It is the process of mapping a word to the sense that it should carry. 

One of the algorithms to solve word sense disambiguation is the Lesk algorithm. It has a huge corpus in the background (generally WordNet is used) that contains definitions of all the possible synonyms of all the possible words in a language. Then it takes a word and the context as input and finds a match between the context and all the definitions of the word. The meaning with the highest number of matches with the context of the word will be returned.

The **Lesk** module from NLTK takes a sentence and the word as input, and returns the meaning or definition of the word. The output of the **Lesk** method is **synset**, which contains the ID of the matched definition. These IDs can be matched with their definitions using the **definition** method of wsd.synset('word').


In [39]:
from nltk.wsd import lesk
from nltk import word_tokenize

sentence1 = "Keep your savings in the bank"
sentence2 = "It's so risky to drive over the banks of the road"

def get_synset(sentence, word):
    return lesk(word_tokenize(sentence), word)

get_synset(sentence1,'bank')

Synset('savings_bank.n.02')

In [40]:
get_synset(sentence2,'bank')

Synset('bank.v.07')

# sentence boundary detection

The **sent_tokenize** method is used to detect sentence boundaries. 

In [41]:
import nltk
from nltk.tokenize import sent_tokenize

def get_sentences(text):
    return sent_tokenize(text)

get_sentences("We are reading a book. Do you know who is the publisher? It is Packt. Packt is based out of Birmingham.")

['We are reading a book.',
 'Do you know who is the publisher?',
 'It is Packt.',
 'Packt is based out of Birmingham.']

In [42]:
get_sentences("Mr. Donald John Trump is the current president of the USA. Before joining politics, he was a businessman.")

['Mr. Donald John Trump is the current president of the USA.',
 'Before joining politics, he was a businessman.']

# activity: preprocessing of raw text

1. Import the necessary libraries.
2. Load the text corpus to a variable.
3. Apply the tokenization process to the text corpus and print the first 20 tokens.
4. Apply spelling correction on each token and print the initial 20 corrected tokens as well as the corrected text corpus.
5. Apply PoS tags to each of the corrected tokens and print them.
6. Remove stop words from the corrected token list and print the initial 20 tokens.
7. Apply stemming and lemmatization to the corrected token list and then print the initial 20 tokens.
8. Detect the sentence boundaries in the given text corpus and print the total number of sentences.

In [43]:
# 1

from nltk import download
download('stopwords')
download('wordnet')
download('averaged_perceptron_tagger')
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from autocorrect import Speller
from nltk.wsd import lesk
from nltk.tokenize import sent_tokenize
from nltk import stem, pos_tag
import string

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ashaynaik/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ashaynaik/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ashaynaik/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [45]:
# 2, 3
sentence = open("data/file.txt", 'r').read()
words = word_tokenize(sentence)
print(words[0:20])

['The', 'reader', 'of', 'this', 'course', 'should', 'have', 'a', 'basic', 'knowledge', 'of', 'the', 'Python', 'programming', 'lenguage', '.', 'He/she', 'must', 'have', 'knowldge']


In [46]:
# 4

spell = Speller(lang='en')

def correct_sentence(words):
    corrected_sentence = ""
    corrected_word_list = []
    for wd in words:
        if wd not in string.punctuation:
            wd_c = spell(wd)
            if wd_c != wd:
                print(wd+" has been corrected to: "+wd_c)
                corrected_sentence = corrected_sentence+" "+wd_c
                corrected_word_list.append(wd_c)
            else:
                corrected_sentence = corrected_sentence+" "+wd
                corrected_word_list.append(wd)
        else:
            corrected_sentence = corrected_sentence + wd
            corrected_word_list.append(wd)
    return corrected_sentence, corrected_word_list

corrected_sentence, corrected_word_list = correct_sentence(words)

lenguage has been corrected to: language
knowldge has been corrected to: knowledge


In [47]:
sentence

'The reader of this course should have a basic knowledge of the Python programming lenguage.\nHe/she must have knowldge of data types in Python.He should be able to write functions,\n and also have the ability to import and use libraries and packages in Python. Familiarity\n with basic linguistics and probability is assumed although not required to fully\n complete this course.\n'

In [49]:
corrected_sentence

' The reader of this course should have a basic knowledge of the Python programming language. He/she must have knowledge of data types in Python.He should be able to write functions, and also have the ability to import and use libraries and packages in Python. Familiarity with basic linguistics and probability is assumed although not required to fully complete this course.'

In [51]:
print(corrected_word_list[0:20])

['The', 'reader', 'of', 'this', 'course', 'should', 'have', 'a', 'basic', 'knowledge', 'of', 'the', 'Python', 'programming', 'language', '.', 'He/she', 'must', 'have', 'knowledge']


In [52]:
# 5

print(pos_tag(corrected_word_list))

[('The', 'DT'), ('reader', 'NN'), ('of', 'IN'), ('this', 'DT'), ('course', 'NN'), ('should', 'MD'), ('have', 'VB'), ('a', 'DT'), ('basic', 'JJ'), ('knowledge', 'NN'), ('of', 'IN'), ('the', 'DT'), ('Python', 'NNP'), ('programming', 'NN'), ('language', 'NN'), ('.', '.'), ('He/she', 'NNP'), ('must', 'MD'), ('have', 'VB'), ('knowledge', 'NN'), ('of', 'IN'), ('data', 'NNS'), ('types', 'NNS'), ('in', 'IN'), ('Python.He', 'NNP'), ('should', 'MD'), ('be', 'VB'), ('able', 'JJ'), ('to', 'TO'), ('write', 'VB'), ('functions', 'NNS'), (',', ','), ('and', 'CC'), ('also', 'RB'), ('have', 'VBP'), ('the', 'DT'), ('ability', 'NN'), ('to', 'TO'), ('import', 'NN'), ('and', 'CC'), ('use', 'NN'), ('libraries', 'NNS'), ('and', 'CC'), ('packages', 'NNS'), ('in', 'IN'), ('Python', 'NNP'), ('.', '.'), ('Familiarity', 'NN'), ('with', 'IN'), ('basic', 'JJ'), ('linguistics', 'NNS'), ('and', 'CC'), ('probability', 'NN'), ('is', 'VBZ'), ('assumed', 'VBN'), ('although', 'IN'), ('not', 'RB'), ('required', 'VBN'), ('to

In [53]:
# 6

stop_words = stopwords.words('english')
def remove_stop_words(word_list):
    corrected_word_list_without_stopwords = []
    for wd in word_list:
        if wd not in stop_words:
            corrected_word_list_without_stopwords.append(wd)
    return corrected_word_list_without_stopwords

corrected_word_list_without_stopwords = remove_stop_words(corrected_word_list)
print(corrected_word_list_without_stopwords[:20])

['The', 'reader', 'course', 'basic', 'knowledge', 'Python', 'programming', 'language', '.', 'He/she', 'must', 'knowledge', 'data', 'types', 'Python.He', 'able', 'write', 'functions', ',', 'also']


In [55]:
# 7

stemmer = stem.PorterStemmer()
def get_stems(word_list):
    corrected_word_list_without_stopwords_stemmed = []
    for wd in word_list:
        corrected_word_list_without_stopwords_stemmed.append(stemmer.stem(wd))
    return corrected_word_list_without_stopwords_stemmed

corrected_word_list_without_stopwords_stemmed = get_stems(corrected_word_list_without_stopwords)
print(corrected_word_list_without_stopwords_stemmed[:20])

['the', 'reader', 'cours', 'basic', 'knowledg', 'python', 'program', 'languag', '.', 'he/sh', 'must', 'knowledg', 'data', 'type', 'python.h', 'abl', 'write', 'function', ',', 'also']


In [57]:
lemmatizer = WordNetLemmatizer()
def get_lemma(word_list):
    corrected_word_list_without_stopwords_lemmatized = []
    for wd in word_list:
        corrected_word_list_without_stopwords_lemmatized.append(lemmatizer.lemmatize(wd))
    return corrected_word_list_without_stopwords_lemmatized
corrected_word_list_without_stopwords_lemmatized =  get_lemma(corrected_word_list_without_stopwords_stemmed)
print(corrected_word_list_without_stopwords_lemmatized[:20])

['the', 'reader', 'cours', 'basic', 'knowledg', 'python', 'program', 'languag', '.', 'he/sh', 'must', 'knowledg', 'data', 'type', 'python.h', 'abl', 'write', 'function', ',', 'also']


In [58]:
# 8

print(sent_tokenize(corrected_sentence))

[' The reader of this course should have a basic knowledge of the Python programming language.', 'He/she must have knowledge of data types in Python.He should be able to write functions, and also have the ability to import and use libraries and packages in Python.', 'Familiarity with basic linguistics and probability is assumed although not required to fully complete this course.']


# kick starting an nlp project

## data collection

This is the initial phase of any NLP project. Our sole purpose is to collect data as per our requirements. For this, we may either use existing data, collect data from various online repositories, or create our own dataset by crawling the web. In our case, we will collect different email data. We can even get this data from our personal emails as well, to start with.

## data preprocessing

Once the data is collected, we need to clean it using different preprocessing steps. It is necessary to clean the collected data to ensure effectiveness and accuracy:
1. Converting all the text data to lowercase
2. Stop word removal
3. Text normalization, which will include replacing all numbers with some common term and replacing punctuation with empty strings
4. Stemming and lemmatization

## feature extraction

Machine learning models tend to understand only numeric data. Therefore, it becomes necessary to convert text data into its equivalent numerical form.

To convert every email into its equivalent numerical form, we will create a dictionary of all the unique words in our data and assign a unique index to each word. Then, we will represent every email with a list having a length equal to the number of unique words in the data. The list will have 1 at the indices of words that are present in the email and 0 at the other indices. This is called one-hot encoding. 

## model development

Once the feature set is ready, we need to develop a suitable model that can be trained to gain knowledge from the data. These models are generally statistical, machine learning-based, deep learning-based, or reinforcement learning-based. In our case, we will build a model that is capable of differentiating between important and unimportant emails.

## model assessment

After developing a model, it is essential to benchmark it. This process of benchmarking is known as model assessment. In this step, we will evaluate the performance of our model by comparing it to others. This can be done by using different parameters or metrics. These parameters include precision, recall, and accuracy. In our case, we will evaluate the newly created model by seeing how well it performs at classifying emails as important and unimportant.

## model deployment

This is the final stage for most industrial NLP projects. In this stage, the models are put into production. They are either integrated into an existing system or new products are created by keeping this model as a base. In our case, we will deploy our model to production, so that it can classify emails as important and unimportant in real time.

# end notes

Use Google News API to collect news.