 <div class="alert alert-block alert-info">  
    <h1 id="heading" align="center"> 📌Natural Language Processing - NLP 101 📌</h1>
    <img src="https://litslink.com/wp-content/uploads/2020/07/nlp-illustration.png"/>
 </div>

# Natural Language Processing (NLP)

**NLP** is a subfield of computer science, artificial intelligence, information engineering, and human-computer interaction. This field focuses on how to program computers to process and analyze large amounts of natural language data. It is difficult to perform as the process of reading and understanding languages is far more complex than it seems at first glance.

# Install Dependencies

In [27]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

# Case Conversion

In [28]:
text = 'The quick brown fox jumped over The Big Dog'
text

In [29]:
text.lower()

In [30]:
text.upper()

In [31]:
text.title()

# Tokenization

**Tokenization** is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph

![Tokenization](https://media.geeksforgeeks.org/wp-content/uploads/tokenizer.jpg)

In [32]:
sample_text = ("US unveils world's most powerful supercomputer, beats China. " 
               "The US has unveiled the world's most powerful supercomputer called 'Summit', " 
               "beating the previous record-holder China's Sunway TaihuLight. With a peak performance "
               "of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, "
               "which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, "
               "which reportedly take up the size of two tennis courts.")
sample_text

## Sentence tokenizer

In [33]:
nltk.sent_tokenize(sample_text)

**How sent_tokenize works ?**

The `sent_tokenize` function uses an instance of `PunktSentenceTokenizer` from the `nltk.tokenize.punkt module`, which is already been trained and thus very well knows to mark the end and beginning of sentence at what characters and punctuation.

## PunktWord Tokenizer

In [34]:
# It doen’t seperates the punctuation from the words.
# from nltk.tokenize import PunktWordTokenizer
# PunktWordTokenizer().tokenize(sample_text)

**N.B**: `PunktWordTokenizer` .There isn't one called PunktWordTokenizer anymore. It was internal and was not intended to be public. Which is why you can't import that name.

## Word tokenizer

In [35]:
print(nltk.word_tokenize(sample_text))

## WordPunkt Tokenizer

In [36]:
# It seperates the punctuation from the words.
from nltk.tokenize import WordPunctTokenizer
WordPunctTokenizer().tokenize(sample_text)

## TreebanWord Tokenizer

In [37]:
from nltk.tokenize import TreebankWordTokenizer
  
TreebankWordTokenizer().tokenize(sample_text)

# Spacy Tokenization

## Sentence
## Word

In [38]:
import spacy

# spacy.load('en') won't work so change it to spacy.load('en_core_web_sm')
nlp = spacy.load('en_core_web_sm') 

text_spacy = nlp(sample_text)

In [39]:
[obj.text for obj in text_spacy.sents]

In [40]:
print([obj.text for obj in text_spacy])

# Removing HTML tags & noise

In [41]:
import requests

data = requests.get('http://www.gutenberg.org/cache/epub/8001/pg8001.html')
content = data.text
print(content[2745:3948])

In [42]:
import re
from bs4 import BeautifulSoup

def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    [s.extract() for s in soup(['iframe', 'script'])]
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    return stripped_text

clean_content = strip_html_tags(content)
print(clean_content[1163:1957])

# Removing Accented Characters

In [43]:
import unicodedata

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

In [44]:
s = 'Sómě Áccěntěd těxt'
s

In [45]:
remove_accented_chars(s)

# Removing Special Characters, Numbers and Symbols

In [46]:
import re

def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
    text = re.sub(pattern, '', text)
    return text


In [47]:
# Sentence
s = "Well this was fun! See you at 7:30, What do you think!!? #$@@9318@ 🙂🙂🙂"
s

In [48]:
# Removing special caracters and digits too
remove_special_characters(s, remove_digits=True)

In [49]:
# Removing only special caracters
remove_special_characters(s)

# Expanding Contractions

In [50]:
!pip install contractions
!pip install textsearch
print("installation done")

**What are contractions?**

Contractions are words or combinations of words that are shortened by dropping letters and replacing them by an apostrophe.

Nowadays, where everything is shifting online, we communicate with others more through text messages or posts on different social media like Facebook, Instagram, Whatsapp, Twitter, LinkedIn, etc. in the form of texts. With so many people to talk, we rely on abbreviations and shortened form of words for texting people.

**For example** I’ll be there within 5 min. Are u not gng there? Am I mssng out on smthng? I’d like to see u near d park.

In English **contractions**, we often drop the vowels from a word to form the contractions. Removing contractions contributes to text standardization and is useful when we are working on Twitter data, on reviews of a product as the words play an important role in sentiment analysis.

How to expand contractions?

* Using contractions library installed above.

In [51]:
s = "Y'all can't expand contractions I'd think! You wouldn't be able to. How'd you do it?"
s

In [52]:
import contractions

# list all the contractions
list(contractions.contractions_dict.items())[:10]

In [53]:
# fix the sentence containing contractions.
contractions.fix(s)

In [54]:
# contracted text
text = '''I'll be there within 5 min. Shouldn't you be there too? 
          I'd love to see u there my dear. It's awesome to meet new friends.
          We've been waiting for this day for so long.'''
  
# creating an empty list
expanded_words = []    
for word in text.split():
  # using contractions.fix to expand the shotened words
  expanded_words.append(contractions.fix(word))   
    
expanded_text = ' '.join(expanded_words)
print('Original text: ' + text)
print("--"*80+"\n")
print('Expanded_text: ' + expanded_text)

# Stemming

**Stemming** algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful in some occasions, but not always.

In [55]:
# Porter Stemmer
from nltk.stem import PorterStemmer
ps = PorterStemmer()

ps.stem('jumping'), ps.stem('jumps'), ps.stem('jumped')

In [56]:
ps.stem('lying')

In [57]:
ps.stem('strange')

# Lemmatization

**Lemmatization**, on the other hand, takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. 

In [58]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

In [59]:
help(wnl.lemmatize)

In [60]:
# lemmatize nouns
print(wnl.lemmatize('cars', 'n'))
print(wnl.lemmatize('boxes', 'n'))

In [61]:
# lemmatize verbs
print(wnl.lemmatize('running', 'v'))
print(wnl.lemmatize('ate', 'v'))

In [62]:
# lemmatize adjectives
print(wnl.lemmatize('saddest', 'a'))
print(wnl.lemmatize('fancier', 'a'))

In [63]:
# ineffective lemmatization
print(wnl.lemmatize('ate', 'n'))
print(wnl.lemmatize('fancier', 'v'))
print(wnl.lemmatize('fancier'))

In [64]:
s = 'The brown foxes are quick and they are jumping over the sleeping lazy dogs!'

### Tokenize

In [65]:
tokens = nltk.word_tokenize(s)
print(tokens)

In [66]:
lemmatized_text = ' '.join(wnl.lemmatize(token) for token in tokens)
lemmatized_text

### POS Tagging

**What is Part of Speech?**

The **part of speech** explains how a word is used in a sentence. There are eight main parts of speech - **nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions and interjections.**

![Part of Speech](https://miro.medium.com/max/345/0*V635bzjWK2n1jBsd.png)

* Noun (N)- Daniel, London, table, dog, teacher, pen, city, happiness, hope
* Verb (V)- go, speak, run, eat, play, live, walk, have, like, are, is
* Adjective(ADJ)- big, happy, green, young, fun, crazy, three
* Adverb(ADV)- slowly, quietly, very, always, never, too, well, tomorrow
* Preposition (P)- at, on, in, from, with, near, between, about, under
* Conjunction (CON)- and, or, but, because, so, yet, unless, since, if
* Pronoun(PRO)- I, you, we, they, he, she, it, me, us, them, him, her, this
* Interjection (INT)- Ouch! Wow! Great! Help! Oh! Hey! Hi!

Most **POS** are divided into sub-classes. **POS Tagging** simply means labeling words with their appropriate Part-Of-Speech.

[reference](https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb)


In [67]:
tagged_tokens = nltk.pos_tag(tokens)
print(tagged_tokens)

### Tag conversion to WordNet Tags

In [68]:
from nltk.corpus import wordnet

def pos_tag_wordnet(tagged_tokens):
    tag_map = {'j': wordnet.ADJ, 'v': wordnet.VERB, 'n': wordnet.NOUN, 'r': wordnet.ADV}
    new_tagged_tokens = [(word, tag_map.get(tag[0].lower(), wordnet.NOUN))
                            for word, tag in tagged_tokens]
    return new_tagged_tokens

In [69]:
wordnet_tokens = pos_tag_wordnet(tagged_tokens)
print(wordnet_tokens)

### Chunking

**Chunking** is a process of extracting phrases from unstructured text. Instead of just simple tokens which may not represent the actual meaning of the text, its advisable to use phrases such as `“South Africa”` as a single word instead of `‘South’` and `‘Africa’` separate words.

In [70]:
sentence = "the little yellow dog barked at the cat"

In [71]:
#Define your grammar using regular expressions
grammar = ('''
    NP: {<DT>?<JJ>*<NN>} # NP
    ''')

In [72]:
chunkParser = nltk.RegexpParser(grammar)
tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
tagged

In [73]:
tree = chunkParser.parse(tagged)
for subtree in tree.subtrees():
    print(subtree)

In [None]:
# tree.draw()
# .draw() have some issues to fix

![Chunking](https://miro.medium.com/max/548/1*ZNcznQAcvThZwjF0mPjTqw.png)

* Tree diagram from the above code

### Effective Lemmatization

In [75]:
lemmatized_text = ' '.join(wnl.lemmatize(word, tag) for word, tag in wordnet_tokens)
lemmatized_text

### Let's define a function such that you put all the above steps together so that it does the following

- Function name is __`wordnet_lemmatize_text(...)`__
- Input is a variable __`text`__ which should take in a document (bunch of words)
- Call the earlier defined functions and utilize them
- Return lemmatized text as the output (as a string)

In [76]:
wnl = WordNetLemmatizer()

def wordnet_lemmatize_text(text):
    tagged_tokens = nltk.pos_tag(nltk.word_tokenize(text))
    wordnet_tokens = pos_tag_wordnet(tagged_tokens)
    lemmatized_text = ' '.join(wnl.lemmatize(word, tag) for word, tag in wordnet_tokens)
    return lemmatized_text

### Let's call the function on the below sentence and test it

In [77]:
s

In [78]:
wordnet_lemmatize_text(s)

## Lemmatization with Spacy

In [79]:
import spacy
nlp = spacy.load('en_core_web_sm', parse=False, tag=False, entity=False)

def spacy_lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

In [80]:
# the sentence
s

In [81]:
spacy_lemmatize_text(s)

# Stopword Removal

In [82]:
def remove_stopwords(text, is_lower_case=False, stopwords=None):
    if not stopwords:
        stopwords = nltk.corpus.stopwords.words('english')
    tokens = nltk.word_tokenize(text)
    tokens = [token.strip() for token in tokens]
    
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopwords]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopwords]
    
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

In [83]:
stop_words = nltk.corpus.stopwords.words('english')
print(stop_words[:10])

In [84]:
s

In [85]:
remove_stopwords(s, is_lower_case=False)

### Let's remove the words 'the' and 'brown' from the stop_words list and call the function with this new list

In [86]:
stop_words.remove('the')
stop_words.append('brown')

In [87]:
remove_stopwords(s, is_lower_case=False, stopwords=stop_words)

---

# This is all fot this Notebook. 📔


I am starting to use NLP so, you remarks, suggestion are welcome and appreciated.💡💡💡