<div class="alert alert-block alert-info">
Notebook Author:<br>Felix Gonzalez, P.E. <br> Adjunct Instructor, <br> Division of Professional Studies <br> Computer Science and Electrical Engineering <br> University of Maryland Baltimore County <br> fgonzale@umbc.edu
</div>

<div class="alert alert-block alert-info">
Acknowledgements:<br>Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.
<br> https://www.nltk.org/book/
</div>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os 

import nltk # Natural Langage Toolkit
from nltk import word_tokenize, pos_tag # Tokenizer and Parts of Speech Tags
from nltk.tokenize import RegexpTokenizer # Tokenizer
from nltk.stem import WordNetLemmatizer, PorterStemmer # Lemmitization and Stemming
from nltk.corpus import stopwords, wordnet # Stopwords and POS tags
#nltk.download #(One time to download 'stopwords')
#nltk.download # (One time to download 'punkt')
#nltk.download #(One time to download 'averaged_perceptron_tagger')

from sklearn.feature_extraction.text import TfidfVectorizer # Vectorization Functions
from sklearn.metrics import pairwise_distances # Cosine similarity

# Introduction

There are various packages that allow you to perform natural language processing (NLP) Other NLP libraries that can be used include Gensim and SpaCy.
Lecture 15, 31_Natural_Language_Processing_Introduction.ipynb includes more details on the capabilities of some of these packages and details on performing NLP. This notebook will focus on developing a Bag of Words (BoW) model and using the model to perform NLP tasks calculations (e.g., similarity ranking, text clustering, and classification). 



https://www.nltk.org/



The notebook will focus on the simpler BoW model. In this notebook specifically will 
Talk about BoW

Text Normalization (https://www.nltk.org/book/ch03.html)

# Data

# Data or Corpus Specific Stopwords



# Text Normalization Functions
[Return to Table of Contents](#Table-of-Contents)

During text normalization various task can be performed which include but not limited to applying lower case, removing numbers and special characters, removing stop words, and applying lemmatization and/or stemming both which reduces words to their root. 

Documentation:
- NLTK Library: https://www.nltk.org/
- NLTK WordnetLemmatizer: https://www.nltk.org/_modules/nltk/stem/wordnet.html
- NLTK Porterstemmer: https://www.nltk.org/howto/stem.html

In [None]:
# Normalization of text. 
def text_normalization(text, word_reduction_method):
    text = str(text) # Convert narrative to string.
    df = pd.DataFrame({'': [text]}) # Converts narrative to a dataframe format use replace functions.
    df[''] = df[''].str.lower() # Covert narrative to lower case.
    df[''] = df[''].str.replace("\d+", " ", regex = True) # Remove numbers
    df[''] = df[''].str.replace("[^\w\s]", " ", regex = True) # Remove special characters
    df[''] = df[''].str.replace("_", " ", regex = True) # Remove underscores characters
    df[''] = df[''].str.replace('\s+', ' ', regex = True) # Replace multiple spaces with single
    text = str(df[0:1]) # Extracts narrative from dataframe.
    tokenizer = RegexpTokenizer(r'\w+') # Tokenizer.
    tokens = tokenizer.tokenize(text) # Tokenize words.
    filtered_words = [w for w in tokens if len(w) > 1 if not w in stopwords_custom] # Note remove words of 1 letter only. Can increase to higher value as needed.
    if word_reduction_method == 'Lemmatization':
        lemmatizer = WordNetLemmatizer()
        reduced_words=[lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in filtered_words] # Lemmatization.  The second argument is the POS tag.
    if word_reduction_method == 'Stemming':
        stemmer = PorterStemmer() # Stemming also could make the word unreadable but is faster than lemmatization.
        reduced_words=[stemmer.stem(w) for w in filtered_words]
    return " ".join(reduced_words) # Join words with space.

def get_wordnet_pos(word): # Reference: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/#wordnetlemmatizer
    #"""Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

When performing NLP, you will need to decide the target feature or column. In this case we have two columns that have unstructured data. One is the question the other is the answer. We could also combine both in one column and use that as the target. In this case we will use the question as the target. This will allow us to perform various NLP tasks within the target text or category:
- similarity calculations or ranking 
- text classification
- text clustering 

However, not that the questions tend to be really short. Before making a decision on the target it would also be good practice to explore some statistics on the number of words, lenght of the text, and any other stats that we may think of. Note below that when normalizing text data it sometimes removes all the words and we end up with a target text with no words hence no vectors. This will cause issues later on and need to be addressed (i.e., fixed or removed).