# Natural Language Processing
  
---

<img src="https://www.dropbox.com/scl/fi/b1vbv4c4m5vikt6s08n62/nlp.png?rlkey=r5t9i1socnr84jk2slvx2pylw&raw=1"  align="center"/>

### Learning Objectives
- Discuss the major tasks involved with natural language processing.
- Discuss, on a low level, the components of natural language processing.
- Identify why natural language processing is difficult.
- Demonstrate text classification.
- Demonstrate common text preprocessing techniques.

### How Do We Use NLP in Data Science?

In data science, we are often asked to analyze unstructured text or make a predictive model using it. Unfortunately, most data science techniques require numeric data. NLP libraries provide a tool set of methods to convert unstructured text into meaningful numeric data.

- **Analysis:** NLP techniques provide tools to allow us to understand and analyze large amounts of text. For example:

    - Analyze the positivity/negativity of comments on different websites.
    - Extract key words from meeting notes and visualize how meeting topics change over time.

- **Vectorizing for machine learning:** When building a machine learning model, we typically must transform our data into numeric features. This process of transforming non-numeric data such as natural language into numeric features is called vectorization. For example:

    - Understanding related words. Using stemming, NLP lets us know that "swim", "swims", and "swimming" all refer to the same base word. This allows us to reduce the number of features used in our model.
    - Identifying important and unique words. Using TF-IDF (term frequency-inverse document frequency), we can identify which words are most likely to be meaningful in a document.

### What Is Natural Language Processing (NLP)?

- Using computers to process (analyze, understand, generate) natural human languages.
- Making sense of human knowledge stored as unstructured text.
- Building probabilistic models using data about a language.

<img src="https://www.dropbox.com/scl/fi/ceuj0day17rlz5tsywhkg/siri.jpg?rlkey=k93psk3wuuru90s6kmr5l1fyg&raw=1"  align="center"/>

### What does NLU mean?

<img src="https://www.dropbox.com/scl/fi/m11szeidbae8b7syyk9lb/twoway.jpg?rlkey=c2sh2q8tw2wh0owcic57u64so&raw=1" align="center"/>

---

### What Are Some of the Lower-Level Components?

- **Objective:** Discuss, on a low level, the components of natural language processing.

Unfortunately, the NLP programming libraries typically do not provide direct solutions for the high-level tasks above. Instead, they provide low-level building blocks that enable us to craft our own solutions. These include:

- **Tokenization:** Breaking text into tokens (words, sentences, n-grams)
- **Stop-word removal:** a/an/the
- **Stemming and lemmatization:** root word
- **TF-IDF:** word importance
- **Part-of-speech tagging:** noun/verb/adjective
- **Named entity recognition:** person/organization/location
- **Spelling correction:** "New Yrok City"
- **Word sense disambiguation:** "buy a mouse"
- **Segmentation:** "New York City subway"
- **Language detection:** "translate this page"
- **Machine learning:** specialized models that work well with text

### Why is NLP hard?


<img src="https://www.dropbox.com/scl/fi/bthbfwx9n1nqawxhquegp/fan.png?rlkey=483x7v41fq60agosyw71t3t4j&raw=1"  align="right"/>

- **Objective:** Identify why natural language processing is difficult.

Natural language processing requires an understanding of the language and the world. Several limitations of NLP are:

- **Ambiguity**:
    - Hospitals Are Sued by 7 Foot Doctors
    - Juvenile Court to Try Shooting Defendant
    - Local High School Dropouts Cut in Half
- **Non-standard English:** text messages
- **Idioms:** "throw in the towel"
- **Newly coined words:** "retweet"
- **Tricky entity names:** "Where is A Bug's Life playing?"
- **World knowledge:** "Mary and Sue are sisters", "Mary and Sue are mothers"

# Introduction to Spacy and NLTK

<a id='textblob_install'></a>

## Install TextBlob, gensim, and swifter

The TextBlob Python library provides a simplified interface for exploring common NLP tasks including part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

To proceed with the lesson, first install TextBlob, as explained below. We tend to prefer Anaconda-based installations, since they tend to be tested with our other Anaconda packages.

**To install textblob run:**

> `conda install -c conda-forge textblob`

**Or:**

> `pip install textblob`

> `python -m textblob.download_corpora lite`

**We will also need another set of packages: gensim, and swifter**

In [None]:
!pip install --upgrade textblob spacy 'gensim==4.2.0' swifter

In [None]:
!python -m textblob.download_corpora lite
!python -m spacy download en_core_web_sm

<a id='yelp_rev'></a>

## Reading in the Yelp Reviews

Throughout this lesson, we will use Yelp reviews to practice and discover common low-level NLP techniques.

You should be familiar with these terms, as they are frequently used in NLP:
- **corpus**: a collection of documents (derived from the Latin word for "body")
- **corpora**: plural form of corpus

Throughout this lesson, we will use a model very popular for text classification called Naive Bayes (the "NB" in `BinonmialNB` and `MultinomialNB` below). If you are unfamiliar with it, know that it works exactly the same as all other models in scikit-learn! We will look extensively at the mechanics behind Naive Bayes later in the course. However, see the [appendix](#bayes) at the end of this notebook for a quick introduction.

In [None]:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB         # Naive Bayes
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer

import spacy
import gensim
import warnings
import nltk
warnings.filterwarnings('ignore')
nltk.download('punkt')
textblob_tokenizer = lambda x: TextBlob(x).words


In [None]:
%%writefile get_data.sh
if [ ! -f yelp.csv ]; then
  wget -O yelp.csv https://www.dropbox.com/s/xds4lua69b7okw8/yelp.csv?dl=0
fi

In [None]:
!bash get_data.sh

In [None]:
# Read yelp.csv into a DataFrame.
path = './yelp.csv'
yelp = pd.read_csv(path)


In [None]:
# The head of the original data
yelp.head()

# NER and Linguistic features of Spacy

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

df = pd.DataFrame([], columns=['Text',	'Lemma',	'POS',	'Tag',	'Dep',	'Shape',	'alpha',	'stop'])


for ix, token in enumerate(doc):
    df.loc[ix] = [token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop]
df



    Text: The original word text.
    Lemma: The base form of the word.
    POS: The simple UPOS part-of-speech tag.
    Tag: The detailed part-of-speech tag.
    Dep: Syntactic dependency, i.e. the relation between tokens.
    Shape: The word shape – capitalization, punctuation, digits.
    is alpha: Is the token an alpha character?
    is stop: Is the token part of a stop list, i.e. the most common words of the language?




In [None]:
from spacy import displacy
html = displacy.render(doc, style="dep")


In [None]:
import IPython
IPython.display.HTML(html)

In [None]:
import spacy

# English pipelines include a rule-based lemmatizer
nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode)  # 'rule'

print([token.lemma_ for token in doc])

# Original doc : "Apple is looking at buying U.K. startup for $1 billion"

In [None]:
import spacy

df = pd.DataFrame([], columns=['Text',	'Initial char',	'End char',	'Entity'])

for ix, ent in enumerate(doc.ents):
    df.loc[ix] = [ent.text, ent.start_char, ent.end_char, ent.label_]
df

In [None]:
text = "I saw The Beatles perform. Who did you see?"
doc1 = nlp(text)

df1 = pd.DataFrame([], columns=['Text',	'Tag',	'POS'])

for i, word in enumerate(doc1):
  df1.loc[i] = [word, doc1[i].tag_, doc1[i].pos_]
df1




In [None]:
df2 = pd.DataFrame([], columns=['Text2',	'Tag2',	'POS2'])


# Add attribute ruler with exception for "The Beatles" as NNP/PROPN NNP/PROPN
ruler = nlp.get_pipe("attribute_ruler")
# Pattern to match "The Beatles"
patterns = [[{"LOWER": "the"}, {"TEXT": "Beatles"}]]
# The attributes to assign to the matched token
attrs = {"TAG": "NNP", "POS": "PROPN"}
# Add rules to the attribute ruler
ruler.add(patterns=patterns, attrs=attrs, index=0)  # "The" in "The Who"
ruler.add(patterns=patterns, attrs=attrs, index=1)  # "Who" in "The Who"

doc2 = nlp(text)
for i, word in enumerate(doc2):
  df2.loc[i] = [word, doc2[i].tag_, doc2[i].pos_]
df2

In [None]:
pd.concat([df1, df2], axis=1, )

# Topic Modelling

## Topic Modelling with Spacy

In [None]:
!pip install bertopic

In [None]:
yelp

In [None]:
import spacy
from bertopic import BERTopic

nlp = spacy.load('en_core_web_sm', exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])

topic_model = BERTopic(embedding_model=nlp)
topics, probs = topic_model.fit_transform(yelp['text'])

In [None]:


topic_model.get_topic_info()

## Doing Topic Modelling with LDA

As you proceed through this section, note that text classification is done in the same way as all other classification models. First, the text is vectorized into a set of numeric features. Then, a standard machine learning classifier is applied. NLP libraries often include vectorizers and ML models that work particularly well with text.

> We will refer to each piece of text we are trying to classify as a document.
> - For example, a document could refer to an email, book chapter, tweet, article, or text message.

**Text classification is the task of predicting which category or topic a text sample is from.**

We may want to identify:
- Is an article a sports or business story?
- Does an email have positive or negative sentiment?
- Is the rating of a recipe 1, 2, 3, 4, or 5 stars?

**Predictions are often made by using the words as features and the label as the target output.**

Starting out, we will make each unique word (across all documents) a single feature. In any given corpora, we may have hundreds of thousands of unique words, so we may have hundreds of thousands of features!

- For a given document, the numeric value of each feature could be the number of times the word appears in the document.
    - So, most features will have a value of zero, resulting in a sparse matrix of features.

- This technique for vectorizing text is referred to as a bag-of-words model.
    - It is called bag of words because the document's structure is lost — as if the words are all jumbled up in a bag.
    - The first step to creating a bag-of-words model is to create a vocabulary of all possible words in the corpora.

> Alternatively, we could make each column an indicator column, which is 1 if the word is present in the document (no matter how many times) and 0 if not. This vectorization could be used to reduce the importance of repeated words. For example, a website search engine would be susceptible to spammers who load websites with repeated words. So, the search engine might use indicator columns as features rather than word counts.

**We need to consider several things to decide if bag-of-words is appropriate.**

- Does order of words matter?
- Does punctuation matter?
- Does upper or lower case matter?

## Demo: Text Processing in scikit-learn

- **Objective:** Demonstrate text classification.

<a id='count_vec'></a>


### Creating Features Using CountVectorizer

- **What:** Converts each document into a set of words and their counts.
- **Why:** To use a machine learning model, we must convert unstructured text into numeric features.
- **Notes:** Relatively easy with English language text, not as easy with some languages.

<img src="https://www.dropbox.com/scl/fi/7abked6w9kvq4az4mi7nm/cvec.png?rlkey=w1hlizux2lbkhna6f6hx4o86u&raw=1"  align="center"/>

In [None]:

# Define X and y.
X = yelp.text
y = yelp.stars

# Split the new DataFrame into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25 ,random_state=99)

In [None]:
X_train[:2]

In [None]:
# Use CountVectorizer to create document-term matrices from X_train and X_test.
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [None]:
X_train.head()

In [None]:
X_train.shape

In [None]:
X_train_dtm[0]

In [None]:
# Rows are documents, columns are terms (aka "tokens" or "features", individual words in this situation).
X_train_dtm.shape

In [None]:
# Last 50 features
print((vect.get_feature_names_out()[-25:]))

In [None]:
# Show vectorizer vect

[CountVectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

One common method of reducing the number of features is converting all text to lowercase before generating features! Note that to a computer, `aPPle` is a different token/"word" than `apple`. So, by converting both to lowercase letters, it ensures fewer features will be generated. It might be useful not to convert them to lowercase if capitalization matters.

In [None]:
# Create a CountVectorizer with kwarg lowercase=False, min_df=25, token_pattern='\w+|\$[\d\.]+|\S+' and transform the Train set
vect = CountVectorizer(lowercase=False, min_df=25, token_pattern='\w+|\$[\d\.]+|\S+')
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape
# vect.get_feature_names()[-10:]

<a id='countvectorizer-model'></a>


### Using CountVectorizer for Topic Modelling
![DTM](https://www.dropbox.com/scl/fi/14huaxukr29dhqxxh6gzp/DTM.png?rlkey=5ute46xkcauinbhq83lsdlzyu&raw=1)

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

number_of_topics = 10

model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)


In [None]:
model.fit(X_train_dtm)


In [None]:
def display_topics(model, feature_names, no_top_words):
    topic_dict = {}
    for topic_idx, topic in enumerate(model.components_):
        topic_dict["Topic %d words" % (topic_idx)]= ['{}'.format(feature_names[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
        topic_dict["Topic %d weights" % (topic_idx)]= ['{:.1f}'.format(topic[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
    return pd.DataFrame(topic_dict)


In [None]:
no_top_words = 10
display_topics(model, vect.get_feature_names_out(), no_top_words)

### Cleaning the text data and retrying

Let's clean the dataset first then! Both methods failed!

In [None]:
import re
nltk.download('stopwords')
my_stopwords = nltk.corpus.stopwords.words('english')
word_rooter = nltk.stem.snowball.PorterStemmer(ignore_stopwords=False).stem

my_punctuation = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'


def preprocess_text(text, should_join=True):
    text = ' '.join(word.lower() for word in textblob_tokenizer(text))
    text = re.sub(r'http\S+', '', text) # remove http links
    text = re.sub(r'bit.ly/\S+', '', text) # rempve bitly links
    text = text.strip('[link]') # remove [links]
    text = re.sub('['+my_punctuation + ']+', ' ', text) # remove punctuation
    text = re.sub('\s+', ' ', text) #remove double spacing
    text = re.sub(r"[^a-zA-Z.,&!?]+", r" ", text) # only normal characters
    text_token_list = [word for word in text.split(' ')
                            if word not in my_stopwords] # remove stopwords
    text_token_list = [word_rooter(word) if '#' not in word else word
                        for word in text_token_list] # apply word rooter
    text = ' '.join(text_token_list)
    if should_join:
      return ' '.join(gensim.utils.simple_preprocess(text))
    else:
      return gensim.utils.simple_preprocess(text)

In [None]:
import swifter
processed_reviews = yelp['text'].swifter.apply(preprocess_text)

In [None]:
processed_reviews = processed_reviews.rename('review')

In [None]:
processed_reviews.head()

In [None]:
yelp = pd.concat([yelp, processed_reviews], axis=1)

In [None]:
yelp

In [None]:
X = yelp.review
y = yelp.stars

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25 ,random_state=99)

In [None]:
X.shape

In [None]:
vect = CountVectorizer(lowercase=False, min_df=25, token_pattern='\w+|\$[\d\.]+|\S+')
X_train_dtm = vect.fit_transform(X_train)
model.fit(X_train_dtm)


In [None]:
no_top_words = 10
display_topics(model, vect.get_feature_names_out(), no_top_words)

Much better!! Let's see how BERT would do!

In [None]:
topics, probs = topic_model.fit_transform(yelp['review'])


In [None]:
topic_model.get_topic_info()

# Now you do it
<img src="https://www.dropbox.com/scl/fi/yukrb2nsze8zku5a5sj43/bbc.jpg?rlkey=jxwx6ghoge5vd4bj6z8ze4dbc&raw=1" width="400" height="400" align="center"/>

<img src="https://www.dropbox.com/scl/fi/q6sedc6g1aika01rvzec8/hands_on.jpg?rlkey=qk7bpiwwqkds648x8kmcx2ucq&raw=1" width="100" height="100" align="right"/>

Do topic modeling on the BBC dataset! In the next section we will learn how to classify the categories as well

In [None]:
%%writefile get_data_capstone.sh
if [ ! -f bbc.csv ]; then
  wget -O bbc.csv https://www.dropbox.com/scl/fi/lfa2ryv86uqd3y988irfw/bbc.csv?rlkey=vtwdf6g8sejhkf75p7o36ev00&dl=0
fi

In [None]:
!bash get_data_capstone.sh

In [None]:
pd.read_csv('./bbc.csv')