# Introduction to NLP
---

Data comes in many different forms: time stamps, sensor readings, images, categorical labels, and so much more. But text is still some of the most valuable data out there for those who know how to use it.

### NLP with spaCy

In this note, I'll be using the NLP library (spaCy) to take on some of the most important tasks in working with text. spaCy relies on models that are language-specific and come in different sizes. You can load a spaCy model with `spacy.load`.

For example, here's how you would load the English language model:

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

***Note***: *Before you can load spacy's "en_core_web_sm", you need to first download the library file by running the following command in the command prompt:*
```python
python -m spacy download en_core_web_sm
```

*Alternatively, you can run "!python -m spacy download en_core_web_sm" in Jupyter notebook.*

## Tokenizing

With the model loaded, you can process text like this:

In [2]:
doc = nlp("Tea is healthy and calming, don't you think?")

for token in doc:
    print(token)

Tea
is
healthy
and
calming
,
do
n't
you
think
?


`nlp('some_text')` returns a document object that contains tokens. A token is a unit of text in the document, such as individual words and punctuation. SpaCy splits contractions like "don't" into two tokens, "do" and "n't". You can see the tokens by iterating through the document.

Iterating through a document gives you token objects. Each of these tokens comes with additional information. In most cases, the important ones are `token.lemma_` and `token.is_stop`.

## Text Preprocessing

There are a few types of preprocessing to improve how we model with words. The first is "lemmatizing." The "lemma" of a word is its base form. For example, "walk" is the lemma of the word "walking". So, when you lemmatize the word walking, you would convert it to walk.

It's also common to remove stopwords. Stopwords are words that occur frequently in the language and don't contain much information. English stopwords include "the", "is", "and", "but", "not".

`token.lemma_` returns the token's lemma, while `token.is_stop` returns a boolean True if the token is a stopword (and False otherwise). For example:

In [3]:
print(f"Token \t\tLemma \t\tStopword".format('Token', 'Lemma', 'Stopword'))
print("-"*40)
for token in doc:
    print(f"{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}")

Token 		Lemma 		Stopword
----------------------------------------
Tea		tea		False
is		be		True
healthy		healthy		False
and		and		True
calming		calming		False
,		,		False
do		do		True
n't		n't		True
you		you		True
think		think		False
?		?		False


Why are lemmas and identifying stopwords important? Language data has a lot of noise mixed in with informative content. In the sentence above, the important words are tea, healthy and calming. Removing stop words might help the predictive model focus on relevant words. Lemmatizing similarly helps by combining multiple forms of the same word into one base form ("calming", "calms", "calmed" would all change to "calm").

## Practical example

Let's apply the techniques learnt on 'A Million News Headlines' [dataset](https://www.kaggle.com/therohk/million-headlines), which is a corpus of over one million news article headlines published by the ABC.

In [4]:
import pandas as pd
import numpy as np
import re  # For preprocessing
import multiprocessing
from time import time  # To time our operations

text_data = pd.read_csv('../../datasets/news-headlines/abcnews-date-text.csv')
text_data.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


### Cleaning:

We are lemmatizing and removing the stopwords and non-alphabetic characters for each news headline:

In [5]:
def cleaning(doc):
    # Lemmatizes and removes stopwords
    # doc needs to be a spacy Doc object
    txt = [token.lemma_ for token in doc if not token.is_stop]

    # if a sentence is only one or two words long, the benefit for the training, e.g. Word2Vec, is very small, 
    # thus we discard short sentences of less than 3 words
    if len(txt) > 2:
        return ' '.join(txt)

Taking advantage of spaCy `.pipe()` attribute to speed-up the cleaning process:

In [6]:
# Yields generator object with non-alphabetic characters removed from text
brief_cleaning = (re.sub("[^A-Za-z']+", ' ', str(row)).lower() for row in text_data['headline_text'])

t = time()
txt = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, batch_size=5000, 
                                         n_process=(multiprocessing.cpu_count()-1))]

print('Time to clean up everything: {} mins'.format(round((time() - t) / 60, 2)))

Time to clean up everything: 8.29 mins


Put the results in a new column for easy comparison:

In [8]:
# parse into dataframe 
text_data['clean_text'] = txt
text_data.head()

Unnamed: 0,publish_date,headline_text,clean_text
0,20030219,aba decides against community broadcasting lic...,aba decide community broadcasting licence
1,20030219,act fire witnesses must be aware of defamation,act fire witness aware defamation
2,20030219,a g calls for infrastructure protection summit,g call infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise,air nz staff aust strike pay rise
4,20030219,air nz strike to affect australian travellers,air nz strike affect australian traveller
