# Walking Through an NLP Pipeline with SpaCy

Till now we've learnt quite a few things. We've learnt about preprocessing text, BOW models, topic models and word embeddings. In this notebook we'll be taking a more practical approach and learning how to deal with almost all the nlp problems that you'll be trying to solve

## Quick Set Up

We're going to be using a library called SpaCy later in this notebook. To make things easier we're quickly going to download an install the model now.

If you're running on local run both steps one after another.

For colab:

All you need to do is run the following cell. Then, when it's finished running, restart the runtime above.

In [0]:
! python -m spacy download en_core_web_md

Then once you've done the above steps, run the cell below:

In [0]:
import spacy
nlp = spacy.load('en_core_web_md')

## Obtaining Data

That's one of the most important and generally the most overlooked part. Real world data is hard. Converting it into a usable form is even harder. Generally you'd write your spiders to scrape content from web. A lot of the other times the data will be in pdf form and you'll convert it into txt. Here's a few libraries you can make use of:
1. Web scraping: BeautifulSoup, scrapy
2. Pdf to text: If you're on linux/macOS install `poppler-utils` and use pdf2text, for windows use `xpdf`. All other libraries in Python do not work that well

### Web scraping with Beautiful Soup

Beautiful Soup is an amazing powerful library that helps make scraping web pages easy and intuitive. We're not going to be using this for scraping annual reports, so we've just included a link quick walkthrough here. If you want to look at beautiful soup in more detail, there are loads of articles and courses dedicated to it e.g.: https://towardsdatascience.com/step-by-step-tutorial-web-scraping-wikipedia-with-beautifulsoup-48d7f2dfa52d

### Converting Pdfs to Text

As mentioned earlier you can use pdf2text which is a command line utility. Can be used as: `pdftotext {PDF-file} {text-file}`

You can find instructions to install here: https://www.cyberciti.biz/faq/converter-pdf-files-to-text-format-command/


You can obtain xpdf for windows and use the same commands. Find xpdf binaries here: http://gnuwin32.sourceforge.net/packages/xpdf.htm

#### **Exercise**: Download the PDF below, import it and print the first 50 characters

In [0]:
pdf_url = ["https://int.nyt.com/data/documenthelper/641-read-the-open-letter-to-amazon/0a112e77301026beb86d/optimized/full.pdf"]

# Import your processed pdf here
# Print the first 50 characters

##### *Example Answer*

In [0]:
'Open Letter to Amazon Chief Executive Jeff Bezos\n\n'

Is your output the same?

# Defining your problem

Once you have the data, next we need to look at what the problem we're trying to solve is. 

*In fact, in a real life data science problem, this should be your first step. By really understanding what you are trying to achieve, you will be able to work far more efficiently on the problem. It's all well and good to go download all the data you can and start applying models to it, but without understanding what's important you won't be able to produce a good solution.*

For a text problem your approach will fall into one of two categories:

*   Using a supervised approach
*   Using an unsupervised approach


**Supervised Approach**
This is the ideal approach when we're looking to predict something based on body of text we have, or can, label. It also requires us to have a large enough amount of text to train our own model on. 

If this is the case then you will simply convert your corpus into word vectors and train a model.

**Unupervised Approach**
In a lot of situations this isn't the case. More often than not we either wont have enough text for our model, or can't (realistically) label it. 

In this notebook we'll look at how you can solve problems of this type Obviously this is just the basic outline, but from here you can add more depth and detail as you become more confident. 

As a quick recap:
1. Identify what problem it is that you're solving.
2. If you can create supervised data -> create BoW representations and train anything from logistic regression to lightGBM over them
3. If not, you'll need to create a nlp pipeline using some pre trained models

Since supervised problems are more or less straightforward, we're going to focus on the unsupervised approach. 

For the rest of this notebook we're going to be working on an annual report for a company called Tullow Oil. We've already converted it to text, so you don't have to. The code to download the report is below.

#### About SpaCy

First introduced in the previous notebook, Spacy is an *Industrial-Strength Natural Language Processing Library*. It takes care of most of your underlying nlp tasks so we can focus on looking at the bigger picture. This allows us to spend more of our time trying to solve the (business) problem we face, rather than writing our own functions for common tasks.

Some of the functionality Spacy has:
*   Tokenisation
*   Lemmatization
*   Named Entity Recognition
*  Part of Speech
* NLP Models





#### Downloading models


The library itself doesn't come with models so you'll need to download them seprately. If you're working with Engligh language, there are mainly 3 of them: `en_core_web_sm`, `en_core_web_md`, `en_core_web_lg` in the increasing order of complexity: small, medium and large. You can install any of them using: `python -m spacy download en_core_web_xx`

We don't need to do this now, as this was the setup step at the beginning of the article.

# Using Spacy Doc Objects

Using the downloaded models you can convert your text into a spacy doc object. 

When your run the function `nlp()` a spacy doc object will be created from the given data.

The `nlp()` function will run the spacy NLP pipeline and save all the metadata to the doc object. For example, the object will contain computed values for:

*   Tokenisation
*   Lemmatization
* Part of Speech
* Dependency Parsing
* Named Entity Recogntition 

That's a lot to remember and explain at once, so we're going to work through each one with an example. The first two you should remember from the previous notebook. The last three we will cover below:

In [0]:
# First, let's download that annual report we were talking about

! wget https://auquan-public-data.s3.ap-south-1.amazonaws.com/Tullow_Oil.txt


In [0]:
# Next we need to read the data

with open('Tullow_Oil.txt', 'r', encoding='latin-1') as f:
  data = f.read()


In [0]:
# Now convert it into a SpaCy Doc
doc = nlp(data)

### **1. Breaking into sentences:**

In most cases, it is better to deal with text formatted as sentences instead of the whole document at once.

We've seen how to do this with NLTK and it's just as easy in SpaCy:

In [0]:
sents = [x for x in doc.sents]
print(sents[:20])

As you can see, there are a couple stray characters and other errors. SpaCy doesn't work perfectly but that's the case with any NLP algorithm.

### **2. Getting all tokens and their root forms:**

Again, we've seen this with NLTK and now we will do it with SpaCy.

In SpaCy there is a single function that takes care of tokenization and lemmatization in one go.

In [0]:
for token in doc:
    #print('token: %s, root: %s'%(token, token.lemma_))
    if len(token.text.strip()) > 5:
      print('token -> root: %s -> %s'%(token, token.lemma_))

#### **Exercise:** Split, tokenise and lemmatize this doc using SpaCy

In [0]:
! wget https://auquan-public-data.s3.ap-south-1.amazonaws.com/Facebook.txt

##### *Example Solution (look when finished)*

In [0]:
# Do exactly what we did for Tullow oil
# Read the file
# convert to spacy doc
# Get all the tokens and their lemmas

### **3. Get POS (Part Of Speech) tags for each of the tokens, like noun, verb, adverb etc.**

Despite it's name, part of speech is an increadibly useful technique that classifies words at nouns, verbs, adverbs, adjectives etc. 

The significance of this might not be instantly apparent, but it is hugely powerful. By focusing on verbs we can understand whats happening in a sentence. Nouns can tell us what is doing something in the sentence. An advance usecase is to build a parse tree of the sentence(s) to understand their syntactic structure.

In general, knowing the grammatical category of words is useful in downstream tasks as it helps in understanding the semantics behind the word. For example: `play` as a noun and a verb mean completely different things.

Again, this is super easy to implement in Spacy:

In [0]:
for token in doc:
    print('token: %s, POS: %s'%(token, token.pos_))

#Warning: Long Output

#### **Exercise**: Return the top 20 most frequently occuring nouns in the Facebook report

In [0]:
#First you'll need to do PoS on your doc

#Next you'll have to create a frequency count of each word. What does this tell us?

##### *Example Answer*

In [0]:
from collections import Counter

def getCount(doc, k):
  tokens = []
  for token in doc:
    if token.pos_ == 'NOUN':
      tokens.append(token.text)
  counter = Counter(tokens)
  return counter.most_common(k)

### **4. Get all noun chunks**

Noun chunks can be thought of as a noun and the words that describe it. For example: 'the short man' would be a single noun chunk. These chunks help separate different nouns from each other, so a lot of times we are interested in the whole chunk as this gives us more info. 

Here's how you can obtain noun chunks:

In [0]:
for chunk in doc.noun_chunks:
    print(chunk)

#Really long output!

#### **Exercise**: Create a list of noun chunks for facebook....

##### *Example Solution (look when finished)*

In [0]:
# Use doc.noun_chunks for getting the noun chunks
# filter out the ones that're too short or have punctuations etc

### **5. Get all named entities**

Another common and really useful task is finding all the named things in a text. Named entities might include: people, geographical locations, organizations, reference of money, reference of quantities etc. 

The Spacy doc contains the named entities and their classes, so you can call both (as shown below).

Once again, Spacy has built-in support to achieve this in just a line of code:

In [0]:
# Here we're printing the named entity and the type of that entity
for ent in doc.ents:
    print('Entity: %s, Label: %s'%(ent.text, ent.label_))

You can find more details here: https://spacy.io/usage/linguistic-features
        
Keep in mind that you always want to start with the small spacy model (`en_core_web_sm`). If this appears to work contains too many errors, then try the medium or large model. 

Using the smaller model allows you to experiment faster (as it takes less time to load and run). On a more conceptual level, it is generally true that if something can be explained with a simpler model, then that is preferable to a larger model. In finance this is especially true because overfitting is such a problem.

Now lets build on this a bit more.

In the next example we will be using these bits and pieces to try and extract some more interesting information out of Annual report. 

To begin, let's try to get all the names of people:

In [0]:
people = []
for ent in doc.ents:
    if (ent.label_ == 'PERSON') and len(ent.text) > 5:
      print(ent.text)
      people.append(ent.text)

Next lets step it up again.

Now let's try something non-trivial. Let's find who's the ceo of the firm.

This is going to be a multi-step process:
1. Break your text into sentences
2. Look at each sentence and find if there are mentions of specific keywords, `CEO` and `Chief Executive Officer` in our case
3. If there is, get all the names of people from the sentence
4. If most names of the same person then that probably is the CEO


Additional: to be more sure you can actually look at the how close to your keyword does the name occur. In this case, closer it is, the more chance of it being a CEO 

So, first let's begin by getting all the noun chunks that mention either ceo or chief executive officer:

In [0]:
# look at each sentence
for sent in doc.sents:
  # for each tag we're intereseted in
  for tag in ['ceo', 'chief executive officer']:
    # if the tag appears in the sentence
    if tag in sent.text.lower():
      # get all the named entities in the sentence
      for ent in sent.ents:
        # if the entity is a person, print it
        if (ent.label_ == 'PERSON') and len(ent.text) > 5:
          print(ent.text)

Well, here we have it! Simple working strategy.

#### **Exercise**: Try to obtain the names of COO and CFO for Facebook using a similar strategy

In [0]:
# Add the steps

##### *Example Solution (look when finished)*

In [0]:
# Use the same strategy as you used for CEO, if it doesn't work see if you can add more keywords and make it work, 
# or add some other rule to find out sentences that might mention that title and person 
# Your answers should be:
# COO: Sheryl Sandberg
# CFO: David Wehner


### **Final Exercise**

Now let's look at all the statements that talk about money. If we get all such sentences we can find out the ones that talk about the financials we're interested in. For example, we might find a sentence that talks about money and it also says `Adj EBITDAX` (and probably no other relevant noun chunk) then we can be pretty sure that the value is actually `Adj EBITDAX`

In [0]:
for sent in doc.sents:
  for ent in sent.ents:
    if ent.label_ in ['MONEY']:
      print(sent)
      break

**Exercise: Can using the above sentences and whatever you've learnt so far, try to extract Revenue and Adjusted EBITDAX**

Make sure that your approach is general enough to work out for reports from different companies. Don't worry if you're unable to make it work for all of them, it's a difficult problem with all the different formats

##### *Example Solution (look when finished)*

In [0]:
# Revenue: $1723m 
# NET DEBT $3.5Bn