# Extracting Insights from Text Using spaCy's Pipeline

## NLP Day, November 7th 2019

http://nlpday.ml/

## Preprerations

First, we need to import all neccesary libraries

In [73]:
import pandas as pd
import spacy

spaCy’s models can be installed as Python packages. This means that they’re a component of your application, just like any other module. They’re versioned and can be defined as a dependency in your requirements.txt. Models can be installed from a download URL or a local directory, manually or via pip. Their data can be located anywhere on your file system.

https://spacy.io/usage/models

For this exercise, we will use the `en_core_web_sm` model, and it can be installed with the command:
`python -m spacy download en_core_web_sm`

In [74]:
nlp = spacy.load('en')

This is the default fields we want to print:

In [75]:
FIELDS = ['text', 'lemma', 'tag', 'pos', 'head', 'dep', 'ent_iob', 'ent_type', 'idx']

Let's write some helper methods for printing the parsed sentence after spaCy

In [76]:
def token_data(token):
    # https://spacy.io/api/token
    return dict(
        idx=token.idx,
        i=token.i,
        text=token.text,
        norm=token.norm_,
        head=token.head.i,
        dep=token.dep_,
        lemma=token.lemma_,
        pos=token.pos_,
        tag=token.tag_,
        ent_type=token.ent_type_,
        ent_iob=token.ent_iob_,
    )

def json_tree(parsed_sentence):
    return [token_data(t) for t in parsed_sentence]

In order to parse a sentence with spaCy, one need to call the model, with a given string. Simple as that!
The return object's structures is as follows: https://spacy.io/api/doc#attributes. The Doc object contains a list of tokens, each one of the structure: https://spacy.io/api/token#attributes. For each token, its data depends on the components in the pipeline. In our case, we are using the default pipeline, which is:

1) Tokenizer - https://spacy.io/api/tokenizer

2) Tagger - https://spacy.io/api/tagger

3) Dependency Parser - https://spacy.io/api/dependencyparser

4) Named Entity Recognition - https://spacy.io/api/entityrecognizer

5) Text Categorizer - https://spacy.io/api/textcategorizer


For more details:
https://spacy.io/usage/processing-pipelines

Let's explore a simple sentence!

In [77]:
parsed_sentence = nlp('Apple Inc is a great company!')

In [78]:
from pprint import pprint
pprint(json_tree(parsed_sentence))

[{'dep': 'compound',
  'ent_iob': 'B',
  'ent_type': 'ORG',
  'head': 1,
  'i': 0,
  'idx': 0,
  'lemma': 'Apple',
  'norm': 'apple',
  'pos': 'PROPN',
  'tag': 'NNP',
  'text': 'Apple'},
 {'dep': 'nsubj',
  'ent_iob': 'I',
  'ent_type': 'ORG',
  'head': 2,
  'i': 1,
  'idx': 6,
  'lemma': 'Inc',
  'norm': 'inc',
  'pos': 'PROPN',
  'tag': 'NNP',
  'text': 'Inc'},
 {'dep': 'ROOT',
  'ent_iob': 'O',
  'ent_type': '',
  'head': 2,
  'i': 2,
  'idx': 10,
  'lemma': 'be',
  'norm': 'is',
  'pos': 'VERB',
  'tag': 'VBZ',
  'text': 'is'},
 {'dep': 'det',
  'ent_iob': 'O',
  'ent_type': '',
  'head': 5,
  'i': 3,
  'idx': 13,
  'lemma': 'a',
  'norm': 'a',
  'pos': 'DET',
  'tag': 'DT',
  'text': 'a'},
 {'dep': 'amod',
  'ent_iob': 'O',
  'ent_type': '',
  'head': 5,
  'i': 4,
  'idx': 15,
  'lemma': 'great',
  'norm': 'great',
  'pos': 'ADJ',
  'tag': 'JJ',
  'text': 'great'},
 {'dep': 'attr',
  'ent_iob': 'O',
  'ent_type': '',
  'head': 2,
  'i': 5,
  'idx': 21,
  'lemma': 'company',
  'no

# Data Exploration

Let's load the data we have and explore it

In [79]:
products_df = pd.read_csv('../../data/data.csv')

In [80]:
products_df.head()

Unnamed: 0,sentence,company,product
0,British Telecom is planning to launch Qumu as ...,British Telecom,Qumu
1,"Also in this quarter, the first WattUp enabled...",Delight,WattUp
2,"In September, SEAT will expand its SUV range w...",SEAT,Tarraco
3,The new Jetta being launched now by SEAT and t...,SEAT,Jetta
4,The employee from Fiat contributed personally ...,Fiat,Fiorino


We can see that each line contains:

1) A sentence that talks about a company the released (or going to release) a new product

2) The company name

3) The product name

In [81]:
num_rows = len(products_df)
print(f'Number of rows: {num_rows}')

Number of rows: 100


**We have very few rows in our data set** :(

Unfortunately, a statistical model or a neural network may not work here (but we enourage you to try!), so we may use a rule-base approach utilizing spaCy's pipeline.

We can see that not all rows contains company and product in them, so our rule-based model should be agnostic for that cases:

In [82]:
num_rows_without_company_or_product = products_df.isnull().sum()
print(f'Number of rows without a company or a product\n{num_rows_without_company_or_product}')

Number of rows without a company or a product
sentence     0
company     11
product     10
dtype: int64


In [83]:
nan_rows = products_df[pd.isnull(products_df).any(axis=1)]

In [84]:
nan_rows.head()

Unnamed: 0,sentence,company,product
66,"Let's say, we'll announce the product when we ...",,
67,We've got the product.,,
68,"As I mentioned, we launched some new products.",,
69,I knowthey just launched the Dealership Finance.,,Dealership Finance
70,"There is getting new product launched, somethi...",,


# Rule-Based Model

Now, let's begin with the most fun part - our model!

The goal is to write NLP rules, based on spaCy's data in order to extract a company name and their new released product from a given sentence (or None if not available in the sentence) based on the examples we have in the training data.

We will start with simple rules and expend it to more complicated ones in order to push our accuracy higher.

## Evaluation

We will use the **accuracy** metric as evaluation metric. For each row, the point will be split to 0.5 of the scor for the company name and 0.5 for the product name.

In [85]:
def accuracy_score(companies_true, products_true, companies_predicted, products_predicted):
    correct_companies = [1 if company_true == company_predicted else 0
                            for company_true, company_predicted in zip(companies_true, companies_predicted)]
    correct_products = [1 if product_true == product_predicted else 0
                        for product_true, product_predicted in zip(products_true, products_predicted)]
    
    return float((sum(correct_companies) + sum(correct_products))) / (len(correct_companies) + len(correct_products)) * 100.0

Let's test our `accuracy_score` method above:

In [86]:
test_score = accuracy_score(['Apple', 'Amazon'], ['iPhone', 'AWS'], ['Google', 'Amazon'], ['iPhone', 'AWS'])
assert(test_score == 75)
print(f'Test score: {test_score}')

Test score: 75.0


## 1st Model: Using POS only

**TODO:** Write rules to extract company name and product using the `tag_` and `pos_` fields

## 2st Model: Using NER

**TODO:** In addition to eariler, now change or add new rules using the `ent_type_` and `ent_iob_` fields

## 3st Model: Using Dependency Tree arcs

**TODO:** In addition to eariler, now change or add new rules using the `dep_` and `head` fields