<a href="https://colab.research.google.com/github/g-r-a-e-m-e/applied-natural-language-processing-in-the-enterprise/blob/main/chapter_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 3: NLP Tasks and Applications

## NLP Tasks

### Natural Language Dataset

#### Explore the AG Dataset

In [7]:
# Import libraries
import pandas as pd
import os

# Get current working directory
# cwd = os.getcwd()

# Import AG Dataset
data = pd.read_csv('/content/drive/MyDrive/notebooks/applied-natural-language-processing-in-the-enterprise/data/ag_dataset/train.csv')
data = pd.DataFrame(data = data)
data.columns = data.columns.str.replace(" ", "_")
data.columns = data.columns.str.lower()
data['class_name'] = data['class_index'].map({1: 'World',
                                              2: 'Sports',
                                              3: 'Business',
                                              4: 'Sci_Tech'})

# View data
data

Unnamed: 0,class_index,title,description,class_name
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli...",Business
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,Business
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,Business
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,Business
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...",Business
...,...,...,...,...
119995,1,Pakistan's Musharraf Says Won't Quit as Army C...,KARACHI (Reuters) - Pakistani President Perve...,World
119996,2,Renteria signing a top-shelf deal,Red Sox general manager Theo Epstein acknowled...,Sports
119997,2,Saban not going to Dolphins yet,The Miami Dolphins will put their courtship of...,Sports
119998,2,Today's NFL games,PITTSBURGH at NY GIANTS Time: 1:30 p.m. Line: ...,Sports


In [8]:
# Count observations by class
data.class_name.value_counts()

Business    30000
Sci_Tech    30000
Sports      30000
World       30000
Name: class_name, dtype: int64

In [9]:
# View a sample of titles
for i in range(10):
  print('Title of Article', i)
  print(data.loc[i, 'title'])
  print('\n')

Title of Article 0
Wall St. Bears Claw Back Into the Black (Reuters)


Title of Article 1
Carlyle Looks Toward Commercial Aerospace (Reuters)


Title of Article 2
Oil and Economy Cloud Stocks' Outlook (Reuters)


Title of Article 3
Iraq Halts Oil Exports from Main Southern Pipeline (Reuters)


Title of Article 4
Oil prices soar to all-time record, posing new menace to US economy (AFP)


Title of Article 5
Stocks End Up, But Near Year Lows (Reuters)


Title of Article 6
Money Funds Fell in Latest Week (AP)


Title of Article 7
Fed minutes show dissent over inflation (USATODAY.com)


Title of Article 8
Safety Net (Forbes.com)


Title of Article 9
Wall St. Bears Claw Back Into the Black




In [10]:
# View descriptions
for i in range(10):
  print('Description of Article', i)
  print(data.loc[i, 'description'])
  print('\n')

Description of Article 0
Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.


Description of Article 1
Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.


Description of Article 2
Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.


Description of Article 3
Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday.


Description of Article 4
AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections.


Description o

In [11]:
# Clean up the text
cols = ['title', 'description']
data[cols] = data[cols].applymap(lambda x: x.replace("\\", ' '))
data[cols] = data[cols].applymap(lambda x: x.replace("#36;", '$'))
data[cols] = data[cols].applymap(lambda x: x.replace("  ", ' '))
data[cols] = data[cols].applymap(lambda x: x.strip())

# Write data to csv
data.to_csv('/content/drive/MyDrive/notebooks/applied-natural-language-processing-in-the-enterprise/data/ag_dataset/train_prepared.csv', index = False)


### NLP Task #1: Named Entity Recognition 

#### Perform Inference Using the Original spaCy Model

In [1]:
!pip install -U spacy[cuda110,transformers,lookups]==3.0.3
!pip install -U spacy-lookups-data==1.0.0
!pip install cupy-cuda110==8.5.0
!python -m spacy download en_core_web_trf

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-trf==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.0.0/en_core_web_trf-3.0.0-py3-none-any.whl (459.7 MB)
[K     |████████████████████████████████| 459.7 MB 18 kB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')


In [4]:
# Import spacy and load language model
import spacy
nlp = spacy.load('en_core_web_trf')

In [5]:
# View metadata of the model
import pprint
pp = pprint.PrettyPrinter(indent = 4)
pp.pprint(nlp.meta)

{   'author': 'Explosion',
    'components': [   'transformer',
                      'tagger',
                      'parser',
                      'ner',
                      'attribute_ruler',
                      'lemmatizer'],
    'description': 'English transformer pipeline (roberta-base). Components: '
                   'transformer, tagger, parser, ner, attribute_ruler, '
                   'lemmatizer.',
    'disabled': [],
    'email': 'contact@explosion.ai',
    'labels': {   'attribute_ruler': [],
                  'lemmatizer': [],
                  'ner': [   'CARDINAL',
                             'DATE',
                             'EVENT',
                             'FAC',
                             'GPE',
                             'LANGUAGE',
                             'LAW',
                             'LOC',
                             'MONEY',
                             'NORP',
                             'ORDINAL',
                             

In [12]:
# Print NER results for Descriptions
for i in range(9):
  print('Article', i)
  print(data.loc[i, 'description'])
  print('Text Start End Label')
  doc = nlp(data.loc[i, 'description'])
  for token in doc.ents:
    print(token.text, token.start_char,
          token.end_char, token.label_)
    print('\n')

Article 0
Reuters - Short-sellers, Wall Street's dwindling band of ultra-cynics, are seeing green again.
Text Start End Label
Reuters 0 7 ORG


Article 1
Reuters - Private investment firm Carlyle Group, which has a reputation for making well-timed and occasionally controversial plays in the defense industry, has quietly placed its bets on another part of the market.
Text Start End Label
Reuters 0 7 ORG


Carlyle Group 34 47 ORG


Article 2
Reuters - Soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market next week during the depth of the summer doldrums.
Text Start End Label
Reuters 0 7 ORG


next week 134 143 DATE


summer 168 174 DATE


Article 3
Reuters - Authorities have halted oil export flows from the main pipeline in southern Iraq after intelligence showed a rebel militia could strike infrastructure, an oil official said on Saturday.
Text Start End Label
Reuters 0 7 ORG


Iraq 86 90 GPE


Saturday 186 194 DATE

