## Practice NLP on an Article from the NY Times

In [3]:
from urllib import request
from bs4 import BeautifulSoup
from bs4.element import Comment

In [4]:
# Latest Coronavirus article
url = 'https://www.nytimes.com/2020/07/16/health/coronavirus-vaccine-novavax.html?action=click&module=Top%20Stories&pgtype=Homepage'

In [9]:
from nlp_toolkit import nlp_toolkit as ntk

### Parsing the article into Python

First step is to open the url page using ```request.urlopen().read()``` that returns the url in ```html``` markup.

In [8]:
html = request.urlopen(url).read()
print(html[:300])

b'<!DOCTYPE html>\n<html lang="en-US" class="story" xmlns:og="http://opengraphprotocol.org/schema/">\n  <head>\n    <title data-rh="true">How Novavax Won $1.6 Billion to Make a Coronavirus Vaccine - The New York Times</title>\n    <meta data-rh="true" itemprop="inLanguage" content="en-US"/><meta data-rh="'


Parse it through to a ```BeautifulSoup``` object, then strip out tags that are:

- style
- script
- head
- title
- meta
- document
- comments.

This can be done in the ```text_from_html()``` wrapper function.

In [10]:
text = ntk.text_from_html(html)
print(text[:300])

Sections SEARCH Skip to content Skip to site index Health Today’s Paper Health | How a Struggling Company Won $1.6 Billion to Make a Coronavirus Vaccine https://nyti.ms/2ChPu4q The Coronavirus Outbreak live Latest Updates Maps and Cases Drug and Treatment Tracker Business Updates Advertisement Co


### How many times is COVID-19 and vaccine mentioned?

In [12]:
import re

In [14]:
covid_matcher = re.finditer('coronavirus|covid', text)

matches = [match.start() for match in covid_matcher]
print('There are {} mentions of COVID'.format(len(matches)))

There are 17 mentions of COVID


In [15]:
vaccine_matcher = re.finditer('vaccine', text)
vaccine_matches = [match.start() for match in vaccine_matcher]
print('There are {} mentions of vaccine'.format(len(vaccine_matches)))

There are 53 mentions of vaccine


### Tokenization to sentences

In [11]:
from nltk import sent_tokenize, word_tokenize, WordNetLemmatizer, PorterStemmer, ngrams, pos_tag
from nltk.corpus import stopwords
from collections import Counter

In [16]:
sentences = sent_tokenize(text)
for sentence in sentences[:10]:
    print(sentence)
    print()

Sections SEARCH Skip to content Skip to site index Health Today’s Paper Health | How a Struggling Company Won $1.6 Billion to Make a Coronavirus Vaccine https://nyti.ms/2ChPu4q The Coronavirus Outbreak live Latest Updates Maps and Cases Drug and Treatment Tracker Business Updates Advertisement Continue reading the main story Supported by Continue reading the main story How a Struggling Company Won $1.6 Billion to Make a Coronavirus Vaccine Novavax just received the Trump administration’s largest vaccine contract.

In the Maryland company’s 33-year history, it has never brought a vaccine to market.

The coronavirus vaccine Novavax, a small biotech company, has developed is now in safety trials.

Results are expected this month.

Credit... Andrew Caballero-Reynolds/Agence France-Presse — Getty Images By  Katie Thomas and Megan Twohey July 16, 2020 Updated 12:22 p.m.

ET In late February, as the coronavirus spread around the world, Dr. Richard Hatchett, the head of an international nonpro

Looking at the 'sentences' here, the parsing through ```BeautifulSoup``` picked up the text of the title along with the surrounding links before the title. Additionally, captions from images are also parsed through.

The article itself really starts at sentence 6.

### Named Entity Recognition

We will use a precomponsed crunchdatabase to find named entities. For this, we wil use ```spacy``` to do the text analysis.

In [19]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [20]:
# Parse the text into the nlp object from spacy
page = nlp(text)

In [25]:
# Extract organizations mentioned in the article
orgs = [entity.text for entity in page.ents if entity.label_ == 'ORG']
print(set(orgs))

{'the Defense Department', 'the White House', 'France-Presse', 'Glassdoor', 'Catalent', 'Trump', 'Phalanx Investment Partners', 'The Biomedical Advanced Research and Development Authority', 'Image Novavax', 'the National Security Council', 'KPMG', 'the University of New Mexico', 'the British Journal of Sports Medicine', 'the Gates Foundation', 'the American Council on Exercise', 'The New York Times', 'the Food and Drug Administration', 'BARDA', 'the Coalition for Epidemic Preparedness Innovations', 'the Department of Health and Human Services', 'Novavax', 'George Washington University Law School', 'H.H.S.', 'Virginia Tech', 'Getty Images CEPI', 'Doctors Without Borders', 'The New York Times Company', 'House', 'CEPI', 'The Coronavirus Outbreak', 'the World Health Organization', 'Megan Twohey'}


The entity recognizer picked up some errors: Trump was considered an organization.