# Natural Language Processing

Natural language processing is the science of understanding text.

We will use spacy which is a library for NLP and the en_core_web_sm which is a model trained on English corpus.

In [1]:
# install spacy.
#%conda install -c conda-forge spacy

In [2]:
# Contains English tokenizer, tagger, parser, NER and word vectors.
#%conda install -c conda-forge spacy-model-en_core_web_sm

In [3]:
import spacy
import en_core_web_sm

English = en_core_web_sm.load()

In [4]:
doc = English('European regulators have fined Microsoft about $730 million '
              'for failing to honor an agreement to give users a choice of Internet browser.')
[(ent.text, ent.label_) for ent in doc.ents]

[('European', 'NORP'), ('Microsoft', 'ORG'), ('about $730 million', 'MONEY')]

```
TYPE         DESCRIPTION
PERSON       People, including fictional.
NORP         Nationalities or religious or political groups.
FAC          Buildings, airports, highways, bridges, etc.
ORG          Companies, agencies, institutions, etc.
GPE          Countries, cities, states.
LOC          Non-GPE locations, mountain ranges, bodies of water.
PRODUCT      Objects, vehicles, foods, etc. (Not services.)
EVENT        Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART  Titles of books, songs, etc.
LAW          Named documents made into laws.
LANGUAGE     Any named language.
DATE         Absolute or relative dates or periods.
TIME         Times smaller than a day.
PERCENT      Percentage, including ”%“.
MONEY        Monetary values, including unit.
QUANTITY     Measurements, as of weight or distance.
ORDINAL      “first”, “second”, etc.
CARDINAL     Numerals that do not fall under another type.
```

In [5]:
spacy.displacy.render(doc, jupyter=True, style='ent')

In [6]:
#%pip install newspaper3k

In [7]:
url = 'https://thenextweb.com/security/2019/09/10/us-court-says-scraping-a-site-without-permission-isnt-illegal/'

In [8]:
from newspaper import Article

In [9]:
article = Article(url)

In [10]:
article.download()

In [11]:
article.parse()

In [12]:
article.title

'US court says scraping a site without permission isn’t illegal'

In [13]:
article.authors

['Ivan Mehta']

In [14]:
article.publish_date

datetime.datetime(2019, 9, 10, 0, 0)

In [15]:
article.text

"An appeals court situated in California, US, today said it’s not illegal to scrape data from public websites without any prior approval. Web scraping refers to the process of collecting large troves of data with the use of web crawlers – scripts designed to lift information from web pages.\n\nThe ruling comes after a legal dispute between LinkedIn and data analytics firm HiQ. LinkedIn sent a cease-and-desist letter to HiQ, demanding it to stop scraping the site. In response, the data analytics company counter-sued in hopes of blocking LinkedIn from interfering.\n\nThe company argued that it blocked HiQ from scraping the data to protect its users’ privacy. On the flip side, the data analytics company said LinkedIn started blocking its scraping requests only after it launched its own analytics tool.\n\nThe court banned the Microsoft-owned company from blocking HiQ’s attempts to scrape data from publicly available profiles on the platform.\n\nBIG NEWS: 9th Circuit holds that scraping a p

In [16]:
article.top_img

'https://img-cdn.tnwcdn.com/image/tnw?filter_last=1&fit=1280%2C640&url=https%3A%2F%2Fcdn0.tnwcdn.com%2Fwp-content%2Fblogs.dir%2F1%2Ffiles%2F2019%2F09%2Fsocial-media-1432985_1920-1.jpg&signature=18274de15a38e676eafd47aa6eb38a3a'

In [18]:
text = article.text.replace('\n', '')
doc = English(text)

In [20]:
spacy.displacy.render(doc, jupyter=True, style='ent')