# AI - Natural Language Processing

### The problem?

- Endless amounts of unstructured data found in emails, tweets, letters, memos, etc.
- Even in transcripts
- How can we make sense of all this data?
- How can we 'easily' find relevant information for our reporting?

### The solution?
- Artificial Intelligence to process all that text using **natural language processing**!
- <a href="https://machinelearningmastery.com/natural-language-processing/">Learn more</a> about the complexity and the history of NLP.
- The use of ```large language models```!

### Journalism examples

- <a href="http://doctors.ajc.com/part_1_license_to_betray/">License to betray</a> – Finding word stems and roots to uncover abuse. (<a href="http://doctors.ajc.com/about_this_investigation/?ecmp=doctorssexabuse_microsite_stories">More info</a>)
- <a href="https://www.revealnews.org/article/federal-judges-rulings-favored-companies-in-which-he-owned-stock/">Federal judge’s rulings favored companies in which he owned stock</a> – Finding all stock owned by judges in disclosure forms and comparing to caseloads.
- <a href="https://www.latimes.com/local/cityhall/la-me-crime-stats-20151015-story.html">LAPD underreported serious assaults, skewing crime stats for 8 years</a> – Text classification analysis.

### The tools

- Spacy v. NLTK
- NLTK launched in 2001, Spacy in 2015
- NLTK is now bloated and complex, requiring many steps to deal with many changes etc.
- Spacy is lean and modern, and can compute some text 4x to 20x faster than NLTK.
- Spacy does **nearly** everything that NLTK does, but better.
- NLTK, however, is still the library of choice for sentiment analysis.

However, sentiment analysis in journalism can be problematic. Be extra wary of NLP's use for news analysis. AI can easily misinterpret the sentiment in this sentence:

"It is a great movie if you have the taste and sensibilities of a five-year-old boy."

It's best to stick to the following types of analysis:

- Mentions of a word or concept (who said something...when and how many times?)
- Frequency of target terms or topics (how often were keywords used in speeches, transcripts, etc)
- Words over time (a timeline that shows frequency of words over time)
- Missing words (really a flip of words over time to show how people stopped using certain concepts or terms)
- Key people, places, companies (identify proper nouns and places for reporting)
- Comparisons (for example financial disclosures over time...which stocks were added or removed over the years)

# Installing Spacy (if installation is needed)

## Step 1. Install Spacy

If this first time ever using spacy on this computer, you must first do either the ```!conda install``` or ```!pip install```:

### TURN OFF FOR COLAB
Run for ANACONDA

In [None]:
conda install -c conda-forge spacy

### TURN OFF FOR ANACONDA
Run for Colab

In [None]:
## COLAB pip install
# !pip install -U spacy


In [None]:
## import libary.
import spacy

#### Which language model is best for you?
<a href="https://spacy.io/usage/models">https://spacy.io/usage/models</a>

## Step 2. Install language model (if installation is needed)


### ANACONDA ONLY

In [None]:
conda install -c conda-forge spacy-model-en_core_web_sm

### COLAB ONLY

In [None]:
# !python -m spacy download en_core_web_trf

In [None]:
## import that language model
import en_core_web_sm

# Import libs

In [7]:
## import library
import pandas as pd
import spacy
import glob

In [2]:
## import that language model
## https://spacy.io/usage/models
import en_core_web_sm

In [3]:
!python -m spacy download en_core_web_trf

Collecting en-core-web-trf==3.7.3
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [9]:
# !pip install spacy-transformers

/bin/bash: line 1: pip install spacy-transformers: command not found


In [10]:
import en_core_web_trf

### Place English libary into a ```nlp``` pipeline

In [13]:
## build nlp pipeline (a function will tokenize, parse and ner for us)
nlp = en_core_web_sm.load()

In [22]:
## accuracy check
## import spacy-transformers
nlp2 = spacy.load("en_core_web_trf")

  model.load_state_dict(torch.load(filelike, map_location=device))


In [14]:
## what type of object is nlp
type(nlp)

## Step 3. Text analysis

In [15]:
### Sample English text:
text = u'''\
On May 10, 2011, Microsoft announced its acquisition of Skype Technologies, \
creator of the VoIP service Skype, for $8.5 billion. \
Microsoft is headquartered near Seattle, Washington while Skype remains in Palo Alto, California. \
Sandeep Junnarkar got this from Wikipedia. \
But he'd rather head to Paris, France to see the Mona Lisa at the Louvre. \
The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." \
Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.
'''

In [16]:
## CALL the text
text

'On May 10, 2011, Microsoft announced its acquisition of\xa0Skype Technologies, creator of the\xa0VoIP\xa0service\xa0Skype, for $8.5 billion. Microsoft is headquartered near Seattle, Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. But he\'d rather head to Paris, France to see the Mona Lisa at the Louvre. The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.\n'

In [17]:
type(text)

str

In [18]:
## PRINT the text
print(text)

On May 10, 2011, Microsoft announced its acquisition of Skype Technologies, creator of the VoIP service Skype, for $8.5 billion. Microsoft is headquartered near Seattle, Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. But he'd rather head to Paris, France to see the Mona Lisa at the Louvre. The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.



### Tokenize our text

- Tokenizing is always the first step in text analysis.
- It breaks all text into isolated but related units (including spaces, symbols, punctuation, numbers, words etc.)
- However, it retains the connection between all the words, sentences, and paragraphs.

In [19]:
## let's run the nlp function and create a spacy doc
doc = nlp(text)

In [23]:
doc2 = nlp2(text)

In [21]:
## CALL doc
doc

On May 10, 2011, Microsoft announced its acquisition of Skype Technologies, creator of the VoIP service Skype, for $8.5 billion. Microsoft is headquartered near Seattle, Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. But he'd rather head to Paris, France to see the Mona Lisa at the Louvre. The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.

In [None]:
## what type of data is it?
type(doc)

spacy.tokens.doc.Doc

In [None]:
## show each token
for each_word in doc:
  print(each_word)
  print(type(each_word))
  print("-----")

On
<class 'spacy.tokens.token.Token'>
-----
May
<class 'spacy.tokens.token.Token'>
-----
10
<class 'spacy.tokens.token.Token'>
-----
,
<class 'spacy.tokens.token.Token'>
-----
2011
<class 'spacy.tokens.token.Token'>
-----
,
<class 'spacy.tokens.token.Token'>
-----
Microsoft
<class 'spacy.tokens.token.Token'>
-----
announced
<class 'spacy.tokens.token.Token'>
-----
its
<class 'spacy.tokens.token.Token'>
-----
acquisition
<class 'spacy.tokens.token.Token'>
-----
of
<class 'spacy.tokens.token.Token'>
-----
 
<class 'spacy.tokens.token.Token'>
-----
Skype
<class 'spacy.tokens.token.Token'>
-----
Technologies
<class 'spacy.tokens.token.Token'>
-----
,
<class 'spacy.tokens.token.Token'>
-----
creator
<class 'spacy.tokens.token.Token'>
-----
of
<class 'spacy.tokens.token.Token'>
-----
the
<class 'spacy.tokens.token.Token'>
-----
 
<class 'spacy.tokens.token.Token'>
-----
VoIP
<class 'spacy.tokens.token.Token'>
-----
 
<class 'spacy.tokens.token.Token'>
-----
service
<class 'spacy.tokens.token

### Parts of speech



In [26]:
## print all parts of speech words
for token in doc:
  print(f"{token.text} ---> {token.pos} ---> {token.pos_}")

On ---> 85 ---> ADP
May ---> 96 ---> PROPN
10 ---> 93 ---> NUM
, ---> 97 ---> PUNCT
2011 ---> 93 ---> NUM
, ---> 97 ---> PUNCT
Microsoft ---> 96 ---> PROPN
announced ---> 100 ---> VERB
its ---> 95 ---> PRON
acquisition ---> 92 ---> NOUN
of ---> 85 ---> ADP
  ---> 103 ---> SPACE
Skype ---> 96 ---> PROPN
Technologies ---> 96 ---> PROPN
, ---> 97 ---> PUNCT
creator ---> 92 ---> NOUN
of ---> 85 ---> ADP
the ---> 90 ---> DET
  ---> 103 ---> SPACE
VoIP ---> 96 ---> PROPN
  ---> 103 ---> SPACE
service ---> 92 ---> NOUN
  ---> 103 ---> SPACE
Skype ---> 96 ---> PROPN
, ---> 97 ---> PUNCT
for ---> 85 ---> ADP
$ ---> 99 ---> SYM
8.5 ---> 93 ---> NUM
billion ---> 93 ---> NUM
. ---> 97 ---> PUNCT
Microsoft ---> 96 ---> PROPN
is ---> 87 ---> AUX
headquartered ---> 100 ---> VERB
near ---> 85 ---> ADP
Seattle ---> 96 ---> PROPN
, ---> 97 ---> PUNCT
Washington ---> 96 ---> PROPN
while ---> 98 ---> SCONJ
Skype ---> 96 ---> PROPN
remains ---> 100 ---> VERB
in ---> 85 ---> ADP
Palo ---> 96 ---> PROPN
Alto

### Step 4. Named Entity Recognition (NER)

#### Spacy easily returns the words that matter to us like names of companies, people, places, art works, numbers, etc.

- ```.ents``` ------------> Finds all entities in doc spacy object.

- ```ent.text``` ------------> The actual text.

- ```ent.label``` ------------> A numeric code for the entity.

- ```ent.label_``` ------------> The word's entity category.

- ```spacy.explain(ent.label_)``` ---------> A description of the category.




In [27]:
## call text
text

'On May 10, 2011, Microsoft announced its acquisition of\xa0Skype Technologies, creator of the\xa0VoIP\xa0service\xa0Skype, for $8.5 billion. Microsoft is headquartered near Seattle, Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. But he\'d rather head to Paris, France to see the Mona Lisa at the Louvre. The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.\n'

In [28]:
## find all entities
for word in doc.ents:
  print(word)

May 10, 2011
Microsoft
Skype Technologies
VoIP
Skype
$8.5 billion
Microsoft
Seattle
Washington
Skype
Palo Alto
California
Sandeep Junnarkar
Wikipedia
Paris
France
The Hudson River
Mahicantuck
two
315 miles
the Atlantic Ocean
Mt. Mercy
New York


In [29]:
## find all entities with their label
for word in doc.ents:
  print(f"{word} ---> {word.label_}")

May 10, 2011 ---> DATE
Microsoft ---> ORG
Skype Technologies ---> ORG
VoIP ---> LOC
Skype ---> ORG
$8.5 billion ---> MONEY
Microsoft ---> ORG
Seattle ---> GPE
Washington ---> GPE
Skype ---> ORG
Palo Alto ---> GPE
California ---> GPE
Sandeep Junnarkar ---> PERSON
Wikipedia ---> GPE
Paris ---> GPE
France ---> GPE
The Hudson River ---> LOC
Mahicantuck ---> PERSON
two ---> CARDINAL
315 miles ---> QUANTITY
the Atlantic Ocean ---> LOC
Mt. Mercy ---> LOC
New York ---> GPE


In [30]:
## find all entities with their label and label descriptors
for word in doc.ents:
  print(f"{word} ---> {word.label_} ---> {spacy.explain(word.label_)}")

May 10, 2011 ---> DATE ---> Absolute or relative dates or periods
Microsoft ---> ORG ---> Companies, agencies, institutions, etc.
Skype Technologies ---> ORG ---> Companies, agencies, institutions, etc.
VoIP ---> LOC ---> Non-GPE locations, mountain ranges, bodies of water
Skype ---> ORG ---> Companies, agencies, institutions, etc.
$8.5 billion ---> MONEY ---> Monetary values, including unit
Microsoft ---> ORG ---> Companies, agencies, institutions, etc.
Seattle ---> GPE ---> Countries, cities, states
Washington ---> GPE ---> Countries, cities, states
Skype ---> ORG ---> Companies, agencies, institutions, etc.
Palo Alto ---> GPE ---> Countries, cities, states
California ---> GPE ---> Countries, cities, states
Sandeep Junnarkar ---> PERSON ---> People, including fictional
Wikipedia ---> GPE ---> Countries, cities, states
Paris ---> GPE ---> Countries, cities, states
France ---> GPE ---> Countries, cities, states
The Hudson River ---> LOC ---> Non-GPE locations, mountain ranges, bodies o

In [31]:
## find all entities with their label and label descriptors
for word in doc2.ents:
  print(f"{word} ---> {word.label_} ---> {spacy.explain(word.label_)}")

May 10, 2011 ---> DATE ---> Absolute or relative dates or periods
Microsoft ---> ORG ---> Companies, agencies, institutions, etc.
Skype Technologies ---> ORG ---> Companies, agencies, institutions, etc.
Skype ---> ORG ---> Companies, agencies, institutions, etc.
$8.5 billion ---> MONEY ---> Monetary values, including unit
Microsoft ---> ORG ---> Companies, agencies, institutions, etc.
Seattle ---> GPE ---> Countries, cities, states
Washington ---> GPE ---> Countries, cities, states
Skype ---> ORG ---> Companies, agencies, institutions, etc.
Palo Alto ---> GPE ---> Countries, cities, states
California ---> GPE ---> Countries, cities, states
Sandeep Junnarkar ---> PERSON ---> People, including fictional
Wikipedia ---> ORG ---> Companies, agencies, institutions, etc.
Paris ---> GPE ---> Countries, cities, states
France ---> GPE ---> Countries, cities, states
the Mona Lisa ---> WORK_OF_ART ---> Titles of books, songs, etc.
Louvre ---> FAC ---> Buildings, airports, highways, bridges, etc.
T

### Create a CSV that holds all the organizations/companies in a document

In [35]:
## find all entities and place in a list using list comprehension

## find all entities
 ## find all entity labels

entities = [ word.text for word in doc2.ents ]
ent_labels = [ word.label_ for word in doc2.ents ]
entities
ent_labels

['DATE',
 'ORG',
 'ORG',
 'ORG',
 'MONEY',
 'ORG',
 'GPE',
 'GPE',
 'ORG',
 'GPE',
 'GPE',
 'PERSON',
 'ORG',
 'GPE',
 'GPE',
 'WORK_OF_ART',
 'FAC',
 'LOC',
 'LOC',
 'CARDINAL',
 'FAC',
 'QUANTITY',
 'LOC',
 'FAC',
 'GPE']

In [37]:
## Turn the two lists into a dictionary using a for loop
my_data = []

for item in zip(ent_labels, entities):
  my_data.append(item)

In [38]:
my_data

[('DATE', 'May 10, 2011'),
 ('ORG', 'Microsoft'),
 ('ORG', 'Skype Technologies'),
 ('ORG', 'Skype'),
 ('MONEY', '$8.5 billion'),
 ('ORG', 'Microsoft'),
 ('GPE', 'Seattle'),
 ('GPE', 'Washington'),
 ('ORG', 'Skype'),
 ('GPE', 'Palo Alto'),
 ('GPE', 'California'),
 ('PERSON', 'Sandeep Junnarkar'),
 ('ORG', 'Wikipedia'),
 ('GPE', 'Paris'),
 ('GPE', 'France'),
 ('WORK_OF_ART', 'the Mona Lisa'),
 ('FAC', 'Louvre'),
 ('LOC', 'The Hudson River'),
 ('LOC', 'Mahicantuck'),
 ('CARDINAL', 'two'),
 ('FAC', 'Mahicantuck'),
 ('QUANTITY', '315 miles'),
 ('LOC', 'the Atlantic Ocean'),
 ('FAC', 'Mt. Mercy'),
 ('GPE', 'New York')]

In [40]:
df = pd.DataFrame(my_data, columns=["Label", "Word"])
df

Unnamed: 0,Label,Word
0,DATE,"May 10, 2011"
1,ORG,Microsoft
2,ORG,Skype Technologies
3,ORG,Skype
4,MONEY,$8.5 billion
5,ORG,Microsoft
6,GPE,Seattle
7,GPE,Washington
8,ORG,Skype
9,GPE,Palo Alto


In [48]:
## creating a function
def find_ent(tokenized_text):
  '''
  This function takes tokenized text and returns a datafram of entities, labels, and explanations.
  parameter: tokenized text (must have run through nlp pipeline)
  '''
  ent_list = []
  if tokenized_text.ents:
    for word in tokenized_text.ents:
      temp_dict = {"word": word.text,
                   "label": word.label_,
                   "meaning": spacy.explain(word.label_)}
      ent_list.append(temp_dict)
  else:
    print("Your text must first be tokenized to find entities.")
  return pd.DataFrame(ent_list)

In [52]:
df = find_ent(doc2)
df

Unnamed: 0,word,label,meaning
0,"May 10, 2011",DATE,Absolute or relative dates or periods
1,Microsoft,ORG,"Companies, agencies, institutions, etc."
2,Skype Technologies,ORG,"Companies, agencies, institutions, etc."
3,Skype,ORG,"Companies, agencies, institutions, etc."
4,$8.5 billion,MONEY,"Monetary values, including unit"
5,Microsoft,ORG,"Companies, agencies, institutions, etc."
6,Seattle,GPE,"Countries, cities, states"
7,Washington,GPE,"Countries, cities, states"
8,Skype,ORG,"Companies, agencies, institutions, etc."
9,Palo Alto,GPE,"Countries, cities, states"


In [53]:
df.query("label == 'PERSON'")

Unnamed: 0,word,label,meaning
11,Sandeep Junnarkar,PERSON,"People, including fictional"


In [55]:
df.query("label == 'ORG'")

Unnamed: 0,word,label,meaning
1,Microsoft,ORG,"Companies, agencies, institutions, etc."
2,Skype Technologies,ORG,"Companies, agencies, institutions, etc."
3,Skype,ORG,"Companies, agencies, institutions, etc."
5,Microsoft,ORG,"Companies, agencies, institutions, etc."
8,Skype,ORG,"Companies, agencies, institutions, etc."
12,Wikipedia,ORG,"Companies, agencies, institutions, etc."


In [51]:
## Turn the two lists into a dictionary using
## dictionary comprehension within list comprehension

data_lc = [ {"label": label, "entity": entity} for (label, entity) in zip(ent_labels, entities) ]
data_lc

[{'label': 'DATE', 'entity': 'May 10, 2011'},
 {'label': 'ORG', 'entity': 'Microsoft'},
 {'label': 'ORG', 'entity': 'Skype Technologies'},
 {'label': 'ORG', 'entity': 'Skype'},
 {'label': 'MONEY', 'entity': '$8.5 billion'},
 {'label': 'ORG', 'entity': 'Microsoft'},
 {'label': 'GPE', 'entity': 'Seattle'},
 {'label': 'GPE', 'entity': 'Washington'},
 {'label': 'ORG', 'entity': 'Skype'},
 {'label': 'GPE', 'entity': 'Palo Alto'},
 {'label': 'GPE', 'entity': 'California'},
 {'label': 'PERSON', 'entity': 'Sandeep Junnarkar'},
 {'label': 'ORG', 'entity': 'Wikipedia'},
 {'label': 'GPE', 'entity': 'Paris'},
 {'label': 'GPE', 'entity': 'France'},
 {'label': 'WORK_OF_ART', 'entity': 'the Mona Lisa'},
 {'label': 'FAC', 'entity': 'Louvre'},
 {'label': 'LOC', 'entity': 'The Hudson River'},
 {'label': 'LOC', 'entity': 'Mahicantuck'},
 {'label': 'CARDINAL', 'entity': 'two'},
 {'label': 'FAC', 'entity': 'Mahicantuck'},
 {'label': 'QUANTITY', 'entity': '315 miles'},
 {'label': 'LOC', 'entity': 'the Atlan

In [54]:
df_lc = pd.DataFrame(data_lc)
df_lc

Unnamed: 0,label,entity
0,DATE,"May 10, 2011"
1,ORG,Microsoft
2,ORG,Skype Technologies
3,ORG,Skype
4,MONEY,$8.5 billion
5,ORG,Microsoft
6,GPE,Seattle
7,GPE,Washington
8,ORG,Skype
9,GPE,Palo Alto


In [None]:
## the previous lists hold all entities.
## let's narrow them down to the orgs/companies


In [None]:
## What data types are these?


### Deduplicate?

If you need to deduplicate the results you can do so by using ```unique()``` in Pandas.

But perhaps you want uncover a pattern in how often terms are used and when.


### Export instead

In [None]:
## import pandas
import pandas as pd

In [None]:
# ## use pandas to write to csv file
filename = "test-entities-1.csv"
df = pd.DataFrame(all_orgs) ## we turn our life dict into a dataframe which we're call df
df.to_csv(filename, encoding='utf-8', index=False)


### Create a function to process entities

In [None]:

## function to find entities
def show_entities(my_text):
  '''
  my_text must be a spacy doc tokenized object; already run through nlp pipeline

  '''
  each_token = "Token"
  entity_type = "Entity"
  entity_def = "Entity Defined"
  print(f"{each_token:{30}}{entity_type:{15}}{entity_def}")
  if my_text.ents:
      for word in my_text.ents:
          print(f"{word.text:{30}} {word.label_:{15}} {str(spacy.explain(word.label_))}")
  else:
      print("There are no entities in this text")


In [None]:
## show entities in my english sentence


## Specialized function to capture entity types

In [None]:
## create function to return list of dictionaries of entities and entity labels


In [None]:
## test it to find orgs


## Install other languages
#### Other languages can be found at https://spacy.io/usage/models

#### Disclaimer: Language models are built by open source communities. English and German are the most advanced language models.

### Spanish language model

### ANACONDA ONLY

In [None]:
conda update -n base -c conda-forge conda

### COLAB ONLY

In [None]:
# !python -m spacy download es_core_news_sm


In [None]:
## import the library and create nlp pipleline
import es_core_news_sm
nlp = es_core_news_sm.load()

In [None]:
### Sample Spanish Text (sorry!)
stext = """
El 10 de mayo de 2011, Microsoft anunció la adquisición de Skype Technologies,\
creador del servicio de VoIP Skype, por 8.500 millones de dólares. Microsoft tiene\
su sede cerca de Seattle, Washington, mientras que Skype permanece en Palo Alto,\
California. Sandeep Junnarkar obtuvo esto de Wikipedia. Pero preferiría ir a París,\
Francia, a ver la Mona Lisa en el Louvre. El río Hudson realmente debería llamarse por\
su nombre nativo original, Mahicantuck, que significa "el río\
que fluye en dos direcciones". Mahicantuck fluye por 315 millas hacia el Océano Atlántico\
desde su origen en Mt. Mercy, el pico más alto del estado de Nueva York.
"""

In [None]:
## tokenize and show parts of speech for each token


In [None]:
## show the tokens


In [None]:
## show entities


## More NLP:

- Text summarization
- Word frequency
- Context around words
- Surprise ending?