# AI - Natural Language Processing
## Part 2 - Functionalize NLP for entities


# ONLY IF NEEDED

## Step 1. Install Spacy

If this first time ever using spacy on this computer, you must first do either the ```!conda install``` or ```!pip install```:

### TURN OFF FOR COLAB
Run for ANACONDA

In [1]:
conda install -c conda-forge spacy

ValueError: The python kernel does not appear to be a conda environment.  Please use ``%pip install`` instead.

#### Which language model is best for you?
<a href="https://spacy.io/usage/models">https://spacy.io/usage/models</a>

## Step 2. Install language model


### ANACONDA ONLY

In [None]:
conda install -c conda-forge spacy-model-en_core_web_sm

# Import libs

In [1]:
import pandas as pd
import spacy
import glob
import en_core_web_sm

## Import hearings
Download <a href="https://drive.google.com/file/d/1EUYLeHpHAAW2MGsrT6_jov9cJ-IuDLg-/view?usp=sharing">this senate hearing</a> and turn it into a spacy doc.

Create a spreadsheet with columns for the entity, the label, and its meaning.

(remember, you will have to also tap elements from weeks' lessons to accomplish this)

In [5]:
pip install icecream

Collecting icecream
  Downloading icecream-2.1.3-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting colorama>=0.3.9 (from icecream)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting executing>=0.3.1 (from icecream)
  Downloading executing-2.1.0-py2.py3-none-any.whl.metadata (8.9 kB)
Collecting asttokens>=2.0.1 (from icecream)
  Downloading asttokens-2.4.1-py2.py3-none-any.whl.metadata (5.2 kB)
Downloading icecream-2.1.3-py2.py3-none-any.whl (8.4 kB)
Downloading asttokens-2.4.1-py2.py3-none-any.whl (27 kB)
Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Downloading executing-2.1.0-py2.py3-none-any.whl (25 kB)
Installing collected packages: executing, colorama, asttokens, icecream
Successfully installed asttokens-2.4.1 colorama-0.4.6 executing-2.1.0 icecream-2.1.3


In [7]:
from icecream import ic

In [18]:
#Import file into notebook
target_files = glob.glob('*.txt')
target_files

['senate-hearing.txt']

In [19]:
#read text from doc
for target_file in target_files:
  with open(target_file, "r") as my_text:
    all_text = my_text.readlines()
    (print(all_text))



In [30]:
type(all_text)

list

In [33]:
text = " ".join(all_text)

In [34]:
text



In [37]:
nlp = spacy.load('en_core_web_sm')

In [38]:
doc = nlp(text)

In [39]:
doc

[Senate Hearing 118-22]
 [From the U.S. Government Publishing Office]
 
 
                                                         S. Hrg. 118-22
 
                    IMPLEMENTING IIJA: PERSPECTIVES ON
           THE DRINKING WATER AND WASTEWATER INFRASTRUCTURE ACT
 
 
                                 HEARING
 
                                BEFORE THE
 
                               COMMITTEE ON
                       ENVIRONMENT AND PUBLIC WORKS
 
                           UNITED STATES SENATE
 
                     ONE HUNDRED EIGHTEENTH CONGRESS
 
                              FIRST SESSION
 
                                __________
 
                              MARCH 15, 2023
 
                                __________
 
   Printed for the use of the Committee on Environment and Public Works
   
 [GRAPHIC NOT AVAILABLE IN TIFF FORMAT]  
 
 
         Available via the World Wide Web: http://www.govinfo.gov
         
                                __________
 
            

In [40]:
type(doc)

spacy.tokens.doc.Doc

In [41]:
for word in doc.ents:
  print(word)

Senate
118-22
the U.S. Government Publishing Office
118
PERSPECTIVES
ONE HUNDRED
FIRST
MARCH 15, 2023
the Committee on Environment and Public Works
the World Wide Web: http://www.govinfo.gov
         
                                
52
512
PDF                  
WASHINGTON
2023
ONE HUNDRED
Delaware
West Virginia
Ranking
BENJAMIN L. CARDIN
Maryland
KEVIN CRAMER
North Dakota
 BERNARD
Vermont             
Rhode Island
Oklahoma
 JEFF MERKLEY
Oregon
Nebraska
EDWARD J. MARKEY
Massachusetts      
BOOZMAN
Arkansas
 
STABENOW
Michigan
ROGER WICKER
Mississippi
 
KELLY
Arizona
Alaska
 ALEX
California             LINDSEY O. GRAHAM
South Carolina
Pennsylvania
Courtney Taylor
Democratic
Adam Tomlinson
Republican
MARCH 15,
Thomas R.
U.S.
the State of Delaware
Hon
Shelley Moore
U.S.
the State of West 
   
Virginia
3
Fox
Radhika
Office of Water
Environmental Protection Agency
6
9
Carper
Response
Fetterman
23
Capito
Response
Cramer
28
Lummis
Mullin
30
Sullivan
Randy E.
Philadelphia Water 
   Department


In [42]:
for word in doc.ents:
  print(f"{word} ---> {word.label_}--->{spacy.explain(word.label_)}")

Senate ---> ORG--->Companies, agencies, institutions, etc.
118-22 ---> CARDINAL--->Numerals that do not fall under another type
the U.S. Government Publishing Office ---> ORG--->Companies, agencies, institutions, etc.
118 ---> CARDINAL--->Numerals that do not fall under another type
PERSPECTIVES ---> ORDINAL--->"first", "second", etc.
ONE HUNDRED ---> CARDINAL--->Numerals that do not fall under another type
FIRST ---> ORDINAL--->"first", "second", etc.
MARCH 15, 2023 ---> DATE--->Absolute or relative dates or periods
the Committee on Environment and Public Works ---> ORG--->Companies, agencies, institutions, etc.
the World Wide Web: http://www.govinfo.gov
         
                                 ---> WORK_OF_ART--->Titles of books, songs, etc.
52 ---> CARDINAL--->Numerals that do not fall under another type
512 ---> CARDINAL--->Numerals that do not fall under another type
PDF                   ---> FAC--->Buildings, airports, highways, bridges, etc.
WASHINGTON ---> GPE--->Countries, 

In [43]:
entities = [word.text for word in doc.ents]
ent_labels = [word.label_ for word in doc.ents]
entities
ent_labels

['ORG',
 'CARDINAL',
 'ORG',
 'CARDINAL',
 'ORDINAL',
 'CARDINAL',
 'ORDINAL',
 'DATE',
 'ORG',
 'WORK_OF_ART',
 'CARDINAL',
 'CARDINAL',
 'FAC',
 'GPE',
 'CARDINAL',
 'CARDINAL',
 'GPE',
 'GPE',
 'ORG',
 'PERSON',
 'GPE',
 'PERSON',
 'GPE',
 'FAC',
 'GPE',
 'GPE',
 'GPE',
 'GPE',
 'PERSON',
 'PERSON',
 'GPE',
 'GPE',
 'GPE',
 'GPE',
 'PERSON',
 'GPE',
 'GPE',
 'GPE',
 'PERSON',
 'ORG',
 'GPE',
 'GPE',
 'PERSON',
 'NORP',
 'PERSON',
 'NORP',
 'DATE',
 'PERSON',
 'GPE',
 'ORG',
 'PERSON',
 'PERSON',
 'GPE',
 'GPE',
 'GPE',
 'CARDINAL',
 'PERSON',
 'PERSON',
 'ORG',
 'ORG',
 'CARDINAL',
 'CARDINAL',
 'PERSON',
 'ORG',
 'PERSON',
 'CARDINAL',
 'PERSON',
 'ORG',
 'PERSON',
 'CARDINAL',
 'PERSON',
 'PERSON',
 'CARDINAL',
 'PERSON',
 'PERSON',
 'ORG',
 'PERSON',
 'PERSON',
 'ORG',
 'CARDINAL',
 'PERSON',
 'CARDINAL',
 'PERSON',
 'ORG',
 'ORG',
 'CARDINAL',
 'CARDINAL',
 'PERSON',
 'CARDINAL',
 'PERSON',
 'NORP',
 'ORG',
 'DATE',
 'PERSON',
 'PERSON',
 'DATE',
 'CARDINAL',
 'PERSON',
 'PERSON

In [44]:
def find_ent(tokenized_text):
  '''
  Takes tokenized text and returns a dataframe of entities, labels, and explanations
  para1: tokenized text (must be run through nlp pipeline)
  '''
  ent_list = []
  if tokenized_text.ents:
    for word in tokenized_text.ents:
      temp_dict = {"word" : word.text,
                 "label" : word.label_,
                 "meaning" : spacy.explain(word.label_)}
      ent_list.append(temp_dict)
  else:
    print("Your text must first be tokenized for me to find entities")
  return pd.DataFrame(ent_list)

In [45]:
df = find_ent(doc)
df

Unnamed: 0,word,label,meaning
0,Senate,ORG,"Companies, agencies, institutions, etc."
1,118-22,CARDINAL,Numerals that do not fall under another type
2,the U.S. Government Publishing Office,ORG,"Companies, agencies, institutions, etc."
3,118,CARDINAL,Numerals that do not fall under another type
4,PERSPECTIVES,ORDINAL,"""first"", ""second"", etc."
...,...,...,...
1425,Philadelphia,GPE,"Countries, cities, states"
1426,West Virginia,GPE,"Countries, cities, states"
1427,the years,DATE,Absolute or relative dates or periods
1428,Green \n Bay,LOC,"Non-GPE locations, mountain ranges, bodies of ..."


In [51]:
filename = 'senate_hearing2.csv'
df.columns = ['word','label','meaning']
df.to_csv(filename, index=False)