# **arlis-python-nlp**
Natural Language Processing Libraries for Text Extraction and Indexing specifically designed for the articles in the [Artificial Intelligence Incident Database](https://incidentdatabase.ai/) (AIID)


---


## **Introduction**
This GitHub repository is for those who are involved in filling out the requried [taxonomy fields](https://incidentdatabase.ai/taxonomy/cset) for the  articles in AIID. Many of the taxonomy fields as of writing this documentation have been filled manually, our program focuses on automating this process. The taxonomy fields that our program automates is Full description of the incident, Sector of deployment, Location, and Named Entities. If our work continues in the future, we will focus on trying to automate even more taxonomy fields. The way our program works is that you would import our libraries and use our functions to get these taxonomy fields. This is will be further explained in the following section.

### **Usage** 
This section will explain how to use our program to extract taxonomy fields for an article using our example program, [example.py](https://github.com/UMD-ARLIS/arlis-python-nlp/blob/main/example.py).


---


### **Before using our program**
Our program uses our own library (arlis-python-nlp) as well as NLTK. Arlis-python-nlp and NLTK would have to be installed and imported for our program to run. Down below our two blocks that show how to install/import the required libraries. (Note: this does not include information on how to install NLTK). The first block with five lines of code is what you would run on your terminal (without <font color='red'>!</font>)

In [None]:
#https://huggingface.co/joeddav/xlm-roberta-large-xnli?text=Can+you+please+amend+the+invoice+to+reflect+true+capital+expenditure+and+anticipated+revenue%3F&candidate_labels=Medical%2Cbusiness%2Cfinance&multi_class=true

# CUDA 11.1
# Site to visit for later version: https://pytorch.org/get-started/previous-versions/
!pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/torch_stable.html
!pip install transformers
!pip install sentencepiece
!pip install newspaper3k
!python -m spacy download en_core_web_lg

After installing proper packages onto your machine, the following functions will need to be imported and defined as follow:

In [12]:
import transformers
from transformers import pipeline

from newspaper import Article
import nltk
nltk.download('punkt')
import pandas as pd
import spacy
nlp = spacy.load('en_core_web_lg')

classifier = pipeline("zero-shot-classification", model="joeddav/xlm-roberta-large-xnli")

#Location Finder
def locationFinder(text):
    gpe = [] # countries, cities, states
    loc = [] # non gpe locations, mountain ranges, bodies of water
    doc = nlp(text)
    for ent in doc.ents:
      if (ent.label_ == 'GPE'):
        gpe.append(ent.text)
      elif (ent.label_ == 'LOC'):
        loc.append(ent.text)
    return gpe, loc

#Extraction people and organizations involved
def nameEntityFinder(paragraph):
    doc = nlp(paragraph)
    nameEntityDict = {}
    nameEntityDict_v2 = {}
    for ent in doc.ents:
      nameEntityDict[ent.text] = ent.label_
      
    for (key, value) in nameEntityDict.items():
        if value == 'PERSON' or value == 'ORG':
            nameEntityDict_v2[key] = value
    return nameEntityDict_v2

#Returns sector of deployment
def get_Sector_of_Deployment(text):
  sectorDeployment = ['Information and communication', 'Arts, entertainment and recreation', 'Transportation and storage', 'Public administration and defence', 'Administrative and support service activities', 'Human health and social work activities', 'Education', 'Professional, scientific and technical activities', 'Financial and insurance activities', 'Wholesale and retail trade', 'Activities of households as employers', 'Accommodation and food service activities']
  vector = classifier(text, sectorDeployment)
  return vector['labels'][0]

#Return sector of infrastructure
def get_infrastructure_sector(text):
  infrastructureSector = ['Transportation', 'Healthcare and public health', 'Government facilities', 'Communications', 'Food and agriculture', 'Critical manufacturing', 'Nuclear', 'Financial services', 'Information technology']
  vector = classifier(text, infrastructureSector)
  return vector['labels'][0]

#Return harm type
def get_harm_type(text):
  harmType = ['Harm to social or political systems', 'Harm to civil liberties', 'Harm to physical health/safety', 'Psychological harm', 'Financial harm', 'Harm to physical property', 'Harm to intangible property', 'Other:Harm to publicly available information', 'Other:Reputational harm; False incarceration', 'Other:Reputational harm', 'Other:Privacy', 'Other', 'Other:Reputational harm/social harm (libel and defamation)']
  vector = classifier(text, harmType)
  return vector['labels'][0]

#Returns pandas dataframe of all functions
def totalFunctions(url):
  article = Article(url)
  article.download()

  article.parse() #parses through the text
  article.nlp() 
  article.keywords
  df = pd.DataFrame()
  df['Function']=['Keywords', 'Author', 'Article Summary', 'Harm Type', 'Sector of Deployment', 'Sector of Infrastructure', 'Named Entities']
  df['Result']=[article.keywords, article.authors, article.summary, get_harm_type(article.text), get_Sector_of_Deployment(article.text), get_infrastructure_sector(article.text), nameEntityFinder(article.text)]
  return df

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Some weights of the model checkpoint at joeddav/xlm-roberta-large-xnli were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




---


# **Creating Article Summary Object**
### This will be specific to the URL or link you want to use
To enter the website link into the code, the way it is done is through creating a article summary object. In the code block down below, the Article object is article and inside the paranthesis of Article is where you would enter your article link. The article should be downloaded, parsed, and go through the nlp function shown below. There is no output for these commands but these lines of code are vital to use the functions below. 

In [None]:
article = Article('https://www.theguardian.com/technology/2015/jul/08/women-less-likely-ads-high-paid-jobs-google-study')
article.download()
article.parse()
article.nlp()

### **How to use the functions**
There are a total of eight useful functions that can retrieve infromation from an article. These include extraction of keywords, full description of the article, retrieving the article's authors, identifying the named entities in the article, getting the sector of deployment and sector of infrastructure, and harm type. The last function can pull all information together and return the output in a dataframe. Here, we use a [2015 article from The Guardian](https://www.theguardian.com/technology/2015/jul/08/women-less-likely-ads-high-paid-jobs-google-study) to show how our functions work.

Here is how to extract key words from an article:

In [None]:
article.keywords

['likely',
 'jobs',
 'highpaid',
 'abuse',
 'substance',
 'google',
 'users',
 'group',
 'ad',
 'sites',
 'shown',
 'shows',
 'adverts',
 'ads',
 'researchers',
 'study',
 'women']

The following line of code gives the user a summary or full description of the article:

In [None]:
article.summary

'Female job seekers are much less likely to be shown adverts on Google for highly paid jobs than men, researchers have found.\nTheir 17,370 fake profiles only visited jobseeker sites and were shown 600,000 adverts which the team tracked and analysed.\nThe adverts shown to the control group did not include any rehabilitation services.\nThe Watershed site was included in the top 100 substance abuse sites list, which was used as the experimental list of sites to visit by the automated system.\nGoogle has said that it prohibits the targeting of adverts within its “sensitive category policy”, which includes health issues such as substance abuse.'

This function calls for the article's authors:

In [None]:
article.authors

['Samuel Gibbs']

This function lists all the named entities: 

In [None]:
nameEntityFinder(article.text)

{'AdFisher': 'PERSON',
 'Carnegie Mellon': 'ORG',
 'Google': 'ORG',
 'Watershed': 'ORG'}

This function returns the primary economic sector in which the AI system(s) involved in the incident were operating.

In [13]:
get_Sector_of_Deployment(article.text)

'Administrative and support service activities'

This function will return the field that indicates if the incident caused harm to any of the economic sectors designated by the U.S. government as critical infrastructure

In [None]:
get_infrastructure_sector(article.text)

'Information technology'

This function will indicate the type(s) of harm caused or nearly caused by the incident:

In [None]:
get_harm_type(article.text)

'Harm to physical property'

Finally, we can combine all seven functions above into one function that returns all the outputs as a dataframe. The dataframe can then be exported as a csv or any other file time as needed.

In [None]:
totalFunctions('https://www.theguardian.com/technology/2015/jul/08/women-less-likely-ads-high-paid-jobs-google-study')

Unnamed: 0,Function,Result
0,Keywords,"[shows, women, shown, ad, adverts, users, abus..."
1,Author,[Samuel Gibbs]
2,Article Summary,Female job seekers are much less likely to be ...
3,Harm Type,Harm to physical property
4,Sector of Deployment,Administrative and support service activities
5,Sector of Infrastructure,Information technology
6,Named Entities,"{'Carnegie Mellon': 'ORG', 'AdFisher': 'PERSON..."


To save the above dataframe in a csv file, use the code below:

In [None]:
df = totalFunctions('https://www.theguardian.com/technology/2015/jul/08/women-less-likely-ads-high-paid-jobs-google-study')
df.to_csv('ArticleSummaryDataFrame.csv', encoding='utf-8', index=False)



---


## **Conclusion** 
This library can extract key information however it is important to know that this process may not always be correct. Therefore if you are using this library for an important task, it is highly reccomended to check the results the code provides. Down below is a table that includes the names and contact information of the creators of this library. Please feel free to contact us using this contact information for any questions 



| Name | Email |
| --- | --- |
|Ujwal Gupta | ugupta12@umd.edu|
|Marcus Hill | mhill128@umd.edu|
|Ayushi Saxena | asaxena1@umd.edu|