## The below script will web scrape the names of the first 10 articles in the topic of Artificial Intelligence from C4ISR.net. It will then place these headlines into a Pandas dataframe. Further, it will use the Spacy NLP tool to perform entity recognition on a sample string, and visualize the entities in a way that non-technical audiences can understand. Last, I add the CSET acronym and AI to Spacy's entity dictionary, so that it can recognize these when running future NLP models.

# Requirements:
#### The following Python packages are required for this script:
##### - Spacy (along with its small library model)
##### -BeautifulSoup4
##### -Pandas

### If you do not have these, please run the below commands to install the necessary packages

In [3]:
#The following 4 lines will use pip and the command line to install spacy - an NLP processor
#It will also install beautifulsoup4 - a web scraper, and pandas for general data manipulation
!pip install spacy
!python -m spacy download en_core_web_sm
!pip install beautifulsoup4
!pip install pandas

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

#Set URL to the C4ISRNET - artificial intelligence topic page
url = "https://www.c4isrnet.com/artificial-intelligence/"

#Establish request
r1 = requests.get(url)
news = r1.content

#Establish BeautifulSoup
soup = BeautifulSoup(news, "html5lib")

#For this, I inspected the HTML elements of the webpage and identified the relevent HTML elements to scrape
#i.e. headlines
headlines = soup.find_all("h4", class_ = " headline")

In [7]:
#Determines number of headlines
article_length = len(headlines)

#Print to the user how many headlines collected
print('There are', article_length, 'headlines collected.')

There are 10 headlines collected.


In [8]:
#Create empty headline list
headline_list = []

#Create for loop that iterates through each article, extracts headline, and appends each headline to the list above
for i in range(article_length):
    x = headlines[i].get_text()
    headline_list.append(x)
    
#Create Dataframe from headline_list
headline_df = pd.DataFrame(headline_list, columns = ['Headlines'])

#Look at Dataframe with collected headlines
#We can expand the BeautifulSoup scraping code to also save article URLs for research
headline_df

Unnamed: 0,Headlines
0,"Tomorrow Wars Volume 1 Issue 6: Labor, Force"
1,The 3 major security threats to AI
2,Can DARPA CREATE an AI for unmanned-unmanned t...
3,Who should manage the Pentagon’s AI data? DARP...
4,4 intel challenges from a former Central Comma...
5,The Pentagon’s AI center is poised for a break...
6,How the Pentagon is tackling deepfakes as a na...
7,Tomorrow Wars Volume 1 Issue 5: War is Coding
8,Now is the time to double down on artificial i...
9,4 big problems the intelligence community face...


In [None]:
#Import spacy NLP tool, displacy visualizer, and load the small spacy library
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

#Sample text from C4ISR.net article
text = u"At the heart of the tensions between the Pentagon and Silicon Valley are two high-profile petitions. The first was signed and circulated by workers at Google opposed to the company’s participation in developing image processing algorithms for Project Maven. The second was circulated among workers at Microsoft in protest of the company adapting augmented reality tool HoloLens for the battlefield."

#Create doc variable and perform Spacy nlp on text string above
doc = nlp(text)

#Create for loop to identify entities in sample text string
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

#Visualize entity recognition with displacy visualizer
#This would be more useful when presenting to a less technical audience
displacy.serve(doc, style="ent")

Pentagon 41 49 ORG
Silicon Valley 54 68 LOC
two 73 76 CARDINAL
first 105 110 ORDINAL
Google 151 157 ORG
Project Maven 243 256 ORG
second 262 268 ORDINAL
Microsoft 301 310 ORG


  "__main__", mod_spec)



Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...



In [None]:
#Import span container to access position of select words based on index
from spacy.tokens import Span

#Below is a sample string that we will use later for entity recognition
doc2 = nlp(u"""CSET will deliver nonpartisan analysis
            and advice to the U.S. and international policy
            and academic community on AI and other emerging technologies.""")

#Below we are updating Spacy's entity definitions to include CSET as an Organization
#and also adding AI as a Product
doc2.ents = [Span(doc2, 0,1, label=doc2.vocab.strings[u"ORG"]),
            Span(doc2, 19,20, label=doc2.vocab.strings[u"PRODUCT"])]


#Visualize updated entity recognition
displacy.serve(doc2, style="ent")

  "__main__", mod_spec)



Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

