# Introduction

This notebook introduces how the text processing pipeline works and will get you up and running for development quickly. The hope is that these classes can be extended easily to accommodate new features. For example, a `Sentence.tags()` method that gets classification tags or a `Sentence.similar()` method that gets similar sentences from other documents.

## Building a Corpus

The `Corpus` object is the main container of the documents. The first time you instantiate this class, you'll need to specify `from_file_only=False` to kick off the text extraction from the PDFs. After PDFs are converted to text, a text version is saved and loaded in subsequent instantiations. If you leave this argument out, the corpus will load only those docs that already have a text version already. This is useful because the text extraction times out and fails to generate a text version when trying to convert a few PDFs (unresolved problem), so after the first call, using `from_file_only=True` (the default) speeds up development.

In [1]:
from cybersecurity_nlp.pipelines.corpus import Corpus
# corp = Corpus(from_file_only=False) # Run pdf text extraction, need to do this on the first call!
corp = Corpus() # Load docs from text files

2018-07-29 11:53:36,232:INFO: Parsing key file
2018-07-29 11:53:36,236:INFO: 97 documents found in key file
2018-07-29 11:53:36,240:INFO: Reading Kenya_2014_GOK-national-cybersecurity-strategy.pdf from text file
2018-07-29 11:53:36,244:INFO: Reading Korea_RepublicOf_2011_KOR_NCSS_2011.pdf from text file
2018-07-29 11:53:36,251:INFO: Reading Russia_2000.pdf from text file
2018-07-29 11:53:36,258:INFO: Reading Australia_2009_AG%20Cyber%20Security%20Strategy%20-%20for%20website.pdf from text file
2018-07-29 11:53:36,265:INFO: Reading Australia_2016_Cyber-Strategy.pdf from text file
2018-07-29 11:53:36,269:INFO: Reading Latvia_2014_LVA_CSS_2014-2018.pdf from text file
2018-07-29 11:53:36,272:INFO: Reading Rwanda%20NCSS%20NICI_III.pdf from text file
2018-07-29 11:53:36,277:INFO: Reading Austria_2013_130415_strategie_cybersicherheit_en_web.pdf from text file
2018-07-29 11:53:36,280:INFO: Reading Lithuania_2011_EIS%28KS%29PP_796_2011-06-29_EN_PATAIS.pdf from text file
2018-07-29 11:53:36,285:

## Documents and Sentences

With the `Corpus` object, you can easily get `Document` objects and `Sentence` objects from those documents. The examples below walk through how you can do this.

In [2]:
documents = corp.documents()

In [3]:
documents[0]

<cybersecurity_nlp.pipelines.document.Document at 0x1206b2048>

In [4]:
print("ID:", documents[0].id())
print("Country:", documents[0].country())
print("Year:", documents[0].year())
print("URL:", documents[0].url())

ID: 1
Country: Kenya
Year: 2014
URL: https://www.itu.int/en/ITU-D/Cybersecurity/Documents/National_Strategies_Repository/Kenya_2014_GOK-national-cybersecurity-strategy.pdf


In [5]:
sentences = documents[0].sentences()

In [6]:
sentences[10]

<cybersecurity_nlp.pipelines.sentence.Sentence at 0x1206c0e48>

In [7]:
print(sentences[10].text())

The Strategy defines Kenya's cybersecurity vision, key objectives, and ongoing commitment to support national priorities by encouraging ICT growth and aggressively protecting critical information infrastructures.


In [8]:
print(sentences[10].is_bad())

False


In [9]:
documents[3].key_terms()

[('cyber security strategy', 0.38162230548737924),
 ('cyber security', 0.10147222501118881),
 ('australian government', 0.09161004175164054),
 ('national security', 0.021988220002407053),
 ('critical infrastructure', 0.016835651290452034),
 ('information', 0.016792057807714436),
 ('australian', 0.0160499103577135),
 ('security', 0.013668806400426194),
 ('national interest', 0.013567926860251562),
 ('security policy', 0.012427159465926079),
 ('government', 0.009959535259817225),
 ('system', 0.009855367670856669),
 ('business', 0.009796623306084884),
 ('cyber', 0.008633985829629013),
 ('digital economy', 0.008510761354943011),
 ('security strategy', 0.00849379325432838),
 ('internet', 0.008203318721244756),
 ('good practice', 0.007794338508209304),
 ('threat', 0.007779643561479822),
 ('private sector', 0.007237800704051011)]