### Introducing Knowledge Graphs
In complex fields, such as science and medicine, the sheer amount of data and literature available on specific topics is hard to overstate. The same goes for knowledge management in established companies and industries where, over time, institutional knowledge in the form of textual information builds up, becoming too large to sensibly disseminate. In both of these case, a knowledge graph may help to alleviate issues associated with too much disparate information.

The quality of text data has a large impact on the preparation of data for knowledge graph ingestion.

#### Cleaning the data for our knowledge graph

Knowledge graphs typically contain relationships that represent commonalities between related documents and are built up using the content within those documents text. For this reason, a large part of knowledge graph construction is cleaning and preparing that text for later graph creation.

Let’s begin by taking a look at the raw abstract data in 20k_abstracts.txt.

In [None]:
## Import the raw abstract data
import csv
with open('./data/20k_abstracts.txt') as c:
    reader = csv.reader(c, delimiter='\t')
    data = [line for line in reader if line != []]

With our data loaded, we can now convert it into a different format. We want to concatenate
each of the sentences in an abstract into one whole abstract – at present the sentences are on
different lines – as well as removing unnecessary data.

Now, we can open our for loop and add some logic. We know the sentences begin from the
second line, so we can slice our list of lists to process and begin looping through lines from
data[1:]. We now need to use an if statement to process different lines in a different way.
Where a line’s length is only 1, we know that this just contains a reference number that we
don’t need. Because we start our loop from the second line of the raw data, we also know that
a line of length 1 means that the sentences we have seen before make up one whole abstract.
Applying this logic to our loop, when we see len(line) == 1, we can use append() to
append an abstract to our clean data list, and initialize a new empty abstract string. If we have a
line that is not of length 1, we can add the current line’s second element to the abstract string to
build up a complete abstract from several lines of text and ignore the sentence type annotation:

In [None]:
clean_data = []
abstract = ''

for line in data[1:]:
    if len(line) == 1:
        clean_data.append(abstract)
        abstract = ''
    else:
        abstract += ' ' + line[1]

Lastly, we just need to write our data back to the file so that we don’t need to do this cleaning
step every time our knowledge graph is constructed later. We can use the csv module again
to write each abstract to a separate line. We can also take this opportunity to give each abstract
a sequentially increasing integer ID, which igraph will require later, by using enumerate():

#### Ingesting data into a Knowledge Graph
There are some stuff jumping straight into creating a knowledge graph from our cleaned abstract data. We must consider the structure we are aiming to produce first. We will then need to process our abstracts to extract terms of interest. Then, once we have terms, we can create a list of edges to import into igraph.

Getting the ingestion right into the knowledge graph is crucial and this all stems from how you
conceptually and practically design your graph schema.