### Introducing Knowledge Graphs
In complex fields, such as science and medicine, the sheer amount of data and literature available on specific topics is hard to overstate. The same goes for knowledge management in established companies and industries where, over time, institutional knowledge in the form of textual information builds up, becoming too large to sensibly disseminate. In both of these case, a knowledge graph may help to alleviate issues associated with too much disparate information.

The quality of text data has a large impact on the preparation of data for knowledge graph ingestion.

#### Cleaning the data for our knowledge graph

Knowledge graphs typically contain relationships that represent commonalities between related documents and are built up using the content within those documents text. For this reason, a large part of knowledge graph construction is cleaning and preparing that text for later graph creation.

Let’s begin by taking a look at the raw abstract data in 20k_abstracts.txt.

In [1]:
## Import the raw abstract data
import csv
with open('./data/20k_abstracts.txt') as c:
    reader = csv.reader(c, delimiter='\t')
    data = [line for line in reader if line != []]

With our data loaded, we can now convert it into a different format. We want to concatenate
each of the sentences in an abstract into one whole abstract – at present the sentences are on
different lines – as well as removing unnecessary data.

In [10]:
data[:3]

[['###24290286'],
 ['BACKGROUND',
  'IgE sensitization to Aspergillus fumigatus and a positive sputum fungal culture result are common in patients with refractory asthma .'],
 ['BACKGROUND',
  'It is not clear whether these patients would benefit from antifungal treatment .']]

In [None]:
clean_data = []
abstract = ''

for line in data[1:]:
    if len(line) == 1:
        clean_data.append(abstract)
        abstract = ''
    else:
        abstract += ' ' + line[1]