# Create a Knowledge Graph from Text

In this project, you’ll create a knowledge graph from the text. A knowledge graph is a structured information map linking entities, facts, or 
related concepts. This representation makes it easier to relate and retrieve data; therefore, computers can quickly navigate and comprehend facts 
and concepts


## Task 1: Import Libraries

To create the knowledge graph from text and display it, you’ll need some libraries. In this task, import the following libraries that will be used in
the project:

1. wikipedia: This is used to obtain textual data from Wikipedia pages.

2. re: This is used for text preprocessing.

3. requests: This is used for obtaining and processing API responses.

4. spacy: This is used for tasks related to natural language processing.

5. spacy_transformers: This is used to enable coreference resolution.

6. displacy from spacy: This is used to visualize dependencies in a sentence.

7. Matcher from spacy.matcher: This is used for coreference resolution among sentences.

8. networkx: This is used to create a knowledge graph.

9. Network from pyvis.network: This is used to display an interactive graph.

   

In [3]:
import wikipedia as wp 
import re
import requests
import spacy
import spacy_transformers
from spacy import displacy
from spacy.matcher import Matcher
import networkx as nx
from pyvis.network import Network

## Task 2: Load the Data

The data for this project is English textual data obtained from Wikipedia through the wikipedia library.

To complete this task:

1. Set the language of the API request using the wikipedia.set_lang() function.

2. Obtain the data of a Wikipedia page using the wikipedia.page(<title>).content command.

3. Print the data to get a gist of it.

    

In [2]:
# Set the language of the response
wp.set_lang("en")

# Obtain and store the data
title = " 'New York City' "
data = wp.page(title).content

# View the data
print(data)

New York, often called New York City or NYC, is the most populous city in the United States, located at the southern tip of New York State on one of the world's largest natural harbors. The city comprises five boroughs, each coextensive with a respective county. New York is a global center of finance and commerce, culture, technology, entertainment and media, academics and scientific output, the arts and fashion, and, as home to the headquarters of the United Nations, international diplomacy.
With an estimated population in 2023 of 8,258,035 distributed over 300.46 square miles (778.2 km2), the city is the most densely populated major city in the United States. New York City has more than double the population of Los Angeles, the nation's second-most populous city. New York is the geographical and demographic center of both the Northeast megalopolis and the New York metropolitan area, the largest metropolitan area in the U.S. by both population and urban area. With more than 20.1 milli

## Task 3: Preprocess the Data

In this task, we will preprocess data to prepare it for the following tasks:

1. Convert all the data to lowercase.

2. Remove blank lines, if any.

3. Remove punctuation marks (except ., ,, ;, and !) from the data.

4. Remove the text enclosed in the parentheses to simplify the text.

5. Remove the headings that are enclosed in == and ===.

6. Delete any text following the See Also heading.

7. Print the text to verify the results.



In [3]:
# Convert the data to lowercase and replace new lines

data = data.lower().replace('\n', "")

# Remove the last part of the text, certain punctuation marks, headings, as well as any text within the parentheses
data = re.sub('== see also ==.*|[@#:&\"]|===.*?===|==.*?==|\(.*?\)', '', data)

# View the data
print(data)

new york, often called new york city or nyc, is the most populous city in the united states, located at the southern tip of new york state on one of the world's largest natural harbors. the city comprises five boroughs, each coextensive with a respective county. new york is a global center of finance and commerce, culture, technology, entertainment and media, academics and scientific output, the arts and fashion, and, as home to the headquarters of the united nations, international diplomacy.with an estimated population in 2023 of 8,258,035 distributed over 300.46 square miles , the city is the most densely populated major city in the united states. new york city has more than double the population of los angeles, the nation's second-most populous city. new york is the geographical and demographic center of both the northeast megalopolis and the new york metropolitan area, the largest metropolitan area in the u.s. by both population and urban area. with more than 20.1 million people in

## Task 4: Recognize Named Entities

An important task of NLP is named entity recognition (NER). In this task, you’ll perform named entity recognition. To complete this task, do the 
following:

1. Load a language model.

2. Apply the language model to the text document.

3. Display the sentences from the document, along with the labeled named entities.



In [4]:
# Load a language model

nlp = spacy.load('en_core_web_lg')
doc=nlp(data)

# Display the entities in the doc
displacy.render(doc,style="ent",jupyter=True)

## Task 5: Compute Coreference Clusters

To make the document more coherent, you need to ensure no pronouns about an entity get lost, i.e., you must do coreference resolution.


To complete this task:
    
1. Add a coreference resolution component in the NLP pipeline.
    
2. Pass the data to the pipeline.
    
3. Print resolved coreferences.
    

In [5]:
# Add the coreference resolution component in the pipeline
nlp.add_pipe('coreferee')

# Pass the data to the language model 
doc = nlp(data)

# Print resolved coreferences, if any
doc._.coref_chains.print()

0: york(1), york(54), city(110), city(117), york(146), york(161)
1: nyc(9), city(15)
2: states(19), city(41), states(121)
3: city(125), city(205), city(217), its(219)
4: 20.1(183), its(187), its(195)
5: world(210), world(255)
6: states(232), city(252), city(261)
7: .(273), city(288)
8: population(280), its(290)
9: amsterdam(294), amsterdam(314)
10: city(322), city(327)
11: ii(343), his(348)
12: city(368), city(381), its(389), city(416), city(434)
13: manhattan(392), manhattan(412)
14: world(422), world(437), world(455), world(488)
15: york(445), york(470), city(491)
16: metropolitan(471), their(506)
17: area(472), its(474), it(478)
18: world(496), world(543), world(558)
19: city(519), city(535), city(540), city(562)
20: york(595), york(604), city(630)
21: duke(602), duke(617)
22: kingdom(637), it(641)
23: algonquians(663), their(669)
24: harbor(709), harbor(755), harbor(786)
25: verrazzano(720), he(722)
26: area(725), it(730)
27: captain(742), he(769), he(803)
28: gomes(744), company(8

## Task 6: Resolve Coreferences

In this task, you’ll update the text by resolving coreferences. To complete this task:

1. Iterate through the document one token at a time.

2. If the token lies in the corefence chain, replace it with its resolution in the text; otherwise, keep the token as it is.

3. Print the document afterward to verify the coreference resolution.

    

In [6]:
resolved_data = ""
for token in doc:
    resolved_coref = doc._.coref_chains.resolve(token)
    if resolved_coref:
        resolved_data += " " + " and ".join(r.text for r in resolved_coref)
    elif token.dep_ == "punct":
        resolved_data += token.text
    else:
        resolved_data += " " + token.text
print(resolved_data)

 new york, often called new york city or nyc, is the most populous nyc in the united states, located at the southern tip of new york state on one of the world 's largest natural harbors. the states comprises five boroughs, each coextensive with a respective county. new york is a global center of finance and commerce, culture, technology, entertainment and media, academics and scientific output, the arts and fashion, and, as home to the headquarters of the united nations, international diplomacy.with an estimated population in 2023 of 8,258,035 distributed over 300.46 square miles, the york is the most densely populated major york in the united states. new york city has more than double the population of los angeles, the nation 's second- most populous city. new york is the geographical and demographic center of both the northeast megalopolis and the new york metropolitan area, the largest metropolitan area in the u.s . by both population and urban area. with more than 20.1 million peop

## Task 7: Extract Relationships

In a knowledge graph, nodes hold information about the objects and subjects. Therefore, in this task, you’ll create a function, 
extract_relationship(), that receives a sentence as an argument and does the following:

1. Declares variables for two entities, a relation, and a list for triples.

2. Iterates through the sentence one noun chunk at a time.

3. Stores the first two noun chunks as the subject and object.

4. When two noun chunks are found, it returns them along with any text in between them as a relationship.

5. In case there are less than two noun chunks in the sentence, it returns None.

N.B : Coding grammatical rules has its limitations. This project has chosen a straightforward method, which is just one of the many approaches to 
capturing relationships. You can explore alternative techniques, such as creating complete graphs or linear chains between noun chunks,
employing a deep learning approach, or utilizing more advanced NLP tools.



In [7]:
def extract_relationship(sentence):
    
    doc = nlp(sentence)
    
    first, last = None, None
    
    for chunk in doc.noun_chunks:
        if not first:
            first = chunk
        else:
            last = chunk

    if first and last:
        return (first.text.strip(), last.text.strip(), str(doc[first.end:last.start]).strip())
    
    return (None, None, None)

## Task 8: Create a Graph

In this task, you will plot the knowledge graph displaying the relationships. Normally, both the subject and object are the nodes connected via an 
edge that is a verb.

To complete the task, follow these steps:

1. Create an object of the Network class and pass the notebook=True argument to enable output embeddings.

2. Create an empty networkx graph.

3. Loop through all the sentences.

Add the subject and object as nodes.

Create an edge between the nodes with the relationship displayed as the title. Display the relationship, as well as the whole sentence on the edge. 
The print_five_words() helper function is provided for this.

4. Pass the networkx object to the from_nx() function of the Network object.

5. Display the graph using the show() function of the pyvis library.



In [9]:
#A helper function that prints 5 words per row. Can be used for better readability of a given text.
print_five_words = lambda sentence: '\n'.join(' '.join(sentence.split()[i:i+5]) for i in range(0, len(sentence.split()), 5))

In [10]:
# Create a Network object
graph_doc = nlp(resolved_data)

# Create an empty graph
nx_graph = nx.DiGraph()

for sent in enumerate(graph_doc.sents) :
    if len(sent[1]) > 3:
        (a, b, c) = extract_relationship(str(sent[1]))

        # Add nodes and edges to graph
        if a and b:
            nx_graph.add_node(a, size = 5)
            nx_graph.add_node(b, size = 5)
            nx_graph.add_edge(a, b, weight=1, title=print_five_words(c), arrows="to")

g = Network(notebook=True, cdn_resources='in_line')
g.from_nx(nx_graph)
g.show("example.html")

example.html


## Task 9: List the Related Entities

Now that the graph is created, you’ll extract the information from the graph by passing a node and extracting the neighboring nodes.

In [11]:
print(nx_graph.edges(['manhattan']))

[('manhattan', 'important universities'), ('manhattan', "the nation 's 360 largest counties"), ('manhattan', 'the second department')]


In [12]:
print(nx_graph.edges(['new york']))

[('new york', "the world 's largest natural harbors"), ('new york', 'the united states'), ('new york', 'the u.s'), ('new york', 'new york'), ('new york', 'asia'), ('new york', "new york 's broad- spectrum high technology sphere"), ('new york', 'public and commercial financial support'), ('new york', 'china'), ('new york', 'poll divisions'), ('new york', 'new jersey'), ('new york', 'a car')]


In [None]:
print(nx_graph.edges(['new york']))

In [18]:
print(nx_graph.edges(['the harlem river']))

[('the harlem river', 'the bronx')]


In [19]:
print(nx_graph.edges(['queens']))

[('queens', 'the annual u.s'), ('queens', 'fort tilden'), ('queens', 'the world'), ('queens', 'professional programs')]


In [20]:
print(nx_graph.edges(['the world']))

[]
