# Introduction

This notebook builds slightly off of the previous two.  In this one, we will be populated our database off of Wikidata.  To do so, we will start with the same starter paragraph from Wikipedia.  We will use `spacy` to get the named entities and then use those to scrape Wikipedia using a bot called `Pywikibot`.  (See the `README.md` for information on how to create the token you will need for this bot.)

The steps we will follow below are:

1. Get Wikipedia entry for the target
2. Use `spacy` to identify the named entities
3. Use `spacy` to clean the text of the named entities
4. Get the Q-codes for all entities (subjects)
5. For a given list of claims (verbs) associated with the subjects, get all targets (objects)
6. Connect to Neo4j
7. Get all P31 claims (_"instance of"_) for all nodes to create node labels
8. Add the nodes and properties to the graph
9. Add the edges to the graph

Once these steps are completed, we will then move to the next notebook where we will do some basic data science / machine learning.

In [1]:
%matplotlib inline

import json
import re
import urllib
from pprint import pprint
import time
from tqdm import tqdm

from neo4j import GraphDatabase

import numpy as np
import pandas as pd
import wikipedia

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.matcher import Matcher
from spacy.tokens import Doc, Span, Token

from pywikibot.data import api
import pywikibot
import wikipedia
import pprint

print(spacy.__version__)

3.0.3


In [2]:
non_nc = spacy.load('en_core_web_md')

nlp = spacy.load('en_core_web_md')
nlp.add_pipe('merge_noun_chunks')

print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'merge_noun_chunks']


# We start by getting the Wikipedia summary paragraph for our target search term, Barack Obama

In [3]:
text = wikipedia.summary('barack obama')
doc = nlp(text)
text

'Barack Hussein Obama II ( (listen) bə-RAHK hoo-SAYN oh-BAH-mə; born August 4, 1961) is an American politician and attorney who served as the 44th president of the United States from 2009 to 2017. A member of the Democratic Party, Obama was the first African-American  president of the United States. He previously served as a U.S. senator from Illinois from 2005 to 2008 and as an Illinois state senator from 1997 to 2004.\nObama was born in Honolulu, Hawaii. After graduating from Columbia University in 1983, he worked as a community organizer in Chicago. In 1988, he enrolled in Harvard Law School, where he was the first black president of the Harvard Law Review. After graduating, he became a civil rights attorney and an academic, teaching constitutional law at the University of Chicago Law School from 1992 to 2004. Turning to elective politics, he represented the 13th district in the Illinois Senate from 1997 until 2004, when he ran for the U.S. Senate. Obama received national attention 

# We can use `displacy` to visualize the named entities in the text, which will be the starting nodes for our graph.

Note that you will see some obvious errors below, but the named entity recognizition (NER) algorithm in `spacy` is still really well suited for this task.

In [None]:
spacy.displacy.serve(doc, style='ent')




Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...



# Let's review some of the detected entities

In [None]:
ent_ignore_ls = ['DATE']
ner_list = []

for el in doc.ents:
    if el.label_ not in ent_ignore_ls:
        #print(el, el.label_)
        if el.text not in ner_list:
            temp_doc = nlp(el.text)
            ner_list.append(el.text)

ner_list[0:5]

# Text cleaning

Even some of the entities will be dirty text.  So we still want to do things like removing special characters and stop words.  By the time we get to the end of the next two cells, you can see what our remaining list of named entities is.  This will be our starter list for scraping Wikidata.

In [None]:
def remove_special_characters(text):
    
    regex = re.compile(r'[\n\r\t]')
    clean_text = regex.sub(" ", text)
    
    return clean_text


def remove_stop_words_and_punct(text, print_text=False):
    
    result_ls = []
    rsw_doc = non_nc(text)
    
    for token in rsw_doc:
        if print_text:
            print(token, token.is_stop)
            print('--------------')
        if not token.is_stop and not token.is_punct and not token.is_space:
            result_ls.append(str(token))
    
    result_str = ' '.join(result_ls)

    return result_str

In [None]:
node_text_ls = []

for el in ner_list:
    clean_text = remove_special_characters(el)
    no_sw = remove_stop_words_and_punct(clean_text)
    if no_sw not in node_text_ls:
        node_text_ls.append(no_sw)

node_text_ls

## These are some helper functions for interfacing with Wikidata

We are also establishng the bot connection to the site here.

In [None]:
def getItems(site, itemtitle):
    params = { 'action' :'wbsearchentities' , 'format' : 'json' , 'language' : 'en', 'type' : 'item', 'search': itemtitle}
    request = api.Request(site=site,**params)
    return request.submit()

def getItem(site, wdItem, token):
    request = api.Request(site=site,
                          action='wbgetentities',
                          format='json',
                          ids=wdItem)    
    return request.submit()

def prettyPrint(variable):
    pp = pprint.PrettyPrinter(indent=4)
    pp.pprint(variable)

# Login to wikidata
token = open('.wiki_api_token').read()
wikidata = pywikibot.Site('wikidata', 'wikidata')
site = pywikibot.Site("wikidata", "wikidata")

# Confirmation that we are able to connect the bot to Wikidata

In [None]:
itempage = pywikibot.ItemPage(wikidata, "Q76")  # Q42 is Douglas Adams
itempage

# Now we are going to start scraping Wikidata with our bot

First, we are going to take all of our named entities and identify them in Wikidata.  This is done by correlating the individual entity with a Wikidata Q-code, which is what Wikidata uses to index all entities.  As you will see, not all of the entities are in Wikidata, likely because of the fact that there are modifiers to the text prior to the actual entity (ex: _Republican nominee_ John McCain).  But will we still be OK. :)

In [None]:
item_ls = []
i = 0

for el in node_text_ls:
    #itempage = pywikibot.ItemPage(wikidata, el)
    #print(el, itempage)
    wikidataEntries = getItems(site, el)
    try:
        tup = (wikidataEntries['search'][0]['id'], el)
        item_ls.append(tup)
    except:
        i += 1
        print('Missing ', i,'th entry for ', el)
    #item_ls.append(tup)
    
dedup_item_ls = []

for item in item_ls:
    if item not in dedup_item_ls:
        dedup_item_ls.append(item)
        
dedup_item_ls

# How do we get the verbs?

In Wikidata, these are called "claims" or "statements" and are indexed through the P-value.  There are literally thousands of different P values.  I have gone through and identified a series that I thought might be particularly interesting for this dataset.  This list should absolutely be customized to the application/graph.

### Note

This process can take several minutes, depending on the size of your starter list and the amount of traffic hitting Wikidata at any given time.  You might even hit timeout errors.  They will eventually resolve themselves.  Grab a cup of coffee.  For Barack Obama's entity list, this takes around 10-12 minutes or so.

In [None]:
%%time
p_dc = {'P6': 'head_of_government',
        'P17': 'country',
        'P19': 'place_of_birth',
        'P22': 'father',
        'P25': 'mother', 
        'P26': 'spouse',
        'P27': 'country_of_citizenship',
        'P30': 'continent',
        'P31': 'instance_of',
        'P35': 'head_of_state',
        'P36': 'capital',
        'P37': 'official_language',
        'P39': 'position_held',
        'P40': 'child',
        'P69': 'educated_at',
        'P101': 'field_of_work',
        'P102': 'member_of_political_party',
        'P106': 'occupation',
        'P108': 'employer',
        'P150': 'contains_administrative_territorial_entity',
        'P159': 'headquarters_location',
        'P166': 'award_received',
        'P172': 'ethnic_group',
        'P361': 'part_of',
        'P463': 'member_of',
        'P551': 'residence',
        'P607': 'conflict',
        'P793': 'significant_event',
        'P1344': 'participated_in',
        'P1813': 'short_name',
        'P1906': 'office_held_by_head_of_state',
        'P2388': 'office_held_by_head_of_the_organization',
        'P2670': 'has_parts_of_the_class'
       }

full_node_tup_ls = []

for el in tqdm(item_ls):
    itempage = pywikibot.ItemPage(wikidata, el[0])
    itemdata = itempage.get()
    source_node = itemdata['labels']['en']
    #print(el, source_node)

    for key in p_dc.keys():
        #print(source_node, key, p_dc[key])
        #print(itemdata['claims'])
        try:
            for i in itemdata['claims'][key]:
                target = i.getTarget()
                #print(target.id)
                tup = (source_node, el[0], key, p_dc[key], target.labels['en'], target.id)
                if tup not in full_node_tup_ls:
                    full_node_tup_ls.append(tup)
        except:
            continue

#full_node_tup_ls

In [None]:
df = pd.DataFrame(full_node_tup_ls, columns=['source_name', 'source_q', 'rel_p', 'rel_name', 'target_name', 'target_q'])
df.head()

In [None]:
df.shape

# Connecting to Neo4j

As before, we will connect to Neo4j with the usual class.  We will also set up a constraint on unique P-values, since this has many potential benefits, particularly as the graph gets larger.

In [None]:
class Neo4jConnection:
    
    def __init__(self, uri, user, pwd):
        self.__uri = uri
        self.__user = user
        self.__pwd = pwd
        self.__driver = None
        try:
            self.__driver = GraphDatabase.driver(self.__uri, auth=(self.__user, self.__pwd))
        except Exception as e:
            print("Failed to create the driver:", e)
        
    def close(self):
        if self.__driver is not None:
            self.__driver.close()
        
    def query(self, query, parameters=None, db=None):
        assert self.__driver is not None, "Driver not initialized!"
        session = None
        response = None
        try: 
            session = self.__driver.session(database=db) if db is not None else self.__driver.session() 
            response = list(session.run(query, parameters))
        except Exception as e:
            print("Query failed:", e)
        finally: 
            if session is not None:
                session.close()
        return response

In [None]:
# If you are using a Sandbox instance, you will want to use the following (commented) line.  
# If you are using a Docker container for your DB, use the uncommented line.
# conn = Neo4jConnection(uri="bolt://some_ip_address:7687", user="neo4j", pwd="some_password")

conn = Neo4jConnection(uri="bolt://neo4j:7687", user="neo4j", pwd="kgDemo")

In [None]:
conn.query('CREATE CONSTRAINT q_value IF NOT EXISTS ON (n:Node) ASSERT n.id IS UNIQUE')

In [None]:
source_df = df[['source_name', 'source_q']].drop_duplicates()
source_df.columns = ['name', 'id']
target_df = df[['target_name', 'target_q']].drop_duplicates()
target_df.columns = ['name', 'id']
all_nodes_df = pd.concat([source_df, target_df]).drop_duplicates()
all_nodes_df.shape

# Some helper functions

The below functions are responsible for populating the graph.  To make the graph more rich, we do want to be able to give a descriptive node label.  We will use the Wikidata claim _"instance of"_ (P31) for this.  So, for example, Barack Obama is an instance of a human whereas the Unites States is an instance of a "sovereign state."

In [None]:
def get_p31(row):
    # P31 corresponds to "instance of"
    
    itempage = pywikibot.ItemPage(wikidata, row)
    itemdata = itempage.get()
    try:
        target = itemdata['claims']['P31'][0].getTarget()
        target.get()
        return target.labels['en']
    except:
        return 'Unknown'
    

def add_nodes(rows, batch_size=10000):
    # Adds author nodes to the Neo4j graph as a batch job.

    query = '''UNWIND $rows AS row
               MERGE (:Node {name: row.name, id: row.id, type: row.node_label})
               RETURN count(*) as total
    '''
    return insert_data(query, rows, batch_size)


def add_edges(rows, batch_size=50000):
    
    
    query = """UNWIND $rows AS row
               MATCH (src:Node {id: row.source_q}), (tar:Node {id: row.target_q})
               CREATE (src)-[:%s]->(tar)
    """ % edge
    
    return insert_data(query, rows, batch_size)


def insert_data(query, rows, batch_size = 10000):
    # Function to handle the updating the Neo4j database in batch mode.

    total = 0
    batch = 0
    start = time.time()
    result = None

    while batch * batch_size < len(rows):

        res = conn.query(query, parameters={'rows': rows[batch*batch_size:(batch+1)*batch_size].to_dict('records')})
        try:
            total += res[0]['total']
        except:
            total += 0
        batch += 1
        result = {"total":total, "batches":batch, "time":time.time()-start}
        print(result)

    return result

In [None]:
%%time
all_nodes_df['node_label'] = all_nodes_df['id'].map(get_p31)
all_nodes_df.head()

In [None]:
add_nodes(all_nodes_df)

In [None]:
edge_ls = df['rel_name'].unique().tolist()
#edge_ls

In [None]:
for edge in edge_ls:
    print(edge)
    y = df[df['rel_name'] == edge]
    #print(y.shape)
    add_edges(y)

In [None]:
y = all_nodes_df['node_label'].value_counts()
print(y[0:5])

# Conclusion

At this point we have populated our database.  You should get 1312 nodes and 1622 relationships (once deduping in Cypher is completed and all nodes are attributed to the proper labels determined by P31).  We will do some things in Cypher (see `cypher_queries/queries.cql` and follow along with the "Method 2" section).  Once those are done, we can proceed to the final notebook where we will show how to do some basic ML on the graph.