An Introduction to Natural Language Processing with Python for SEOs

We are going to learn practical NLP while building a simple knowledge graph from scratch. 

We are going to extract useful facts automatically from SEJ XML sitemaps.

In order to do this and keep things simple and fast, we will pull article headlines from the URLs in the XML sitemaps.

We will extract entities of interest and their relationships from the headlines.

Finally, we will build a powerful knowledge graph and visualize the most popular relationships.

Here is the technical plan:

1. We will fetch all SEJ XML sitemaps
2. We will parse the URLs to extract the headlines in their slugs 
3. We will extract entity pairs from the headlines 
4. We will extract corresponding relationships
5. We will build a knowledge graph and a simple form to visualize specific relationships

Resources to learn more



##Fetching all SEJ XML sitemaps

We recently did a webinar with Elias from The Money Supermarket and learned about his wonderful Python library for marketers: advertools.

Some of my old articles are not working with the newer library versions. He gave me a good idea to print the library versions so it is easy to get the code to work.

In [1]:
%%capture
!pip install advertools

In [2]:
import advertools as adv
print(adv.__version__)

0.13.5


In [3]:
sitemap_url = "https://www.searchenginejournal.com/sitemap_index.xml" #@param {type:"string"}


Let's download the sitemap index

In [4]:
df = adv.sitemap_to_df(sitemap_url)

2024-02-15 12:31:47,438 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.searchenginejournal.com/post-sitemap3.xml
2024-02-15 12:31:47,446 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.searchenginejournal.com/post-sitemap2.xml
2024-02-15 12:31:47,461 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.searchenginejournal.com/post-sitemap.xml
2024-02-15 12:31:47,487 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.searchenginejournal.com/post-sitemap4.xml
2024-02-15 12:31:47,496 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.searchenginejournal.com/post-sitemap7.xml
2024-02-15 12:31:47,565 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.searchenginejournal.com/post-sitemap8.xml
2024-02-15 12:31:47,593 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.searchenginejournal.com/post-sitemap6.xml
2024-02-15 12:31:47,618 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.searchengin

One cool feature in the package is that it downloaded all the linked sitemaps in the index and we get a nice dataframe

---



In [5]:
df

Unnamed: 0,loc,lastmod,image,image_loc,sitemap,sitemap_last_modified,sitemap_size_mb,download_date,news,news_publication,publication_name,publication_language,news_publication_date,news_title
0,https://www.searchenginejournal.com/dick-chene...,2006-02-13 04:24:36+00:00,,,https://www.searchenginejournal.com/post-sitem...,2024-02-15 10:29:51+00:00,0.155574,2024-02-15 10:31:47.462626+00:00,,,,,,
1,https://www.searchenginejournal.com/internet-e...,2006-02-13 06:02:56+00:00,,,https://www.searchenginejournal.com/post-sitem...,2024-02-15 10:29:51+00:00,0.155574,2024-02-15 10:31:47.462626+00:00,,,,,,
2,https://www.searchenginejournal.com/launch-fir...,2006-02-13 06:55:50+00:00,,,https://www.searchenginejournal.com/post-sitem...,2024-02-15 10:29:51+00:00,0.155574,2024-02-15 10:31:47.462626+00:00,,,,,,
3,https://www.searchenginejournal.com/search-eng...,2006-02-13 07:32:03+00:00,,,https://www.searchenginejournal.com/post-sitem...,2024-02-15 10:29:51+00:00,0.155574,2024-02-15 10:31:47.462626+00:00,,,,,,
4,https://www.searchenginejournal.com/targetcom-...,2006-02-13 15:33:15+00:00,,,https://www.searchenginejournal.com/post-sitem...,2024-02-15 10:29:51+00:00,0.155574,2024-02-15 10:31:47.462626+00:00,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24206,https://www.searchenginejournal.com/author/ann...,2022-08-20 13:44:45+00:00,,,https://www.searchenginejournal.com/author-sit...,2024-02-15 10:06:16+00:00,0.126728,2024-02-15 10:31:51.013028+00:00,,,,,,
24207,https://www.searchenginejournal.com/author/bre...,2022-08-20 13:44:45+00:00,,,https://www.searchenginejournal.com/author-sit...,2024-02-15 10:06:16+00:00,0.126728,2024-02-15 10:31:51.013028+00:00,,,,,,
24208,https://www.searchenginejournal.com/author/cla...,2022-08-20 13:44:45+00:00,,,https://www.searchenginejournal.com/author-sit...,2024-02-15 10:06:16+00:00,0.126728,2024-02-15 10:31:51.013028+00:00,,,,,,
24209,https://www.searchenginejournal.com/author/adc...,2022-08-20 13:44:45+00:00,,,https://www.searchenginejournal.com/author-sit...,2024-02-15 10:06:16+00:00,0.126728,2024-02-15 10:31:51.013028+00:00,,,,,,


Look how simple it is to filter articles/pages from this year. We have 1550 articles

In [6]:
df[df["lastmod"] > '2023-12-31']

Unnamed: 0,loc,lastmod,image,image_loc,sitemap,sitemap_last_modified,sitemap_size_mb,download_date,news,news_publication,publication_name,publication_language,news_publication_date,news_title
1899,https://www.searchenginejournal.com/,2024-02-14 21:39:34+00:00,,,https://www.searchenginejournal.com/post-sitem...,2024-02-15 09:55:52+00:00,0.134017,2024-02-15 10:31:47.480078+00:00,,,,,,
18449,https://www.searchenginejournal.com/tree-of-th...,2024-01-01 05:55:28+00:00,\n\t\t\t,https://www.searchenginejournal.com/wp-content...,https://www.searchenginejournal.com/post-sitem...,2024-02-15 09:45:55+00:00,0.390761,2024-02-15 10:31:49.167702+00:00,,,,,,
18450,https://www.searchenginejournal.com/john-muell...,2024-01-01 08:36:10+00:00,\n\t\t\t,https://www.searchenginejournal.com/wp-content...,https://www.searchenginejournal.com/post-sitem...,2024-02-15 09:45:55+00:00,0.390761,2024-02-15 10:31:49.167702+00:00,,,,,,
18451,https://www.searchenginejournal.com/stock-phot...,2024-01-01 08:41:16+00:00,\n\t\t\t,https://www.searchenginejournal.com/wp-content...,https://www.searchenginejournal.com/post-sitem...,2024-02-15 09:45:55+00:00,0.390761,2024-02-15 10:31:49.167702+00:00,,,,,,
18452,https://www.searchenginejournal.com/google-on-...,2024-01-01 08:49:50+00:00,\n\t\t\t,https://www.searchenginejournal.com/wp-content...,https://www.searchenginejournal.com/post-sitem...,2024-02-15 09:45:55+00:00,0.390761,2024-02-15 10:31:49.167702+00:00,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23215,https://www.searchenginejournal.com/author/lee...,2024-01-29 06:30:34+00:00,,,https://www.searchenginejournal.com/author-sit...,2024-02-15 10:06:16+00:00,0.126728,2024-02-15 10:31:51.013028+00:00,,,,,,
23216,https://www.searchenginejournal.com/author/ale...,2024-01-18 12:22:51+00:00,,,https://www.searchenginejournal.com/author-sit...,2024-02-15 10:06:16+00:00,0.126728,2024-02-15 10:31:51.013028+00:00,,,,,,
23217,https://www.searchenginejournal.com/author/and...,2024-01-04 12:45:29+00:00,,,https://www.searchenginejournal.com/author-sit...,2024-02-15 10:06:16+00:00,0.126728,2024-02-15 10:31:51.013028+00:00,,,,,,
23218,https://www.searchenginejournal.com/author/don...,2024-01-04 12:38:18+00:00,,,https://www.searchenginejournal.com/author-sit...,2024-02-15 10:06:16+00:00,0.126728,2024-02-15 10:31:51.013028+00:00,,,,,,


In [7]:
df[df["lastmod"] > '2020-01-01'][["loc", "lastmod"]].to_csv("latest_articles.csv", index=False)

We can export this list to a csv. I will include only the URL and last modification time stamp.

If you click on the file folder to the left, you will see the file and you can right click to download it.

##Extract headlines from the URLs

The advertools library has a function to break URLs, but let's do it manually to get familiar with the process.

In [8]:
example_url="https://www.searchenginejournal.com/google-be-careful-relying-on-3rd-parties-to-render-website-content/376547/" #@param {type:"string"} 


In [9]:
from urllib.parse import urlparse
import re

In [10]:
u = urlparse(example_url)

In [11]:
u

ParseResult(scheme='https', netloc='www.searchenginejournal.com', path='/google-be-careful-relying-on-3rd-parties-to-render-website-content/376547/', params='', query='', fragment='')

Here we get a named tuple, ```ParseResult```, with a breakdown of the URL components.

We are interested in the ```path```.

We are going to use a simple regex to split it by / and - characters

In [12]:
slug = re.split("[/-]", u.path)

slug

['',
 'google',
 'be',
 'careful',
 'relying',
 'on',
 '3rd',
 'parties',
 'to',
 'render',
 'website',
 'content',
 '376547',
 '']

Next, we can convert it back to a string

In [13]:
headline = " ".join(slug)
headline

' google be careful relying on 3rd parties to render website content 376547 '

The slugs contain a page identifier that is useless for us. We will remove with a regex

In [14]:
headline = re.sub("\d{6}", "",headline)
headline


' google be careful relying on 3rd parties to render website content  '

Strip whitespace at the borders

In [15]:
headline = headline.strip()
headline

'google be careful relying on 3rd parties to render website content'

In [16]:
if re.match("author|category", "autho stephen kenwright"):
  print("Ok")

We can convert this code to a function and create a new column in our dataframe

In [17]:
def get_headline(url):

  u = urlparse(url)

  if len(u.path) > 1:

    slug = re.split("[/-]", u.path)

    new_headline = re.sub("\d{6}", ""," ".join(slug)).strip()

    #skip author and category pages
    if not re.match("author|category", new_headline):
      return new_headline

  return ""




In [18]:
#let's test it
get_headline(example_url)

'google be careful relying on 3rd parties to render website content'

In [19]:
import pandas as pd

#skip the home page and headers
new_df = pd.read_csv("latest_articles.csv", names=["url", "lastmod"], skiprows=2)

In [20]:
new_df

Unnamed: 0,url,lastmod
0,https://www.searchenginejournal.com/google-how...,2020-01-06 11:23:40+00:00
1,https://www.searchenginejournal.com/track-offl...,2020-01-07 12:45:50+00:00
2,https://www.searchenginejournal.com/do-ugly-si...,2020-01-06 21:55:09+00:00
3,https://www.searchenginejournal.com/youtubes-c...,2020-01-07 04:53:50+00:00
4,https://www.searchenginejournal.com/googles-jo...,2020-01-07 11:16:20+00:00
...,...,...
8331,https://www.searchenginejournal.com/author/ann...,2022-08-20 13:44:45+00:00
8332,https://www.searchenginejournal.com/author/bre...,2022-08-20 13:44:45+00:00
8333,https://www.searchenginejournal.com/author/cla...,2022-08-20 13:44:45+00:00
8334,https://www.searchenginejournal.com/author/adc...,2022-08-20 13:44:45+00:00


Let's create a new column named headline

In [21]:
new_df["url"]

0       https://www.searchenginejournal.com/google-how...
1       https://www.searchenginejournal.com/track-offl...
2       https://www.searchenginejournal.com/do-ugly-si...
3       https://www.searchenginejournal.com/youtubes-c...
4       https://www.searchenginejournal.com/googles-jo...
                              ...                        
8331    https://www.searchenginejournal.com/author/ann...
8332    https://www.searchenginejournal.com/author/bre...
8333    https://www.searchenginejournal.com/author/cla...
8334    https://www.searchenginejournal.com/author/adc...
8335    https://www.searchenginejournal.com/author/bre...
Name: url, Length: 8336, dtype: object

this will run the function on each URL

In [22]:
new_df["url"].apply(lambda x: get_headline(x))

0                          google how to seo podcast site
1         track offline conversions microsoft advertising
2                                      do ugly sites sell
3       youtubes coppa changes begin today possibly af...
4       googles john mueller our seos have it harder t...
                              ...                        
8331                                                     
8332                                                     
8333                                                     
8334                                                     
8335                                                     
Name: url, Length: 8336, dtype: object

In [23]:
new_df["headline"] = new_df["url"].apply(lambda x: get_headline(x))

In [24]:
new_df

Unnamed: 0,url,lastmod,headline
0,https://www.searchenginejournal.com/google-how...,2020-01-06 11:23:40+00:00,google how to seo podcast site
1,https://www.searchenginejournal.com/track-offl...,2020-01-07 12:45:50+00:00,track offline conversions microsoft advertising
2,https://www.searchenginejournal.com/do-ugly-si...,2020-01-06 21:55:09+00:00,do ugly sites sell
3,https://www.searchenginejournal.com/youtubes-c...,2020-01-07 04:53:50+00:00,youtubes coppa changes begin today possibly af...
4,https://www.searchenginejournal.com/googles-jo...,2020-01-07 11:16:20+00:00,googles john mueller our seos have it harder t...
...,...,...,...
8331,https://www.searchenginejournal.com/author/ann...,2022-08-20 13:44:45+00:00,
8332,https://www.searchenginejournal.com/author/bre...,2022-08-20 13:44:45+00:00,
8333,https://www.searchenginejournal.com/author/cla...,2022-08-20 13:44:45+00:00,
8334,https://www.searchenginejournal.com/author/adc...,2022-08-20 13:44:45+00:00,


In [25]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8336 entries, 0 to 8335
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   url       8336 non-null   object
 1   lastmod   8336 non-null   object
 2   headline  8336 non-null   object
dtypes: object(3)
memory usage: 195.5+ KB


## Extracting named entities



Let's get our list of headlines and convert to a single text document

In [26]:
new_df["headline"].tolist()[:10]

['google how to seo podcast site',
 'track offline conversions microsoft advertising',
 'do ugly sites sell',
 'youtubes coppa changes begin today possibly affecting creator revenue',
 'googles john mueller our seos have it harder than others',
 'seo struggles',
 'challenge google ads trademark disputes',
 'best amp wordpress plugins',
 'chrome push notification blocker',
 'google assistant now has 500 million users worldwide']

In [27]:
text = "\n".join([x for x in new_df["headline"].tolist() if len(x) > 0])

In [28]:
with open("text.txt", "w") as f:
  f.write(text)

In [29]:
print(text)

google how to seo podcast site
track offline conversions microsoft advertising
do ugly sites sell
youtubes coppa changes begin today possibly affecting creator revenue
googles john mueller our seos have it harder than others
seo struggles
challenge google ads trademark disputes
best amp wordpress plugins
chrome push notification blocker
google assistant now has 500 million users worldwide
track organic conversions
seo reports which metrics matter how to use them well
google optimization score ppc
twitter rolls out a new ad unit in the explore tab
b2b paid advertising greg finn podcast
instagram user growth drops to lower levels than previously expected
female marketers
facebook pages to redesign good for the advertiser 8875
quality raters guidelines ranking signals
stepping into the unknown
facebook is preparing to launch a desktop redesign to all users
duckduckgo is now a default search engine option on android in the eu
actionable tasking seo answer important questions
biggest link b

## Introducing Spacy

https://spacy.io/usage/spacy-101

https://spacy.io/usage/examples

https://spacy.io/api/annotation

In [30]:
!pip install spacy



In [31]:
import spacy
from spacy import displacy

In [32]:
# !python -m spacy download en
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [33]:
nlp = spacy.load("en_core_web_sm")


In [34]:
doc = nlp(text)


In [35]:
displacy.render(doc, style="ent", jupyter=True)

## Building a Knowledge Graph

1. https://www.analyticsvidhya.com/blog/2019/10/how-to-build-knowledge-graph-text-using-spacy/

2. https://www.analyticsvidhya.com/blog/2019/09/introduction-information-extraction-python-spacy/?utm_source=blog&utm_medium=how-to-build-knowledge-graph-text-using-spacy
3. https://www.kaggle.com/shivamb/spacy-text-meta-features-knowledge-graphs



Let’s get the dependency tags 

In [36]:
for tok in doc[:100]:
  print(tok.text, "...", tok.dep_)

google ... nsubj
how ... advmod
to ... aux
seo ... nsubj
podcast ... compound
site ... compound

 ... dep
track ... compound
offline ... amod
conversions ... dobj
microsoft ... compound
advertising ... dobj

 ... dep
do ... aux
ugly ... amod
sites ... nsubj
sell ... ccomp

 ... dep
youtubes ... compound
coppa ... compound
changes ... dobj
begin ... conj
today ... npadvmod
possibly ... advmod
affecting ... advcl
creator ... compound
revenue ... compound

 ... dep
googles ... dobj
john ... compound
mueller ... dobj
our ... poss
seos ... nsubj
have ... ccomp
it ... nsubj
harder ... ccomp
than ... mark
others ... compound

 ... dep
seo ... nsubj
struggles ... poss

 ... dep
challenge ... compound
google ... compound
ads ... compound
trademark ... compound
disputes ... dobj

 ... dep
best ... amod
amp ... nmod
wordpress ... amod
plugins ... compound

 ... dep
chrome ... compound
push ... compound
notification ... compound
blocker ... compound

 ... dep
google ... compound
assistant ... nsub

The rule can be something like this — extract the subject/object along with its modifiers, compound words and also extract the punctuation marks between them.

In [37]:
from spacy.matcher import Matcher 
from spacy.tokens import Span 
import networkx as nx

import matplotlib.pyplot as plt
from tqdm import tqdm

To build a knowledge graph, the most important things are the nodes and the edges between them.

The main idea is to go through a sentence and extract the subject and the object as and when they are encountered

In [38]:
def get_entities(sent):
  ## chunk 1
  ent1 = ""
  ent2 = ""

  prv_tok_dep = ""    # dependency tag of previous token in the sentence
  prv_tok_text = ""   # previous token in the sentence

  prefix = ""
  modifier = ""

  #############################################################
  
  for tok in nlp(sent):
    ## chunk 2
    # if token is a punctuation mark then move on to the next token
    if tok.dep_ != "punct":
      # check: token is a compound word or not
      if tok.dep_ == "compound":
        prefix = tok.text
        # if the previous word was also a 'compound' then add the current word to it
        if prv_tok_dep == "compound":
          prefix = prv_tok_text + " "+ tok.text
      
      # check: token is a modifier or not
      if tok.dep_.endswith("mod") == True:
        modifier = tok.text
        # if the previous word was also a 'compound' then add the current word to it
        if prv_tok_dep == "compound":
          modifier = prv_tok_text + " "+ tok.text
      
      ## chunk 3
      if tok.dep_.find("subj") == True:
        ent1 = modifier +" "+ prefix + " "+ tok.text
        prefix = ""
        modifier = ""
        prv_tok_dep = ""
        prv_tok_text = ""      

      ## chunk 4
      if tok.dep_.find("obj") == True:
        ent2 = modifier +" "+ prefix +" "+ tok.text
        
      ## chunk 5  
      # update variables
      prv_tok_dep = tok.dep_
      prv_tok_text = tok.text
  #############################################################

  return [ent1.strip(), ent2.strip()]

In [39]:
get_entities("the film had 200 patents")


['film', '200  patents']

In [40]:
for t in [x for x in new_df["headline"].tolist() if len(x) > 0][:100]:
  print(get_entities(t))

['', 'how podcast site']
['', '']
['ugly  sites', '']
['youtubes coppa changes', 'possibly creator revenue']
['googles john seos', 'harder  others']
['seo', '']
['', '']
['', '']
['', 'notification blocker']
['google assistant', '500 million 500 users']
['', 'organic  conversions']
['seo metrics', 'how  them']
['', '']
['twitter', 'new explore tab']
['b2b', 'advertising greg']
['instagram user growth', 'lower  levels']
['', '']
['', 'facebook advertiser']
['', 'ranking quality raters signals']
['', 'unknown']
['facebook', 'desktop users']
['duckduckgo', 'now search engine eu']
['tasking  seo', 'important  questions']
['', '']
['', 'how  it']
['', '']
['', '']
['google ads', 'smart simulator bidding']
['', '']
['', 'python seo spreadsheets']
['', '']
['', '']
['2020 core update', '']
['parallel  tracking', 'video campaigns']
['pinterest', 'users']
['', 'organic google serps']
['', '']
['seos', 'high linkedins data']
['google', 'new search results']
['tony  wright', 'better seo podcast']

In [41]:
entity_pairs = []

for i in tqdm([x for x in new_df["headline"].tolist() if len(x) > 0]):
  entity_pairs.append(get_entities(i))

100%|██████████████████████████████████████| 7173/7173 [00:15<00:00, 474.08it/s]


In [42]:
entity_pairs[10:20]

[['', 'organic  conversions'],
 ['seo metrics', 'how  them'],
 ['', ''],
 ['twitter', 'new explore tab'],
 ['b2b', 'advertising greg'],
 ['instagram user growth', 'lower  levels'],
 ['', ''],
 ['', 'facebook advertiser'],
 ['', 'ranking quality raters signals'],
 ['', 'unknown']]

Our hypothesis is that the predicate is actually the main verb in a sentence.

In [43]:
def get_relation(sent):
    doc = nlp(sent)

  # Matcher class object 
    matcher = Matcher(nlp.vocab)

  #define the pattern 
    pattern = [{'DEP':'ROOT'}, 
            {'DEP':'prep','OP':"?"},
            {'DEP':'agent','OP':"?"},  
            {'POS':'ADJ','OP':"?"}] 

    matcher.add("matching_1", patterns=[pattern]) 

    matches = matcher(doc)
    k = len(matches) - 1

    span = doc[matches[k][1]:matches[k][2]] 

    return(span.text)

In [44]:
get_relation("John completed the task")


'completed'

In [None]:
relations = [get_relation(i) for i in tqdm([x for x in new_df["headline"].tolist() if len(x) > 0])]


 25%|█████████▋                            | 1828/7173 [00:04<00:11, 458.69it/s]

In [None]:
entity_pairs[10:20]

In [None]:
relations[10:20]

In [None]:
pd.Series(relations).value_counts()[4:50]

In [None]:
pip install --upgrade matplotlib networkx

Let's build the knowledge graph

In [None]:
# extract subject
source = [i[0] for i in entity_pairs]

# extract object
target = [i[1] for i in entity_pairs]

kg_df = pd.DataFrame({'source':source, 'target':target, 'edge':relations})

In [None]:
# create a directed-graph from a dataframe
G=nx.from_pandas_edgelist(kg_df, "source", "target", 
                          edge_attr=True, create_using=nx.MultiDiGraph())

In [None]:
import matplotlib.pyplot as plt
import networkx as nx

# Assuming G is your graph object
fig, ax = plt.subplots(figsize=(12, 12))

# Calculate node positions
pos = nx.spring_layout(G)

# Draw the graph with specified node positions, and edge colormap
nx.draw(G, with_labels=True, node_color='skyblue', edge_cmap=plt.cm.Blues, pos=pos, ax=ax)

plt.show()

use only a few important relations to visualize a graph. I will take one relation at a time.

In [None]:
# def display_graph(relation):

#   G=nx.from_pandas_edgelist(kg_df[kg_df['edge']==relation], "source", "target", 
#                             edge_attr=True, create_using=nx.MultiDiGraph())

#   plt.figure(figsize=(12,12))
#   pos = nx.spring_layout(G, k = 0.5) # k regulates the distance between nodes
#   nx.draw(G, with_labels=True, node_color='skyblue', node_size=1500, edge_cmap=plt.cm.Blues, pos = pos)
#   plt.show()

In [None]:
def display_graph(relation):
    G = nx.from_pandas_edgelist(kg_df[kg_df['edge'] == relation], "source", "target", 
                                edge_attr=True, create_using=nx.MultiDiGraph())
    
    fig, ax = plt.subplots(figsize=(12, 12))  # Create a new figure and axis
    pos = nx.spring_layout(G, k=0.5)  # k regulates the distance between nodes
    
    # Draw the graph with specified node positions, and edge colormap
    nx.draw(G, with_labels=True, node_color='skyblue', node_size=1500, edge_cmap=plt.cm.Blues, pos=pos, ax=ax)
    
    plt.show()

Now, when I run ```display_graph("launches")```, I get the graph at the begining of the article.

In [None]:
display_graph("launches")

In [None]:
display_graph("introduces new")

In [None]:
display_graph("reveals")

In [None]:
display_graph("explains")

In [None]:
display_graph("improve")

In [None]:
display_graph("expands")

In [None]:
pd.Series(relations).value_counts()[4:50].to_dict()


In [None]:
relation_list = list(pd.Series(relations).value_counts()[4:50].to_dict().keys())
relation_list


In [None]:
relation_choice = "are"  #@param ['launches', 'is',  'on', 'makes', 'adds', 'improve', 'explains', 'optimize','get','use','update','introduces', 'expands','says','find','are','reveals','introduces new', 'make', 'know' ]

display_graph(relation_choice)

In [None]:
display_graph('moved')