<a href="https://colab.research.google.com/github/bcollister01/plb_linkee/blob/main/Linkee_Keyword_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Input phrases we are testing: Tom Hanks, Las Vegas, Olympics, Harry Potter, Cereal, Kitchen, Busted, Trees, Insurance companies, Hurricanes 

In [None]:
%%capture
!pip install wikipedia
!pip install yake
!pip install --upgrade ecommercetools
!pip install pattern
!pip install textacy

# Import packages
import wikipedia
import re
import yake
import nltk
import numpy as np
import pandas as pd
from ecommercetools import seo
from pattern.text.en import singularize, pluralize
#For Google Knowledge Graph API
import requests
import urllib
import json
from requests_html import HTML
from requests_html import HTMLSession
#if scraping paragraphs from first few webpages
from bs4 import BeautifulSoup
#For question generation
import spacy
import textacy
!python -m spacy download en_core_web_sm
nltk.download('averaged_perceptron_tagger')
nlp = spacy.load("en_core_web_sm")

#**Creating mini functions to split up tasks**

A lot of our functions help to clean up our results or are helper functions. The main ones are find_text and keyword_extract.

## Functions for finding all versions of words

We want to make sure that we don't have two very similar keywords e.g. sweet and sweets. So we'll need to check keywords against singular/plural forms.

In [None]:
def ending_pluralize(noun):
  '''Return most appropriate plural of input word.'''
  if re.search('[sxz]$', noun):
      return re.sub('$', 'es', noun)
  elif re.search('[^aeioudgkprt]h$', noun):
      return re.sub('$', 'es', noun)
  elif re.search('y$', noun):
      return re.sub('y$', 'ies', noun)
  else:
      return noun

def add_s_pluralize(noun):
  '''Naively add s to end of input word to create plural'''
  return noun + 's'

def tidy_input(input):
  '''Take input word and tidy it up to create a list of options.
  
  We have a few different pluralize functions just to account for any
  misspellings online/words created when punctuation removed.
  '''

  input_words = input.split()

  #Add singular forms of plurals and plural forms of singles 
  singles = [singularize(plural) for plural in input_words]
  plurals1 = [pluralize(single) for single in singles]
  plurals2 = [ending_pluralize(single) for single in singles]
  plurals3 = [add_s_pluralize(single) for single in singles]
  input_words = input_words + singles + plurals1 + plurals2 + plurals3

  input_words = input_words + [word.lower() for word in input_words]
  #If you want capitalized words as well
  input_words = input_words + [word[0].upper() + word[1:] for word in input.split()]
  input_words = input_words + [word.upper() for word in input_words]

  input_words = list(set(input_words))
  
  return(input_words)


## Functions for working with Google Knowledge Graph

We explored a few different methods for getting text to run through the keyword generation model. The Google Knowledge Graph was the most reliable of these but work is needed to interact with the API. We can also choose to adapt our search term slightly if we know from the Knowledge Graph what sort of enetity we are dealing with to potentially get better results.

In [None]:

#Next few functions sourced from https://practicaldatascience.co.uk/data-science/how-to-access-the-google-knowledge-graph-search-api

def get_source(url):
    """Return the source code for the provided URL. 

    Args: 
        url (string): URL of the page to scrape.

    Returns:
        response (object): HTTP response object from requests_html. 
    """

    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as e:
        print(e)

def get_knowledge_graph(api_key, query):
    """Return a Google Knowledge Graph for a given query.

    Args: 
        api_key (string): Google Knowledge Graph API key. 
        query (string): Term to search for.

    Returns:
        response (object): Knowledge Graph response object in JSON format.
    """ 
        
    endpoint = 'https://kgsearch.googleapis.com/v1/entities:search'
    params = {
        'query': query,
        'limit': 10,
        'indent': True,
        'key': api_key,
    }

    url = endpoint + '?' + urllib.parse.urlencode(params)    
    response = get_source(url)
    
    return json.loads(response.text)

def get_knowledge_graph_df(input):
  """
  Uses Google's knowledge graph to generate Pandas DataFrame of entities 
  deemed most similar to input searched. DataFrame includes categorization
  of entity, title, short description and URL (usually to Wikipedia).
  You will need to have set up an API key in Google Cloud Console to get this
  to work (it's free to do and you can do 100k requests a day I believe.)
  https://console.cloud.google.com/apis 
  Args:
    input (string): Final Linkee answer

  Returns:
    knowledge_graph_df (Pandas DataFrame): info on Knowledge Graph results
  """
  threshold=0.2
  api_key = ####Removed for demo
  knowledge_graph_json = get_knowledge_graph(api_key, input)
  knowledge_graph_df = pd.json_normalize(knowledge_graph_json, record_path='itemListElement')
  #Only using scores if knowledge graph actually returns something
  if len(knowledge_graph_df) > 0:
    max_score = max(knowledge_graph_df['resultScore'])
    knowledge_graph_df = knowledge_graph_df.loc[knowledge_graph_df['resultScore']>threshold*max_score]
  return knowledge_graph_df

def classify_input(knowledge_graph_df):
  """Classify the input word/phrase as a certain category 
  to improve search results. Acts as failsafe if initial search
  of input fails.

  Args:
    knowledge_graph_df: Return of get_knowledge_graph_df

  Returns:
    category (string): Category of input
  
  """
  if "SportsTeam" in knowledge_graph_df['result.@type'][0]:
    entity_tags = knowledge_graph_df['result.@type'][1]
  else:
    entity_tags = knowledge_graph_df['result.@type'][0]
  #return(entity_tags)
  if ("Movie" in entity_tags) or ("MovieSeries" in entity_tags):
    category = "Movie"
  elif ("TVEpisode" in entity_tags) or ("TVSeries" in entity_tags):
    category = "TV"
  elif ("VideoGame" in entity_tags) or ("VideoGameSeries" in entity_tags):
    category = "VideoGame"
  elif ("Book" in entity_tags) or ("BookSeries" in entity_tags):
    category = "Book"
  elif "Person" in entity_tags:
    category = "Person"
  elif ("MusicAlbum" in entity_tags) or ("MusicGroup" in entity_tags) or ("MusicRecording" in entity_tags):
    category = "Music"
  elif ("Place" in entity_tags) or ("AdministrativeArea" in entity_tags):
    category = "Place" 
  else:
    category = "Thing"

  return(category)

def tailored_search(category, input):
  """Change the search to get better keywords for input, based on its category

  Args:
    category (string): Category of input
    input (string): Final Linkee answer

  Returns:
    search_input (string): Search term to use to find keywords
  
  """
  if category == "Movie" or category == "TV" or category == "Book":
    search_input = input + " " + category + " information"
  elif category == "Place":
    search_input = input + " location"
  else:
    search_input = input
  return(search_input)


def collect_urls(knowledge_graph_df):
  """Collect the urls from the knowledge graph to give more options to scrape 
  from.

  Args:
    knowledge_graph_df: Return of get_knowledge_graph_df

  Returns:
    list of urls (string): urls found
  
  """
  if 'result.detailedDescription.url' in knowledge_graph_df.columns:
    knowledge_graph_df = knowledge_graph_df[knowledge_graph_df['result.detailedDescription.url'].notna()]
    urlList = knowledge_graph_df['result.detailedDescription.url'].tolist()
  else:
    urlList = []
  return urlList

def get_wiki_links(urlList):
  '''Extract the URLs linking to Wikipedia from a list of URLs'''
  url_wiki=[urlList[i] for i in range(len(urlList)) if urlList[i].find("wiki")!= -1]
  # if len(url_wiki) == 0:
  #   print('No Urls')
  # if url_wiki:
  return(url_wiki)


def get_wiki_text(url_wiki, keep_words=10000):
    '''
  Takes a list of urls and scrapes from Wikipedia links
  if present.
  Input:
    urlList - a list of urls
    keep_words - the number of words to keep (approx up to paragraph)
  Output:
    text_comb - the text extracted from the paragraphs until word limit reached
  '''
    text_comb = ''
    total_words = 0
    # print('url_wiki',url_wiki)
    for url in url_wiki:
      wiki_term = url.split('/wiki/')[1]
      print(f"Looking at wiki page for: {wiki_term}")
      try:
        text_wiki = (wikipedia.page(wiki_term, auto_suggest = False)).content
      except KeyError: #fullurl errors can be caused by unicode or other symbols
        text_wiki = (wikipedia.page(wiki_term, auto_suggest = True)).content
      #This will drop headers surrounded by ==
      text_wiki = re.sub(r'==.*?==+', '', text_wiki)
      paras = text_wiki.split('\n\n')
      word_count = len(paras[0].split()) #number of words in 1st paragraph
      remaining_words = keep_words - total_words
      j = 0
      text = paras[0]
      while word_count < remaining_words and j<len(paras)-1:
        j += 1
        para_text = paras[j]
        word_count = word_count + len(para_text.split())
        text = text + ' ' + para_text
      #Drop new line /n clutter
      text = text.replace('\n', '')
      text_comb = text_comb + text # change if want more than one
      total_words = total_words + word_count
      if total_words >= keep_words:
        break  # break out of for loop when we have enough words
    return text_comb


def wiki_autosuggest(input, keep_words = 10000):
  ''' Gets text from Wikipedia using whichever page is autosuggested
      Input: input - original input word
             keep_words - number of words to keep
      Output: 
             text that has been cleaned
  '''
  # Get text from single wikipedia page using auto-suggest
  try:
    text_wiki = (wikipedia.page(input, auto_suggest = True)).content
  except Exception as err:
    print(err.args)
    raise ValueError(f'No urls found for {input}') 
  #if no exception raised clean up text
  #This will drop headers surrounded by ==
  text_wiki = re.sub(r'==.*?==+', '', text_wiki)
  paras = text_wiki.split('\n\n')
  word_count=len(paras[0].split()) #number of words in 1st paragraph
  j=0
  text = paras[0]
  while word_count < keep_words and j<len(paras)-1:
    j += 1
    para_text = paras[j]
    word_count = word_count + len(para_text.split())
    text = text + ' ' + para_text
  #Drop new line /n clutter
  text = text.replace('\n', '')
  return text


## Main functions

These functions find the text on the subject we want and then tries to find similar keywords to it.

In [None]:


def find_text(input): 
  '''  
  Finds text related to input that can be used for
  keyword extraction. This function attempts to clean
  up the relevant text.
  '''
  knowledge_graph_df = get_knowledge_graph_df(input)
  if len(knowledge_graph_df) == 0:
    print("nothing found using knowledge graph, trying wiki")
    text = wiki_autosuggest(input)
  else:
    urlList = collect_urls(knowledge_graph_df)
    url_wiki = get_wiki_links(urlList)
    if len(url_wiki) >= 1:
      keep = min(len(url_wiki), 3)
      url_wiki = url_wiki[0:keep]
      text = get_wiki_text(url_wiki)
    else: 
      #Use the knowledge graph categories to find wikipedia url
      print(" No wiki urls: 1st pass")
      category = classify_input(knowledge_graph_df)
      search_input = tailored_search(category, input)
      print(f"Searching for urls with input {search_input}")
      urlList = collect_urls(get_knowledge_graph_df(search_input))
      url_wiki = get_wiki_links(urlList)
      if len(url_wiki) >= 1:
        keep = min(len(url_wiki), 3)
        url_wiki = url_wiki[0:keep]
        text = get_wiki_text(url_wiki)
      else:
        print(" No wiki urls: 2nd pass")
        text = wiki_autosuggest(input)
  
  #Text Cleaning
  text = re.sub(r"\'", '', text) #Get rid of \'
  text = re.sub(r"\\xa0...", '', text) #Get rid of \\xa0...
  text = re.sub(r"\\n", ' ', text) #Get rid of \\n
  text = re.sub(r"\\u200e", ' ', text) #Get rid of \\u200e
  text = re.sub(r"[\"\'\“\”\[\]\)\(\•\▽\❖\†]+", '', text)
  text = re.sub(r"logo", '', text)
  text = re.sub(r"[Vv]iew \d+ more rows", '', text) #Get rid of [Vv]iew \d+ more rows
  text = re.sub(r"\d+ hours ago", '', text)
  text = re.sub(r"[-·—,.;:@#?!$+-]+", ' ', text) 
  text = re.sub(r"U S ", "US ", text)

  text = ' '.join(text.split()) #Single spacing

  return text


def keyword_extract(text, ngram_size):
  '''Extract keywords/phrases of ngram_size using YAKE'''
  #Initialise extractor
  kw_extractor = yake.KeywordExtractor()
  language = "en"
  max_ngram_size = ngram_size
  deduplication_threshold = 0.3
  numOfKeywords = 100
  custom_kw_extractor = yake.KeywordExtractor(lan=language, 
                                              n=max_ngram_size, 
                                              dedupLim=deduplication_threshold, 
                                              top=numOfKeywords, features=None)
  
  #Run extractor on text and get out words/phrases
  yake_output = custom_kw_extractor.extract_keywords(text)
  words, scores = zip(*yake_output)
  words = list(words)
  scores = list(scores)
  words = [re.sub(r"[,.;@#?!$]+", ' ', i) for i in words]
  return(words,scores)



## Filtering our keywords

We check there are no repeats in our candidate keywords. We also want to select keywords which are proper nouns - these are the words we are mostly likely able to generate questions from in the next stage of this project.

In [None]:

def answer_keyword_compare(keywords_list, input_words):
  '''Remove candidate keywords that contain input words'''

  keywords_list = [x for x in keywords_list if not any(i in input_words for i in x.split())]
  return keywords_list

def remove_non_noun_full_keywords(keywords_list):
  '''
  Only retain keywords/keyphrases that are proper nouns.
  '''
  pos = nltk.pos_tag(keywords_list)
  new_keyword_list = []
  for ii in np.arange(0,len(pos),1):
    if pos[ii][1]=='NNP':
      new_keyword_list.append(pos[ii][0])
    if pos[ii][1]=='NNPS':
      new_keyword_list.append(pos[ii][0])
  return new_keyword_list

def select_keywords(words2):
  '''
  Selects the keywords/phrases to use for question generation. Ensures that 
  keyword phrases do not overlap each other.
  '''
  words3 = []
  words3.append(words2[0])
  del words2[0]
  for i in range(len(words2)):
    #If at any point, we only have 4 candidate keywords left, use them all
    if len(words2) + len(words3) <= 4:
      words3 = words3 + words2
      break
    test_words = words2[0].lower().split()
    singles = [singularize(plural) for plural in test_words]
    plurals1 = [pluralize(single) for single in singles]
    plurals2 = [ending_pluralize(single) for single in singles]
    plurals3 = [add_s_pluralize(single) for single in singles]
    test_words = list(set(test_words + singles + plurals1 + plurals2 + plurals3))
    previous_words = words3.copy()
    previous_words = [word for phrase in previous_words for word in phrase.split()]
    previous_words = [x.lower() for x in previous_words]
    if len(test_words) + len(previous_words) == len(list(set(test_words + previous_words))):
      words3.append(words2[0])
    del words2[0]
  return(words3)

## Final function

The end user at the moment only needs to use this function and calls all the other functions created to return a keyword list. In the next stage, the output of this function will act as the input for the question generation stage.

In [None]:
def linkee_keywords(input):
  """
  Main pipeline function which takes input and generates list of keywords.
  """
  answer_list = tidy_input(input)

  #Keyword extraction
  text = find_text(input)
  
  #Potentially here combine 1-gram, 2-gram, 3-gram results
  words_df = pd.DataFrame()
  words = (keyword_extract(text, 2))[0] #+ (keyword_extract(text, 1))[0] + (keyword_extract(text, 3))[0]
  scores = (keyword_extract(text, 2))[1] #+ (keyword_extract(text, 1))[1] + (keyword_extract(text, 3))[1]
  words_df['words'] = words
  words_df['scores'] = scores
  words_df.sort_values(by=['scores'],ascending=False)
  words_df = words_df[:100]

  #Cleaning up returned keywords
  words2 = answer_keyword_compare(words, answer_list)

  words2 = remove_non_noun_full_keywords(words2)

  final_keywords = select_keywords(words2)

  return(final_keywords)

We have a few pre-run examples of results. While some results appear to be quite vague/not useful, the only ones which will be kept on the final card will be those we can generate questions for. Therefore, the next stage of this project will likely bring the added benefit of improving our keyword list. 


In [None]:
%%time
linkee_keywords('Apocalypse Now')[:6] 

Looking at wiki page for: Apocalypse_Now
Looking at wiki page for: Ride_of_the_Valkyries
CPU times: user 27.3 s, sys: 542 ms, total: 27.9 s
Wall time: 30.8 s


['Vietnam War',
 'Ford Coppola',
 'Colonel Kurtz',
 'Marlon Brando',
 'George Lucas',
 'John Milius']

In [None]:
%%time
linkee_keywords('Tom Hanks')[:6] 

Looking at wiki page for: Tom_Hanks
Looking at wiki page for: Forrest_Gump
CPU times: user 24.7 s, sys: 471 ms, total: 25.2 s
Wall time: 27.8 s


['Forrest Gump',
 'Academy Award',
 'American Film',
 'Motion Picture',
 'Golden Globe',
 'South Carolina']

In [None]:
%%time
linkee_keywords('Eddie Murphy')[:6]

Looking at wiki page for: Eddie_Murphy
CPU times: user 14.7 s, sys: 261 ms, total: 15 s
Wall time: 16.8 s


['Hills Cop',
 'Night Live',
 'Nutty Professor',
 'Academy Award',
 'Supporting Actor',
 'Paramount Pictures']

In [None]:
%%time
linkee_keywords('Allstate')[:6]

Looking at wiki page for: Allstate
CPU times: user 18.5 s, sys: 278 ms, total: 18.8 s
Wall time: 20.5 s


['Sears Roebuck',
 'Wrigley Field',
 'Dennis Haysbert',
 'Solutions Private',
 'National General',
 'Northbrook Illinois']

In [None]:
%%time
linkee_keywords('Green Goblin')[:6]

Looking at wiki page for: Green_Goblin
Looking at wiki page for: Harry_Osborn
CPU times: user 22.7 s, sys: 436 ms, total: 23.1 s
Wall time: 26.3 s


['Spider Man',
 'Harry Osborn',
 'American Son',
 'Parker Industries',
 'Gabriel Stacy',
 'Formula Norman']

In [None]:
%%time
linkee_keywords('Emmerdale')[:6]

Looking at wiki page for: Emmerdale
CPU times: user 15.3 s, sys: 213 ms, total: 15.6 s
Wall time: 17.4 s


['Scottish Television',
 'Tom King',
 'ITV regions',
 'Jack Sugden',
 'Yorkshire Dales',
 'Episodes originally']