# Search analysis

Here I wanted to answer the following questions, answers that I found are beneath each one

Worth mentioning that the search data consists of 506,992 individual searches, all on the 10th/11th Nov 2020 - so might not be representative of wider trends but should be enough to make some high level inferences

#### What percentage of searches have nouns or verbs or both?

Nouns and verbs: 111059 (24%)
Just nouns: 305836 (68%)
Just verbs: 33486 (7%)

I suspect the true number of verbs is lower as spacy can mislabel some words, eg "fund" can be a verb or noun, in the context of government its a noun and spacy could have mislabelled words like that.

Thus, though, we can see that almost exactly 2/3 of searches consist entirely of nouns. This is interesting and makes intuitive sense. 

"Here be opinions" -> the 'bad thing' though is that it could mean that users are making non specific, vague searches - from which it is hard to tell (even as a human) what they want. Eg if someone wants to sign in the their universal credit account and searches for "universal credit" then they might well get the right page but no search engine could correctly infer from the search term why that person wanted that page - other people searching for the same term could well want to apply or just get more information about it. Thus, to my mind, if we can increase the number of people making more specific searches then they would (likely) get better results and it would also be easier for us to improve search (as we'd have a better idea of user intent).


#### What percentage of searches have constrictively detected SVO triples?

I've used a _very_ constrictive method of detecting SVO triples here (ie that it must follow good grammar rules etc.) which we can be pretty sure is a bad assumption. I deliberately shied away from assuming that a search like "apply passport" means that passport is the object and apply is the verb - even though it's a pretty reasonable assumption. This is because I wanted to see how well the approach of being sure that two words are related is. Thus, 16% of 1000 searches had a triple, in theory it could be as high as 24% (the percentage of searches with a verb and noun). This is higher than I initially thought


#### What percentage of searches have entities in them?

39%. I suspect that the true value is higher as govNER has been trained on GOV.UK content which has correct casings and thus it doesn't always detect entities in strings like "universal credit" or misspellings. A simple way to get around this is to try getting a list of all unique entities from the Knowledge Graph, lowercasing them and doing a simple string match or Levenstein distance (to account for typos).

In [2]:
import pandas as pd
import spacy
import os
from py2neo import Graph
import sys
import os

os.environ['MODEL_FILE_PATH'] = '../../govuk-knowledge-graph/data'
sys.path.append("../../govuk-language-model")
from sagemaker.container.govner.govner import GovNER
ner = GovNER()
nlp = spacy.load("en_core_web_sm")

Loaded govNER v0.1
# of params: 108321038:


In [3]:
search_queries_df = pd.read_csv("../data/raw/search_queries.csv")
search_queries_df = search_queries_df.drop_duplicates(subset=['search_term', 'session_id'])
search_queries_df.head()

Unnamed: 0.1,Unnamed: 0,search_term,session_id,search_timestamp
0,0,uniform,37285030730931021621605022105,2020-11-10 15:28:36+00:00
3,3,gateway,74520221399677225591605004160,2020-11-10 11:23:39+00:00
4,4,new style jsa,53773750642097425841605019611,2020-11-10 14:49:52+00:00
5,5,sick leave,27471907663691553091605019886,2020-11-10 15:10:30+00:00
6,6,v5c,36817976659622799321605029629,2020-11-10 17:34:40+00:00


In [12]:
class SOV:
    def __init__(self):
        self.subject = None
        self.object = None
        self.verb = None
        
    def cypher_subject(self):
        return self._cypher_safe(self.subject)

    def cypher_object(self):
        return self._cypher_safe(self.object)

    def cypher_verb(self):
        return self._cypher_safe(self.verb)

    def _cypher_safe(self, token):
        if token is None:
            return ""
        if type(token) is list: 
            text = ''.join([t.text_with_ws for t in token])
        else:
            text = token.text
        text = text.lower()
        text = text.strip()
        return text.replace("'", "")


class Title:
    
    def __init__(self, title, nlp):
        self.nlp = nlp
        self.title = title
        self.triples = []
    
    def subject_object_triples(self):
        if any(self.triples):
            return self.triples
        self.triples = self._get_triples_for_title()
        return self.triples
    
    def _verbs(self):
        return ["VERB", "AUX"]
    
    def _cypher_safe(self, words):
         return [word.replace("'", "") for word in words]
        
    def _is_object_of_prepositional_phrase(self, token):
        # Finds objects of prepositional phrases
        # eg "Apply online for a UK passport", "Apply for this licence"
        if token.dep_ == "pobj" and token.head.dep_ == "prep" and token.head.head.pos_ in self._verbs():
            triple = SOV()
            triple.verb = token.head.head
            triple.object = [token]
            # experiment
            triple.subject = []
            reversed_lefts = list(token.lefts) or []
            reversed_lefts.reverse()# or []
            print(reversed_lefts)
            if reversed_lefts:
                for left in reversed_lefts:
                    print(f"left text: {left.text}")
                    print(f"left dep: {left.dep_}")
                    if left.dep_ == "poss":
                        triple.subject.append(left)
            # end experiment
            compound_lefts = self._compound_left_compounds(token)
            if any(compound_lefts):
                compound_lefts.reverse()
                print(compound_lefts)
                triple.object = compound_lefts + triple.object
            return [triple]

    def _is_object(self, token):
        # Finds simple objects
        # eg "Get a passport for your child"
        # TODO: should probably extract "for your child" bit as a modifier of some kind
        if token.dep_ == "dobj" and token.head.pos_ in self._verbs():
            triple = SOV()
            triple.verb = token.head.head
            triple.object = [token]
            compound_lefts = self._compound_left_compounds(token)
            if any(compound_lefts):
                compound_lefts.reverse()
                print(compound_lefts)
                triple.object += compound_lefts
            return [triple]

    def _compound_left_compounds(self, token):
        print(f"compounded lefts for token: {token.text}")
        compounded_lefts = []
        reversed_lefts = list(token.lefts) or []
        reversed_lefts.reverse()# or []
        print(reversed_lefts)
        if reversed_lefts:
            for left in reversed_lefts:
                print(f"left text: {left.text}")
                print(f"left dep: {left.dep_}")
                if left.dep_ == "compound":
                    compounded_lefts.append(left)
                    compounded_lefts += self._compound_left_compounds(left)
                else:
                    break
        return compounded_lefts
    
    def _find_triples(self, token, debug=False):
        is_object_of_prepositional_phrase = self._is_object_of_prepositional_phrase(token)
        if is_object_of_prepositional_phrase:
            if debug:
                print("is_object_of_prepositional_phrase")
            return is_object_of_prepositional_phrase
        is_object = self._is_object(token)
        if is_object:
            if debug:
                print("is_object")
            return is_object

    def _to_nltk_tree(self, node):
        if node.n_lefts + node.n_rights > 0:
            return Tree(node.orth_, [self._to_nltk_tree(child) for child in node.children])
        else:
            return node.orth_

    def _get_triples_for_title(self, debug=False):
        doc = self.nlp(self.title)
        if debug:
            [self._to_nltk_tree(sent.root).pretty_print() for sent in doc.sents]
        triples = []
        for token in doc:
            if debug:
                print(f"text: {token.text}")
                print(f"dep: {token.dep_}")
                print(f"head dep: {token.head.dep_}")
                print(f"head head pos: {token.head.head.pos_}")
                print(f"lefts: {list(token.lefts)}")
                print()
            subject_object_triples = self._find_triples(token, debug)
            if subject_object_triples:
                triples += subject_object_triples
        return triples


### Find searches with SVOs and/or entities

This is ridiculously computationally expensive so I've limited it to 1000 searches

In [13]:
triples = []
searches_with_entities = []

for _index, row in search_queries_df[0:1000].iterrows():
    try:
        title = Title(row['search_term'], nlp)
        if any(title.subject_object_triples()):
            triples.append(title)
        else:
            entities = ner.entities(row['search_term'])
            if any(entities):
                searches_with_entities.append([entities, row['search_term']])
    except:
        next

[]
compounded lefts for token: uk
[]
compounded lefts for token: license
[]
compounded lefts for token: test
[a]
left text: a
left dep: det
[small]
left text: small
left dep: amod
compounded lefts for token: business
[small]
left text: small
left dep: amod
compounded lefts for token: grant
[your]
left text: your
left dep: poss
compounded lefts for token: grant
[]
compounded lefts for token: code
[tax, my]
left text: tax
left dep: compound
compounded lefts for token: tax
[]
left text: my
left dep: poss
[tax]
[]
compounded lefts for token: article
[]
compounded lefts for token: penalty
[car, my]
left text: car
left dep: compound
compounded lefts for token: car
[]
left text: my
left dep: poss
[car]
compounded lefts for token: uk
[]
compounded lefts for token: spain
[]
compounded lefts for token: training
[]
[covid]
left text: covid
left dep: amod
compounded lefts for token: fund
[covid]
left text: covid
left dep: amod
compounded lefts for token: payment
[advance, an]
left text: advance
le

compounded lefts for token: car
[the]
left text: the
left dep: det
compounded lefts for token: grant
[coronavirus]
left text: coronavirus
left dep: compound
compounded lefts for token: coronavirus
[]
[coronavirus]
compounded lefts for token: loans
[]
compounded lefts for token: select
[]
[]
compounded lefts for token: home
[]
[family, a]
left text: family
left dep: compound
left text: a
left dep: det
compounded lefts for token: member
[family, a]
left text: family
left dep: compound
compounded lefts for token: family
[]
left text: a
left dep: det
[family]
compounded lefts for token: insurance
[national]
left text: national
left dep: amod
compounded lefts for token: record
[insurance, national]
left text: insurance
left dep: compound
compounded lefts for token: insurance
[]
left text: national
left dep: amod
[insurance]
compounded lefts for token: contributations
[insurance, national, my]
left text: insurance
left dep: compound
compounded lefts for token: insurance
[]
left text: nationa

compounded lefts for token: homes
[]
compounded lefts for token: iht
[]
compounded lefts for token: loan
[]
compounded lefts for token: support
[]
compounded lefts for token: people
[]
compounded lefts for token: jsa
[tor]
left text: tor
left dep: compound
compounded lefts for token: tor
[]
[tor]
compounded lefts for token: debt
[]


In [16]:
print(f"number of searches (out of 1000) with triples in it: {len(triples)}")
print(f"number of searches (out of 1000) with an entity in it: {len(searches_with_entities)}")

number of searches (out of 1000) with triples in it: 160
number of searches (out of 1000) with an entity in it: 391


### Can the KG return content that matches an SVO?

Yes is the answer! It actually has some really good results (requires SVO triples to be inserted into the graph with the extract_subject_verb_object_from_titles notebook)

In [76]:

host = os.environ.get('REMOTE_NEO4J_URL')
user = os.environ.get('NEO4J_USER')
password = os.environ.get('NEO4J_PASSWORD')
graph = Graph(host=host, user='neo4j', password = password, secure=True)
has_result = []
for triple in triples:
    result = graph.run('MATCH ({name: "' + triple.subject_object_triples()[0].cypher_verb() + '"})-[:HAS_VERB|HAS_OBJECT|HAS_SUBJECT]-(n:Action)-[:HAS_VERB|HAS_OBJECT|HAS_SUBJECT]-({name: "' + triple.subject_object_triples()[0].cypher_object() + '"}) WITH n MATCH (n)-[:TITLE_MENTIONS]-(c:Cid) return c.name').data()
    if any(result):
        has_result.append([result, triple])

In [77]:
len(has_result)

334

In [90]:
for result in has_result:
    print()
    print(result[1].title)
    print(result[0])



when to apply settlement: refugee or humanitarian protection
[{'c.name': '/turkish-worker-business-person-settlement'}]

register a company
[{'c.name': '/register-as-an-overseas-company'}]

if a child is born in uk , can his parents live in uk
[{'c.name': '/apply-citizenship-born-uk'}]

'find out about money taken off your universal credit payments'.
[{'c.name': '/find-court-money'}]

find out about money taken off your universal credit payments
[{'c.name': '/find-court-money'}]

find a job
[{'c.name': '/find-a-job'}, {'c.name': '/find-teaching-job'}]

find out about money taken of my universal credit
[{'c.name': '/find-court-money'}]

find out about money taken of my universal credit payments
[{'c.name': '/find-court-money'}]

find out about money taken off your universal credit payments'.
[{'c.name': '/find-court-money'}]

find out about money taken off my tax credits
[{'c.name': '/find-court-money'}]

check my status
[{'c.name': '/check-immigration-status'}]

check my status
[{'c.n

### What percentage of searches have nouns, verbs or both?

In [5]:
just_nouns = 0
just_verbs = 0
nouns_and_verbs = 0
just_noun_sents = []
just_verb_sents = []
nouns_and_verbs_sents = []
for _index, row in search_queries_df[0:100].iterrows():
    try:
        doc = nlp(row['search_term'])
    except TypeError:
        next
    has_verb = False
    has_noun = False
    for token in doc:
        if token.pos_ == "VERB":
              has_verb = True
        if token.pos_ == "NOUN":
              has_noun = True
    if has_verb and has_noun:
        nouns_and_verbs += 1
        nouns_and_verbs_sents.append(row['search_term'])
        next
    if has_verb and not has_noun:
        just_verbs += 1
        just_verb_sents.append(row['search_term'])
        next
    if has_noun and not has_verb:
        just_nouns += 1
        just_noun_sents.append(row['search_term'])
        
print(nouns_and_verbs)
print(just_nouns)
print(just_verbs)