# Identifying Business Content

The purpose of this notebook is to be able to algorithmically find/define business content.
By this, we mean content that is primarily business orientated and/or relates mostly to the lifecycle of a business - which can include self employed people, tradespeople, charities, non profits etc etc.

As a team, we decided that using a probabilistic approach (ie we can be 80% sure this is business content) allows for the fact that some content may or may not be "primarily" business oriented. For example, it might be a piece of content on a non business topic that includes a paragraph or a section for business users - in which case it should be included but is not "primarily" business oriented.

Further to this, we decided that in a tradeoff between precision and recall, we want to prioritise recall.

# Approach 1

### Broad strokes

The broad idea here is to use regex(s) to find heavily business oriented content, find common entities/n-grams in this content, widen the net and then use this expanded set (which is, hopefully, still fairly precise) as a way of training a model (of some sort, tbd) to then assign probabilities to unseen content.

To start with, I experimented with using a regex on the Knowledge Graph (KG) to see how much content I would get with various regexes. As I suspected, I tried a regex on the KG with just 'business' as a keyword and there was at least one result that uses the word business but there was absolutely no business intent (https://www.gov.uk/government/news/not-long-left-to-have-your-say-on-the-local-water-environment to be precise) so this regex casts the net too wide

The query was:
```
MATCH (n:Cid) WHERE toLower(n.text) CONTAINS 'business' OR toLower(n.description) CONTAINS 'business' OR toLower(n.title) CONTAINS 'business' RETURN n.title as title, n.name AS slug
```
and it had 78689 results

When I tried looking more specifically for the term 'your business' I got 3109 results and when I looked for 'your company' I got 1015. There is a difference of 74,000 results there! 

I tried using a regex for "your business|charity|company" to look at title, description and body text. It was picking up some content aimed at employees where it mentioned "your company" as the entity you work for.

Thus, I used a regex looking for "your business" in body text and your "business|charity|company" in descriptions and titles as a starting point.

In [None]:
import os
from py2neo import Graph
import nltk
from nltk.util import ngrams
host = os.environ.get('REMOTE_NEO4J_URL')
user = os.environ.get('NEO4J_USER')
password = os.environ.get('NEO4J_PASSWORD')

In [None]:

graph = Graph(host=host, user='neo4j', password = password, secure=True)

result = graph.run("MATCH (n:Cid) WHERE toLower(n.text) =~ '.*your business' OR toLower(n.description) =~ '.*your business*|.*your charity.*|.*your company.*' OR toLower(n.title) =~ '.*your business*|.*your charity.*|.*your company.*' RETURN n.title as title, n.name AS slug, n.text as text, n.description as description").data()
print(f"{len(result)} results")

From eyeballing the results, they look pretty sensible. Thus, lets try finding common n grams:

In [None]:
texts = []
for r in result:
    text = str(r['title']) + " " + str(r['description']) + " " + str(r['text'])
    texts.append(text)

In [26]:
def extract_ngrams(data, num):
    n_grams = ngrams(nltk.word_tokenize(" ".join(data)), num)
    return [ ' '.join(grams) for grams in n_grams]

bigrams = extract_ngrams(texts, 2)
trigrams = extract_ngrams(texts, 3)

In [29]:
counts = {}
for bigram in bigrams:
    if not bigram in counts:
        counts[bigram] = 0
    counts[bigram] += 1
    
for trigram in trigrams:
    if not trigram in counts:
        counts[trigram] = 0
    counts[trigram] += 1

In [36]:
sorted_counts = sorted(counts.items(), key=lambda item: item[1], reverse= True)
sorted_counts

[('’ s', 1499),
 ('your charity', 1130),
 ('of the', 981),
 ('charity ’', 685),
 ('charity ’ s', 678),
 ('. You', 535),
 ('. The', 520),
 ('in the', 512),
 ('your company', 493),
 ('the charity', 486),
 ('need to', 450),
 ('. If', 446),
 ('to the', 431),
 ('your charity ’', 380),
 (', you', 372),
 ('you ’', 353),
 ('. This', 348),
 ('If you', 337),
 ('if you', 335),
 ('You can', 297),
 ('for the', 291),
 ('on the', 290),
 ('’ t', 289),
 (', and', 268),
 (', the', 259),
 ('Corporation Tax', 253),
 ('governing document', 252),
 ('. It', 248),
 ('you can', 244),
 ('with the', 233),
 (', or', 231),
 ('you need', 228),
 ('’ re', 227),
 ('. If you', 227),
 ('as a', 223),
 ('and the', 219),
 ('You must', 213),
 ('of your', 212),
 ('to be', 211),
 ('the commission', 209),
 ('’ ll', 205),
 ('able to', 203),
 ('Companies House', 199),
 ('to your', 190),
 ('the company', 189),
 ('for example', 189),
 ('your business', 188),
 ('such as', 184),
 ('company ’', 183),
 ('you ’ re', 183),
 ('company ’ 

# Work paused

I have (temporarily) paused work on this in order to work on another project. The bi/trigrams have some promising entries but a lot of them are clearly just stop words. One could try:

* filtering out stop words
* doing tf-idf compared to known non business content to see which words are more important in business content
* using govner to find common entities in business content (from n-grams it seems that things like Corporation Tax, Companies House etc would be good entities). There may be some nouns that it misses though, (for example, 'accounting period' which might not be detected (haven't tried it though))