by: **Christian Mantilla**  
LinkedIn: https://www.linkedin.com/in/cmanti/  
github: https://github.com/chris14jan  

# Scope

**- CFC Insight Technical Challenge**  
This technical challenge is designed to take up to 2 hours.
Your submission should be uploaded to a github repository and include your python
code, a requirements.txt file and any setup or running instructions.
All code should be written for python version >=3.6. You are allowed to use external
libraries and binaries if you choose.

**- The Challenge**  
Produce a program that:
1. Scrape the index webpage hosted at `cfcunderwriting.com`
2.Writes a list of *all externally loaded resources* (e.g. images/scripts/fonts not hosted
on cfcunderwriting.com) to a JSON output file.  

3.Enumerates the page's hyperlinks and identifies the location of the "Privacy Policy"
page  
4. Use the privacy policy URL identified in step 3 and scrape the pages content.
Produce a case-insentitive word frequency count for all of the visible text on the page.
Your frequency count should also be written to a JSON output file..

**- Assessment Criteria**  
Treat the challenge as you would a real project so keep in mind readability,
performance, code quality and comments.

**- Submission**  
Please send back, by email, the URL to the github repository that you have checked
your work into.

**- Follow Up**  
Successful entries will be discussed at interview. We would love to understand what
you would do if you had more time, and how you would expand the solution e.g.
Adding support for international languages or making the solution more robust.

# Imports

In [1]:
from bs4 import BeautifulSoup as bs
import json
import re
import requests
import os
from pathlib import Path

from sklearn.feature_extraction.text import CountVectorizer
import nltk

# Parse Links Function
Create function to get html from website and parse the html to extract all links  

Ideally the function would save the html first to reduce get requests from the server.  
Saving timestamped html files could allow for easier maintenance of the scraping functions.

In [39]:
def parse_links():
    """
    """
    url = "https://www.cfcunderwriting.com"
    parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
#     parser = 'lxml'
    
    response = requests.get(url).text
    soup = bs(response, parser)
    
    links = []
    for link in soup.find_all(('link', 'a'), {'href':True}):
        links.append(link['href'])
    
    for link in soup.find_all(("script", 'iframe'), {'src':True}):
        links.append(link['src'])
        
    for link in soup.find_all('div', {'class':'img'}):
        links.append(link['style'])
    
    return links

parse_links()

['https://www.cfcunderwriting.com/en-gb/',
 'https://www.cfcunderwriting.com/en-gb/',
 'https://www.cfcunderwriting.com/en-us/',
 'https://www.cfcunderwriting.com/en-ca/',
 'https://www.cfcunderwriting.com/en-au/',
 'https://www.cfcunderwriting.com/en-gb/',
 'https://fonts.googleapis.com/css?family=Montserrat:300,400,500,600,700',
 'https://cdnjs.cloudflare.com/ajax/libs/cookieconsent2/3.1.0/cookieconsent.min.css',
 'https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/4.1.3/css/bootstrap.min.css',
 '/css/dist/bundle.css?v=20210825.1',
 '/apple-touch-icon.png',
 '/favicon-32x32.png',
 '/favicon-16x16.png',
 '/site.webmanifest',
 '/safari-pinned-tab.svg',
 '#',
 'javascript:;',
 '/en-gb/claims/',
 '/en-gb/contact/',
 '/en-gb/careers/',
 '#',
 '/locations/switch/?isocode=GB',
 '/locations/switch/?isocode=US',
 '/locations/switch/?isocode=CA',
 '/locations/switch/?isocode=AU',
 '/en-gb/',
 'javascript:;',
 '#',
 '/locations/switch/?isocode=GB',
 '/locations/switch/?isocode=US',
 '/loc

# Return External Links Function
Create function to return external links only from list of all links

In [None]:
def external_links(links):
    """
    """
    external_links = []
    for link in links:
        if re.match(".*(cfcunderwriting\.com+)[?\.]?", link, re.I):
            continue

        if re.match('^(http|https)://', link, re.I):
            external_links.append(link)    
            continue
        
        elif re.match("^background-image: url\('.*'", link, re.I):
            p = re.compile("'(.*?)'")
            m = p.search(link)
            link = m.group().replace("'","")
            external_links.append(link) 
            continue
    
    soup_social = soup.find(('div'), {'class':'social'})
    for link in soup_social.find_all(('link', 'a'), {'href':True}):
        external_links.remove(link['href'])
    
    
    return {"external_links":external_links}, len(external_links)

external_links(parse_links())

# Enumerate All Hyperlinks Function
Create function to enumerate all hyperlinks hosted on the cfcunderwriting.com domain.

Easiest, non-ideal solution was to explicitly exclude non cfcunderwriting.com  
Function can be improved to exclude all domains =/= cfcunderwriting.com

In [7]:
def enumerate_hyperlinks(links):
    """
    """
    hyperlinks = []
    for link in links:
        if re.match(".*(cloudflare\.com+)[?\.]?", link, re.I):
            continue
        elif re.match(".*(googleapis\.com+)[?\.]?", link, re.I):
            continue
        elif re.match(".*(google\.com+)[?\.]?", link, re.I):
            continue
            
        if re.match(".*(cfcunderwriting\.com+)[?\.]?", link, re.I):
            hyperlinks.append(link)
            continue

        if re.match('^(http|https)://', link, re.I):
            hyperlinks.append(link)
            continue
        
        elif re.match("^/en-", link, re.I):
            hyperlinks.append(link)
            continue
    
    hyperlinks_dict = {}
    for i, link in enumerate(hyperlinks):
        if re.match("^/en-gb", link, re.I):
            hyperlinks_dict[link.replace('/en-gb/', 'https://www.cfcunderwriting.com/en-gb/')] = i
#             hyperlinks_dict[i] = link.replace('/en-gb/', 'https://www.cfcunderwriting.com/en-gb/')
            continue
        else:
            hyperlinks_dict[link] = i
#             hyperlinks_dict[i] = link
    
    hyperlinks_enumerated = {}
    for i, link in enumerate(hyperlinks_dict.keys()):
        hyperlinks_enumerated[i] = link
    
#     return {"hyperlinks":hyperlinks}, len(hyperlinks)
    return hyperlinks_enumerated

enumerate_hyperlinks(parse_links())

{0: 'https://www.cfcunderwriting.com/en-gb/',
 1: 'https://www.cfcunderwriting.com/en-us/',
 2: 'https://www.cfcunderwriting.com/en-ca/',
 3: 'https://www.cfcunderwriting.com/en-au/',
 4: 'https://www.cfcunderwriting.com/en-gb/claims/',
 5: 'https://www.cfcunderwriting.com/en-gb/contact/',
 6: 'https://www.cfcunderwriting.com/en-gb/careers/',
 7: 'https://www.cfcunderwriting.com/en-gb/platform/',
 8: 'https://www.cfcunderwriting.com/en-gb/products/class/contingency/',
 9: 'https://www.cfcunderwriting.com/en-gb/products/class/cyber/',
 10: 'https://www.cfcunderwriting.com/en-gb/products/class/environmental-liability/',
 11: 'https://www.cfcunderwriting.com/en-gb/products/class/excess-umbrella/',
 12: 'https://www.cfcunderwriting.com/en-gb/products/class/intellectual-property/',
 13: 'https://www.cfcunderwriting.com/en-gb/products/class/kidnap-ransom/',
 14: 'https://www.cfcunderwriting.com/en-gb/products/class/management-liability/',
 15: 'https://www.cfcunderwriting.com/en-gb/products/

# Parse Privacy Policy Page Function
Create function that parses the privacy page

Obtain Privacy Policy from list of hyperlinks  
Easy solution is to manually call the hyperlink based on its index  
Ideal solution is to find link that has `privacy` and/or `policy` using regex or equivalent

In [8]:
hyperlinks = enumerate_hyperlinks(parse_links())
hyperlinks[67]

'https://www.cfcunderwriting.com/en-gb/support/privacy-policy/'

In [31]:
def parse_privacy_page():
    """
    """
    url = enumerate_hyperlinks(parse_links())[67]
    parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
    
    response = requests.get(url).text
    soup = bs(response, parser)
    
    section_texts = []
    
    main_section = soup.find(["main"])
    
    sections = main_section.find_all(["h2", "p"])
    section_titles = soup.find_all('h2', class_='title')
    for section in sections:
        section_text = section.text.replace('\n',' ').lower()
        section_text = re.sub(r"(-?\d+)((\.(-?\d+))+)?", ' ', section_text)
        section_text = section_text.replace('\xa0',' ')
        section_texts.append(section_text)
        continue
    
    return section_texts

parse_privacy_page()

[' . our approach',
 '  this privacy policy (the “policy”) sets out how we cfc underwriting limited (registered company number  ) headquartered at second floor,   gracechurch street, london ec v  aa and any of our subsidiaries or holding companies (together referred to as “cfc underwriting”, “our” or “we”) process the personal data of our customers, brokers and website visitors in the european union (“users”).',
 '  if you have any questions about this policy, please contact our data protection officer (“dpo”) by clicking here.',
 ' ',
 ' . what information do we collect',
 'personal data',
 '  we will collect personal data when you obtain a quote for one of our products of services, or in the course of providing you with one of our products of services. we will also collect personal data when you register with us or provide your information through our website. the types of information we collect may include:',
 '  information you provide us in your insurance application, including na

# Word Frequency Function - SKL
Create function that determines the word frequency of the privacy page

In [32]:
def word_frequency_counter_skl():

    corpus = parse_privacy_page()
    cv = CountVectorizer()
    cv_fit = cv.fit_transform(corpus)
    words = cv.get_feature_names()
    counts = cv_fit.toarray().sum(axis=0)
    word_frequencies = {word:counts[i] for i, word in enumerate(words)}
    return {'word_frequencies': word_frequencies}

word_frequency_counter_skl()

{'word_frequencies': {'aa': 1,
  'able': 1,
  'about': 8,
  'above': 2,
  'abroad': 1,
  'abuses': 2,
  'access': 2,
  'accordance': 3,
  'account': 2,
  'acting': 1,
  'activities': 1,
  'adaptive': 1,
  'address': 2,
  'addresses': 4,
  'adequately': 1,
  'adjust': 2,
  'administer': 2,
  'advanced': 1,
  'advisors': 1,
  'affect': 1,
  'after': 1,
  'afterwards': 1,
  'against': 2,
  'aggregated': 1,
  'all': 1,
  'alleged': 1,
  'allow': 1,
  'also': 6,
  'an': 8,
  'analyse': 1,
  'analysis': 1,
  'analytics': 1,
  'and': 48,
  'anonymised': 2,
  'anonymous': 1,
  'another': 1,
  'any': 17,
  'applicable': 6,
  'application': 2,
  'approach': 1,
  'appropriate': 1,
  'apps': 1,
  'are': 13,
  'as': 13,
  'ask': 1,
  'asked': 1,
  'assess': 4,
  'assist': 1,
  'associated': 1,
  'at': 4,
  'authorities': 1,
  'authority': 3,
  'based': 1,
  'basis': 1,
  'be': 9,
  'been': 2,
  'before': 1,
  'behalf': 1,
  'being': 2,
  'believe': 1,
  'below': 1,
  'benefit': 1,
  'better': 1,
  

Function can be optimized to exclude the following `words` that don't actually exist  
ec  
ip  
sar  


# Word Frequency Function - NLTK
Alternate function to calculate word frequencies:

In [33]:
def word_frequency_counter_nltk():
    corpus = ' '.join(parse_privacy_page())
    words = nltk.tokenize.word_tokenize(corpus)
    counts = nltk.FreqDist(words)

    word_frequencies = dict((word, freq) for word, freq in counts.items() if word.isalpha())
    return {'word_frequencies': word_frequencies}

word_frequency_counter_nltk()

{'word_frequencies': {'our': 54,
  'approach': 1,
  'this': 9,
  'privacy': 6,
  'policy': 7,
  'the': 59,
  'sets': 1,
  'out': 6,
  'how': 3,
  'we': 44,
  'cfc': 4,
  'underwriting': 4,
  'limited': 1,
  'registered': 1,
  'company': 2,
  'number': 2,
  'headquartered': 1,
  'at': 4,
  'second': 1,
  'floor': 1,
  'gracechurch': 1,
  'street': 1,
  'london': 1,
  'ec': 1,
  'v': 1,
  'aa': 1,
  'and': 46,
  'any': 17,
  'of': 53,
  'subsidiaries': 2,
  'or': 49,
  'holding': 1,
  'companies': 1,
  'together': 1,
  'referred': 1,
  'to': 111,
  'as': 13,
  'process': 6,
  'personal': 32,
  'data': 47,
  'customers': 3,
  'brokers': 1,
  'website': 18,
  'visitors': 4,
  'in': 29,
  'european': 1,
  'union': 1,
  'users': 3,
  'if': 9,
  'you': 59,
  'have': 14,
  'questions': 2,
  'about': 8,
  'please': 6,
  'contact': 7,
  'protection': 6,
  'officer': 3,
  'dpo': 1,
  'by': 14,
  'clicking': 1,
  'here': 2,
  'what': 1,
  'information': 33,
  'do': 6,
  'collect': 11,
  'will': 11

# Word Frequency Function Comparisons

Summary of the comparison of the two word counting functions:  
**The scikit-learn word counting function is slightly faster and returns fewer non-existent words**  

Compare the total word counts of each frequency counter function:

In [38]:
print("(skl word count, ", "nltk word count)")
sum(word_frequency_counter_skl()['word_frequencies'].values()), sum(word_frequency_counter_nltk()['word_frequencies'].values())

(skl word count,  nltk word count)


(1986, 2005)

Compare the speed of each word frequency counter function:

In [44]:
%%time
word_frequency_counter_skl()

CPU times: user 222 ms, sys: 321 µs, total: 222 ms
Wall time: 919 ms


{'word_frequencies': {'aa': 1,
  'able': 1,
  'about': 8,
  'above': 2,
  'abroad': 1,
  'abuses': 2,
  'access': 2,
  'accordance': 3,
  'account': 2,
  'acting': 1,
  'activities': 1,
  'adaptive': 1,
  'address': 2,
  'addresses': 4,
  'adequately': 1,
  'adjust': 2,
  'administer': 2,
  'advanced': 1,
  'advisors': 1,
  'affect': 1,
  'after': 1,
  'afterwards': 1,
  'against': 2,
  'aggregated': 1,
  'all': 1,
  'alleged': 1,
  'allow': 1,
  'also': 6,
  'an': 8,
  'analyse': 1,
  'analysis': 1,
  'analytics': 1,
  'and': 48,
  'anonymised': 2,
  'anonymous': 1,
  'another': 1,
  'any': 17,
  'applicable': 6,
  'application': 2,
  'approach': 1,
  'appropriate': 1,
  'apps': 1,
  'are': 13,
  'as': 13,
  'ask': 1,
  'asked': 1,
  'assess': 4,
  'assist': 1,
  'associated': 1,
  'at': 4,
  'authorities': 1,
  'authority': 3,
  'based': 1,
  'basis': 1,
  'be': 9,
  'been': 2,
  'before': 1,
  'behalf': 1,
  'being': 2,
  'believe': 1,
  'below': 1,
  'benefit': 1,
  'better': 1,
  

In [45]:
%%time
word_frequency_counter_nltk()

CPU times: user 368 ms, sys: 0 ns, total: 368 ms
Wall time: 984 ms


{'word_frequencies': {'our': 54,
  'approach': 1,
  'this': 9,
  'privacy': 6,
  'policy': 7,
  'the': 59,
  'sets': 1,
  'out': 6,
  'how': 3,
  'we': 44,
  'cfc': 4,
  'underwriting': 4,
  'limited': 1,
  'registered': 1,
  'company': 2,
  'number': 2,
  'headquartered': 1,
  'at': 4,
  'second': 1,
  'floor': 1,
  'gracechurch': 1,
  'street': 1,
  'london': 1,
  'ec': 1,
  'v': 1,
  'aa': 1,
  'and': 46,
  'any': 17,
  'of': 53,
  'subsidiaries': 2,
  'or': 49,
  'holding': 1,
  'companies': 1,
  'together': 1,
  'referred': 1,
  'to': 111,
  'as': 13,
  'process': 6,
  'personal': 32,
  'data': 47,
  'customers': 3,
  'brokers': 1,
  'website': 18,
  'visitors': 4,
  'in': 29,
  'european': 1,
  'union': 1,
  'users': 3,
  'if': 9,
  'you': 59,
  'have': 14,
  'questions': 2,
  'about': 8,
  'please': 6,
  'contact': 7,
  'protection': 6,
  'officer': 3,
  'dpo': 1,
  'by': 14,
  'clicking': 1,
  'here': 2,
  'what': 1,
  'information': 33,
  'do': 6,
  'collect': 11,
  'will': 11