# Scottish Widows Document Scraping

For the literature library search:  
https://adviser.scottishwidows.co.uk/literature-library.html

For specific searching cirteria, for example *guides*:  
https://adviser.scottishwidows.co.uk/literature-library.html?n=1000&filter=swe:literaturelibrary/contenttype/guides

In [18]:
!pip install --quiet PyPDF2
!pip install --quiet pycryptodome

Collecting pycryptodome
  Downloading pycryptodome-3.20.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Downloading pycryptodome-3.20.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m57.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hInstalling collected packages: pycryptodome
Successfully installed pycryptodome-3.20.0


In [19]:
# Automatically restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [2]:
import io, os
from urllib.parse import urlparse

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm
import PyPDF2


pd.options.display.max_rows = 100
pd.options.display.max_columns = 100


## PDF Content Extraction

In [2]:
# Using pypdf2 to read a pdf uri

pdf_url = "https://adviser.scottishwidows.co.uk/assets/literature/docs/42365.pdf"
#pdf_url = "https://adviser.scottishwidows.co.uk//assets/literature/docs/27316.pdf"

response = requests.get(url=pdf_url)

assert response.status_code == requests.codes.ok


pdf_reader = PyPDF2.PdfReader( io.BytesIO(response.content) )

print(f"Total pages: {len(pdf_reader.pages)}")

for i, page in enumerate(pdf_reader.pages):
    page_text = page.extract_text()
    print(f"Page {i+1}: {page_text}")


Total pages: 3
Page 1:  
 
 
Which trust form should I use?  
 
For life assurance (i.e. non pension) contracts which are already set up and on risk, Scottish Widows 
currently offers a choice of four trusts. To help you choose which trust is most appropriate for your 
needs, a brief description of each trust and where i t may be used is given below.   
 
Placing a policy under trust usually means you are giving up all rights to the benefits under a policy, 
although, in a few very specific situations it ’s possible to retain certain benefits and our range of 
trusts takes this  into account.   
 
Rememb er a trust is a legal document.  If you ’re in any doubt as to which trust is most suitable 
for your policy and your requirements, please seek advice from your financial or legal 
adviser.    
 
If your policy is a regular premium policy it may be what is known as a “qualifying policy”. 
Placing a qualifying policy under trust can have tax implications and advice should always be 
sou

In [3]:
def get_pdf_pages(pdf_url):
    """Extract content of a pdf file page by page and return in a DataFrame with page_number and page_text columns"""
    
    url_parsed = urlparse(pdf_url)
    if url_parsed.scheme in ('file', ''): # possibly a local file
        assert os.path.exists(url_parsed.path)
        pdf_file = url_parsed.path
    else: # possibly a remote url, need to fetch it first
        response = requests.get(url=pdf_url)
        assert response.status_code == requests.codes.ok
        pdf_file = io.BytesIO(response.content)

    pdf_reader = PyPDF2.PdfReader( pdf_file )
        
    return pd.DataFrame([
            {"page_number": i+1, "page_text": page.extract_text()} 
            for i, page in enumerate(pdf_reader.pages)
        ])


In [4]:
# local_file = "../data/56036.pdf"
# df = get_pdf_pages(local_file)

df = get_pdf_pages(pdf_url)

df.head()              

Unnamed: 0,page_number,page_text
0,1,\n \n \nWhich trust form should I use? \n \n...
1,2,\n \n \n4. The Gift trust (creating fixed int...
2,3,\n \n \nPlease tick one of the boxes below to...


## Search the literature library

In [5]:
# single pdf
# https://adviser.scottishwidows.co.uk/literature-library.html?filter=swe:literaturelibrary/contenttype/guides#search

# search default: 10 itmes in a page 
search_url = "https://adviser.scottishwidows.co.uk/literature-library.html?filter=swe:literaturelibrary/contenttype/guides#search"

# search and display all with number of items set to 1000
search_url = "https://adviser.scottishwidows.co.uk/literature-library.html?n=1000&filter=swe:literaturelibrary/contenttype/guides"


In [6]:
search_response = requests.get(url=search_url)

soup = BeautifulSoup(search_response.content, "html.parser")

#search_response.content
print(soup.title)

<title>Literature | For Advisers | Scottish Widows</title>


### All the links in a page

In [7]:
print(soup.title.string)

# all links in the page
nb_links = len(soup.find_all('a'))
print(f"There are {nb_links} links in this page.\n")

# text from the page
#print(soup.get_text())

_ = [print(a) for a in soup.find_all('a')]

Literature | For Advisers | Scottish Widows
There are 574 links in this page.

<a href="https://www.scottishwidows.co.uk/" target="_self">visit our personal site</a>
<a aria-label="Continue to our website" class="positive btn btn-primary" href="#">
<span class="btn-text">Continue to our website</span>
</a>
<a accesskey="0" data-selector="skip-link-item-c_122_masthead_copy" data-tealium-event="Internal Click" data-tealium-narrative="Accessibility statement [Accesskey '0']" href="/accessibility.html">
<span class="btn-text">Accessibility statement [Accesskey '0']</span>
<span class="sr-only">Go to Accessibility statement</span>
</a>
<a accesskey="S" data-selector="skip-link-item-c_122_masthead_copy" data-tealium-event="Internal Click" data-tealium-narrative="Skip to Content [Accesskey 'S']" href="#main">
<span class="btn-text">Skip to Content [Accesskey 'S']</span>
<span class="sr-only">Skip to main content</span>
</a>
<a accesskey="N" data-selector="skip-link-item-c_122_masthead_copy" d

In [8]:
# all the resulting pdf files are in the anchor elements with "title" class
download_links = soup.find_all(class_="title")

print(f"Total downlaodable links: {len(download_links)}")

download_links[0]

Total downlaodable links: 223


<a class="title" download="" href="/assets/literature/docs/42365.pdf" target="_blank">
					Which trust form should I use?
				</a>

In [9]:
print(download_links[0].get("href"))
print(download_links[0].string.strip())

/assets/literature/docs/42365.pdf
Which trust form should I use?


In [10]:
pdf_uris = [download_link.get("href") for download_link in download_links]
_ = [print("https://adviser.scottishwidows.co.uk/" + uri) for uri in pdf_uris]

https://adviser.scottishwidows.co.uk//assets/literature/docs/42365.pdf
https://adviser.scottishwidows.co.uk//assets/literature/docs/53075.pdf
https://adviser.scottishwidows.co.uk//assets/literature/docs/41848sp.pdf
https://adviser.scottishwidows.co.uk//assets/literature/docs/56737.pdf
https://adviser.scottishwidows.co.uk//assets/literature/docs/E2103.pdf
https://adviser.scottishwidows.co.uk//assets/literature/docs/fsaSWplcFSAReturn2007.pdf
https://adviser.scottishwidows.co.uk//assets/literature/docs/27316.pdf
https://adviser.scottishwidows.co.uk//assets/literature/docs/28742a.pdf
https://adviser.scottishwidows.co.uk//assets/literature/docs/56241.pdf
https://adviser.scottishwidows.co.uk//assets/literature/docs/26489.pdf
https://adviser.scottishwidows.co.uk//assets/literature/docs/54575.pdf
https://adviser.scottishwidows.co.uk//assets/literature/docs/24696.pdf
https://adviser.scottishwidows.co.uk//assets/literature/docs/51084.pdf
https://adviser.scottishwidows.co.uk//assets/literature/do

In [11]:
def get_all_pdf_links(entry_page_url):
    """Extract all pdf links from an url and return a DataFrame with title and pdf url as columns"""
    
    response = requests.get(url=entry_page_url)
    soup = BeautifulSoup(response.content, "html.parser")

    download_links = soup.find_all(class_="title")
    
    df = pd.DataFrame([
        {"title": pdf_link.string.strip(), "url": "https://adviser.scottishwidows.co.uk" + pdf_link.get("href")} 
        for pdf_link in download_links 
    ])
    
    return df


In [12]:
pdf_urls_df = get_all_pdf_links(search_url)
pdf_urls_df.tail()

Unnamed: 0,title,url
218,CPA (S1) Charges Leaflet,https://adviser.scottishwidows.co.uk/assets/li...
219,Scottish Widows Weekly Cash Fund Report,https://adviser.scottishwidows.co.uk/assets/li...
220,PPA (S2) Charges sheet,https://adviser.scottishwidows.co.uk/assets/li...
221,Product charges - REA series 2,https://adviser.scottishwidows.co.uk/assets/li...
222,Halifax monthly plan charges - life,https://adviser.scottishwidows.co.uk/assets/li...


In [13]:
pdf_urls_df.shape

(223, 2)

### Validating the end points

In [14]:
def check_url_exist(url):
    """To check the url endpoint does exist"""
    
    response = requests.get(url=url)
    
    return response.status_code == requests.codes.ok

def clean_pdf_urls(urls_df):
    """To remove all the invalid urls from the urls_df"""
    
    exist = urls_df["url"].apply(check_url_exist) # ToDo: do it in parallel
    
    return urls_df.loc[exist]

def is_encrypted(pdf_url):
    response = requests.get(url=pdf_url)
    assert response.status_code == requests.codes.ok
    
    pdf_reader = PyPDF2.PdfReader( io.BytesIO(response.content) )

    return pdf_reader.is_encrypted

print(is_encrypted(pdf_url))

clean_pdf_urls(pdf_urls_df[0:10])

False


Unnamed: 0,title,url
0,Deed of Appointment of Additional Trustees,https://adviser.scottishwidows.co.uk/assets/li...
1,Deed of Nomination and Retirement of Protector,https://adviser.scottishwidows.co.uk/assets/li...
2,Scottish Widows Protect – The illnesses we cover,https://adviser.scottishwidows.co.uk/assets/li...
3,​​Scottish Widows Bank Premier Team Flyer,https://adviser.scottishwidows.co.uk/assets/li...
4,FSCS SWUTM flyer,https://adviser.scottishwidows.co.uk/assets/li...
5,Your guide to making withdrawals from your Col...,https://adviser.scottishwidows.co.uk/assets/li...
6,HBOS Important Notes For Applications,https://adviser.scottishwidows.co.uk/assets/li...
7,KFD COMPANY PENSIONBUILDER,https://adviser.scottishwidows.co.uk/assets/li...
8,GSHP Joining Guide,https://adviser.scottishwidows.co.uk/assets/li...
9,​​HIFML value assessment report,https://adviser.scottishwidows.co.uk/assets/li...


In [15]:
pdf_urls_df[0:10].loc[lambda _s: _s.url.apply(check_url_exist)]

Unnamed: 0,title,url
0,Deed of Appointment of Additional Trustees,https://adviser.scottishwidows.co.uk/assets/li...
1,Deed of Nomination and Retirement of Protector,https://adviser.scottishwidows.co.uk/assets/li...
2,Scottish Widows Protect – The illnesses we cover,https://adviser.scottishwidows.co.uk/assets/li...
3,​​Scottish Widows Bank Premier Team Flyer,https://adviser.scottishwidows.co.uk/assets/li...
4,FSCS SWUTM flyer,https://adviser.scottishwidows.co.uk/assets/li...
5,Your guide to making withdrawals from your Col...,https://adviser.scottishwidows.co.uk/assets/li...
6,HBOS Important Notes For Applications,https://adviser.scottishwidows.co.uk/assets/li...
7,KFD COMPANY PENSIONBUILDER,https://adviser.scottishwidows.co.uk/assets/li...
8,GSHP Joining Guide,https://adviser.scottishwidows.co.uk/assets/li...
9,​​HIFML value assessment report,https://adviser.scottishwidows.co.uk/assets/li...


In [16]:
#%%timeit -n 1 -r 1

df1 = clean_pdf_urls(pdf_urls_df)

(pdf_urls_df.shape, df1.shape)

((223, 2), (219, 2))

In [17]:
#%%timeit -n 1 -r 1

df2 = df1.url.apply(is_encrypted)

df2.shape

(219,)

In [None]:
# some files are encrypted
df1.loc[df2]

## Collect all the PDF Contents

In [18]:
for index, row in pdf_urls_df.iloc[0:2].iterrows():
    print(f"-------------{index}----")
    print(get_pdf_pages(row.url).assign(title=row.title))


-------------0----
   page_number                                          page_text  \
0            1  OF ADDITIONAL TRUSTEES\nfor use with the Absol...   
1            2  1WHAT THIS FORM IS FOR\nYou have been provided...   
2            3  21. DEED OF APPOINTMENT\nThis Deed of Appointm...   
3            4  32. BACKGROUND\n2.1 This deed is supplemental ...   
4            5  44. DATA PRIVACY NOTICE (CONTINUED)\n• from an...   
5            6  55. SIGNATURES\nThis and the preceding pages a...   
6            7  65. SIGNATURES (CONTINUED)\n1st Additional tru...   
7            8  5. SIGNATURES (CONTINUED)\n2nd Additional trus...   

                                        title  
0  Deed of Appointment of Additional Trustees  
1  Deed of Appointment of Additional Trustees  
2  Deed of Appointment of Additional Trustees  
3  Deed of Appointment of Additional Trustees  
4  Deed of Appointment of Additional Trustees  
5  Deed of Appointment of Additional Trustees  
6  Deed of Appointment 

In [19]:

df = pd.concat(
    [
        get_pdf_pages(row.url).assign(title=row.title)
        for index, row in pdf_urls_df.iloc[0:10].iterrows() if check_url_exist(row.url)
    ],
    axis=0, 
    ignore_index=True)

print(df.shape)
df.head()

(84, 3)


Unnamed: 0,page_number,page_text,title
0,1,OF ADDITIONAL TRUSTEES\nfor use with the Absol...,Deed of Appointment of Additional Trustees
1,2,1WHAT THIS FORM IS FOR\nYou have been provided...,Deed of Appointment of Additional Trustees
2,3,21. DEED OF APPOINTMENT\nThis Deed of Appointm...,Deed of Appointment of Additional Trustees
3,4,32. BACKGROUND\n2.1 This deed is supplemental ...,Deed of Appointment of Additional Trustees
4,5,44. DATA PRIVACY NOTICE (CONTINUED)\n• from an...,Deed of Appointment of Additional Trustees


In [20]:
df.tail()

Unnamed: 0,page_number,page_text,title
79,18,18\nRetail Funds\n • • • • • • • • • • • • • •...,​​HIFML value assessment report
80,19,19\nSophie O’Connor – \nIndependent AFM Chair ...,​​HIFML value assessment report
81,20,20\nActive investing\nActive investing takes a...,​​HIFML value assessment report
82,21,21\nIndex tracking\nAn index fund is construct...,​​HIFML value assessment report
83,22,60039 (11/21)HBOS Investment Fund Managers Lim...,​​HIFML value assessment report


In [21]:
#%%timeit -n 1 -r 1 # about 3.5 minutes to run
import time

start = time.time()
guides_df = pd.concat(
    [
        get_pdf_pages(row.url).assign(title=row.title)
        for index, row in pdf_urls_df.iterrows() if check_url_exist(row.url)
    ],
    axis=0, 
    ignore_index=True)
print(time.time() - start)

print(guides_df.shape)

guides_df.tail()

144.84666967391968
(2900, 3)


Unnamed: 0,page_number,page_text,title
2895,1,The following charges apply to your account fr...,PPA (S2) Charges sheet
2896,2,"Where applicable, the charge for with-profits ...",PPA (S2) Charges sheet
2897,1,Policy charge (also known as a plan charge)\nT...,Product charges - REA series 2
2898,2,Additional fund charge\nFor an initial period ...,Product charges - REA series 2
2899,1,Monthly plan \ncharges for 2024.\nThe followi...,Halifax monthly plan charges - life


In [22]:
guides_df.memory_usage(deep=True)

Index               128
page_number       23200
page_text      10408098
title            308960
dtype: int64

In [4]:
all_guides_file = "scottish_widows_all_guides.pq"

In [26]:
guides_df.to_parquet(all_guides_file)

In [7]:
os.stat(all_guides_file)

os.stat_result(st_mode=33188, st_ino=1048781, st_dev=2064, st_nlink=1, st_uid=1001, st_gid=1002, st_size=2535292, st_atime=1706190879, st_mtime=1706190879, st_ctime=1706190879)

In [8]:
df2 = pd.read_parquet(all_guides_file)
df2.head()

Unnamed: 0,page_number,page_text,title
0,1,OF ADDITIONAL TRUSTEES\nfor use with the Absol...,Deed of Appointment of Additional Trustees
1,2,1WHAT THIS FORM IS FOR\nYou have been provided...,Deed of Appointment of Additional Trustees
2,3,21. DEED OF APPOINTMENT\nThis Deed of Appointm...,Deed of Appointment of Additional Trustees
3,4,32. BACKGROUND\n2.1 This deed is supplemental ...,Deed of Appointment of Additional Trustees
4,5,44. DATA PRIVACY NOTICE (CONTINUED)\n• from an...,Deed of Appointment of Additional Trustees


In [9]:
df2.shape

(2900, 3)

In [None]:
pd.testing.assert_frame_equal(df2, guides_df)

## Scratch

In [None]:
#try_url = "https://adviser.scottishwidows.co.uk//assets/literature/docs/42365.pdf"
#try_url = "https://adviser.scottishwidows.co.uk//assets/literature/docs/fsaSWplcFSAReturn2007.pdf"
#try_url = "https://adviser.scottishwidows.co.uk//assets/literature/docs/27316.pdf"
#try_url = "https://adviser.scottishwidows.co.uk//assets/literature/docs/28742a.pdf"
#try_url = "https://adviser.scottishwidows.co.uk//assets/literature/docs/56241.pdf"

#try_url = "https://adviser.scottishwidows.co.uk/assets/literature/docs/52125.pdf"
#try_url = "https://adviser.scottishwidows.co.uk/assets/literature/docs/56696.pdf"
try_url = "https://adviser.scottishwidows.co.uk/assets/literature/docs/56036.pdf"
    
get_pdf_pages(try_url)