# Investigate: Non-English Attachments on English pages
Notebook explores GOV.UK pages that are non-English attachments but are being marked as English, the default choice.

This is part of the Accessibility work to ensure compliance with WCAG. These attachments are currently WCAG fails because screen-reading software that the visually impaired use to read GOV.UK pages will suggest these attachments are English and thus the person will download it, when it the attachment is not actually in English. The consequence is that they will then have to download another attachment, so the page is less accessible.

## Approach
The approach this notebook will take is to identify a column in the pre-processed content store that has the attachment in. We define this by looking at the *attachment* element of the HTML code and then title relating to this. Generally, there are two directions that we can then take:
1. Detect language of attachment via its title
     + Is easiest method
     + Less reliable because names of attachments are typically short plus there are abbreviations. Language detection works less effectively when it has less language to scan. Just like how humans cannot accurately guess the language of text if they do not have much text to go by.
1. Detect language of contents of attachment
     + Harder as you need to read the attachments in bulk
     + All sorts of different attachments such as `.pdf`, `.doc`, `.csv`, `.html` so need a variety of ways to read the contents
     + More accurate as will be working with extra text
     
We discard Option (2.) because  it would be really slow to download all the attachments and read their contents.

In [None]:
import os
import time

import pandas as pd
import numpy as np

from bs4 import BeautifulSoup

from langdetect import detect_langs

# display multiple outputs in same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
def extract_attachment_titles(html):
    """Extracts all the attachment titles from GOV.UK pages
    
    :param html: String of the HTML code for the GOV.UK page being passed in
    :return: list of all the attachment titles that were extracted from GOV.UK page
    
    """
    
    # pass html into BeautifulSoup class to apply methods on it
    soup = BeautifulSoup(html, 'html.parser')
    
    # initialise list to store results
    list_title = []
    
    # extract all text from `h2` element with class description `title` 
    # nested in `div` element with class description `attachment-details`
    for text in soup.find_all('div', class_ = 'attachment-details'):
        for title in text.find_all('h2', class_ = 'title'):
            list_title.append(title.get_text())
    
    return list_title

In [None]:
def func_detectlangs(text):
    """Detects language of a text, moving onto next text if an error is thrown
    
    :param text: A string to detect the language of
    :return: A list returning the language detected and confidence score associated to it
    
    """

    try:
        return [detect_langs(txt) for txt in text]
    except:
        pass

## Load Data
Data used in this will be all the content on GOV.UK that exist on 6th August 2020.

Due to the sheer size of the data, need to pre-specify column headings and which columns are dates to make the import process:
- Work
- Work relatively quickly

In [None]:
# create dictionaries and headers to specify dtype and date columns
dict_header = {'base_path':object,
               'content_id':object,
               'title':object,
               'description':object,
               'publishing_app':object,
               'document_type':object,
               'details':object,  
               'text':object,
               'organisations':object,  
               'taxons':object,
               'step_by_steps':object,
               'details_parts':object,  
               'first_published_at':object,
               'public_updated_at':object,
               'updated_at':object,
               'finder':object,
               'facet_values':object,  
               'facet_groups':object,
               'has_brexit_no_deal_notice':bool,
               'withdrawn':bool,
               'withdrawn_at':object,
               'withdrawn_explanation':object}
list_header_date = ['first_published_at',
                    'public_updated_at',
                    'updated_at',
                    'withdrawn_at']

# load data
df = pd.read_csv(filepath_or_buffer='../data/preprocessed_content_store_200820.csv.gz',
                 compression='gzip',
                 encoding='utf-8',
                 sep='\t',
                 header=0,
                 names=list(dict_header.keys()),
                 dtype=dict_header,
                 parse_dates=list_header_date)

In [None]:
del dict_header, list_header_date

To find webpages with attachments on, we assume the following (based on a few case examples):
1. They have a non-empty list in the `'attachments': [...]` element

Not perfect though, still have pages that don't have any attachments in them.

In [None]:
# have attachments in `details` column, under 'attachments'
df['details_attachment_exists'] = df['details'].str.contains('\'attachments\'\: \[', na = False)
df_attachment = df.query('details_attachment_exists == True')

In [None]:
df_attachment[['title', 'publishing_app', 'document_type', 'details', 'text', 'public_updated_at']].sample(n = 5, random_state = 42)

***

## Extracting link titles
Let's extract the file names from the urls so that we can start detecting the language. Will do this in two main stages:
1. Extract the urls from the HTML code
1. Extract the file names and extensions from the urls

Some example webpages to test are:
- [MMR](https://www.gov.uk/government/publications/measles-mumps-and-rubella-lab-confirmed-cases-in-england-2019)
- [Dart Charge Bulletin](https://www.gov.uk/government/publications/dart-charge-bulletin-3-advice-for-foreign-hgv-drivers)
- [Tribunal decisions](https://www.gov.uk/employment-tribunal-decisions/miss-r-youd-v-elton-community-centre-2404942-2017)
    + This example here does not work because need to take `title` from `'attachments'" [ ... 'title:' ..]` part of HTML code. This seems to be in JSON format and sits outside of the HTML code. Can't quite get it to be treated as a Python dictionary.

***

```python
test = df.query('base_path == "/employment-tribunal-decisions/miss-r-youd-v-elton-community-centre-2404942-2017"').iloc[0]['details']
soup = BeautifulSoup(test, 'html.parser')
print(soup.prettify())

test = df.query('base_path == "/employment-tribunal-decisions/miss-r-youd-v-elton-community-centre-2404942-2017"').iloc[0]['details']
# replace single with double quotes for JSON
test = test.replace("\'", "\"")
test = json.loads(json.dumps(test, separators = "; "))

isinstance(test, dict)
```

***

In [None]:
# check if worked correctly
df_extract = df_attachment[['base_path', 'text', 'details']].copy()
df_extract['attachment_title'] = df_extract['details'].apply(extract_attachment_titles)

df_extract = df_extract[df_extract['attachment_title'].map(lambda x: len(x)) > 0]

# for inspection
df_extract.sample(n = 1000, random_state = 42).to_csv('../data/sample_attachments.csv')

## Language Detection
Let's apply language detection on our attachment titles now.

In [None]:
test = df_extract.iloc[:1000,].copy()

# get lenght of each attachment list so we can see if language detection works
test['attachment_list_length'] = test['attachment_title'].apply(len)

In [None]:
%%time
test['attachment_lang'] = test['attachment_title'].apply(lambda x: func_detectlangs(x))

In [None]:
test.query('attachment_list_length > 1')

In [None]:
test[test['base_path'].str.contains('/government/publications/foi-responses-publish')].iloc[0,]['attachment_lang']

Performs poorly with abbreviations and short sentences, which makes sense.