# Investigate: Non-English Attachments on English pages
Notebook explores GOV.UK pages that are non-English attachments but are being marked as English, the default choice.

This is part of the Accessibility work to ensure compliance with WCAG. These attachments are currently WCAG fails because screen-reading software that the visually impaired use to read GOV.UK pages will suggest these attachments are English and thus the person will download it, when it the attachment is not actually in English. The consequence is that they will then have to download another attachment, so the page is less accessible.

In particular, this notebook lay the code basis so that it can go into the report runner.

## Postulation
The approach this notebook will take is to identify a column in the pre-processed content store that has the attachment in. We will define this by looking for text that ends in something such as `.pdf`. There are two directions that we can then take:
1. Detect language of attachment via the name
     + Is easiest method
     + Less reliable because names of attachments are typically short and language detection works less effectively when it has less language to scan.
1. Detect language of contents of attachment
     + Harder as you need to read the attachments in bulk
     + All sorts of different attachments such as `.pdf`, `.doc`, `.csv`, `.html` so need a variety of ways to read the contents
     + More accurate as will be working with extra text
     
Option (2.) is on the backburner for now because we would need the full page paths to the attachments and it would be really slow to download all the attachments and read their contents. Can construct it like something below:
```
str('govuk/' + base_path + `file_name`)
```

In [None]:
import os
import time

import pandas as pd
import numpy as np
from multiprocessing import Pool

from langdetect import detect_langs

In [None]:
def func_detectlangs(text):
    """Detects language of a text, moving onto next text if an error is thrown
    
    :param text: A string to detect the language of
    :return: A list returning the language detected and confidence score associated to it
    
    """

    try:
        return detect_langs(text)
    except:
        pass

In [None]:
def func_detectlang_df(df):
    """Apply funcs_detectlangs() function on dataframe columns
    
    :param df: A dataframe with `text` column to apply func_detectlangs() on
    :return: A dataframe with extra column `text_lang` that identifies what language and the level of confidence of the text passed in
    
    """
    df['text_lang'] = df['text'].apply(lambda text: func_detectlangs(text))
    return df

def func_detectlang_pool(df, func, n_cores):
    """Parallelises the func_detectlang_df function
    
    :param df: A dataframe to pass into `func`
    :param func: A function to apply to dataframe
    :param n_cores: Number of cores to parallelise on
    """
    
    df_split = np.array_split(df, n_cores)

    p = Pool(processes = n_cores)
    
    df = pd.concat(p.map(func, df_split))
    
    p.close()
    p.join()
    
    return df

In [None]:
# number of cores on machine
n_cores = os.cpu_count() - 1

# file attachments
file_attachment = """ 
    .chm|.csv|.diff|.doc|.docx|.dot|.dxf|.eps|\
    .gif|.gml|.ics|.jpg|.kml|.odp|.ods|.odt|.pdf|\
    .png|.ppt|.pptx|.ps|.rdf|.ris|.rtf|.sch|.txt|\
    .vcf|.wsdl|.xls|.xlsm|.xlsx|.xlt|.xml|.xsd|.xslt|\ 
    .zip"""

## Load Data
Data used in this will be all the content on GOV.UK that exist on 6th August 2020.

Due to the sheer size of the data, need to pre-specify column headings and which columns are dates to make the import process:
- Work
- Work relatively quickly

In [None]:
# create dictionaries and headers to specify dtype and date columns
dict_header = {'base_path':object,
               'content_id':object,
               'title':object,
               'description':object,
               'publishing_app':object,
               'document_type':object,
               'details':object,  
               'text':object,
               'organisations':object,  
               'taxons':object,
               'step_by_steps':object,
               'details_parts':object,  
               'first_published_at':object,
               'public_updated_at':object,
               'updated_at':object,
               'finder':object,
               'facet_values':object,  
               'facet_groups':object,
               'has_brexit_no_deal_notice':bool,
               'withdrawn':bool,
               'withdrawn_at':object,
               'withdrawn_explanation':object}
list_header_date = ['first_published_at',
                    'public_updated_at',
                    'updated_at',
                    'withdrawn_at']

# load data
df = pd.read_csv(filepath_or_buffer='../data/content_store/preprocessed_content_sotre_060820.csv.gz',
                 compression='gzip',
                 encoding='utf-8',
                 sep='\t',
                 header=0,
                 names=list(dict_header.keys()),
                 dtype=dict_header,
                 parse_dates=list_header_date)

In [None]:
del dict_header, list_header_date

In [None]:
test = pd.DataFrame(data = {'column': ['this is a .pdf and we also have a .txt', 'this only has .pdf', 'this only has .txt', 'this has nothing']})
test['column'].str.contains('.pdf|.txt', na = False)

In [None]:
# see that have attachments in `details` column
#df.apply(lambda col: col.str.contains('.pdf', na = False), axis = 1)
df['details_attachment_exists'] = df['details'].str.contains(file_attachment, na = False)
df_attachment = df.query('details_attachment_exists == True')

In [None]:
df_attachment[['title', 'publishing_app', 'document_type', 'details', 'text', 'public_updated_at']].sample(n = 5, random_state = 42)

***

From previewing the data, spot patterns for extacting `.pdf` files. It looks like they either exist in `src=` or `html=`.

In [None]:
df_attachment['details'].sample(n = 1000, random_state = 42).to_csv('../data/sample_details.csv')

***

## Preprocessing
Let's extract the file names from the urls so that we can start detecting the language. Will do this in two main stages:
1. Extract the urls from the HTML code
1. Extract the file names and extensions from the urls

In [None]:
from bs4 import BeautifulSoup

In [None]:
# get smaller cut to work with
test = df_attachment[['details']].sample(n = 1000, random_state = 42).copy()

In [None]:
def get_links(text):
    """Extracts the href part of HTML code
    
    :param text: A string of HTML code that we want to extract the href attribute from
    :return: A list of links and other things we have extracted from the href attribute
    
    """
    list_links = []
    soup = BeautifulSoup(text, "html.parser")
    for a in soup.find_all('a', href = True):
        link = a['href']
        list_links.append(link)
    return list_links

In [None]:
test['details'].apply(lambda text: BeautifulSoup(text, "html.parser").find_all('a', href = True))
test['details'] = test['details'].apply(get_links)

In [None]:
pd.options.display.max_colwidth = 1000
test['details']

In [None]:
import os

In [None]:
def extract_filename(list_text):
    """Extracts the last part of a URL path string, including the file name and extension
    
    :param list_text: A list of strings to extract last part from, e.g. everything after '/'
    :return: A list of the same length as list_text, but with the last parts kept e.g. everything after '/'
    
    """
    file_name = [os.path.split(text)[1] for text in list_text]
    return file_name

In [None]:
test['attachment_name'] = test['details'].apply(extract_filename)

In [None]:
test.head(5)