This notebook is for visualizing the results of the sliding-window QA model approach on the Modern Slavery Dataset. The approach is run in the notebook 'QA-sliding window.ipynb'. The idea behind the approach is to use a pretrained QA model (one trained on SQuAD v2 such that it can return a "no span found" result) to ask questions of the documents in a zero-shot configuration. Since most documents in the dataset are longer than the maximum input length, a sliding window approach is used: after the entire document is tokenized, the QA model is run on successive windows, each slid by stride=128 tokens (~1/4th of the window size). All spans returned by the QA model are recorded in a new dataframe (df_with_segments.parquet). 

This viewer is used to display the spans found by the QA model to investigate whether the model is providing useful filtering. As such, it is important to be able to visualize not only the spans, but also the text that the model ignored. Therefore, the entire documents are printed and the spans that were found are highlighted. The model was run with 6 potentially useful questions. Here, each question is assigned a different color as shown in the legend below. Where multiple questions returned the same span, that span is printed multiple times, once in each color.

In [1]:
import pandas as pd

# the custom functions used in this notebook are defined in SlidingWindowTransformersQA.py:
from SlidingWindowTransformersQA import get_tokenizer, print_legend, print_doc_with_highlights

In [2]:
model_name = 'twmkn9/albert-base-v2-squad2' # Model selection criteria detailed in 'QA-sliding window.ipynb'
tokenizer = get_tokenizer(model_name)

# Read in both the df with the string versions of the spans and the df with the tokens and token classes, concatenating:
df = pd.concat([pd.read_parquet('df_with_segments.parquet'),pd.read_parquet('df_token_classes.parquet')],axis=1)
questions = [col for col in df.columns if '?' in col]

# view the resulting df:
df

Unnamed: 0,ID,source,TEXT,Is there training provided?,Is there training already in place?,Has training been done?,Is training planned?,Is training in development?,What kind of training is provided?,tokens,token classes
0,0,labeled,Modern Slavery Statement\n\nUa\n\n> Responsibi...,[we will also aim to develop training for our ...,[],[we will also aim to develop training for our ...,[we will also aim to develop training for our ...,[we will also aim to develop training for our ...,[],"[773, 9822, 3331, 13, 3786, 13, 1, 4024, 13, 1...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."
1,1,labeled,Burton's Biscuit Company (a trading name of Bu...,[the board carries out a strategic risk assess...,[training relevant to other members of staff i...,[training relevant to other members of staff i...,[brc global standards (a manufacturing certifi...,"[no discrimination is practised., training rel...",[],"[9759, 22, 18, 20947, 237, 13, 5, 58, 5205, 20...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."
2,2,labeled,MODERN SLAVERY ACT STATEMENT\nOUR BUSINESS Zal...,[all factories must provide a recent audit don...,[training the code of conduct is part of our <...,[training the code of conduct is part of our <...,"[all work shall be voluntary,, the training is...",[training the code of conduct is part of our <...,[<unk>compliance basics<unk> training],"[773, 9822, 601, 3331, 318, 508, 13, 10662, 13...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."
3,3,labeled,MENU\nHOME\nU.K. MODERN SLAVERY ACT STATEMENT\...,"[no forced labor prison, indentured, bonded, i...",[],"[in 2017, we also conducted purchasing practic...","[no forced labor prison, indentured, bonded, i...",[any associate who contracts a factory that us...,[],"[11379, 213, 287, 9, 197, 9, 773, 9822, 601, 3...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."
4,4,labeled,Modern Slavery Act Statement\nIntroduction fro...,[we pay all employees at least the national li...,[],[we provide training for our staff.],[we provide training for our staff.],[we provide training for our staff.],[],"[773, 9822, 601, 3331, 3445, 37, 14, 903, 1452...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."
...,...,...,...,...,...,...,...,...,...,...,...
976,326,hidden,CECP Advisors LLP Modern Slavery Act Statement...,[training with respect to the msa policy will ...,[],[training with respect to the msa policy will ...,[training with respect to the msa policy will ...,"[training and communication as necessary,]",[training with respect to the msa policy will],"[23943, 306, 6721, 18, 13, 211, 306, 773, 9822...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."
977,327,hidden,Modern Slavery Act Transparency Statement\n201...,[we firmly believe the work we have started re...,"[training training on ethical buying, social c...",[not be subjected to harsh or inhumane treatme...,[workers should not be subjected to harsh or i...,[not be subjected to harsh or inhumane treatme...,"[ethical buying, social compliance and factory...","[773, 9822, 601, 19668, 3331, 690, 8, 1053, 84...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."
978,328,hidden,MENU\n\n0333 2203 121\nBOOK A ROOM\n\nAnti Sla...,[management at all levels are responsible for ...,[],[management at all levels are responsible for ...,[any person working for our business or as a s...,[],[adequate and regular training],"[11379, 713, 20165, 1024, 3601, 13, 12586, 360...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."
979,329,hidden,"We have placed cookies on your computer, as th...",[greater internal training will be given to re...,[],[greater internal training will be given to re...,[greater internal training will be given to re...,[greater internal training will be given to re...,[greater internal training will],"[95, 57, 1037, 19396, 27, 154, 1428, 15, 28, 5...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."


In [3]:
print_legend(questions)

****************************************************************************************************
                                               LEGEND                                               
****************************************************************************************************
[6;30;41mIs there training provided?[0m
[6;30;42mIs there training already in place?[0m
[6;30;43mHas training been done?[0m
[6;30;44mIs training planned?[0m
[6;30;45mIs training in development?[0m
[6;30;46mWhat kind of training is provided?[0m
****************************************************************************************************



In [4]:
# Let's investigate the first 10 documents:
for row_id in range(10):
    print('*'*100)
    print(f'row_id = {row_id}')
    print_doc_with_highlights(text = df.loc[row_id,'TEXT'],
                              tokens = df.loc[row_id,'tokens'],
                              token_classes = df.loc[row_id,'token classes'],
                              questions = questions,
                              tokenizer = tokenizer
                             )

****************************************************************************************************
row_id = 0
****************************************************************************************************
                                               LEGEND                                               
****************************************************************************************************
[6;30;41mIs there training provided?[0m
[6;30;42mIs there training already in place?[0m
[6;30;43mHas training been done?[0m
[6;30;44mIs training planned?[0m
[6;30;45mIs training in development?[0m
[6;30;46mWhat kind of training is provided?[0m
****************************************************************************************************

Modern Slavery Statement

Ua

> Responsibility > Modern Slavery Statement
Anti-Slavery and Human Trafficking Statement for the year ending 31 March 2018
Introduction from the Board
AO World plc (“AO”) fully supports the governm

MODERN SLAVERY ACT STATEMENT
OUR BUSINESS Zalando is a leading European online fashion platform for women, men and children serving 15 European markets. Our operating activities include development, sourcing, marketing and retail. Via our online store we sell apparel, shoes and accessories from different international brands as well as private label products. The latter are managed, sourced and sold by Zalando SE’s wholly-owned subsidiary, zLabels GmbH. zLabels goods are produced in 27 countries across the world. In 2016, we sourced our products from 278 suppliers with 464 factories. We do not own any of the production facilities. In order to sell products, Zalando has built a complex online platform, providing partners with services including IT intelligence or fulfilment. All transportation services from the countries of production to the customer as well as the return process are provided by business partners.
Zalando’s supply chain
The core of Zalando’s operations is conducted in G

MENU
HOME
U.K. MODERN SLAVERY ACT STATEMENT
U.K. MODERN SLAVERY STATEMENT FOR FISCAL 2017
This statement is made pursuant to Section 54 of the U.K. Modern Slavery Act and the California Transparency in Supply Chains Act and outlines the efforts L Brands has taken and is continuing to take to ensure that forced labor is not occurring within our supply chain.
Forced labor includes prison, indentured, bonded, involuntary or slave labor or labor obtained through human trafficking. L Brands has a zero-tolerance policy regarding forced labor. We are committed to operating ethically and with integrity and transparency in all business dealings and to putting effective systems and controls in place to safeguard against any form of forced labor taking place within our supply chain.
OUR BUSINESS
More than stores, more than products ... L Brands is a family of brands. Our brands are world-renowned; they are household names. Through Victoria's Secret, PINK, Bath & Body Works, La Senza and Henri Ben

CONTENTS

PRIME MINISTER FOREWORD

1

EXECUTIVE SUMMARY

2

SECTION 1

STRUCTURE, BUSINESS AND SUPPLY CHAINS 

4

SECTION 2

POLICIES IN RELATION TO MODERN SLAVERY

7

SECTION 3

RISK ASSESSMENT AND DUE DILIGENCE

10

SECTION 4

TRAINING AND AWARENESS RAISING

23

SECTION 5

GOALS AND KEY PERFORMANCE INDICATORS (KPIs)

25

1

GOVERNMENT MODERN SLAVERY STATEMENT

PRIME MINISTER FOREWORD

Around the world, something in the region of 40 million innocent men, women and even children have been forced into various forms of modern slavery.
Many are here in the UK. Still more are abroad. All are victims of a vile business that has no place in the last century, let alone this one.
Those behind such crimes, these traders in human misery, must and will be ruthlessly hunted down and brought to justice.
And, while that happens, we should absolutely not be lining their pockets with British taxpayers’ money.
That may sound like a statement of the blindingly obvious.
But with complex and ofte

Jones Lang LaSalle Incorporated
Modern Slavery and Human Trafficking
01 April 2017

Table of Contents
Introduction .... 3 About JLL.... 4
Our Policies on Slavery and Human Trafficking .... 5 Due Diligence Processes for Slavery and Human Trafficking .... 6 Slavery and Human Trafficking Risks in JLL’s Business.... 7 Our Effectiveness in Combatting Slavery and Human Trafficking .... 7 Board of Directors Approval .... 7

COPYRIGHT © JONES LANG LASALLE IP, INC. 2017. All Rights Reserved

2

Introduction
As a company that carries out a portion of its business in the United Kingdom, Jones Lang LaSalle Incorporated and its subsidiaries (“JLL”) approves and issues this Modern Slavery and Human Trafficking Statement under Section 54(1) of the UK Modern Slavery Act 2015 (MSA). JLL carries on business in the UK through its UK based affiliates, including Jones Lang LaSalle Limited and LaSalle Investment Management, but our responsibilities and commitment to uphold the principles of the Modern Sla

# Observations:
On the whole, the model does surprisingly well at detecting subsets of documents that speak about training. It shows promise in being able to detect subsets of the document that discuss training. However, the performance of the model is not currently satisfactory on its own to classify documents. Namely:
- The model is returning a number of irrelevant spans. For example:
    - In doc 1 there are 3 spans returned that have nothing to do with training. However, none of these are returned for more than one question.
    - In doc 2, like in doc 1, some spans are returned that have nothing to do with training. However, one span (the last, gold span) is related to training despite only being picked up by one question. This means that we cannot exclude spans only detected by one question.
    - In doc 3, there are some spans getting picked up by multiple questions that do not have to do with training.
    - Many other docs also have spans picked up that are not related to training: doc 4, doc 6 (quite a few here), doc 8, doc 9
- The model sometimes picks up a span, but not a large-enough one. For example:
    - In doc 3, there is an entire section about training (starting with the header "TRAINING" which was picked up by the purple question). However, only a few subspans of this section are returned. There is lots of important and relevant text that is skipped.
    - In doc 4, a span is picked up by the red question: "provide training where required and", but in order to understand the nature of this training, we need this text from earlier in the same sentence: "we have either already taken, or intend to take the following steps:"
    - In doc 6, a section was flagged, but failed to flag these adjacent texts: "A further 455 government commercial staff, public sector and third sector organisations have attended CCS and Home Office modern slavery training events to raise awareness of modern slavery in public procurement and practical steps to prevent it." and "In addition to training government commercial staff, we have worked to upskill government suppliers to help them to take effective steps to prevent modern slavery in the supply chains of the goods and services they provide government"
    - In doc 9, this section was not flagged: "A specific off-the-shelf training on identifying signs of human trafficking has been developed for additional targeted implementation expectations for Bechtel colleagues in groups such as Human Resources, Procurement and Ethics and Compliance who may have a greater exposure to these areas of risk in their work. This slide deck also serves as a training for voluntary use during safety awareness briefings, project kick-off briefings, orientations, team seminars and other similar circumstances." However, the sentence before it which also discusses training was.
- The model is also sometimes completely failing to flag some spans that discuss training. However, this was only observed in doc 3. For example, these spans were not flagged, nor were they near any other flagged spans:
    - "IPS has been supporting global compliance for more than two decades, enabling improvement in labor standards and workplace conditions, supply chain security, trade compliance and brand protection in our supply chain through monitoring, remediation, capacity building and training."
    - "Some of these stakeholders help us to manage risk through programs that enhance the rule of law (through training and capacity building) and other collaborative activities."
    - "L Brands’ Supplier Code of Conduct, our Ethics Hotline, training and the compliance standards listed in this statement help to prevent the use of forced labor in our supply chain. Our Master Sourcing Agreement, IPS Compliance Guidebook, due diligence, monitoring, remediation and training ensure that our suppliers are aware of our policies and implement our standards in their processes to minimize the risk of forced labor."
- The model does not do a good job of discriminating between training already in place and training in development. For example:
    - In doc 0, the model does an ok job of detecting that training is currently in development (blue and purple questions) but not yet in place (green and aqua questions). However, the red and gold questions probably should not have flagged here.
    - In doc 1, "is being rolled out" probably should not flag for the green question ("Is training already in place?")
    - In doc 8, "We are working towards incorporating specific anti-slavery and anti-human-trafficking content into our ethics training for employees and for suppliers in high risk sectors and geographies." probably should not flag the gold question ("Has training been done?")
- Doc 5 did not display the highlights in the correct place. It has a number of flagged spans for various questions in the df that are not showing when printed. None of the other 9 documents had a problem with this.
- The aqua question ("What kind of training is provided?") seems to return shorter spans that are frequently missing important context (For example, doc 9) and are usually picked up by at least one of the other quesitons. There does not appear to be any difference in the performance of the other 5 questions asked. I did not notice any questions that returned false positives less often.

# Recommendations / Next Steps:
- The outputs of this model return extraneous spans and miss important spans. Since this was just a proof of concept using a relatively small pre-trained model, the next step should be to re-run the analysis using a larger, slower model.
- Since a common problem seems to be that the spans are too short, a method could be implemented to expand all flagged spans. For example, each span could be expanded to include the sentence immediately preceding and immediately following the flagged span.
- The question "What kind of training is provided?" can be dropped.
- There are many false positive spans returned. This is not necessarily a problem if another classifier is applied on top of the returned spans. This model is still useful in shortening the documents for ease of classification later. This can be helpful for both Machine Learning classifiers and for Human classifiers (i.e. it makes the human's job a lot easier to label a document if they only have to read a subset of the whole thing).
    - One approach could be to use this model to filter the documents, then have humans label some of the filtered documents, then train a machine learning model on the filtered documents to predict the human labels. ML models to be considered include a sequence-classification transformer, a 1-d CNN, or even something simpler like a bag-of-words classifier.
- Since the model does not do a good job of discriminating between training already in place and training in development, this task will likely need to fall to the subsequent classifier mentioned in the previous bullet.
- As the discrimination task related to type of training will be pushed to subsequent models, there is no need to maintain separate spans by question. These can be compressed into a 1-d tensor where the class of each token is 1 if any of the questions flag it a 1.
- An investigation needs to be conducted into why doc 5 did not display correctly. I suspect this is likely an issue with the token_start_locations() function due to an inconsistency in converting string to token back to string.