# Prepare labelled Q/A data with SageMaker
In this notebook we'll prepare a dataset for supervised fine-tuning with Amazon SageMaker. In particular we'll look at ten recent papers from the Amazon Science community, and send these to a Mechanical Turk workforce using SageMaker Ground Truth.

### Step 1. Download the data
First let's take a look at the raw papers data. These have already been converted to raw text files, and they do not have any of the original images. They are already uploaded to an S3 bucket, so let's download them locally and take a look.

In [1]:
!mkdir data

In [6]:
s3_path = 's3://dist-train/amazon-science/Amazon Science Training Data'

In [10]:
!aws s3 sync '{s3_path}' data

In [99]:
!ls data

five.txt  four.txt  nine.txt  one.txt  seven.txt  ten.txt  three.txt  two.txt


### Step 2. Extract the abstracts
Next, let's pull just the abstracts out from these papers. Fortunately these 10 samples all use the same word to indicate the end of the abstract, which is either `Introduction` or the same in all capitals. We'll use that logic to split the data from the paper and grab just the paper title, author names, and abstract.

In [94]:
import os

def check_abstracts(abs):
    for k,v in abs.items():
        print (k)
        print (v['length'])

def papers_etl(local_data_path):

    abs = {}
    
    for file in os.listdir(local_data_path):
    
        # skip anythingthat's not what we're looking for
        if not file.endswith('txt'):
            continue 
    
        fp = f'{local_data_path}/{file}'
    
        data = open(fp).read()
    
        # split based on seeing the word "Introduction"
        if 'Introduction' in data:
            abstract = data.split('Introduction')[0].replace ('\n', ' ')
            abs[fp] = {'abstract':abstract, 'length':len(abstract)}        
    
        elif 'INTRODUCTION' in data:
            abstract = data.split('INTRODUCTION')[0].replace ('\n', ' ')
            abs[fp] = {'abstract':abstract, 'length':len(abstract)}      
    return abs

abstracts =  papers_etl('data')
check_abstracts(abstracts)

data/nine.txt
1910
data/five.txt
1797
data/ten.txt
1284
data/four.txt
1554
data/two.txt
1215
data/one.txt
1425
data/seven.txt
2046
data/three.txt
1497


In [96]:
abstracts['data/nine.txt']['abstract'][:300]

'LEVERAGING CONFIDENCE MODELSFOR IDENTIFYING CHALLENGING DATA SUBGROUPS IN SPEECH MODELSAlkis Koudounas♣, Eliana Pastor♣, Vittorio Mazzia♡, Manuel Giollo♡,Thomas Gueudre♡, Elisa Reale♡, Giuseppe Attanasio♢, Luca Cagliero♣,Sandro Cumani♣, Luca de Alfaro♠, Elena Baralis♣, Daniele Amberti♡♣Politecnico d'

### Step 3. Format into a manifest file
We'll be sending these files into the mechanical turk workforce using SageMaker Ground Truth, which has a managed interface for setting up the job incluidng a nice page for question answering. This needs a manifest file, which is simply a jsonlines object with all of the abstracts. Let's create that now.

In [65]:
!pip install jsonlines

Collecting jsonlines
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)
Installing collected packages: jsonlines
Successfully installed jsonlines-4.0.0


In [97]:
import jsonlines
import json

with jsonlines.open('abstracts.manifest', 'w') as writer:

    for data in abstracts.values():

        # SM GT wants to see the word source as the key
        writer.write({'source': data['abstract']})

In [98]:
!aws s3 cp abstracts.manifest '{s3_path}/labelling-job/'

upload: ./abstracts.manifest to s3://dist-train/amazon-science/Amazon Science Training Data/labelling-job/abstracts.manifest


### Step 4. Start a labelling job!
Next, we'll navigate to the SageMaker Ground Truth labelling job page to get this running.