# Pre-process the data before running the customization job

>We recommend to run this notebook using 
>- kernel: `Data Science 3.0` or `Python 3`
>- instance size `ml.t3.medium` or greater

## Prepare the dataset

Our training dataset is based on a PDF file containing 1,625 European CapMarkets and Bank Finance terms. Our training dataset is based on Latham & Watkins' glossary [The Book of Jargon - European Capital Markets and Bank Finance](https://www.lw.com/en/book-of-jargon/boj-european-capital-markets-and-bank-finance).
Let's first download the dataset from Latham & Watkins' website.

In [12]:
url = "https://www.lw.com/admin/Upload/Documents/Books%20of%20Jargon/Book-of-Jargon-European-Capital-Markets-and-Bank-Finance-2nd-Edition.pdf"
filename = "capmarkets-jargon" # needed for downstream .jsonl file
filename_pdf = "capmarkets-jargon.pdf"

import requests

def download_file(url, filename):
    
    # Send a GET request to the URL
    response = requests.get(url, stream=True)
    
    # Raise an exception if the request failed
    response.raise_for_status()
    
    # Open the target file in binary write mode
    with open(filename, 'wb') as file:
        for chunk in response.iter_content(chunk_size=8192): 
            # Write the contents of the response to the file in chunks
            if chunk:
                file.write(chunk)

Before we can customize a model, we need to prepare our CapMarkets dataset for training by converting it into `JSONL` file format. Here's what a JSONL file would look like:
```json
{"input": "<input text>"}
{"input": "<input text>"}
{"input": "<input text>"}
```

Since our dataset is in PDF format, we need to install [PyPDF2](https://pypi.org/project/PyPDF2/) library to be able to extract text from PDF.

In [None]:
!pip install PyPDF2

Now, let's define a function that converts each line of a PDF page into JSONL.

In [None]:
terms_counter = 0
parts = []

### Construct JSONL from one PDF page

def visitor_body(text, cm, tm, fontDict, fontSize):
    global terms_counter
    
    if len(text) > 1:
        if(fontDict['/BaseFont'] == '/IYOEDH+CandidaStd-Bold'):
            # close previous term, if there's one
            if(parts):
                parts.append("\"}")
                parts.append("\n")


            terms_counter = terms_counter + 1
            
            parts.append("{\"input\":" + " \"" + text )
        else:
            parts.append(text)


Next, we iterate through each PDF page containing definitions converting them to JSONL and then save the output as capmarkets-jargon.jsonl file.

In [None]:
from PyPDF2 import PdfReader

pdfReader = PdfReader(filename_pdf) 
start_page = 3
end_page = 186
terms_counter = 1

# clear all old items, if any
parts.clear()
output = ""
for i in range(start_page, end_page):
    page = pdfReader.pages[i]
    page.extract_text(visitor_text=visitor_body)

# close the last term
parts.append( "\"" + "}" )

print("Terms processed:", terms_counter)

output = "".join(parts)
with open(filename + ".jsonl", "w") as outfile:
    outfile.write(output)


Finally, let's upload the generated JSONL file to S3 so that it's accessible by Bedrock.

In [None]:
import sagemaker

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name

bucket = sagemaker_session.default_bucket()

In [None]:
# Specify the path to the training data
train_data_path = filename + ".jsonl"

# Upload the training data to the specified S3 key prefix 'PreProcessed'
s3_train_data = sagemaker_session.upload_data(path=train_data_path, key_prefix='PreProcessed')

# Print a message indicating the successful upload
print(f"Uploaded {train_data_path} to {s3_train_data}")