## Load Data from S3:

### To load data from Amazon S3 to an Amazon SageMaker notebook, follow these steps:

- Set Up AWS Credentials: Ensure your SageMaker notebook instance has the necessary IAM role permissions to access the S3 bucket.
- Install Required Libraries: Ensure boto3 and pandas are installed.
- Initialize the S3 Client: Use boto3 to interact with S3.
- Download Data from S3: Use the download_file method to download the data.

In [None]:
# Install required libraries if not already installed
!pip install boto3 pandas --quiet

In [None]:
# Import libraries
import boto3

# Initialize the S3 client
s3 = boto3.client('s3')

# S3 bucket name
bucket_name = "<BUCKET_NAME>"

# File keys and local file names
files = [
    {'key': 'dataset-jsonl/dev1.jsonl', 'local_file': 'dev1.jsonl'},
    {'key': 'dataset-jsonl/dev2.jsonl', 'local_file': 'dev2.jsonl'}
]

# Download files from S3
for file in files:
    s3.download_file(bucket_name, file['key'], file['local_file'])
    print(f"Downloaded {file['key']} to {file['local_file']}")

### Converting JSONL Format for Conversation Data
This notebook transforms JSONL files from a simple prompt-completion format to a structured conversation format.

#### Input Format
```json
{
    "prompt": "value of prompt key",
    "completion": "value of completion key"
}
Output Format

{
    "conversationTurns": [{
        "referenceResponses": [{
            "content": [{
                "text": "value from completion key"
            }]
        }],
        "prompt": {
            "content": [{
                "text": "value from prompt key"
            }]
        }
    }]
}
Let's implement the transformation:

In [None]:
import json
import random

# Function to transform a single record
def transform_record(record):
    return {
        "conversationTurns": [
            {
                "referenceResponses": [
                    {
                        "content": [
                            {
                                "text": record["completion"]
                            }
                        ]
                    }
                ],
                "prompt": {
                    "content": [
                        {
                            "text": """You're given a radiology report findings to generate a concise radiology impression from it.

A Radiology Impression is the radiologist's final concise interpretation and conclusion of medical imaging findings, typically appearing at the end of a radiology report.
\n Follow these guidelines when writing the impression:
\n- Use clear, understandable language avoiding obscure terms.
\n- Number each impression.
\n- Order impressions by importance.
\n- Keep impressions concise and shorter than the findings section.
\n- Write for the intended reader's understanding.\n
Findings: \n""" + record["prompt"]
                        }
                    ]
                }
            }
        ]
    }

# Read from input file and write to output file
def convert_file(input_file_path, output_file_path, sample_size=1000):
    # First, read all records into a list
    records = []
    with open(input_file_path, 'r', encoding='utf-8') as input_file:
        for line in input_file:
            records.append(json.loads(line.strip()))
    
    # Randomly sample 1000 records
    random.seed(42)  # Set the seed first for reproducibility
    sampled_records = random.sample(records, sample_size)
    
    # Write the sampled and transformed records to the output file
    with open(output_file_path, 'w', encoding='utf-8') as output_file:
        for record in sampled_records:
            transformed_record = transform_record(record)
            output_file.write(json.dumps(transformed_record) + '\n')

In [None]:
# Usage
input_file_path = '<INPUT_FILE_NAME>.jsonl'  # Replace with your input file path
output_file_path = '<OUTPUT_FILE_NAME>.jsonl'  # Replace with your desired output file path
convert_file(input_file_path, output_file_path)

#### Load Transformed file back to S3

In [None]:
# File paths and S3 keys for the transformed files
transformed_files = [
    {'local_file': '<OUTPUT_FILE_NAME>.jsonl', 'key': '<FOLDER_NAME>/<OUTPUT_FILE_NAME>.jsonl'},
    {'local_file': '<OUTPUT_FILE_NAME>.jsonl', 'key': '<FOLDER_NAME>/<OUTPUT_FILE_NAME>.jsonl'}
]

# Upload files to S3
for file in transformed_files:
    s3.upload_file(file['local_file'], bucket_name, file['key'])
    print(f"Uploaded {file['local_file']} to s3://{bucket_name}/{file['key']}")