# Pretraining a model on Arcee Cloud

In this notebook, you will learn how to run continuous pretraining a model on Arcee Cloud. In this example, we'll train a Llama3-8B model on the Energy domain.

In order to run this demo, you need a Starter account on Arcee Cloud. Please see our [pricing](https://www.arcee.ai/pricing) page for details.

The Arcee documentation is available at [docs.arcee.ai](https://docs.arcee.ai/deployment/start-deployment).

## Prerequisites

Please [sign up](https://app.arcee.ai/account/signup) to Arcee Cloud and create an [API key](https://docs.arcee.ai/getting-arcee-api-key/getting-arcee-api-key).

Then, please update the cell below with your API key. Remember to keep this key safe, and **DON'T COMMIT IT to one of your repositories**.

In [None]:
%env ARCEE_API_KEY=YOUR_API_KEY

Create a new Python environment (optional but recommended) and install [arcee-python](https://github.com/arcee-ai/arcee-python).

In [None]:
# Uncomment the next three lines to create a virtual environment
#!pip install -q virtualenv
#!virtualenv -q arcee-cloud
#!source arcee-cloud/bin/activate

%pip install -q arcee-py

In [None]:
import arcee
from IPython.display import Image

## Preparing our dataset

We need a dataset that holds the appropriate domain knowledge on the Energy domain. Arcee Cloud can ingest data in a variety of formats, like PDF, JSON, XML, TXT, HTML, and CSV. Please check the [documentation](https://docs.arcee.ai/continuous-pretraining/upload-pretraining-data) for an up-to-date list of supported formats.


We assembled a collection of about 300 PDF reports from the [International Energy Agency]((https://www.iea.org/analysis?type=report)) and the [Energy Reports](https://www.sciencedirect.com/journal/energy-reports) journal. The total size of the dataset is 1.5GB and 16 million tokens. Please note that this is probably too small for efficient pretraining. For real-life applications, we recommend using at least 100 million tokens.

For convenience, we have stored the dataset in this Google drive [folder](https://drive.google.com/drive/folders/1DX5hIuVfykHqz2gwLTu4MR9R6TTAxiEO?usp=sharing). However, please note that Arcee Cloud requires training datasets to be stored in Amazon S3, so we also uploaded the dataset to a "customer" bucket defined below. You will be able to use this bucket to run the rest of this notebook, but you won't be able to list its content. In real-life, you would of course use your own S3 bucket.

In [None]:
dataset_bucket_name = "juliensimon-datasets"
dataset_name = "energy-pdf"
dataset_s3_uri=f"s3://{dataset_bucket_name}/{dataset_name}"
print(f"Dataset S3 URI: {dataset_s3_uri}")

The training code in Arcee Cloud runs in one of Arcee's AWS accounts. 

We need to allow this account to access the data stored in the bucket above (which is attached to a different AWS account). 

This setup is called "cross-account access" and it requires adding a policy to the bucket, allowing the Arcee account to read the data it stores. 

You'll find more information about cross-account access and bucket policies in the [AWS documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access-example2.html). 

If you're unfamiliar with the process, or don't have the AWS permissions required, please contact your AWS administrator.

Here is the bucket policy applied to the "customer" bucket. 

It gives Arcee's AWS account `812782781539` read and list permission on the "customer" bucket. Working with your bucket, you would need to update the `Resource` section with your bucket and prefixes. Then, you would either apply this bucket policy to your bucket, using either the AWS console or one of the AWS SDKs.
    
    
    import boto3
    import json

    bucket_policy = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::812782781539:root"
                },
                "Action": [
                    "s3:GetBucketLocation",
                    "s3:ListBucket",
                    "s3:GetObject",
                    "s3:GetObjectAttributes",
                    "s3:GetObjectTagging"
                ],
                "Resource": [
                    "arn:aws:s3:::juliensimon-datasets",
                    "arn:aws:s3:::juliensimon-datasets/*"
                ]
            },
        ]
    }

    policy_string = json.dumps(bucket_policy)

    boto3.client('s3').put_bucket_policy(Bucket="juliensimon-datasets", Policy=policy_string)


# Uploading our dataset

Now that Arcee Cloud can read the training dataset, let's upload it with the `upload_corpus_folder()` API.

In [None]:
help(arcee.upload_corpus_folder)

In [None]:
model_name = "meta-llama/Meta-Llama-3-8B"

In [None]:
response = arcee.upload_corpus_folder(
    corpus=dataset_name,
    s3_folder_url=dataset_s3_uri,
    tokenizer_name=model_name,
    block_size=8192  # see max_position_embeddings in https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/config.json
)

In [None]:
from time import sleep

while True:
    response = arcee.corpus_status(dataset_name)
    if response["processing_state"] == "processing":
        print("Upload is in progress. Waiting 30 seconds before checking again.")
        sleep(30)
    else:
        print(response)
        break
    

# Pretraining our model

Once the dataset has been uploaded, we can launch training with the `start_pretraining()` API.

In [None]:
help(arcee.start_pretraining)

In [None]:
pretraining_name=f"{model_name}-{dataset_name}"

In [None]:
response = arcee.start_pretraining(
    pretraining_name=pretraining_name,
    corpus=dataset_name,
    base_model=model_name
)

In the Arcee Cloud console, we can see the training job has started. After a few minutes, you should see the training loss decreasing, indicating that the model is learning how to correctly predict the tokens present in your dataset.

In [None]:
Image("model_pretraining_01.png")

## Deploying our trained model

Once training is complete, we can deploy and test the pretrained model. The model hasn't been aligned, so chances are it's not going to generate anything really useful. However, we should still check that the model is able to generate properly.

As part of the Arcee Cloud free tier, model deployment is free of charge and the endpoint will be automatically shut down after 2 hours.

Deployment should take 5-7 minutes. Please see the model deployment sample notebook for details.

In [None]:
deployment_name = f"{model_name}-{dataset_name}"

In [None]:
response = arcee.start_deployment(deployment_name=deployment_name, pretraining=pretraining_name)

In [None]:
while True:
    response = arcee.deployment_status(deployment_name)
    if response["deployment_processing_state"] == "pending":
        print("Deployment is in progress. Waiting 60 seconds before checking again.")
        sleep(60)
    else:
        print(response)
        break

Once the model endpoint is up and running, we can prompt the model with a domain-specific question.

In [None]:
query = "Is solar a good way to achieve net zero?"

response = arcee.generate(deployment_name=deployment_name, query=query)
print(response["text"])

## Stopping our deployment

When we're done working with our model, we should stop the deployment to save resources and avoid unwanted charges.

The `stop_deployment()` API only requires the deployment name.

In [None]:
arcee.stop_deployment(deployment_name=deployment_name)
arcee.deployment_status(deployment_name)

This concludes the model pretraining demonstration. Thank you for your time!

If you'd like to know more about using Arcee Cloud in your organization, please visit the [Arcee website](https://www.arcee.ai), or contact [sales@arcee.ai](mailto:sales@arcee.ai).
