# Iris Dataset - Simple DataFrames Analysis

In this notebook, we will walk through an example of using BanyanDataFrames to
perform some simple data analysis on iris.csv

In [10]:
# To re-create the environment elsewhere, either copy the Project.toml and Manifest.toml
# into another directory. Or, copy this notebook into another directory and uncomment
# and run the following code.

# using Pkg
# Pkg.activate("./")
# Pkg.add(url="https://github.com/banyan-team/banyan-julia.git", rev="v0.1.3", subdir="Banyan")
# Pkg.add(url="https://github.com/banyan-team/banyan-julia.git", rev="v0.1.3", subdir="BanyanDataFrames")

## Configuring

Banyan is used to perform some data analysis on the Iris Dataset. To run this
example, please ensure that you have an Banyan account and an AWS account.

To configure your AWS credentials, you can run the second cell below to set
your AWS credentials. Run the third cell below to set your Banyan credentials.

You must pass your User ID and API Key to the `configure` function in order
to authenticate. You can find this information on the Account page of the
Banyan Dashboard. After running this block, your credentials will be saved
in `$HOME/.banyan/banyanconfig.toml` and will be read from that file in the
future.

In [7]:
# Import packages
using Banyan
using BanyanDataFrames

In [11]:
# Run the following to configure the AWS CLI. If you have already configured
# the AWS CLI with the credentials for the account you have configured with
# your Banyan account, you can skip this step.

print("Enter AWS_ACCESS_KEY_ID: \n")
ENV["AWS_ACCESS_KEY_ID"] = readline()
print("Enter AWS_SECRET_ACCESS_KEY: \n")
ENV["AWS_SECRET_ACCESS_KEY"] = readline()
print("Enter AWS_DEFAULT_REGION: \n")
ENV["AWS_DEFAULT_REGION"] = readline()

In [12]:
print("Please enter your User ID: \n")
user_id = readline()
print("Please enter your API Key: \n")
api_key = readline()

# Configures Banyan client library with your Banyan credentials
configure(user_id=user_id, api_key=api_key)

## Creating a cluster

For this example, you can either use an existing cluster or create a new cluster.
Run the following code block and enter in either the name of an existing cluster
or the name you would like to use for a new cluster.

In the cell below, you can change `instance_type` to create a cluster with a
different EC2 instance type that may have a larger amount of memory or threads,
and you can change the `max_num_nodes` to indicate the maximum amount of resources
you would like to have allocated for your cluster at any given time.

In [13]:
print("Cluster name for existing cluster or new cluster: ")
cluster_name = readline()
println(cluster_name)
clusters = get_clusters()
println("You have $(length(filter(c -> last(c).status == :running, clusters))) running clusters")
if !(haskey(clusters, cluster_name) && clusters[cluster_name].status == :running)
    println("Creating new cluster $(cluster_name)")
    create_cluster(
        name=cluster_name,
        instance_type="t3.large",
        max_num_nodes=4,
    )
else
    println("Using existing cluster $(cluster_name)")
end

In [14]:
# Create a job
job_id = create_job(
    cluster_name = cluster_name,
    nworkers = 2,
    print_logs = true
)

## Performing computation on a dataframe

The following code compute the average petal length for each species for
flowers with a petal length less than 6.0.

Note that the API for performing various operations on dataframes is the same as that
of the DataFrames library.

In [15]:
# Perform computation

using Statistics

# ...
iris = read_csv("https://raw.githubusercontent.com/banyan-team/banyan-julia/v0.1.3/BanyanDataFrames/res/iris.csv")

# Filters the rows based on whether the petal length is less than 6.0
iris_sub = filter(row -> row.petal_length < 6.0, iris)

# Groups the rows by the species type
gdf = groupby(iris_sub, :species)

# Computes average petal length for each group and stores in a new
# DataFrame. Collects the result back to the client and materializes in
# the variable `avg_pl`.
avg_pl = collect(combine(gdf, :petal_length => mean))


## Cleanup resources

After running your desired computation, we suggest that you destroy your running
job so that you are not charged for resources when you are not using them. You
may choose to keep your cluster running.

Run the following code block to destroy the job that you created in this example.

In [None]:
# Destroy job
destroy_job(job_id)