# Abalone

Here we will load data of physical measurements of abalones (sea snails). <a href="https://archive.ics.uci.edu/dataset/1/abalone" target="_blank" rel="noopener">Follow this link</a> to get details about this dataset.

To execute queries and upload data to Exasol database we will be using the <a href="https://github.com/exasol/pyexasol" target="_blank" rel="noopener">`pyexasol`</a> module.

## Prerequisites

Prior to using this notebook the following steps need to be completed:
1. [Configure the sandbox](../sandbox_config.ipynb).

## Setup

### Access configuration

In [None]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

## Download data

First, we will load the data into Pandas DataFrame. Each data column represents one of the features and is named accordingly, see section Variable Table in the dataset description. We will name the Pandas columns as per the variable description.

In [None]:
from urllib.request import urlopen
import tempfile
from zipfile import ZipFile
from contextlib import ExitStack
import pandas as pd
from stopwatch import Stopwatch

stopwatch = Stopwatch()

DATA_URL = "https://archive.ics.uci.edu/static/public/1/abalone.zip"
DATA_FILE = "abalone.data"

resp = urlopen(DATA_URL)
with ExitStack() as stack:
    f = stack.enter_context(tempfile.TemporaryFile())
    f.write(resp.read())
    print(f"Downloading the data took: {stopwatch}")

    f.seek(0)
    z = stack.enter_context(ZipFile(f))
    f = stack.enter_context(z.open(DATA_FILE, "r"))
    df = pd.read_csv(f)

column_def = [
    ('Sex', 'CHAR(1)'),	                 # M, F, and I (infant)
    ('Length', 'DECIMAL(4,3)'),          # longest shell measurement (mm)
    ('Diameter', 'DECIMAL(4,3)'),	     # perpendicular to length (mm)
    ('Height', 'DECIMAL(4,3)'),          # with meat in shell (mm)
    ('Whole_weight', 'DECIMAL(5,4)'),    # whole abalone (grams)
    ('Shucked_weight', 'DECIMAL(5,4)'),  # weight of meat (grams)
    ('Viscera_weight', 'DECIMAL(5,4)'),  # gut weight (after bleeding) (grams)
    ('Shell_weight', 'DECIMAL(4,3)'),    # after being dried (grams)
    ('Rings', 'INT')                     # +1.5 gives the age in years
]
df.columns = [name for name, _ in column_def]

print(df.head())

## Upload data into DB

Let's split data randomly into train and test sets. We will then create two tables - ABALONE_TRAIN and ABALONE_TEST - and load the datasets into these tables.

In [None]:
from sklearn.model_selection import train_test_split
from exasol.connections import open_pyexasol_connection

# Split the data into train and test sets
df_train, df_test = train_test_split(df, test_size=0.2)

train_table = 'ABALONE_TRAIN'
test_table = 'ABALONE_TEST'
column_desc = [' '.join(c) for c in column_def]

stopwatch = Stopwatch()

# Create an Exasol connection
with open_pyexasol_connection(sb_config, compression=True) as conn:

    # Create tables
    sql = f'CREATE OR REPLACE TABLE "{sb_config.SCHEMA}"."{train_table}"({", ".join(column_desc)})'
    conn.execute(query=sql)
    sql = f'CREATE OR REPLACE TABLE "{sb_config.SCHEMA}"."{test_table}" LIKE "{sb_config.SCHEMA}"."{train_table}"'
    conn.execute(query=sql)

    # Import data into Exasol
    conn.import_from_pandas(df_train, (sb_config.SCHEMA, train_table))
    print(f"Imported {conn.last_statement().rowcount()} rows into {train_table}.")
    conn.import_from_pandas(df_test, (sb_config.SCHEMA, test_table))
    print(f"Imported {conn.last_statement().rowcount()} rows into {test_table}.")

print(f"Importing the data took: {stopwatch}")