# CENVAL-ARC Workshop: Hands-on Demo

This hands-on demo will use Scikit Learn to build a classification model.

The purpose of this demo is to show that individuals can quickly get to coding and conducting analyses with a common environment. This is useful in both instructional and research settings, where a consistent environment eliminates several headaches.

This consistency comes from using software containers and common infrastructure.
In this case, we are using an official Project Jupyter container image, [SciPy notebook](https://github.com/jupyter/docker-stacks/tree/main/images/scipy-notebook), which has several common scientific python packages pre-installed.

Using the NRP, instructors and researchers can build and bring their own software containers to enable instruction and research collaboration.

In [1]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

## Let's load the Iris dataset

A small classic dataset from Fisher, 1936. One of the earliest known datasets used for evaluating classification methods. 

[Iris - UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/53/iris)

In [2]:
X, y = load_iris(return_X_y=True)

In [3]:
print(X.shape)

(150, 4)


In [4]:
print(y.shape)

(150,)


## Use Pandas to make the dataset easier to work with

Pandas allows us to construct a dataframe from the dataset.

This allows us to add labels so that we can understand the data better at a glance.

In [7]:
feature_columns = ["sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm"]

df = pd.DataFrame(X, columns=feature_columns)
df["class"] = y
df.head()

Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


## Let's explore this dataset a bit

Things we might want to know before modeling:
- Basic statistics (min, max, mean, std. dev.)
- Number of unique classes
- Balance of data between classes
- Missing values

In [8]:
df.describe()

Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,class
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


In [10]:
df["class"].unique()

array([0, 1, 2])

In [9]:
df["class"].value_counts()

class
0    50
1    50
2    50
Name: count, dtype: int64

In [11]:
df.isna().sum().sum()

0

In [12]:
(df == "").sum()

sepal_length_cm    0
sepal_width_cm     0
petal_length_cm    0
petal_width_cm     0
class              0
dtype: int64

## Visualizing our dataset