# Welcome to the Dark Art of Coding:
## Introduction to Machine Learning
Data handling

<img src='../universal_images/dark_art_logo.600px.png' width='300' style="float:right">

# Objectives
---

In this session, students should expect to:

* Explore the datatypes that scikit-learn uses
* Explore several techniques to prep data for use in scikit-learn
* Examine techniques to manipulate OR review the data


# Overview: Data Handling
---

Generally, scikit-learn uses several of the most popular datatypes found in the Python data science ecosystem:

* numpy arrays
* scipy sparse matrixes
* pandas DataFrames


To train a scikit-learn classifier, all you need is a `2D array` (also called a `matrix` and often labeled `X`) for the input variables and a `1D array` (often labeled `y`) for the target variables. 

## Features

The 2D array `X` holds the features of your dataset as columns and holds individual samples of the data as rows.

Each column is a separate feature. In the examples below a feature could be the price of a soda OR could be the lengths of specific beetle body parts.

Each row in the examples below is a specific example of a soda OR a beetle.

### Features example one: Soda prices

|Soda size (oz)|
|:---|
|12|
|16|
|20|
|24|

### Features example two: Beetle sizes

|Head (mm)|Thorax (mm)|Abdomen (mm)| 
|:---|:---|:---|
|4|6|6|
|6|10|9|
|4|6|7|
|7|11|9|

## Targets

The 1D array `y` holds specific target values OR categorization labels.

In the examples below a target value could be the price of specific soda OR could the category of beetle that has body parts with specific lengths.

### Target example one: Soda costs

In this 1D target array, there are as many prices as there are rows in the 2D features array.

```
[0.50, 0.65, 0.70, 0.80]
```

### Target example two: Beetle classification

In this 1D target array, there are as many beetle categorizations as there are rows in the 2D features array.

* 0 = checkered beetle
* 1 = diving beetle

```
[0, 1, 0, 1] 
```

# Creating arrays for use in scikit-learn
---

There are many ways to create `2D` and `1D` arrays. Here we will show you several quick examples from `numpy` and `pandas` that you will see in this tutorial and/or in examples in books/online.

We will not dive deeply into the details... we leave it up to the student to explore tools like `numpy`, `scipy`, `pandas` to better understand what is happening under the hood.

## numpy

In [70]:
import numpy as np

x = np.array([12, 16, 20, 14, 18])             # coffee sizes
y = np.array([2.95, 3.65, 4.15, 3.25, 4.20])   # coffee prices

Remember, that the features matrix must be in a 2D format and the target array must be in a 1D format, so let's look at the current arrangement using the `.shape` attribute associated with numpy arrays:

In [71]:
x

array([12, 16, 20, 14, 18])

In [72]:
x.shape

(5,)

In [73]:
y

array([2.95, 3.65, 4.15, 3.25, 4.2 ])

In [74]:
y.shape

(5,)

As we can see above, both of these are simply 1D arrays of length **five**, thus the tuple only has a single dimension of 5. The comma is required because the data type produced by the `.shape` attribute is a `tuple`.

To convert the `x` array from a 1D array to a 2D array, there are several techniques you will see:

### Numpy technique one:

In [75]:
X_numpy_one = x[:, np.newaxis]        # np.newaxis increases dimensionality
X_numpy_one                           #     by one

array([[12],
       [16],
       [20],
       [14],
       [18]])

### Numpy technique two:

In [76]:
X_numpy_two = x[:, None]              # np.newaxis is actually an alias for
X_numpy_two                           #     None, so None also works

array([[12],
       [16],
       [20],
       [14],
       [18]])

In [77]:
X_numpy_one.shape

(5, 1)

In [78]:
X_numpy_two.shape

(5, 1)

In [79]:
### Numpy technique three:

X_numpy_three = x.reshape(len(x), 1)
X_numpy_three

array([[12],
       [16],
       [20],
       [14],
       [18]])

### 2D numpy data:

If the data is already a 2D array/matrix you don't need to do anything:

In [80]:
X_beetle = np.array([[4, 6, 6], 
                     [6, 10, 9], 
                     [4, 6, 7],
                     [7, 11, 9]])
X_beetle

array([[ 4,  6,  6],
       [ 6, 10,  9],
       [ 4,  6,  7],
       [ 7, 11,  9]])

In [81]:
X_beetle.shape

(4, 3)

## pandas

In [82]:
import pandas as pd

x_pandas = pd.Series([12, 16, 20, 14, 18])
y_pandas = pd.Series([2.95, 3.65, 4.15, 3.25, 4.20])

You might presume that a pandas Series is a 1D array, but that is not the case... as we will see when we look at the shape.

In [83]:
x_pandas

0    12
1    16
2    20
3    14
4    18
dtype: int64

In [84]:
x_pandas.shape

(5,)

In [85]:
y_pandas.shape

(5,)

Again, both of these are simply 1D arrays of length **five**.

To convert the x_pandas 1D array to a 2D array, we can convert the pandas Series to a DataFrame:

### Pandas technique one:

In [86]:
X_pandas_one = x_pandas.to_frame('size')
X_pandas_one

Unnamed: 0,size
0,12
1,16
2,20
3,14
4,18


In [87]:
X_pandas_one.shape

(5, 1)

### Pandas technique two:

In [88]:
X_pandas_two = pd.DataFrame(x_pandas, columns=['size'])

In [89]:
X_pandas_two

Unnamed: 0,size
0,12
1,16
2,20
3,14
4,18


In [90]:
X_pandas_two.shape

(5, 1)

# Looking at the data
---

## Numpy

To see just the first few rows of a numpy array:

In [96]:
X_numpy_one[:3]       # first three rows

array([[12],
       [16],
       [20]])

## Pandas

In [98]:
X_pandas_one.head(3)

Unnamed: 0,size
0,12
1,16
2,20


In [99]:
X_pandas_one[:3]

Unnamed: 0,size
0,12
1,16
2,20


In [None]:
X_train = 
y_train = 

It can be really useful to take a look at the features matrix and target array of the training data. 

* In the raw form
* In a visualization tool

For this dataset, let's use a xx plot.

In [None]:
X_train[:5]

In [None]:
y_train[:5]

In [None]:
plt.
plt.title("")

### Prep the test data

(XX) In the following plot, we chose to set the alpha channel for the dots at 0.15 which makes the dots largely transparent, so that they are visually distinct.

## Choose the Model

## Choose Appropriate Hyperparameters

Here we choose to assign xx hyperparameters: `xx` and `xx`. We will discuss both later.

There are a number of hyperparameters

```python
XX()
```

## Fit the Model

In [None]:
model.fit(X_train, y_train)

## Apply the Model

In [None]:
y_pred = model.predict(X_test)

In [None]:
y_pred.shape

In [None]:
y_pred[::100]

## Examine the results

# Gotchas
---

# Deep Dive
---

# Gotchas
---

# How to learn more: tips and hints
---

# Experience Points!
---

# delete_this_line: task 01

In **`jupyter`** create a simple script to complete the following tasks:


**REPLACE THE FOLLOWING**

Create a function called `me()` that prints out 3 things:

* Your name
* Your favorite food
* Your favorite color

Lastly, call the function, so that it executes when the script is run

---
When you complete this exercise, please put your **green** post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

# Experience Points!
---

# delete_this_line: task 02

In **`jupyter`** create a simple script to complete the following tasks:

**REPLACE THE FOLLOWING**

Task | Sample Object(s)
:---|:---
Compare two items using `and` | 'Bruce', 0
Compare two items using `or` | '', 42
Use the `not` operator to make an object False | 'Selina' 
Compare two numbers using comparison operators | `>, <, >=, !=, ==`
Create a more complex/nested comparison using parenthesis and Boolean operators| `('kara' _ 'clark') _ (0 _ 0.0)`

---
When you complete this exercise, please put your **green** post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

# Experience Points!
---

# delete_this_line: sample 03

In your **text editor** create a simple script called:

```bash
my_lessonname_03.py```

Execute your script on the command line using **`ipython`** via this command:

```bash
ipython -i my_lessonname_03.py```

**REPLACE THE FOLLOWING**

I suggest that as you add each feature to your script that you run it right away to test it incrementally. 

1. Create a variable with your first name as a string AND save it with the label: `myfname`.
1. Create a variable with your age as an integer AND save it with the label: `myage`.

1. Use `input()` to prompt for your first name AND save it with the label: `fname`.
1. Create an `if` statement to test whether `fname` is equivalent to `myfname`. 
1. In the `if` code block: 
   1. Use `input()` prompt for your age AND save it with the label: `age` 
   1. NOTE: don't forget to convert the value to an integer.
   1. Create a nested `if` statement to test whether `myage` and `age` are equivalent.
1. If both tests pass, have the script print: `Your identity has been verified`

When you complete this exercise, please put your **green** post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

# References
---

Below are references that may assist you in learning more:
    
|Title (link)|Comments|
|---|---|
|[General API Reference](https://scikit-learn.org/stable/modules/classes.html)||
|[XX API Reference]()||
|[User Guide]()||