# Overview

We use data to train and evaluate how good a machine learning (ML) model is. A common practice is to split an available dataset into training and test datasets.

* Use training data to train the model.
* Use test data to tune the model.

In [57]:
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

## Sample Dataset

We use the Scikit-learn (aka `sklearn`) library for simple machine learning in Python. It comes a few [sample (toy) datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html) that can be used for experimentation or exploration of the various algorithms in the library. Here are some datasets:

* [The iris dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html) - dataset for classification.
* [The diabetes dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html) - dataset for regression.
* [The digits dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) - dataset for classification.
* [the wine dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html).
* [the breast cancer](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html) - dataset for classification.

This notebook uses the wine dataset. Let's load the sample dataset wine into a Dataframe object.

In [58]:
wine = load_wine()
df = pd.DataFrame(data=wine['data'], columns=wine['feature_names'])
print(df.head())

   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
0    14.23        1.71  2.43               15.6      127.0           2.80   
1    13.20        1.78  2.14               11.2      100.0           2.65   
2    13.16        2.36  2.67               18.6      101.0           2.80   
3    14.37        1.95  2.50               16.8      113.0           3.85   
4    13.24        2.59  2.87               21.0      118.0           2.80   

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
0        3.06                  0.28             2.29             5.64  1.04   
1        2.76                  0.26             1.28             4.38  1.05   
2        3.24                  0.30             2.81             5.68  1.03   
3        3.49                  0.24             2.18             7.80  0.86   
4        2.69                  0.39             1.82             4.32  1.04   

   od280/od315_of_diluted_wines  proline  
0                  

### EDA

Let's perform EDA before diving into data analysis and manipulation of the data.

In [59]:
print('Columns:')
print(df.columns)
print('Dtypes:')
print(df.dtypes)
print('Dimension:')
print(df.shape)

Columns:
Index(['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium',
       'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
       'proanthocyanins', 'color_intensity', 'hue',
       'od280/od315_of_diluted_wines', 'proline'],
      dtype='object')
Dtypes:
alcohol                         float64
malic_acid                      float64
ash                             float64
alcalinity_of_ash               float64
magnesium                       float64
total_phenols                   float64
flavanoids                      float64
nonflavanoid_phenols            float64
proanthocyanins                 float64
color_intensity                 float64
hue                             float64
od280/od315_of_diluted_wines    float64
proline                         float64
dtype: object
Dimension:
(178, 13)


## Split Data into Training and Test Data

Now we can split the data into data and training datasets. We use the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function in `sklearn`.

| Argument     | Description |
|--------------|-------------|
| train_size   | Proportion of the dataset to be split into the training dataset. Value should be between 0.0 and 1.0 |
| random_state | Controls the shuffling applied to the data before applying split. See [here](https://scikit-learn.org/stable/glossary.html#term-random_state) for details. |
| suffle       | If shuffle=True, shuffles the data before splitting. |

In [60]:
X_train, X_test, y_train, y_test = train_test_split(df, df, train_size=0.66, random_state=42)
print('X train dimension: ', X_train.shape)
print('X test  dimension: ', X_test.shape)

X train dimension:  (117, 13)
X test  dimension:  (61, 13)
