# Train & Test Datasets (this notebook is being edited)

## Reference
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- https://scikit-learn.org/stable/datasets/index.html
- https://machinelearningmastery.com/time-series-forecasting-supervised-learning/
- https://machinelearningmastery.com/time-series-datasets-for-machine-learning/
- https://plot.ly/python/time-series/
- https://www.plotly.express/plotly_express/

## Table of Contents
1. Setup
1. Introduction
1. Sample datasets

## Setup

In [1]:
!pip install plotly==4.0.0



Import the `pandas` and `numpy` libraries. In addition, import the `train_test_split` function which we use to create train and test datasets.

In [0]:
import pandas  as pd
import numpy   as np
import sklearn as sk

from sklearn.model_selection import train_test_split

Display the version numbers of the numpy, pandas and scikit-learn packages:

In [3]:
print('numpy  :',np.__version__)
print('pandas :',pd.__version__)
print('sklearn:',sk.__version__)

numpy  : 1.16.4
pandas : 0.24.2
sklearn: 0.21.3


Note that these version numbers may not be identical to the references provide above.

## Introduction

There are two basic types of datasets:
1. Cross-sectional, the rows of which __are not related__ to each other in time. ([Wikipedia](https://en.wikipedia.org/wiki/Cross-sectional_data))
2. Time series, the rows of which __are related__ to each other in time. ([Wikipedia](https://en.wikipedia.org/wiki/Time_series))

When creating a supervised learning model the goal is to predict well on future unseen data.
This means that the model should be:
- trained on one batch of data (the train dataset)
- evaluated on another batch of data (the test dataset)

Scikit-learn provides a function called `train_test_split` to separate a cross-sectional dataset into distinct train and test datasets. This function will be demonstrated below. In addition, the code for a function which splits time series datasets into train and test datasets will be defined and demonstrated. 

## Datasets

First, load the iris dataset for the demonstration.

In [0]:
import pandas as pd
from sklearn.datasets import load_iris
iris_features = load_iris().data
iris_target   = load_iris().target

The `train_test_split` function can split one or more datasets.
Typically it splits two (the features and the target) as shown below.

For example,

In [0]:
x_train, x_test, y_train, y_test = train_test_split(iris_features,
                                                    iris_target)

Notice that the two train datasets have the same number of rows.

In [24]:
x_train.shape, y_train.shape

((112, 4), (112,))

Notice that the two test have the same number of rows.

In [25]:
x_test.shape, y_test.shape

((38, 4), (38,))

Notice that the pair of "x" datasets have identical numbers of columns and that the "y" datasets have only one dimension.