# Train & Test Datasets (this notebook is being edited)

## Reference
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- https://scikit-learn.org/stable/datasets/index.html
- https://machinelearningmastery.com/time-series-forecasting-supervised-learning/
- https://machinelearningmastery.com/time-series-datasets-for-machine-learning/
- https://plot.ly/python/time-series/
- https://www.plotly.express/plotly_express/

## Table of Contents
1. Setup
1. Introduction

## Setup

In [4]:
!pip install plotly==4.0.0



Import the `pandas` and `numpy` libraries. In addition, import the `train_test_split` function which we use to create train and test datasets.

In [0]:
import pandas  as pd
import numpy   as np
import sklearn as sk
from sklearn.model_selection import train_test_split

Display the version numbers of the numpy, pandas and scikit-learn packages:

In [14]:
print('numpy  :',np.__version__)
print('pandas :',pd.__version__)
print('sklearn:',sk.__version__)

numpy  : 1.16.4
pandas : 0.24.2
sklearn: 0.21.2


## Introduction

There are two basic types of datasets:
1. Cross-sectional, the rows of which __are not related__ to each other in time. ([Wikipedia](https://en.wikipedia.org/wiki/Cross-sectional_data))
2. Time series, the rows of which __are related__ to each other in time. ([Wikipedia](https://en.wikipedia.org/wiki/Time_series))

When creating a supervised learning model the goal is prediction. The basic process is:
1. Use a _training_ dataset to create the model
2. Make predictions with this model
3. 

In addition though, the goal is to predict well on future unseen data.
This means that the model should be:
- trained on one batch of data (the train dataset)
- evaluated on another batch of data (the test dataset)

Scikit-learn provides a function called `train_test_split` to separate a cross-sectional dataset into distinct train and test datasets. This function will be demonstrated below. In addition, the code for a function which splits time series datasets into train and test datasets will be defined and demonstrated. 

Note that these version numbers may not be identical to the references provide above.

## Datasets

First, load the iris dataset for the demonstration.

In [52]:
import pandas as pd
from sklearn.datasets import load_iris
iris_features = load_iris().data
iris_target   = load_iris().target
iris_feature_columns = [feature_name.replace(' ','_').replace('(','').replace(')','') 
                        for feature_name in load_iris().get('feature_names')]
iris_features_pdf = pd.DataFrame(data=iris_features,
                                 columns=iris_feature_columns
                                )
iris_target_pdf = pd.DataFrame(data={'species': iris_target}
                              )
iris_pdf = pd.concat([iris_features_pdf, iris_target_pdf],
                     axis='columns',
                     join='inner')
print(iris_pdf.info())
iris_pdf.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length_cm    150 non-null float64
sepal_width_cm     150 non-null float64
petal_length_cm    150 non-null float64
petal_width_cm     150 non-null float64
species            150 non-null int64
dtypes: float64(4), int64(1)
memory usage: 5.9 KB
None


Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [72]:
def get_iris_pdf():
  import pandas as pd
  from sklearn.datasets import load_iris
  iris_features = load_iris().data
  iris_target   = load_iris().target
  iris_feature_columns = [feature_name.replace(' ','_').replace('(','').replace(')','') 
                          for feature_name in load_iris().get('feature_names')]
  iris_features_pdf = pd.DataFrame(data=iris_features,
                                   columns=iris_feature_columns
                                  )
  iris_target_pdf = pd.DataFrame(data={'species': iris_target}
                                ).replace(to_replace=[0,1,2], 
                                          value=load_iris().get('target_names')
                                         )
  iris_pdf = pd.concat([iris_features_pdf, iris_target_pdf],
                       axis='columns',
                       join='inner')
  return iris_pdf
get_iris_pdf()

Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [73]:
def plot_iris_pdf():
  import plotly.express as px
  import pandas as pd
  fig = px.scatter(get_iris_pdf(), x='sepal_length_cm', y='sepal_width_cm', color='species')
  fig.show()
plot_iris_pdf()

### Boston housing dataset

### Shampoo dataset

In [54]:
def get_shampoo_pdf():
  import pandas as pd
  shampoo_url='https://raw.githubusercontent.com/jbrownlee/Datasets/master/shampoo.csv'
  return pd.read_csv(shampoo_url)

print(get_shampoo_pdf().info())
get_shampoo_pdf().head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 2 columns):
Month    36 non-null object
Sales    36 non-null float64
dtypes: float64(1), object(1)
memory usage: 656.0+ bytes
None


Unnamed: 0,Month,Sales
0,1-01,266.0
1,1-02,145.9
2,1-03,183.1
3,1-04,119.3
4,1-05,180.3


In [55]:
def plot_shampoo_pdf():
  import plotly.express as px
  import pandas as pd
  fig = px.line(get_shampoo_pdf(), x='Month', y='Sales')
  fig.show()
  
plot_shampoo_pdf()

### Daily temperatures dataset

In [7]:
import pandas as pd
daily_temps_url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv'
daily_temps_pdf = pd.read_csv(daily_temps_url)
print(daily_temps_pdf.info())
daily_temps_pdf.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3650 entries, 0 to 3649
Data columns (total 2 columns):
Date    3650 non-null object
Temp    3650 non-null float64
dtypes: float64(1), object(1)
memory usage: 57.1+ KB
None


Unnamed: 0,Date,Temp
0,1981-01-01,20.7
1,1981-01-02,17.9
2,1981-01-03,18.8
3,1981-01-04,14.6
4,1981-01-05,15.8


In [0]:
def get_daily_temps_pdf():
  daily_temps_url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv'
  return pd.read_csv(daily_temps_url)

In [18]:
def plot_daily_temps_pdf():
  import plotly.express as px
  import pandas as pd
  fig = px.line(get_daily_temps_pdf(), x='Date', y='Temp')
  fig.show()
plot_daily_temps_pdf()

##2. Introduction

The `train_test_split` function can split one or more datasets.
Typically it splits two (the features and the target) as below.

For example,

In [0]:
x_train, x_test, y_train, y_test = train_test_split(iris_features, iris_target)

Notice that the two train datasets have the same number of rows.

In [0]:
x_train.shape, y_train.shape

Notice that the two test have the same number of rows.

In [0]:
x_test.shape, y_test.shape

Notice that the pair of "x" datasets have identical numbers of columns and that the "y" datasets have only one dimension.

## Time series

In [0]:
def train_test_split_timeseries(pdf, train_pct=0.8, start_test=None, target_column_name=None):
  assert target_column_name is not None, "target_column_name cannot be None"
  import pandas as pd
  from datetime import timedelta
  if start_test is not None:
    start_test_dt  = pd.to_datetime(start_test)

  if start_test_dt > pdf.index.max(): # create a test set
    return {'train_x': pdf.loc[:start_test].drop(target_column_name),
            'train_y': pdf.loc[:start_test].loc[:,[target_col_num]],
            'test_x' : pdf.loc[start_test:].drop(target_column_name),
            'test_y' : pdf.loc[start_test:].loc[:,[target_col_num]],
           }

## Time series datasets

In [15]:
train_test_split_timeseries(sample_timeseries_pdf, target_column_name='')

UnboundLocalError: ignored

In [3]:
# Using graph_objects
import plotly.graph_objects as go

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/finance-charts-apple.csv')

fig = go.Figure([go.Scatter(x=df['Date'], y=df['AAPL.High'])])
fig.show()

__The End__ (test)