# Day 1 - Part C: Data preparation with Numpy and Pandas

Numpy and Pandas are python libraries which automate most of the data manipulation tasks for you. This notebook will provide an overview of these and guide you though data preparation steps including: cleaning, encoding and normalization.

Below is the code importing all the libraries that will be used throughout this notebook. Remember to run this line everytime you open the notebook. Otherwise, majority of the functions used here will not work!

In [None]:
import numpy as np
from numpy.linalg import inv

import pandas as pd
from sklearn.utils import resample
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

# 1. Introduction to Numpy
Numpy adds support for large, multidimensional array manipulation. It supports operations such as addition, multiplication, inversion and many other. Here, you will learn how to create numpy arrays and matrices, and how to perform basic operations on them.

For more information refer to: https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

## 1.1. Basic numpy operations
In numpy, you can create arrays from lists.

In [None]:
some_array = np.array([1, 2, 3, 4, 5])
print(some_array)

A single array can contain different objects: strings, floats, integers etc.

In [None]:
some_array = np.array(['blue', 'red', 'orange', 'white'])
print(some_array)

some_array = np.array([1.5, 2, 3.6, 4.1, 5.17])
print(some_array)

You can access elements of an array using [ ], similarly as in python's lists. Indexes of elements start from 0, therefore the code below accesses the ** *2nd* ** element.

In [None]:
some_array[1]

Numpy arrays are mutable, therefore its elements can be changed by assigning a value. Below the ** *4th* ** element of an array has been changed to 15.

In [None]:
print("Before change:")
print(some_array)

some_array[3] = 15

print("\nAfter change:")
print(some_array)

You can assign the whole numpy array to a variable, however be careful when you are doing it as modifying one array will modify the other array as well!

In [None]:
some_array = np.array([1.5, 2, 3.6, 4.1, 5.17])
some_array2 = some_array

print("Before change:")
print("some_array:", some_array)
print("some_array2:", some_array2)

some_array[3] = -6

print("\nAfter change:")
print("some_array:", some_array)
print("some_array2:", some_array2)

As you see, elements in **both** arrays were modified even if only one of them was changed! It is because variables *some_array* and *some_array2* are pointers to arrays. We will not discuss pointers here, and you are not required to know how they work, however keep in mind that assigning numpy arrays to variables will "bound" them together and they will be susceptible to changes made on the other copy.

## 1.2 2D arrays
you can create arrays which are 2-dimensional.

In [None]:
np.array( [ [1, 2, 3], [3, 4, 5] ] )

To create a 2D array, you need to pass a list of columns to ** *np.array()* ** function, where each column is a list of row elements. You are provided with come examples below, but feel free to experiment with sizes!

Matrix 3x2:

In [None]:
np.array( [ [1, 2], [3, 3], [4, 5] ] )

Matrix 1x6:

In [None]:
np.array( [ [1, 2, 3, 3, 4, 5] ] )

Matrix 6x1:

In [None]:
np.array( [ [1], [2], [3], [3], [4], [5] ] )

## 1.3 Basic operations on arrays

Numpy is very efficient with array operations. Given two arrays A and B you can add them, multiply them, or calculate the inverse. Examples are provided to you below. For more information refer to: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.matrix.html

In [None]:
A = np.array([ [1, 2], [0, 4]])
B = np.array([ [4, 0], [3, -1]])

print(A)
print()
print(B)

In [None]:
A + B

In [None]:
A - B

In [None]:
A * B

To calculate inverse you have to import ** *inv* ** package. This is done for you in the first script at the beginning of this notebook.

In [None]:
inv(A)

## 1.4 Functions for creating custom arrays
Numpy provides functions that enable to generate arrays for you. Examples are provided below.

** *np.zeros( (2, 2) )* ** creates 2x2 matrix containing only zeros, ** *np.ones( (2, 3) )* ** creates 2x3 matrix containing only ones, and ** *np.random.rand(3, 2)* ** creates 3x2 matrix containing random numbers between 0 - 1.

You can create 1-dimensional arrays, 2- or even more dimensional matrices.

In [None]:
np.zeros((2, 2))

In [None]:
np.ones((2, 3))

In [None]:
np.random.rand(3,2)

4D matrix 3x2x2x4:

In [None]:
np.random.rand(3, 2, 2, 4)

## 1.5 Slicing arrays
Numpy allows to slice the original array and work on the slide instead of the whole dataset

In [None]:
some_array = np.random.rand(3, 3)

print(some_array)

slice_array = some_array[0:2, 1:3]

print()
print(slice_array)

You can specify the range which columns/rows should be included by using the format *array[ column_start:column_end, row_start:row_end ]*. Rows and columns in numpy (and also in Pandas) start indexing from 0, therefore if you want to include first and second column you need to use 0:2, if second and third use 1:3 etc. A notation 1:4 means including elements from index 1, 2 and 3. Note, that 4 is not included.

Again, remember that working on the slice will modify the original array too! The slice of an array has different indexing now, therefore the same element in the example below will have [0, 0] location in the slice array, and [0, 1] location in the original array.

In [None]:
print("Before change:")
print("some_array:\n", some_array)
print("some_array2:\n", slice_array)

slice_array[0, 0] = -6

print("\nAfter change:")
print("some_array:\n", some_array)
print("some_array2:\n", slice_array)

print()
print(slice_array[0, 0], some_array[0, 1])

# 2. Introduction to Pandas

Pandas, similarly to numpy, provides support with large, multidimensional arrays. Additionally it is equipped with many more functions. The following section provides some key information about the use of the library for data manipulation. You are free to read more here: http://pandas.pydata.org/pandas-docs/stable/tutorials.html

## 2.1 Importing data from files
The ** *read_csv()* ** function enables to read CSV files, however pandas enables to read other formats e.g. excel files. Below, is an exemplary code snippet which reads the file *data.csv* and loads it into pandas DataFrame.

Useful arguments of the function:
- ** *sep* ** specifies the separator with which cells are separated with, 
- *** encoding* ** enables to change the file encoding. The default encoding is *utf-8*, however you might try *ISO-8859-1* if utf-8 does not work.

Run the code below to examine first rows of the dataset with ** *dataFrame.head()* ** function. Try to change the encoding to *utf-8* and see what happens!

In [None]:
dataFrame = pd.read_csv('./datasets/ecommerce_data/e_commerce.csv', sep=',', encoding='ISO-8859-1')
dataFrame.head()

## 2.2 Creating DataFrame from a python dictionary
Alternatively, you can create Pandas DataFrame from a dictionay

In [None]:
data = [{'color_name': 'black', 'R': 0, 'G': 0, 'B': 0}, {'color_name': 'white', 'R': 255, 'G': 255, 'B': 255}, 
        {'color_name': 'red', 'R': 255, 'G': 0, 'B': 0}, {'color_name': 'blue', 'R': 0, 'G': 0, 'B': 255},
        {'color_name': 'green', 'R': 0, 'G': 255, 'B': 0}]

dataFrame = pd.DataFrame(data)
dataFrame

## 2.3 Inspecting the dataset

To inspect the dataset use ** *head()* ** function. This will show you first 5 rows of the dataset.

In [None]:
dataFrame.head()

Hoever, it will work only when ** *head()* ** is executed at the end of the code block. If you want to print your dataframe somewhere in the middle of the code block use:

In [None]:
print(dataFrame.head())

## 2.4 Selecting, modifying and removing rows/columns

### Selecting a column
A good thing about Pandas is that you can refer to specific columns simply by using its name:

In [None]:
dataFrame['B']

In [None]:
dataFrame['color_name']

### Selecting rows
To select range of rows use the code below. DataFrame[2:4] selects all columns for rows with indexes 2 and 3 (4 is not included)

In [None]:
dataFrame[2:4]

If you want to select specific rows use ** *loc* **, or ** *iloc* **. loc will return the row that matches given label, whereas iloc will return a row that has ith position in the data frame.

In [None]:
dataFrame.loc[[0,2]]

You can also select first e.g. 2 rows of the dataset, like below:

In [None]:
dataFrame[:2]

Or last rows

In [None]:
dataFrame[-2:]

### Selecting multiple columns
You can select multiple columns by passing it as a list of column names, just like below:

In [None]:
column_slice = dataFrame[['B','color_name']]
column_slice.head()

While slicing this way, the slice created is not a pointer to the original data frame, therefore modifying it will not change the original dataset. However, it is not the case with .iloc and .loc attributes because they **WILL** return a pointer, therefore if you change the slice obtained using .iloc or .loc it will change your original data frame as well!

In [None]:
column_slice = dataFrame[['B','color_name']]

print("\nBefore changes:")
print(column_slice)
print()
print(dataFrame)

column_slice.loc[0, 'B'] = 14

print("\nAfter changes:")
print(column_slice)
print()
print(dataFrame)

You can find more on indexing here: http://pandas.pydata.org/pandas-docs/stable/indexing.html; and more on slice modification here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

### Column names and indices
You can return column names by using:

In [None]:
dataFrame.columns

You can see the row numbers/names by using:

In [None]:
dataFrame.index

### Conditional selection
You can apply conditional selection by using ** dataFrame[*column_name*] *condition* *value* ** as specified below. The result of such expression is a True/False data frame which can be used to display rows which fulfil the condition. For example, the code below selects all rows that have *R* higher than 100.

In [None]:
select_greater_than_100 = (dataFrame['R'] > 100)
select_greater_than_100

In [None]:
dataFrame[select_greater_than_100]

### Creating new columns
You can create a new column just by assigning values to non-existent column name. You can use completely new values, or create column using values from another one, as shown below: 

In [None]:
dataFrame['color_number'] = 0
dataFrame.head()

In [None]:
dataFrame['color_number'] = 65536 * dataFrame['R'] + 256 * dataFrame['G'] + dataFrame['B']
dataFrame.head()

### Removing columns
You can remove the column by using ** *del* **:

In [None]:
del dataFrame['color_number']
dataFrame.head()

Or simply pop it from the dataset. This operation will remove the column and return its values creating new DataFrame

In [None]:
new_dataFrame = dataFrame.pop('color_name')
print(new_dataFrame)
print()
print(dataFrame)

and now adding it back to the dataset

In [None]:
dataFrame['color_name'] = new_dataFrame
dataFrame.head()

# 3. The Data Science Process

The usual data science process contains the following steps:
1. Initial inspection of the dataset
2. Data preparation (cleaning, feature selection, normalization and splitting dataset into train/cross-validation/test)
4. Building the model
5. Evaluating the model performance

Before the machine learning can be applied to the dataset, a significant effort needs to be put into preparing the dataset for the model. Poorly performed data preparation might result in poor performance or unreliable results. This jupyter notebook will cover initial inspection of the dataset, data cleaning, normalisation and splitting dataset into train/cross-validation/test.

We will do so on the exemplary e-commerce dataset.

## 3.1 Data cleaning and initial inspection

In [None]:
dataFrame = pd.read_csv('./datasets/ecommerce_data/e_commerce.csv', sep=',', encoding='ISO-8859-1', dtype={
    'InvoiceNo': str, 
    'StockCode': str,
    'Description': str,
    'Quantity': np.float64,
    'InvoiceDate': str,
    'UnitPrice': np.float64,
    'CustomerID': np.float64,
    'Country': str
})
dataFrame.head()

We can see that the dataset includes 8 columns: *Invoice No.*, *StockCode*, *Description*, *Quantity*, *InvoiceDate*, *UnitPrice*, *CustomerID*, and *Country*.

Numerical variables include: *Quantity*, *UnitPrice*, and *CustomerID*.

Categorical variables include: *InvoiceNo*, *StockCode*, *Description*, and *Country*, as for now they are stored int the data frame as strings

There is only one datetime type variable: *InvoiceDate*, again stored in the data frame as a string.

You can inspect the types of the variables using:

In [None]:
dataFrame.dtypes

### 3.1.1 Replacing missing values
We will check if there are any variables that are missing.

In [None]:
dataFrame.isnull().any()

We can see that there are *Description* and *CustomerID* features which are empty. The decision what to do with these depends on the circumstances. The most popular options are:

- remove missing values
- replace them with mean/most common category
- replace with the value of the previous row (forward-fill)
- replace with the value of the next row (backward-fill)

#### Removing missing values
Removing the missing values is done using ** *dropna()* ** function. It is important to assign reduced DataFrame to a variable, otherwise changes will not have any effect. Example of such behaviour is presented below.


In [None]:
print(dataFrame.shape)

dataFrame.dropna()

print(dataFrame.shape)

As you can see, the size of the DataFrame did not change, which means that the empty rows were **NOT** removed. Below is the corrected version of the code. The dataset has been reduced by almost 67 000 samples.

In [None]:
print(dataFrame.shape)

dataFrame = dataFrame.dropna()

print(dataFrame.shape)

In [None]:
dataFrame.isnull().any()

#### Replacing missing values with the mean
First we will set *Quantity* in the row with index 1 to null, and then observe how it has been replaced.

In [None]:
dataFrame.loc[1, 'Quantity'] = np.nan
dataFrame.head()

Replacing missing values in column *Quantity* with mean value

In [None]:
dataFrame['Quantity'] = (dataFrame['Quantity'].replace(np.nan, dataFrame['Quantity'].mean()))
dataFrame.head()

#### Replacing missing values with previous row (forward-fill)

First, setting null values in *StockCode* and *InvoiceDate* in the row with index 7. 

In [None]:
dataFrame.loc[7, 'StockCode'] = np.nan
dataFrame.loc[7, 'InvoiceDate'] = np.nan
dataFrame[3:10]

Replacing missing values with the previous value (forward fill). Please note that this method will replace missing values in all columns, with values respective to that column; and might find very useful when replacing missing values in time-series.

In [None]:
dataFrame = dataFrame.fillna(method='ffill')
dataFrame[3:10]

#### Replacing missing values with next row (backward-fill)

In [None]:
dataFrame.loc[7, 'StockCode'] = np.nan
dataFrame.loc[7, 'InvoiceDate'] = np.nan
dataFrame[3:10]

In [None]:
dataFrame = dataFrame.fillna(method='bfill')
dataFrame[3:10]

### 3.1.2 Initial inspection and data cleaning

Now we can investigate how the numerical dataset looks like by using ** *describe()* ** function.

In [None]:
dataFrame.describe()

It can be observed that the *Quantity* feature can have negative values, which is not what we would expect. These values can be removed from the dataset by applying conditional selection discussed before.

First, we see how many of these have non-positive *Quantity* values.

In [None]:
len(dataFrame[dataFrame['Quantity'] <= 0])

Now, we will be selecting these rows using conditional selection.

In [None]:
print(dataFrame.shape)

indexes_with_negative_height = dataFrame[dataFrame['Quantity'] < 0].index
dataFrame = dataFrame.drop(indexes_with_negative_height)

print(dataFrame.shape)

We can observe that the dataset has been reduces exactly by 4098 rows.

Interestingly, there are 13 rows which have unit price equal to 0. This raises a warning flag since it is not often for an item to be for free. We will be displaying these rows.

In [None]:
len(dataFrame[dataFrame['UnitPrice'] == 0])

In [None]:
dataFrame[dataFrame['UnitPrice'] == 0]

However unusual, these items might still be valid entries because they might have been sold as an addition to other item, or bought with a voucher, therefore they will be kept in the dataset.

You can further inspect the dataset by calling  functions e.g. min(), max(), mean(), etc. explicitly on the dataFrame

In [None]:
dataFrame.max()

Note that for the strings the minimum and maximum values are evaluated based on the characters included in the string. For example, a letter "A" will be considered as "smaller" than letter "Z". Moreoever, string digits e.g. "8" are always considered to be smaller than letters.

In [None]:
dataFrame.min()

Similarly, you can apply functions ** *mode()* ** or ** *mean()* **.

More functions can be found here: https://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-stats

You can check out the correlation coefficients between columns of numerical values by using *corr()* function. In the next notebook on data visualisation you will learn how to create a correlation heatmap.

In [None]:
dataFrame.corr()

### Invalid categorical values

If you want to inspect if the categorical column contains invalid entries you can use *value_counts()* function, as below.

In [None]:
dataFrame['Country'].value_counts()

The same applies for numerical values

Most of the values seem to be valid entries, however some of the countries are categorized as "Unspecified". It is our decision whether we want to remove rows which have this value or not, and in this case we will do this be executing the following code:

In [None]:
print(dataFrame.shape)

dataFrame = dataFrame.drop(dataFrame[dataFrame['Country'] == 'Unspecified'].index)

print(dataFrame.shape)

where *dataFrame[dataFrame['Country'] == 'Unspecified']* selects all the rows that have 'Unspecified' country; ** *.index* ** returns indexes associated with these rows; and ** *dataFrame.drop()* ** removes these indexes from the dataset.

We have removed exactly 72 rows, as specified for 'Unspecified' by value_counts().

## 3.2 Encoding categorical variables

Most machine learning methods do not directly work with categories, there is a need to transform these into numerical values. There are two options to do it:

- encoding each category by number e.g. Monday - 1, Tuesday - 2 etc.
- creating dummy columns which encode the category. For example for day of the week, we would create 7 columns, each corresponsing to one day of the week.

To demonstrate both approaches, we will switch to the color dataset used at the beginning of the notebook.

In [None]:
data = [{'color_name': 'black', 'R': 0, 'G': 0, 'B': 0}, {'color_name': 'white', 'R': 255, 'G': 255, 'B': 255}, 
        {'color_name': 'red', 'R': 255, 'G': 0, 'B': 0}, {'color_name': 'blue', 'R': 0, 'G': 0, 'B': 255},
        {'color_name': 'green', 'R': 0, 'G': 255, 'B': 0}, {'color_name': 'green', 'R': 0, 'G': 255, 'B': 0}]

new_dataFrame = pd.DataFrame(data)
new_dataFrame['color_name2'] = new_dataFrame['color_name']
new_dataFrame

### 3.2.1 Numerical encoding of categorical variables

First, will transform the type of the column to *category*, so that pandas map each category with the unique code.

In [None]:
new_dataFrame['color_name'] = new_dataFrame['color_name'].astype('category')

Next, we will update the column with corresponding category codes

In [None]:
new_dataFrame['color_name'] = new_dataFrame['color_name'].cat.codes
new_dataFrame

We can see that each color has now each category assigned: *black* - 0, *white* - 4, *red* - 3, etc.

### 3.2.2 Creating dummy columns (also called as One Hot Encoding)

Execute ** *pd.get_dummies()* ** function to expand a specific column to dummy columns. *columns* parameter specifies the names of columns that will be expanded using one hot encoding.

In [None]:
new_dataFrame = pd.get_dummies(new_dataFrame, columns=['color_name2'])

In [None]:
new_dataFrame

Now we can see that the *color_name2* column has been expanded to 5 binary columns.

## 3.3 Unbalanced datasets

Datasets often might contain unbalanced categorical variables; that means that some of these values appear more frequenty than the other. When the unbalanced feature is the target i.e. the feature that we want to predict using machine learning models, it is a problem because our model will be trained to predict the majority category for most of the time. More on this will be presented in the notebook dealing with evaluating performance of the machine learning model.

There are many different methods to deal with the problem of unbalanced datasets, however in this notebook we will focus on two of them: under-sampling and over-sampling. Under-sampling reduce the size of our original dataset removing instances of the majority class to match the minority class. Over-sampling adds new data points which are similar to these of the minority class. 

Please note, that under-sampling and over-sampling might not be the best to deal with unbalanced datasets, as it will introduce bias. For example, under-sampling will remove instances of the majority class that might be meaningful. However, it still might work in practice for some cases.

Under-sampling and over-sampling will be presented below, however before we do it, we will slighlty modify the e-commerce dataset.

Let's assume that we are interested only in transactions which happened in United Kingdom and France. Below is the code which will select all the rows which have 'United Kingdom' value in column *Country*, and these with 'France' value in the same column.

Both selections are then concatenated togehter using ** *pd.concat()* ** pandas function. Concatenate adds one data frame below the other; this implies that both data frames need to have the same columns. More about concatenation (and other merging functions) can be found here: https://pandas.pydata.org/pandas-docs/stable/merging.html

In [None]:
UKData = dataFrame.loc[dataFrame['Country'] == 'United Kingdom']
FranceData = dataFrame.loc[dataFrame['Country'] == 'France']

dataFrame = pd.concat([UKData, FranceData])
dataFrame

Now we can confirm that our dataset contain only entries related to United Kingdom and France. Moreover, we can observe that there are many more entries related to UK than to France. This means that this dataset is *unbalanced*, therefore we will be applying under-sampling and over-sampling.

In [None]:
dataFrame['Country'].value_counts()

### 3.3.1 Under-sampling

Under-sampling is removing the instances of the majority class so that it matches the number of minority class instances. We will do it using sklearn library ** *resample* **. First, we will create slices of the original dataset, where the first slice contains the majority class ("United Kingdom"), and the second slice contains the minorty class ("France").

In [None]:
df_majority = dataFrame[dataFrame['Country']=='United Kingdom']
df_minority = dataFrame[dataFrame['Country']=='France']
minority_len = (len(df_minority))

Now, we will run the resample method with parameter *replace* equal to False, and n_samples equal to the number of instances in the minority class. The reduced majority class will be concatenated with the minority class, and we can observe that there is now exactly the same number of instances for majority and minority classes.

In [None]:
df_majority_downsampled = resample(df_majority,replace=False,n_samples=minority_len,random_state=123)
ddown = pd.concat([df_minority, df_majority_downsampled])
ddown['Country'].value_counts()

### 3.3.2 Over-sampling

Over-sampling will create copies of dataset instances to match required number of samples. Below, we count number of instances in the majority class, and create additional instances in the minority class to match the majority class.

We can now see that there is the same number of instances for both classes, and it is equal to the initial size of the majority class.

In [None]:
majority_len = (len(df_majority))

df_minority_upsampled = resample(df_minority,replace=True,n_samples=majority_len,random_state=123)
dup = pd.concat([df_majority, df_minority_upsampled])
dup['Country'].value_counts()

## 3.4 Normalization

Most of the machine learning algorithms require the dataset they are trained on to be normalized. Below is the example of normalization of the e-commerce dataset.

First, we will be encoding categorical variables. We will use category codes for *InvoiceNo*, *StockCode* and *Description*, and One Hot Encoding for *Contry*.

In [None]:
dataFrame.columns

In [None]:
dataFrame['StockCode'] = dataFrame['StockCode'].astype('category')
dataFrame['StockCode'] = dataFrame['StockCode'].cat.codes

dataFrame['Description'] = dataFrame['Description'].astype('category')
dataFrame['Description'] = dataFrame['Description'].cat.codes

dataFrame['InvoiceNo'] = dataFrame['InvoiceNo'].astype('category')
dataFrame['InvoiceNo'] = dataFrame['InvoiceNo'].cat.codes

dataFrame = pd.get_dummies(dataFrame, columns=['Country'])
dataFrame.head()

The *InvoiceDate* field would need to be translated into the timestamp in order to be correcrtly normalized, however as it is not the scope of this notebook we will remove it from our data frame. If you are interested, you can read more about timestamps here: https://pandas.pydata.org/pandas-docs/stable/timeseries.html

In [None]:
del dataFrame['InvoiceDate']

We will use ** *preprocessing* ** package from sklearn to normalize our dataset. We have to normalize column by column. Otherwise the normalization will take place across different column.

The preprocessing package requires to pass the data in a form of a vector, therefore we will be first reshaping pandas data frame to a vector, and then reshaping it again as a column and updating the data frame.

In [None]:
for column in dataFrame.columns:
    vector = dataFrame[column].values.reshape(1, len(dataFrame[column]))
    normalized_vector = preprocessing.normalize(vector, norm="l2")
    dataFrame[column] = normalized_vector.reshape(len(dataFrame[column]), 1)
dataFrame.head()

## 3.5 Splitting the dataset

In this section we will discuss two ways of splitting the datasets:
- Splitting inot train, cross-validation and test sets
- K-fold cross-validation

We will perform the split on the e-commerce dataset, where first we will extract the variable we want to predict - let it be *UnitPrice* in this case.

In [None]:
target = dataFrame.pop('UnitPrice')

### 3.5.1 Splitting into train, cross-validation, and test sets

To split the original dataset into three sets: train, cross-validation and test we can use ** *train_test_split()* ** function twice. First split will divide our dataset into (train + cross-validation) and the test set. The second split will divide the data into train and cross-validation. If we want to divide into 60% training, 20% cross-validation, and 20% test we should use at first 0.2 test size and 0.8 train size. While dividing for the second time we have to take into account that the dataset is now reduced, therefore we have to adjust the ratios to: 0.2/0.8 = 0.25 test size and 0.6/0.8 = 0.75 train size, which will give us train and cross-validation sets.

In [None]:
# split into (train + cross validation) and test sets
X_x, X_test, y_x, y_test = train_test_split(dataFrame, target, test_size=0.2, train_size=0.8, random_state=43)

# split into train and cross validation
X_train, X_cv, y_train, y_cv = train_test_split(X_x, y_x, test_size=0.25, train_size=0.75, random_state=43)

print(X_train.shape, X_cv.shape, X_test.shape)

### 3.5.2 K-fold

To perform K-fold you can use the class provided by scikit-learn ** *Kfold()* **. The argument ** *n_splits* ** defines how many parts we want to divide the dataset to, ** *shuffle* ** indicates whether we want to shuffle the dataset before we split it. Kfold will then return the indexes of the chosen rows, from which we can build our train and test subsets. It is advised to create new DataFrame objects while doing so, because .iloc returns a pointer to the original data frame which implies that modifying the slice will modify the original dataset too!

In [None]:
kf = KFold(n_splits=4, shuffle=True, random_state=43)

print("Original dataset:")
print(dataFrame.shape)

kfold_count = 0
for train_indexes, test_indexes in kf.split(dataFrame):
    train_dataset = pd.DataFrame(dataFrame.iloc[train_indexes])
    test_dataset = pd.DataFrame(dataFrame.iloc[test_indexes])
    train_target = pd.DataFrame(target.iloc[train_indexes])
    test_target = pd.DataFrame(target.iloc[test_indexes])
    
    kfold_count += 1
    print("Kfold", str(kfold_count) + "-iteration:")  
    
    print("Train:", train_dataset.shape, "Test:", test_dataset.shape)

If you want to read more on splitting the dataset go to: http://scikit-learn.org/stable/modules/cross_validation.html