# Day 1 - Part C: Data cleaning and preprocessing with Numpy and Pandas

Numpy and Pandas are python libraries which automate most of the data manipulation tasks for you. This notebook will provide an overwier of these and guide you though data manipulation, cleaning and preprocessing steps that need to be performed as a part of data science process. 

# 1. Introduction to Numpy
Numpy adds to python support for large, multidimensional array manipulation. It supports operations such as addition, multiplication, inversion and many other.

For more information refer to: https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

## 1.1 Importing Numpy

In [10]:
import numpy as np

In numpy, you can create arrays from python lists.

## 1.2. Basic numpy operations

In [11]:
some_array = np.array([1, 2, 3, 4, 5])
print(some_array)

[1 2 3 4 5]


A single array can contain different objects: strings, floats, integers etc.

In [22]:
some_array = np.array(['blue', 'red', 'orange', 'white'])
print(some_array)

some_array = np.array([1.5, 2, 3.6, 4.1, 5.17])
print(some_array)

['blue' 'red' 'orange' 'white']
[ 1.5   2.    3.6   4.1   5.17]


you can access elements of an array using [ ], similarly as in python's lists.

In [23]:
some_array[1]

2.0

elements in numpy array are mutable, therefore they can be changed as follows:

In [24]:
some_array[3] = 15
print(some_array)

[  1.5    2.     3.6   15.     5.17]


you can assign the whole numpy array to a variable, however be careful when you are doing it as modifying one array will modify the other array as well!

In [33]:
some_array = np.array([1.5, 2, 3.6, 4.1, 5.17])
some_array2 = some_array

print("Before change:")
print("some_array:", some_array)
print("some_array2:", some_array2)

some_array[3] = -6

print("\nAfter change:")
print("some_array:", some_array)
print("some_array2:", some_array2)

Before change:
some_array: [ 1.5   2.    3.6   4.1   5.17]
some_array2: [ 1.5   2.    3.6   4.1   5.17]

After change:
some_array: [ 1.5   2.    3.6  -6.    5.17]
some_array2: [ 1.5   2.    3.6  -6.    5.17]


as you see, elements in **both** arrays were modified even if only one of them was changed! It is because variables *some_array* and *some_array2* are pointers to arrays. We will not discuss pointers here, and you are not required to know how they work, however keep in mind that assigning numpy arrays to variables will "bound" them together and they will be susceptible to changes made on the other copy.

## 1.3 2D arrays
you can create arrays which are 2-dimensional.

In [39]:
np.array( [ [1, 2, 3], [3, 4, 5] ] )

array([[1, 2, 3],
       [3, 4, 5]])

here, you need to pass [ ] to *np.array()* function, and **inside** the brakets further specify the contents of columns. Columns are separacted by a comma. Data in each columns is again defined within [ ] brakets and separated by a comma. You are provided with come examples below, but feel free to experiment with dimensions!

In [44]:
np.array( [ [1, 2], [3, 3], [4, 5] ] )

array([[1, 2],
       [3, 3],
       [4, 5]])

In [45]:
np.array( [ [1, 2, 3, 3, 4, 5] ] )

array([[1, 2, 3, 3, 4, 5]])

In [46]:
np.array( [ [1], [2], [3], [3], [4], [5] ] )

array([[1],
       [2],
       [3],
       [3],
       [4],
       [5]])

## 1.4 Basic operations on arrays

Numpy is very efficient with array operations. Given two arrays A and B you can add them, multiply them, or calculate the inverse. Examples are provided to you below. For more information refer to: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.matrix.html

In [50]:
A = np.array([ [1, 2], [0, 4]])
B = np.array([ [4, 0], [3, -1]])

print(A)
print()
print(B)

[[1 2]
 [0 4]]

[[ 4  0]
 [ 3 -1]]


In [51]:
A + B

array([[5, 2],
       [3, 3]])

In [52]:
A - B

array([[-3,  2],
       [-3,  5]])

In [53]:
A * B

array([[ 4,  0],
       [ 0, -4]])

To calculate inverse you have to import ** *inv* **

In [57]:
from numpy.linalg import inv

inv(A)

array([[ 1.  , -0.5 ],
       [ 0.  ,  0.25]])

## 1.5 Functions for creating custom arrays
Numpy provides functions that enable you to generate arrays for you. For example, matrices containing only 1's or 0's, random numbers etc. Examples are provided below.

** *np.zeros( (2, 2) )* ** creates 2x2 matrix containing only zeros, ** *np.ones( (2, 3) )* ** creates 2x3 matrix containing only ones, and ** *np.random.rand(2, 3)* ** creates 3x2 matrix containing random numbers between 0 - 1.

You can create 1-dimensional arrays, 2- or even more dimensional matrices.

In [73]:
np.zeros((2, 2))

array([[ 0.,  0.],
       [ 0.,  0.]])

In [75]:
np.ones((2, 3))

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

In [76]:
np.random.rand(3,2)

array([[ 0.1046997 ,  0.77646816],
       [ 0.50355231,  0.92834969],
       [ 0.69782807,  0.01370106]])

In [77]:
np.random.rand(3, 2, 2, 4)

array([[[[ 0.07813112,  0.53344132,  0.15166324,  0.71811386],
         [ 0.80316024,  0.77058829,  0.07709759,  0.64650104]],

        [[ 0.63986262,  0.85501332,  0.72759198,  0.25457561],
         [ 0.34507472,  0.96189534,  0.17849875,  0.23220683]]],


       [[[ 0.90199695,  0.82412248,  0.77324594,  0.35296839],
         [ 0.78333253,  0.11753364,  0.63065222,  0.39319407]],

        [[ 0.25404177,  0.09990884,  0.6579875 ,  0.62468771],
         [ 0.8517834 ,  0.31120845,  0.01097131,  0.26747058]]],


       [[[ 0.81845881,  0.15643635,  0.41735095,  0.10469298],
         [ 0.12688916,  0.25467244,  0.69625261,  0.61037712]],

        [[ 0.1611768 ,  0.79555788,  0.22391815,  0.06251261],
         [ 0.2394673 ,  0.86603143,  0.65930057,  0.67977097]]]])

## 1.6 Slicing arrays
Numpy allows to slice the original array and work on the slide instead of the whole dataset

In [94]:
some_array = np.ones((3, 3))

print(some_array)

slice_array = some_array[0:2, 1:3]

print()
print(slice_array)

[[ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]]

[[ 1.  1.]
 [ 1.  1.]]


You can specify the range which columns/rows should be included by using the format *array[ column_start:column_end, row_start:row_end ]*. Rows and columns in numpy (and also in Pandas) start indexing from 0, therefore if you want to include first and second column you need to use 0:2, if second (index 1) and third (index 2) use 1:3 etc.

Again, remember that working on the slice will modify the original array too! Indexing elements is different in the original array and the slice.

In [95]:
print("Before change:")
print("some_array:\n", some_array)
print("some_array2:\n", slice_array)

slice_array[0, 0] = -6

print("\nAfter change:")
print("some_array:\n", some_array)
print("some_array2:\n", slice_array)

Before change:
some_array:
 [[ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]]
some_array2:
 [[ 1.  1.]
 [ 1.  1.]]

After change:
some_array:
 [[ 1. -6.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]]
some_array2:
 [[-6.  1.]
 [ 1.  1.]]


# 2. Introduction to Pandas

Pandas, similarly to numpy, provides support with large, multidimensional arrays. Additionally it is equipped with many more functions. The following section provides some key information about the use of the library in data manipulation. You are free to read more here: http://pandas.pydata.org/pandas-docs/stable/tutorials.html

## 2.1 Importing Pandas

In [None]:
import pandas as pd

## 2.2 Importing data from files
The ** *read_csv()* ** function enables to read CSV files, however pandas enables to read other formats e.g. excel files. Below, is an exemplary code snipper which reads the file *data.csv* and loads it into pandas DataFrame.

Useful arguments of the function:
- ** *sep* ** specifies the separator with which cells are separated with, 
- *** encoding* ** enables to change the file encoding. The default encoding is *utf-8*, however you might try *ISO-8859-1* if utf-8 does not work.

Run the code below to examine first rows of the dataset with ** *dataFrame.head()* ** function. Try to change the encoding to *utf-8* and see what happens!

In [9]:
dataFrame = pd.read_csv('./datasets/ecommerce_data/e_commerce.csv', sep=',', encoding='ISO-8859-1')
dataFrame.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


## 2.3 Creating DataFrame from a python dictionary
Alternatively, you can create Pandas DataFrame from a dictionay

In [97]:
data = [{'color_name': 'black', 'R': 0, 'G': 0, 'B': 0}, {'color_name': 'white', 'R': 255, 'G': 255, 'B': 255}, 
        {'color_name': 'red', 'R': 255, 'G': 0, 'B': 0}, {'color_name': 'blue', 'R': 0, 'G': 0, 'B': 255},
        {'color_name': 'green', 'R': 0, 'G': 255, 'B': 0}]

dataFrame = pd.DataFrame(data)

## 2.4 Inspecting the dataset

To inspect the dataset use ** *head()* ** function. This will show you first 5 rows of the dataset.

In [98]:
dataFrame.head()

Unnamed: 0,B,G,R,color_name
0,0,0,0,black
1,255,255,255,white
2,0,0,255,red
3,255,0,0,blue
4,0,255,0,green


Hoever, it will work only when ** *head()* ** is executed at the end of the code block. If you want to print your dataframe somewhere in the middle use:

In [99]:
print(dataFrame.head())

     B    G    R color_name
0    0    0    0      black
1  255  255  255      white
2    0    0  255        red
3  255    0    0       blue
4    0  255    0      green


## 2.5 Selecting, modifying and removing rows/columns

### Selecting a column
A good thing about Pandas is that you can refer to specific columns simply by using its name:

In [101]:
dataFrame['B']

0      0
1    255
2      0
3    255
4      0
Name: B, dtype: int64

In [102]:
dataFrame['color_name']

0    black
1    white
2      red
3     blue
4    green
Name: color_name, dtype: object

### Selecting rows
To select range of rows use the code below. DataFrame[2:4] selects all columns for rows from 2 to 3 (4 is not included)

In [103]:
dataFrame[2:4]

Unnamed: 0,B,G,R,color_name
2,0,0,255,red
3,255,0,0,blue


If you want to select specific row indexes use ** *iloc* **:

In [104]:
dataFrame.iloc[[0,2]]

Unnamed: 0,B,G,R,color_name
0,0,0,0,black
2,0,0,255,red


You can also select first e.g. 2 rows of the dataset, like below:

In [105]:
dataFrame[:2]

Unnamed: 0,B,G,R,color_name
0,0,0,0,black
1,255,255,255,white


Or last rows

In [106]:
dataFrame[-2:]

Unnamed: 0,B,G,R,color_name
3,255,0,0,blue
4,0,255,0,green


### Selecting multiple columns
You can select multiple columns by passing it as an array of column names, just like below:

In [107]:
column_slice = dataFrame[['B','color_name']]
column_slice.head()

Unnamed: 0,B,color_name
0,0,black
1,255,white
2,0,red
3,255,blue
4,0,green


### Column names and indices
You can return column names by using:

In [108]:
dataFrame.columns

Index(['B', 'G', 'R', 'color_name'], dtype='object')

You can see the row numbers/names by using:

In [109]:
dataFrame.index

RangeIndex(start=0, stop=5, step=1)

### Conditional selection
You can apply conditional selection by using ** dataFrame[*column_name*] *condition* *value* ** as specified below. The result of such expression is a True/False data frame which can be used to display rows which fulfil the condition. For example, the code below selects all rows that have *R* higher than 100.

In [110]:
select_greater_than_100 = (dataFrame['R'] > 100)
dataFrame[select_greater_than_100]

Unnamed: 0,B,G,R,color_name
1,255,255,255,white
2,0,0,255,red


### Creating new columns
You can create a new column just by assigning values to non-existent column name. You can use completely new values, or create column using values from another one, as shown below: 

In [111]:
dataFrame['color_number'] = 65536 * dataFrame['R'] + 256 * dataFrame['G'] + dataFrame['B']
dataFrame.head()

Unnamed: 0,B,G,R,color_name,color_number
0,0,0,0,black,0
1,255,255,255,white,16777215
2,0,0,255,red,16711680
3,255,0,0,blue,255
4,0,255,0,green,65280


### Removing columns
You can remove the column by using ** *del* **:

In [112]:
del dataFrame['color_number']
dataFrame.head()

Unnamed: 0,B,G,R,color_name
0,0,0,0,black
1,255,255,255,white
2,0,0,255,red
3,255,0,0,blue
4,0,255,0,green


Or simply pop it from the dataset. This operation will remove the column and return its values creating new DataFrame

In [113]:
new_dataFrame = dataFrame.pop('color_name')
print(new_dataFrame)
print()
print(dataFrame)

0    black
1    white
2      red
3     blue
4    green
Name: color_name, dtype: object

     B    G    R
0    0    0    0
1  255  255  255
2    0    0  255
3  255    0    0
4    0  255    0


and now adding it back to the dataset

In [114]:
dataFrame['color_name'] = new_dataFrame
dataFrame.head()

Unnamed: 0,B,G,R,color_name
0,0,0,0,black
1,255,255,255,white
2,0,0,255,red
3,255,0,0,blue
4,0,255,0,green


# 3. Data Science Process

The usual data science process contains the following steps:
1. Initial inspection of the dataset
2. Data manipulation, cleaning and preprocessing
3. Splitting into train/cross-validation/test samples
4. Building the model
5. Evaluating the model performance

This jupyter notebook will cover initial inspection of the dataset, data manipulation, cleaning and preprocessing.

We will do so on the exemplary e-commerce dataset.

## 3.1 Initial inspection

In [122]:
dataFrame = pd.read_csv('./datasets/ecommerce_data/e_commerce.csv', sep=',', encoding='ISO-8859-1')
dataFrame.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


We can see that the dataset includes 8 columns: Invoice No., StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, and Country.

In [123]:
dataFrame.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


You will see how many rows are present in the dataset, the mean, standard deviatnios, minimum value, maximum value and the quartiles. If you want to see specific characteristics, there are special functions for that as well.

In [116]:
dataFrame.max()

B               255
G               255
R               255
color_name    white
dtype: object

In [117]:
dataFrame['B'].min()

0

In [118]:
dataFrame.mean()

B    102.0
G    102.0
R    102.0
dtype: float64

In [119]:
dataFrame['B'].mode()

0    0
dtype: int64

etc...

You can check out the correlation coefficients between columns of numerical values by using *corr()* function.

In [120]:
dataFrame.corr()

Unnamed: 0,B,G,R
B,1.0,0.166667,0.166667
G,0.166667,1.0,0.166667
R,0.166667,0.166667,1.0


## Data cleaning
You might want to inspect if there are any missing values in the dataset. You can do it by running *isnull()* function:

In [121]:
dataFrame.isnull().any()

B             False
G             False
R             False
color_name    False
dtype: bool

Function any() returns true for a column if any of the values in this column are True. In this case it returns False for all columns because there is no empty cell.

We can change the element of an array to null and execute the code again. We will see that the query returned True for B column because it now contains one missing value.

In [None]:
dataFrame.loc[0, 'B'] = np.nan
dataFrame.head()

In [None]:
dataFrame.isnull().any()

We might want to remove the empty values by running:

In [None]:
dataFrame.dropna()

Now, the row corresponding to *black* color name has been removed because it contained a null value. 

Alternatively, forward fill, backward fill, or mean values to replace the missing values can be used. Examples of these are presented below.

### Replacing missing values with the mean

In [None]:
# Reloading data and setting 'R' column in 2nd row to 'NaN'

dataFrame = pd.DataFrame(data)
dataFrame.loc[2, 'R'] = np.nan
print(dataFrame)

# Replacing missing values with the mean
dataFrame['R'] = (dataFrame['R'].replace(np.nan, dataFrame['R'].mean()))
dataFrame.head()

### Replacing missing values with the value of the previous row (forward fill)

In [None]:
# Reloading data and setting 'R' column in 2nd row to 'NaN'

dataFrame = pd.DataFrame(data)
dataFrame.loc[2, 'R'] = np.nan
print(dataFrame)

# Replacing missing values with the previous value (forward fill)
dataFrame = dataFrame.fillna(method='ffill')
dataFrame.head()

### Replacing missing values with the value of the next row (backward fill)

In [None]:
# Reloading data and setting 'R' column in 2nd row to 'NaN'

dataFrame = pd.DataFrame(data)
dataFrame.loc[2, 'R'] = np.nan
print(dataFrame)

# Replacing missing values with the next value (backward fill)
dataFrame = dataFrame.fillna(method='bfill')
dataFrame.head()

### Invalid categorical values

If you want to inspect if the column contains invalid entries you can use *value_counts()* function, as below.

In [None]:
dataFrame['color_name'].value_counts()

The same applies for numerical values

In [None]:
dataFrame['B'].value_counts()

## More functions
You can find more functions here: http://pandas.pydata.org/pandas-docs/stable/indexing.html