## Typical steps in predictive data analysis

* Collect, validate, and clean data

* Feature selection and additional data preparation

* Split data into training and testing set

* Train your algorithm

* Estimate its accuracy

In [1]:
# set up our environment
import pandas as pd

def print_file(filename):
    with open(filename) as f:
        print(f.read(),end='')

## Data preparation

### Data collection
* can be non-trivial
* may require combining information from several sources

### Data validation
* ensure the proper format
* no missing values of any kind
* basic data consistency

### Data cleaning
* decide how to fix data
* examples:
  * remove duplicate entries?
  * drop entries with incorrect or missing info?
  * fix entries with incorrect or missing info?
  * fix misspellings?

## Sample of data validation

In [2]:
filename = 'data_validation.csv'
print_file(filename)

Name;FavoriteNumber;PPG
Alice;7;13.2
;8;12.7
Carol;13;8.2
Dave;"five"
Jack;32;
;;


In [4]:
data = pd.read_csv(filename, sep=';')
print(data)

    Name FavoriteNumber   PPG
0  Alice              7  13.2
1    NaN              8  12.7
2  Carol             13   8.2
3   Dave           five   NaN
4   Jack             32   NaN
5    NaN            NaN   NaN


In [4]:
# check if types are as expected
# if not, incorrect entry
data.dtypes 

Name               object
FavoriteNumber     object
PPG               float64
dtype: object

In [5]:
# count missing entries in a column
column = 'PPG'
missing = pd.isnull(data[column]).sum()
print(f'{missing} missing entries in {column}')
# Let's find out how many rows in dataset have nulls
pd3 = pd.isnull(data)
for i in range(pd3.shape[0]):
    rowisnull = False
    for j in range(pd3.shape[1]):
        rowisnull = rowisnull or pd3.iloc[i][j]
    print('row ', i, 'has null status ', rowisnull)
for i in range(data.shape[0]):
    rowisnull = False
    for j in range(data.shape[1]):
        rowisnull = rowisnull or pd.isnull(data.iloc[i][j])
    print('row ', i, 'has null status ', rowisnull)


3 missing entries in PPG
row  0 has null status  False
row  1 has null status  True
row  2 has null status  False
row  3 has null status  True
row  4 has null status  True
row  5 has null status  True
row  0 has null status  False
row  1 has null status  True
row  2 has null status  False
row  3 has null status  True
row  4 has null status  True
row  5 has null status  True


## Sample of data cleaning

In [6]:
# drop entries with a NaN
data = pd.read_csv(filename, sep=';')
# data = data.dropna(axis=0,how='any')
data = data.dropna(axis=0,how='all')
print(data)
data2 = data.dropna(axis=0,how='any')
print(data2)

    Name FavoriteNumber   PPG
0  Alice              7  13.2
1    NaN              8  12.7
2  Carol             13   8.2
3   Dave           five   NaN
4   Jack             32   NaN
    Name FavoriteNumber   PPG
0  Alice              7  13.2
2  Carol             13   8.2


In [7]:
# replace NaNs with specific value
data['PPG'] = data['PPG'].fillna(0)
data['Name'] = data['Name'].fillna('????')
data['FavoriteNumber'] = data['FavoriteNumber']\
                          .fillna(0)
data

Unnamed: 0,Name,FavoriteNumber,PPG
0,Alice,7,13.2
1,????,8,12.7
2,Carol,13,8.2
3,Dave,five,0.0
4,Jack,32,0.0


In [8]:
def fix(x):
    if x == "five":
        return 5
    return x

data['FavoriteNumber'] = data['FavoriteNumber'].apply(fix)

In [9]:
better_file = "cleaned.csv"
data.to_csv(better_file, sep=";")
print_file(better_file)

;Name;FavoriteNumber;PPG
0;Alice;7;13.2
1;????;8;12.7
2;Carol;13;8.2
3;Dave;5;0.0
4;Jack;32;0.0


## Digression: Conditional expressions and lambda functions

How to create and pass function `fix(x)` more concisely:

```python
def fix(x):
    if x == "five":
        return 5
    return x
```

Conditional expression: `<x> if <condition> else <y>`<br>
(in C++: `<condition> ? <x> : <y>`)

In [10]:
"all good" if 2 + 2 == 4 else "what??!!"

'all good'

More concise version:
```python
def fix(x):
    return 5 if x == "five" else x
```

## Digression: Conditional expressions and lambda functions
### Lambda functions: `lambda <x> : <expression>`
(in OCaml: `fun <x> -> <expression>`)<br>
(in C++: `[] (<type-of-x> <x>) {return <expression;}`)

In [11]:
f = lambda x : x * x
f(16)

256

OCaml: `fun x -> x * x`<br>
C++: `[](int x) { return x * x;}`

We don't have to create a named function to execute this step

In [12]:
data['FavoriteNumber'].apply(lambda x : 5 if x == "five" else x)

0     7
1     8
2    13
3     5
4    32
Name: FavoriteNumber, dtype: object

## Feature selection and additional data preparation

**Important:** select a subset of available attributes, especially if you have few labeled samples

**Otherwise:** the training algorithm could focus on non-useful data


### Additional data preparation (algorithm dependent)

* Normalize the selected features

* Warning for decision trees in `scikit-learn` (applies to Homework 1):
  - The algorithm only works for numerical attributes
  - So you have to convert your data to numerical
  - For instance, replace `"Male"`/`"Female"` with `0`/`1`

## Validation of results

General idea:
  * split your data at random:
      * training set
      * test set
  * use the training set to train your prediction model
  * use the test set to see how well it performs
  
You can let the split decide on the relative size of training/testing sizes (default is 75/25) or you can play with that ratio and see what you get

In [13]:
# split the data set
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn import tree
iris = load_iris()
print(iris)
X,y = iris.data,iris.target
X_train,X_test,y_train,y_test = train_test_split(X,y)

{'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
     

In [14]:
# train your decision tree
clf = tree.DecisionTreeClassifier(max_leaf_nodes=3)
clf = clf.fit(X_train,y_train)

In [15]:
# what was the prediction accuracy?
prediction = clf.predict(X_test)
correct = 0
for i in range(len(y_test)):
    if prediction[i] == y_test[i]:
        correct += 1
correct / len(y_test)


0.9473684210526315

### Metrics of accuracy

* How many predictions did you get right?  (the code listed above)
* The size of the error $ \sum{abs(y_{test}-y_{pred}))/N} $
* The size of the error relative to range of the output space $ \frac{\sum{abs(y_{test}-y_{pred}))/N})}{Range}$
* The coefficient of determination (left for a future lecture but if you are feeling inspired look it up!!
* Many, many, many others https://scikit-learn.org/stable/modules/model_evaluation.html