# Úkol č. 2 - předzpracování dat a binární klasifikace (do 2. listopadu 23:59)

  * V rámci tohoto úkolu se musíte vypořádat s příznaky, které jsou různých typů.
  * Před tím, než na nich postavíte predikční model, je třeba je nějakým způsobem převést do číselné reprezentace.
    
> **Úkoly jsou zadány tak, aby Vám daly prostor pro invenci. Vymyslet _jak přesně_ budete úkol řešit, je důležitou součástí zadání a originalita či nápaditost bude také hodnocena!**

## Zdroj dat

Budeme se zabývat predikcí přežití pasažérů Titaniku.
K dispozici máte trénovací data v souboru **data.csv** a data na vyhodnocení v souboru **evaluation.csv**.

#### Seznam příznaků:
* survived - zda přežil, 0 = Ne, 1 = Ano, **vysvětlovaná proměnná**, kterou chcete predikovat
* pclass - Třída lodního lístku, 1 = první, 2 = druhá, 3 = třetí
* name - jméno
* sex - pohlaví
* age - věk v letech
* sibsp	- počet sourozenců / manželů, manželek na palubě
* parch - počet rodičů / dětí na palubě
* ticket - číslo lodního lístku
* fare - cena lodního lístku
* cabin	- číslo kajuty
* embarked	- místo nalodění, C = Cherbourg, Q = Queenstown, S = Southampton
* home.dest - Bydliště/Cíl

## Pokyny k vypracování

**Základní body zadání**, za jejichž (poctivé) vypracování získáte **8 bodů**:
  * V Jupyter notebooku načtěte data ze souboru **data.csv**. Vhodným způsobem si je rozdělte na podmnožiny vhodné k trénování modelu.
  * Projděte si jednotlivé příznaky a transformujte je do vhodné podoby pro použití ve vybraném klasifikačním modelu.
  * Podle potřeby si můžete vytvářet nové příznaky (na základě existujících), například tedy můžete vytvořit příznak měřící délku jména. Některé příznaky můžete také úplně zahodit.
  * Nějakým způsobem se vypořádejte s chybějícími hodnotami.
  * Následně si vyberte vhodný klasifikační model z přednášek. Najděte vhodné hyperparametry a určete jeho přesnost (accuracy) na trénovací množině. Také určete jeho přesnost na testovací množině.
  * Načtěte vyhodnocovací data ze souboru **evaluation.csv**. Napočítejte predikce pro tyto data (vysvětlovaná proměnná v nich již není). Vytvořte **results.csv** soubor, ve kterém tyto predikce uložíte do dvou sloupců: ID, predikce přežití. Tento soubor nahrajte do repozitáře.
  * Ukázka prvních řádků souboru *results.csv*:
  
```
ID,survived
1000,0
1001,1
...
```

**Další body zadání** za případné další body  (můžete si vybrat, maximum bodů za úkol je každopádně 12 bodů):
  * (až +4 body) Aplikujte všechny klasifikační modely z přednášek a určete (na základě přesnosti na validační množině), který je nejlepší. Přesnost tohoto nejlepšího modelu odhadněte pomocí křížové validace. K predikcím na vyhodnocovacích datech využijte tento model.
  * (až +4 body) Zkuste použít nějaké (alespoň dvě) netriviální metody doplňování chybějících hodnot u věku. Zaměřte na vliv těchto metod na přesnost predikce výsledného modelu. K predikcím na vyhodnocovacích datech využijte ten přístup, který Vám vyjde jako nejlepší.

## Poznámky k odevzdání

  * Řiďte se pokyny ze stránky https://courses.fit.cvut.cz/BI-VZD/homeworks/index.html.
  * Odevzdejte nejen Jupyter Notebook, ale i _csv_ soubor s predikcemi pro vyhodnocovací data (`results.csv`).
  * Opravující Vám může umožnit úkol dodělat či opravit a získat tak další body. První verze je ale důležitá a bude-li odbytá, budete za to penalizováni**

# Solution
My approach to problem can be described in following couple of steps
* retrieve data from `data.csv` and `evaluation.csv` to 2 dataframes
* remove unneeded columns, map male/female to 1/0 and change `embarked` column to set of nominal dummies
* merge `data` and `evaluation` into one dataframe, remove `ID` and `survived` columns to create dataframe for age prediction. By doing so - regressor will have more data to predict age
* make regression model to fill unknown `age` values into `data` and `evaluation`
* make classification model on `data` to predict `survived` value
* predict `survived` for `evaluation`

In [200]:
### odtud už je to Vaše
import pandas as pd
import numpy as np
import sys
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ParameterGrid
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

## Retrieve data

In [233]:
data = pd.read_csv('./data.csv')
evaluation = pd.read_csv('./evaluation.csv')
display(data)
data.info()
display(evaluation)
evaluation.info()

Unnamed: 0,ID,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,home.dest
0,0,1,3,"Dorking, Mr. Edward Arthur",male,19.0,0,0,A/5. 10482,8.0500,,S,"England Oglesby, IL"
1,1,1,2,"Smith, Miss. Marion Elsie",female,40.0,0,0,31418,13.0000,,S,
2,2,0,3,"Hegarty, Miss. Hanora ""Nora""",female,18.0,0,0,365226,6.7500,,Q,
3,3,0,3,"Sage, Mr. John George",male,,1,9,CA. 2343,69.5500,,S,
4,4,0,3,"Cacic, Miss. Marija",female,30.0,0,0,315084,8.6625,,S,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,995,0,3,"Sdycoff, Mr. Todor",male,,0,0,349222,7.8958,,S,
996,996,1,3,"Finoli, Mr. Luigi",male,,0,0,SOTON/O.Q. 3101308,7.0500,,S,"Italy Philadelphia, PA"
997,997,0,3,"Danbom, Mrs. Ernst Gilbert (Anna Sigrid Maria ...",female,28.0,1,1,347080,14.4000,,S,"Stanton, IA"
998,998,0,3,"Sivic, Mr. Husein",male,40.0,0,0,349251,7.8958,,S,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         1000 non-null   int64  
 1   survived   1000 non-null   int64  
 2   pclass     1000 non-null   int64  
 3   name       1000 non-null   object 
 4   sex        1000 non-null   object 
 5   age        797 non-null    float64
 6   sibsp      1000 non-null   int64  
 7   parch      1000 non-null   int64  
 8   ticket     1000 non-null   object 
 9   fare       1000 non-null   float64
 10  cabin      226 non-null    object 
 11  embarked   998 non-null    object 
 12  home.dest  554 non-null    object 
dtypes: float64(2), int64(5), object(6)
memory usage: 101.7+ KB


Unnamed: 0,ID,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,home.dest
0,1000,2,"Jacobsohn, Mrs. Sidney Samuel (Amy Frances Chr...",female,24.0,2,1,243847,27.0000,,S,London
1,1001,2,"Christy, Miss. Julie Rachel",female,25.0,1,1,237789,30.0000,,S,London
2,1002,2,"Gale, Mr. Harry",male,38.0,1,0,28664,21.0000,,S,"Cornwall / Clear Creek, CO"
3,1003,3,"McNamee, Mrs. Neal (Eileen O'Leary)",female,19.0,1,0,376566,16.1000,,S,
4,1004,2,"Howard, Mrs. Benjamin (Ellen Truelove Arman)",female,60.0,1,0,24065,26.0000,,S,"Swindon, England"
...,...,...,...,...,...,...,...,...,...,...,...,...
304,1304,3,"Buckley, Miss. Katherine",female,18.5,0,0,329944,7.2833,,Q,"Co Cork, Ireland Roxbury, MA"
305,1305,3,"Bing, Mr. Lee",male,32.0,0,0,1601,56.4958,,S,"Hong Kong New York, NY"
306,1306,3,"Daher, Mr. Shedid",male,22.5,0,0,2698,7.2250,,C,
307,1307,1,"Wick, Mrs. George Dennick (Mary Hitchcock)",female,45.0,1,1,36928,164.8667,,S,"Youngstown, OH"


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         309 non-null    int64  
 1   pclass     309 non-null    int64  
 2   name       309 non-null    object 
 3   sex        309 non-null    object 
 4   age        249 non-null    float64
 5   sibsp      309 non-null    int64  
 6   parch      309 non-null    int64  
 7   ticket     309 non-null    object 
 8   fare       308 non-null    float64
 9   cabin      69 non-null     object 
 10  embarked   309 non-null    object 
 11  home.dest  191 non-null    object 
dtypes: float64(2), int64(4), object(6)
memory usage: 29.1+ KB


Thought folowing info might be interesting

In [219]:
print('survival rate in')
for pclass, g in data.groupby('pclass'):
    print('\t\tpclass {} is {:.1f}% (total people {})'.format(pclass, g['survived'].sum()/len(g)*100, len(g)))

survival rate in
		pclass 1 is 63.6% (total people 239)
		pclass 2 is 44.8% (total people 210)
		pclass 3 is 26.7% (total people 551)


* Change male to 1 female to 0
* Drop name, ticket, cabin, home.dest
* Substitute `embarked` with nominal dummies

In [234]:
def shape_frame(df):  
    df['sex'].replace(['male','female'],[1,0], inplace=True)
    df.drop(columns=['name', 'ticket', 'cabin', 'home.dest'],inplace=True)
    df = pd.get_dummies(data=df, columns=['embarked'], prefix='embarked')
    df['fare'].fillna(0,inplace=True)
    return df

data = shape_frame(data)
evaluation = shape_frame(evaluation)
display(data)
display(evaluation)

Unnamed: 0,ID,survived,pclass,sex,age,sibsp,parch,fare,embarked_C,embarked_Q,embarked_S
0,0,1,3,1,19.0,0,0,8.0500,0,0,1
1,1,1,2,0,40.0,0,0,13.0000,0,0,1
2,2,0,3,0,18.0,0,0,6.7500,0,1,0
3,3,0,3,1,,1,9,69.5500,0,0,1
4,4,0,3,0,30.0,0,0,8.6625,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
995,995,0,3,1,,0,0,7.8958,0,0,1
996,996,1,3,1,,0,0,7.0500,0,0,1
997,997,0,3,0,28.0,1,1,14.4000,0,0,1
998,998,0,3,1,40.0,0,0,7.8958,0,0,1


Unnamed: 0,ID,pclass,sex,age,sibsp,parch,fare,embarked_C,embarked_Q,embarked_S
0,1000,2,0,24.0,2,1,27.0000,0,0,1
1,1001,2,0,25.0,1,1,30.0000,0,0,1
2,1002,2,1,38.0,1,0,21.0000,0,0,1
3,1003,3,0,19.0,1,0,16.1000,0,0,1
4,1004,2,0,60.0,1,0,26.0000,0,0,1
...,...,...,...,...,...,...,...,...,...,...
304,1304,3,0,18.5,0,0,7.2833,0,1,0
305,1305,3,1,32.0,0,0,56.4958,0,0,1
306,1306,3,1,22.5,0,0,7.2250,1,0,0
307,1307,1,0,45.0,1,1,164.8667,0,0,1


## Age guessing
First - let's make a separate dataframe without `ID` and `survived` columns by combining `data` and `evaluation`

In [235]:
data_age = pd.concat([
    data.drop(columns = ['ID','survived']),
    evaluation.drop(columns = ['ID'])
]).reset_index(drop=True)

display(data_age)

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked_C,embarked_Q,embarked_S
0,3,1,19.0,0,0,8.0500,0,0,1
1,2,0,40.0,0,0,13.0000,0,0,1
2,3,0,18.0,0,0,6.7500,0,1,0
3,3,1,,1,9,69.5500,0,0,1
4,3,0,30.0,0,0,8.6625,0,0,1
...,...,...,...,...,...,...,...,...,...
1304,3,0,18.5,0,0,7.2833,0,1,0
1305,3,1,32.0,0,0,56.4958,0,0,1
1306,3,1,22.5,0,0,7.2250,1,0,0
1307,1,0,45.0,1,1,164.8667,0,0,1


Get rows with known age, split into train, validate and test sets and build DecisionoTreeRegressor

In [236]:
known_age = data_age[data_age['age'].notnull()]

Xtrain, Xrest, ytrain, yrest = train_test_split(
    known_age.drop(columns='age'), known_age['age'], test_size=0.4
)

Xvalidate, Xtest, yvalidate, ytest = train_test_split(
    Xrest, yrest, test_size = 0.4
)

Now it's time to train a model
First we make different combinations of hyperparameters (there isn't any large chunk of data, so we can try all of them)

In [237]:
param_grid = {
    'max_depth': range(1,40),
    'criterion': ['mse', 'friedman_mse', 'mae'],
    'min_samples_leaf': range(1,10),
    'min_samples_split': range(2, 8)
}
param_comb = ParameterGrid(param_grid)

In [238]:
val_acc = []
for params in param_comb:
    regressor = DecisionTreeRegressor(**params)
    regressor.fit(Xtrain, ytrain)
    val_acc.append(regressor.score(Xvalidate, yvalidate))

In [239]:
best_params = param_comb[np.argmax(val_acc)]
print('best params are\n', best_params, '\nwith R^2 score of {:.2f}'.format(val_acc[np.argmax(val_acc)]))

best params are
 {'min_samples_split': 2, 'min_samples_leaf': 8, 'max_depth': 4, 'criterion': 'mse'} 
with R^2 score of 0.34


Determining accuracy of the tree

In [240]:
age_regressor = DecisionTreeRegressor(**best_params)
age_regressor.fit(Xtrain, ytrain)
ypredicted = age_regressor.predict(Xtest)

In [242]:
age_predictions = pd.DataFrame({'truth':ytest, 'predict':ypredicted})
age_predictions['diff'] = (ytest - ypredicted)
display(age_predictions)

Unnamed: 0,truth,predict,diff
21,34.5,28.015625,6.484375
839,4.0,6.500000,-2.500000
870,21.0,28.015625,-7.015625
984,54.0,35.527778,18.472222
305,19.0,28.015625,-9.015625
...,...,...,...
357,31.0,35.527778,-4.527778
533,55.5,28.015625,27.484375
599,57.0,40.213115,16.786885
1125,28.0,28.015625,-0.015625


In [243]:
age_predictions['diff'].mean()

0.5325884134255565

I can't say i'm happy with this kind of accuracy, but i think it's enough
### Predict unknown ages in `data` and `evaluation`

In [314]:
def predict_row(row):
    # make dataframe out of series row, remove age and predict it
    row = row.to_frame().transpose().drop(columns=['age'])
    return age_regressor.predict(row)[0]

def predict_ages(df):
    # here I ensure the columns of dataframe to predict are the same as the one, regressor was trained on
    return df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked_C', 'embarked_Q', 'embarked_S']].apply(
        # if age in row is non NaN - leave it, else: predict row
        lambda row: row['age'] if (not np.isnan(row['age'])) else predict_row(row),
        axis=1
    )

In [311]:
data['age'] = predict_ages(data)
evaluation['age'] = predict_ages(evaluation)

In [312]:
data.info()
evaluation.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ID          1000 non-null   int64  
 1   survived    1000 non-null   int64  
 2   pclass      1000 non-null   int64  
 3   sex         1000 non-null   int64  
 4   age         1000 non-null   float64
 5   sibsp       1000 non-null   int64  
 6   parch       1000 non-null   int64  
 7   fare        1000 non-null   float64
 8   embarked_C  1000 non-null   uint8  
 9   embarked_Q  1000 non-null   uint8  
 10  embarked_S  1000 non-null   uint8  
dtypes: float64(2), int64(6), uint8(3)
memory usage: 65.6 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ID          309 non-null    int64  
 1   pclass      309 non-null    int64  
 2   sex         309 non-null  

As you can see ages are filled with values and we can continue
## Time to build a model for guessing the `survived` column
### Splitting data
Splitting is pretty straight forward, I'll just do the same I did for age guessing regressor

In [313]:
Xtrain, Xrest, ytrain, yrest = train_test_split(
    data.drop(columns=['survived','ID']), data['survived'], test_size=0.4
)
Xvalidate, Xtest, yvalidate, ytest = train_test_split(
    Xrest, yrest, test_size = 0.4
)

### Selecting better model
In this section I'll tune hyperparameters for random forrest classifier and decision tree classifier, and select a better one for final prediction
#### Decision tree classifier

In [315]:
param_grid = {
    'max_depth': range(1,40),
    'criterion': ['gini','entropy'],
    'min_samples_leaf': range(1,10),
    'min_samples_split': range(2, 8)
}
param_comb = ParameterGrid(param_grid)

val_acc = []
for params in param_comb:
    dt = DecisionTreeClassifier(**params)
    dt.fit(Xtrain, ytrain)
    val_acc.append(dt.score(Xvalidate, yvalidate))
    
best_decision_tree_params = param_comb[np.argmax(val_acc)]
print('best decision tree params are\n', best_decision_tree_params, '\nwith score of {:.2f}'.format(val_acc[np.argmax(val_acc)]))

best_decision_tree = DecisionTreeClassifier(**best_decision_tree_params)
best_decision_tree.fit(Xtrain, ytrain)
print('accuracy on test data is', best_decision_tree.score(Xtest, ytest)*100,'%')

best decision tree params are
 {'min_samples_split': 2, 'min_samples_leaf': 2, 'max_depth': 5, 'criterion': 'entropy'} 
with score of 0.85
accuracy on test data is 78.75 %


#### Random forrest classifier

In [323]:
param_grid = {
    'n_estimators': range(40, 110),
    'max_depth': range(1,10),
    'criterion': ['gini','entropy'],
    'min_samples_leaf': range(1,10),
    'min_samples_split': range(2, 6)
}
param_comb = ParameterGrid(param_grid)

val_acc = []
counter=0
for params in param_comb:
    rf = RandomForestClassifier(**params)
    rf.fit(Xtrain, ytrain)
    val_acc.append(rf.score(Xvalidate, yvalidate))
    counter+=1
    sys.stdout.write('\r{:.3f}% done'.format(counter/len(param_comb)*100))

print('\n')
best_random_forrest_params = param_comb[np.argmax(val_acc)]
print('best random forrest params are\n', best_random_forrest_params, '\nwith score of {:.2f}'.format(val_acc[np.argmax(val_acc)]))

100.000% done

best random forrest params are
 {'n_estimators': 56, 'min_samples_split': 4, 'min_samples_leaf': 5, 'max_depth': 3, 'criterion': 'gini'} 
with score of 0.87


In [324]:
best_random_forrest = RandomForestClassifier(**best_random_forrest_params)
best_random_forrest.fit(Xtrain,ytrain)
print('accuracy on test data is', best_random_forrest.score(Xtest,ytest) * 100,'%')

accuracy on test data is 78.75 %


Strange, but both models turned out to have same accuracy on test data. I'll choose random forrest for final prediction and predict data, located in `evaluation.csv`
## Predicting data from `evaluation.csv`

In [326]:
evaluation['survived']=best_random_forrest.predict(evaluation.drop(columns=['ID']))
display(evaluation)
evaluation[['ID','survived']].to_csv('results.csv', index=False)

Unnamed: 0,ID,pclass,sex,age,sibsp,parch,fare,embarked_C,embarked_Q,embarked_S,survived
0,1000,2,0,24.0,2,1,27.0000,0,0,1,1
1,1001,2,0,25.0,1,1,30.0000,0,0,1,1
2,1002,2,1,38.0,1,0,21.0000,0,0,1,0
3,1003,3,0,19.0,1,0,16.1000,0,0,1,1
4,1004,2,0,60.0,1,0,26.0000,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...
304,1304,3,0,18.5,0,0,7.2833,0,1,0,0
305,1305,3,1,32.0,0,0,56.4958,0,0,1,0
306,1306,3,1,22.5,0,0,7.2250,1,0,0,0
307,1307,1,0,45.0,1,1,164.8667,0,0,1,1


## Check if file was saved

In [327]:
check = pd.read_csv('results.csv')
display(check)

Unnamed: 0,ID,survived
0,1000,1
1,1001,1
2,1002,0
3,1003,1
4,1004,1
...,...,...
304,1304,0
305,1305,0
306,1306,0
307,1307,1
