### General Information 

### Step 0 - import libraries

### Step 1 - read data

  - List the data you need and how much you need.

  - Find and document where you can get the data.

  - Check how much space it will take.

  - Check legal obligations, and get authorizations if necessary.

  - Get access authorizations.

  - Create a workspace (with enough storage space).

  - Get the data.

  - Convert the data to a format you can easily manipulate (without changing the data itself).

  - Ensure sensitive information is deleted or protected (e.g. anonymized).

  - Check the size and type of data (time series, sample, geographical, etc.).

  - Sample a test set, put it aside, and never look at it (no data snooping!).

### Step 2 - train-test-split

<hr style="border:2px solid black">

*Don't get biased by any stretch of the imagination. Do the train-test-split as early as possible!*

<hr style="border:2px solid black">

### Step 3 - exploratory data analysis

#### Step 3.0 - workflow

  - Create a copy of the data exploration (sampling it down to a manageble size if necessary).

  - Create a notebook to keep a record of your data exploration.

  - Study each attribute and its characteristics:

    - Name
    - Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
    - % of missing values
    - Noisiness and type of noise (stoxastoc, outliers, rounding errors, etc.)
    - Possibly useful for the task?
    - Type of distribution (Gaussian, uniform, logarithmic, etc.)
  - For supervised learning tasks, identify target attribute(s).

  - Visualize the data.

  - Study the correlation between attributes.

  - Study how you would solve the problem manually.

  - Identify the promising transformations you may want to apply.

  - Identify extra data that would be useful.

  - Document what you have learned.

#### Step 3.1 - general overview

##### **Categorical:**

- `Nominal`

>- <u>Cabin</u> - Cabin number   
>- <u>Embarked</u> - Port of Embarkation ( C = Cherbourg | Q = Queenstown | S = Southampton )

- `Dichotomous`

>- <u>Sex</u> - ( Female | Male )

- `Ordinal`
    
>- <u>Pclass</u> - Ticket class ( 1 = 1st | 2 = 2nd | 3 = 3rd )
    * A proxy for socio-economic status (SES)
        - 1st = Upper
        - 2nd = Middle
        - 3rd = Lower

##### **Numeric:**

- `Discrete`

>- <u>Passenger ID</u>
>- <u>SibSp</u> - # of siblings / spouses aboard the Titanic	
    * sibsp: The dataset defines family relations in this way...
        - Sibling = brother, sister, stepbrother, stepsister
        - Spouse  = husband, wife (mistresses and fiancés were ignored)
>- <u>Parch</u> - # of parents / children aboard the Titanic
    * parch: The dataset defines family relations in this way...
        - Parent = mother, father
        - Child  = daughter, son, stepdaughter, stepson
        - Some children travelled only with a nanny, therefore parch = 0 for them.
>- <u>Survived</u> - ( 0 = Not Survived | 1 = Surived ) 

- `Continous`

>- <u>Age</u> - Age in years
    * Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

>- <u>Fare</u> - Passenger fare

##### **Text:**

- <u>Ticket</u> - Ticket number
- <u>Name</u> - Name of passenger

#### Step 3.2 - descriptive statistics

#### Step 3.3 - observe some features in more detail

##### Step 3.2.1 - PassengerId

##### Step 3.2.2 - Pclass

##### Step 3.2.3 - Name

##### Step 3.2.4 - Sex

##### Step 3.2.5 - Age

##### Step 3.2.6 - SibSp

##### Step 3.2.7 - Parch

##### Step 3.2.8 - Ticket

##### Step 3.2.9 - Fare

##### Step 3.2.10 - Cabin

##### Step 3.2.11 - Embarked

##### Step 3.2.12 - Survived

### **Conclusion from the EDA**

<hr style="border:2px solid black">

    - ....
<hr style="border:2px solid black">

### Step 4 - cleaning & scaling

   - *Fix or remove outliers (otional).*
   - *Fill in missing values (e.g. with zero, mean, median ...) or drop their rows (or columns).*
   - *Standadize or nomalize features*

#### Step 4.1 - impute missing values

#### Step 4.2 - scaling

#### Step 4.3 - interpolation

#### Step 4.4 - remove duplicates and outliers

### Step 5 - feature engineering

   - *Discretize continious features.*
   - *Decompose features (e.g. categorical, date/time, etc.).*
   - *Add promising transformations of features (e.g. log(x), sqrt(x), x^2, etc.).*
   - *Aggregate features into promising new features.*

#### Step 5.1 - feature extraction, decomposition and transformation

#### Step 5.2 - encoding of categorical features

#### Step 5.3 - discretizing of continious features

#### Step 5.4 - drop features

#### Step 5.5 - sampling strategy in case of imbalanced data

#### Step 5.6 - implement polynomials

### Step 6 - baseline model

#### Step 6.1 - create pipeline for the baseline model 

##### Step 6.1.1 - function and column transformer

##### Step 6.1.2 - set up pipeline with estimators

##### Step 6.1.3 - define the hyperparameter grid

##### Step 6.1.4 - set up the grid search CV

#### Step 6.2 - run the baseline model

#### Step 6.3 - evaluate the model

#### Step 6.4 - evaluate the feature importance

#### Step 6.5 - feature selection

### Step 7 - model tuning

#### Step 7.1 - create a pipeline for model tuning

##### Step 7.1.1 - function transformer

##### Step 7.1.2 - column transformer

##### Step 7.1.3 - set up pipeline with estimators

##### Step 7.1.4 - define the hyperparameter grid

##### Step 7.1.5 - set up the grid search CV

#### Step 7.2 - run the tuner model

#### Step 7.3 - evaluate the tuner model

#### Step 7.4 - handle over-/underfitting (e.g. regularization) - if necessary

#### Step 7.5 - optimize the model

### Step 8 - retraining the best model with the whole data set

### Step 9 - pickle the best model