## Capstone Week 1 Comprehensive Document

### Milestones Week 1 Checklist

- Understood Process Model
 
- Understood Architectural Decisions Guidelines
 
- Identified Use Case
 
- Identified Data Source
 
- Assigned one concrete technology/framework to each architectural component
 
- Created an Initial Data Exploration Notebook

### Process Model Guidelines

#### Data Cleansing Guidelines

In some process models Data Cleansing is a separate task, it is closely tied to Feature
Creation but also draws findings from the Initial Data Exploration task. The actual data
transformations are implemented in the Feature Creation asset deliverable; therefore, Data
Cleansing is part of the Feature Creation task in this process model.

While tuning machine learning models, this deliverable asset is touched on a regular basis
anyway because features need to be transformed to increase model performance. In such
iterations, often issues with data are detected and therefore need to be corrected/addressed
here as well.

The following none exhaustive list gives you some guidelines:
- Data types

  Are data types of columns matching their content? E.g. is age stored as integer and not as string?

- Ranges

  Does the value distribution of values in a column make sense? Use stats (e.g. min, max, mean, standard deviation) and visualizations (e.g. box-plot, histogram) for help


- Emptiness

  Are all values non-null where mandatory? E.g. client IDs


- Uniqueness

  Are duplicates present where undesired? E.g. client IDs


- Set memberships

    Are only allowed values chosen for categorical or ordinal fields? E.g. Female, Male, Unknown


- Foreign key set memberships

  Are only allowed values chosen a field? E.g. ZIP code


- Regular expressions

  Some files need to stick to a pattern expressed by a regular expression. E.g. a lower-case character followed by 6 digits


- Cross-field validation

  Some fields can impact validity of other fields. E.g. a male person can’t be pregnant

Please transform your data set accordingly and add all code to the Feature Creation asset
deliverable. Please comply with the naming convention documented in the process model.

#### Feature Engineering Guidelines

Feature Creation and Feature Engineering is one of the most important tasks in machine
learning since it hugely impacts model performance. This also holds for deep learning,
although to a lesser extent. Features can be changed or new features can be created from
existing ones.

The following none exhaustive list gives you some guidelines for feature transformation:

- Imputing

  Some algorithms are very sensitive to missing values. Therefore, imputing allows for filling of empty fields based on its value distribution


- Imputed time-series quantization
  
  Time series often contain streams with measurements at different timestamps. Therefore, it is beneficial to quantize measurements to a common “heart beat” and impute the corresponding values. This can be done by sampling from the source time series distributions on the respective quantized time steps.
  

- Scaling / Normalizing / Centering 

  Some algorithms are very sensitive differences in value ranges for individual fields. Therefore, it is best practice to center data around zero and scale values to a standard deviation of one.


- Filtering 
  
  Sometimes imputing values doesn’t perform well, therefore deletion of low quality records is a better strategy


- Discretizing
  
  Continuous fields might confuse the model, e.g. a discrete set of age ranges sometimes performs better than continuous values, especially on smaller amounts of data and with simpler models.  
  
The following non-exhaustive list gives you some guidelines for feature creation:


- One-hot-encoding

    Categorical integer features should be transformed into “one-hot” vectors. In relational terms this results in addition of additional columns – one columns for each distinct category


- Time-to-Frequency transformation 

  Time-series (and sometimes also sequence data) is recorded in the time domain but can easily transformed into the frequency domain e.g. using FFT (Fast Fourier Transformation)


- Month-From-Date

  Creating an additional feature containing the month independent from data captures seasonal aspects. Sometimes further discretization in to quarters helps as well 


- Aggregate-on-Target

  Simply aggregating fields the target variable (or even other fields) can improve performance, e.g. count number of data points per ZIP code or take the median of all values by geographical region.  As feature engineering is an art on itself, this list cannot be exhaustive. It’s not expected to become an expert in this topic at this point. Most of it you’ll learn by practicing data science on real projects and talk to peers which might share their secrets and tricks with you.  Please transform your data set accordingly and add all code to the Feature Creation asset deliverable. Please comply with the naming convention documented in the process model.
  
#### Model Definition Guidelines

Now it’s time to start modelling. So, this is where it really depends on your use case and data
set how you want to proceed. For example, if you are in an unsupervised context you can
choose between an auto-encoder, PCA or clustering. Or if you are in a supervised context you
have choice between different state-of-the-art machine learning and deep learning algorithms.

But here are some guidelines which are required to follow:

- Choose, justify and apply a model performance indicator (e.g. F1 score, true positive rate, within cluster sum of squared error, …) to assess your model and justify the choice of an algorithm
- Implement your algorithm in at least one deep learning and at least one non-deep learning algorithm, compare and document model performance
- Apply at least one additional iteration in the process model involving at least the feature creation task and record impact on model performance (e.g. data normalizing, PCA, …)  Depending on the algorithm class and data set size you might choose specific technologies/frameworks to solve your problem. Please document all your decisions in the ADD (Architectural Decisions Document).

Keras optimizers:
https://keras.io/api/optimizers/

Available Keras optimizers

- SGD
- RMSprop
- Adam
- Adadelta
- Adagrad
- Adamax
- Nadam
- Ftrl

Keras loss functions:
https://keras.io/api/losses/

Available Keras loss functions:

Note that all losses are available both via a class handle and via a function handle. The class handles enable you to pass configuration arguments to the constructor (e.g. loss_fn = CategoricalCrossentropy(from_logits=True)), and they perform reduction by default when used in a standalone way (see details below).

Probabilistic losses

- BinaryCrossentropy class
- CategoricalCrossentropy class
- SparseCategoricalCrossentropy class
- Poisson class
- binary_crossentropy function
- categorical_crossentropy function
- sparse_categorical_crossentropy function
- poisson function
- KLDivergence class
- kl_divergence function

Regression losses

- MeanSquaredError class
- MeanAbsoluteError class
- MeanAbsolutePercentageError class
- MeanSquaredLogarithmicError class
- CosineSimilarity class
- mean_squared_error function
- mean_absolute_error function
- mean_absolute_percentage_error function
- mean_squared_logarithmic_error function
- cosine_similarity function
- Huber class
- huber function
- LogCosh class
- log_cosh function

Hinge losses for "maximum-margin" classification

- Hinge class
- SquaredHinge class
- CategoricalHinge class
- hinge function
- squared_hinge function
- categorical_hinge function

Once you think you have achieved a descent model performance save the notebook according
to the process model’s naming convention and proceed to the model training task.

#### Model Training Guidelines

Once your model is defined, it can be trained. This can happen on a single thread or on a
parallel framework like Watson Machine Learning or Apache Spark. In the most simple case
Model Definition and Model Training is just a couple of LOCs (lines of code) away. In the
case of Watson Machine Learning or Apache Spark models might need to get serialized and
transferred to another technology / framework.

Please specify and justify the technologies used for model definition and training in the
ADD.

Once you think you have achieved a descent model performance save the notebook
according to the process model’s naming convention and proceed to the model evaluation
task.

#### Model Evaluation Guidelines

Model evaluation is a critical task in data science. This is one of the few measures business
stakeholders are interested in. Model performance heavily influences business impact of a
data science project. Therefore, it is important take some time apart in an independent task
in the process model.

So how are models evaluated? In supervised machine learning this is relatively
straightforward since you can always create a ground truth and compare your results
against ground truth.

So, we are either splitting data into training-, test- and validation-sets to assess model
performance on the test set or we use cross validation. This all is explained in the following
courser course https://www.coursera.org/learn/advanced-machine-learning-signalprocessing/ Week 2.

In case we know what data set we can use as ground truth in supervised learning
(classification and regression) we need to define a different measure for evaluation than in
unsupervised learning (clustering). Since it depends on the type of model we create, the
following none exhaustive lists can be used as a starting point for further research:

Classification:

- Confusion Matrix
- Accuracy
- Precision
- Recall
- Specificity
- True positive rate
- True negative rate
- False positive rate
- False negative rate
- F1-score
- Gain and Lift
- Kolomogorov Smirnov
- Area Under ROC
- Gini Coefficient
- Concordant – Discordant ratio

Regression:

- Root Mean Squared Error (RMSE)
- Mean Squared Error
- Mean Absolute Error (MAE)
- R-Squared
- Relative Squared Error
- Relative Absolute Error
- Sum of Differences
- ACF plot of residuals
- Histogram of residuals
- Residual plots against predictors
- Residual plots against fitted values

Clustering:

- Adjusted Rand index
- Mutual Information
- Homogeneity completeness
- V-measure
- Fowlkes-Mallows
- Silhouette Coefficient Calinski-Harabaz

References:
http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

Please choose at least one appropriate model performance measure, justify why you’ve
used it and document how iterative changes in the feature creation task influence it.

#### Model Deployment Guidelines

Model deployment comes in many shapes. The key to everything is that the business
insights that result from the model are made available to stakeholders. This can happen in
various ways. At the simplest level a PDF report is generated (e.g. using a jupyter notebook
in Watson Studio) and handed over to business stakeholders. Alternatively, the model is
encapsulated behind a REST API and made either available to be consumed by a data
product or sold internally or externally as a API (e.g. by using IBM Watson Machine Learning
or Fabric for DeepLearning).

Depending on your use case, please choose and implement an appropriate model
deployment option and justify your decisions in the ADD.

### The Lightweight IBM Cloud Garage Method for Data Science
### Architectural Decisions Document (ADD) Template

1	Architectural Components Overview

![image.png](attachment:image.png)

1.1	Data Source

1.1.1	Technology Choice

Please describe what technology you have defined here. Please justify below, why. In case this component is not needed justify below.

1.1.2	Justification

Please justify your technology choices here.

1.2	Enterprise Data

1.2.1	Technology Choice

Please describe what technology you have defined here. Please justify below, why. In case this component is not needed justify below.

1.2.2	Justification

Please justify your technology choices here.

1.3	Streaming analytics

1.3.1	Technology Choice

Please describe what technology you have defined here. Please justify below, why. In case this component is not needed justify below.

1.3.2	Justification

Please justify your technology choices here.

1.4	Data Integration 

1.4.1	Technology Choice

Please describe what technology you have defined here. Please justify below, why. In case this component is not needed justify below.

1.4.2	Justification

Please justify your technology choices here.

1.5	Data Repository

1.5.1	Technology Choice

Please describe what technology you have defined here. Please justify below, why. In case this component is not needed justify below.

1.5.2	Justification

Please justify your technology choices here.

1.6	Discovery and Exploration 

1.6.1	Technology Choice

Please describe what technology you have defined here. Please justify below, why. In case this component is not needed justify below.

1.6.2	Justification

Please justify your technology choices here.

1.7	Actionable Insights

1.7.1	Technology Choice

Please describe what technology you have defined here. Please justify below, why. In case this component is not needed justify below.

1.7.2	Justification

Please justify your technology choices here.

1.8	Applications / Data Products

1.8.1	Technology Choice

Please describe what technology you have defined here. Please justify below, why. In case this component is not needed justify below.

1.8.2	Justification

Please justify your technology choices here.

1.9	Security, Information Governance and Systems Management

1.9.1	Technology Choice

Please describe what technology you have defined here. Please justify below, why. In case this component is not needed justify below.

1.9.2	Justification

Please justify your technology choices here.