# DECART Predictive Analytics: Day 1 


#### _“Prediction is very difficult, especially about the future.”_  -Niels Bohr

## Course Overview and Outline

In this part, we will cover main *** concepts and essential theoretical background *** for predictive modeling cores and techniques. Most of these concepts are ** implemented in Python as modules, ** such that if each module is needed at any time, it will be easy to call instead of replicating it. In that regards, each code and concept will be introduced, discussed, and tested. 

We'll have about 6 heath care analytics use cases over our four days together. You will have 4 folders for 4 days.  For some days there will be some Python source code in a .py file that goes with notebooks. 

We'll try to cover all content in discussions or by way of exercises.  Predictive analytics is a very broad and multidisciplinary domain.  We don't claim to cover it comprehensively, here.

**INSTRUCTORS**  
Samir Abdelrahman samir.abdelrahman@utah.edu  
Lynd Bacon lynd.bacon@hsc.utah.edu

### Learning Topics and Objectives By Course Day 
1. Day 1:
    - Introduction and challenges
    - Quick review of the [pandas](http://pandas.pydata.org/) package for data science
    - Regression with models linear in their parameters
    - Introduction to Regression with Linear Models
2. Day 2:
    - Regression with Linear Models (cont.)
    - Simple models for some limited dependent measures
    - Regularization
    - Bayesian regression, models with hyperparameters
    - Cluster Analysis
        - partitioning
        - hierarchical
        - model-based
        - selecting the best number of clusters
3. Day 3:
    - Introduction
        - Differences between classification and prediction
        - Model development, validation, and testing
        - Bias vs. variance and underfitting vs. overfitting
        - Classification measures, like the F-measure, AUC, PPV, and NPV
    - Classifiers
        - Logistic regression, decision tree, support vector machine
        - Ensemble methods
        - binary, multi-class, multilabel classifiers
4. Day 4:
        - Classifiers (cont.)
        - Introduction to Deep Learning
            - Fully connected neural network
            - Convolutional neural networks


# Predictive Analytics (PA)

PA is a multidisciplinary field that uses a combination of statistics and machine learning modeling techniques to predict unknow future events.

<img src="../images/PA.png" height= 75% width=75%>

<sub>
       Figure Reference: http://www.predictiveanalyticstoday.com/what-is-predictive-analytics/
</sub>

# Predictive Modeling (PM)
PM usually is a combination of machine learning and/or simulation techniques to discover patterns from a given dataset of historical records/data points.

<img src="../images/Datasets.png" height= 70% width=70% style="float: center;">


Once models are developed, they may be "embedded" \in running productional systems.  

![Cook-Zubscek-JACR-2017-Fig-3.png](../images/Cook-Zubscek-JACR-2017-Fig-3.png)

**Validating results** is critically important in almost all ML/predictive analytics applications

# Main Machine Learning Algorithm Categories 

1. Supervised learning
2. Unsupervised learning
3. Semi-supervised learning
4. Reinforcement learning


# Supervised Learning (Predictive learning)

#### Key Idea:

We have one or more observed dependent variables or _criterion_ measures to be predicted.  The data we have on what's to be predicted may be **[labeled data](https://en.wikipedia.org/wiki/Labeled_data)** about the subject, continuous data, or both. That is, we know some **truth** value or values, and we want to learn how to predict this truth value.  

Generally speaking, we what to determine the best possible function of predictor variables, X, for predicting a dependent criterion, Y, one that minimizes prediction errors when generalized to a population of interest:  

\begin{align*}
Y_i = f(X_i)+\epsilon_i
\end{align*}

We want to minimize the $\epsilon_i$'s.

## Approach

1. The goal is to infer a function from training data that relates values on observed outcome/criterion variables to values on variables that predict them.  


2. Each training record is an example or a [vector of features/attributes](https://en.wikipedia.org/wiki/Feature_vector) (predictors) and label (outcome) variables.
3. Once learned, the function can be used to make prediction for new cases, given that our function has been developed so as to _generalize_ adequately.

## Types 

One way applications can be differentiated is in terms of the _nature of what's to predicted_.  In many cases, the variable to be predicted consists either of unordered categories, or of values on the real number line or a space with coordinates in $\mathbf{R}$.  And then there are the "in between" cases.

1. if Label is categorical, then classification. **(classification types?)**.
    1. Male/Female
    1. Student/Teacher
    1. Malignant/Benign 
    1. Others?
2. if Label is continuous on $\mathbf{R}$, then regression.
    1. Systolic blood pressure
    1. BMI
    1. serum creatinine
    1. Others?
3.  Not *strictly* unordered category labels, or values not spanning all of $\mathbf{R}$. Some examples:  
    1. Truncated values (e.g. at zero) on $\mathbf{R}$
    1. Counts
    1. rankings
    
**QUESTION** 

Under circumstances might variables in category 2., above, need to be treated as if they are in category 3?
    

## Challenges

Most of the following pertain to ML/predictive analytics, in general.

1. Enough training data.
    1. Labeling can be expensive!
2. External datasets to validate.
3. Skewed or unbalanced dependent/criterion measures, e.g. rare events
4. Time and memory constraints.
4. Choosing between models
5. Striking the best possible *bias-variance* tradeoff: 

In the context of machine learning, _bias_ and _variance_ can have particular definitions:

**BIAS** = error in approximating the "true state of Nature," real processes or systems.

**VARIANCE** = how much our approximation of $f(X_i),~\hat{f}(X_i)$, varies when applied to samples from the same population.  

In principle, we could minimize _both_ bias and variance with the "best" model and lots of data.  It's not that easy. unfortunately.  bias and variance usually vary in different ways as a given model is adjusted in various ways, e.g. by adding or modifying predictor variables. 

6. Alrogithmic bias
7. Algorithm aversion


# _Unsupervised_ Learning


#### Key idea

**Labels or values** that help us understand the observations we have data for _are not available_. For example, we might want to know if the observations can be summarized as falling into different groups.  We want use machine learning to *infer* labels or values.

##  Approach

1. Generally speaking, it's task of inferring a similarity function from unlabeled training data to partition the dataset into subgroups (subpopulations).
2. Inferred labels could be used for measuring the performance of clusters.
3. Inferred labels may be applied to new observations by applying a classifier.


##  Types 

1. Discovering subpopulations (clustering).
2. Discovering Frequent association and patterns among predictors (association/pattern mining).
3. Using visualization to explore the data and dimensionality reduction to represent predictors (visual analytics; "human" pattern detection)


## Challenges

1. Validation: quantitative vs qualitative also intra (in) cluster and inter (between) clusters.
2. Identifying clusters with complex, multidimensional "shapes"
2. Noise and outliers **(difference?)**.
3. Time and memory constraints.


# Semi-supervised Learning (SSL)
 
 ### What SSL Is
1. Combines unsupervised and supervisor learning methods.
2. Used when there aer many unlabeled observations and few labeled observations.
3. There many versions with different orders of learning algorithms.  The aim is to boost the whole learning. 

### How it Works, in a "Nutshell"
1. Use labeled seeds as initial examples for clustering.
2. Run a clustering algorithm to associate unlabeled cases with the seeds. 
3. Repeat the above two steps to refine and redistribute the clusters.

A nice write-up of clustering and semi-supervised clustering methods: 
    
[Semi-supervised clustering methods](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3979639/)

# Reinforcement Learning (RL)

1. A task that divides the learning algorithm into software "agents" that interacts with its environment in order to maximize some sort of award.
2. Each agent monitors the environment and learn how to react, ignore, or set up new rules.
3. An agent should run forever unless it is killed.

<img src="../images/Agent.png" height= 35% width=35% style="  right;">




**Thought Experiment: How might _you_ be an RL agent in real life?**


<sub>
       Agent Structure: https://en.wikipedia.org/wiki/Reinforcement_learning 
</sub>

Richard Sutton and Andrew Barto wrote a seminal book about RL some years ago.  They have a draft of a new edition on line at:  

[Reinforcement Learning: An Introduction, 2nd Ed.2018](https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view)

# A Rumination Exercise

Select one of following learning algorithms or methods: 
    * Supervised (classification/regression)
    * Unsupervised (clustering/pattern mining)
    * Semi-supervised
    * Reinforcement Learning

Then, describe how you might apply it to one of the following:
    * predicting length of stay 
    * predicting mortality
    * predicting chronic kidney disease
    * Defining chronic kidney disease staging
    * Integrating different data resources.
    * Developing an interactive decision system used by both health care professionals and the patients