# Akamai Survival Analysis 
Caleb Lance Matthew Nicole

## Introduction


![Network Diagram](networkDiagram.png)

## Meta Data

### Range Of Data
Sunday April 10, 21:20:00 EDT 2016 0 Friday April 22, 11:45:00 EDT 2016

### Sampling Frequency
5 minute intervals aggregated

### Total Number of Files
5225 Files -- 81 G total 
- x servers 
- y racks

### File Types

- `content`
- `diskload`
- `load`

`Content`: Perhaps the most cryptic of the file types... Metrics describe the type of media(videos, pictures, livestream, etc.) being served at a given time on a given machine.

`Diskload`: Metrics pretain to disk I/O

`Load`: Metrics pretain to the stress-level of a given machine at a given point in time. 

## Variables of Interest:
- `rolled`: server process rolled (died) in last 60 seconds
- `flits`: combination of cpu, disk
- `flitcap`: normalizing factor for flits
- `cpu`: average cpu percentage use in last 60 seconds
- `ip`: IP address of the Akamai server that is serving the content
- `conns`: number of user connections that the server is handling
- `rack`:
- `model`: 

## Normalized Roll Counts of Servers

- *Roll Count*: Number of time points that "rolled"
- Raw roll counts would over-represent servers that have more time points in the data
- Thus, we normalize each roll count by total number of time points for each server

## Normalized Roll Counts of Servers
![](tableServerRollCounts.PNG)
![](histServerRollCounts.png)

![](overall_content.png)

![](Difference_content.png)

## Survival Analysis
- Identify features that contribute to hardware failure
- Indentify hardware that requires the least amount of downtime
- Anticipate outages for a rack during the year

- The analysis of time duration until one or more events happen is the central focus of Survival Analysis 
- Surviving can be assigned many different definitions depending on the context of your problem.  Here survival implies that given machine went the entire duration of our study without rolling. The rational for defining survival as such relies on the assumption that when machines roll once they are more likely to roll several times after. 

## Formulation



**Survival Function**: $S(t)$ of an individual is the probability of *surviving* until at least time $t$

$$ S(t) = P(T>t) $$

- $T$: time of event

- It's not given that our event $T$ will occur within our study period--this is known as **right-censoring**

**Hazard Function**: $\lambda(t)$ is the probability that the event $T$ occurs in the next instant $t + \delta t$, given that an individual has reached time $t$

$$ \lambda(t) = \lim_{\delta t \rightarrow  0} \frac{P(t\leq T < t+\delta t | T > t)}{\delta t} $$ 

We can relate the hazard function of an individual to their specific survival function:

$$ S(t) = e^{-\int_{0}^{t} \lambda(u)du} = e^{-\sum \Lambda_{i}(t)} $$


- This approach leads us to survival regression--we want to know the relative impact of a servers hardware, load profile, and number of connections upon the first time it crashes. 

### Cox Proportional Hazards

Covariates are multiplicatively related to the hazard function

$$ \lambda(t|X)= \lambda_{0}(t)exp(\beta_{0}X_{1}+...+\beta_{p}X_{p}) $$

- $\lambda_{0}(t)$: is the non-parametric baseline hazard function
- $\lambda(t|X)$: is the expected hazard at time $t$


Interpretting the coefficents is a matter of examining the **Hazard Ratio**:

$$\frac{\lambda(t|x)}{\lambda_{0}(t)} = exp(\beta_{0}X_{1}+...+\beta_{p}X_{p})$$



![](hazardInterpretation.png)

- Cox Proportional Hazards was first published in [ ]  
- The interest is in associating the each one of the risk factors (predictors) to the outcome -- associations are quantified by the regression coefficients 
- Cox PH gives a semi-parametric method of estimating the hazard function at time $t$ given a baseline hazard that is modified by a set of covariates
- The fitting of these coefficients is beyond the scope of this pressentation, however it makes use a partial likelihood maximization and the Newton-Raphson method. 

![](hazardRatio2.png)

- Pictured here is a plot of the Hazard Ratio for coefficients that were significant at the 5% level.
- Coinfidence intervals for specific coefficients are displayed in light-blue 
- Note that coefficients with error bars overlapping the red-line (where hazard ratio = 1) should be interpretted as no effect on survival
- The TSSTcorp server has a hazard rate equal to roughly 44x that of our baseline Toshiba server (very risky)
- The Intel Server has a hazard rate equal to roughly 38x that of our baseline Toshiba server (very risky)

![](concordanceIndex2.png)

- Concordance is a "global" index for validating the predictive ability of a survival model


- It can be interpreted as the fraction of all pairs of subjects whose predicted survival times are correctly ordered among all subjects that can actually be ordered. In other words, it is the probability of concordance between the predicted and the observed survival

The index of concordance is a "global" index for validating the predictive ability of a survival model. It is the fraction of pairs in your data, where the observation with the higher survival time has the higher probability of survival predicted by your model. As far as I remember it it equivalent to a rank correlation.

The index is not calculated for every observation/subject. So the c-index can not be interpreted as the risk of a subject. High values mean that your model predicts higher probabilities of survival for higher observed survival times.

If you are interested in the risk of a subject in a timeperiod t, I think you have to estimate the survival and hazard function for a given set of regressors. My main reference on this subject is Harrell (2001): Rgression Modeling Strategies, Springer

## Conclusions

- When appropriate, server/rack hardware should be exchanged for protective components
- Surivival analysis provides a means of comparing risk factors and how they compare to your baseline