# Modeling

It is often desired to understand the relationship between different sources of information. As an example we'll consider the historical request rate of a web server and compare it to its CPU usage. We'll try to predict the CPU usage of the server based on the request rates of the different pages. First some imports:

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pylab
pylab.rcParams['figure.figsize'] = (13.0, 8.0)
%matplotlib inline

In [2]:
import statsmodels

### Data import and inspection

[Pandas](http://pandas.pydata.org/) is a popular library for data wrangling, we'll use it to load and inspect a csv file that contains the historical web request and cpu usage of a web server:

In [None]:
data = pd.DataFrame.from_csv("data/request_rate_vs_CPU.csv")

The head command allows one to quickly see the structure of the loaded data:

In [None]:
data.head()

We can select the CPU column and plot the data:

In [None]:
data.plot(figsize=(13,8), y="CPU")

Next we plot the request rates, leaving out the CPU column  as it has another unit:

In [None]:
data.drop('CPU',1).plot(figsize=(13,8))

Now to continue and start to model the data, we'll work with basic numpy arrays.

We extract the column labels as the request_names for later reference:

In [None]:
request_names = data.drop('CPU',1).columns.values
request_names

We extract the request rates as a 2-dimensional numpy array:

In [None]:
request_rates = data.drop('CPU',1).values

and the cpu usage as a one-dimensional numpy array

In [None]:
cpu = data['CPU'].values

### Simple linear regression

First, we're going to work with the total request rate on the server, and compare it to the CPU usage. The numpy function [sum](http://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html) can be used to calculate the total request rate when selecting the right direction (axis) for th summation.

In [None]:
# fill in
total_request_rate = 

Let's plot the total request rate to check:

In [None]:
plt.figure(figsize=(13,8))
plt.plot(total_request_rate)

We can make use of a [PyPlot's scatter plot](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.scatter) to understand the relation between the total request rate and the CPU usage:

In [None]:
# fill in 
plt.figure(figsize=(13,8))
plt.xlabel("Total request rate")
plt.ylabel("CPU usage")
plt.scatter( ...

There clearly is a strong correlation between the request rate and the CPU usage. Now we'll try to capture this relation using a linear model:

$$ \text{cpu} = c_0 + c_1 \text{total_request_rate} $$

For that we'll make use of the [scikit-learn](http://scikit-learn.org/stable/) machine learning library for Python and use [least-squares linear regression](http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)

In [None]:
from sklearn import linear_model
model = linear_model.LinearRegression()

Now we need to feed the data to the model to fit it. The model.fit method expects a matrix so we need to convert the total_request_rate into a matrix with one column, we can 
use the [reshape](https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html) method for that:

In [None]:
# fill in
total_request_rate_M = 

Then we fit our model using the the total request rate and cpu.

In [None]:
#fill in
model.fit(...

We can now inspect the coefficient $c_1$ of the model:

In [None]:
model.coef_

And the constant term $c_0$:

In [None]:
model.intercept_

Once the model is trained we can use it to [predict](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.predict) the outcome for a given input (or array of inputs). 

What is the expected CPU usage when we have 50 requests per second? 

In [None]:
# fill in 
model.predict(...

Now we plot the linear model together with our data to verify it captures the relationship correctly (the predict method can accept total_request_rate_M at once).

In [None]:
# fill in
plt.figure(figsize=(15,10))
plt.scatter( ... , ...  , color='black')
plt.plot( ... , ... , color='blue', linewidth=3)
plt.xlabel("Total request rate")
plt.ylabel("CPU usage")

plt.show()

Our model also has a [score](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score) indicating how well the linear model captures the data. A score of 1 means the data is perfectly linear, a score of 0 (or lower) means the data is not linear at all (and it does not make sense to try to model it that way).

In [None]:
model.score(...

### Multiple linear regression

Now we consider the separate request rates again and build a linear model for that. The model we try to fit takes the form:

$$\text{cpu} = c_0 + c_1 \text{request_rate}_1 + c_2 \text{request_rate}_2 + \ldots + c_n \text{request_rate}_n$$

where the $\text{request_rate}_i$'s correspond the our different requests:


In [None]:
request_names

No we create a new [LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) model.

In [None]:
# fill in
multi_lin_model = 

Next fit the model on the data:

In [None]:
# fill in
multi_lin_model.fit(

Which request causes most CPU usage, on a per visit basis? ([np.argmax](http://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html) finds the index of the greatest element in an array)

In [None]:
#fill in 
heavy_request = 
print heavy_request

If we want to minimize average CPU usage on this server by deviating traffic of one webpage to another server, which page should we choose?  
One way to determine this is by using the multi_lin_model.predict method. Another way is by directly using the regression formula. Some functions that might be useful for this:
- [np.mean](http://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html) can used to calculate the mean of the values in a matrix
- a * b will calculate the pairwaize product of two vectors

In [None]:
# fill in
average_rates = 
request_to_move = 
print request_to_move