## Kaggle Competition 

Compete in the [CPU Prediction](http://inclass.kaggle.com/c/model-t4/data) Kaggle competition.  It is a regression problem with a fair amount of features so Ridge and LASSO might work well.  Remember, try to use models you understand and do not be swayed to use overly complex algorithms.  Since it is still on going, you can submit your solution and get scored to compete against other data scientists!  Use what you have learned in the first 3 weeks of class to work your way up the leader board.

__If you want to organize a team from other students who have finished the sprint, feel free to make a team of 3-5 students__

## Tools: 

Before you dive into regression, algorithms and testing talk to your partner/team and devise a strategy for analyzing the data. Work effectively so that you can communicate your findings in a presentation. Use any of the tools we learnt this week (here are some suggestions...):

<u> Use EDA techniques: </u>

* Visualize the data set and understand your variables. 
* Look for the categorical and continuous regressors. 
* Use faceting or stratification to identify collinearity.

<u> Use the big guns:</u> 

* Linear regression
* Ridge regression
* Lasso regression 


<u>Remove biases in data using:</u>

* Detecting and reducing Multicollinearity 
* Heteroscedasticity
* Influence and leverage points, and outliers.

<u> Test your predictions: </u>

* Use cross validation and k-fold to test for overfitting

Good Luck!


In [1]:
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix

from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.utils import resample
from sklearn.model_selection import train_test_split

from basis_expansions.basis_expansions import NaturalCubicSpline
from regression_tools.dftransformers import (
    ColumnSelector, Identity,
    FeatureUnion, MapFeature,
    StandardScaler)

from regression_tools.plotting_tools import (
    plot_partial_depenence,
    plot_partial_dependences,
    predicteds_vs_actuals)

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [2]:
raw_data = pd.read_csv('data/train.csv')

In [3]:
raw_data.head()

Unnamed: 0,sample_time,m_id,syst_direct_ipo_rate,syst_buffered_ipo_rate,syst_page_fault_rate,syst_page_read_ipo_rate,syst_process_count,syst_other_states,page_page_write_ipo_rate,page_global_valid_fault_rate,...,tcp_retxto,tcp_kpalv,lla0_pkts_recvpsec,lla0_pkts_sentpsec,llb0_pkts_recvpsec,llb0_pkts_sentpsec,ewc0_pkts_recvpsec,ewc0_pkts_sentpsec,ewd0_pkts_recvpsec,ewd0_pkts_sentpsec
0,2010-11-24 00:01:00,a,80.48,1261.97,15.55,2.1,271,12,6.23,4.67,...,0,0,464.483,463.167,67.433,89.183,382.533,327.983,380.783,412.833
1,2010-11-24 00:01:00,b,73.8,624.38,6.43,0.45,317,10,5.47,1.1,...,0,0,292.667,429.65,70.583,97.783,255.8,350.417,407.6,270.15
2,2010-11-24 00:01:00,c,40.57,466.18,6.4,0.45,258,22,10.9,1.1,...,0,0,359.917,404.55,16.45,21.883,356.083,266.717,231.483,276.15
3,2010-11-24 00:01:00,d,68.75,495.1,6.5,0.45,224,10,2.58,1.1,...,0,0,378.05,367.067,21.717,29.117,262.867,297.35,316.95,286.433
4,2010-11-24 00:01:00,e,47.45,436.65,6.42,0.45,253,10,2.48,1.1,...,0,0,374.05,369.217,24.75,34.033,368.317,293.133,292.283,308.933


This competition is part of a machine learning workshop given at InTraffic.

The goal is to predict the load on the CPUs in a cluster of servers based on the behavior of a series of applications running on these servers.

There are two CPUs in each server. There are seven servers in the cluster. The prediction is for the second CPU.

The dataset consists of a set of variables that were measured over about a one month period. Measurements were taken in one minute intervals and on each server. Measurements are usually the average or sum over that one minute interval. For instance the number of packets received, the average number of IO operations, etc

The set is data from a real cluster that is used to control train traffic in a geographical area spanning several cities.

Data fields

sample time - the date and time the data was sampled.  
m_id - the ID of the server the data was sampled at.  
appxxxx - data about specific application.  
pagexxx - data on memory usage of the server.  
syst_xxx - data on page fault rate, number of processes, etc.  
state_xxx - data on the state the system is in.  
io_xxx - data about general IO usage, (file IO, direct IO).  
tcp_xxx - data on incoming and outgoing TCP traffic.  
llxxx, ewxxx - data on incoming and outgoing network traffic.  
cpu_01_busy - the variable we are trying to predict.

In [7]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178780 entries, 0 to 178779
Data columns (total 89 columns):
sample_time                     178780 non-null object
m_id                            178780 non-null object
syst_direct_ipo_rate            178780 non-null float64
syst_buffered_ipo_rate          178780 non-null float64
syst_page_fault_rate            178780 non-null float64
syst_page_read_ipo_rate         178780 non-null float64
syst_process_count              178780 non-null int64
syst_other_states               178780 non-null int64
page_page_write_ipo_rate        178780 non-null float64
page_global_valid_fault_rate    178780 non-null float64
page_free_list_size             178780 non-null int64
page_modified_list_size         178780 non-null int64
io_mailbox_write_rate           178780 non-null float64
io_split_transfer_rate          178780 non-null float64
io_file_open_rate               178780 non-null float64
io_logical_name_trans           178780 non-null float64
io_