# Ensemble Learning Using Random Forests
This a lab session to use tree ensembles, in particular Random Forests, to build a interesting classifier related to human activity recognition using mobile phone data.

## Read Documentation about the Data Sources

The features selected for this UCI database come from the accelerometer and gyroscope 3-axial raw signals tAcc-XYZ and tGyro-XYZ. These time domain signals (prefix 't' to denote time) were captured at a constant rate of 50 Hz. Then they were filtered using a median filter and a 3rd order low pass Butterworth filter with a corner frequency of 20 Hz to remove noise. Similarly, the acceleration signal was then separated into body and gravity acceleration signals (tBodyAcc-XYZ and tGravityAcc-XYZ) using another low pass Butterworth filter with a corner frequency of 0.3 Hz. 

Subsequently, the body linear acceleration and angular velocity were derived in time to obtain Jerk signals (tBodyAccJerk-XYZ and tBodyGyroJerk-XYZ). Also the magnitude of these three-dimensional signals were calculated using the Euclidean norm (tBodyAccMag, tGravityAccMag, tBodyAccJerkMag, tBodyGyroMag, tBodyGyroJerkMag). 

Finally a Fast Fourier Transform (FFT) was applied to some of these signals producing fBodyAcc-XYZ, fBodyAccJerk-XYZ, fBodyGyro-XYZ, fBodyAccJerkMag, fBodyGyroMag, fBodyGyroJerkMag. (Note the 'f' to indicate frequency domain signals). 

These signals were used to estimate variables of the feature vector for each pattern:  
'-XYZ' is used to denote 3-axial signals in the X, Y and Z directions.

tBodyAcc-XYZ
tGravityAcc-XYZ
tBodyAccJerk-XYZ
tBodyGyro-XYZ
tBodyGyroJerk-XYZ
tBodyAccMag
tGravityAccMag
tBodyAccJerkMag
tBodyGyroMag
tBodyGyroJerkMag
fBodyAcc-XYZ
fBodyAccJerk-XYZ
fBodyGyro-XYZ
fBodyAccMag
fBodyAccJerkMag
fBodyGyroMag
fBodyGyroJerkMag

The set of variables that were estimated from these signals are: 

mean(): Mean value
std(): Standard deviation
mad(): Median absolute deviation 
max(): Largest value in array
min(): Smallest value in array
sma(): Signal magnitude area
energy(): Energy measure. Sum of the squares divided by the number of values. 
iqr(): Interquartile range 
entropy(): Signal entropy
arCoeff(): Autorregresion coefficients with Burg order equal to 4
correlation(): correlation coefficient between two signals
maxInds(): index of the frequency component with largest magnitude
meanFreq(): Weighted average of the frequency components to obtain a mean frequency
skewness(): skewness of the frequency domain signal 
kurtosis(): kurtosis of the frequency domain signal 
bandsEnergy(): Energy of a frequency interval within the 64 bins of the FFT of each window.
angle(): Angle between to vectors.

Additional vectors obtained by averaging the signals in a signal window sample. These are used on the angle() variable:

gravityMean
tBodyAccMean
tBodyAccJerkMean
tBodyGyroMean
tBodyGyroJerkMean

## Import the Data and Browse
- Data is located in <project_root>/exercises/data/samsungdata.csv
    - For solutions notebook, the relative file path is '../data/samsungdata.csv'
    - For exercise notebook, the relative path is './data/samsungdata.csv'


In [None]:
%pylab inline
import seaborn as sbn
import pandas as pd
X_raw = pd.read_csv('../data/samsungdata.csv')

### First, take a look at the data and get acquainted
For example, you could do things like
```python
X_raw.shape
```

```python
X_raw.head()
```

For a more detailed overview of your data, look at
```python
X_raw.describe()
```

In particular, browse to see if there are any variables in your data that are NOT numerical sensor measurements to be used for prediction.

In [None]:
X_raw.head()

In [None]:
X_raw.shape

In [None]:
X_raw.describe()

### Cleaning up the data and getting ready for Machine Learning
We'll do a very crude data cleaning step, just enough to get the data in a usable form.

There are two columns, "Unnamed: 0", "subject" and "activity" that are categorical and/or not useful.

1. The 'activity' column contains the targets to be used for classification (activity categories). Extract that into a separate variable.
Hint:
```python
truth_har = X_raw['activity']
```
Take a look at the distribution of the activity class labels.

2. Do a similar analysis for the 'subject' column.

3. Remove the 3 columns 'Unnamed: 0', 'subject', and 'activity' from X_raw

In [None]:
# Interesting non-numerical variables to check
subjects = X_raw['subject']
truth_har = X_raw['activity']

In [None]:
sbn.countplot(truth_har)

In [None]:
sbn.countplot(subjects)

In [None]:
X_raw = X_raw.drop(['Unnamed: 0','subject','activity'], axis=1)

In [None]:
X_raw.describe()

## Build a RF Classifier as a Black Box

In [None]:
import sklearn.ensemble as ens
from sklearn.model_selection import train_test_split

In [None]:
## ... First we need to split into training and validation sets
shuffle_seed = 31459
test_slice = 0.20
X_train, X_test, y_train, y_test = train_test_split(X_raw,truth_har, test_size=test_slice, random_state=shuffle_seed, shuffle=True)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
# ... Some initial paramaters just to try it out
nfeatures = X_train.shape[1]
nf_sample = np.int(np.round(np.sqrt(nfeatures)))
ntrees = 500
rf = ens.RandomForestClassifier(max_features=nf_sample, n_estimators=500, oob_score=True, n_jobs=-1)

In [None]:
nf_sample

In [None]:
%%time
m1 = rf.fit(X_train,y_train)

In [None]:
m1.oob_score_

## RF Classifier Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
model = ens.RandomForestClassifier(random_state=123)
param_grid = { 
    'n_estimators': [50,100,250,500],
    'max_features': ['auto', 'sqrt', 'log2'],
}

In [None]:
rf_gridCV = GridSearchCV(estimator=rf,param_grid=param_grid,n_jobs=-1,cv=5)

In [None]:
%%time
rf_gridCV.fit(X_train,y_train)

In [None]:
rf_gridCV.best_params_

In [None]:
rf_gridCV.best_score_

In [None]:
rf_gridCV.cv_results_