## Introduction

The goal of this task is to determine the explained variance of an improved random forest model.  This new algorithm should contain a reduced amount of chassis suspension features based on the feature importance output from the original model.

In [1]:
#imports most of the necessary python packages for this data analysis
import numpy as np
import sklearn as sk
import sklearn.datasets as skd
import sklearn.ensemble as ske
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

## Data Wrangling

The data needs to be organized into a panda's dataframe for visual clarity and easy utilization.  Each index represents an entire field pass of the same machine configuration.  The following key performance indicators are based on average values and are included in this initial overview dataframe:

- Chassis roll
- Chassis roll rate
- Chassis pitch
- Chassis pitch rate
- Chassis vertical acceleration
- Chassis longitudinal acceleration
- Chassis lateral acceleration

The model will be attempting to classify boom height performance.  The metric for boom height performance is contained within the dataframe as "Boom_L2_R2_Average_std."

- Boom_L2_R2_Average_std

In [6]:
#reads the data into a pandas dataframe and displays the first 10 rows
#this dataset can be found on: https://github.com/badams97/Sprayer_Chassis_Features

initial_df = pd.read_csv('Chassis_Features_Mean_1.csv')
initial_df.head(10)

Unnamed: 0,Chassis_Roll_mean,Chassis_Roll_Rate_mean,Chassis_Pitch_mean,Chassis_Pitch_Rate_mean,Chassis_Vert_Accel_mean,Chassis_Long_Accel_mean,Chassis_Lat_Accel_mean,Boom_L2_std,Boom_R2_std,Boom_L2_R2_Average_std
0,2.284831,0.196373,1.515955,-0.480477,9.813797,0.153772,0.466084,168.179845,165.326849,166.753347
1,-2.35701,0.252221,-0.655178,-0.576797,9.810353,0.107525,-0.366866,173.627026,215.386613,194.506819
2,0.621063,0.282433,1.346402,-0.502552,9.8249,0.190726,0.249052,188.293229,234.054875,211.174052
3,2.492119,0.227887,0.00724,-0.53962,9.807254,0.074085,0.420933,176.257845,208.258146,192.257995
4,-2.860568,0.237381,1.472113,-0.559221,9.807675,0.170151,-0.312717,168.987986,216.126756,192.557371
5,3.440322,0.260563,-0.288552,-0.552804,9.805874,0.064734,0.51834,217.550777,195.837967,206.694372
6,1.689145,0.156441,1.430515,-0.47865,9.819644,0.13203,0.267781,200.433252,196.479785,198.456519
7,-1.927387,0.226862,0.619088,-0.502623,9.818433,0.204058,-0.171645,190.650008,210.137701,200.393854
8,2.189631,0.290175,0.894358,-0.456564,9.814669,0.112426,0.418233,293.422035,288.378679,290.900357
9,-0.462547,0.384812,0.150088,-0.590048,9.803622,0.117205,-0.034005,284.657096,278.674454,281.665775


## Data Reduction

You should select data columns to eliminate based on their importance levels from the original model.

In [7]:
#filter the data and create a new dataframe that contains only important features and the classifier (i.e. "Boom_L2_R2_Average_std")



In [8]:
#separate the features from the classifier into two separate dataframes



## Initial Data Investigation

You should perform a brief investigation on the data to extract information like: (1) dataframe shape, (2) spread of the data, (3) feature correlation

In [None]:
_______.shape

In [None]:
#validating that no null values are present in the data
pd.isnull(_______).sum()

In [None]:
#quick investigation behind the spread of the data for each column in the overall dataframe
________.describe()

In [None]:
#checking the correlation between the various features
corr = ___________.corr(method = "pearson")
corr

In [9]:
#produce a correlation heatmap for easy visualization using the seaborn package
import seaborn as sns



## Updated Random Forest Model

By removing misleading data and noise, the new model should hopefully result in an increased explained variance.  A lesser amount of features also reduces the training time.

In [10]:
from sklearn.model_selection import train_test_split

In [None]:
#training the machine learning model with a specified test size and random state
X_train, X_test, Y_train, Y_test = train_test_split(________, _________, test_size = ____, random_state = 0)

In [None]:
print(X_train.shape, Y_train.shape)

In [None]:
reg = ske.RandomForestRegressor(n_estimators = ______, random_state = 0)
reg.fit(X_train, Y_train)

In [None]:
Y_train = np.ravel(Y_train)
reg.fit(X_train, Y_train)

In [None]:
Y_pred = reg.predict(X_test)

In [None]:
##explained variance score