<img src="img.jpg">

# Workout Classification
Now it's possible to collect a large amount of data about personal movement using activity monitoring devices such as a [Fitbit](http://www.fitbit.com), [NikeFuelband](http://www.nike.com/us/en_us/c/nikeplus-fuelband), or [Jawbone Up](https://jawbone.com/up). These type of devices are part of the "quantified self" movement, a group of enthusiasts who take measurements about themselves regularly to improve their health or to find patterns in their behavior. Unfurtunatelly, these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.

Usually, people quantify the amount of exercise they do but don't the quality of it. The objective of this project is to use the data from accelerometers on the belt, forearm, arm, and dumbbell to classify when an exercise is done correctly or not.

This project is structured as follows:
1. [Understanding the Data](#Understanding_the_Data)
    1. [Descriptive and Exploratory Analysis](#Descriptive_and_Exploratory_Analysis)
    2. [Train and Validation Dataset](#Train_and_Validation_Dataset)
2. [Feature Selection/Importance](#Feature_Selection_Importance)
3. [Classification Model](#Classification_Model)
4. [Conclusions and Remarks](#Conclusions_and_Remarks)

## Understanding the Data
<a id='Understanding_the_Data'></a>

This project is possible thanks to the data obtained from [here](http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises). Also, the owners of this dataset have an [available publication](http://groupware.les.inf.puc-rio.br/public/papers/2013.Velloso.QAR-WLE.pdf) that you can checkout. 

The experimental design was conducted in accordance with the next points:
1. The exercises were performed by 6 young (20-28 years old) healthy participants.
2. Each participant did 10 repetitions of "Unilateral Dumbbell Biceps Curl" in 5 different fashions:
    1. _Class A_: Exactly as the specifications (high quality performance)
    2. _Class B_: Throwing the elbows to the front (mistake).
    3. _Class C_: Lifting the dumbbell only halfway (mistake).
    4. _Class D_: Lowering the dumbbell only halfway (mistake).
    5. _Class E_: Throwing the hips to the front (mistake).
3. Four sensors were used to collect the data (in the glove, armband, lumbar belt and dumbbell), as it is represented in the below figure (this figure was inspired by the schema proposed by the author [here](http://groupware.les.inf.puc-rio.br/public/papers/2013.Velloso.QAR-WLE.pdf)).
4. Each sensor acts as a center of a euclidean coordinates system, so, for each sensor was possible to record it's [intrinsic rotations](https://en.wikipedia.org/wiki/Euler_angles) (yaw, pitch, and roll), in addition to the [gyroscope](https://en.wikipedia.org/wiki/Gyroscope), [accelerometer](https://en.wikipedia.org/wiki/Accelerometer), and [magnetometer](https://en.wikipedia.org/wiki/Magnetometer) values for each axis of the coordinate system.
    
<img src="workout_body.png">

More details about this data can be consulted [here](http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises).

In [14]:
# Load Libraries
import pandas as pd
import numpy as np

In [15]:
# Load Dataset
curl_variation=pd.read_csv('WearableComputing_weight_lifting_exercises_biceps_curl_variations.csv',low_memory=False)
print('The dataset is composed by {} instances and {} features.'.format(
    curl_variation.shape[0],curl_variation.shape[1]))

The dataset is composed by 39242 instances and 159 features.


In [16]:
# Display the head of our data frame
pd.set_option('display.max_columns',None)
curl_variation.head()

Unnamed: 0,user_name,raw_timestamp_part_1,raw_timestamp_part_2,cvtd_timestamp,new_window,num_window,roll_belt,pitch_belt,yaw_belt,total_accel_belt,kurtosis_roll_belt,kurtosis_picth_belt,kurtosis_yaw_belt,skewness_roll_belt,skewness_roll_belt.1,skewness_yaw_belt,max_roll_belt,max_picth_belt,max_yaw_belt,min_roll_belt,min_pitch_belt,min_yaw_belt,amplitude_roll_belt,amplitude_pitch_belt,amplitude_yaw_belt,var_total_accel_belt,avg_roll_belt,stddev_roll_belt,var_roll_belt,avg_pitch_belt,stddev_pitch_belt,var_pitch_belt,avg_yaw_belt,stddev_yaw_belt,var_yaw_belt,gyros_belt_x,gyros_belt_y,gyros_belt_z,accel_belt_x,accel_belt_y,accel_belt_z,magnet_belt_x,magnet_belt_y,magnet_belt_z,roll_arm,pitch_arm,yaw_arm,total_accel_arm,var_accel_arm,avg_roll_arm,stddev_roll_arm,var_roll_arm,avg_pitch_arm,stddev_pitch_arm,var_pitch_arm,avg_yaw_arm,stddev_yaw_arm,var_yaw_arm,gyros_arm_x,gyros_arm_y,gyros_arm_z,accel_arm_x,accel_arm_y,accel_arm_z,magnet_arm_x,magnet_arm_y,magnet_arm_z,kurtosis_roll_arm,kurtosis_picth_arm,kurtosis_yaw_arm,skewness_roll_arm,skewness_pitch_arm,skewness_yaw_arm,max_roll_arm,max_picth_arm,max_yaw_arm,min_roll_arm,min_pitch_arm,min_yaw_arm,amplitude_roll_arm,amplitude_pitch_arm,amplitude_yaw_arm,roll_dumbbell,pitch_dumbbell,yaw_dumbbell,kurtosis_roll_dumbbell,kurtosis_picth_dumbbell,kurtosis_yaw_dumbbell,skewness_roll_dumbbell,skewness_pitch_dumbbell,skewness_yaw_dumbbell,max_roll_dumbbell,max_picth_dumbbell,max_yaw_dumbbell,min_roll_dumbbell,min_pitch_dumbbell,min_yaw_dumbbell,amplitude_roll_dumbbell,amplitude_pitch_dumbbell,amplitude_yaw_dumbbell,total_accel_dumbbell,var_accel_dumbbell,avg_roll_dumbbell,stddev_roll_dumbbell,var_roll_dumbbell,avg_pitch_dumbbell,stddev_pitch_dumbbell,var_pitch_dumbbell,avg_yaw_dumbbell,stddev_yaw_dumbbell,var_yaw_dumbbell,gyros_dumbbell_x,gyros_dumbbell_y,gyros_dumbbell_z,accel_dumbbell_x,accel_dumbbell_y,accel_dumbbell_z,magnet_dumbbell_x,magnet_dumbbell_y,magnet_dumbbell_z,roll_forearm,pitch_forearm,yaw_forearm,kurtosis_roll_forearm,kurtosis_picth_forearm,kurtosis_yaw_forearm,skewness_roll_forearm,skewness_pitch_forearm,skewness_yaw_forearm,max_roll_forearm,max_picth_forearm,max_yaw_forearm,min_roll_forearm,min_pitch_forearm,min_yaw_forearm,amplitude_roll_forearm,amplitude_pitch_forearm,amplitude_yaw_forearm,total_accel_forearm,var_accel_forearm,avg_roll_forearm,stddev_roll_forearm,var_roll_forearm,avg_pitch_forearm,stddev_pitch_forearm,var_pitch_forearm,avg_yaw_forearm,stddev_yaw_forearm,var_yaw_forearm,gyros_forearm_x,gyros_forearm_y,gyros_forearm_z,accel_forearm_x,accel_forearm_y,accel_forearm_z,magnet_forearm_x,magnet_forearm_y,magnet_forearm_z,classe
0,eurico,1322489729,34670,28/11/2011 14:15,no,1,3.7,41.6,-82.8,3,,,,,,,,,,,,,,,,,,,,,,,,,,2.02,0.18,0.02,-3,-18,22,387,525,-267,132.0,-43.7,-53.6,38,,,,,,,,,,,2.65,-0.61,-0.02,143,30,-346,556,-205,-374,,,,,,,,,,,,,,,,51.23554,11.698847,104.264727,,,,,,,,,,,,,,,,4,,,,,,,,,,,-0.31,0.16,0.08,5,21,37,-471.0,191.0,277.0,-111.0,26.5,138.0,,,,,,,,,,,,,,,,30,,,,,,,,,,,-0.05,-0.37,-0.43,-170.0,155.0,184,-1160.0,1400.0,-876.0,E
1,eurico,1322489729,62641,28/11/2011 14:15,no,1,3.66,42.8,-82.5,2,,,,,,,,,,,,,,,,,,,,,,,,,,1.96,0.14,0.05,-2,-13,16,405,512,-254,129.0,-45.3,-49.0,38,,,,,,,,,,,2.79,-0.64,-0.11,146,35,-339,599,-206,-335,,,,,,,,,,,,,,,,55.824418,9.645819,100.228053,,,,,,,,,,,,,,,,4,,,,,,,,,,,-0.31,0.14,0.07,4,22,35,-472.0,184.0,281.0,-112.0,26.2,138.0,,,,,,,,,,,,,,,,31,,,,,,,,,,,-0.06,-0.37,-0.59,-178.0,164.0,182,-1150.0,1410.0,-871.0,E
2,eurico,1322489729,70653,28/11/2011 14:15,no,1,3.58,43.7,-82.3,1,,,,,,,,,,,,,,,,,,,,,,,,,,1.88,0.08,0.05,-2,-6,8,409,511,-244,125.0,-46.8,-43.7,35,,,,,,,,,,,2.91,-0.69,-0.15,156,44,-307,613,-198,-319,,,,,,,,,,,,,,,,55.469831,6.875244,101.084106,,,,,,,,,,,,,,,,4,,,,,,,,,,,-0.31,0.16,0.05,3,23,37,-468.0,190.0,275.0,-114.0,26.0,137.0,,,,,,,,,,,,,,,,32,,,,,,,,,,,-0.05,-0.27,-0.72,-182.0,172.0,185,-1130.0,1400.0,-863.0,E
3,eurico,1322489729,82654,28/11/2011 14:15,no,1,3.56,44.4,-82.1,1,,,,,,,,,,,,,,,,,,,,,,,,,,1.8,0.03,0.08,-6,-5,7,422,513,-221,120.0,-48.1,-38.1,35,,,,,,,,,,,3.08,-0.72,-0.23,158,52,-305,646,-186,-268,,,,,,,,,,,,,,,,55.94486,11.079297,99.784556,,,,,,,,,,,,,,,,5,,,,,,,,,,,-0.31,0.16,0.07,5,24,38,-469.0,184.0,285.0,-115.0,25.8,137.0,,,,,,,,,,,,,,,,33,,,,,,,,,,,0.02,-0.24,-0.79,-185.0,182.0,188,-1120.0,1400.0,-855.0,E
4,eurico,1322489729,90637,28/11/2011 14:15,no,1,3.57,45.1,-81.9,1,,,,,,,,,,,,,,,,,,,,,,,,,,1.77,0.0,0.13,-4,-9,0,418,508,-208,115.0,-49.1,-31.7,34,,,,,,,,,,,3.2,-0.77,-0.25,163,55,-288,670,-175,-241,,,,,,,,,,,,,,,,55.211739,11.426833,100.422583,,,,,,,,,,,,,,,,4,,,,,,,,,,,-0.31,0.14,0.07,5,23,37,-468.0,189.0,292.0,-117.0,25.5,137.0,,,,,,,,,,,,,,,,34,,,,,,,,,,,0.08,-0.27,-0.82,-188.0,195.0,188,-1100.0,1400.0,-843.0,E


If you explore the above table, you can see that a lot of variables have not assigned values at least in the first 5 instances. Therefore, let's see what is the average of NaN for each feature in our dataset and erase the variables with more than 40% of missing values.
> This threshold was selected taking into account that best algorithms/methods have a breaking point of 50%, this is, it's necessary at least 50% of the data in the distribution to make relative accurate estimations (for example the median). If we are going to impute the missing values, we need enough information in the variable, otherwise, it's better to drop the variable.

In [17]:
# Compute the average of rows with missing values per variable and only keep columns with 60%+ of information
pd.set_option('display.max_rows',None)
columns_out=curl_variation.columns[(curl_variation.isna().sum()/len(curl_variation)>0.4).values]
print('The {}% of the variables have more than 40% of the observations as NaN.'.format(
    round(100*len(columns_out)/curl_variation.shape[1],2)))

The 62.89% of the variables have more than 40% of the observations as NaN.


In [18]:
# Drop the columns with more than 40% of missing data
curl_variation.drop(columns=columns_out,inplace=True)
print('Now our data is composed by {} observations and {} variables'.format(
    curl_variation.shape[0],curl_variation.shape[1]))
curl_variation.head()

Now our data is composed by 39242 observations and 59 variables


Unnamed: 0,user_name,raw_timestamp_part_1,raw_timestamp_part_2,cvtd_timestamp,new_window,num_window,roll_belt,pitch_belt,yaw_belt,total_accel_belt,gyros_belt_x,gyros_belt_y,gyros_belt_z,accel_belt_x,accel_belt_y,accel_belt_z,magnet_belt_x,magnet_belt_y,magnet_belt_z,roll_arm,pitch_arm,yaw_arm,total_accel_arm,gyros_arm_x,gyros_arm_y,gyros_arm_z,accel_arm_x,accel_arm_y,accel_arm_z,magnet_arm_x,magnet_arm_y,magnet_arm_z,roll_dumbbell,pitch_dumbbell,yaw_dumbbell,total_accel_dumbbell,gyros_dumbbell_x,gyros_dumbbell_y,gyros_dumbbell_z,accel_dumbbell_x,accel_dumbbell_y,accel_dumbbell_z,magnet_dumbbell_x,magnet_dumbbell_y,magnet_dumbbell_z,roll_forearm,pitch_forearm,yaw_forearm,total_accel_forearm,gyros_forearm_x,gyros_forearm_y,gyros_forearm_z,accel_forearm_x,accel_forearm_y,accel_forearm_z,magnet_forearm_x,magnet_forearm_y,magnet_forearm_z,classe
0,eurico,1322489729,34670,28/11/2011 14:15,no,1,3.7,41.6,-82.8,3,2.02,0.18,0.02,-3,-18,22,387,525,-267,132.0,-43.7,-53.6,38,2.65,-0.61,-0.02,143,30,-346,556,-205,-374,51.23554,11.698847,104.264727,4,-0.31,0.16,0.08,5,21,37,-471.0,191.0,277.0,-111.0,26.5,138.0,30,-0.05,-0.37,-0.43,-170.0,155.0,184,-1160.0,1400.0,-876.0,E
1,eurico,1322489729,62641,28/11/2011 14:15,no,1,3.66,42.8,-82.5,2,1.96,0.14,0.05,-2,-13,16,405,512,-254,129.0,-45.3,-49.0,38,2.79,-0.64,-0.11,146,35,-339,599,-206,-335,55.824418,9.645819,100.228053,4,-0.31,0.14,0.07,4,22,35,-472.0,184.0,281.0,-112.0,26.2,138.0,31,-0.06,-0.37,-0.59,-178.0,164.0,182,-1150.0,1410.0,-871.0,E
2,eurico,1322489729,70653,28/11/2011 14:15,no,1,3.58,43.7,-82.3,1,1.88,0.08,0.05,-2,-6,8,409,511,-244,125.0,-46.8,-43.7,35,2.91,-0.69,-0.15,156,44,-307,613,-198,-319,55.469831,6.875244,101.084106,4,-0.31,0.16,0.05,3,23,37,-468.0,190.0,275.0,-114.0,26.0,137.0,32,-0.05,-0.27,-0.72,-182.0,172.0,185,-1130.0,1400.0,-863.0,E
3,eurico,1322489729,82654,28/11/2011 14:15,no,1,3.56,44.4,-82.1,1,1.8,0.03,0.08,-6,-5,7,422,513,-221,120.0,-48.1,-38.1,35,3.08,-0.72,-0.23,158,52,-305,646,-186,-268,55.94486,11.079297,99.784556,5,-0.31,0.16,0.07,5,24,38,-469.0,184.0,285.0,-115.0,25.8,137.0,33,0.02,-0.24,-0.79,-185.0,182.0,188,-1120.0,1400.0,-855.0,E
4,eurico,1322489729,90637,28/11/2011 14:15,no,1,3.57,45.1,-81.9,1,1.77,0.0,0.13,-4,-9,0,418,508,-208,115.0,-49.1,-31.7,34,3.2,-0.77,-0.25,163,55,-288,670,-175,-241,55.211739,11.426833,100.422583,4,-0.31,0.14,0.07,5,23,37,-468.0,189.0,292.0,-117.0,25.5,137.0,34,0.08,-0.27,-0.82,-188.0,195.0,188,-1100.0,1400.0,-843.0,E


In [19]:
# Check if exist other variable with NaN
var_with_nan=curl_variation.isna().sum()[curl_variation.isna().sum()!=0]
print('The remain variables with not assigned value are {} with {} missing values'.format(var_with_nan.index,
                                                                                        var_with_nan.values))

The remain variables with not assigned value are Index(['roll_dumbbell'], dtype='object') with [1] missing values


The above piece of information is important and we need to keep it in mind, but, for the moment let's see what is the datatype of our variables and convert the features to the correct datatype. After that, we can design a strategy to deal with this missing value.
> Because we have only one missing value, I don't think that erasing this instance could have an impact on the classification model, but I am curious about the class that this missing value belongs and the number of instances that we have in this class.    

If you print the data type of each variable (use `curl_variation.dtypes`), you can see that there are only 3 types of formats in our dataset (`int`, `float`, `object`). The below code print the variables that belongs to each format.

In [20]:
# Print out the variables data type
dictionary={types: curl_variation.select_dtypes(types).columns for types in ['object','int','float']}
dictionary

{'object': Index(['user_name', 'cvtd_timestamp', 'new_window', 'classe'], dtype='object'),
 'int': Index(['raw_timestamp_part_1', 'raw_timestamp_part_2', 'num_window',
        'total_accel_belt', 'accel_belt_x', 'accel_belt_y', 'accel_belt_z',
        'magnet_belt_x', 'magnet_belt_y', 'magnet_belt_z', 'total_accel_arm',
        'accel_arm_x', 'accel_arm_y', 'accel_arm_z', 'magnet_arm_x',
        'magnet_arm_y', 'magnet_arm_z', 'total_accel_dumbbell',
        'accel_dumbbell_x', 'accel_dumbbell_y', 'accel_dumbbell_z',
        'total_accel_forearm', 'accel_forearm_z'],
       dtype='object'),
 'float': Index(['roll_belt', 'pitch_belt', 'yaw_belt', 'gyros_belt_x', 'gyros_belt_y',
        'gyros_belt_z', 'roll_arm', 'pitch_arm', 'yaw_arm', 'gyros_arm_x',
        'gyros_arm_y', 'gyros_arm_z', 'roll_dumbbell', 'pitch_dumbbell',
        'yaw_dumbbell', 'gyros_dumbbell_x', 'gyros_dumbbell_y',
        'gyros_dumbbell_z', 'magnet_dumbbell_x', 'magnet_dumbbell_y',
        'magnet_dumbbell_z', 'ro

The above dictionary shows that the variables in the `float` data type group are correctly assigned, but in the object variables we have some of them that are  `datetime` and other are `categories` (factor variables). We also have a similar situation with some variables in the group `int`. Let's transform these variables to the correct type.

In [21]:
# Transform to datetime
date_time_var=['cvtd_timestamp','raw_timestamp_part_1',
           'raw_timestamp_part_2']
curl_variation.loc[:,date_time_var]=curl_variation.loc[:,date_time_var].apply(pd.to_datetime)
# Transform to category
categories=['user_name','new_window','classe']
curl_variation.loc[:,categories]=curl_variation.loc[:,categories].astype('category')
dictionary={types: curl_variation.select_dtypes(types).columns for types in ['object','int','float','category']}
dictionary

{'object': Index([], dtype='object'),
 'int': Index(['num_window', 'total_accel_belt', 'accel_belt_x', 'accel_belt_y',
        'accel_belt_z', 'magnet_belt_x', 'magnet_belt_y', 'magnet_belt_z',
        'total_accel_arm', 'accel_arm_x', 'accel_arm_y', 'accel_arm_z',
        'magnet_arm_x', 'magnet_arm_y', 'magnet_arm_z', 'total_accel_dumbbell',
        'accel_dumbbell_x', 'accel_dumbbell_y', 'accel_dumbbell_z',
        'total_accel_forearm', 'accel_forearm_z'],
       dtype='object'),
 'float': Index(['roll_belt', 'pitch_belt', 'yaw_belt', 'gyros_belt_x', 'gyros_belt_y',
        'gyros_belt_z', 'roll_arm', 'pitch_arm', 'yaw_arm', 'gyros_arm_x',
        'gyros_arm_y', 'gyros_arm_z', 'roll_dumbbell', 'pitch_dumbbell',
        'yaw_dumbbell', 'gyros_dumbbell_x', 'gyros_dumbbell_y',
        'gyros_dumbbell_z', 'magnet_dumbbell_x', 'magnet_dumbbell_y',
        'magnet_dumbbell_z', 'roll_forearm', 'pitch_forearm', 'yaw_forearm',
        'gyros_forearm_x', 'gyros_forearm_y', 'gyros_forearm_z',

Now, the variable types look that should and we can move back to our missing value. The first step is to look at the class that this missing value belongs to.

In [22]:
# Look at the class that belongs the missing value
curl_variation.loc[curl_variation['roll_dumbbell'].isna().values,'classe']

8136    A
Name: classe, dtype: category
Categories (5, object): [A, B, C, D, E]

As we can see, this missing value belong to class A, and the number of instances in this class is 11159 (see below), so, as was discussed before, the best option here is to erase this observation from our data because, given the amount of data in this class, one less observation will not have an impact in the model.

In [23]:
# Print the number of instances by classes
print(curl_variation['classe'].value_counts())
# Erase the instance with the missing value
curl_variation.dropna(inplace=True)

A    11159
B     7593
E     7214
C     6844
D     6432
Name: classe, dtype: int64


### Descriptive and Exploratory Analysis
<a id='Descriptive_and_Exploratory_Analysis'></a>

So far, we deal with a significant dimensionality reduction of our dataset, now, it's moment to describe and visualize 

In [24]:
curl_variation.describe()

Unnamed: 0,num_window,roll_belt,pitch_belt,yaw_belt,total_accel_belt,gyros_belt_x,gyros_belt_y,gyros_belt_z,accel_belt_x,accel_belt_y,accel_belt_z,magnet_belt_x,magnet_belt_y,magnet_belt_z,roll_arm,pitch_arm,yaw_arm,total_accel_arm,gyros_arm_x,gyros_arm_y,gyros_arm_z,accel_arm_x,accel_arm_y,accel_arm_z,magnet_arm_x,magnet_arm_y,magnet_arm_z,roll_dumbbell,pitch_dumbbell,yaw_dumbbell,total_accel_dumbbell,gyros_dumbbell_x,gyros_dumbbell_y,gyros_dumbbell_z,accel_dumbbell_x,accel_dumbbell_y,accel_dumbbell_z,magnet_dumbbell_x,magnet_dumbbell_y,magnet_dumbbell_z,roll_forearm,pitch_forearm,yaw_forearm,total_accel_forearm,gyros_forearm_x,gyros_forearm_y,gyros_forearm_z,accel_forearm_x,accel_forearm_y,accel_forearm_z,magnet_forearm_x,magnet_forearm_y,magnet_forearm_z
count,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0,39241.0
mean,432.32861,64.926767,0.423194,-10.828182,11.378813,-0.006646,0.040103,-0.131532,-5.680487,30.458347,-73.348793,55.718993,593.863612,-345.204811,17.507481,-4.300552,-0.903398,25.503733,0.05218,-0.261827,0.272734,-61.760276,32.613134,-69.612803,187.122296,159.195077,310.707449,23.90899,-10.820181,1.298672,13.765297,0.167307,0.043658,-0.138068,-28.798068,52.970235,-38.46166,-326.930319,220.317583,44.90786,33.752846,11.077394,18.88012,34.676359,0.146396,0.095781,0.147624,-63.950659,162.983782,-55.498203,-316.632239,379.756021,395.001874
std,247.966716,62.673213,22.400663,94.983657,7.732245,0.207561,0.078789,0.239327,29.686717,28.651489,100.421833,64.612881,35.479939,64.731917,72.754601,30.651157,71.656906,10.465308,1.985491,0.848317,0.551043,182.275357,109.740131,134.067492,442.961375,201.029204,324.957551,69.962704,37.116855,82.466152,10.255085,1.101102,0.549453,1.632752,67.674351,80.854416,109.694685,341.522831,327.303826,139.833592,107.965046,28.203337,103.467807,10.088156,1.827199,3.735154,1.337716,180.941432,199.58229,137.692888,345.894541,507.060197,368.099852
min,1.0,-28.9,-56.2,-180.0,0.0,-1.06,-0.64,-1.57,-120.0,-71.0,-280.0,-55.0,353.0,-627.0,-180.0,-89.1,-180.0,0.0,-6.37,-3.48,-2.33,-428.0,-318.0,-640.0,-584.0,-392.0,-597.0,-154.139304,-149.593648,-153.713729,0.0,-204.0,-2.12,-2.38,-419.0,-189.0,-334.0,-643.0,-3600.0,-262.0,-180.0,-72.5,-180.0,0.0,-339.0,-7.03,-52.0,-498.0,-690.0,-458.0,-1280.0,-906.0,-973.0
25%,222.0,1.1,1.83,-88.2,3.0,-0.03,0.0,-0.2,-21.0,3.0,-162.0,9.0,582.0,-375.0,-32.5,-25.5,-43.6,17.0,-1.3,-0.8,-0.07,-243.0,-53.0,-140.0,-302.0,-2.0,141.0,-17.936064,-41.129108,-77.556958,4.0,-0.03,-0.14,-0.31,-51.0,-8.0,-142.0,-535.0,230.0,-47.0,-2.0,0.0,-70.1,29.0,-0.22,-1.45,-0.18,-181.0,53.0,-181.0,-620.0,8.0,200.0
50%,428.0,114.0,5.32,-11.9,17.0,0.03,0.02,-0.11,-15.0,37.0,-153.0,35.0,601.0,-320.0,0.0,0.0,0.0,27.0,0.1,-0.26,0.25,-47.0,13.0,-45.0,277.0,207.0,448.0,48.32312,-21.186442,-4.73558,10.0,0.13,0.03,-0.13,-8.0,42.0,-2.0,-478.0,311.0,13.0,20.2,9.69,0.0,36.0,0.05,0.03,0.08,-58.0,200.0,-41.0,-385.0,588.0,512.0
75%,647.0,123.0,15.5,12.5,18.0,0.11,0.11,-0.02,-5.0,61.0,27.0,59.0,610.0,-306.0,77.2,11.7,46.3,33.0,1.57,0.14,0.72,83.0,139.0,24.0,633.0,324.0,546.0,67.620276,17.883279,79.602933,20.0,0.35,0.21,0.03,11.0,111.0,38.0,-303.0,391.0,96.0,140.0,28.9,110.0,41.0,0.56,1.65,0.49,74.0,312.0,25.0,-77.0,736.0,652.0
max,864.0,162.0,60.3,180.0,30.0,2.22,0.64,1.62,106.0,164.0,108.0,485.0,675.0,293.0,180.0,88.5,180.0,67.0,4.9,2.84,3.02,437.0,326.0,292.0,785.0,583.0,703.0,154.508902,149.402444,154.952294,58.0,2.39,52.0,317.0,235.0,315.0,319.0,592.0,639.0,453.0,180.0,89.8,180.0,108.0,4.77,516.0,231.0,479.0,923.0,291.0,672.0,1480.0,1090.0


### Train and Validation Dataset
<a id='Train_and_Validation_Dataset'></a>

For the train and test datasets look [here](https://machinelearningmastery.com/difference-test-validation-datasets/)

## Feature Selection/Importance
<a id='Feature_Selection_Importance'></a>
* Rememver the scale of the variables!!!!     
https://machinelearningmastery.com/calculate-feature-importance-with-python/

In [11]:
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# define the model
model = LogisticRegression()
# fit the model
model.fit(X, y)
# get importance
importance = model.coef_[0]

In [12]:
# permutation feature importance with knn for regression
from sklearn.datasets import make_regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.inspection import permutation_importance
from matplotlib import pyplot
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# define the model
model = KNeighborsRegressor()
# fit the model
model.fit(X, y)
# perform permutation importance
results = permutation_importance(model, X, y, scoring='neg_mean_squared_error')
# get importance
importance = results.importances_mean
importance

array([ 124.40076263,  317.11382479,  134.91476968,   67.98491781,
       9545.27727965, 7839.6205412 ,  895.11855176,  139.34603338,
         81.52569288,   94.45664245])

## Classification Model
<a id='Classification_Model'></a>

## Conclusions and Remarks
<a id='Conclusions_and_Remarks'></a>