# __Simple Machine Learning Project: A Walkthrough__
## __Table of Contents__:
### 0.) Dataset Description
### 1.) Data Cleaning
### 2.) Feature Engineering
### 3.) Feature Selection
### 4.) Model Performance

## __Data Description__:
For this walkthrough we'll be using the Concrete Dataset found [here](https://www.kaggle.com/maajdl/yeh-concret-data). We'll be predicting the Compressive Strength of Concrete given its Age and its ingredients. Some Features include:
Water(kg), Cement(kg), Age(days) etc. <blockquote> "Concrete is the most important material in civil engineering. Concrete compressive strength is a highly nonlinear function of age and ingredients.[maajdl](https://www.kaggle.com/maajdl)"</blockquote>
### __Run the cell below to check out the structure of the dataset:__

In [None]:
from exploratoryDataAnalysis import *
_,_,df = loadData()
df.head(10)

## __Data Cleaning__:
Let's have a look at the data! First let's check for any missing values.

In [None]:
df.isna().sum()

Thankfully, there are none. In the wild, our data will almost certainly contain missing values, which we'll either have to impute(replace with estimates) or drop from the data.
Next, we'll look at some useful statistics.

In [None]:
summaryStatistics(df)

We can gleen from the datatypes portion that we are only dealing with numerical data. But the summary statistics are hard to interpret, we'll look at the distribution of our features to enrich our understanding. 

In [None]:
featureDistributions(df)
boxPlots(df)

Cement, Slag, Ash and others look left skewed, and those dots outside of the minimum and maximum lines of the box plots are outliers. The distributions of our features are not Gaussian(normal)-like. The success Machine Learning algorithms depends on the features being normally distributed. The method we will use to detect outliers relies on this assumption as well. So we'll go ahead transform our features to make them normally distributed by taking their square root, the log transform is also very popular but doesn't work when features contains 0's.

In [None]:
from featureEngineering import *
df = transformFeatures(df)
df.head(10)

Now we'll wrap up the data cleaning by removing outliers. We'll do that by taking the Z-score:(x-mean)/standardDeviation of the features. The Z-score will tell us how many standard deviations away our data is from the mean. If it's too far, in our case greater than 2.5, we'll remove it from our dataset.

In [None]:
print(len(df))
_,_,df = findAndRemoveOutliers(df)
print(len(df))

As you can see, we've dropped a few rows from our database.
## __Feature Engineering__:
Now that our data is clean, we can begin Feature Engineering (technically, our sqrt transform was feature engineering). In our case, this will only involve feature creation. We will combine our features in interesting ways to create new features. In principle, some of these features would be hard for our ML algorithms to find without a little help.
We will take every product and ratio of every pair of features, and add them to our dataset as new features.


In [None]:
df = featureCreation(df)
df.columns.tolist()[15:30]