<a href="https://colab.research.google.com/github/cagBRT/Machine-Learning/blob/master/HP2%20Wage_per_hour.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Clone the entire repo.
!git clone -l -s https://github.com/cagBRT/Machine-Learning.git cloned-repo
%cd cloned-repo
!ls

In [None]:
# Use seaborn for pairplot
!pip install seaborn

# **Can a person's hourly wage be predicted from a set of features?**

This model uses linear regression:<br>
>Multiple inputs<br>
>One float output 

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

# Install TensorFlow
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
import pathlib

import matplotlib.pyplot as plt
import pandas as pd

import seaborn as sns


from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)

In [None]:
#read in data using pandas
dataset = pd.read_csv("hourly_wages.csv")
#check data has been read in properly


wage_per_hour --> the label<br>
all other columns --> features<br>

In [None]:
dataset.columns

In [None]:
dataset.isna().sum()

In [None]:
dataset["female"].value_counts()

In [None]:
dataset["age"].value_counts()

In [None]:
corr = dataset.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)
plt.show()


In [None]:
corr.columns.values

# **Split the dataset into training and test sets**

This is a hyperparameter that can be adjusted. <br>
0.95 - 0.5

In [None]:
train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)
print("done")

# **Separate the label from the features**

In [None]:
train_stats = train_dataset.describe()
train_stats.pop("wage_per_hour")
train_stats = train_stats.transpose()
train_stats

In [None]:
test_stats = test_dataset.describe()
test_stats.pop("wage_per_hour")
test_stats = test_stats.transpose()
test_stats

In [None]:
train_labels = train_dataset.pop('wage_per_hour')
test_labels = test_dataset.pop('wage_per_hour')
print("done")

# **Normalize the data**

In [None]:
def norm(x):
  return (x - train_stats['mean']) / train_stats['std']
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)
print("done")

# **Create the model**
Tune the hyperparameters to improve the model performance. <br>
1. Number of nodes in each layers
2. Number and kinds of layers
3. activation functions
4. The learning rate in the optimizer (.0001 - .1)


In [None]:
inputs = len(train_dataset.keys())
print("number of inputs to the model = " + str(inputs))

def build_model():
  model = keras.Sequential([
    layers.Dense(8, activation=tf.nn.relu,input_shape=([len(train_dataset.keys())]),),
    #layers.Dropout(0.2),

    #layers.Dense(8, activation=tf.nn.relu),
    #layers.Dropout(0.2),

    layers.Dense(8, activation=tf.nn.relu),
    #layers.Dropout(0.2),
    
    layers.Dense(1)
  ])

  optimizer = tf.keras.optimizers.RMSprop(0.001)

  model.compile(loss='mean_squared_error',
                optimizer=optimizer,
                metrics=['mean_absolute_error', 'mean_squared_error'])
  return model
  print("done")

In [None]:
model = build_model()
print("done")

# **Train the Model**
Modify this hyperparameter<br>
1. Number of epochs

In [None]:
# Display training progress by printing a single dot for each completed epoch

model = build_model()
EPOCHS = 1000

# The patience parameter is the amount of epochs to check for improvement
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)

history = model.fit(normed_train_data, train_labels, epochs=EPOCHS,
                    validation_split = 0.2, verbose=0, callbacks=[early_stop])


In [None]:
def plot_history(history):
  hist = pd.DataFrame(history.history)
  hist['epoch'] = history.epoch

  plt.figure()
  plt.xlabel('Epoch')
  plt.ylabel('Mean Abs Error [wage per hour]')
  plt.plot(hist['epoch'], hist['mean_absolute_error'],
           label='Train Error')
  plt.plot(hist['epoch'], hist['val_mean_absolute_error'],
           label = 'Val Error')
  plt.ylim([0,5])
  plt.legend()

  plt.show()

plot_history(history)

In [None]:
loss, mae, mse = model.evaluate(normed_test_data, test_labels, verbose=1)

print("Testing set Mean Abs Error: ${:5.2f} wage_per_hour".format(mae))


Discretization, also known as quantization or binning, divides a continuous feature into a pre-specified number of categories (bins), and thus makes the data discrete.<br>
One of the main goals of a discretization is to significantly reduce the number of discrete intervals of a continuous attribute. Hence, why this transformation can increase the performance of tree based models.<br>
Sklearn provides a KBinsDiscretizer class that can take care of this.<br>
 The only thing you have to specify are:<br> the number of bins (n_bins) for each feature and how to encode these bins (ordinal, onehot or onehot-dense).<br>

The optional strategy parameter can be set to three values:<br>
>**uniform**, where all bins in each feature have identical widths.<br>
**quantile** (default), where all bins in each feature have the same number of points.<br>
**kmeans**, where all values in each bin have the same nearest center of a 1D k-means cluster.

In [None]:
from sklearn.preprocessing import KBinsDiscretizer
disc = KBinsDiscretizer(n_bins=3, encode='onehot', 
                        strategy='uniform')
disc.fit_transform(normed_train_data) #X is the dataset

If the output doesn’t make sense to you, invoke the bin_edges_ attribute on the discretizer (disc) and take a look at how the bins are divided. Then try another strategy and see how the bin edges change accordingly.