![image.png](attachment:image.png)

<h1 align="center">CA2 - Machine Learning</h1>
<h3 align="center">Lecturer: David McQuaid</h3>
<h3 align="center">Caio Machado de Oliveira</h3>
<h4 align="center">ID: 2020351</h4>
<h4 align="center">May/2024</h4>

In [1]:
#Importing all libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras import backend as k
from tensorflow.keras.callbacks import EarlyStopping


import joblib

# Neural Network - Part 1

In [2]:
# importing dataset
filename = 'BankRecords.csv'
data = pd.read_csv(filename)


In [3]:
data.head()

Unnamed: 0,ID,Age,Experience(Years),Income(Thousands's),Sort Code,Family,Credit Score,Education,Mortgage(Thousands's),Personal Loan,Securities Account,CD Account,Online Banking,CreditCard
0,1,25,1,49,91107,4,1.6,Diploma,0,No,Yes,No,No,No
1,2,45,19,34,90089,3,1.5,Diploma,0,No,Yes,No,No,No
2,3,39,15,11,94720,1,1.0,Diploma,0,No,No,No,No,No
3,4,35,9,100,94112,1,2.7,Degree,0,No,No,No,No,No
4,5,35,8,45,91330,4,1.0,Degree,0,No,No,No,No,Yes


## EDA - Exploratory Data Analysis

### Numerical Values

When first looking at the numerical features and its attributes, it is noticed that the feature "Experience(Years)" contain negative values that might cause problems in further applications.
It was identified that the features 'ID' and 'Sort Code' are not important for this application.

In [4]:
# Checking numerical values
data.describe()

Unnamed: 0,ID,Age,Experience(Years),Income(Thousands's),Sort Code,Family,Credit Score,Mortgage(Thousands's)
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,2500.5,45.3384,20.1046,73.7742,93152.503,2.3964,1.937913,56.4988
std,1443.520003,11.463166,11.467954,46.033729,2121.852197,1.147663,1.747666,101.713802
min,1.0,23.0,-3.0,8.0,9307.0,1.0,0.0,0.0
25%,1250.75,35.0,10.0,39.0,91911.0,1.0,0.7,0.0
50%,2500.5,45.0,20.0,64.0,93437.0,2.0,1.5,0.0
75%,3750.25,55.0,30.0,98.0,94608.0,3.0,2.5,101.0
max,5000.0,67.0,43.0,224.0,96651.0,4.0,10.0,635.0


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ID                     5000 non-null   int64  
 1   Age                    5000 non-null   int64  
 2   Experience(Years)      5000 non-null   int64  
 3   Income(Thousands's)    5000 non-null   int64  
 4   Sort Code              5000 non-null   int64  
 5   Family                 5000 non-null   int64  
 6   Credit Score           5000 non-null   float64
 7   Education              5000 non-null   object 
 8   Mortgage(Thousands's)  5000 non-null   int64  
 9   Personal Loan          5000 non-null   object 
 10  Securities Account     5000 non-null   object 
 11  CD Account             5000 non-null   object 
 12  Online Banking         5000 non-null   object 
 13  CreditCard             5000 non-null   object 
dtypes: float64(1), int64(7), object(6)
memory usage: 547.0+ 

In [6]:
data.isnull().sum()

ID                       0
Age                      0
Experience(Years)        0
Income(Thousands's)      0
Sort Code                0
Family                   0
Credit Score             0
Education                0
Mortgage(Thousands's)    0
Personal Loan            0
Securities Account       0
CD Account               0
Online Banking           0
CreditCard               0
dtype: int64

### Categorical Values

It is important to check que unique values to check for mispelling. '.unique()' allow us to visualize it. '.value_counts()' let us count each value on each feature. 

In [7]:
print(data['Education'].value_counts())
print(data['Education'].unique())

print("\n")
print(data['Personal Loan'].value_counts())
print(data['Personal Loan'].unique())

print("\n")
print(data['Securities Account'].value_counts())
print(data['Securities Account'].unique())

print("\n")
print(data['CD Account'].value_counts())
print(data['CD Account'].unique())

print("\n")
print(data['Online Banking'].value_counts())
print(data['Online Banking'].unique())


print("\n")
print(data['CreditCard'].value_counts())
print(data['CreditCard'].unique())


Education
Diploma    2096
Masters    1501
Degree     1403
Name: count, dtype: int64
['Diploma' 'Degree' 'Masters']


Personal Loan
No     4520
Yes     480
Name: count, dtype: int64
['No' 'Yes']


Securities Account
No     4478
Yes     522
Name: count, dtype: int64
['Yes' 'No']


CD Account
No     4698
Yes     302
Name: count, dtype: int64
['No' 'Yes']


Online Banking
Yes    2984
No     2016
Name: count, dtype: int64
['No' 'Yes']


CreditCard
No     3530
Yes    1470
Name: count, dtype: int64
['No' 'Yes']


## Data Preparation

To prepare the data for a model application demand understanding of what will and won't be relevant, and may or may not affect the model accuracy.
After describing the numerical values on the EDA section, negative values were found on 'Experience(Years)' and will be removed as it might affect the model accuracy.
Was identified that 'ID' and 'Sort Code' are irrelevant for our model application. 

In [8]:
# Drop the 'ID' and 'Sort Code' columns and rename Income column for convenience
data = data.drop(['ID', 'Sort Code'], axis=1)
data = data.rename(columns={'Income(Thousands\'s)': 'Income'})

We have addressed that only the 'Experience(Years)' feature contain negative values, now we only need to count and remove the entire row from our dataset

In [9]:
# Count the number of negative observations 
negative_value = (data['Experience(Years)'] < 0).sum().sum()
print(f"Number of negative values on 'Age': {negative_value}")

Number of negative values on 'Age': 52


In [10]:
# Selecting rows with negative values in 'Experience(Years)' column
negative_rows = data[data['Experience(Years)'] < 0]
# Displaying the selected rows
print("Rows with negative values in 'Experience(Years)' column:")
negative_rows

Rows with negative values in 'Experience(Years)' column:


Unnamed: 0,Age,Experience(Years),Income,Family,Credit Score,Education,Mortgage(Thousands's),Personal Loan,Securities Account,CD Account,Online Banking,CreditCard
89,25,-1,113,4,2.3,Masters,0,No,No,No,No,Yes
226,24,-1,39,2,1.7,Degree,0,No,No,No,No,No
315,24,-2,51,3,0.3,Masters,0,No,No,No,Yes,No
451,28,-2,48,2,1.75,Masters,89,No,No,No,Yes,No
524,24,-1,75,4,0.2,Diploma,0,No,No,No,Yes,No
536,25,-1,43,3,2.4,Degree,176,No,No,No,Yes,No
540,25,-1,109,4,2.3,Masters,314,No,No,No,Yes,No
576,25,-1,48,3,0.3,Masters,0,No,No,No,No,Yes
583,24,-1,38,2,1.7,Degree,0,No,No,No,Yes,No
597,24,-2,125,2,7.2,Diploma,0,No,Yes,No,No,Yes


In [11]:
# Remove rows with negative values in 'Experience(Years)' column
data = data[data['Experience(Years)'] >= 0]

In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4948 entries, 0 to 4999
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Age                    4948 non-null   int64  
 1   Experience(Years)      4948 non-null   int64  
 2   Income                 4948 non-null   int64  
 3   Family                 4948 non-null   int64  
 4   Credit Score           4948 non-null   float64
 5   Education              4948 non-null   object 
 6   Mortgage(Thousands's)  4948 non-null   int64  
 7   Personal Loan          4948 non-null   object 
 8   Securities Account     4948 non-null   object 
 9   CD Account             4948 non-null   object 
 10  Online Banking         4948 non-null   object 
 11  CreditCard             4948 non-null   object 
dtypes: float64(1), int64(5), object(6)
memory usage: 502.5+ KB


### Processing features

### StandardScaler () & OneHotEncoder()

Processing the features play a fundamental role in our model. StandardScaler and OneHoteEncoder work together but have diffirent functions. 
StandardScaler is applied to numerical values that have different range ensuring the mean is equal to 0 and the standard deviation is equal to 1. Another option would be MinMaxScaler, which was tested previously and got worse accuracy.
OneHotEncode simply convert categorical variable into numerical, creating a binary column for each category indicating the presence or absence of that category.

In [13]:
# Separate the target variable from the processing 
income = data.pop('Income')

In [14]:
# List of categorical and numerical features
categorical_features = ['Education', 'Personal Loan', 'Securities Account', 'CD Account', 'Online Banking', 'CreditCard']
numerical_features = ['Age', 'Experience(Years)', 'Family', 'Credit Score', 'Mortgage(Thousands\'s)']

In [15]:
# Creating transformers for numerical and categorical data
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(drop='first')  # Use drop='first' to avoid multicollinearity

In [16]:
# Combine transformers into a preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])


In [17]:
# Apply the transformations
data_preprocessed = preprocessor.fit_transform(data)

In [18]:
# Convert the preprocessed data to a DataFrame for better readability
# Getting feature names for the new columns created by OneHotEncoder
encoded_columns = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features)
all_columns = numerical_features + list(encoded_columns)

In [19]:
# Create a DataFrame
data_model = pd.DataFrame(data_preprocessed, columns=all_columns)

In [20]:
data_model

Unnamed: 0,Age,Experience(Years),Family,Credit Score,Mortgage(Thousands's),Education_Diploma,Education_Masters,Personal Loan_Yes,Securities Account_Yes,CD Account_Yes,Online Banking_Yes,CreditCard_Yes
0,-1.816072,-1.709074,1.400757,-0.192215,-0.556228,1.0,0.0,0.0,1.0,0.0,0.0,0.0
1,-0.049224,-0.117679,0.529926,-0.249439,-0.556228,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,-0.579278,-0.471322,-1.211736,-0.535558,-0.556228,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.932648,-1.001787,-1.211736,0.437247,-0.556228,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.932648,-1.090198,1.400757,-0.535558,-0.556228,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
4943,-1.462703,-1.532252,-1.211736,-0.020543,-0.556228,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4944,-1.374360,-1.443841,1.400757,-0.878901,0.278590,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4945,1.540939,1.650538,-0.340905,-0.936125,-0.556228,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4946,1.717624,1.738949,0.529926,-0.821677,-0.556228,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [21]:
# Add target variable back to data
data_model['Income'] = income.values

In [22]:
data_model.head()

Unnamed: 0,Age,Experience(Years),Family,Credit Score,Mortgage(Thousands's),Education_Diploma,Education_Masters,Personal Loan_Yes,Securities Account_Yes,CD Account_Yes,Online Banking_Yes,CreditCard_Yes,Income
0,-1.816072,-1.709074,1.400757,-0.192215,-0.556228,1.0,0.0,0.0,1.0,0.0,0.0,0.0,49
1,-0.049224,-0.117679,0.529926,-0.249439,-0.556228,1.0,0.0,0.0,1.0,0.0,0.0,0.0,34
2,-0.579278,-0.471322,-1.211736,-0.535558,-0.556228,1.0,0.0,0.0,0.0,0.0,0.0,0.0,11
3,-0.932648,-1.001787,-1.211736,0.437247,-0.556228,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100
4,-0.932648,-1.090198,1.400757,-0.535558,-0.556228,0.0,0.0,0.0,0.0,0.0,0.0,1.0,45


In [23]:
data_model.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4948 entries, 0 to 4947
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Age                     4948 non-null   float64
 1   Experience(Years)       4948 non-null   float64
 2   Family                  4948 non-null   float64
 3   Credit Score            4948 non-null   float64
 4   Mortgage(Thousands's)   4948 non-null   float64
 5   Education_Diploma       4948 non-null   float64
 6   Education_Masters       4948 non-null   float64
 7   Personal Loan_Yes       4948 non-null   float64
 8   Securities Account_Yes  4948 non-null   float64
 9   CD Account_Yes          4948 non-null   float64
 10  Online Banking_Yes      4948 non-null   float64
 11  CreditCard_Yes          4948 non-null   float64
 12  Income                  4948 non-null   int64  
dtypes: float64(12), int64(1)
memory usage: 502.7 KB


# Training Data

In [24]:
# Separate features and target variable
X = data_model.drop('Income', axis=1)
y = data_model['Income']

In [25]:
# Split the data into training and test sets
X_train, X_test, y_train , y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Neural Network 

Neural Network is a subset of machine learning and is known as the backbone of deep learning algorithms. The neural part of the model mimics the neurons in the brain. Neural Network, also called Artificial Neural Network, is made up of node layers, an input layer, hidden layer (one or more), and an output layer  (Google Cloud, n.d.). Each of these nodes are connected to the next one and has a weight and threshold value on them. In the image below we can see a Neural Network of 5 layers. An Input layer, three hidden layers, and an output layer. (IBM, 2024)

![image.png](attachment:image.png) image: (IBM, 2024)


Neural Network support different ways of train models. The most known are feed-forward and back-propagation. Feed-forward simply goes in one direction, from input to output. On the other hand, backpropagation means going in the opposite direction. It takes the total error and goes back to each neurons distributing accordingly minimizing the error and adjusting the weights and biases of the network.  (IBM, 2024)

In our model, we used 'ReLU' for activation function because it is the most used in regression application and dealt better with the complexity of the data (Oppermann, 2021). After running many trainings checking for the best parameters, we got the best results using only one hidden layer with '64 units'. The output layer must have the units equal the expected value of the prediction which is one feature (income) in our case. The activation function used in the output layer in most regression models is 'linear' which is suitable for predicting continuous values. The optimizer uses 'Adam' which is a stochastic gradient descent, and plays a important role in the model minimizing the loss function, such as Mean Squared Error (MSE), used in our model. MSE it is defined as the average of the squared differences between the predicted and actual values. 

So the model is built and ready to run. After exausting trainings, the best results was a score ranging 63% to 65%. 

In [26]:
# Define the Keras Sequential model with tuned hyperparameters
model = Sequential()
model.add(Dense(128, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='linear'))

# Compile the model with a tuned optimizer
optimizer = Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='mean_squared_error')
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [27]:
# Train the model
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, batch_size=32, callbacks=[early_stop], verbose=1)

# Evaluate the model
loss = model.evaluate(X_test, y_test, verbose=1)
print(f"Test Loss: {loss}")

Epoch 1/100
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - loss: 6367.0322 - val_loss: 1212.8506
Epoch 2/100
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 1252.4066 - val_loss: 953.0752
Epoch 3/100
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 1024.4225 - val_loss: 884.6487
Epoch 4/100
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 938.1086 - val_loss: 847.1609
Epoch 5/100
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 902.2068 - val_loss: 829.5797
Epoch 6/100
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 864.9871 - val_loss: 810.8610
Epoch 7/100
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 906.4537 - val_loss: 800.6794
Epoch 8/100
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 868.0255 - val_loss: 794.7

In [62]:
# Evaluate the model
y_pred_keras = model.predict(X_test)
mse_keras = mean_squared_error(y_test, y_pred_keras)
rmse_keras = np.sqrt(mse_keras)
r2_keras = r2_score(y_test, y_pred_keras)

print(f'Keras MSE: {mse_keras}')
print(f'Keras RMSE: {rmse_keras}')
print(f'Keras R²: {r2_keras}')

[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Keras MSE: 737.173612933383
Keras RMSE: 27.150941290006557
Keras R²: 0.6251927875261548


# Machine Learning Regression Model

In this assignment, we explored two very common used machine learning models: Linear Regression and Random Forest Regressor. Both models are from the Scikt-learn library and have distinct characteristics and are utilized for regression problems in order to predict a continuous target variable based on input features.



## Linear Regression


LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation (scikit-learn developers, 2019).

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables (Amazon Web Services, Inc., n.d.). Its choice was a decision for modeling using a different approach in order to explorer the nuanses of the dataset when model building. The way Linear Regression model works is fitting a linear equation to the observed data, minimizing the sum of squared residuals between the observed and predicted values. 


The model has a simple application method and do not demand a lot of tune for its hyperparameters. 

After applying the model, the result was a R² score = 0.54 (54%).

In [29]:
# Train a Linear Regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

In [64]:
# Evaluate the model
y_pred_linear = linear_model.predict(X_test)
mse_linear = mean_squared_error(y_test, y_pred_linear)
rmse_linear = np.sqrt(mse_linear)
r2_linear = r2_score(y_test, y_pred_linear)

print(f'Linear Regression MSE: {mse_linear}')
print(f'Linear Regression RMSE: {rmse_linear}')
print(f'Linear Regression R²: {r2_linear}')

Linear Regression MSE: 891.3765563848292
Linear Regression RMSE: 29.855930003683174
Linear Regression R²: 0.5467901231113048


## Random Forest Regressor

Random Forest Regressor, on the other hand, is an ensemble learning method that leverages the power of multiple decision trees to improve predictive performance and robustness. The random forest algorithm constructs a multitude of decision trees during training and outputs the mean prediction of the individual trees for regression tasks. This approach helps to mitigate overfitting, enhance model accuracy, and handle datasets with higher dimensionality (IBM, 2023). Random Forest Regressor is particularly effective in capturing complex interactions and non-linear relationships between features, making it a versatile and powerful tool for various regression problems.

Our model performed well capturing the complexity of the data regardless of the fact that it is not a large dataset. 

In [31]:
# Tune the parameters for make_regression
X, y = make_regression(
    n_samples=1000,        # Number of samples
    n_features=12,         # Number of features
    n_informative=6,       # Number of informative features
    noise=0.2,             # Standard deviation of Gaussian noise
    bias=0.5,              # Bias term in the underlying linear model
    random_state=42        # Seed for reproducibility
)

In [32]:
# Create a Random Forest Regressor
model_RF = RandomForestRegressor(n_estimators=100, random_state=42)

In [33]:
# Train the model
model_RF.fit(X_train, y_train)

In [65]:
# Evaluate the model
y_pred_RF = model_RF.predict(X_test)
mse_RF = mean_squared_error(y_test, y_pred_RF)
rmse_RF = np.sqrt(mse_RF)
r2_rf = r2_score(y_test, y_pred_RF)

print(f'Random Forest Regressor MSE: {mse_RF}')
print(f'Random Forest Regressor RMSE: {rmse_RF}')
print(f'Random Forest Regressor R²: {r2_rf}')

Random Forest Regressor MSE: 412.9744567065066
Random Forest Regressor RMSE: 20.321772971532443
Random Forest Regressor R²: 0.7900280175179655


# Model Comparison

Based on our models after trained and tested, the Random Forest Regressor performed the best compared to Keras and Linear Regression. Despite of Keras being a Neural Network model and deals much better with complex data in general, the fact that our dataset has not a large number of obsearvation (5000 in total) may have affected the results not capturing the underlying patterns effectively.

In [71]:
print(f'Keras MSE: {mse_keras:.2f}')
print(f'Keras R²: {r2_keras*100:.2f}%')


print(f'\nLinear Regression MSE: {mse_linear:.2f}')
print(f'Linear Regression R²: {r2_linear*100:.2f}%')


print(f'\nRandom Forest Regressor MSE: {mse_RF:.2f}')
print(f'Random Forest Regressor R²: {r2_rf*100:.2f}%')


Keras MSE: 737.17
Keras R²: 62.52%

Linear Regression MSE: 891.38
Linear Regression R²: 54.68%

Random Forest Regressor MSE: 412.97
Random Forest Regressor R²: 79.00%


In [35]:
# New customer details
new_customer = {
    'Age': 30,
    'Experience(Years)': 5,
    'Sort Code': 92011,
    'Family': 2,
    'Credit Score': 1.2,
    'Mortgage(Thousands\'s)': 20,
    'Education': 'Degree',
    'Personal Loan': 'No',
    'Securities Account': 'Yes',
    'CD Account': 'No',
    'Online Banking': 'Yes',
    'CreditCard': 'No'
}

# Convert to DataFrame
new_customer_df = pd.DataFrame([new_customer])

# Apply the same preprocessing to the new customer data
new_customer_preprocessed = preprocessor.transform(new_customer_df)

# Predict income using the keras model
new_customer_income_keras = model.predict(new_customer_preprocessed)

# Predict income using the linear model
new_customer_income_linear = linear_model.predict(new_customer_preprocessed)


# Predict income using the Random Forest model
new_customer_income_RF = model_RF.predict(new_customer_preprocessed)

# Print the results
print("Predicted income using keras model:", new_customer_income_keras[0])
print("Predicted income using linear model:", new_customer_income_linear[0])
print("Predicted income using Random Forest model:", new_customer_income_RF[0])



[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
Predicted income using keras model: [44.273777]
Predicted income using linear model: 52.40344773459118
Predicted income using Random Forest model: 52.4




In [72]:
# joblib.dump(linear_model, 'linear_model.pkl')

# joblib.dump(model_RF, 'model_RF.pkl')

# model.save('my_keras_model.h5')

# # Save the preprocessor
# joblib.dump(preprocessor, 'preprocessor.pkl')

# References

IBM (2024). AI vs. machine learning vs. deep learning vs. neural networks | IBM. [online] www.ibm.com. Available at: https://www.ibm.com/think/topics/ai-vs-machine-learning-vs-deep-learning-vs-neural-networks.

Google Cloud. (n.d.). Deep learning vs machine learning. [online] Available at: https://cloud.google.com/discover/deep-learning-vs-machine-learning#:~:text=is%20machine%20learning.- [Accessed 30 May 2024].

Oppermann, A. (2021). Activation Functions in Deep Learning: Sigmoid, tanh, ReLU. [online] KI Tutorials. Available at: https://artemoppermann.com/activation-functions-in-deep-learning-sigmoid-tanh-relu/#:~:text=Activation%20functions%20add%20a%20nonlinear [Accessed 30 May 2024].



Amazon Web Services, Inc. (n.d.). What is Linear Regression? - Linear Regression - AWS. [online] Available at: https://aws.amazon.com/what-is/linear-regression/.

IBM (2023). What is Random Forest? | IBM. [online] www.ibm.com. Available at: https://www.ibm.com/topics/random-forest.




### MODELS

Chollet, F. (2020). The Sequential model. [online] keras.io. Available at: https://keras.io/guides/sequential_model/.


scikit-learn.org. (n.d.). 3.2.4.3.2. sklearn.ensemble.RandomForestRegressor — scikit-learn 0.23.2 documentation. [online] Available at: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.


scikit-learn developers (2019). sklearn.linear_model.LinearRegression — scikit-learn 0.22 documentation. [online] Scikit-learn.org. Available at: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html.

