# Model Training with TensorFlow

Train a model that predicts whether or not a file transfer is suspicous or benign , based on its features (attributes).
### 1. Import the required libraries and packages.

You can safely ignore the TensorFlow import warnings.
TensorFlow typically produces these warnings on CPU-only environments where the use of accelerators, such as GPUs or certain CPU instruction sets, is limited or not available.

In [1]:
from typing import List, Dict

import pandas as pd

import tensorflow as tf

import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

2024-11-12 21:42:57.245625: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-12 21:42:57.245686: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-12 21:42:57.247717: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-12 21:42:57.260223: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### 2. Load the data into a Pandas dataframe.

In [2]:
data = pd.read_csv('./data/mftinput4.csv')

FileNotFoundError: [Errno 2] No such file or directory: './data/mftinput4.csv'

Split the data into two data frames: features (`X`) and target variable (`y`).

In [5]:
X = data.drop('Outcome', axis=1)
y = data['Outcome']

Inspect the two dataframes.

In [6]:
X.head()

Unnamed: 0,Entropy,FileAge,CompressionRatio,FileSize,TransferTime,PacketsSize,TransferRate,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [7]:
y.head()

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

Divide the data into training and test data sets. 

The `train_test_split` method of Scikit-learn can split the data set into random train and test subsets.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=0
)

print(f"Number of samples in training set: {X_train.shape[0]}")
print(f"Number of samples in test set: {X_test.shape[0]}")

NameError: name 'train_test_split' is not defined

### 4. Create and train the model.

Define a simple neural network model with the Keras API of TensorFlow.
The network must take eight input features and output two target values, corresponding to the two possible outcomes, diabetes or no diabetes.
The network also defines two internal layers, with 20 and 10 neurons respectively.

In [None]:
# Seed for reproducible results
tf.random.set_seed(10)
tf.keras.utils.set_random_seed(10)

model = tf.keras.Sequential([
    tf.keras.layers.Dense(20, activation='relu', input_shape=(8,)),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(2, activation='softmax')
])

Compile the model and define the loss function, the optimizer, and the training epochs.

In [None]:
epochs = 500

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

Train the model.

In [None]:
model.fit(X_train, y_train, epochs=epochs, verbose=0.5)

### 5. Evaluate the model metrics.

After the model is trained, evaluate the model against the test set.

In [None]:
# Compute the predictions (y_predictions) given the test data (y_predicted)
y_predicted_probabilities = model.predict(X_test)
y_predicted = tf.argmax(y_predicted_probabilities, axis=1)

# Compare the predicted values for the test set (y_predicted)
# against the expected values (y_test)
print("Classification Report:")
print(classification_report(y_test, y_predicted))

The trained model has an accuracy value of aproximately 78%.

You can improve the score by retraining the model after more sophisticated data engineering or by tweaking the model hyper parameters.

### 6. Test the model with sample cases.
Test the model with data from two patients: one patient with diabetes and one patient without diabetes.

In [None]:
# Tuple for textual display of prediction
classes = ('No diabetes', 'Diabetes')


def predict(patients: List[Dict]):
    features_as_lists = [list(patient.values()) for patient in patients]
    inputs_array = np.array(features_as_lists)
    prediction_probabilities = model.predict(inputs_array, verbose=0)
    # argmax gets the index of the maximum value in an array
    predictions = [classes[np.argmax(p)] for p in prediction_probabilities]
    return predictions


diabetes_patient = {
    "Pregnancies": 6.0,
    "Glucose": 110.0,
    "BloodPressure": 65.0,
    "SkinThickness": 15.0,
    "Insulin": 1.0,
    "BMI": 45.7,
    "DiabetesPedigreeFunction": 0.627,
    "Age": 50
}

no_diabetes_patient = {
    "Pregnancies": 0,
    "Glucose": 88.0,
    "BloodPressure": 60.0,
    "SkinThickness": 35.0,
    "Insulin": 1.0,
    "BMI": 45.7,
    "DiabetesPedigreeFunction": 0.27,
    "Age": 20
}

predictions = predict([diabetes_patient, no_diabetes_patient])
print(predictions)