
## Group Members
- 520H0511 - Phan Ngọc Hoàng Anh
- 520H0371 - Đặng Nhật Khang



# Pima Indians Diabetes Prediction using Machine Learning and Neural Networks

This notebook outlines the process of analyzing the Pima Indians Diabetes Database and applying various machine learning and neural network models to predict diabetes.



## Data Loading and Initial Analysis

First, we load the data and perform initial exploratory analysis.


In [None]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
file_path = 'path_to_your_data.csv' # Replace with your file path
data = pd.read_csv(file_path)

# Display the first few rows
print(data.head())

# Statistical summary
print(data.describe())

# Plot histograms
data.hist(bins=15, figsize=(15, 10))
plt.show()

# Correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, cmap='viridis')
plt.title("Correlation Matrix of Features")
plt.show()



## Machine Learning Models

We will now implement basic machine learning models including Logistic Regression, SVM, and Random Forest.


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# Splitting the dataset
X = data.drop('Outcome', axis=1)
y = data['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardizing the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train_scaled, y_train)
y_pred_logreg = logreg.predict(X_test_scaled)
print(classification_report(y_test, y_pred_logreg))

# Support Vector Machine
svc = SVC()
svc.fit(X_train_scaled, y_train)
y_pred_svc = svc.predict(X_test_scaled)
print(classification_report(y_test, y_pred_svc))

# Random Forest Classifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print(classification_report(y_test, y_pred_rf))



## Neural Network Models

Below are the implementations for Feed Forward Neural Network (FFNN) and Recurrent Neural Network (RNN). These models should be run in an environment where TensorFlow is installed.

### Feed Forward Neural Network (FFNN):
This model uses dense layers and dropout for regularization to prevent overfitting. Early stopping is also used to stop training when the validation loss stops improving.

### Recurrent Neural Network (RNN):
The RNN model uses LSTM (Long Short-Term Memory) layers, suitable for sequence data. This model also includes dropout layers for regularization and early stopping for training efficiency.


In [None]:

# Import necessary libraries
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
from tensorflow.keras.callbacks import EarlyStopping
import numpy as np

# Feed Forward Neural Network (FFNN)
model_ffnn = Sequential([
    Dense(32, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    Dropout(0.2),
    Dense(16, activation='relu'),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])

model_ffnn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
early_stopping = EarlyStopping(monitor='val_loss', patience=5)
history_ffnn = model_ffnn.fit(X_train_scaled, y_train, epochs=100, validation_split=0.2, callbacks=[early_stopping], verbose=0)
ffnn_evaluation = model_ffnn.evaluate(X_test_scaled, y_test, verbose=0)

# Recurrent Neural Network (RNN)
X_train_scaled_rnn = np.reshape(X_train_scaled, (X_train_scaled.shape[0], 1, X_train_scaled.shape[1]))
X_test_scaled_rnn = np.reshape(X_test_scaled, (X_test_scaled.shape[0], 1, X_test_scaled.shape[1]))

model_rnn = Sequential([
    LSTM(32, input_shape=(1, X_train_scaled.shape[1]), return_sequences=True),
    Dropout(0.2),
    LSTM(16),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])

model_rnn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history_rnn = model_rnn.fit(X_train_scaled_rnn, y_train, epochs=100, validation_split=0.2, callbacks=[early_stopping], verbose=0)
rnn_evaluation = model_rnn.evaluate(X_test_scaled_rnn, y_test, verbose=0)
