We could develop a model to predict whether a patient will be hospitalized or recover based on their age, gender, symptoms, and test results. Here’s how we could go about it:

Data Preprocessing: Clean the data, handle missing values, and encode categorical features.
Feature Engineering: Extract useful features and create new ones if necessary.
Model Selection: Choose an appropriate machine learning algorithm.
Training and Evaluation: Train the model and evaluate its performance using metrics like accuracy, precision, recall, and F1-score.
Hyperparameter Tuning: Optimize the model’s parameters for better performance.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Load the generated COVID-19 data
data = pd.read_csv('covid19_sample_data.csv')

# Convert the 'DateTested' column to datetime format
data['DateTested'] = pd.to_datetime(data['DateTested'])

# Drop unnecessary columns
data = data.drop(columns=['PatientID', 'DateTested'])

# Encode categorical features
label_encoder = LabelEncoder()
data['Gender'] = label_encoder.fit_transform(data['Gender'])
data['TestResult'] = label_encoder.fit_transform(data['TestResult'])
data['Symptoms'] = label_encoder.fit_transform(data['Symptoms'])

# Define features and target
X = data.drop(columns=['Hospitalized', 'Recovered'])
y_hospitalized = data['Hospitalized']
y_recovered = data['Recovered']

# Split the data into training and testing sets
X_train, X_test, y_train_hosp, y_test_hosp = train_test_split(X, y_hospitalized, test_size=0.3, random_state=42)
_, _, y_train_rec, y_test_rec = train_test_split(X, y_recovered, test_size=0.3, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train the model for hospitalization prediction
model_hosp = RandomForestClassifier(n_estimators=100, random_state=42)
model_hosp.fit(X_train, y_train_hosp)

# Make predictions for hospitalization
y_pred_hosp = model_hosp.predict(X_test)

# Evaluate the hospitalization model
print("Hospitalization Model Evaluation")
print(confusion_matrix(y_test_hosp, y_pred_hosp))
print(classification_report(y_test_hosp, y_pred_hosp))
print("Accuracy:", accuracy_score(y_test_hosp, y_pred_hosp))

# Train the model for recovery prediction
model_rec = RandomForestClassifier(n_estimators=100, random_state=42)
model_rec.fit(X_train, y_train_rec)

# Make predictions for recovery
y_pred_rec = model_rec.predict(X_test)

# Evaluate the recovery model
print("\nRecovery Model Evaluation")
print(confusion_matrix(y_test_rec, y_pred_rec))
print(classification_report(y_test_rec, y_pred_rec))
print("Accuracy:", accuracy_score(y_test_rec, y_pred_rec))


Hospitalization Model Evaluation
[[74 76]
 [75 75]]
              precision    recall  f1-score   support

       False       0.50      0.49      0.49       150
        True       0.50      0.50      0.50       150

    accuracy                           0.50       300
   macro avg       0.50      0.50      0.50       300
weighted avg       0.50      0.50      0.50       300

Accuracy: 0.49666666666666665

Recovery Model Evaluation
[[64 95]
 [73 68]]
              precision    recall  f1-score   support

       False       0.47      0.40      0.43       159
        True       0.42      0.48      0.45       141

    accuracy                           0.44       300
   macro avg       0.44      0.44      0.44       300
weighted avg       0.44      0.44      0.44       300

Accuracy: 0.44
