# Diagnosing the Semian Flu

You are given the early data for an outbreak of a dangerous virus originating from a group of primates being keeped in a biomedical research lab, this virus is dubbed the "Semian Flu".

You have the medical records of some patients in `'flu.csv`. There are two general types of patients in the data, flu patients and healthy (this is recorded in the column labeled `flu`, a 0 indicates the absences of the virus and a 1 indicates presence). Furthermore, scientists have found that there are two strains of the virus, each requiring a different type of treatment (this is recorded in the column labeled `flutype`, a 1 indicates the absences of the virus, a 2 indicates presence of strain 1 and a 3 indicates the presence of strain 2).

**Your task:** build a model to predict if a given patient has the flu. Your goal is to catch as many flu patients as possible without misdiagnosing too many healthy patients.


In [65]:
from google.colab import drive, files
import os

drive.mount('/content/gdrive/')

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


In [66]:
PROJECT_FOLDER = "/content/gdrive/MyDrive/Colab Notebooks/MSDS600/Assignment3/"
os.chdir(PROJECT_FOLDER)
print("Current dir: ", os.getcwd())

Current dir:  /content/gdrive/MyDrive/Colab Notebooks/MSDS600/Assignment3


In [67]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

##Part 1: Data Exploration

**Exercise 1: Load the dataset from 'flu.csv' into a pandas DataFrame. Look at the first rows of the dataset.**

In [68]:
df = pd.read_csv("flu.csv")
df.head()

Unnamed: 0,ID,Gender,Age,Race1,Education,MaritalStatus,HHIncomeMid,HomeOwn,Work,Weight,BMI,Pulse,Testosterone,TotChol,PhysActive,AlcoholYear,SmokeNow,PregnantNow,flu,flutype
0,51624,male,34,White,High School,Married,30000.0,Own,NotWorking,87.4,32.22,70.0,,3.49,No,0.0,No,,0,1
1,51630,female,49,White,Some College,LivePartner,40000.0,Rent,NotWorking,86.7,30.57,86.0,,6.7,No,20.0,Yes,,0,1
2,51638,male,9,White,,,87500.0,Rent,,29.8,16.82,82.0,,4.86,,,,,0,1
3,51646,male,8,White,,,60000.0,Own,,35.2,20.64,72.0,,4.09,,,,,0,1
4,51647,female,45,White,College Grad,Married,87500.0,Own,Working,75.7,27.24,62.0,,5.82,Yes,52.0,,,0,1


**Exercise 2: Provide summary statistics of the number of flu patients and the distribution of flu types.**

In [69]:
flu_patients = df[df.flu == 1].shape[0]

In [70]:
df[df.flu ==1].flutype.value_counts()

Unnamed: 0_level_0,count
flutype,Unnamed: 1_level_1
2,227
3,83


In [71]:
print(f"The total number of flu patients is {flu_patients}, out of all the {df.shape[0]} patients.")
print("There are 227 patients with strain 1 of the flu and 83 with strain 2.")

The total number of flu patients is 310, out of all the 5246 patients.
There are 227 patients with strain 1 of the flu and 83 with strain 2.


**Exercise 3: What is the average BMI of the 5 individuals with the highest cholesterol level?**


In [72]:
top_5_chol = df.sort_values(by = "TotChol", ascending = False)[["TotChol", "BMI"]].head(5)
top_5_chol

Unnamed: 0,TotChol,BMI
2733,13.65,28.82
3271,12.28,36.5
3377,9.93,19.9
2621,9.9,21.09
892,9.34,28.03


In [73]:
top_5_avg_bmi = top_5_chol.BMI.mean()
top_5_avg_bmi = round(top_5_avg_bmi, 2)
print(f"The average BMI of the 5 patients with the highest cholesterol levels is: {top_5_avg_bmi}")

The average BMI of the 5 patients with the highest cholesterol levels is: 26.87


**Exercise 4: Calculate the average age for healthy and flu patients using pandas `groupby()` method.**

In [74]:
#Checking if there are null values in the Age column
df.Age.isnull().sum()

0

In [75]:
df.groupby("flu").Age.mean()

Unnamed: 0_level_0,Age
flu,Unnamed: 1_level_1
0,34.549635
1,43.493548


So, the average age of healthy patients is around 34.5 years old and the one of flu patients is 43.5.

## Part 2: Data Preprocessing

**Exercise 5: There are a large number of missing values in the data. Nearly all predictors have some degree of missingness. NaN in the `PregnantNow` column is meaningful and informative, as patients with NaN's in the pregnancy column are males. Replace all missing values in `PregnantNow` with "No".**


In [76]:
df.PregnantNow.replace(np.nan, "No", inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.PregnantNow.replace(np.nan, "No", inplace = True)


In [77]:
#checking the NaNs have been replaced
df.PregnantNow.head()

Unnamed: 0,PregnantNow
0,No
1,No
2,No
3,No
4,No


**Exercise 6: Drop the following columns from the dataframe as they have too many missing values: `SmokeNow`, `AlcoholYear`, `Testosterone`,  `Education`, `MaritalStatus`, `Work`.**

In [78]:
df.drop(columns = ["SmokeNow", "AlcoholYear", "Testosterone", "Education", "MaritalStatus", "Work"], inplace = True)

In [79]:
df.head()

Unnamed: 0,ID,Gender,Age,Race1,HHIncomeMid,HomeOwn,Weight,BMI,Pulse,TotChol,PhysActive,PregnantNow,flu,flutype
0,51624,male,34,White,30000.0,Own,87.4,32.22,70.0,3.49,No,No,0,1
1,51630,female,49,White,40000.0,Rent,86.7,30.57,86.0,6.7,No,No,0,1
2,51638,male,9,White,87500.0,Rent,29.8,16.82,82.0,4.86,,No,0,1
3,51646,male,8,White,60000.0,Own,35.2,20.64,72.0,4.09,,No,0,1
4,51647,female,45,White,87500.0,Own,75.7,27.24,62.0,5.82,Yes,No,0,1


**Execise 7: For the rest of the variables with NaN, replace the NaN's either with the mode if a variable is categorical (object type) or with the mean if the it is numerical (float type).**

In [80]:
df.dtypes

Unnamed: 0,0
ID,int64
Gender,object
Age,int64
Race1,object
HHIncomeMid,float64
HomeOwn,object
Weight,float64
BMI,float64
Pulse,float64
TotChol,float64


In [81]:
df.isnull().sum()

Unnamed: 0,0
ID,0
Gender,0
Age,0
Race1,0
HHIncomeMid,448
HomeOwn,33
Weight,40
BMI,236
Pulse,870
TotChol,909


In [82]:
df.HHIncomeMid.replace(np.nan, df.HHIncomeMid.mean(), inplace = True)
df.HomeOwn.replace(np.nan, df.HomeOwn.mode()[0], inplace = True)
df.Weight.replace(np.nan, df.Weight.mean(), inplace = True)
df.BMI.replace(np.nan, df.BMI.mean(), inplace = True)
df.Pulse.replace(np.nan, df.Pulse.mean(), inplace = True)
df.TotChol.replace(np.nan, df.TotChol.mean(), inplace = True)
df.PhysActive.replace(np.nan, df.PhysActive.mode()[0], inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.HHIncomeMid.replace(np.nan, df.HHIncomeMid.mean(), inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.HomeOwn.replace(np.nan, df.HomeOwn.mode()[0], inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermedi

In [83]:
#checking there are no more missing values to make sure the replacements happened
df.isnull().sum()

Unnamed: 0,0
ID,0
Gender,0
Age,0
Race1,0
HHIncomeMid,0
HomeOwn,0
Weight,0
BMI,0
Pulse,0
TotChol,0


In [84]:
df.head()

Unnamed: 0,ID,Gender,Age,Race1,HHIncomeMid,HomeOwn,Weight,BMI,Pulse,TotChol,PhysActive,PregnantNow,flu,flutype
0,51624,male,34,White,30000.0,Own,87.4,32.22,70.0,3.49,No,No,0,1
1,51630,female,49,White,40000.0,Rent,86.7,30.57,86.0,6.7,No,No,0,1
2,51638,male,9,White,87500.0,Rent,29.8,16.82,82.0,4.86,Yes,No,0,1
3,51646,male,8,White,60000.0,Own,35.2,20.64,72.0,4.09,Yes,No,0,1
4,51647,female,45,White,87500.0,Own,75.7,27.24,62.0,5.82,Yes,No,0,1



**Exercise 8: Encode categorical variables intro dummy variables.**



In [85]:
df = pd.get_dummies(df).astype(int)
df.head()

Unnamed: 0,ID,Age,HHIncomeMid,Weight,BMI,Pulse,TotChol,flu,flutype,Gender_female,...,Race1_Other,Race1_White,HomeOwn_Other,HomeOwn_Own,HomeOwn_Rent,PhysActive_No,PhysActive_Yes,PregnantNow_No,PregnantNow_Unknown,PregnantNow_Yes
0,51624,34,30000,87,32,70,3,0,1,0,...,0,1,0,1,0,1,0,1,0,0
1,51630,49,40000,86,30,86,6,0,1,1,...,0,1,0,0,1,1,0,1,0,0
2,51638,9,87500,29,16,82,4,0,1,0,...,0,1,0,0,1,0,1,1,0,0
3,51646,8,60000,35,20,72,4,0,1,0,...,0,1,0,1,0,0,1,1,0,0
4,51647,45,87500,75,27,62,5,0,1,1,...,0,1,0,1,0,0,1,1,0,0


## Part 3: Model Training and Evaluation






**Exercise 9: Split the observations into an approximate 80-20 train-test split and train a logistic regression model.**

In [86]:
df.columns

Index(['ID', 'Age', 'HHIncomeMid', 'Weight', 'BMI', 'Pulse', 'TotChol', 'flu',
       'flutype', 'Gender_female', 'Gender_male', 'Race1_Black',
       'Race1_Hispanic', 'Race1_Mexican', 'Race1_Other', 'Race1_White',
       'HomeOwn_Other', 'HomeOwn_Own', 'HomeOwn_Rent', 'PhysActive_No',
       'PhysActive_Yes', 'PregnantNow_No', 'PregnantNow_Unknown',
       'PregnantNow_Yes'],
      dtype='object')

In [87]:
from sklearn.model_selection import train_test_split
#splitting the data into train and test
data_train, data_test = train_test_split(df, train_size = 0.8, random_state = 1)

#selecting the predictors
predictors = ['Age', 'Weight', 'BMI', 'TotChol', 'Gender_female', 'Race1_Black',
       'Race1_Hispanic', 'Race1_White', 'HomeOwn_Own', 'HomeOwn_Rent', 'PhysActive_Yes', 'PregnantNow_No']
X_train = data_train[predictors]
y_train = data_train['flu']

X_test = data_test[predictors]
y_test = data_test['flu']

In [88]:
#Creating a linear logistic regression model
lr_model = LogisticRegression(solver='saga', penalty=None, random_state = 1, class_weight='balanced', max_iter=1000)
lr_model.fit(X_train, y_train)

**Exercise 10: Report the performance of the model on the training set.**


In [89]:
print("Accuracy on training set:", lr_model.score(X_train, y_train.values))


Accuracy on training set: 0.6203527168732126


**Exercise 11: Evaluate the performance of the model on the test set, considering both the overall accuracy and the ability to correctly identify flu patients. In other words, apart from the accuracy, compute the percentage of the correcly classified flu patients and the percentage of the correctly classified healthy individuals based on this model.**

In [90]:
print("Accuracy on test set:", lr_model.score(X_test, y_test.values))

Accuracy on test set: 0.6095238095238096


In [91]:
from sklearn.metrics import accuracy_score, confusion_matrix

y_pred = lr_model.predict(X_test)

#Computing overall accuracy
accuracy = accuracy_score(y_test, y_pred)

#Computing confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

#Sensitivity (Recall for Flu Patients)
sensitivity = tp / (tp + fn)  # True Positives / (True Positives + False Negatives)

#Specificity (Recall for Healthy Individuals)
specificity = tn / (tn + fp)  # True Negatives / (True Negatives + False Positives)

print(f"Accuracy: {accuracy:.2f}")
print(f"Sensitivity (Flu Patients Correctly Classified): {sensitivity:.2f}")
print(f"Specificity (Healthy Individuals Correctly Classified): {specificity:.2f}")

Accuracy: 0.61
Sensitivity (Flu Patients Correctly Classified): 0.66
Specificity (Healthy Individuals Correctly Classified): 0.61


In [92]:
predictors_lst = [['Age', 'BMI', 'TotChol', 'Gender_female', 'Race1_Black',
       'Race1_Mexican', 'Race1_White', 'HomeOwn_Own', 'HomeOwn_Rent', 'PhysActive_Yes', 'PregnantNow_No'], ['Age', 'Weight', 'Gender_male', 'HomeOwn_Rent', 'PhysActive_No'], ['Weight', 'TotChol', 'Pulse', 'Race1_White', 'PregnantNow_No'], ['Race1_White', 'Race1_Black', 'Race1_Mexican', 'Race1_Hispanic'], ['Gender_female', 'Age'], ['HHIncomeMid', 'HomeOwn_Own']]

In [93]:
#Trying logistic regression with different sets of predictors

for lst in predictors_lst:
  X_train = data_train[lst]
  X_test = data_test[lst]
  y_train = data_train['flu']
  y_test = data_test['flu']
  lr_model = LogisticRegression(solver='saga', penalty=None, random_state = 1, class_weight='balanced', max_iter=1000)
  lr_model.fit(X_train, y_train)
  print(f"Accuracy on training set for {lst}:", lr_model.score(X_train, y_train.values))
  print(f"Accuracy on test set for {lst}:", lr_model.score(X_test, y_test.values))
  y_pred = lr_model.predict(X_test)
  accuracy = accuracy_score(y_test, y_pred)
  tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
  sensitivity = tp / (tp + fn)
  specificity = tn / (tn + fp)
  print(f"Accuracy for {lst}: {accuracy:.2f}")
  print(f"Sensitivity (Flu Patients Correctly Classified) for {lst}: {sensitivity:.2f}")
  print(f"Specificity (Healthy Individuals Correctly Classified) for {lst}: {specificity:.2f}")
  print()




Accuracy on training set for ['Age', 'BMI', 'TotChol', 'Gender_female', 'Race1_Black', 'Race1_Mexican', 'Race1_White', 'HomeOwn_Own', 'HomeOwn_Rent', 'PhysActive_Yes', 'PregnantNow_No']: 0.6484747378455672
Accuracy on test set for ['Age', 'BMI', 'TotChol', 'Gender_female', 'Race1_Black', 'Race1_Mexican', 'Race1_White', 'HomeOwn_Own', 'HomeOwn_Rent', 'PhysActive_Yes', 'PregnantNow_No']: 0.6285714285714286
Accuracy for ['Age', 'BMI', 'TotChol', 'Gender_female', 'Race1_Black', 'Race1_Mexican', 'Race1_White', 'HomeOwn_Own', 'HomeOwn_Rent', 'PhysActive_Yes', 'PregnantNow_No']: 0.63
Sensitivity (Flu Patients Correctly Classified) for ['Age', 'BMI', 'TotChol', 'Gender_female', 'Race1_Black', 'Race1_Mexican', 'Race1_White', 'HomeOwn_Own', 'HomeOwn_Rent', 'PhysActive_Yes', 'PregnantNow_No']: 0.59
Specificity (Healthy Individuals Correctly Classified) for ['Age', 'BMI', 'TotChol', 'Gender_female', 'Race1_Black', 'Race1_Mexican', 'Race1_White', 'HomeOwn_Own', 'HomeOwn_Rent', 'PhysActive_Yes', 'Pr



Accuracy on training set for ['Weight', 'TotChol', 'Pulse', 'Race1_White', 'PregnantNow_No']: 0.5400381315538608
Accuracy on test set for ['Weight', 'TotChol', 'Pulse', 'Race1_White', 'PregnantNow_No']: 0.5104761904761905
Accuracy for ['Weight', 'TotChol', 'Pulse', 'Race1_White', 'PregnantNow_No']: 0.51
Sensitivity (Flu Patients Correctly Classified) for ['Weight', 'TotChol', 'Pulse', 'Race1_White', 'PregnantNow_No']: 0.50
Specificity (Healthy Individuals Correctly Classified) for ['Weight', 'TotChol', 'Pulse', 'Race1_White', 'PregnantNow_No']: 0.51





Accuracy on training set for ['Race1_White', 'Race1_Black', 'Race1_Mexican', 'Race1_Hispanic']: 0.31053384175405147
Accuracy on test set for ['Race1_White', 'Race1_Black', 'Race1_Mexican', 'Race1_Hispanic']: 0.3333333333333333
Accuracy for ['Race1_White', 'Race1_Black', 'Race1_Mexican', 'Race1_Hispanic']: 0.33
Sensitivity (Flu Patients Correctly Classified) for ['Race1_White', 'Race1_Black', 'Race1_Mexican', 'Race1_Hispanic']: 0.76
Specificity (Healthy Individuals Correctly Classified) for ['Race1_White', 'Race1_Black', 'Race1_Mexican', 'Race1_Hispanic']: 0.30





Accuracy on training set for ['Gender_female', 'Age']: 0.6010486177311726
Accuracy on test set for ['Gender_female', 'Age']: 0.5742857142857143
Accuracy for ['Gender_female', 'Age']: 0.57
Sensitivity (Flu Patients Correctly Classified) for ['Gender_female', 'Age']: 0.64
Specificity (Healthy Individuals Correctly Classified) for ['Gender_female', 'Age']: 0.57

Accuracy on training set for ['HHIncomeMid', 'HomeOwn_Own']: 0.9428026692087702
Accuracy on test set for ['HHIncomeMid', 'HomeOwn_Own']: 0.9333333333333333
Accuracy for ['HHIncomeMid', 'HomeOwn_Own']: 0.93
Sensitivity (Flu Patients Correctly Classified) for ['HHIncomeMid', 'HomeOwn_Own']: 0.00
Specificity (Healthy Individuals Correctly Classified) for ['HHIncomeMid', 'HomeOwn_Own']: 1.00





Based on the results, the best model is the first that we implemented, as it has the highest combination of overall accuracy, sensitivity and specificity (and also some of the other ones don't converge - this also happened with models I tried with polynomial features).

As we calculated, the overall accuracy of the model is 61%, the flu patients that are correctly classified are 66%, and the healthy individuals correctly classified 61%. This may not be a very accurate model but it is still efficient considering our data, which has a vast majority of healthy compared to the flu patients (this is why we added the parameter class_weight='balanced' in the logistic regression, otherwise the model predicted almost perfectly the healthy people, but ignored almost completely the flu patients). Another reason why we may not have better accuracy is that there are other more important reasons why someone might have the flu that are not listed/ not possible to be listed in the features, for example the expossure that they had to the virus (or to someone that already had the virus), which is a very random and difficult thing to measure.
