## Week 4, Lab 2: Predicting Chronic Kidney Disease in Patients
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus on steps exploring data, building models and evaluating the models we build.

There are three links you may find important:
- [A set of chronic kidney disease (CKD) data and other biological factors](./chronic_kidney_disease_full.csv).
- [The CKD data dictionary](./chronic_kidney_disease_header.txt).
- [An article comparing the use of k-nearest neighbors and support vector machines on predicting CKD](./chronic_kidney_disease.pdf).

## Step 1: Define the problem.

Suppose you're working for Mayo Clinic, widely recognized to be the top hospital in the United States. In your work, you've overheard nurses and doctors discuss test results, then arrive at a conclusion as to whether or not someone has developed a particular disease or condition. For example, you might overhear something like:

> **Nurse**: Male 57 year-old patient presents with severe chest pain. FDP _(short for fibrin degradation product)_ was elevated at 13. We did an echo _(echocardiogram)_ and it was inconclusive.

> **Doctor**: What was his interarm BP? _(blood pressure)_

> **Nurse**: Systolic was 140 on the right; 110 on the left.

> **Doctor**: Dammit, it's an aortic dissection! Get to the OR _(operating room)_ now!

> _(intense music playing)_

In this fictitious but [Shonda Rhimes-esque](https://en.wikipedia.org/wiki/Shonda_Rhimes#Grey's_Anatomy,_Private_Practice,_Scandal_and_other_projects_with_ABC) scenario, you might imagine the doctor going through a series of steps like a [flowchart](https://en.wikipedia.org/wiki/Flowchart), or a series of if-this-then-that steps to diagnose a patient. The first steps made the doctor ask what the interarm blood pressure was. Because interarm blood pressure took on the values it took on, the doctor diagnosed the patient with an aortic dissection.

Your goal, as a research biostatistical data scientist at the nation's top hospital, is to develop a medical test that can improve upon our current diagnosis system for [chronic kidney disease (CKD)](https://www.mayoclinic.org/diseases-conditions/chronic-kidney-disease/symptoms-causes/syc-20354521).

**Real-world problem**: Develop a medical diagnosis test that is better than our current diagnosis system for CKD.

**Data science problem**: Develop a medical diagnosis test that reduces both the number of false positives and the number of false negatives.

---

## Step 2: Obtain the data.

### 1. Read in the data.

In [1]:
from sklearn.preprocessing import scale
import warnings; warnings.simplefilter('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import patsy
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import recall_score
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN, SMOTETomek
from imblearn.under_sampling import NearMiss, EditedNearestNeighbours, TomekLinks
from imblearn.ensemble import BalancedBaggingClassifier
from imblearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix

In [4]:
data = pd.read_csv('~/desktop/dsi/submissions/labs/4.02-lab-classification-model-evaluation-master/4.02-lab-classification-model-evaluation-master/data/chronic_kidney_disease_full.csv')

### 2. Check out the data dictionary. What are a few features or relationships you might be interested in checking out?

Answer: examine the relationship between someone's age, blood pressure, red blood cell count, sodium, potassium, white blood cell count, etc., and whether or not they exhibit symptoms of/have CKD.

In [7]:
data

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.020,1.0,0.0,,normal,notpresent,notpresent,121.0,...,44.0,7800.0,5.2,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.020,4.0,0.0,,normal,notpresent,notpresent,,...,38.0,6000.0,,no,no,no,good,no,no,ckd
2,62.0,80.0,1.010,2.0,3.0,normal,normal,notpresent,notpresent,423.0,...,31.0,7500.0,,no,yes,no,poor,no,yes,ckd
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,...,32.0,6700.0,3.9,yes,no,no,poor,yes,yes,ckd
4,51.0,80.0,1.010,2.0,0.0,normal,normal,notpresent,notpresent,106.0,...,35.0,7300.0,4.6,no,no,no,good,no,no,ckd
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,55.0,80.0,1.020,0.0,0.0,normal,normal,notpresent,notpresent,140.0,...,47.0,6700.0,4.9,no,no,no,good,no,no,notckd
396,42.0,70.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,75.0,...,54.0,7800.0,6.2,no,no,no,good,no,no,notckd
397,12.0,80.0,1.020,0.0,0.0,normal,normal,notpresent,notpresent,100.0,...,49.0,6600.0,5.4,no,no,no,good,no,no,notckd
398,17.0,60.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,114.0,...,51.0,7200.0,5.9,no,no,no,good,no,no,notckd


---

## Step 3: Explore the data.

### 3. How much of the data is missing from each column?

In [6]:
data.isnull().sum()

age        9
bp        12
sg        47
al        46
su        49
rbc      152
pc        65
pcc        4
ba         4
bgr       44
bu        19
sc        17
sod       87
pot       88
hemo      52
pcv       71
wbcc     106
rbcc     131
htn        2
dm         2
cad        2
appet      1
pe         1
ane        1
class      0
dtype: int64

In [10]:
data.dropna()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,...,32.0,6700.0,3.9,yes,no,no,poor,yes,yes,ckd
9,53.0,90.0,1.020,2.0,0.0,abnormal,abnormal,present,notpresent,70.0,...,29.0,12100.0,3.7,yes,yes,no,poor,no,yes,ckd
11,63.0,70.0,1.010,3.0,0.0,abnormal,abnormal,present,notpresent,380.0,...,32.0,4500.0,3.8,yes,yes,no,poor,yes,no,ckd
14,68.0,80.0,1.010,3.0,2.0,normal,abnormal,present,present,157.0,...,16.0,11000.0,2.6,yes,yes,yes,poor,yes,no,ckd
20,61.0,80.0,1.015,2.0,0.0,abnormal,abnormal,notpresent,notpresent,173.0,...,24.0,9200.0,3.2,yes,yes,yes,poor,yes,yes,ckd
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,55.0,80.0,1.020,0.0,0.0,normal,normal,notpresent,notpresent,140.0,...,47.0,6700.0,4.9,no,no,no,good,no,no,notckd
396,42.0,70.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,75.0,...,54.0,7800.0,6.2,no,no,no,good,no,no,notckd
397,12.0,80.0,1.020,0.0,0.0,normal,normal,notpresent,notpresent,100.0,...,49.0,6600.0,5.4,no,no,no,good,no,no,notckd
398,17.0,60.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,114.0,...,51.0,7200.0,5.9,no,no,no,good,no,no,notckd


In [12]:
data.shape

(400, 25)

### 4. Suppose that I dropped every row that contained at least one missing value. (In the context of analysis with missing data, we call this a "complete case analysis," because we keep only the complete cases!) How many rows would remain in our dataframe? What are at least two downsides to doing this?

> There's a good visual on slide 15 of [this deck](https://liberalarts.utexas.edu/prc/_files/cs/Missing-Data.pdf) that shows what a complete case analysis looks like if you're interested.

Answer: A disadvantage would be mistakenly losing data within columns which have valid values while dropping values. Additionally we would have to contend with a 'random' sample of respondents.

### 5. Thinking critically about how our data were gathered, it's likely that these records were gathered by doctors and nurses. Brainstorm three potential areas (in addition to the missing data we've already discussed) where this data might be inaccurate or imprecise.

Answer: 1. Any columns/features that involve human measurement may contain faulty practices deriving from human error.
      2. A few columns deal with data which can be massively temporary, drastically infulenced by timing of data acquisition.
      3. There would be records which could have been accidentially modified(adjusted/rounded) incorrectly while recording.

---

## Step 4: Model the data.

### 6. Suppose that I want to construct a model where no person who has CKD will ever be told that they do not have CKD. What (very simple, no machine learning needed) model can I create that will never tell a person with CKD that they do not have CKD?

> Hint: Don't think about `statsmodels` or `scikit-learn` here.

Answer: Build a model which informes each patient they have CKD; the only error which could occur within this situation is a Type I Error, or a False Positive.

### 7. In problem 6, what common classification metric did we optimize for? Did we minimize false positives or negatives?

Answer: We optimized "Sensitivity" and the "False Negative Rate"; by minimizing "False Negatives".

### 8. Thinking ethically, what is at least one disadvantage to the model you described in problem 6?

Answer: One disadvantage would be the model will produce an excess of False Positives.

### 9. Suppose that I want to construct a model where a person who does not have CKD will ever be told that they do have CKD. What (very simple, no machine learning needed) model can I create that will accomplish this?

Answer: The inverse of query 6: Build a model which informs each patient they do not have CKD; the error which could occur is a Type II Error, or a False Negative.

### 10. In problem 9, what common classification metric did we optimize for? Did we minimize false positives or negatives?

Answer: We optimized "Specificity" and the False Positive Rate by minimizing False Positives.

### 11. Thinking ethically, what is at least one disadvantage to the model you described in problem 9?

Answer: The model will produce many False Negatives, preventing actual sick patients from seeking medical assistance. 

### 12. Construct a logistic regression model in `sklearn` predicting class from the other variables. You may scale, select/drop, and engineer features as you wish - build a good model! Make sure, however, that you include at least one categorical/dummy feature and at least one quantitative feature.

> Hint: Remember to do a train/test split!

In [13]:
data.columns

Index(['age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr', 'bu',
       'sc', 'sod', 'pot', 'hemo', 'pcv', 'wbcc', 'rbcc', 'htn', 'dm', 'cad',
       'appet', 'pe', 'ane', 'class'],
      dtype='object')

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 25 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     391 non-null    float64
 1   bp      388 non-null    float64
 2   sg      353 non-null    float64
 3   al      354 non-null    float64
 4   su      351 non-null    float64
 5   rbc     248 non-null    object 
 6   pc      335 non-null    object 
 7   pcc     396 non-null    object 
 8   ba      396 non-null    object 
 9   bgr     356 non-null    float64
 10  bu      381 non-null    float64
 11  sc      383 non-null    float64
 12  sod     313 non-null    float64
 13  pot     312 non-null    float64
 14  hemo    348 non-null    float64
 15  pcv     329 non-null    float64
 16  wbcc    294 non-null    float64
 17  rbcc    269 non-null    float64
 18  htn     398 non-null    object 
 19  dm      398 non-null    object 
 20  cad     398 non-null    object 
 21  appet   399 non-null    object 
 22  pe

In [15]:
data.drop(['rbc','pc','sod','pot','pcv','wbcc','rbcc'], axis=1, inplace=True)

In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 18 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     391 non-null    float64
 1   bp      388 non-null    float64
 2   sg      353 non-null    float64
 3   al      354 non-null    float64
 4   su      351 non-null    float64
 5   pcc     396 non-null    object 
 6   ba      396 non-null    object 
 7   bgr     356 non-null    float64
 8   bu      381 non-null    float64
 9   sc      383 non-null    float64
 10  hemo    348 non-null    float64
 11  htn     398 non-null    object 
 12  dm      398 non-null    object 
 13  cad     398 non-null    object 
 14  appet   399 non-null    object 
 15  pe      399 non-null    object 
 16  ane     399 non-null    object 
 17  class   400 non-null    object 
dtypes: float64(9), object(9)
memory usage: 56.4+ KB


In [18]:
#Model Construction
data_v1 = data

In [19]:
data_v1.replace(np.nan, 0, inplace=True)

In [20]:
data_v1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 18 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     400 non-null    float64
 1   bp      400 non-null    float64
 2   sg      400 non-null    float64
 3   al      400 non-null    float64
 4   su      400 non-null    float64
 5   pcc     400 non-null    object 
 6   ba      400 non-null    object 
 7   bgr     400 non-null    float64
 8   bu      400 non-null    float64
 9   sc      400 non-null    float64
 10  hemo    400 non-null    float64
 11  htn     400 non-null    object 
 12  dm      400 non-null    object 
 13  cad     400 non-null    object 
 14  appet   400 non-null    object 
 15  pe      400 non-null    object 
 16  ane     400 non-null    object 
 17  class   400 non-null    object 
dtypes: float64(9), object(9)
memory usage: 56.4+ KB


In [21]:
data_v1['pcc'].value_counts()

notpresent    354
present        42
0               4
Name: pcc, dtype: int64

In [22]:
data_v1['pcc'].replace(0, 'present', inplace=True)

In [23]:
data_v1['pcc'].value_counts()

notpresent    354
present        46
Name: pcc, dtype: int64

In [24]:
data_v1['ba'].value_counts()

notpresent    374
present        22
0               4
Name: ba, dtype: int64

In [25]:
data_v1['ba'].replace(0, 'present', inplace=True)

In [26]:
data_v1['ba'].value_counts()

notpresent    374
present        26
Name: ba, dtype: int64

In [27]:
data_v1['htn'].value_counts()

no     251
yes    147
0        2
Name: htn, dtype: int64

In [30]:
data_v1['htn'].replace(0, 'yes', inplace=True)

In [31]:
data_v1['htn'].value_counts()

no         251
yes        147
present      2
Name: htn, dtype: int64

In [32]:
data_v1['dm'].value_counts()

no     261
yes    137
0        2
Name: dm, dtype: int64

In [33]:
data_v1['dm'].replace(0, 'yes', inplace=True)

In [34]:
data_v1['dm'].value_counts()

no     261
yes    139
Name: dm, dtype: int64

In [35]:
data_v1['cad'].value_counts()

no     364
yes     34
0        2
Name: cad, dtype: int64

In [36]:
data_v1['cad'].replace(0, 'yes', inplace=True)

In [37]:
data_v1['cad'].value_counts()

no     364
yes     36
Name: cad, dtype: int64

In [40]:
data_v1['appet'].value_counts()

good    317
poor     82
0         1
Name: appet, dtype: int64

In [41]:
data_v1['appet'].replace(0, 'poor', inplace=True)

In [42]:
data_v1['appet'].value_counts()

good    317
poor     83
Name: appet, dtype: int64

In [43]:
data_v1['pe'].value_counts()

no     323
yes     76
0        1
Name: pe, dtype: int64

In [44]:
data_v1['pe'].replace(0, 'yes', inplace=True)

In [45]:
data_v1['pe'].value_counts()

no     323
yes     77
Name: pe, dtype: int64

In [46]:
data_v1['ane'].value_counts()

no     339
yes     60
0        1
Name: ane, dtype: int64

In [47]:
data_v1['pe'].replace(0, 'yes', inplace=True)

In [48]:
data_v1['ane'].value_counts()

no     339
yes     60
0        1
Name: ane, dtype: int64

In [49]:
data_v1['class'].value_counts()

ckd       250
notckd    150
Name: class, dtype: int64

In [50]:
data_v1 = pd.get_dummies(data=data_v1, drop_first=True)

In [53]:
data_v1.columns

Index(['age', 'bp', 'sg', 'al', 'su', 'bgr', 'bu', 'sc', 'hemo', 'pcc_present',
       'ba_present', 'htn_present', 'htn_yes', 'dm_yes', 'cad_yes',
       'appet_poor', 'pe_yes', 'ane_no', 'ane_yes', 'class_notckd'],
      dtype='object')

In [54]:
data_v1.drop('class_notckd', axis=1, inplace=True)

In [55]:
data_v1 ['class'] = data['class']

In [56]:
data_v1.columns

Index(['age', 'bp', 'sg', 'al', 'su', 'bgr', 'bu', 'sc', 'hemo', 'pcc_present',
       'ba_present', 'htn_present', 'htn_yes', 'dm_yes', 'cad_yes',
       'appet_poor', 'pe_yes', 'ane_no', 'ane_yes', 'class'],
      dtype='object')

In [57]:
data_v1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 20 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          400 non-null    float64
 1   bp           400 non-null    float64
 2   sg           400 non-null    float64
 3   al           400 non-null    float64
 4   su           400 non-null    float64
 5   bgr          400 non-null    float64
 6   bu           400 non-null    float64
 7   sc           400 non-null    float64
 8   hemo         400 non-null    float64
 9   pcc_present  400 non-null    uint8  
 10  ba_present   400 non-null    uint8  
 11  htn_present  400 non-null    uint8  
 12  htn_yes      400 non-null    uint8  
 13  dm_yes       400 non-null    uint8  
 14  cad_yes      400 non-null    uint8  
 15  appet_poor   400 non-null    uint8  
 16  pe_yes       400 non-null    uint8  
 17  ane_no       400 non-null    uint8  
 18  ane_yes      400 non-null    uint8  
 19  class   

In [58]:
v1col_list= list(data_v1.columns)

In [71]:
[v1_features.append(col) for col in v1col_list if col != 'class']

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [73]:
# X = data_v1[v1_features]
# y = data_v1['class']

In [74]:
poly = PolynomialFeatures(include_bias=False, degree=2)

In [None]:
X_poly = poly.fit_transform(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, random_state = 42)
ss = StandardScaler()

In [None]:
StandardScaler(copy=True, with_mean=True, with_std=True)

In [None]:
ss.fit(X_train)

In [None]:
X_train_sc = ss.transform(X_train)
X_test_sc = ss.transform(X_test)

In [None]:
logreg = LogisticRegression()

In [None]:
logreg.fit(X_train_sc, y_train)

In [None]:
logreg.score(X_train_sc, y_train)

---

## Step 5: Evaluate the model.

### 13. Based on your logistic regression model constructed in problem 12, interpret the coefficient of one of your quantitative features.

### 14. Based on your logistic regression model constructed in problem 12, interpret the coefficient of one of your categorical/dummy features.

In [None]:
logreg.coef_

Interpretation: Beta 1's coefficient of 0.11018577 implies a respondent is more likely to have CKD for every year they are older.

### 15. Despite being a relatively simple model, logistic regression is very widely used in the real world. Why do you think that's the case? Name at least two advantages to using logistic regression as a modeling technique.

Answer: Valid statement, being the repsonses' simplicity which allows a higher plausibility for accurate predictions. 
Advantages of Logistic Regression:
      1.  Logistic Regression will predict continuous values.
      2.  Logistic Regression connects regression and classification.
      3.  Logistic Regression can predict probabilities.

### 16. Does it make sense to generate a confusion matrix on our training data or our test data? Why? Generate it on the proper data.

> Hint: Once you've generated your predicted $y$ values and you have your observed $y$ values, then it will be easy to [generate a confusion matrix using sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html).

Indeed it would be logical to construct a confusion matrix to account for any unseen data. 

In [None]:
predictions = logreg.predict(X_test_sc)

In [None]:
cm = confusion_matrix(y_test, predictions)

In [None]:
cm

In [None]:
cm = pd.DataFrame(cm, columns=['Predicted Negative','Predicted Positive'], index=['Actual Negative','Actual Positive'])

In [None]:
cm

### 17. In this hospital case, we want to predict CKD. Do we want to optimize for sensitivity, specificity, or something else? Why? (If you don't think there's one clear answer, that's okay! There rarely is. Be sure to defend your conclusion!)

Answer: In this hospital case, we want to optimize for ** sensitivity** or **recall**; although predicting someone is positive for CKD when they are actually negative, a False Negative/Type II Error, predicting a negative for CKD when the actuality is positive would be worse; False Positive/ Type 1 Error. 
A False Negative/Type II Error can be remedied within a few months when the patient is evaluated once more. 
A False Positive/Type I Error, on the other hand, would result in a sick person going about their life as if they were healthy, potentially causing irrvocable damage physiologically. 


### 18 (BONUS). Write a function that will create an ROC curve for you, then plot the ROC curve.

Here's a strategy you might consider:
1. In order to even begin, you'll need some fit model. Use your logistic regression model from problem 12.
2. We want to look at all values of your "threshold" - that is, anything where .predict() gives you above your threshold falls in the "positive class," and anything that is below your threshold falls in the "negative class." Start the threshold at 0.
3. At this value of your threshold, calculate the sensitivity and specificity. Store these values.
4. Increment your threshold by some "step." Maybe set your step to be 0.01, or even smaller.
5. At this value of your threshold, calculate the sensitivity and specificity. Store these values.
6. Repeat steps 3 and 4 until you get to the threshold of 1.
7. Plot the values of sensitivity and 1 - specificity.

### 19. Suppose you're speaking with the biostatistics lead at Mayo Clinic, who asks you "Why are unbalanced classes generally a problem? Are they a problem in this particular CKD analysis?" How would you respond?

Answer:  Unbalanced classes are generally a problem because the minority class is at risk of not having an adequate amount of exposure during the model process to be accounted for within the model.  Furthermore, since there are a minute amount of minority class(es), the model cannot sufficiently pick up its signal, resulting in its attributes being overlooked during the prediction process.

My response would be to offer building two variant versions of the model; one which does not alter the offset of the unbalanced classes, and one that sets both classes equal by either: 1. **oversampling the minority**, or negative CKD cases; or 2. **undersampling the majority**, or positive CKD cases.  This would allow observation upon if there are any measurable differences in the models' results.

In [75]:
data['class'].value_counts()

ckd       250
notckd    150
Name: class, dtype: int64

In [76]:
data_v1['class'].value_counts()

ckd       250
notckd    150
Name: class, dtype: int64

### 20. Suppose you're speaking with a doctor at Mayo Clinic who, despite being very smart, doesn't know much about data science or statistics. How would you explain why unbalanced classes are generally a problem to this doctor?

Answer:

### 21. Let's create very unbalanced classes just for the sake of this example! Generate very unbalanced classes by [bootstrapping](http://stattrek.com/statistics/dictionary.aspx?definition=sampling_with_replacement) (a.k.a. random sampling with replacement) the majority class.

1. The majority class are those individuals with CKD.
2. Generate a random sample of size 200,000 of individuals who have CKD **with replacement**. (Consider setting a random seed for this part!)
3. Create a new dataframe with the original data plus this random sample of data.
4. Now we should have a dataset with around 200,000 observations, of which only about 0.00075% are non-CKD individuals.

### 22. Build a logistic regression model on the unbalanced class data and evaluate its performance using whatever method(s) you see fit. How would you describe the impact of unbalanced classes on logistic regression as a classifier?
> Be sure to look at how well it performs on non-CKD data.

---

## Step 6: Answer the problem.

At this step, you would generally answer the problem! In this situation, you would likely present your model to doctors or administrators at the hospital and show how your model results in reduced false positives/false negatives. Next steps would be to find a way to roll this model and its conclusions out across the hospital so that the outcomes of patients with CKD (and without CKD!) can be improved!