## Week 4, Lab 2: Predicting Chronic Kidney Disease in Patients
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus on steps exploring data, building models and evaluating the models we build.

There are three links you may find important:
- [A set of chronic kidney disease (CKD) data and other biological factors](./chronic_kidney_disease_full.csv).
- [The CKD data dictionary](./chronic_kidney_disease_header.txt).
- [An article comparing the use of k-nearest neighbors and support vector machines on predicting CKD](./chronic_kidney_disease.pdf).

## Step 1: Define the problem.

Suppose you're working for Mayo Clinic, widely recognized to be the top hospital in the United States. In your work, you've overheard nurses and doctors discuss test results, then arrive at a conclusion as to whether or not someone has developed a particular disease or condition. For example, you might overhear something like:

> **Nurse**: Male 57 year-old patient presents with severe chest pain. FDP _(short for fibrin degradation product)_ was elevated at 13. We did an echo _(echocardiogram)_ and it was inconclusive.

> **Doctor**: What was his interarm BP? _(blood pressure)_

> **Nurse**: Systolic was 140 on the right; 110 on the left.

> **Doctor**: Dammit, it's an aortic dissection! Get to the OR _(operating room)_ now!

> _(intense music playing)_

In this fictitious but [Shonda Rhimes-esque](https://en.wikipedia.org/wiki/Shonda_Rhimes#Grey's_Anatomy,_Private_Practice,_Scandal_and_other_projects_with_ABC) scenario, you might imagine the doctor going through a series of steps like a [flowchart](https://en.wikipedia.org/wiki/Flowchart), or a series of if-this-then-that steps to diagnose a patient. The first steps made the doctor ask what the interarm blood pressure was. Because interarm blood pressure took on the values it took on, the doctor diagnosed the patient with an aortic dissection.

Your goal, as a research biostatistical data scientist at the nation's top hospital, is to develop a medical test that can improve upon our current diagnosis system for [chronic kidney disease (CKD)](https://www.mayoclinic.org/diseases-conditions/chronic-kidney-disease/symptoms-causes/syc-20354521).

**Real-world problem**: Develop a medical diagnosis test that is better than our current diagnosis system for CKD.

**Data science problem**: Develop a medical diagnosis test that reduces both the number of false positives and the number of false negatives.

---

## Step 2: Obtain the data.

In [177]:
# Imports here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso, LassoCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

### 1. Read in the data.

In [2]:
data = pd.read_csv('../4.02-lab-classification-model-evaluation-master/chronic_kidney_disease_full.csv')

In [3]:
pd.set_option('display.max_columns', None)
pd.set_option('max_rows', None)

In [4]:
data.head(2)

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,bu,sc,sod,pot,hemo,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,121.0,36.0,1.2,,,15.4,44.0,7800.0,5.2,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,,18.0,0.8,,,11.3,38.0,6000.0,,no,no,no,good,no,no,ckd


In [5]:
data['class'].value_counts()

ckd       250
notckd    150
Name: class, dtype: int64

### 2. Check out the data dictionary. What are a few features or relationships you might be interested in checking out?

Answer: I would be interested in checking out the relationship between bp age and class. Age and blood pressure are features that most people would know so it would be interesting to see if certain ranges of blood pressure relate to people who have ckd.

---

## Step 3: Explore the data.

### 3. How much of the data is missing from each column?

In [6]:
data.isna().sum()

age        9
bp        12
sg        47
al        46
su        49
rbc      152
pc        65
pcc        4
ba         4
bgr       44
bu        19
sc        17
sod       87
pot       88
hemo      52
pcv       71
wbcc     106
rbcc     131
htn        2
dm         2
cad        2
appet      1
pe         1
ane        1
class      0
dtype: int64

In [7]:
data.shape

(400, 25)

### 4. Suppose that I dropped every row that contained at least one missing value. (In the context of analysis with missing data, we call this a "complete case analysis," because we keep only the complete cases!) How many rows would remain in our dataframe? What are at least two downsides to doing this?

> There's a good visual on slide 15 of [this deck](https://liberalarts.utexas.edu/prc/_files/cs/Missing-Data.pdf) that shows what a complete case analysis looks like if you're interested.  
**Note:** You can clean your data below in step 4 when building a model!

**Answer:**
- If I dropped every row that contained at least one missing value we would only have 158 rows left in our dataset after dropping 152 rows.   
- One downsides of this would be that now we have a much smaller dataset to train and test our model on which will make our model less accurate to future data.   
- Another downside of this would be that we are losing very important information that is crucial to predicting whether or not someone has ckd.  

In [8]:
data.head(1)

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,bu,sc,sod,pot,hemo,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,121.0,36.0,1.2,,,15.4,44.0,7800.0,5.2,yes,yes,no,good,no,no,ckd


In [9]:
data.shape

(400, 25)

In [10]:
data.dropna().shape

(158, 25)

In [11]:
data['su'].unique()

array([ 0.,  3.,  4.,  1., nan,  2.,  5.])

In [12]:
data.isna().sum()

age        9
bp        12
sg        47
al        46
su        49
rbc      152
pc        65
pcc        4
ba         4
bgr       44
bu        19
sc        17
sod       87
pot       88
hemo      52
pcv       71
wbcc     106
rbcc     131
htn        2
dm         2
cad        2
appet      1
pe         1
ane        1
class      0
dtype: int64

In [13]:
cont_cols = data.describe().columns

In [14]:
data.head(1)

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,bu,sc,sod,pot,hemo,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,121.0,36.0,1.2,,,15.4,44.0,7800.0,5.2,yes,yes,no,good,no,no,ckd


In [15]:
num_cols = ['age', 'bp', 'bgr', 'bu', 'sc', 'sod', 'pot', 'hemo', 'pcv', 'wbcc', 'rbcc']

In [16]:
for n in data[num_cols]:
    data[n].replace(np.nan, data[n].median(), inplace=True)

In [17]:
nom_cols =  ['sg','al','su','rbc','pc','pcc','ba','htn','dm','cad','appet','pe','ane',]

In [18]:
data['sg'].dtype

dtype('float64')

In [19]:
for n in data[nom_cols]:
    num_nom = ['sg', 'al', 'su']
    if n in num_nom:
        data[n].fillna("Unknown", inplace = True)
    else:
        pass

for n in data[nom_cols]:
    if data[n].dtype == 'O':
        data[n].fillna("Unknown", inplace = True)
    
        pass

In [20]:
data.shape

(400, 25)

In [21]:
for col in data[num_cols]:
    outlier = abs(data[col].std() * 3) + abs(data[col].mean())    
    data.drop(data[data[col] > outlier].index, inplace=True)

In [22]:
data.shape

(366, 25)

In [23]:
data.describe()

Unnamed: 0,age,bp,bgr,bu,sc,sod,pot,hemo,pcv,wbcc,rbcc
count,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0
mean,51.191257,75.68306,135.699454,50.694536,2.313251,138.412568,4.375956,12.767623,39.819672,8140.983607,4.795082
std,17.150311,11.678248,57.302202,36.224932,2.758354,5.797189,0.66561,2.635747,7.785497,2015.709023,0.789852
min,2.0,50.0,22.0,1.5,0.4,111.0,2.5,4.8,14.0,2200.0,2.1
25%,41.25,70.0,100.0,27.0,0.9,136.0,3.925,11.1,35.0,7000.0,4.5
50%,54.5,80.0,121.0,42.0,1.2,138.0,4.4,12.65,40.0,8000.0,4.8
75%,64.0,80.0,140.0,55.0,2.4,141.0,4.8,14.9,45.0,9200.0,5.2
max,90.0,110.0,360.0,202.0,18.0,150.0,7.6,17.8,54.0,15700.0,6.5


### 5. Thinking critically about how our data were gathered, it's likely that these records were gathered by doctors and nurses. Brainstorm three potential areas (in addition to the missing data we've already discussed) where this data might be inaccurate or imprecise.

**Answer:**
1) User error such as entering in the data wrong into the system.  
2) Hand writing can be misinterpreted.  
3) A potential rushed process can lead to mess ups and "sloppy" work.  

---

## Step 4: Model the data.

### 6. Suppose that I want to construct a "model" where no person who has CKD will ever be told that they do not have CKD. What (very simple, no machine learning needed) model can I create that will never tell a person with CKD that they do not have CKD?

> Hint: Don't think about `statsmodels` or `scikit-learn` here.

**Answer:** A confusion matrix

### 7. In problem 6, what common classification metric did we optimize for? Did we minimize false positives or negatives?

**Answer:** The common classification metric we optimized for is precision. We have optimized for precision because we got rid of all of the outliers in our data.

### 8. Thinking ethically, what is at least one disadvantage to the model you described in problem 6?

**Answer:** With this model it will be unethical becuase it will show us the percentage of people that we classified improperly. This is bad because we should have labeled them correctly from the beginning.

### 9. Suppose that I want to construct a "model" where a person who does not have CKD will ever be told that they do have CKD. What (very simple, no machine learning needed) model can I create that will accomplish this?

**Answer:** We would use a confusion matrix

### 10. In problem 9, what common classification metric did we optimize for? Did we minimize false positives or negatives?

**Answer:** Sensitivity would be what we optimize for. Yes we did since sensitivity is True positives / true positives + false negatives.

### 11. Thinking ethically, what is at least one disadvantage to the model you described in problem 9?

**Answer:** A disadvantage would be that we are not telling them that they have ckd when they actually do have it.

### 12. Construct a logistic regression model in `sklearn` predicting class from the other variables. You may scale, select/drop, and engineer features as you wish - build a good model! Make sure, however, that you include at least one categorical/dummy feature and at least one quantitative feature.

> Hint 1: Remember to do a train/test split!  
> Hint 2: This will require data cleaning first!

In [31]:
data = pd.get_dummies(columns=['cad'], data=data, drop_first=True)

In [254]:
data = pd.get_dummies(columns=['class'], data =data, drop_first=True)

In [259]:
X = data[['age', 'bp', 'bgr', 'bu', 'sc', 'sod', 'pot', 'hemo', 'wbcc', 'rbcc', 'cad_no', 'cad_yes']]

In [260]:
y = data['class_notckd']

In [261]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [262]:
pipe = Pipeline([
    ('sc', StandardScaler()),
    ('lr', LogisticRegression(penalty='l1', solver='liblinear'))
])

In [263]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('sc',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('lr',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l1', random_state=None,
                                    solver='liblinear', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

In [264]:
pipe.score(X_train, y_train)

0.9635036496350365

In [265]:
pipe.get_params

<bound method Pipeline.get_params of Pipeline(memory=None,
         steps=[('sc',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('lr',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l1', random_state=None,
                                    solver='liblinear', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)>

In [266]:
# Instantiate pipeline object.
pipe_2 = Pipeline([
    ('sc', StandardScaler()),
    ('lr', LogisticRegression(penalty='l1', solver='liblinear'))
])

In [267]:
pipe_2.get_params().keys()
# estimator.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'sc', 'lr', 'sc__copy', 'sc__with_mean', 'sc__with_std', 'lr__C', 'lr__class_weight', 'lr__dual', 'lr__fit_intercept', 'lr__intercept_scaling', 'lr__l1_ratio', 'lr__max_iter', 'lr__multi_class', 'lr__n_jobs', 'lr__penalty', 'lr__random_state', 'lr__solver', 'lr__tol', 'lr__verbose', 'lr__warm_start'])

In [268]:
pipe_2_params = {
#                 'lr__solver': ['newton-cg', 'lbfgs', 'liblinear'],
                'lr__penalty': ['l2', 'l1'],
                'lr__C': [100, 10, 1.0, 0.1, 0.01, 0.001],
                 'sc__with_mean': [True, False], 
                 'sc__with_std': [True, False],
}

In [269]:
pipe_2_gridsearch = GridSearchCV(pipe_2, # What is the model we want to fit?
                                 pipe_2_params, # What is the dictionary of hyperparameters?
                                 cv=5, # What number of folds in CV will we use?
                                 verbose=1)

In [270]:
pipe_2_gridsearch.fit(X_train, y_train);

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:    2.9s finished


In [271]:
pipe_2_gridsearch.best_score_

0.9490235690235691

In [272]:
pipe_2_gridsearch.score(X_train, y_train)

0.9635036496350365

In [273]:
pipe_2_gridsearch.best_estimator_

Pipeline(memory=None,
         steps=[('sc',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('lr',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l1', random_state=None,
                                    solver='liblinear', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

**Cannot get coefficients for grid search**

In [274]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size = .33, stratify=y)

In [275]:
sc = StandardScaler()
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)

In [276]:
lr = LogisticRegression(penalty='l1', solver='liblinear', C = .10)

In [277]:
lr.fit(Z_train, y_train)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l1',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [278]:
print(lr.score(Z_train, y_train))
print(lr.score(Z_test, y_test))

0.9428571428571428
0.9256198347107438


In [279]:
X.head(1)

Unnamed: 0,age,bp,bgr,bu,sc,sod,pot,hemo,wbcc,rbcc,cad_no,cad_yes
0,48.0,80.0,121.0,36.0,1.2,138.0,4.4,15.4,7800.0,5.2,1,0


In [280]:
coef_df = pd.DataFrame({
    'column': X.columns,
    'coef'  : lr.coef_[0]
})
coef_df

Unnamed: 0,column,coef
0,age,0.0
1,bp,-0.233302
2,bgr,-0.263646
3,bu,0.0
4,sc,-0.008092
5,sod,0.264616
6,pot,0.0
7,hemo,1.952085
8,wbcc,0.0
9,rbcc,0.367285


---

## Step 5: Evaluate the model.

### 13. Based on your logistic regression model constructed in problem 12, interpret the coefficient of one of your quantitative features.

For everyone one unit increase in white blood cell count there is a 1.9 increase in the likelihood someone will not have CKD.

### 14. Based on your logistic regression model constructed in problem 12, interpret the coefficient of one of your categorical/dummy features.

Based on my logistic regression model for everyone one unit increase in cad_no and cad_yes there is no increase or decrease in the likelihood of someone having CKD.

### 15. Despite being a relatively simple model, logistic regression is very widely used in the real world. Why do you think that's the case? Name at least two advantages to using logistic regression as a modeling technique.

**Answer:**  
1) In a logistic regression model we can interpret model coefficients as indicators of feature importance.  
2) The model runs quickly and does not take much computational power.

### 16. Does it make sense to generate a confusion matrix on our training data or our test data? Why? Think about which data is used for model evaluation. Generate it on the proper data.

> Hint: Once you've generated your predicted $y$ values and you have your observed $y$ values, then it will be easy to [generate a confusion matrix using sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html).

In [281]:
y_preds = lr.predict(Z_test)

In [282]:
cm = metrics.confusion_matrix(y_test, y_preds)

In [283]:
pd.DataFrame(cm, columns=['predicting CKD', 'predicited NOT CKD'], index=['actual CKD', 'actual NOT CKD'])

Unnamed: 0,predicting CKD,predicited NOT CKD
actual CKD,65,6
actual NOT CKD,3,47


### 17. In this hospital case, we want to predict CKD. Do we want to optimize for sensitivity, specificity, or something else? Why? (If you don't think there's one clear answer, that's okay! There rarely is. Be sure to defend your conclusion!)

Answer: We want to optimize for precision because we want to have the least amount of error when predicting if someone has CKD.

### 18 (BONUS). Write a function that will create an ROC curve for you, then plot the ROC curve.

Here's a strategy you might consider:
1. In order to even begin, you'll need some fit model. Use your logistic regression model from problem 12.
2. We want to look at all values of your "threshold" - that is, anything where .predict() gives you above your threshold falls in the "positive class," and anything that is below your threshold falls in the "negative class." Start the threshold at 0.
3. At this value of your threshold, calculate the sensitivity and specificity. Store these values.
4. Increment your threshold by some "step." Maybe set your step to be 0.01, or even smaller.
5. At this value of your threshold, calculate the sensitivity and specificity. Store these values.
6. Repeat steps 3 and 4 until you get to the threshold of 1.
7. Plot the values of sensitivity and 1 - specificity.

### 19. Suppose you're speaking with the biostatistics lead at Mayo Clinic, who asks you "Why are unbalanced classes generally a problem? Are they a problem in this particular CKD analysis?" How would you respond?

**Answer:**  
1. Unbalanced classes are generally a problem because if we have data and 99% of the data says that people do not have CKD and only 1% has CKD the model will predict most of the time that new people do not have CKD since the classes are very imbalance.  
2. Yes having unbalanced classes in the CKD analysis is a problem because our model will not be able to accurately classify someone with CKD.

### 20. Suppose you're speaking with a doctor at Mayo Clinic who, despite being very smart, doesn't know much about data science or statistics. How would you explain why unbalanced classes are generally a problem to this doctor?

**Answer:** Unbalanced classes are generally a problem because if we have data and 99% of the data says that people do not have CKD and only 1% of people have CKD our predictions will only predict future people as not having CKD. 

### 21. Let's create very unbalanced classes just for the sake of this example! Generate very unbalanced classes by [bootstrapping](http://stattrek.com/statistics/dictionary.aspx?definition=sampling_with_replacement) (a.k.a. random sampling with replacement) the majority class.

1. The majority class are those individuals with CKD.
2. Generate a random sample of size 200,000 of individuals who have CKD **with replacement**. (Consider setting a random seed for this part!). The [`pandas .sample()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) method may be _very_ useful here!
3. Create a new dataframe with the original data plus this random sample of data.
4. Now we should have a dataset with just over 200,000 observations, of which only about 0.075% are non-CKD individuals.

In [290]:
new_data = data[['age', 'bp', 'bgr', 'bu', 'sc', 'sod', 'pot', 'hemo', 'wbcc', 'rbcc', 'cad_no', 'cad_yes', 'class_notckd']]

In [301]:
new_df = new_data.sample(n=200_000, random_state=1, weights='class_notckd', replace=True)

In [302]:
new_df.shape

(200000, 13)

In [303]:
new_df['class_notckd'][:15000] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


In [305]:
new_df['class_notckd'].value_counts()

1    185000
0     15000
Name: class_notckd, dtype: int64

In [307]:
new_data.shape

(366, 13)

In [308]:
sample_df = new_data.append(new_df)

In [309]:
sample_df.shape

(200366, 13)

### 22. What do you expect will be the impact of unbalanced classes on your logistic regression model?

**Answer:** I expect that the model will have a hard time finding people who do have CKD. The model will think that everyone does not have CKD since the split in data is so different.

### 23. Build a logistic regression model on the unbalanced class data and evaluate its performance using whatever method(s) you see fit. 
> Be sure to look at how well it performs on non-CKD data.

In [311]:
sample_df.head(1)

Unnamed: 0,age,bp,bgr,bu,sc,sod,pot,hemo,wbcc,rbcc,cad_no,cad_yes,class_notckd
0,48.0,80.0,121.0,36.0,1.2,138.0,4.4,15.4,7800.0,5.2,1,0,0


In [323]:
type(sample_df)

pandas.core.frame.DataFrame

In [330]:
sample_df.columns[:-1]

Index(['age', 'bp', 'bgr', 'bu', 'sc', 'sod', 'pot', 'hemo', 'wbcc', 'rbcc',
       'cad_no', 'cad_yes'],
      dtype='object')

In [324]:
sample_df.shape

(200366, 13)

In [333]:
X = sample_df[sample_df.columns[:-1]]
y = sample_df['class_notckd']

In [334]:
X.shape

(200366, 12)

In [335]:
y.shape

(200366,)

In [336]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)

In [338]:
sc = StandardScaler()
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)

In [339]:
logreg1 = LogisticRegression(penalty='l1', solver='liblinear', C = .10)

In [340]:
logreg1.fit(Z_train, y_train)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l1',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [341]:
print(logreg1.score(Z_train, y_train))
print(logreg1.score(Z_test, y_test))

0.9239301277514991
0.9249557629194961


### 24. Do the results of your model above align with your expectations of the impact of unbalanced classes on logistic regression? If not, do you have any thoughts on why your model, considering the data, is performing how it is?

**Answer:** Yes the model performed like how I thought it would. The train and test would be very close to each other since the model is predicting the likelihood someone has or does not have CKD. The split of people who have CKD and who do not have CKD is extremely unbalanced so the model cannot determine if someone has CKD.

---

## Step 6: Answer the problem. (Nothing to do here...except think about it!)

At this step, you would generally answer the problem! In this situation, you would likely present your model to doctors or administrators at the hospital and show how your model results in reduced false positives/false negatives. Next steps would be to find a way to roll this model and its conclusions out across the hospital so that the outcomes of patients with CKD (and without CKD!) can be improved!