### Diabetic Retinopathy
Retinopathy is a medical condition affecting the eyes, characterized by alterations in the blood vessels within the retina. This can manifest as changes in vision or loss of it. Diabetic patients, in particular, are notably susceptible to developing retinopathy due to the impact of prolonged elevated blood sugar levels on the delicate blood vessels of the eye. Regular eye examinations and meticulous management of diabetes are crucial in mitigating the risk and detecting retinopathy at its earliest stages for effective intervention.

#### Using logistic regression, we will build a risk score model for retinopathy in diabetes patients.
We will also cover C-index, Log transformations and Standardization.


In [None]:
#import libraries
#import numpy, pandas, matplotlib.pyplot

## Load Data

Lets read the data from retinopathy_lr.csv.


## Investigate the Dataset

- What are the features?
- Display the first few records to take a quick look.


We see that the features are : Age: (years), Systolic_BP: Systolic blood pressure (mmHg), Diastolic_BP: Diastolic blood pressure (mmHg), Cholesterol: (mg/DL)


Lets create X and y where they store the input variables and the output variable respectively.

Lets do a splitting of data as 75%, 25%. We used train_test_split yesterday.

In [None]:
# from sklearn.model_selection import train_test_split

Lets take a look at the distributions of the features in X_train.

In [None]:
#loop through all features.
#Use a histogram : x_train[i].hist()

In [None]:
#See distribution of o/p var

The performance of logistic regression is notably affected by imbalanced class distributions, particularly when one class dominates the other. In such scenarios, logistic regression tends to prioritize the majority class, leading to a diminished capacity to recognize and forecast instances in the minority class. This imbalance results in decreased accuracy, precision, and recall, particularly impacting the minority class of interest.

As we can see, the distributions have a generally bell shaped distribution, but with slight rightward skew. We can remove skews by applying the log function to the data. Let's plot the log of the feature variables and see.

In [None]:
#Use a histogram : np.log

In [None]:
#use np.log to applying the log function to the train set, and to the val set

## 4. Scale the Data

Let's now scale(in this case, "standardize") our data so that the distributions has a mean of zero and standard deviation of 1. Recall that a standard normal distribution has mean of zero and standard deviation of 1.   
Some algorithms work much better when we scale the data.

In [None]:
#StandardScaler
# from sklearn.preprocessing import StandardScaler


After transforming the training and test sets, we'll expect the training set to be centered at zero with a standard deviation of 1. Lets take a look.

Let's have a look at the distributions of the transformed training data.

In [None]:
#seaborn library maybe
# import seaborn as sns

## Create the Model

Now we are ready to build the risk model by training logistic regression with our data.



If you get a warning message regarding the solver parameter, you may want to specify that particular one explicitly with solver='lbfgs'

In [None]:
#Create model

## Evaluate the Model

In [None]:
#on train

In [None]:
#on val

## Evaluate the Model Using the C-index

* The c-index measures the discriminatory power of a risk score. 
* Intuitively, a higher c-index indicates that the model's prediction is in agreement with the actual outcomes of a pair of patients.
* The formula for the c-index is

$$ \mbox{cindex} = \frac{\mbox{concordant} + 0.5 \times \mbox{ties}}{\mbox{permissible}} $$

* A permissible pair is a pair of patients who have different outcomes.
* A concordant pair is a permissible pair in which the patient with the higher risk score also has the worse outcome.
* A tie is a permissible pair where the patients have the same risk score.

## cindex

* The `cindex` function to compute c-index.
* `y_true` is the array of actual patient outcomes, 0 if the patient does not eventually get the disease, and 1 if the patient eventually gets the disease.
* `scores` is the risk score of each patient.  These provide relative measures of risk, so they can be any real numbers. By convention, they are always non-negative.
* Here is an example of input data and how to interpret it:
```Python
y_true = [0,1]
scores = [0.45, 1.25]
```
    * There are two patients. Index 0 of each array is associated with patient 0.  Index 1 is associated with patient 1.
    * Patient 0 does not have the disease in the future (`y_true` is 0), and based on past information, has a risk score of 0.45.
    * Patient 1 has the disease at some point in the future (`y_true` is 1), and based on past information, has a risk score of 1.25.

In [None]:
def cindex(y_true, scores):
    '''
    Input:
    y_true (np.array): a 1-D array of true binary outcomes (values of zero or one)
    scores (np.array): a 1-D array of corresponding risk scores output by the model
    Output:
    c_index
    '''
    n = len(y_true)
    concordant = 0
    permissible = 0
    ties = 0    
    for i in range(n):
        for j in range(i + 1, n):
            if (y_true[i] != y_true[j]):
                permissible = permissible + 1
                if (scores[i] == scores[j]):
                    ties = ties + 1
                    continue

                if y_true[i] == 0 and y_true[j] == 1:
                    if (scores[i] < scores[j]):
                        concordant = concordant + 1

                if y_true[i] == 1 and y_true[j] == 0:
                    if (scores[i] > scores[j]):
                        concordant = concordant + 1

    c_index = (concordant + 0.5 * ties)/permissible    
    return c_index

To get the predicted probabilities, we use the `predict_proba` method. This method will return the result from the model *before* it is converted to a binary 0 or 1. For each input case, it returns an array of two values which represent the probabilities for both the negative case (patient does not get the disease) and positive case (patient the gets the disease).

In [None]:
# evaluate on val set
# scores = model_X.predict_proba(x_val)[:, 1]
# c_index_val = cindex(y_val.values, scores)
# print("c-index on val set is {}".format(c_index_val))

# Good work!