# Logistic Regression

In this notebook you will use GPU-accelerated logistic regression to predict infection risk based on features of our population members.

## Objectives

By the time you complete this notebook you will be able to:

- Use GPU-accelerated logistic regression

## Imports

In [1]:
import cudf
import cuml

import cupy as cp

## Load Data

In [2]:
gdf = cudf.read_csv('./data/pop_2-05.csv', usecols=['age', 'sex', 'infected'])

In [3]:
gdf.dtypes

age         float64
sex         float64
infected    float64
dtype: object

In [4]:
gdf.shape

(58479894, 3)

In [5]:
gdf.head()

Unnamed: 0,age,sex,infected
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
3,0.0,0.0,0.0
4,0.0,0.0,0.0


## Logistic Regression

Logistic regression can be used to estimate the probability of an outcome as a function of some (assumed independent) inputs. In our case, we would like to estimate infection risk based on population members' age and sex.

Here we create a cuML logistic regression instance `logreg`:

In [6]:
logreg = cuml.LogisticRegression()

## Exercise: Regress Infected Status

The `logreg.fit` method takes 2 arguments: the model's independent variables *X*, and the dependent variable *y*. Fit the `logreg` model using the `gdf` columns `age` and `sex` as *X* and the `infected` column as *y*.

#### Solution

In [9]:
# %load solutions/regress_infected
logreg.fit(gdf[['age', 'sex']], gdf['infected'])


LogisticRegression(penalty='l2', tol=0.0001, C=1.0, fit_intercept=True, max_iter=1000, linesearch_max_iter=50, verbose=0, solver='qn', handle=<cuml.common.handle.Handle object at 0x7fe87fccd510>)

## Viewing the Regression

After fitting the model, we could use `logreg.predict` to estimate whether someone has more than a 50% chance to be infected, but since the virus has low prevalence in the population (around 1-2%, in this dataset), individual probabilities of infection are well below 50% and the model should correctly predict that no one is individually likely to have the infection.

However, we also have access to the model coefficients at `logreg.coef_` as well as the intercept at `logreg.intercept_`. Both of these values are CUDA device arrays, the same kind we saw earlier when generating the `northing` and `easting` columns:

In [10]:
type(logreg.coef_)

numba.cuda.cudadrv.devicearray.DeviceNDArray

In [11]:
type(logreg.intercept_)

numba.cuda.cudadrv.devicearray.DeviceNDArray

To view these values, we need to use their `copy_to_host` methods, which will return CPU NumPy arrays that we can print.

In [12]:
logreg_coef = logreg.coef_.copy_to_host()
logreg_int = logreg.intercept_.copy_to_host()[0]

print("Coefficients: [age, sex]")
print([logreg_coef[0][0], logreg_coef[1][0]])

print("Intercept:")
print(logreg_int)

Coefficients: [age, sex]
[0.013795661578579912, 0.002492827409630907]
Intercept:
-4.757416365732666


## Estimate Probability of Infection

As with all logistic regressions, the coefficients allow us to calculate the logit for each; from that, we can calculate the estimated percentage risk of infection.

In [13]:
# logit = x1 * m1 + x2 * m2 + b
exp_logit = cp.exp(gdf['age'] * logreg_coef[0][0] + 
                   gdf['sex'] * logreg_coef[1][0] + 
                   logreg_int)

# converting the logit to a percentage risk via the logistic function p = exp(logit) / (exp(logit) + 1)
gdf['risk'] = exp_logit / (exp_logit + 1)

Looking at the original records with their new estimated risks, we can see how estimated risk varies across individuals.

In [14]:
gdf.take(cp.random.choice(gdf.shape[0], size=5, replace=False))

Unnamed: 0,age,sex,infected,risk
34647609,16.0,1.0,0.0,0.010622
35021746,18.0,1.0,0.0,0.010915
15835550,42.0,0.0,0.0,0.015098
35712523,20.0,1.0,0.0,0.011217
9016082,25.0,0.0,0.0,0.011979


## Exercise: Show Infection Prevalence is Related to Age

The positive coefficient on age suggests that the virus is more prevalent in older people, even when controlling for sex.

For this exercise, show that infection prevalence has some relationship to age by printing the mean `infected` values for the oldest and youngest members of the population when grouped by age:

#### Solution

In [16]:
# %load solutions/risk_by_age
age_groups = gdf[['age', 'infected']].groupby(['age'])
print(age_groups.mean().head())
print(age_groups.mean().tail())


     infected
age          
0.0  0.000000
1.0  0.000889
2.0  0.001960
3.0  0.002715
4.0  0.003586
      infected
age           
86.0  0.023417
87.0  0.023256
88.0  0.024569
89.0  0.024412
90.0  0.025017


## Exercise: Show Infection Prevalence is Related to Sex

Similarly, the positive coefficient on sex suggests that the virus is more prevalent in people with sex = 1 (females), even when controlling for age.

For this exercise, show that infection prevalence has some relationship to sex by printing the mean `infected` values for the population when grouped by sex:

#### Solution

In [18]:
# %load solutions/risk_by_sex
sex_groups = gdf[['sex', 'infected']].groupby(['sex'])
sex_groups.mean()


Unnamed: 0_level_0,infected
sex,Unnamed: 1_level_1
0.0,0.01014
1.0,0.020713


## Using Training and Test Data

cuML gives us a simple method for producing paired training/testing data:

In [19]:
x_train, x_test, y_train, y_test  = cuml.train_test_split(gdf[['age', 'sex']], gdf['infected'], train_size=0.9)

## Exercise: Fit Logistic Regression Model Using Training Data

For this exercise, create a new logistic regression model `logreg`, and fit it with the *X* and *y* training data just created.

#### Solution

In [21]:
# %load solutions/fit_training
logreg = cuml.LogisticRegression()
logreg.fit(x_train, y_train)


LogisticRegression(penalty='l2', tol=0.0001, C=1.0, fit_intercept=True, max_iter=1000, linesearch_max_iter=50, verbose=0, solver='qn', handle=<cuml.common.handle.Handle object at 0x7fe878352030>)

## Use Test Data to Validate Model

We can now use the same procedure as above to predict infection risk using the test data:

In [22]:
logreg_coef = logreg.coef_.copy_to_host()
logreg_int = logreg.intercept_.copy_to_host()[0]

exp_logit = cp.exp(x_test['age'] * logreg_coef[0][0] + 
                   x_test['sex'] * logreg_coef[1][0] + 
                   logreg_int)

y_test_pred = exp_logit / (exp_logit + 1)

As we saw before, very few people are actually infected in the population, even among the highest-risk groups. As a simple way to check our model, we split the test set into above-average predicted risk and below-average predicted risk, then observe that the prevalence of infections correlates closely to those predicted risks.

In [23]:
test_results = cudf.DataFrame()
test_results['infected'] = y_test
test_results['predicted_risk'] = y_test_pred
test_results['high_risk'] = test_results['predicted_risk'] > test_results['predicted_risk'].mean()

risk_groups = test_results.groupby('high_risk')
risk_groups.mean()

Unnamed: 0_level_0,infected,predicted_risk
high_risk,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.011525,0.011631
True,0.020329,0.020163


<br>
<div align="center"><h2>Please Restart the Kernel</h2></div>

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Next

In the next notebook, you will use GPU-accelerated k-nearest-neighbors algorithm to locate the nearest road nodes to each hospital.