<a href="https://colab.research.google.com/github/geocarvalho/r-bioinfo-ds/blob/master/statquest/machine-learning/16_logistic_regression_in_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 16 - [Logistic Regression in R](https://www.youtube.com/watch?v=C4N3_XJJ-jU&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=16)

* Open the Heart Disease Dataset


In [21]:
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
data <- read.csv(url, header=FALSE)
print(head(data))

  V1 V2 V3  V4  V5 V6 V7  V8 V9 V10 V11 V12 V13 V14
1 63  1  1 145 233  1  2 150  0 2.3   3 0.0 6.0   0
2 67  1  4 160 286  0  2 108  1 1.5   2 3.0 3.0   2
3 67  1  4 120 229  0  2 129  1 2.6   2 2.0 7.0   1
4 37  1  3 130 250  0  0 187  0 3.5   3 0.0 3.0   0
5 41  0  2 130 204  0  2 172  0 1.4   1 0.0 3.0   0
6 56  1  2 120 236  0  0 178  0 0.8   1 0.0 3.0   0


* Name the columns after the names that were listed on the UCI website

In [22]:
colnames(data) <- c(
"age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach",
"exang", "oldpeak", "slope", "ca", "thal", "hd")
print(head(data))

  age sex cp trestbps chol fbs restecg thalach exang oldpeak slope  ca thal hd
1  63   1  1      145  233   1       2     150     0     2.3     3 0.0  6.0  0
2  67   1  4      160  286   0       2     108     1     1.5     2 3.0  3.0  2
3  67   1  4      120  229   0       2     129     1     2.6     2 2.0  7.0  1
4  37   1  3      130  250   0       0     187     0     3.5     3 0.0  3.0  0
5  41   0  2      130  204   0       2     172     0     1.4     1 0.0  3.0  0
6  56   1  2      120  236   0       0     178     0     0.8     1 0.0  3.0  0


* The `str()` function, which describes the **str**ucture of the data, tells us that some of the columns are messed up;


In [23]:
str(data)

'data.frame':	303 obs. of  14 variables:
 $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
 $ sex     : num  1 1 1 1 0 1 0 0 1 1 ...
 $ cp      : num  1 4 4 3 2 2 4 4 4 4 ...
 $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
 $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
 $ fbs     : num  1 0 0 0 0 0 0 0 0 1 ...
 $ restecg : num  2 2 2 0 2 0 2 0 2 2 ...
 $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
 $ exang   : num  0 1 1 0 0 0 0 1 0 1 ...
 $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
 $ slope   : num  3 2 2 3 1 1 3 1 2 3 ...
 $ ca      : Factor w/ 5 levels "?","0.0","1.0",..: 2 5 4 2 2 2 4 2 3 2 ...
 $ thal    : Factor w/ 4 levels "?","3.0","6.0",..: 3 2 4 2 2 2 2 2 4 4 ...
 $ hd      : int  0 2 1 0 0 0 3 0 2 1 ...


* Right now, **sex** is a number, but it's supposed to be a factor, where 0 represents "Female" and 1 represents "Male";

* **cp** (aka **c**hest **p**ain) is also supposed to be a factor, where levels 1-3 represent different types of pain and 4 represents no chest pain;

* **ca** and **thal** are correctly called factors, but one of the levels is "**?**" when we need it to be **NA**.

In [0]:
# Change the "?"s to NAs
data[data=="?"] <- NA

# Convert the 0s in $sex to F, for female and the 1s to M for male
data[data$sex == 0,]$sex <- "F"
data[data$sex == 1,]$sex <- "M"

# Convert the $sex column into a factor, and a bunch of other columns
data$sex <- as.factor(data$sex)
data$cp <- as.factor(data$cp)
data$fbs <- as.factor(data$fbs)
data$restecg <- as.factor(data$restecg)
data$exang <- as.factor(data$restecg)
data$slope <- as.factor(data$slope)

* Since the **ca** column originally had a **?** in it, rather than **NA**, R thinks it's a column of strings. We correct that assumption by telling R that it's a column of integers and then we convert it to a factor.

In [0]:
data$ca <- as.integer(data$ca)
data$ca <- as.factor(data$ca)

# And the same thing to $thal

data$thal <- as.integer(data$thal)
data$thal <- as.factor(data$thal)

* The last thing we need to do to the data is make **hd** (aka **h**eart **d**isease) a factor that is easy on the eyes. Here I'm using a fancy trick with `ifelse()` to convert the **0**s to "**Healthy**" and the **1**s to "**Unhealthy**";

> We could have done a similar trick for **sex**, but I wanted to show you both ways to convert numbers to words.

In [0]:
data$hd <- ifelse(test=data$hd == 0, yes="Healthy", no="Unhealthy")
data$hd <- as.factor(data$hd)

* Once we're done fixing up the data, we check that we have made the appropriate changes with `str()`

In [28]:
str(data)

'data.frame':	303 obs. of  14 variables:
 $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
 $ sex     : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 1 2 2 ...
 $ cp      : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
 $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
 $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
 $ fbs     : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
 $ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
 $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
 $ exang   : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
 $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
 $ slope   : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
 $ ca      : Factor w/ 4 levels "2","3","4","5": 1 4 3 1 1 1 3 1 2 1 ...
 $ thal    : Factor w/ 3 levels "2","3","4": 2 1 3 1 1 1 1 1 3 3 ...
 $ hd      : Factor w/ 2 levels "Healthy","Unhealthy": 1 2 2 1 1 1 2 1 2 2 ...


* Now we see how many samples (rows of data) have **NA** values.

In [29]:
nrow(data[is.na(data$ca) | is.na(data$thal),])

In [32]:
print(data[is.na(data$ca) | is.na(data$thal),])

    age sex cp trestbps chol fbs restecg thalach exang oldpeak slope   ca thal
88   53   F  3      128  216   0       2     115     2     0.0     1    2 <NA>
167  52   M  3      138  223   0       0     169     0     0.0     1 <NA>    2
193  43   M  4      132  247   1       2     143     2     0.1     2 <NA>    4
267  52   M  4      128  204   1       0     156     0     1.0     2    2 <NA>
288  58   M  2      125  220   0       0     144     0     0.4     2 <NA>    4
303  38   M  3      138  175   0       0     173     0     0.0     1 <NA>    2
           hd
88    Healthy
167   Healthy
193 Unhealthy
267 Unhealthy
288   Healthy
303   Healthy


* If we wanted to, we could impute values for the **NA**s using Random Forest or some other method. However, for this example, we'll just remove these samples.

In [33]:
print(nrow(data))
data <- data[!(is.na(data$ca) | is.na(data$thal)),]
print(nrow(data))

[1] 303
[1] 297


* Now we need to make sure that healthy and disease samples come from each gender (female and male);

* If only male samples have heart disease, we should probably remove all females from the model;

* We do this with the `xtabs()` function. We pass the data and use "model syntax" to select the columns in the data we want to build a table from


In [34]:
xtabs(~ hd + sex, data=data)

           sex
hd            F   M
  Healthy    71  89
  Unhealthy  25 112

* **Healthy** and **Unhealthy** patients are both represented by a lot of female and male samples.

* Now let's verify that all 4 levels of Chest Pain (**cp**) were reported by a bunch of patients

In [35]:
xtabs(~ hd + cp, data=data)

           cp
hd            1   2   3   4
  Healthy    16  40  65  39
  Unhealthy   7   9  18 103

* .. and then we do the same thing for all of the boolean and categorical variables that we're using to predict heart disease

In [36]:
xtabs(~ hd + fbs, data=data)

           fbs
hd            0   1
  Healthy   137  23
  Unhealthy 117  20

In [37]:
xtabs(~ hd + restecg, data=data)

           restecg
hd           0  1  2
  Healthy   92  1 67
  Unhealthy 55  3 79

* For **restecg**, only patients represent level 1. This could, potentially, get in the way of  finding the best fitting line.