## Programming Exam Instructions

- The programming exam is in the form of a Jupyter notebook.
- All questions are auto-graded and you have unlimited attempt. 
- In the Rlab Jupyter notebook,  you should see the exam information and the detailed tasks in text, as well as pieces of code chunks.  Some code chunks were pre-written by your instructor. Others need to be completed by students. 
- You have to run every single code chunk (including the ones provided by your instructor ) in sequence.  
- Code chunks with "# your code here" lines  should be replaced by your own code.
- There are special code chunks containing "# Test your code in here" lines. Those code cells contain code invisible to students, which automatically grades the assignment after the deadline or when a student clicks on  "Submit Assignment" button. 
- You can click on validate button in Jupyter notebook to see if your code passed the test. 
- Please run each code chunk in a sequence as some codes rely on succesful execution of earlier code chunks
- From the time you have started the exam, you will have full **4 hours** to complete the programming exam
- You can check your  notes when you take the programming exam
- Sharing exam questions or answers on the web, social media platforms or in other outlets is a direct violation of UNT Code of Conduct and UNT Policy of Academic Integrity



# Part 1: Data Preparation
In this programming exam, we will use US birth data. Every year, the US releases a large data set containing information on births recorded in the country. We have a random sample of 1,000 cases from the data set released in 2014. There are 13 variables in the dataset:

-**fage**: Father's age in years.

-**mage**: Mother's age in years.

-**mature**: Maturity status of mother.

-**weeks**: Length of pregnancy in weeks.

-**premie**: Whether the birth was classified as premature (premie) or full-term.

-**visits**: Number of hospital visits during pregnancy.

-**gained**: Weight gained by mother during pregnancy in pounds.

-**weight**: Weight of the baby at birth in pounds.

-**lowbirthweight**: Whether baby was classified as low birthweight (low) or not (not low).

-**sex**: Sex of the baby, female or male.

-**habit**: Status of the mother as a nonsmoker or a smoker.

-**marital**: Whether mother is married or not married at birth.

-**whitemom**: Whether mom is white or not white.

In [152]:
#call the packages
library(ggplot2)
library(dplyr)
library(testthat)
library(caret)
library(recipes)
# call the birth data
birth<-read.csv("birth.csv", header=TRUE)
summary(birth)
library(class)

      fage            mage               mature        weeks      
 Min.   :15.00   Min.   :14.00   mature mom :159   Min.   :21.00  
 1st Qu.:26.00   1st Qu.:24.00   younger mom:841   1st Qu.:38.00  
 Median :31.00   Median :28.00                     Median :39.00  
 Mean   :31.13   Mean   :28.45                     Mean   :38.67  
 3rd Qu.:35.00   3rd Qu.:33.00                     3rd Qu.:40.00  
 Max.   :85.00   Max.   :47.00                     Max.   :46.00  
 NA's   :114                                                      
       premie        visits          gained          weight      
 full term:876   Min.   : 0.00   Min.   : 0.00   Min.   : 0.750  
 premie   :124   1st Qu.: 9.00   1st Qu.:20.00   1st Qu.: 6.545  
                 Median :12.00   Median :30.00   Median : 7.310  
                 Mean   :11.35   Mean   :30.43   Mean   : 7.198  
                 3rd Qu.:14.00   3rd Qu.:38.00   3rd Qu.: 8.000  
                 Max.   :30.00   Max.   :98.00   Max.   :10.620  
  

# TASK 1: Data Cleaning  [30 points]
In this task, you will work with the **birth** data

In **birth** data, complete the following tasks:

- replace missing values in **fage** variable with **15**  (**fage** will take the value of 15 when missing)  
- replace missing values in **visits** variable with **11** (**visits** will take the value of 11 when missing)  
- drop observations from **birth** data when **habit** variable has a missing value (drop all rows when **habit** is  missing )
- drop observations from **birth** data when **gained** variable has a missing value (drop all rows when **gained** is  missing)
- rename  **sex** to **gender** (variable name **sex** will be changed to **gender**)

-NOTE: We worked on a similar problem in RLab1






In [153]:
# Task #1: Prepare the data. 
# Your codes will pass the test if you complete all the data claenining tasks
birth <-read.csv("birth.csv", header=TRUE)
# your code here

birth$fage[which(is.na(birth$fage))] = 15
birth$visits[which(is.na(birth$visits))] = 11

birth <- na.omit(birth)

birth <- rename(birth, gender = sex)

In [154]:

# Test your code in here
### BEGIN HIDDEN TEST

test_that("Check the model summary", {
    expect_equal(round(mean(birth$fage),3),29.349)
    expect_equal(  IQR(birth$visits),3)
    expect_equal( sum(is.na(birth$habit)),0)
    expect_equal( sum(is.na(birth$gained)),0)
    expect_equal( dim(birth)[1],941)
})


print("Passed!")

### END HIDDEN TEST






[1] "Passed!"


# Part 2:  modeling lowbirthweight with logistic regression
- In this part we will work with **birth1** data. **birth1** data is a subset of **birth** data with complete cases. There are 794 rows and 13 columns. Students can get full credit from Task 2 even if they can't complete Task 1.
- **birth1** is split into **birth_train** and **birth_test** datasets by using the **createDataPartition()** function in **caret** package.

In this part, your job is to use logistic regression to model **lowbirthweight** by using the all predictors in the dataset.
Run the following code chunk first. The codes below call the original data, drop cases where we have a missing value, name the dataset as **birth1**, then by using the **createDataPartition** function in **caret** package, it splits data into **birth_train** and **birth_test**. 



In [163]:
# Run this code chunk first
raw<-read.csv("birth.csv", header=TRUE)    # get the original birth data
birth1<-raw%>%                             # drop rows with missing values and name the dataset as birth1
filter(complete.cases(.))

set.seed(4230) #set the seed function
index_data <- createDataPartition(birth1$lowbirthweight, p = 0.7,
list = FALSE)
birth_train <- birth1[index_data, ]
birth_test <- birth1[-index_data, ]



# Task 2 [30 points]
- By using the **birth_train** data, use the **glm()** function in R to estimate **lowbirthweight** with logistic regression by using all the predictors in the dataset and call your model as **model_logistic**. Use **set.seed(4230)**. If you get the warning message: “glm.fit: algorithm did not converge”, just ignore the warning. 
- Calculate the predicted probability of **lowbirthweight** in  **birth_test** by using the **model_logistic** model. Name your findings as **logistic_predict** . Use **set.seed(4230)**.  

-NOTE: We worked on a similar problem in RLab2


In [164]:
# Task #2: Logistic regression 
# use set.seed(4230) just  before running glm function
# use set.seed(4230) just  before running predict function
# You should complete the both tasks to pass this test


# your code here
set.seed(4230)

model_logistic <- glm(lowbirthweight ~ (.), family = "binomial", data=birth_train)

set.seed(4230)
logistic_predict <- predict(model_logistic, birth_test, type="response")

“glm.fit: algorithm did not converge”
“glm.fit: fitted probabilities numerically 0 or 1 occurred”


In [165]:
# Test your code in here
### BEGIN HIDDEN TEST

test_that("Check the model summary", {
    expect_equal(round(summary(model_logistic)[8][[1]]),293)
    expect_equal( round(summary(model_logistic)[7][[1]]),544)
    expect_equal( length(logistic_predict),237)
    expect_equal( round(mean(logistic_predict), 3),0.904)
    expect_equal( round(sum(logistic_predict), 1),214.4)
})



print("Passed!")

### END HIDDEN TEST





[1] "Passed!"


# Part 3: Modeling lowbirthweight with knn 
In this part, we will use the k-nearest neighbors (knn) algorithm to classify and predict new observations in **birth_test** with their proximity to k most-similar observations from **birth_train**. 


The following r chunk code preprocess our data for knn model. By using **recipe** function in **recipes** package, the following code chunk centers and scales numerical features and conducts one-hot encoding on categorical features such that there will be one dummy variable for each group of a categorical variable. The following code chunk stores the pre-processed features data for the training and test sets under **features_train** and **features_test** datasets, respectively. 



In [166]:
# Run this code chunk first before moving onto Task 3
# some cleaning with the recipe function. Do the pre processing on the birth_train data
features_train <- recipe(lowbirthweight  ~ ., data = birth_train) %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes())%>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
prep(training = birth_train, retain = TRUE) %>%
juice() %>%
  select(-lowbirthweight)


# some cleaning with the recipe function. Do the pre processing on the birth_test data
features_test <- recipe(lowbirthweight  ~ ., data = birth_test) %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes())%>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
prep(training = birth_test, retain = TRUE) %>%
juice() %>%
  select(-lowbirthweight)




# Task 3 [30 points]

Use the **knn** function in **class** package to predict **lowbirthweight** in the test data with **knn** method when **k=20**. Use the **set.seed(4230)** seed function and name the predicted test data labels as **model_knn**. 

Please note than knn() function in the class package requires predictors and labels to be entered separately. More specifically, predictors need to be a matrix and the label to be a vector only. 

NOTE:  We worked on a similar problem in RLab3

In [168]:
#  Task 3: knn when k=20
# Use set.seed(4230)
# your code here

y_train <- birth_train$lowbirthweight
y_test <- birth_test$lowbirthweight

set.seed(4230)
model_knn <- knn(train =  features_train,
                test =  features_test,
                 cl = y_train,
                 k = 20)



In [169]:
# Test your code in here
### BEGIN HIDDEN TEST

class_error = function(actual, predicted) {
  mean(actual != predicted)
}

test_that("Check the classification error", {
    expect_equal( round(class_error(birth_test$lowbirthweight,model_knn), 2),0.06)})
        
print("Passed!")

### END HIDDEN TEST

[1] "Passed!"



# Task 4 [10 points]


What is the accuracy rate in **model_knn**. Calculate the accuracy rate and name it as **accuracy_model_knn**. Your accuracy calculation should be at least 3 digits to pass the test.     

In [170]:
#  Task 4: calculate the accuracy ratio

# your code here
accuracy_table <-table(model_knn, birth_test$lowbirthweight)

TP <- 3
FP <- 0
TN <- 220
FN <- 14

accuracy_model_knn <- (TP + TN) / (TN + TP + FN + FP)

In [171]:
# Test your code in here
### BEGIN HIDDEN TEST



test_that("Check the accuracy  measure", {
    expect_equal( round(round(accuracy_model_knn, 3)^(-1/round(accuracy_model_knn,2)), 3),1.067)})
        
print("Passed!")

### END HIDDEN TEST

[1] "Passed!"
