# Sorting-Answer 

#### By Dae Hyeun (Isaac) Cheong 

This document has a code to sort a free response question into 5 categories: 
* **C**: Correct answer, but still need to be checked by human.
* **I**: Incorrect answer, but still need to be checked by human. 
* **H**: Answers that can't be classified using this algorithm so human needs to correct. 

In [143]:
library(knitr)
library(dplyr)
library(readxl)
library(stringr)
raw_data <- read_excel("Items 28 and 30-2.xls")
translated_pc <- read_excel("translated_pc.xlsx")

#### Methodology

To explain the methodology on how to sort an answer, I will use the question about water bubbles, which coresspond to question 4 of the science assessment. To the original dataset, the answer to the question 4 correspond to the column name "pc"

<img src="scoringrubric.png" alt="rubric" width="700"/>

Here, I will explain the psedocode of the following method.

1. We will first start with identifying MC (Most likely to be a correct answer). We consider the answer as most likely to be a correct answer when the answer contain at least two key words either from Code 10 and Code 11. We will make the two groups of key words (What and How), so the answer get classified into MC when answer has at least one word from each group.
    * For example, code 10's What group will be {Bubbles, Gas , They , etc...} and How group will be {light, less dense, not heavy, etc...}. If the student wrote "Bubbles float because air inside the bubble is lighter than water", the algorithm will classify the answer to MC group. 

2. We repeat the Step 1 for every Code of correct answer to finish the sorting to MC category.  

3. Now, out of all non-MC answer, we will identify CC (Likely to be a correct answer). If we look at the answer key, we can identify some important key words that the word alone makes the answer correct. When the answer contain at least one of this key words, then we identify the answer to category CC. 
    * For example, some key words for the bubble problem will be {less dense, not heavy, oxygen, carbon dioxide, light}. If the student wrote "The thing inside is less dense than water", this will not be classified into MC but CC.  
    
4. Then we repeat the STEP 3 to identify LI (Likely to be a incorrect answer). 

5. Any blank answer or the answer that contains {don't know, no idea}, go to category I. Rest of them go to NC (Not classified)


In [86]:
## This function detects the row with completely wrong answer 
## Wrong answer = "Don't Know" + "NA" 
## Returns the dataframe with Wrong answer: This will be the "I" (incorrect) group

detecting_I <- function(df) {
    incorrect <- df %>% filter(is.na(Response) | str_detect(df[ ,'Response'], "DONT KNOW"))
    return(incorrect)
}

In [89]:
## This function takes a two dataframe
## Return a dataframe that overlapped rows are removed from the bigger dataframe

remove_overlap <- function(bigger_df, smaller_df) {
    `%nin%` = Negate(`%in%`)
    return(bigger_df %>% filter(Student_ID %nin% smaller_df$Student_ID))
    }

In [91]:
## This function takes a key words and dataframe, 
## Returns the dataframe with keywords
## It narrows down the entire dataset to smaller subsets. 

narrower_w_keywords <- function(df, keywords) {
    vector_of_answer <- df[ ,'Response'] 
    rows_w_keywords <- str_detect(vector_of_answer, keywords)
    return(df[rows_w_keywords, ])
    }

narrower_wo_keywords <- function(df, keywords) {
    vector_of_answer <- df[ ,'Response'] 
    rows_wo_keywords <- str_detect(vector_of_answer, keywords, negate = TRUE)
    return(df[rows_wo_keywords, ])
    }


In [231]:
## This function add another column "Correction" (either "Correct" or "Incorrect")

add_correction <- function(df, result = "Correct") {
    df$Correction <- result
    return(df) 
    }


Let's use these function to correct this water problem

### STEP 1: Define a dataframe that will contain all incorrect responses using *detecting_I*

In [232]:
# Extract columns we want, combine with translated response 
# Rename the columns. Make sure we name the column that we want to correct as "Response"
our_interest <- select(raw_data, CODECEN, CUADERNILLO, FORMA, PC)
final_df <- cbind(our_interest, translated_pc)
colnames(final_df) <- c("Code", "Student_ID", "Form", "Original_Response","Response")
head(final_df, 5)

Code,Student_ID,Form,Original_Response,Response
141100009,200645,A,PORQUE LAS BURBUJAS ESTAN LLENAS DE AIRE OCIGENO Y NO DURAN DENTRO DEL AGUA POR ESO.,BECAUSE THE BUBBLES ARE FILLED WITH OXYGEN AIR AND DO NOT LAST IN THE WATER FOR THAT REASON.
141100009,200642,B,LAS BURBUJAS SUBEN EN EL AGUA PORQUE SOPLAMOS.,BUBBLES RISE IN THE WATER BECAUSE WE BLOW.
141100009,200643,A,PORQUE LAS BURBUAS SALEN Y SALEN DEL POPOTE Y COMO NO TIENEN GRAVEDAD SUBEN Y SUBEN FLOTANDO.,BECAUSE THE BUBBLES COME AND COME OUT OF THE STRAW AND AS THEY HAVE NO GRAVITY THEY FLOAT UP AND UP.
141100009,200646,B,PORQUE ESAS BURVUJAS SE EXPANDEN,BECAUSE THOSE BURVETS EXPAND
141100009,200644,B,PORQUE CUANDO UNO SOPLA SE ACEN BURBUJAS Y SE VAN A LAS NUBES Y SE ASE AGUA.,"BECAUSE WHEN ONE BLOWS, BUBBLES BECOME AND THEY GO TO THE CLOUDS AND WATER IS GROWN."


In [233]:
# Define the dataframe that will contain all of the incorrect answer
I <- detecting_I(final_df)

# Update the final_df without those rows we detetcted
final_df <- remove_overlap(final_df, I) 

In [234]:
#Check I 
head(I,5)

Code,Student_ID,Form,Original_Response,Response
130800004,200206,B,NO SE,I DONT KNOW
140500001,200154,B,NO LA SE,I DONT KNOW IT
120500071,200220,B,,
20900023,200056,B,,
20900023,200060,B,,


### STEP 2: Add more inccorect answer using *narrower_w_keywords* and *narrower_wo_keywords*

In [235]:
# Since Answer key said the correct answer should refer to Bubbles/Gas, the answer that doesn't include
# these words have high chance to be incorrect 
new_df <- final_df 
keywords = "THEY|GAS|BUBBLES|BUBBLE|CARBON DIOXIDE|AIR|OXYGEN"
incorrect <- narrower_wo_keywords(new_df, keywords)
new_df <- remove_overlap(new_df, incorrect)
I = rbind(I, incorrect)

In [236]:
#According to Answer key, there are some words that frequently appear in wrong answer 
keywords = "EVAPORATES|PRESSURE|FORCE|BREATHING|WIND|GRAVITY|WEIGH"
incorrect = narrower_w_keywords(new_df, keywords)
new_df <- remove_overlap(new_df, incorrect)
I = rbind(I, incorrect)

In [237]:
#Common Wrong Anwer usually contains only "BLOWS" without referring to gas.
incorrect <- new_df %>% narrower_w_keywords("BLOWS") %>% narrower_wo_keywords("GAS|CARBON DIOXIDE|AIR|OXYGEN")
new_df <- remove_overlap(new_df, incorrect)
I = rbind(I, incorrect)

In [238]:
nrow(I)

### STEP 3: Define a dataframe that will contain all correct responses

In [239]:
#According to Answer Key, there are some words/phrases that frequently appear in correct response 
keywords = "LIGHT|LESS DENSE|NOT HEAVY|FLOAT|AIR RISES"
correct = narrower_w_keywords(new_df, keywords) 
new_df <- remove_overlap(new_df, correct)
C = correct

In [240]:
# Blow + GAS is usually a correct answer 
correct = new_df %>% narrower_w_keywords("BLOW") %>% narrower_w_keywords("AIR|OXYGEN|CARBON DIOXIDE|GAS")
new_df <- remove_overlap(new_df, correct)
C = rbind(C, correct)

In [241]:
# AIR + BUBBLES, PUSH, RISE are common combination of a correct answer 
correct = new_df %>% narrower_w_keywords("AIR") %>% narrower_w_keywords("BUBBLE|PUSH|RISE|EXPELL|OUT")
new_df <- remove_overlap(new_df, correct)
C = rbind(C, correct)

In [242]:
# Student who answered with some scientific term such as OXYGEN and CARBON DIOXIDE has higher chnace to be correct
correct = new_df %>% narrower_w_keywords("OXYGEN|CARBON DIOXIDE")
new_df <- remove_overlap(new_df, correct)
C = rbind(C, correct)

### STEP 4: Assign the unclassified one into class H, add column using *add_correction* to each DF

In [243]:
C = add_correction(C) 
I = add_correction(I, "Incorrect")
H = new_df %>% add_correction("H")

In [245]:
H

Code,Student_ID,Form,Original_Response,Response,Correction
141100009,200642,B,LAS BURBUJAS SUBEN EN EL AGUA PORQUE SOPLAMOS.,BUBBLES RISE IN THE WATER BECAUSE WE BLOW.,H
141100009,200174,B,PORQUE EL MISMO AIRE DE LA PERSONA ASE QUE EL AGUA AGA VURBUJAS,BECAUSE THE SAME AIR OF THE PERSON MAKES THE WATER MAKE VURBUJAS,H
130800004,200696,B,PORQUE SIGUEN EL AIRE,BECAUSE THEY FOLLOW THE AIR,H
130800004,200693,A,A PORQUE SEGUN SE SOPLA SUBEN LAS BURBUJAS SE FORMAN CON EL SOPLO QUE UNO LE DA A LA PAJILLA,"BECAUSE AS YOU BLOW THE BUBBLES RISE, THEY FORM WITH THE BLOW THAT ONE GIVES TO THE STRAW",H
130800004,200695,A,PORQUE CUANDO UNO METE UNA PAJILLA AL UN BASO ECHAN ESPUMITAS COMO JABON PORQUE UNO LA ESTA SOPLANDO,"BECAUSE WHEN YOU PUT A STRAW INTO A BOTTLE, THEY PUT FOAM LIKE SOAP BECAUSE YOU ARE BLOWING IT",H
020900012,200094,B,SUBEN POR EL AIRE DE LA PERSONA Y POR EL POPOTE,THEY GO UP THROUGH THE PERSON'S AIR AND THROUGH THE STRAW,H
020900012,200096,B,PORQUE DE ASE MUCHO AIRE,BECAUSE IT TAKES A LOT OF AIR,H
020900012,200281,A,PORQUE EL AIRE SUDE ACIA ARIDA PERO SI LA PAJILLA ESTA ADAJO,BECAUSE THE AIR SWEATS IT IS ARID BUT IF THE STRAW IS OLD,H
020900012,200099,A,POR QUE CUANDO NOSOTROS SOPLAMOS DENTRO DEL EL VASO COMIENSA A SALIR BURBUJAS SOLO AMADER EL AGUA CON LA POPOTE,BECAUSE WHEN WE BLOW INSIDE THE GLASS BEGINS TO COME OUT BUBBLES ONLY AMADER THE WATER WITH THE STRAW,H
020900001,200062,B,PORQUE EL AGUA NO PUEDE TENER HAYRE DENTRO NI LAS BURBUJAS ESTAR ADENTRO DEL AGUA,BECAUSE THE WATER CANNOT HAVE THERE INSIDE NOR THE BUBBLES BE INSIDE THE WATER,H
