# Sorting-Answer 

#### By Dae Hyeun (Isaac) Cheong 

This document has a code to sort a free response question into 5 categories: 
* **C**: Correct answer, but still need to be checked by human.
* **I**: Incorrect answer, but still need to be checked by human. 
* **H**: Answers that can't be classified using this algorithm so human needs to correct. 

In [85]:
library(knitr)
library(dplyr)
library(readxl)
library(stringr)
raw_data <- read_excel("Items 28 and 30-2.xls")
translated_pc <- read_excel("translated_pc.xlsx")

#### Methodology

To explain the methodology on how to sort an answer, I will use the question about water bubbles, which coresspond to question 4 of the science assessment. To the original dataset, the answer to the question 4 correspond to the column name "pc"

<img src="scoringrubric.png" alt="rubric" width="700"/>

Here, I will explain the psedocode of the following method.

1. We will first start with identifying MC (Most likely to be a correct answer). We consider the answer as most likely to be a correct answer when the answer contain at least two key words either from Code 10 and Code 11. We will make the two groups of key words (What and How), so the answer get classified into MC when answer has at least one word from each group.
    * For example, code 10's What group will be {Bubbles, Gas , They , etc...} and How group will be {light, less dense, not heavy, etc...}. If the student wrote "Bubbles float because air inside the bubble is lighter than water", the algorithm will classify the answer to MC group. 

2. We repeat the Step 1 for every Code of correct answer to finish the sorting to MC category.  

3. Now, out of all non-MC answer, we will identify CC (Likely to be a correct answer). If we look at the answer key, we can identify some important key words that the word alone makes the answer correct. When the answer contain at least one of this key words, then we identify the answer to category CC. 
    * For example, some key words for the bubble problem will be {less dense, not heavy, oxygen, carbon dioxide, light}. If the student wrote "The thing inside is less dense than water", this will not be classified into MC but CC.  
    
4. Then we repeat the STEP 3 to identify LI (Likely to be a incorrect answer). 

5. Any blank answer or the answer that contains {don't know, no idea}, go to category I. Rest of them go to NC (Not classified)


In [87]:
our_interest <- select(raw_data, CODECEN, CUADERNILLO, FORMA, PC)
final_df <- cbind(our_interest, translated_pc)
colnames(final_df) <- c("Code", "Student_ID", "Form", "Original_Response","Response")

In [14]:
## This function detect the answer that is most likely to be correct. 
## Take input as dataframe (translated text has to be a 5th column)
## Returns a another dataframe with only MC answer with another column labeled with "C" (Correct)
## What_words and How_words must be in Regex expression 

What_Words = "( AIR |BUBBLES|BUBBLE|THEY|OXYGEN| GAS | CARBON |CARBON DIOXIDE)"
How_Words = "( LIGHT |LESS DENSE|NOT HEAVY|LIGHTER|FLOAT|BLOWING| BLOW | BLOWS )"

detecting_MC <- function(x) {
    vector_of_answer <- x[ ,5]
    all_MC_answer <- str_detect(final_df[ ,5], What_Words) * str_detect(final_df[ ,5], How_Words)
    return(all_MC_answer) 
}


In [86]:
## This function detects the row with completely wrong answer 
## Wrong answer = "Don't Know" + "NA" 
## Returns the dataframe with Wrong answer: This will be the "I" (incorrect) group

detecting_I <- function(df) {
    incorrect <- df %>% filter(is.na(Response) | str_detect(df[ ,'Response'], "DONT KNOW"))
    return(incorrect)
}

In [89]:
## This function takes a two dataframe
## Return a dataframe that overlapped rows are removed from the bigger dataframe

remove_overlap <- function(bigger_df, smaller_df) {
    `%nin%` = Negate(`%in%`)
    return(bigger_df %>% filter(Student_ID %nin% smaller_df$Student_ID))
    }

In [91]:
## This function takes a key words and dataframe, 
## Returns the dataframe with keywords
## It narrows down the entire dataset to smaller subsets. 

narrower_w_keywords <- function(df, keywords) {
    vector_of_answer <- df[ ,'Response'] 
    rows_w_keywords <- str_detect(vector_of_answer, keywords)
    return(df[rows_w_keywords, ])
    }

narrower_wo_keywords <- function(df, keywords) {
    vector_of_answer <- df[ ,'Response'] 
    rows_wo_keywords <- str_detect(vector_of_answer, keywords, negate = TRUE)
    return(df[rows_wo_keywords, ])
    }


In [21]:
## This function add another column "Correction" (either "Correct" or "Incorrect")

add_correction <- function(df, result = "Correct") {
    return(df$Correc
    }
