# Sorting-Answer 

#### By Dae Hyeun (Isaac) Cheong 

This document has a code for the algorithm that sort a FRQ response into 3 categories: 
* **C**: Correct answer, but still need to be checked by human.
* **I**: Incorrect answer, but still need to be checked by human. 
* **H**: Answers that can't be classified using this algorithm so human needs to correct. 

In [23]:
library(writexl)
library(knitr)
library(dplyr)
library(readxl)
library(stringr)
raw_data <- read_excel("Items 28 and 30-2.xls")
translated_pc <- read_excel("translated_pc.xlsx")

"package 'writexl' was built under R version 3.6.3"

### Summary 

This sorting algorithm sorts the response into three categories (C, I, H) by using the patterns or keywords of wrong/correct answer. The result shows that the algorithm was quite successful with estimated accuracy of around 80%. Human checks at the end is strongly recommended to get 100% accuracy.

#### Methodology: Demonstration with Bubble Rising Problem

To explain the methodology of how this algorithm works, I will use the water bubble questions which coresspond to question 14 of the SAT science assessment. For the original dataset that we records the student's response, the student responses for this question are recoreded in column "pc".

<img src="scoringrubric.png" alt="rubric" width="700"/>

Before we start the sorting, we should define some functions so that we can make our process much smoother. 
Here are quick summary of what each function does.

1. **dectecting_I**: This function detects the row with completely wrong response. Response that contains the phrase "Don't Know" or "NA" will be considered to be a completely wrong answer. This will returns the dataframe with all wrong answers.

2. **remove_overlap**: This function takes two datafram and return ONE dataframe that all overlapped rows are removed from the bigger dataframe. 

3. **narrower_w_keywords**: This function takes a dataframe and keywords, and returns a row that contains at least one keywords. 

4. **narrower_wo_keywords**: This function takes a dataframe and keywords, and returns a row that doesn't contain any keywords. 

5. **add_correction**: This function takes a dataframe and returns a dataframe with extra column named correction. This columns will contain string "Correct" for all rows for the default option, but you can manually change it.

In [2]:
## This function detects the row with completely wrong answer 
## Wrong answer = "Don't Know" + "NA" 
## Returns the dataframe with Wrong answer: This will be the "I" (incorrect) group
## Note: DONT KNOW should be changed to Spanish, if we are going to sort the Spanish response.

detecting_I <- function(df) {
    incorrect <- df %>% filter(is.na(Response) | str_detect(df[ ,'Response'], "DONT KNOW"))
    return(incorrect)
}

In [3]:
## This function takes a two dataframe
## Return a dataframe that overlapped rows are removed from the bigger dataframe

remove_overlap <- function(bigger_df, smaller_df) {
    `%nin%` = Negate(`%in%`)
    return(bigger_df %>% filter(Student_ID %nin% smaller_df$Student_ID))
    }

In [4]:
## This function takes a key words and dataframe, 
## Returns the dataframe with keywords
## It narrows down the entire dataset to smaller subsets. 
## Example format of keywords: keywords = "ONE|TWO|THREE"
## Note: the keywords above will also detect the rows that cotains "TWOWELVE" since it also contains TWO. 
## Note: If you want to avoid this write it as "ONE| TWO |THREE" or utilize some regex expression.
## Note: This has some advantage to our algorithm since we don't need to be stressed our by additional -s (plaurl/3rd person verb)

narrower_w_keywords <- function(df, keywords) {
    vector_of_answer <- df[ ,'Response'] 
    rows_w_keywords <- str_detect(vector_of_answer, keywords)
    return(df[rows_w_keywords, ])
    }

narrower_wo_keywords <- function(df, keywords) {
    vector_of_answer <- df[ ,'Response'] 
    rows_wo_keywords <- str_detect(vector_of_answer, keywords, negate = TRUE)
    return(df[rows_wo_keywords, ])
    }


In [5]:
## This function add another column "Correction" (either "Correct" or "Incorrect")

add_correction <- function(df, result = "Correct") {
    df$Correction <- result
    return(df) 
    }


## Demonstration 

### STEP 1: Define a dataframe that will contain all incorrect responses using *detecting_I*

In [6]:
# Extract columns we want, combine with translated response 
# Rename the columns. Make sure we name the column that we want to correct as "Response"
our_interest <- select(raw_data, CODECEN, CUADERNILLO, FORMA, PC)
final_df <- cbind(our_interest, translated_pc)
colnames(final_df) <- c("Code", "Student_ID", "Form", "Original_Response","Response")
head(final_df, 5)

Code,Student_ID,Form,Original_Response,Response
141100009,200645,A,PORQUE LAS BURBUJAS ESTAN LLENAS DE AIRE OCIGENO Y NO DURAN DENTRO DEL AGUA POR ESO.,BECAUSE THE BUBBLES ARE FILLED WITH OXYGEN AIR AND DO NOT LAST IN THE WATER FOR THAT REASON.
141100009,200642,B,LAS BURBUJAS SUBEN EN EL AGUA PORQUE SOPLAMOS.,BUBBLES RISE IN THE WATER BECAUSE WE BLOW.
141100009,200643,A,PORQUE LAS BURBUAS SALEN Y SALEN DEL POPOTE Y COMO NO TIENEN GRAVEDAD SUBEN Y SUBEN FLOTANDO.,BECAUSE THE BUBBLES COME AND COME OUT OF THE STRAW AND AS THEY HAVE NO GRAVITY THEY FLOAT UP AND UP.
141100009,200646,B,PORQUE ESAS BURVUJAS SE EXPANDEN,BECAUSE THOSE BURVETS EXPAND
141100009,200644,B,PORQUE CUANDO UNO SOPLA SE ACEN BURBUJAS Y SE VAN A LAS NUBES Y SE ASE AGUA.,"BECAUSE WHEN ONE BLOWS, BUBBLES BECOME AND THEY GO TO THE CLOUDS AND WATER IS GROWN."


In [7]:
# Define the dataframe that will contain all of the incorrect answer
I <- detecting_I(final_df)

# Update the final_df so that we can get our dataset without those rows we detetcted. 
final_df <- remove_overlap(final_df, I) 

In [8]:
#Check I 
head(I,5)

Code,Student_ID,Form,Original_Response,Response
130800004,200206,B,NO SE,I DONT KNOW
140500001,200154,B,NO LA SE,I DONT KNOW IT
120500071,200220,B,,
20900023,200056,B,,
20900023,200060,B,,


### STEP 2: Add more inccorect answer using *narrower_w_keywords* and *narrower_wo_keywords*

In [9]:
# Since Answer key said the correct answer should refer to Bubbles/Gas, the answer that doesn't include
# these words have high chance to be incorrect 
keywords = "THEY|GAS|BUBBLE|CARBON DIOXIDE|AIR|OXYGEN"
incorrect <- narrower_wo_keywords(final_df, keywords)
final_df <- remove_overlap(final_df, incorrect)
#Our new set of incorrect answer has to the row binded to the I that we defined in STEP 1.
I = rbind(I, incorrect)

In [10]:
#According to Answer key, there are some words that frequently appear in wrong answer 
keywords = "EVAPORATE|PRESSURE|FORCE|BREATHING|WIND|GRAVITY|WEIGH|WHY|MIX|STRENGTH"
incorrect = narrower_w_keywords(final_df, keywords)
final_df <- remove_overlap(final_df, incorrect)
I = rbind(I, incorrect)

In [11]:
#Common Wrong Anwer usually contains only "BLOWS" without referring to gas.
incorrect <- final_df %>% narrower_w_keywords("BLOW") %>% narrower_wo_keywords("GAS|CARBON DIOXIDE|AIR|OXYGEN")
final_df <- remove_overlap(final_df, incorrect)
I = rbind(I, incorrect)

In [12]:
#BLOW + STARW without mentioning gas usually appears in wrong response.
incorrect <- final_df %>% narrower_w_keywords("BLOW") %>% narrower_w_keywords("STRAW") %>% narrower_wo_keywords("GAS|CARBON DIOXIDE|AIR|OXYGEN")
final_df <- remove_overlap(final_df, incorrect)
I = rbind(I, incorrect)

In [13]:
nrow(I)

### STEP 3: Define a dataframe that will contain all correct responses

In [14]:
#According to Answer Key, there are some words/phrases that frequently appear in correct response 
keywords = "LIGHT|LESS DENSE|NOT HEAVY|FLOAT|AIR RISES"
correct = narrower_w_keywords(final_df, keywords) 
final_df <- remove_overlap(final_df, correct)
# Define a dataframe with only the correct response.
C = correct

In [15]:
# Blow + GAS is usually a correct answer 
correct = final_df %>% narrower_w_keywords("BLOW") %>% narrower_w_keywords("AIR|OXYGEN|CARBON DIOXIDE|GAS")
final_df <- remove_overlap(final_df, correct)
#Row bind our new set of correct answer to the existing dataframe
C = rbind(C, correct)

In [16]:
# AIR + BUBBLES, PUSH, RISE are common combination of a correct answer 
correct = final_df %>% narrower_w_keywords("AIR") %>% narrower_w_keywords("BUBBLE|PUSH|RISE|EXPELL|OUT")
final_df <- remove_overlap(final_df, correct)
C = rbind(C, correct)

In [17]:
# Student who answered with some scientific term such as OXYGEN and CARBON DIOXIDE has higher chnace to be correct
correct = final_df %>% narrower_w_keywords("OXYGEN|CARBON DIOXIDE")
final_df <- remove_overlap(final_df, correct)
C = rbind(C, correct)

In [18]:
# WATRER + PUSH or AIR PUSH is usually a correct answer 
correct = final_df %>% narrower_w_keywords("WATER|AIR") %>% narrower_w_keywords("PUSH")
final_df <- remove_overlap(final_df, correct)
#Row bind our new set of correct answer to the existing dataframe
C = rbind(C, correct)

### STEP 4: Assign the unclassified one into class H, add column using *add_correction* to each DF

In [19]:
C = add_correction(C) 
I = add_correction(I, "Incorrect")

#We will make the column with "Incorrect" because it makes grading easier.
H = final_df %>% add_correction("Incorrect")

In [20]:
sample_n(I, 20)

Code,Student_ID,Form,Original_Response,Response,Correction
121200006,200712,B,POR QUE AL SOPOPLAR CON UNA PAGIA EL VIENTO QUE UNO PORPOSPONA ASE QUE EL AGUA SE LEBANTE COMO BURBUJAS SOBLE EL VASO,BECAUSE BY BLOWING WITH A PAGE THE WIND THAT ONE POSTPOSES MAKES THE WATER RISE LIKE BUBBLES SOUND THE GLASS,Incorrect
180100026,200120,B,8,8,Incorrect
130800004,200207,A,PORQUE EL AGUA FLUYE Y LAS ELEBA A LA SURPERFICIE.,BECAUSE THE WATER FLOWS AND LIFTS THEM TO THE SURFACE.,Incorrect
180100186,200594,B,LAS BURBUJAS SUBEN EN EL AGUA PORQUE LA FUERZA QUE LE HACE EL AIRE ABAJO ELLAS SUBEN HACIA ARRIBA,THE BUBBLES RISE IN THE WATER BECAUSE THE FORCE THAT THE AIR MAKES IT DOWN THEY RISE UPWARDS,Incorrect
30100041,200741,A,PORQUE EL BAPOR DEL AGUA ES MUY SUFISIENTE,BECAUSE THE STEAM OF THE WATER IS VERY SUFFICIENT,Incorrect
20900001,200554,B,POR QUE LAS BURBUJAS SON DE HAIRE Y EL HAIRE ESTA ARRIBA DEL AGUA POR ESO LAS BURBUJASSUBEN,BECAUSE THE BUBBLES ARE MADE OF HAIR AND THE HAIR IS ABOVE THE WATER THAT'S WHY THE BUBBLES RISE,Incorrect
160700004,200133,A,POR QUE EN LO QUE SOPLA POR LA PAJILLA EL AIRE CAEN CON FUERZA PRODUCE UN BURBUJITAS PEQUE!AS,"BECAUSE AS IT BLOWS THROUGH THE STRAW, THE AIR FALLS WITH FORCE, PRODUCING SMALL BUBBLES!",Incorrect
141100009,200644,B,PORQUE CUANDO UNO SOPLA SE ACEN BURBUJAS Y SE VAN A LAS NUBES Y SE ASE AGUA.,"BECAUSE WHEN ONE BLOWS, BUBBLES BECOME AND THEY GO TO THE CLOUDS AND WATER IS GROWN.",Incorrect
30100041,200742,B,PORQUE BARANTE EL TIEMPO QUE ABSORVE COMO ENSALA Y DESASALA,BECAUSE IT BARANTE THE TIME THAT ABSORBS AS SALAD AND DESALATION,Incorrect
140500001,200154,B,NO LA SE,I DONT KNOW IT,Incorrect


### STEP 5: Analyze group H to discover other keywords/pattern for inccorect/correct answer

This step is important because we might discover other important keywords/pattern for incorrect and correct answer. For example, I carefully observed the group H and realized that word "WHY" appears very frequently in many of the wrong answer. Therefore I updated STEP 2 and added "WHY" to the keywords to detect more incorrect response.  

### STEP 6: Export each dataframe to XLSX file. After checked by human, merge the file into one file.

In [25]:
dim(I)
dim(C)
dim(H)

In [24]:
write_xlsx(C, 'Correct_Response.xlsx')
write_xlsx(I, 'Incorrect_Response.xlsx')
write_xlsx(H, 'Unsorted_Response.xlsx')

Question about use of woods is pretty simple. 

In [None]:
question_10_Part_1 <- select(raw_data, CODECEN, CUADERNILLO, FORMA, "_v28")
question_10_Part_2 <- select(raw_data, CODECEN, CUADERNILLO, FORMA, "_v29")
colnames(question_10_Part_1) <- c("Code", "Student_ID", "Form", "Response")
colnames(question_10_Part_2) <- c("Code", "Student_ID", "Form", "Response")
head(question_10_Part_1, 5)