## Data: Student Exam Answers

The data described here were used to evaluate the method of machine-scoring presented in the [Score Exams notebook](Score%20Exams.ipynb). This notebook expands upon the description of data preperation found at the prior link. 

More than one thousand exam answers were obtained as PDF files containing answers to thirteen free response questions, drawn from six Suffolk University Law School final exams, taught by five instructors. Each PDF corresponded to a single student exam. These PDFs were parsed to extract their answers, their contents converted into XML of the following format.

```
<EXAM>
    <STUDENT id='ID'>00000000</STUDENT>
    <QUESTION id='Q1'>
        <![CDATA[
            text of written answer to question one
        ]]>
    </QUESTION>
    <QUESTION id='Q2'>
        <![CDATA[
            text of written answer to question two
        ]]>
    </QUESTION>
    <QUESTION id='Q3'>
        <![CDATA[
            text of written answer to question three
        ]]>
    </QUESTION>
</EXAM>
```

These translated files were reviewed by hand and reformatted as needed to correct any formatting errors. Each XML file was then read into a csv file for its associated exam as a single row with their columns corresponding to each of its elements (e.g., ID, Q1, Q2). Additionally, a column stating the number of words contained in each question was appended to the csv file (e.g., size_Q1, size_Q2). E.g., 

|ID|Q1|Q2|size_Q1|size_Q2|
|--|--|--|-------|-------|
|00001|text of 1's ans to q1|text of 1's ans to q2|6|6|
|00002|text of 2's ans to q1|text of 2's ans to q2|6|6|

Instructors also provided scores for each exam question. These were placed in a csv for each exam with the scores on each question associated to the exam ID. E.g., 

|ID|Q1|Q2|
|--|--|--|
|00001|96|93|
|00002|83|89|

In keeping with the wishes of those instructors who provided exams, only three of the exams are available here for review (i.e., [property_instructor_A](https://colarusso.github.io/free-response-scoring/exam_questions/property_instructor_A.docx), [property_instructor_B](https://colarusso.github.io/free-response-scoring/exam_questions/property_instructor_B.docx), and [crim_instructor_E](https://colarusso.github.io/free-response-scoring/exam_questions/crim_instructor_E.docx)). Another two are on file with the author and may be shared upon request and the assent of their authors. In keeping with the author's wishes, the remaining exam will not be shared. Subject to constraints imposed by the Family Educational Rights and Privacy Act (FERPA), three of the answer sets may be shared upon request. One of these is linked to an exam requiering instructor assent to be shared.

Please note that the exam questiones shared at the links above are docx files and have been redacted to exclude the instructor's name and text that does not include the scored question prompts (e.g., multiple choice questions). 

## Load packages

In [None]:
import csv
import re
import os
import pandas as pd
import numpy as np
import pdftotext
import os


## Parse PDF files

To facilitate retrieval of the exam answers which were stored in various folders, a list of dictionaries is defined. Each dictionary defines the folder name where csv files can be found as well as the names of its relevant columns for later consideration (i.e., their ID and those questions to be scored). To avoid sharing of this data publicly, as seen below, these files are not included in this repository and were located outside of this repository's folder (i.e., `../data/`). 

In [None]:
exams = [
            {"folder":"../data/property_instructor_A","columns":["ID","SHORT_ANS","Q1","Q2"]},
            {"folder":"../data/property_instructor_B","columns":["ID","Q1","Q2"]},
            {"folder":"../data/environ_instructor_B","columns":["ID","Q2"]},
            {"folder":"../data/PR_instructor_C","columns":["ID","Q1","Q2"]},
            {"folder":"../data/contracts_instructor_D","columns":["ID","Q1","Q2","Q3"]},
            {"folder":"../data/crim_instructor_E","columns":["ID","Q1","Q2"]}
          ]

The `data` folder contains a folder named for each exam (e.g., `property_instructor_A`). After processing of exam data, these folders contain three folders and two csv files. To begin with a file containing the exam IDs and scores (`actual.csv`) is placed in the exam folder. The individual pdf files (one for each exam are placed in the `pdfs` folder). The first round of processing places XML translations of the pdf files into the `texts` folder. A copy of these are then placed in the `texts_cleaned` folder where manual reformatting takes place. After this step, these files are compiled into the `texts.csv` file for use in the [Score Exams notebook](Score%20Exams.ipynb). This process is performed below. 

```
data
|---- property_instructor_A
     |---- pdfs
     |    |---- 0001_property_instructor_A.pdf 
     |    |---- 0002_property_instructor_A.pdf 
     |    |---- 0003_property_instructor_A.pdf 
     |---- texts
     |    |---- 0001.xml 
     |    |---- 0002.xml 
     |    |---- 0003.xml 
     |---- text_cleaned
     |    |---- 0001.xml 
     |    |---- 0002.xml 
     |    |---- 0003.xml 
     |---- actual.csv
     |---- texts.csv     
```

In [None]:
def read_pdf (file, rules=None):
    
    # open the pdf file defined in `file`
    with open(file, "rb") as f:
        doc = pdftotext.PDF(f)
        
    # discover student ID
    student_id = re.search("ID:\s+(\d+)",doc[0])[1]
    # discover exam type
    exam_type = re.search("Exam Name:\s+([^-]*)",doc[0])[1].strip()
    file_name = re.search("%s.(\S*)"%exam_type,doc[0])[1].strip()
    exam_date = re.search("Exam Date:(.*)",doc[0])[1].strip()
    
    # print details for review
    print("Student ID:",student_id)
    print("Exam Type:",exam_type)
    print("File:",file_name)
    print("Pages",len(doc))
    print("\n")

    # begin construction of XML
    text = "<EXAM>\n<STUDENT id='ID'>%s</STUDENT>\n"%student_id

    # run exam specific rules
    if rules=="property_instructor_B":
        text = text + "<QUESTION id='Q1'>\n<![CDATA[\n"
    elif rules=="environ_instructor_B":
        text = text + "<QUESTION id='Q1'>\n<![CDATA[\n"
    elif rules=="PR_instructor_C":
        text = text + "<QUESTION id='Q1'>\n<![CDATA[\n"
    elif rules=="crim_instructor_E":
        text = text + "<QUESTION id='Q1'>\n<![CDATA[\n"
    elif rules=="contracts_instructor_D":
        text = text + "<QUESTION id='Q1'>\n<![CDATA[\n"
    
    # for each page of the pdf
    i = 0
    for page in doc:
        # for pages other than the first page
        if i>0:
            # parse text for construction of xml            
            page = re.sub("\s*.*xmdx"," ",page)
            page = re.sub("\d+\s?of\s?%s"%len(doc)," ",page)
            page = re.sub("\d+\s+of"," ",page)
            page = re.sub("of\s?%s"%len(doc)," ",page)
            page = re.sub("\s*END OF EXAM\s*"," ",page)   
                       
            page = re.sub("%s.%s"%(exam_type,file_name)," ",page)
            page = re.sub("ID:"," ",page)
            page = re.sub(exam_date," ",page)
            page = re.sub(student_id," ",page)
            page = re.sub("\s*\(Exam Number\)\s*Exam Name:\s*Exam Date:\s*1\)","",page)

            # run exam specific rules
            if rules=="property_instructor_A":
                page = re.sub("\n\s*((Short Answer|Part (II[^I]|2)).*)","\n\n<QUESTION id='SHORT_ANS' tell=\"\\1\">\n<![CDATA[\n",page, 0, re.IGNORECASE)
                page = re.sub("\n\s*(Part (III|3).*)","\n\n]]>\n</QUESTION>\n",page, 0, re.IGNORECASE)
                page = re.sub("\n\s*((Essay\s)?(Question|Quesiton)\s?1.*)","\n\n<QUESTION id='Q1' tell=\"\\1\">\n<![CDATA[\n",page, 0, re.IGNORECASE)
                page = re.sub("\n\s*((Essay\s)?(Question|Quesiton)\s?2.*)","\n\n]]>\n</QUESTION>\n<QUESTION id='Q2' tell=\"\\1\">\n<![CDATA[\n",page, 0, re.IGNORECASE)                
            elif rules=="property_instructor_B":
                page = re.sub("\n\s*(\(?2(\.|\)|:))","\n\n]]>\n</QUESTION>\n<QUESTION id='Q2' tell=\"\\1\">\n<![CDATA[\n",page, 0, re.IGNORECASE)                
            elif rules=="environ_instructor_B":
                page = re.sub("\n\s*(\(?2(\.|\)|:))","\n\n]]>\n</QUESTION>\n<QUESTION id='Q2' tell=\"\\1\">\n<![CDATA[\n",page, 0, re.IGNORECASE)                
            elif rules=="PR_instructor_C":
                page = re.sub("\n\s*(\(?2(\.|\)|:))","\n\n]]>\n</QUESTION>\n<QUESTION id='Q2' tell=\"\\1\">\n<![CDATA[\n",page, 0, re.IGNORECASE)                
            elif rules=="crim_instructor_E":
                page = re.sub("\n\s*((Essay|Question)?\s?\(?2(\.|\)|:|\s))","\n\n]]>\n</QUESTION>\n<QUESTION id='Q2' tell=\"\\1\">\n<![CDATA[\n",page, 0, re.IGNORECASE)                
            elif rules=="contracts_instructor_D":
                page = re.sub("\n\s*((Essay|Question)?\s?\(?2(\.|\)|:|\s))","\n\n]]>\n</QUESTION>\n<QUESTION id='Q2' tell=\"\\1\">\n<![CDATA[\n",page, 0, re.IGNORECASE)                
                page = re.sub("\n\s*((Essay|Question)?\s?\(?3(\.|\)|:|\s))","\n\n]]>\n</QUESTION>\n<QUESTION id='Q3' tell=\"\\1\">\n<![CDATA[\n",page, 0, re.IGNORECASE)                

            page = re.sub("\r\n \r\n"," ",page)
            page = re.sub("\r\n ","\n\n",page)
            page = re.sub("\r\n"," ",page)

            if (page.strip() not in text):
                text = text + " " + page
                
        i = i + 1

    text = text + "\n\n]]>\n</QUESTION>\n</EXAM>\n"
    
    text = re.sub(" +"," ",text)
    print("\n")
        
    return student_id, text

In [None]:
def clean_text (text,mode=0):
    if mode==0:
        content = re.sub('\n+', ' ',  text) # remove line breaks
        content = re.sub('\r+', ' ',  text) # remove line breaks
        content = re.sub('\s+', ' ',  text) # remove line breaks
    elif mode==1:
        content = re.sub('[^\w]+', " ",  text) # clean standalone number 
        content = content.lower()
    elif mode==2:
        content = text.lower()
    return content

In [None]:
import xml.etree.ElementTree as et
def parse_XML(xml_file): 
    xtree = et.parse(xml_file)
    xroot = xtree.getroot()
    cols = []
    ans = []
    
    for node in xroot: 
        name = node.attrib.get("id")
        ans = ans + [clean_text(node.text)]
        cols = cols + [name]
    
    out_df = pd.DataFrame([ans],columns=cols)
    return out_df

In [None]:
# for each exam
for exam in exams:
    print("\n=========================================================")
    print(exam["folder"])
    print("=========================================================\n")
    files = os.listdir("%s/pdfs"%exam["folder"])
    
    # for each individual student's exam
    for file in files:
        # read the exam and create xml
        student_id, text = read_pdf("%s/pdfs/%s"%(exam["folder"],file), rules=exam["folder"])
        # write xml to file
        text_file = open("%s/texts/%s.xml"%(exam["folder"],student_id), "w", encoding="utf-8")
        text_file.write("%s" % text)
        text_file.close()

## Now go copy contents of `texts` into `text_cleaned` and clean the data by hand. Then run the below.

This will take the cleaned xml files and compile them into the `texts.csv` file with exam ID, question text, and a count of words for each question. 

In [None]:
def word_count(row):
    return len(row.split())

In [None]:
write = 1
for exam in exams:
    print("\n=========================================================")
    print(exam["folder"])
    print("=========================================================\n")
    files = os.listdir("%s/texts_cleaned"%exam["folder"])
    df_tmp = pd.DataFrame()
    i = 1
    for file in files:
        try:
            df_tmp = pd.concat([df_tmp,parse_XML("%s/texts_cleaned/%s"%(exam["folder"],file))])
            print("%s)\t%s"%(i,file))
        except:
            print("%s)\t%s (error)"%(i,file))
        i = i + 1
        
    df_tmp = df_tmp.reset_index(drop=True)
    df_tmp = df_tmp[exam["columns"]].dropna()
    
    print("")
    print("N:",len(df_tmp))

    for column in [x for x in exam["columns"] if (x not in "ID")]:
        df_tmp["size_%s"%column]= df_tmp[column].apply(word_count)
        print("%s Mean Words:"%column,int(round(df_tmp["size_%s"%column].mean())))
        
    
    display(df_tmp)    
    if write ==1:
        df_tmp.to_csv("%s/texts.csv"%exam["folder"], index=False, encoding="utf-8")            

## Scrub exam answers of student IDs

In [None]:
import shutil
import hashlib
from random import randrange

def hashme(w,seed):
    try:
        w = int(w)
        w = str(w)+str(seed)
        h = hashlib.md5(str(w).encode('utf-8'))
        return h.hexdigest()
    except:
        np.nan

In [None]:
for exam in exams:
    print("\n=========================================================")
    print(exam["folder"])
    print("=========================================================\n")
    
    shutil.copyfile("%s/texts.csv"%exam["folder"], "%s/texts_redacted.csv"%exam["folder"])
    shutil.copyfile("%s/actual.csv"%exam["folder"], "%s/actual_redacted.csv"%exam["folder"])
    
    df1_tmp = pd.read_csv("%s/texts_redacted.csv"%exam["folder"], encoding="utf-8")
    df2_tmp = pd.read_csv("%s/actual_redacted.csv"%exam["folder"], encoding="utf-8")

    seed = randrange(1000000)
    
    display(df1_tmp)
    df1_tmp["ID"] = df1_tmp["ID"].apply(hashme, args=(seed,))
    display(df1_tmp)
    display(df2_tmp)
    df2_tmp["ID"] = df2_tmp["ID"].apply(hashme, args=(seed,))
    display(df2_tmp)

    df1_tmp.to_csv("%s/texts_redacted.csv"%exam["folder"], index=False, encoding="utf-8")  
    df2_tmp.to_csv("%s/actual_redacted.csv"%exam["folder"], index=False, encoding="utf-8")  

For exams that are going to be shared, go through and read answers to make sure there is no PII in the plain text of the answers. If there is, remove from dataset.