This notebook cleans .txt data for students performance on exams for BIO1111 and exports a csv which can be analyzed in an R script.
#### Author: Christopher Agard

In [1]:
import string

In [2]:
import pandas as pd
import numpy as np
import os,glob

def cleanSoarExam (data, examNum, fileType='flat',
                   colSpec = [(0,9),(10,21),(22,28),(29,30),(30,32),(32,34),(34,36),(36,38),(39,41),(41,68)],
                   soarSessions=[]):

    if fileType != 'flat':
        print("\n{} fileType is not currently supported.".format(fileType))
    else:
        try:
            df=pd.read_fwf(data,colSpec,names=['tuid','last','first','middle','unnamed1','unnamed2','unnamed3','soar','ncorrect','item'])
        except:
            "print(\nCould not find file {} or {} was not acceptable value for colSpec)".format(data,colSpec)
        #try:
        #    df.columns = ['tuid','last','first','middle','unnamed1','unnamed2','unnamed3','soar','ncorrect','item']
        #except:
        #    print("\nColumn number != 10.\n{}".format(len(df.columns)))
        try:
            df['examNumber']=examNum
            numbers=pd.Series(list(range(28))).astype(str)
            itemNames= 'item_'+ numbers[1:]
            itemData=df.item.apply(lambda i: pd.Series(list(i)))
            itemData.columns=itemNames
            df=df.merge(itemData,'outer',left_index=True,right_index=True).drop('item',axis=1)
        except:
            print("\nUnhandled exception encountered.")
        if soarSessions ==np.nan:
            df.loc['soarType']='other'
        else:
            df.loc[df.soar.isin(soarSessions),'soarType']='mine'
            df.loc[~df.soar.isin(soarSessions),'soarType']='other'
            df.loc[df.tuid=='NNNNNNNNN','soarType']='key'
    return df
        

Here we need to write a function to determine how many students get each item wrong.

In [3]:
# import pandas as pd
# import numpy as np

def nwrong (x,key):
    """
    :param x: pd.Series
    :param key: ~None
    Takes a pandas series and a specified value and returns the number of values 
    in the series which do not match the specified value."""
    assert isinstance(x, pd.Series)
    x = x.astype(str)
    key = str(key)
    return x[x!=key].count()
    

In [4]:
list(string.ascii_lowercase)[0]

'a'

# Setting up notebook

In [5]:
import pandas as pd
import numpy as np
import os,glob

pd.options.display.max_columns=50

# Getting Exam Data

Here we define paths for getting exam data and outputting results

In [6]:
homesource = 'S:/Chris/Temple/Biol1111/Fall 2018/Raw Data/Exam 1'
homeoutputFolder = 'S:/Chris/Temple/Biol1111/Fall 2018/Results/Exam 1'
worksource = 'C:/Users/tuh27554/Documents/BIOL1111/Fall 2018/Raw Data/Exam 1'
workoutputFolder = 'C:/Users/tuh27554/Documents/BIOL1111/Fall 2018/Results/Exam 1'

Here we get a list of source data files. For this notebook we will use the \*.txt files.

In [7]:
os.chdir(worksource)
files = glob.glob('*.txt')
print(files)

['V1Raw.txt', 'V2Raw.txt']


 And now we read in and clean those data using the *cleanSoarExam* function.

In [8]:
df1 = cleanSoarExam(files[0],examNum=1,soarSessions=[81])
print('Version 1 of the exam has {} students.'.format(df1.shape[0]-1))
df2 = cleanSoarExam(files[1],examNum=1,soarSessions=[81])
print('Version 2 of the exam has {} students.'.format(df2.shape[0]-1))
os.chdir(workoutputFolder)

Version 1 of the exam has 123 students.
Version 2 of the exam has 122 students.


In [9]:
# df1.head() 

In [10]:
# df2.head() 

# Analyzing Exam Data

## Determining the most problematic items for the class

We need to identify the item columns over which to apply *nwrong*.

In [11]:
itemcols1 = df1.columns[df1.columns.str.contains('item_')]
itemcols2 = df2.columns[df2.columns.str.contains('item_')]

Now we apply the function to determine the number of students who answered incorrectly for version 1 and version 2 separately. We will only print one of these here for an example.

In [12]:
v1wrong = df1.loc[df1.soarType.isin(['mine','key']),itemcols1].apply(lambda x: nwrong(x=x[1:],key=x[0]))
v2wrong = df2.loc[df2.soarType.isin(['mine','key']),itemcols2].apply(lambda x: nwrong(x=x[1:],key=x[0]))
v2wrong

item_1     2
item_2     4
item_3     3
item_4     3
item_5     3
item_6     4
item_7     3
item_8     4
item_9     8
item_10    3
item_11    3
item_12    4
item_13    4
item_14    7
item_15    4
item_16    4
item_17    7
item_18    2
item_19    0
item_20    8
item_21    8
item_22    1
item_23    2
item_24    0
item_25    4
item_26    6
item_27    8
dtype: int64

Adding these lists together we get the total number wrong on each item.  If we sort the resulting series in descending order, we will have the guide we need to determine the order for discussing the items in class.

In [13]:
totalwrong = v1wrong + v2wrong
orderedwrong = totalwrong.sort_values(ascending=False)
orderedwrong

item_27    21
item_21    20
item_14    15
item_9     13
item_17    12
item_15    10
item_20    10
item_26    10
item_3      9
item_13     9
item_8      8
item_11     8
item_6      8
item_16     7
item_23     6
item_12     6
item_5      6
item_7      6
item_10     5
item_4      4
item_2      4
item_1      4
item_18     4
item_25     4
item_22     2
item_24     1
item_19     0
dtype: int64

Let's also print the letters for the correct answers for each item.

In [14]:

v1keys = df1.loc[df1.soarType=='key',itemcols1].apply(lambda x: list(string.ascii_uppercase)[int(x)-1])
v1keys
# v2wrong = df2.loc[df2.soarType.isin(['mine','key']),itemcols2].apply(lambda x: nwrong(x=x[1:],key=x[0]))

item_1     B
item_2     D
item_3     B
item_4     B
item_5     C
item_6     A
item_7     D
item_8     B
item_9     C
item_10    B
item_11    C
item_12    D
item_13    D
item_14    C
item_15    A
item_16    C
item_17    A
item_18    A
item_19    A
item_20    C
item_21    B
item_22    C
item_23    B
item_24    A
item_25    C
item_26    A
item_27    D
dtype: object

Here we need to write a function to determine how many students get each item wrong.

In [15]:
v2keys = df1.loc[df1.soarType=='key',itemcols2].apply(lambda x: list(string.ascii_uppercase)[int(x)-1])
v2keys

item_1     B
item_2     D
item_3     B
item_4     B
item_5     C
item_6     A
item_7     D
item_8     B
item_9     C
item_10    B
item_11    C
item_12    D
item_13    D
item_14    C
item_15    A
item_16    C
item_17    A
item_18    A
item_19    A
item_20    C
item_21    B
item_22    C
item_23    B
item_24    A
item_25    C
item_26    A
item_27    D
dtype: object

## Evaluating students performace 

Now we look at scores by section.  For this we can rename *unnamed1* to *version* to keep track of the versions and append the 2 dfs. 

In [16]:
df = df1.append(df2)
print(df.shape)
df.to_csv('merged exam 1 results.csv')
# df.head()

(247, 38)


We can drop the "key" data and store as a separate variable,*maxscore*, the maximum possible value for ncorrect.

In [17]:
maxscore = df.ncorrect.max()
df = df.loc[df.soarType!='key',:]
print('maxscore:{}'.format(maxscore))
# df.head()

maxscore:27


We will use this *df* for the rest of our analysis.

In [18]:
scores = df.groupby(['soar']).ncorrect.agg(['min','max','median','mean','count']).sort_values('median')
scores.to_csv('Exam1Score breakdown.csv')
scores

Unnamed: 0_level_0,min,max,median,mean,count
soar,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
21.0,12,12,12.0,12.0,1
6.0,15,15,15.0,15.0,1
2.0,19,19,19.0,19.0,1
71.0,7,26,19.0,18.517241,29
75.0,9,24,19.0,17.848485,33
83.0,12,26,19.5,18.807692,26
84.0,11,24,19.5,18.964286,28
81.0,12,26,20.0,20.16129,31
87.0,12,25,20.0,20.0,27
72.0,9,26,20.5,19.21875,32


In [19]:
print(df.groupby('soar').ncorrect.quantile(.75))

soar
2.0     19.00
6.0     15.00
21.0    12.00
71.0    22.00
72.0    22.25
73.0    23.00
75.0    20.00
81.0    22.50
83.0    21.75
84.0    21.50
87.0    22.50
Name: ncorrect, dtype: float64


Now let's look at the distribution of individuals who scored above the 75-percentile in their class.  This gives us an idea of how distributed the high scores are in each section.

In [20]:
highAchieverslocal = df.groupby('soar').ncorrect.apply(lambda x: x.loc[x>x.quantile(.75)].count()).reset_index()\
.rename(columns = {'ncorrect':'nAbove75p'})
highAchieverslocal

Unnamed: 0,soar,nAbove75p
0,2.0,0
1,6.0,0
2,21.0,0
3,71.0,6
4,72.0,8
5,73.0,6
6,75.0,7
7,81.0,8
8,83.0,7
9,84.0,7


Now let's look at the distribution of individuals who scored above the 75-percentile across the entire class.  This gives us an idea of how distributed the high scores are across sections.

In [21]:
highAchieversGlobal = df.loc[df.ncorrect>df.ncorrect.quantile(.75)]\
.groupby('soar').ncorrect.count().reset_index()\
.rename(columns = {'ncorrect':'nAbove75p'})
print('The overall all median and 75 percentile were {} and {}, \
respectively.'.format(df.ncorrect.median(),df.ncorrect.quantile(.75)))
highAchieversGlobal

The overall all median and 75 percentile were 20.0 and 22.0, respectively.


Unnamed: 0,soar,nAbove75p
0,71.0,6
1,72.0,8
2,73.0,11
3,75.0,4
4,81.0,8
5,83.0,4
6,84.0,7
7,87.0,7
