This notebook cleans .txt data for students performance on exams for BIO1111 and exports a csv which can be analyzed in an R script.
#### Author: Christopher Agard

In [1]:
import pandas as pd
import numpy as np
import os,glob,string
import plotly
import plotly.plotly as py
import plotly.graph_objs as go

plotly.tools.set_config_file(world_readable=True)

def cleanSoarExam (data, examNum, fileType='flat',
                   colSpec = [(0,9),(10,21),(22,28),(29,30),(30,32),(32,34),(34,36),(36,38),(39,41),(41,68)],
                   soarSessions=[]):

    if fileType != 'flat':
        print("\n{} fileType is not currently supported.".format(fileType))
    else:
        try:
            df=pd.read_fwf(data,colSpec,names=['tuid','last','first','middle','unnamed1','unnamed2','unnamed3','soar','ncorrect','item'])
        except:
            "print(\nCould not find file {} or {} was not acceptable value for colSpec)".format(data,colSpec)
        #try:
        #    df.columns = ['tuid','last','first','middle','unnamed1','unnamed2','unnamed3','soar','ncorrect','item']
        #except:
        #    print("\nColumn number != 10.\n{}".format(len(df.columns)))
        try:
            df['examNumber']=examNum
            numbers=pd.Series(list(range(28))).astype(str)
            itemNames= 'item_'+ numbers[1:]
            itemData=df.item.apply(lambda i: pd.Series(list(i)))
            itemData.columns=itemNames
            df=df.merge(itemData,'outer',left_index=True,right_index=True).drop('item',axis=1)
        except:
            print("\nUnhandled exception encountered.")
        if soarSessions ==np.nan:
            df.loc['soarType']='other'
        else:
            df.loc[df.soar.isin(soarSessions),'soarType']='mine'
            df.loc[~df.soar.isin(soarSessions),'soarType']='other'
            df.loc[df.tuid=='NNNNNNNNN','soarType']='key'
    return df
        

Here we need to write a function to determine how many students get each item wrong.

In [2]:
# import pandas as pd
# import numpy as np

def nwrong (x,key):
    """
    :param x: pd.Series
    :param key: ~None
    Takes a pandas series and a specified value and returns the number of values 
    in the series which do not match the specified value."""
    assert isinstance(x, pd.Series)
    x = x.astype(str)
    key = str(key)
    return x[x!=key].count()
    

In [3]:
list(string.ascii_lowercase)[0]

'a'

# Setting up notebook

In [4]:
import pandas as pd
import numpy as np
import os,glob

pd.options.display.max_columns=50

# Getting Exam Data

Here we define paths for getting exam data and outputting results

In [5]:
homesource = 'S:/Chris/Temple/Biol1111/Fall 2018/Raw Data/Exam 3'
homeoutputFolder = 'S:/Chris/Temple/Biol1111/Fall 2018/Results/Exam 3'
worksource = 'C:/Users/tuh27554/Documents/BIOL1111/Fall 2018/Raw Data/Exam 3'
workoutputFolder = 'C:/Users/tuh27554/Documents/BIOL1111/Fall 2018/Results/Exam 3'

Here we get a list of source data files. For this notebook we will use the \*.txt files.

In [6]:
os.chdir(worksource)
files = glob.glob('*.txt')
print(files)

['V31Raw.txt', 'V32Raw.txt']


 And now we read in and clean those data using the *cleanSoarExam* function.

In [7]:
df1 = cleanSoarExam(files[0],examNum=1,soarSessions=[81])
df1=df1.drop(columns=['last','first','middle'])
print('Version 1 of the exam has {} students.'.format(df1.shape[0]-1))
df2 = cleanSoarExam(files[1],examNum=1,soarSessions=[81])
df2=df2.drop(columns=['last','first','middle'])
print('Version 2 of the exam has {} students.'.format(df2.shape[0]-1))
os.chdir(workoutputFolder)

Version 1 of the exam has 113 students.
Version 2 of the exam has 118 students.


# Analyzing Exam Data

## Determining the most problematic items for the class

We need to identify the item columns over which to apply *nwrong*.

In [8]:
itemcols1 = df1.columns[df1.columns.str.contains('item_')]
itemcols2 = df2.columns[df2.columns.str.contains('item_')]

df1.loc[(df1.soarType=='key') & (df1.item_1!=df1.loc[df1.soarType=='key','df1.item_1'])].ncorrect +1

Now we apply the function to determine the number of students who answered incorrectly for version 1 and version 2 separately. We will only print one of these here for an example.

In [9]:
v1wrong = df1.loc[df1.soarType.isin(['mine','key']),itemcols1].apply(lambda x: nwrong(x=x[1:],key=x[0]))
# v1wrong = v1wrong.rename({'item_1':'item_1a'})
v1wrong

item_1      5
item_2      2
item_3      5
item_4     11
item_5      4
item_6      9
item_7      9
item_8     14
item_9     10
item_10    10
item_11     6
item_12     6
item_13    11
item_14     9
item_15     5
item_16     8
item_17     3
item_18     7
item_19     2
item_20     4
item_21     4
item_22     5
item_23     3
item_24     7
item_25     7
item_26     4
item_27     6
dtype: int64

In [10]:
v2wrong = df2.loc[df2.soarType.isin(['mine','key']),itemcols2].apply(lambda x: nwrong(x=x[1:],key=x[0]))
# v2wrong = v2wrong.rename({'item_1':'item_1b'})
v2wrong

item_1      3
item_2      2
item_3      2
item_4     10
item_5      0
item_6      8
item_7      6
item_8     12
item_9      8
item_10     7
item_11     5
item_12     3
item_13     2
item_14     9
item_15     2
item_16     8
item_17     3
item_18    11
item_19     5
item_20     3
item_21     3
item_22     6
item_23     7
item_24     6
item_25     7
item_26     0
item_27     2
dtype: int64

Adding these lists together we get the total number wrong on each item.  If we sort the resulting series in descending order, we will have the guide we need to determine the order for discussing the items in class.

In [11]:
totalwrong = v1wrong + v2wrong
orderedwrong = totalwrong.sort_values(ascending=False)
# orderedwrong.loc[~orderedwrong.isna()] = orderedwrong.loc[~orderedwrong.isna()].astype(int)
orderedwrong

item_8     26
item_4     21
item_14    18
item_18    18
item_9     18
item_6     17
item_10    17
item_16    16
item_7     15
item_25    14
item_24    13
item_13    13
item_11    11
item_22    11
item_23    10
item_12     9
item_27     8
item_1      8
item_15     7
item_19     7
item_20     7
item_21     7
item_3      7
item_17     6
item_26     4
item_5      4
item_2      4
dtype: int64

Let's also print the letters for the correct answers for each item.  REmeber that for this dataset 4=A, 3=B, 2=C, and 1=D.

In [12]:
keysdict = {4:'A',3:'B',2:'C',1:'D'}
v1keys = df1.loc[df1.soarType=='key',itemcols1]\
.apply(lambda x: list(string.ascii_uppercase)[4-int(x)])
v1keys.to_csv('e3v1keys.csv')
v1keys

item_1     A
item_2     C
item_3     D
item_4     B
item_5     D
item_6     B
item_7     B
item_8     D
item_9     C
item_10    B
item_11    D
item_12    C
item_13    A
item_14    D
item_15    A
item_16    A
item_17    C
item_18    C
item_19    C
item_20    B
item_21    D
item_22    A
item_23    B
item_24    A
item_25    B
item_26    C
item_27    D
dtype: object

Here we need to write a function to determine how many students get each item wrong.

In [13]:
v2keys = df2.loc[df2.soarType=='key',itemcols2]\
.apply(lambda x: list(string.ascii_uppercase)[4-int(x)])
v2keys.to_csv('e3v2keys.csv')
v2keys

item_1     B
item_2     C
item_3     D
item_4     D
item_5     D
item_6     D
item_7     B
item_8     D
item_9     C
item_10    D
item_11    B
item_12    C
item_13    A
item_14    B
item_15    A
item_16    C
item_17    C
item_18    A
item_19    C
item_20    B
item_21    D
item_22    A
item_23    D
item_24    D
item_25    B
item_26    C
item_27    C
dtype: object

## Evaluating students performace 

Now we look at scores by section.  For this we can rename *unnamed1* to *version* to keep track of the versions and append the 2 dfs. 

In [14]:
df = df1.append(df2)
print(df.shape)
df.to_csv('merged exam 3 results.csv')

(233, 35)


We can drop the "key" data and store as a separate variable,*maxscore*, the maximum possible value for ncorrect.

In [15]:
maxscore = df.ncorrect.max()
dfkey = df.loc[df.soarType=='key',:]
df = df.loc[df.soarType!='key',:]
print('maxscore:{}'.format(maxscore))

maxscore:27


df.loc[df.item_1!=dfkey.item_1,'ncorrect'] = df.loc[df.item_1!=dfkey.item_1,'ncorrect'] + 1

We will use this *df* for the rest of our analysis.

In [16]:
def iqr(x):
    return x.quantile(.75) - x.quantile(.25)

def percentile25(x):
    return x.quantile(.25)

def percentile75(x):
    return x.quantile(.75)


In [17]:
df.head()

Unnamed: 0,tuid,unnamed1,unnamed2,unnamed3,soar,ncorrect,examNumber,item_1,item_2,item_3,item_4,item_5,item_6,item_7,item_8,item_9,item_10,item_11,item_12,item_13,item_14,item_15,item_16,item_17,item_18,item_19,item_20,item_21,item_22,item_23,item_24,item_25,item_26,item_27,soarType
1,915493826,,,,,20,1,4,2,1,3,1,4,4,1,2,2,1,2,4,1,4,3,2,1,2,3,1,4,4,4,3,1,1,other
2,915323216,,,,87.0,15,1,4,2,1,1,1,3,4,1,3,2,3,4,1,3,4,4,2,3,2,2,1,3,2,4,3,2,1,other
3,915271724,,,,,23,1,4,2,1,3,1,3,3,1,2,4,1,2,1,1,4,4,2,2,2,3,1,1,3,4,3,2,2,other
4,915501049,31.0,11.0,31.0,75.0,21,1,4,2,1,3,1,4,2,2,3,3,1,2,1,1,4,4,2,1,2,3,1,4,3,4,3,2,1,other
5,915468481,13.0,11.0,36.0,83.0,19,1,4,2,1,4,1,3,3,3,2,2,1,1,1,2,4,3,2,2,2,3,1,4,2,4,3,2,1,other


In [18]:
dfkey.soarType.unique()

array(['key'], dtype=object)

In [19]:
scores = df.groupby(['soar']).ncorrect\
.agg(['count','min',percentile25,percentile75,'max','median',iqr,'mean','std'])\
.sort_values('median', ascending =False)
scores.to_csv('Exam3Score breakdown.csv')
scores

Unnamed: 0_level_0,count,min,percentile25,percentile75,max,median,iqr,mean,std
soar,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
81.0,32,11,15.75,19.25,24,18,3.5,17.5625,3.472821
6.0,1,17,17.0,17.0,17,17,0.0,17.0,
71.0,28,10,14.75,19.0,26,17,4.25,17.107143,3.675135
73.0,20,8,12.0,18.25,25,17,6.25,15.9,4.529436
75.0,30,8,14.25,19.0,23,17,4.75,16.466667,3.441465
83.0,24,12,15.0,17.0,22,16,2.0,16.041667,2.115762
84.0,26,9,14.25,19.0,24,16,4.75,16.384615,3.677792
87.0,23,8,14.0,20.5,26,16,6.5,16.913043,5.186635
72.0,27,6,14.0,19.0,26,15,5.0,15.851852,4.221772


Now let's look at the distribution of individuals who scored above the 75-percentile in their class.  This gives us an idea of how distributed the high scores are in each section.

In [20]:
highAchieverslocal = df.groupby('soar').ncorrect.apply(lambda x: x.loc[x>x.quantile(.75)].count())\
.reset_index()\
.rename(columns = {'ncorrect':'nAbove75p'})
highAchieverslocal.soar = highAchieverslocal.soar.astype(int)
highAchieverslocal.soar = highAchieverslocal.soar.astype(str)
highAchieverslocal

Unnamed: 0,soar,nAbove75p
0,6,0
1,71,6
2,72,5
3,73,5
4,75,7
5,81,8
6,83,5
7,84,6
8,87,6


Now let's look at the distribution of individuals who scored above the 75-percentile across the entire class.  This gives us an idea of how distributed the high scores are across sections.

In [21]:
highAchieversGlobal = df.loc[df.ncorrect>df.ncorrect.quantile(.75)]\
.groupby('soar').ncorrect.count().reset_index()\
.rename(columns = {'ncorrect':'nAbove75p'})
print('The overall all median and 75 percentile were {} and {}, \
respectively.'.format(df.ncorrect.median(),df.ncorrect.quantile(.75)))
highAchieversGlobal

The overall all median and 75 percentile were 17.0 and 19.0, respectively.


Unnamed: 0,soar,nAbove75p
0,71.0,6
1,72.0,5
2,73.0,3
3,75.0,7
4,81.0,8
5,83.0,1
6,84.0,6
7,87.0,10


## Vizualization

In [22]:
data = [go.Bar(x = highAchieversGlobal.soar,y = highAchieversGlobal.nAbove75p)]
layout = go.Layout(
    title = 'Higher Acheivers Global',
    titlefont = dict(
        size = 20),
    xaxis = dict(
        dtick = 1,
        title = 'SOAR Section',
        titlefont = dict(
            size = 18)),
    yaxis = dict(
        title = 'Number of High Achievers',
        titlefont = dict(
            size = 18)))

fig = go.Figure(
        data = data,
        layout = layout)
py.iplot(fig, filename = 'Bar of High Achievers Global 2018E31111')

Aw, snap! We didn't get a username with your request.

Don't have an account? https://plot.ly/api_signup

Questions? accounts@plot.ly


PlotlyError: Because you didn't supply a 'file_id' in the call, we're assuming you're trying to snag a figure from a url. You supplied the url, '', we expected it to start with 'https://plot.ly'.
Run help on this function for more information.