## Sources of Assessment



Your grade in EEP153 will be based on four different sources:

1.  Other's ranking of your teams' performance on each class project;

2.  Your teammates' rankings of your contribution to each project;

3.  An individual final exam; and

4.  Some extra credit opportunities, including activity on `piazza`.



## Data from Assessments



### Students



Consider the following artificial data on assessment.  We first
generate some data on (imaginary) students.  Each student is
characterized by three things:

-   Name (to identify student; we observe this);
-   Ability (affects performance independent of effort);
-   Effort (affects performance independent of ability);

The following code defines a function we can use to generate some random names for our imaginary students.



In [1]:
import pandas as pd
import numpy as np
import urllib.request
import random

def random_names(n,k=2):
    """Return a list of n random k-part names.

    Borrows from =amoodie='s amusing idea described at
    https://stackoverflow.com/questions/18834636/random-word-generator-python
    """
    word_url = "http://svnweb.freebsd.org/csrg/share/dict/words?view=co&content-type=text/plain"
    response = urllib.request.urlopen(word_url)
    long_txt = response.read().decode()
    words = long_txt.splitlines()
    upper_words = [word for word in words if word[0].isupper()]
    name_words  = [word for word in upper_words if not word.isupper()]
    rand_name   = ' '.join([name_words[random.randint(0, len(name_words))]
                            for i in range(k)])

    names = []
    for i in range(n):
        names.append(' '.join([name_words[random.randint(0,len(name_words))]
                               for j in range(k)]))

    return names

STUDENTS = 40
names = random_names(STUDENTS)
# print(names)

Next, for each student we'll randomly draw an ability and an effort.
Draws are from a normal distribution.



In [1]:
ability = [random.normalvariate(0,1) for name in names]

effort = [random.normalvariate(0,1) for name in names]

With names, ability, and effort all determined, build a `pandas.DataFrame`.



In [1]:
students = pd.DataFrame({'Ability':ability,'Effort':effort}, index=names)

print(students.head())

### Performance



If professors could simply observe ability and effort, grading would
be very easy!  That's not the world we live in, though.  Instead we
have students take tests or complete assignments, where performance is
related to effort and ability, and we try to draw inferences about
the latter from the former.

The following code assigns students to random teams, and generates
scores for their projects.  Note that, e.g., "Team1" means the
assignment to teams for project 1; it's not an identifier for a team.



In [1]:
import numpy as np
# Assign students to random groups and generate project scores.

PROJECTS = 4
TEAMS = 8

for project in range(PROJECTS):
    # Sort students into a random order
    np.random.shuffle(names)

    students = students.join(pd.Series(np.array([[i]*(STUDENTS // TEAMS)
                                                 for i in range(TEAMS)]).flatten(),
                                       index=names,name='Team%d' % (project+1,)))
print(students.head())

Now, performance on each group project is assumed to depend on the
average of the of ability and effort for the entire team.  Every
student will provide a *ranking* of all *other* teams' projects.



In [1]:
for project in range(1,PROJECTS+1):
    teams = students.groupby('Team%d' % project)
    teamscore = teams[['Ability','Effort']].mean().sum(axis=1) # Team averages
    others_evals = teamscore.values.reshape((-1,1))
                 + np.random.randn(TEAMS,int(STUDENTS*(TEAMS-1)/TEAMS)) # Others' evals
    others_evals = pd.DataFrame(others_evals).rank(ascending=False).mean(axis=1).squeeze() # Average of rankings
    others_evals.name = 'Project%d' % project
    students = students.join(others_evals,on='Team%d' % project)

print(students.filter(regex="Project").head())

So far so good; we have averages of rankings of all students for
others' team projects.  The second source of assessment are peer
rankings *within* the group.  We assume that one's teammates provide
rankings which depend on ability and effort, observed with error.



In [1]:
for project in range(1,PROJECTS+1):
    
    peer_evals = students[['Ability','Effort']].sum(axis=1).values.reshape((-1,1))
    peer_evals = peer_evals + np.random.randn(STUDENTS,STUDENTS//TEAMS + 1) # Error in obs.
    peer_evals = pd.DataFrame(peer_evals,index=students.index).rank(ascending=False).mean(axis=1).squeeze() # Average of rankings
    students['Peers%d' % project] = peer_evals

print(students.filter(regex="Peers").head(10))
#print(peer_evals)

Finally, there's also individual assessments, from the final exam and
instructor assessment of contributions on `piazza`.  These are also
related to ability and effort, measured with error.  However, we
assume that contributions on piazza depend more on effort than on
ability.



In [1]:
students['Final'] = students[['Ability','Effort']].sum(axis=1) 
students['Final'] = students['Final'] + np.random.randn(STUDENTS)*0.2

# Effort weighted 0.7, ability 0.3
students['piazza'] = students[['Ability','Effort']].dot([.3,.7]) + np.random.randn(STUDENTS)*0.5

# Out[24]:
# output
  File "<ipython-input-24-7458272908cd>", line 1
    students['Final'] = students[['Ability','Effort']].sum(axis=1) +
                                                                    ^
SyntaxError: invalid syntax

Taken altogether, this gives us a DataFrame of scores by student.
This is more or less what the data I'll have at the end of the
semester will look like (except that the names will be less silly).



In [1]:
# Note that *lower* rankings are better, so flip sign on scores based on such rankings
Scores = pd.concat([students[['Final','piazza']],
                    -students[['Peers%d' % p for p in range(1,PROJECTS+1)]+['Project%d' % p for p in range(1,PROJECTS+1)]]],axis=1)

print(Scores.head())

## Evaluation



So, how do we turn a set of scores like this into course grades?
There are two steps.  First, we compute the *singular value
decomposition* (SVD) of the matrix of scores; this is a technique
fundamental the the calculation of least squares regression
techniques, and a popular tool in the recent machine learning
literature.  It's closely related to an approach called "principal
components" which has long been used in a field called "psychometrics", which does things 
like designing and interpreting intelligence tests.   

A great feature of the SVD is that it allows us to *simultaneously*
estimate ability+effort for each student, along with a weight for each
assignment that indicates how informative that assignment is.



In [1]:
#!pip install CFEDemands --upgrade
from cfe.estimation import svd_rank1_approximation_with_missing_data as my_svd

# Windsorize scores (helps interpretation)
Scores = Scores - Scores.mean()
Scores = Scores/Scores.std()

xhat,weights,s,grades = my_svd(Scores,return_usv=True)

# Sign from SVD is indeterminate, but weights should be positive
if weights.sum()<0:
    weights = -weights
    grades = - grades
    
weights = weights/weights.sum()

# Normalize grades
grades = (grades-grades.mean())/grades.std()

print(weights)

The right measure of success for this approach is if the grades we assign provide a good estimate of the sum of ability and effort.



In [1]:
import cufflinks as cf
from plotly.io import write_image
cf.go_offline()

# This is what we want to measure
truth = students[['Ability','Effort']].sum(axis=1) 

df = pd.DataFrame({'Truth':truth,'Grades':grades})
print(df.corr().iloc[0,1])

df.iplot(kind='scatter', mode='markers', symbol='circle-dot',
         x='Truth',y='Grades',
         xTitle='Truth',yTitle='Grades',
         asFigure=False)

#write_image(fig,'grades_vs_truth.png')

## Grade Assignment



We've described above how we'll calculate *scores*; how will these be
turned into letter grades?  Let's start with a description of the
distribution of scores from above; these have been /normalized, so
that they have a mean of zero and a standard deviation of one, by
construction.



In [1]:
grades.iplot(kind='histogram',
             xTitle='Grade (raw)',
             yTitle='Frequency')

To map into letter grades, we'll using the following device, which
involves anchoring letter grades to the best five students.

-   Let $\bar{x}$ be the median grade among top five students (i.e.,
    the 3rd highest grade, if we ignore ties).
-   All students with a grade greater than $\bar{x}-1/3$ will receive
    an **A+** (so *at least* best three students will receive this
    grade, by construction).
-   Remaining students within 2/3 of a standard deviation of $\bar{x}$
    will receive an **A**.
-   Remaining students within one standard deviation of $\bar{x}$
    will receive an **A-**.
-   And so on&#x2026;
-   &#x2026;until students with grades more than 4 standard deviations of
    $\bar{x}$ will receive an **F**.

If scores are normally distributed, then we'd expect $\bar{x}$ to be
about 1.517 standard deviations above the mean, and for the
distribution of grades to be as follows.  (NB: The assumption of
normality is a big assumption! However, if it's satisfied then our
distribution of grades will be close to the distribution reported for
all EEP classes: [http://projects.dailycal.org/grades/](http://projects.dailycal.org/grades/))  

| Normalized Grade Score|Letter|Predicted %|
|---|---|---|
| $\bar{x}-x\leq 1/3$|A+|11.97%|
| $2/3\leq \bar{x}-x < 1/3$|A|7.99%|
| $1\leq \bar{x}-x< 2/3$|A-|10.55%|
| $4/3\leq \bar{x}-x< 1$|B+|12.49%|
| $5/3\leq \bar{x}-x< 4/3$|B|13.24%|
| $2\leq \bar{x}-x< 5/3$|B-|12.57%|
| $7/3\leq \bar{x}-x< 2$|C+|10.69%|
| $8/3\leq \bar{x}-x< 7/3$|C|8.15%|
| $3\leq \bar{x}-x< 8/3$|C-|5.56%|
| $10/3\leq \bar{x}-x< 3$|D+|3.40%|
| $11/3\leq\bar{x}- x< 10/3$|D|1.86%|
| $4\leq \bar{x}-x< 11/3$|D-|0.91%|
| $\bar{x}-x > 4$|F|0.64%|



## Details



How do we know what the expected value of the third highest score is?
For measures of centrality like the mean we have a nice theory
governing its distribution (the mean will be asymptotically normally
distributed with a standard deviation of $1/\sqrt{N}$, where $N$ is
the class size).  Similar results hold for estimating any *quantile*
of the distribution, and follow from the Central Limit Theorem.  

These results *can't* be used for things like the value of the $k$th
highest score as the population gets large.  There is a collection of
theoretical results that *does* obtain, collectively called "Extreme
value theory".  

Instead of heading to the math library, we will cheat.  Let's just
**draw** a large number of samples of scores from a (quasi-) random
number generator.  Each sample will just be the size of the class
(e.g., 40).  Then for each sample we'll find the third-highest value,
and compute the average across all the samples.  

This approach to calculating a statistic is called a "Monte Carlo"
experiment, and is often very effective (even if a bit crude).



In [1]:
import numpy as np

STUDENTS=40

xbar=[]
for i in range(10000):
    x = np.random.randn(STUDENTS)
    x.sort()
    xbar.append(x[-3]) 

print("Estimated value of xbar is: {:2f}.".format(np.mean(xbar)))
pd.DataFrame({"xbar":xbar}).iplot(kind='histogram',bins=100)