# Purpose

Prize challenges with judges reviewing them can be unwieldy sucks on manpower. This code is designed to take a *very* simple case of keyword selection and develop similarity scores between judges with specific expertise and submission that require that expertise. 

Effectively, we use simple keyword vectors for applications (chosen by the applicants themselves from a limited dropdown list) and use those same keywords to describe the expertise of a group of external judges (whose expertise keywords are chosen by hand by prize staff). By measuring the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between these two groups, we can then determine a similarity score that helps speed up the process of matching prize judges with applications that are most relevent to their field of knowledge.

# The Data

The data intended for input herein are prize submission metadata from herox.com, a prizes and challenges hosting platform. 

## THE CONTROL CENTER
Constants for controlling the assignment behavior for submissions are below this line, please modify these as needed.

In [None]:
JUDGES_PER_APP = 5
MAX_REVIEWS_PER_JUDGE = 18

#How many special/flagged judges should there be per app?
FLAGGED_JUDGES_PER_APP = 1

OUTPUT_FILEPATH = 'Data/Output_Files/American-Made_0101_autoAssigned_5Judge_18cap.csv'

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import random

from Submission import Submission
from Judge import Judge

## Read in the submissions data

Want starting output to be submission IDs as column names and keywords as index labels.

In [None]:
#Keyword data for prize submissions/applications
#column number 255: Unique ID column
s = pd.read_csv('Data/0101HeroXSubs.csv', index_col = 255, 
                encoding = 'latin1')

s = s.loc[:,'Keyword tags'].dropna(how = 'all')

s = s.str.split('\r', expand = True).transpose()
s.index = ['Keyword ' + str(x) for x in range (1,len(s)+1)]
s.index.name, s.columns.name = ('Keywords', 'Submissions')


s = pd.melt(s).dropna()

s.rename(columns = {'value': 'keyword'}, inplace = True)
s['value'] = 1
s = s.pivot_table(index = 'keyword', columns = 'Submissions', values = 'value').fillna(0).astype('int')
s = s.transpose()

s.drop(columns = ['Other or N/A'], inplace = True)

#Make sure the columns are sorted in the same order for submissions as they are for judges
s = s.reindex(columns = sorted(s.columns))

s

In [None]:
#How many projects have no keywords?
s[s.sum(axis = 1) == 0]

## Read in the judges data

In [None]:
j = pd.read_csv('Data/0101JudgePanel.csv', index_col = 'Name',
                dtype = 'str').transpose().dropna(how = 'all', axis =1)

j.columns.name = 'Judge'
j.drop(['Status', 'Title', 'Company', 'Email', 'Notes', 'Garrett Notes'], inplace = True)

j = pd.melt(j).dropna()

j.rename(columns = {'value': 'keyword'}, inplace = True)
j['value'] = 1

j = j.pivot_table(index = 'keyword', columns = 'Judge', values = 'value').fillna(0).astype('int')
j = j.transpose()

#Not worrying about Business keyword yet
j_matched = j.drop(columns = ['Business'])

#Make sure the columns are sorted in the same order for judges as they are for submissions
j_matched = j_matched.reindex(index = sorted(j_matched.index))

j_matched

## Check to make sure the keyword pools for both submissions and judges match

**NOTE: ** often there are keyword misspellings or extras

In [None]:
#Are the column counts the same??
(s.columns == j_matched.columns).sum() == len(s.columns)

## Setup the Judge and Submission objects; Run the cosine similarity scoring and assign Judges to Submissions

In [None]:
import assignments

#Run the assignments 100 times and average over the assignment count for each judge to see 
    #if number of keywords per judge dictates # of assignments
for i in range(0,100):
    #Make the flagged Judges
    judges = [Judge(name, MAX_REVIEWS_PER_JUDGE, flag = True) for name in j[j['Business'] != 0].index]

    #Now add in the unflagged Judges
    judges += [Judge(name, MAX_REVIEWS_PER_JUDGE, flag = False) for name in j[j['Business'] == 0].index]
    
    #Create all the Submissions
    submissions = [Submission(name, JUDGES_PER_APP, FLAGGED_JUDGES_PER_APP) for name in s.index]
    
    
    scores = assignments.similarity_scoring(s, j_matched, JUDGES_PER_APP, random_seed = None)
    assignments.make_assignments(scores, submissions, judges)
    
    judge_data = pd.DataFrame(data = None)
    for e in judges:
        judge_data = judge_data.append(e.to_df())
        
avg_judge_data = judge_data.groupby('Judge Name').mean()
avg_judge_data

In [None]:
avg_judge_data['Keyword Count'] = j_matched.transpose().sum()
avg_judge_data.reset_index(inplace = True)
avg_judge_data

In [None]:
%matplotlib inline

import seaborn as sns
#sns.scatterplot(x = 'Keyword Count', y = 'Number of Assignments', data = avg_judge_data)
sns.regplot(x = 'Keyword Count', y = 'Number of Assignments', data = avg_judge_data)

In [None]:
round(avg_judge_data.corr().loc['Number of Assignments', 'Keyword Count'], 4)

**It looks like our hypothesis is unfounded. Higher keyword counts don't seem to correlate with lower assignment numbers after all,** when a number (100 in this case) of random trials are averaged over.

## Check to make sure we don't have any apps with assignment violations

## Write the optimized assignments to a DataFrame and export to CSV

In [None]:
cols = ['Judge ' + str(x) for x in range(1,JUDGES_PER_APP + 1)]

output = pd.DataFrame(data = None, columns = cols, index = [x.id for x in submissions])

for sub in submissions:
    output.loc[sub.id] = [x.name for x in sub.assigned_judges]
    
output

In [None]:
output

In [None]:
output.to_csv(OUTPUT_FILEPATH)
#app_assignments.to_csv('Data/Output_Files/American-Made_Solar_Prize_App_Assignments_test_manualAssignments.csv')

## TO DO

1. Allow randomness of judges to be stochastic (no random seed), run X times and plot Judge keyword count vs. average submission count
    * Rachelle worried that she's seeing a trend where higher keyword count = lower assignment load
3. Figure out if we can parse a single cell list of keywords that are separated by \n into keywords as needed