# Capstone Team Optimization 

**Automatic selection of Capstone teams based on preferences.**

Georgetown students fill out a project interest survey at the start of Foundations, which we then use to attempt to optimize project teams curation. This serves as an intereting icebreaker to get people talking together about potential projects, but also as a mechanism to show optimization techniques in real life. Though obviously this method is more for demonstration purposes, I think it highlights a few key techniques. 

The optimization works as follows:

1. Assign students into random teams. 
2. Compute the _cost_ of those team assignments across the entire cohort (cost function to follow). 
3. Select a random number of swaps between 10 and 100
4. For each swap, switch two members of teams, if resulting _cost_ is less, continue; otherwise revert to original
5. Repeat steps 2-4 until minimum error or maximum searches

So basically this is a random hill climbing type search (or is intended to be). There are a number of ways to improve this function of course, but it's for demonstration only. 

The cost function is as follows:

1. Start with cost = 0 (perfect teams have no cost)
2. Add the square difference of each team's size with the optimal team size 
3. Add the number of unique OS per team - 1 (e.g. same OS is zero cost) 
4. Add cost of missing roles (e.g. don't have a programmer on the team)
5. Add domain alignment cost (similar domains selected is better)
6. Add dataset alignment cost (similar datasets selected is better) 


## Settings 

In [1]:
COHORT = 16      # Set to change the cohort to analyze. 
TEAM_SIZE = 4   # Optimal number of members per team

## Fields and Fixtures 

In [2]:
import os
import csv
import random

from itertools import chain
from collections import defaultdict, Counter

FIXTURES = os.path.join(os.getcwd(),"fixtures")

FIELDS = {'name' : 'Name',
          'email': 'Email',
          'github': 'Github Username',
          'linkedin': 'LinkedIn URL',
          'os': 'What is your preferred operating system?',
          'language': 'What programming languages are you familiar with?',
          'python': 'What is your level of Python proficiency?',
          'sql': 'What is your level of SQL proficiency?',
          'cli': 'What is your proficiency with the command line?',
          'dbs': 'What databases have you used before?',
          'role': 'Which of these roles would you like your primary contribution on the team to be?',
          'coord': 'Would you be willing to be a team coordinator?',
          'project': 'At what level do you feel your overall project should be at?',
          'domains': 'What domains are you interested in?',
          'datasets': 'What types of projects/data sets are you interested in?'
    }

PROG_ROLE  = 'Programmer - focused on the technical implementation'
STATS_ROLE = 'Statistician - focused on modeling and analysis'
DOM_ROLE   = 'Domain Expert - focused on finding novel data products for specific data sets'  

## Data Loading and Parsing

In [3]:
def getCohortPath(cohort=COHORT):
    """
    Returns the path to the Cohort file in the fixtures directory.
    """
    return os.path.join(FIXTURES,"cohort{}-preferences.csv".format(cohort))


def loadData(cohort=COHORT):
    """
    Loads and parses survey data. 
    """
    with open(getCohortPath(cohort), 'r') as f:
        reader = csv.DictReader(f)
        for row in reader:
            row[FIELDS['language']] = parseMulti(row[FIELDS['language']])
            row[FIELDS['python']] = parseInt(row[FIELDS['python']])
            row[FIELDS['sql']] = parseInt(row[FIELDS['sql']])
            row[FIELDS['cli']] = parseInt(row[FIELDS['cli']])
            row[FIELDS['dbs']] = parseMulti(row[FIELDS['dbs']])
            row[FIELDS['coord']] = parseBool(row[FIELDS['coord']])
            row[FIELDS['project']] = parseInt(row[FIELDS['project']])
            row[FIELDS['domains']] = parseMulti(row[FIELDS['domains']])
            row[FIELDS['datasets']] = parseMulti(row[FIELDS['datasets']])
            yield dict([(field, row[FIELDS[field]]) for field in FIELDS])
                   
def parseBool(s):
    """
    Helper function for parsing yes/no/maybe. 
    """
    try:
        return {'yes': True,
                'no': False,
                'not sure': None,
                'if i have to': None
        }[s.lower()]
    except KeyError: 
        return None

def parseMulti(s):
    """
    Helper function for parsing survey lists (checkboxes). 
    """
    return filter(lambda i: i != '', [i.strip() for i in s.split(',')])

def parseInt(s):
    """
    Helper function for parsing integer fields. 
    """
    try:
        return int(s)
    except ValueError:
        return None

## Teams Collection 

A collection of teams and computation of team cost. 

In [4]:
class Cohort(object):
    
    def __init__(self, cohort=COHORT):
        self.teams = defaultdict(list)
        
        # Assign students to ordered teams.
        students = list(loadData(cohort))
        n_teams  = (len(students) / TEAM_SIZE)
        
        for idx, student in enumerate(students):
            self.teams[(idx + 1) / n_teams].append(student)
    
    def swap(self, source=None, target=None, sidx=None, tidx=None, transfer=False):
        """
        Swaps two students between two teams. If None values are passed,
        then the values are randomly selected. If transfer is true, then simply
        transfer the source to the target, don't swap. 
        """
        if source is None:
            source = random.choice(self.teams.keys())
        
        if target is None:
            target = random.choice(self.teams.keys())
        
        if sidx is None and len(self.teams[source]) > 1:
            sidx = random.randint(0, len(self.teams[source])-1)

        if tidx is None and len(self.teams[target]) > 1:
            tidx = random.randint(0, len(self.teams[target])-1)
        
        if sidx is not None:
            alpha = self.teams[source].pop(sidx)
            self.teams[target].append(alpha)
        
        if not transfer and tidx is not None:
            bravo = self.teams[target].pop(tidx)
            self.teams[source].append(bravo)
    
    def cost(self):
        """
        Computes the cost of the current team make up. 
        """
        cost = 0 # Perfect teams would have a cost of zero. 
        
        # Loop over each team to compute the costs.
        for team, prefs in self.teams.iteritems():
        
            # First add square difference in team size to optimal team size. 
            cost += (len(prefs) - TEAM_SIZE) ** 2
            
            # Add cost of multiple operating systems (1 OS is zero cost)
            cost += (len(set([pref['os'] for pref in prefs])) - 1)
            
            # Add cost of missing roles 
            cost += 3 - len(set([pref['role'] for pref in prefs]))
            
            # Add cost of domain mis-alignment 
            domains = Counter(chain(*[domain for domain in [pref['domains'] for pref in prefs]]))
            domains = domains.most_common(1)
            if domains:
                _, count = domains[0]
                cost += len(prefs) - count 
            else:
                cost += 99
            
            # Add cost of dataset mis-alignment
            datasets = Counter(chain(*[dataset for dataset in [pref['datasets'] for pref in prefs]]))
            datasets = datasets.most_common(1)
            if datasets:
                _, count = datasets[0]
                cost += len(prefs) - count 
            else:
                cost += 99
        
        return cost
    
    def select_coordinator(self, teamno):
        """
        From the people who selected yes to being coordinator, choose random.
        """
        # Filter out people who didn't say yes to coordinator role.
        coords = filter(
            lambda p: p['coord'] in (True, None), 
            self.teams[teamno]
        )

        return random.choice(coords)['name']


    def mean_level(self, teamno, field):
        """
        Compute the mean level of the given numeric field.
        """
        levels = [
            float(pref[field]) if pref[field] else 0.0
            for pref in self.teams[teamno]
        ]

        return sum(levels) / len(levels)
    

    def print_team(self, teamno):
        # Create output structure
        output = []

        # Create Title Header
        title = "Team {} Selection Report".format(teamno)
        output.append(title)
        output.append("-"*len(title))
        output.append("")

        # Print out averages
        output.append(
            "  * Coordinator: {}".format(self.select_coordinator(teamno))
        )
        output.append("")
        output.append(
            "  * Mean Python Level: {}".format(
                self.mean_level(teamno, 'python')
            )
        )
        output.append(
            "  * Mean SQL Level: {}".format(
                self.mean_level(teamno, 'sql')
            )
        )
        output.append(
            "  * Mean CLI Level: {}".format(
                self.mean_level(teamno, 'cli')
            )
        )
        output.append(
            "  * Mean Project Level: {}".format(
                self.mean_level(teamno, 'project')
            )
        )
        output.append("")

        # Print out member names
        output.append("  - Members:")
        output.extend([
            "    + {} ({})".format(pref['name'], pref['email']) 
            for pref in self.teams[teamno]
        ])
        output.append("")

        # Print out domain preferences
        domains = Counter(chain(*[domain for domain in [pref['domains'] for pref in self.teams[teamno]]]))
        output.append("  - Domains:")
        output.extend([
            "    + {}: {}".format(*prefs) 
            for prefs in domains.most_common()
        ])
        output.append("")

        # Print out project preferences
        datasets = Counter(chain(*[dataset for dataset in [pref['datasets'] for pref in self.teams[teamno]]]))
        output.append("  - Project Types:")
        output.extend([
            "    + {}: {}".format(*prefs) 
            for prefs in datasets.most_common()
        ])
        output.append("")

        # Return report string
        return "\n".join(output)

In [5]:
cohort = Cohort()
print cohort.cost()

23


## Optimization

In [6]:
# Random Search Method
cohort = Cohort()

for _ in xrange(5000):
    # 100k searches 
    num_swaps = random.randint(10, 100)
    prob_xfer = 0.25 
    ncohort = Cohort()
    
    for _ in xrange(num_swaps):
        xfer = True if random.random() <= prob_xfer else False 
        ncohort.swap(transfer=xfer)
        if ncohort.cost() < cohort.cost():
            cohort = ncohort
        

# Loop over each team to compute the costs.
print cohort.cost()
for team in cohort.teams:
    print cohort.print_team(team)
    print
    print

11
Team 0 Selection Report
-----------------------

  * Coordinator: Kalev Jaakson

  * Mean Python Level: 5.66666666667
  * Mean SQL Level: 4.33333333333
  * Mean CLI Level: 3.0
  * Mean Project Level: 3.33333333333

  - Members:
    + Jack  Harmon (john.harmon96@gmail.com)
    + Samantha Sadiv (sadiv28@gmail.com)
    + Kalev Jaakson (kj499@georgetown.edu)

  - Domains:
    + Agriculture: 3
    + Sports: 2
    + Transportation: 1
    + Retail/Industry: 1
    + Energy: 1
    + Health Care/Medicine: 1
    + Education: 1
    + Government/Social Data: 1

  - Project Types:
    + Time Series Analysis: 3
    + Visualization/Visual Analytics: 3
    + Regression Analysis: 2
    + Statistical Modeling for Forecasting: 2
    + Clustering or Classification: 2
    + Text Analysis/Natural Language Processing: 2
    + Network Analysis: 1



Team 1 Selection Report
-----------------------

  * Coordinator: Lisa Huynh

  * Mean Python Level: 3.75
  * Mean SQL Level: 4.75
  * Mean CLI Level: 3.75
  * 