# Final Cohorts

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
import random
import matplotlib.pyplot as plt
%matplotlib inline

#### Clean up the raw data

- similarity info (from survey) and registrant doc (from my.harvard) are encoded differently, set up differently, etc SO:
    - *For the Registrant Doc*
        - resave the registrant doc to be utf-8
        - ensure that pd.read_csv is *tab-delimited* (*eye roll emoji*)
        - concatenate the Last, First column in info so we can actually compare the two dataframes
        - join them
        - extract relevant columns

In [2]:
# Similarity info, from the interest survey
info = pd.read_csv('info copy.txt')

# Actual registrants, from my.harvard
people = pd.read_csv('finalroster copy.txt',delimiter='\t')

# select important columns
info = info[list(info.columns[:11])]

# this student's last name got encoded strangely (because of the accent), so we fix it
info.loc[19,'Last'] = 'Béland'

# take a peek at what the my.harvard export looks like
people.head()

In [3]:
# info has a TON of extra empty lines (like thousands of them)
# so let's drop them
info.dropna(how="all",inplace=True)

# Since the similarity info has two separate fields for first and last name, but the my.harvard export has one name field formatted as "Last,First"
# we should change formatting so that it matches.
name = []
for idx in list(info.index):
    concat = info.loc[idx]["Last"]+','+info.loc[idx]["First"]
    name.append(concat)
    
info["Name"] = name

# once we join the two data sets, we want to be able to identify the people who are actually registered
# so make a 'Reg' field that = T(rue) for registrants
people["Reg"] = ["T" for i in range(people.shape[0])]

# join the two data sets
both = people.set_index("Name").join(info.set_index("Name"))

# check to see if we retained the right number of registrants post-joining
both[both.Reg == 'T'].shape

In [4]:
# we had initially set the name field as the index, change it back to numbers
both.reset_index(inplace=True)

# take a peek at what our final dataset looks like
both.head()

In [13]:
# someone was missing, add them in
both = both.append(info.loc[26],ignore_index=True)

# identify the relevant columns
data = both[["Name","Email Address", "Final Concentration (Abbr)","Years Full-Time Employment","Current Work Environment"]]
new_cols = ['name','email','program','employment','sector']
data.columns = new_cols
data.head()

**FINALLY CLEAN!!**

--------

### Cohort Building Functions
Below are the functions that actually create the cohorts.

**sim_groups**

Despite this function's name, it can create groups of similar people, dissimilar people, or a mixed bag. It does this by calculating the cosine similarity (https://en.wikipedia.org/wiki/Cosine_similarity), and then ordering students based on their similarity to a random "seed" student. For similar groups, we take the first X students in the ordered list. For dissimilar groups, we take the *last* X students from the list. For mixed groups, we take X/2 from the top of the list, and X/2 from the bottom. The function itself returns a list of indeces which correspond to an entry in the dataframe we created above.

**make_soup**

"Soup" here means creating a long-ish vector (or "soup") for each student of all the information we have on them. This vector is then used to calculate the cosine similarity between students in the `sim_groups` function. Additionally, we add a clustering prediction to the information we have. Clusters are based on program, employment, and sector, with double weighting on program and sector. These cluster assignments are **not** cohort assignments; the clusters merely give the cohort builder more information when calculating the cosine similarity so we can hopefully have more precise or "refined" groups. This is all based on hypothesis, so it will be interesting to see the results after the course is finished.


**make_groups**

This function preprocesses some of the data, then calls `make_soup` and `sim_groups` so a user only has to do one function call overall. The preprocessing step is label encoding - this means converting categorical data (like "program") into numerical data. This means that in the `make_soup` function, those vectors are all made up of numbers, rather than a combination of words and numbers.

For the "balanced" groups, we have a different set of functions.

**get_ratios**

This function finds the distribution of programs within the group of registrants and returns the distribution as a set of ratios. These ratios are what the balanced cohorts will be based on - for example, if in the total registrant set, .3 students are in EPM, each balanced cohort will also have ~.3 students in EPM.

**make_balanced_groups**

This function uses the ratios from `get_ratios` to creat a set of cohorts that reflect the make up of the entire registrant class. It also tries to ensure that there are no "lone" students; that is, there are no cohorts in which a student does not have at least one other person in their program to connect with. This means that even in the balanced cohorts, there may be a slight variation in program distribution compared to the overall distribution.

In [80]:
# Function makes groups of people based on cosine (dis)similarity
# kind can be "sim", "dis", or "mix" (default is "sim")
def sim_groups(num_groups, names_list, soup_mat, og_df, kind = "sim"):
    
    names = names_list.copy()
    pull = og_df
    indices = pd.Series(og_df.index, index=names)
    soup_df = pd.DataFrame(soup_mat, index = og_df.index)
    
    
    groups = {}
    
    for g in range(num_groups):
        
        soup = soup_df.values
        idx = indices[names.loc[random.choice(names.index)]]
        cosine_sim = cosine_similarity(soup,soup)
        cos_df = pd.DataFrame(cosine_sim, index=names.index, columns = names.index)
        sim_scores = cos_df.loc[idx]
        sim_scores = sim_scores.sort_values(ascending=False)
        
        # Change the indexing here to change group sizes 
        if kind == "sim": 
            student_indices = list(sim_scores[1:15].index)
        elif kind == "dis":
            student_indices = list(sim_scores[-14:].index)
        elif kind == "mix":
            student_indices = list(sim_scores[1:7].index) + list(sim_scores[-8:].index)

        
        group_idx = student_indices+[idx]
#         group = pull[['firstname','lastname']].loc[group_idx]
        groups["Cohort{}".format(g)] = group_idx
        
        # update email list
        names.drop(group_idx,inplace=True)

        
        # update soup_df
        soup_df.drop(group_idx,inplace=True)
        
#         seed = random.choice(list(emails.index))


    # Return the groups
    return groups

In [20]:
# df should have columns 'name', 'program','employment','sector'
def make_soup(data):
    
    df = data.copy()

    km = KMeans(n_clusters = 4)
    clus = km.fit_predict(df[['program','program','employment','sector','sector']])
    df['cluster'] = clus
    
    soup = []
    for ind in list(df.index):
        soup.append([int(i) for i in df[['program','employment','sector','cluster']].loc[ind,:]])
    
    soup_matrix = np.asarray(soup)
    
    return soup_matrix

In [21]:
def make_groups(num_groups,frame, kind = "sim"):
    
    df = frame.copy()
    
    le = LabelEncoder()
    prog = le.fit_transform(df.program)
    df['program'] = prog
    
    sec = le.fit_transform(df.sector.astype(str))
    df['sector'] = sec
    
    sm = make_soup(df)
    
    names = df.name
    
    g = sim_groups(num_groups,names,sm,df,kind)
    
    return g
    

In [31]:
def get_ratios(df):
    prog_ratios = {}
    program_list = sorted(list(df.program.unique()))
    for prog in program_list:
        counter = 0
        for idx in list(data.index):
            if data.loc[idx].program == prog:
                counter += 1
        prog_ratios["{}".format(prog)] = counter/data.shape[0]
    
    return prog_ratios

In [34]:
# based on a dictionary of ratios
def make_balanced_groups(size,df,num_cohorts = 8):
    
    frame = df.copy()
    
    ratios = get_ratios(frame)
    programs = [key for key in ratios]
    
    ideal_count = {}
    for key in ratios:
        ideal_count['{}'.format(key)] = int(ratios[key]*size)
    
    cohorts = {}
    for i in range(num_cohorts):
        cohorts['Cohort{}'.format(i+1)] = []
    
    for cohort in cohorts:
        indeces = []
        random.shuffle(programs)
        for key in programs:
            num = ideal_count[key]
            if frame[frame.program == key].shape[0] == 3:
                num = 3
            elif num == 1:
                num += 1
            elif num == 0:
                num += 2
            if frame[frame.program == key].shape[0] >= num:
                students = list(frame[frame.program == key].sample(n = num).index.values)
                frame.drop(students,inplace=True)
                indeces.append(students)
        cohorts[cohort] = [idx for sublist in indeces for idx in sublist]
    
    if frame.shape[0] > 0:
        add_to_each = int(frame.shape[0]/num_cohorts)
        for cohort in cohorts:
            leftovers = list(frame.sample(n = add_to_each).index.values)
            if len(cohorts[cohort]) < size:
                for i in leftovers:
                    cohorts[cohort].append(i)
                frame.drop(leftovers,inplace=True)
    
    for cohort in cohorts:
        if len(cohorts[cohort]) > size:
            chop = len(cohorts[cohort]) - size
            put_back = cohorts[cohort][-chop:]
            cohorts[cohort] = cohorts[cohort][:-chop]
            frame = frame.append(df.loc[put_back])
#             print(frame.loc[put_back].program)
    
    if frame.shape[0] > 0:
        for i,s in enumerate(frame.index.values):
            for cohort in cohorts:
                if len(cohorts[cohort]) < size:
                    cohorts[cohort].append(s)
            frame.drop(s,inplace=True)

    return cohorts

____

### Actually making the groups

In [35]:
# we want some balanced groups and some groups based on (dis)similarity
# so lets break up the data
balanced = data.sample(frac=.66)
smaller = data.drop(balanced.index)

In [36]:
# start with making the balanced cohorts
balanced_inds = make_balanced_groups(30,balanced,num_cohorts=4)

# since the function returns indeces, we have to now associate those indeces with entries in our dataframe
bal_cohort_dfs = {}
for cohort in balanced_inds:
    bal_cohort_dfs['{}'.format(cohort)] = balanced.loc[balanced_inds[cohort]]
    
# looks like there might have been some duplicates; identify those so we can manually delete them.
temp = []
for key in balanced_inds:
    temp.append(balanced_inds[key])
    
find_dupes = [idx for sublist in temp for idx in sublist]
set([x for x in find_dupes if find_dupes.count(x) > 1])

In [57]:
# now what we want to do is make two similar cohorts and two mixed cohorts.
# in order to preserve the original ratios, we'll use make_balanced_groups 
# to split this into two balanced groups
# and then break those into the actual cohorts.
smaller_inds = make_balanced_groups(30,smaller,2)

small_cohort_dfs = {}
for cohort in smaller_inds:
    small_cohort_dfs['{}'.format(cohort)] = smaller.loc[smaller_inds[cohort]]
    
# similar groups
two_small_sim = make_groups(2,small_cohort_dfs['Cohort1'])

# mixed groups
two_small_mix = make_groups(2,small_cohort_dfs['Cohort2'],"mix")

In [86]:
# Now that we have all our cohort assignments, let's merge them into our original dataset so we can download it and do whatever we want in excel, etc.
data["cohort"] = ["" for i in data.index]

#Cohort 1-4: 30 balanced
for i in range(1,5):
    data.loc[balanced_inds['Cohort{}'.format(i)],"cohort"] = i 

#Cohort 5: 15 similar
data.loc[two_small_sim['Cohort0'],"cohort"] = 5

#Cohort 6: 15 mix
data.loc[two_small_mix['Cohort0'],"cohort"] = 6

#Cohort 7: 15 similar
data.loc[two_small_sim['Cohort1'],"cohort"] = 7

#Cohort 8: 15 mix
data.loc[two_small_mix['Cohort1'],"cohort"] = 8

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [105]:
# Find out who didn't sign up
signed_up = both.Name.values
all_names = info.Name.values

didnt = [i for i in all_names if i not in signed_up]
didnt_sign_up = info[info.Name.isin(didnt)]
didnt_sign_up.drop([40,222,146],inplace=True)
# didnt_sign_up.to_csv('DidntEnroll.csv')