# Cleaning Education Data

Author: Andrea Mock

This notebook is dedicated to cleaning the data that is relevant to each alums education. 
Many alums have their degree and majors listed in different formats and thus we would like to clean the data to allow for an easy comparison between each person's majors.

In [14]:
# load necessary libraries
import pandas as pd
import json

In [19]:
# import the csv file with data
edu_df = pd.read_pickle('education')

In [62]:
def findProfessors(person):
    """
    identifies if someone states that they are a professor at Wellesley or not
    """
    
    if type(person) == str:
        if ('Professor at Wellesley') in person:
            return True
    return False

In [64]:
# indicator if someone is a professor at Welellesley
isProfWellesley = edu_df['headline'].apply( lambda x: findProfessors(x))

In [66]:
# remove all the professors from our dataset
edu_df = edu_df[~isProfWellesley]

In [1]:
#edu_df.head() # get a sense of how our data currently looks

In [81]:
#sort dataset by names 
edu_sorted = edu_df.sort_values('name').reset_index(drop=True)

In [28]:
# read in dataset with cleaned job titles and companies
jobs_df = pd.read_csv('job_info.csv', index_col = 0)

In [47]:
def findGradStudents(student):
    if type(student) == str:
        if ('phd' or 'grad' or 'student') in student.lower():
            return True
    return False

In [48]:
tf = jobs_df['title'].apply( lambda x: findGradStudents(x) )
# create a column that is a true false indicator if someone is a grad student
jobs_df['current_grad_student'] = tf

In [79]:
# sort dataset by names
jobs_sorted = jobs_df.sort_values('name').reset_index(drop=True)

In [83]:
# merge the datasets 
merged_df = pd.concat([edu_sorted, jobs_sorted], axis=1, sort=False)

In [89]:
def remove_dup_columns(frame):
    """
    since both datasets are slightly different keep those colums that are unique and remove the duplicate ones
    """
    keep_names = set()
    keep_icols = list()
    for icol, name in enumerate(frame.columns):
        if name not in keep_names:
            keep_names.add(name)
            keep_icols.append(icol)
    return frame.iloc[:, keep_icols]

In [92]:
# create a dataframe that does not have duplicate columns 
no_dup_df = remove_dup_columns(merged_df)

In [94]:
# see if dropping columns was successful
no_dup_df.columns

Index(['name', 'headline', 'summary', 'location', 'current_company_link',
       'skills', 'education', 'jobs', 'education_clean', 'degree',
       'study_range', 'major', 'grad_year', 'grad_school', 'job_cleaned',
       'title', 'company', 'years', 'description', 'url', 'end', 'start',
       'current_grad_student'],
      dtype='object')

In [102]:
# drop columns used for cleaning data
no_dup_df = no_dup_df.drop(['education_clean', 'current_company_link', 'job_cleaned'], axis=1)

In [104]:
def attendedGradSchool(person):
    """
    return true if a person has attended or attends grad school, else false
    """
    if ((person['grad_school'] == True) or (person['current_grad_student'] == True)):
        return True
    return False

In [107]:
# yes no indicator if someone attended grad school
gradSchool = no_dup_df.apply(attendedGradSchool, axis =1 )

In [108]:
# add a column for if someone attended gradschool or is attending
no_dup_df['attended_grad_school'] = gradSchool

In [122]:
# drop columns used for determining if someone went to grad school
no_dup_df = no_dup_df.drop(['grad_school', 'current_grad_student'], axis=1)

In [124]:
# save alum data 
no_dup_df.to_pickle('alum_data')

## Part 2: Cleaning majors at Wellesley

After having cleaned and merged the data, we want to continue cleaning by determining a person's major(s) at Wellesley.
First let's take a look at what majors people listed that they pursued. 

In [128]:
no_dup_df.major.unique()

array(['Media Arts and Sciences with Computer Science concentration',
       'Media Arts and Sciences (Computer Science, Film, and Design) and Political Science',
       None, 'Computer Science', 'Political Science',
       'Physics, Computer Science',
       'Computer Science, Minor in Economics',
       'Computer Science, Mathematics', 'Psychology, Computer Science',
       'Major in Computer Science; Minor in Astronomy',
       'Computer Science and Spanish', 'Cultural Studies',
       'American Studies',
       'Cognitive and Linguistic Science and Computer Science',
       'Biology/Biological Sciences, General',
       'Computer Science, Biological Chemistry',
       'BA, Media Arts and Sciences, Cum Laude', 'Biology and Sociology',
       'Economics and Data Science (Mathematics, Statistics, Computer Science)',
       'Computer Science; American Studies',
       'Chemistry, Computer Science', 'Physics',
       'Majoring in Media Arts and Sciences (Computer Science & Design), Mino

As we can see although many people might have pursued the same major, they may have listed it differently. Our first step is to split the majors based on punctuation such as , or ; 

In [190]:
import re

In [559]:
def removeWords(words):
    """
    removes 'stopwords' ie words such as major or changes long major names to acronyms for better splitting 
    of majors and analysis
    """
    
    if type(words) == str:
        words = re.sub(r'\([^)]*\)', '', words)
        words = re.sub("\d+", "", words)
        words = re.sub("major*", "", words)
        words = re.sub("minor", "", words)
        words = re.sub("Majoring", "", words)
        
        words = re.sub("Major", "", words)
        words = re.sub("Minor", "", words)
        words = re.sub("Minoring", "", words)
        words = re.sub(" in ", " ", words)
        words = re.sub("Cum Laude", "",words)
        words = re.sub("BA", "",words)
        words = re.sub("Women's and Gender Studies", "WGST", words)
        words = re.sub("Women & Gender Studies", "WGST", words)
        words = re.sub("Women’s and Gender Studies", "WGST", words)
        words = re.sub("Women's Studies", "WGST", words)
        words = re.sub("Media Arts and Sciences", "MAS", words)
        words = re.sub('Media Arts & Sciences', "MAS", words)
        words = re.sub("Media Arts and Science", "MAS", words)
        words = re.sub("Media Arts & Sicences", "MAS", words)
        words = re.sub('Media Arts and Computer Science','MAS', words)
        words = re.sub('Computer Science and Cinema and Media Studies','MAS', words)
        words = re.sub('Media Art & Computer Science','MAS', words)
        words = re.sub('Media Arts and Scienc','MAS', words)
        words = re.sub('Computer Science  and Media Arts','MAS', words)
        words = re.sub('Computer Science and Digital Arts','MAS', words)
        words = re.sub('Computer Science and Digital Arts','MAS', words)
        words = re.sub('Computer Science and Digital Arts','MAS', words)
        words = re.sub('Double','', words)
        words = re.sub('double','', words)
        words = re.sub('summa cum laude','', words)
        words = re.sub('Cum laude','', words)
        words = re.sub('cum laude','', words)
        words = re.sub(":", "", words)
        #words = re.sub(".", "", words)
        #sep = 'with'
       # words = words.split(sep, 1)[0]
        words = words.strip('.')
        return words
    return ''

In [560]:
def splitMajor(majorList):
    """
    split a major by different stoppers to determine double majors/major minor combos
    """
    if type(majorList) == str:
        if len(majorList.split('|')) != 1:
            return majorList.split('|')
        if len(majorList.split(',')) != 1:
            return majorList.split(',')
        elif len(majorList.split(';')) != 1:
            return majorList.split(';')
        elif len(majorList.split('&')) != 1:
            return majorList.split('&')
        elif len(majorList.split('+')) != 1:
            return majorList.split('+')
        elif len(majorList.split('and')) != 1:
            return majorList.split('and')
        elif len(majorList.split('with')) != 1:
            return majorList.split('with')
        elif len(majorList.split(' - ')) != 1:
            return majorList.split(' - ')
        elif len(majorList.split('/')) == 1:
            return majorList.split('/')
    return ['']

In [561]:
# create a column to clean major
clean1 = no_dup_df['major'].apply( lambda x: removeWords(x))
no_dup_df['major_cleaning'] = clean1.apply(lambda x: splitMajor(x))

In [562]:
clean1

0                MAS with Computer Science concentration
1                             MAS  and Political Science
2                                                       
3                                       Computer Science
4                                       Computer Science
                             ...                        
603                               Mathematics, Economics
604                                     Computer Science
605    Russian Language and Literature, Computer Science
606                  Computer Science  Africana Studies 
607                                                 MAS 
Name: major, Length: 608, dtype: object

In [563]:
# when splitting see that some have one major others have more than one thing listed
no_dup_df['major_cleaning'].apply( lambda x: len(x)).value_counts()

1    312
2    281
3     15
Name: major_cleaning, dtype: int64

In [564]:
# get rid of white space 
no_dup_df['major1'] = no_dup_df['major_cleaning'].apply(lambda x: x[0].strip())

In [503]:
def getSecondMajor(person):
    """
    if someone has a second major then extract that major and return it 
    """
    if (len(person['major_cleaning']) > 1):
        return person['major_cleaning'][1].strip()
    return '' # return empty string if no second major 

In [504]:
def getThirdMajor(person):
    """
    gather third major if existant and return it 
    """
    if (len(person['major_cleaning']) > 2):
        return person['major_cleaning'][2].strip()
    return '' # return empty string if no third major 

In [567]:
no_dup_df['major2'] = no_dup_df.apply(getSecondMajor, axis=1)

In [568]:
no_dup_df['major3'] = no_dup_df.apply(getThirdMajor, axis=1)

In [507]:
def cleanMAS(major):
    """
    if someone is an MAS major, drop other unnecessary info in major description
    """
    if 'MAS' in major:
        return 'MAS'
    else:
        return major

In [508]:
no_dup_df['major1'].value_counts(ascending = True)[:20]

Cognitive and Linguistic Science               1
Chinese Studies                                1
Cross-registered for Computer Science          1
Russian Language and Literature                1
International Relations - Political Science    1
Computer Science and Chinese Language          1
Digital MAS                                    1
Computer Sci.                                  1
Mathematics and Economics                      1
Biology/Biological Sciences                    1
Logic                                          1
Africana Studies                               1
Human-computer interaction                     1
Cognitive Science-Computer Science             1
MAS with an emphasis Computer Science          1
International Relations                        1
Teacher Education Program                      1
Anthropology                                   1
French                                         1
Environmental Studies and MAS                  1
Name: major1, dtype:

In [509]:
# one of our current entries
no_dup_df.iloc[46]['major_cleaning']

['Environmental Studies and MAS',
 ' concentration economics and computer science']

In [632]:
# clean MAS major data (Remove any uncessary info)
no_dup_df['major1'] = no_dup_df['major1'].apply(lambda x: cleanMAS(x))

In [635]:
def cleaning2(major): 
    """
    cleans all the majors names to have a unified description of how each major is called 
    """
    
    if ('Computer' in major or 'computer' in major):
        return 'Computer Science'
    elif ('Cognitive' in major or 'Linguistic' in major):
        return 'Cognitive Science'
    elif 'International Relations' in major:
        return 'International Relations'
    elif 'Biolo' in major:
        return 'Biology'
    elif 'Classic' in major:
        return 'Classics'
    elif ('English' in major or 'Writing' in major):
        return 'English'
    elif ('Geology' in major or 'Geoscience' in major):
        return 'Geology'
    elif ('Studio Art' in major or 'Art Studies' in major or 'Fine Art' in major or 'Art' == major):
        return 'Studio Art'
    elif ('Math' in major):
        return 'Math'
    elif ('Education' in major):
        return 'Education'
    elif ('Russian' in major):
        return 'Russian'
    elif ('Logic' in major or 'Philosophy' in major):
        return 'Philosophy'
    elif ('Cinema' in major):
        return 'CAMS'
    elif ('music' in major.lower()):
        return 'Music'
    elif ('Econom' in major):
        return 'Economics'
    elif 'Biochemistry' in major:
        return 'Biochemistry'
    elif 'Environment' in major:
        return 'Environmental Studies'
    elif 'Chinese' in major:
        return 'Chinese'
    elif 'Japanese' in major:
        return 'Japanese'
    return major

In [614]:
# see how many unique majors we have after first round of cleaning
no_dup_df['major1'].apply(lambda x: cleaning2(x)).unique()

array(['MAS', '', 'Computer Science', 'Political Science', 'Physics',
       'Psychology', 'Cultural Studies', 'American Studies',
       'Cognitive Science', 'Biology', 'Economics', 'Chemistry',
       'Classics', 'WGST', 'Math', 'English', 'Neuroscience', 'Geology',
       'History', 'Music', 'Art History', 'Language Studies', 'French',
       'Chinese Studies', 'Architecture', 'Anthropology', 'Philosophy',
       'Environmental Studies', 'History and Religion',
       'Africana Studies', 'Education', 'Comparative Literature',
       'Sociology', 'Studio Art', 'International Relations',
       'Liberal Arts', 'Biochemistry', 'Liberal Arts and Sciences',
       'Russian'], dtype=object)

In [636]:
def cleaning3(major):
    """
    checks if a major is in our listed majors, if yes returns that major else returns an empty strins
    """
    
    major_list = ['Africana Studies',
 'American Studies',
 'Anthropology',
 'Arabic',
 'Architecture',
 'Art History',
 'Asian-American Studies',
 'Astronomy',
 'Biochemistry',
 'Biology',
 'CAMS',
 'Chemistry',
 'Chinese',
 'Classics',
 'Cognitive Science',
 'Comparative Literature',
 'Computer Science',
 'Data Science',
 'East Asian Studies',
 'Economics',
 'Education',
 'English',
 'Environmental Studies',
 'French',
 'Geology',
 'German',
 'History',
 'History and Religion',
 'International Relations',
 'Italian Studies',
 'Japanese',
 'Jewish Studies',
 'Latin',
 'MAS',
 'Math',
 'Medieval and Renaissance Studies',
 'Middle Eastern Studies',
 'Music',
 'Neuroscience',
 'Philosophy',
 'Physics',
 'Political Science',
 'Psychology',
 'Russian',
 'Sociology',
 'Spanish',
 'Studio Art',
 'WGST']
    if type(major) == str:
        if major in major_list:
            return major
    return ''

In [645]:
no_dup_df['major2'].apply(lambda x: cleaning2(x)).unique()

array(['Computer Science', 'Political Science', '', 'Economics', 'Math',
       'Astronomy', 'Spanish', 'Cognitive Science', 'Biology', 'MAS',
       'Sociology', 'Data Science', 'American Studies',
       'Environmental Studies', 'English', 'WGST', 'Latin', 'Arabic',
       'History', 'Philosophy', 'Physics', 'Geology', 'Psychology',
       'Studio Art', 'Medieval and Renaissance Studies', 'French',
       'Biochemistry', 'Education', 'Chinese', 'Music', 'Neuroscience',
       'Classics', 'Chemistry', 'Italian Studies', 'East Asian Studies',
       'Asian-American Studies', 'German', 'CAMS', 'Japanese',
       'Middle Eastern Studies', 'International Relations',
       'Jewish Studies', 'Art History', 'Anthropology',
       'Africana Studies', 'Comparative Literature'], dtype=object)

In [617]:
no_dup_df['major3'].unique()

array(['', 'Computer Science', 'Studio Art', 'Russian',
       'Ethnicity Studies', 'Economics', 'French', 'Math', 'Culture',
       'Psychology', 'Music Theory', 'Italian Studies'], dtype=object)

In [640]:
# create a list of all majors to use for later
allMajors = list(no_dup_df['major3'].apply(lambda x: cleaning3(cleaning2(x))).unique()) + list(no_dup_df['major2'].apply(lambda x: cleaning3(cleaning2(x))).unique()) + list(no_dup_df['major1'].apply(lambda x: cleaning3(cleaning2(x))).unique())

In [641]:
# all unique majors in our dataset
uniqueMajors = set(allMajors)
uniqueMajors

{'',
 'Africana Studies',
 'American Studies',
 'Anthropology',
 'Arabic',
 'Architecture',
 'Art History',
 'Asian-American Studies',
 'Astronomy',
 'Biochemistry',
 'Biology',
 'CAMS',
 'Chemistry',
 'Chinese',
 'Classics',
 'Cognitive Science',
 'Comparative Literature',
 'Computer Science',
 'Data Science',
 'East Asian Studies',
 'Economics',
 'Education',
 'English',
 'Environmental Studies',
 'French',
 'Geology',
 'German',
 'History',
 'History and Religion',
 'International Relations',
 'Italian Studies',
 'Japanese',
 'Jewish Studies',
 'Latin',
 'MAS',
 'Math',
 'Medieval and Renaissance Studies',
 'Middle Eastern Studies',
 'Music',
 'Neuroscience',
 'Philosophy',
 'Physics',
 'Political Science',
 'Psychology',
 'Russian',
 'Sociology',
 'Spanish',
 'Studio Art',
 'WGST'}

In [642]:
# clean the major listings 
no_dup_df['major1'] = no_dup_df['major1'].apply(lambda x: cleaning3(cleaning2(x)))
no_dup_df['major2'] = no_dup_df['major2'].apply(lambda x: cleaning3(cleaning2(x)))
no_dup_df['major3'] = no_dup_df['major3'].apply(lambda x: cleaning3(cleaning2(x)))

In [647]:
no_dup_new = no_dup_df.copy()

In [650]:
no_dup_new.columns

Index(['name', 'headline', 'summary', 'location', 'skills', 'education',
       'jobs', 'degree', 'study_range', 'major', 'grad_year', 'title',
       'company', 'years', 'description', 'url', 'end', 'start',
       'attended_grad_school', 'major_cleaning', 'major1', 'major2', 'major3'],
      dtype='object')

In [652]:
# drop columns used for cleaning
no_dup_new = no_dup_new.drop(['major_cleaning', 'major', 'years'], axis=1)

In [654]:
# save cleaned data in a csv file
no_dup_new.to_csv('majors_v2.csv')