# Data exploration and cleaning 

Author: Andrea Mock

The first step after having collected data from Linkedin is to get familiar with the dataset and clean it to extract futher data. 

In [2]:
import json
import pandas as pd

## Part 1: Reading in data and exploring
The first step is to read it into a dataframe and extracting information such as location, name, graduation year, and educational information. We also check out some entries to see what information might be useful to examine later on.

In [3]:
# import json file that contains all of the linkedin profiles
with open('mergedData.json') as inFile:
    csJson = json.load(inFile)

In [4]:
# number of entries
len(csJson)

753

In [7]:
#csJson[0] # sample entry - lot's of information!

In [8]:
# convert json file to a dataframe
df = pd.DataFrame(csJson)
#df

Since all of the entries are dictionaries we want to get a better sense of what each dictionary contains. 
We are interested in the company, school, location, headline, summary, job experience, educational information, and skills a person has.

In [9]:
csJson[1].keys() # keys in our json dictionary 

dict_keys(['personal_info', 'experiences', 'skills', 'accomplishments', 'interests', 'recommendations'])

In [10]:
csJson[1]['skills'][0] # example of a skills entry

{'name': 'Python', 'endorsements': '7'}

In [11]:
#csJson[0]['personal_info'] # explore what is in the personal-info section

In [12]:
csJson[1]['skills'][0] # skills 

{'name': 'Python', 'endorsements': '7'}

As we can see alums have multiple skills therefore we want to extract the skills for each alum as well asa their educational information.

In [13]:
def getSkills(jsonData):
    """
    extracts the skills for each alum and saves it as a list of a list for the skiulls of all alums
    """
    all_skills = []
    for i in range(len(jsonData)):
        skill_list = []
        for skill in jsonData[i]:
            skill_list.append(skill['name'])
        all_skills.append(skill_list)
    return all_skills

In [14]:
getSkills(df['skills'])[0] # sample skills entry for one alum

['Java',
 'Python (Programming Language)',
 'Customer Service',
 'Mandarin',
 'Microsoft Office',
 'Communication',
 'Programming Languages',
 'Calculus']

In [15]:
def getEducation(data):
    """
    extracts the educational information from all of the alums data
    """
    
    education_all = []
    for i in range(len(data)):
        education_all.append( data[i]['education'])
    return education_all

In [16]:
getEducation(df['experiences'])[0] # output of our getEducation function

[{'name': 'Wellesley College',
  'degree': "Bachelor's degree",
  'grades': None,
  'field_of_study': 'Computer Science',
  'date_range': '2018 – 2022',
  'activities': 'Computer Science Club, Asian Student Union, Wellesley Democrats'},
 {'name': "St. Andrew's School",
  'degree': 'High School Diploma',
  'grades': 'Magna Cum Laude',
  'field_of_study': None,
  'date_range': '2014 – 2018',
  'activities': 'Co-Head of Student Weekend Activity Group, Founder of Multi-Racial Affinity Group, Writer for Cardinal School Newspaper, Varsity Cross-Country, Varsity Track, Varsity Soccer'}]

In [17]:
def createDataFrame(df):
    """
    creates a dataframe that contains relevant information about wellesley alums including
    name, headline, summary and location
    """
    
    new_df = pd.DataFrame()
    new_df['name'] = df['personal_info'].apply(lambda x: x['name'])
    new_df['headline'] = df['personal_info'].apply(lambda x: x['headline'])
    new_df['summary'] = df['personal_info'].apply(lambda x: x['summary'])
    new_df['location'] = df['personal_info'].apply(lambda x: x['location'])
    new_df['current_company_link'] = df['personal_info'].apply(lambda x: x['current_company_link'])
    new_df['skills'] = getSkills(df['skills'])
    new_df['education'] = df['experiences'].apply(lambda x: x['education'])
    new_df['jobs'] = df['experiences'].apply(lambda x: x['jobs'])
    return new_df

In [18]:
new = createDataFrame(df) # create a new dataframe with only relevant info

In [20]:
#new.head()

## Part 2: Cleaning dataframe
After having gathered the most important information from the original JSON file, we want to clean the new dataframe as certain data entries are still dictionary or hard to read. 
The first step is thus to gather information on their Wellesley education, when they graduated and what major/s they pursued.

In [21]:
def getWellesleyInfo(edu_data):
    """ 
    gathers the information about a persons educational info and returns the dictionary entry 
    """
    edu_list = []
    for i in range(len(edu_data)):
        edu_df = pd.DataFrame(edu_data['education'][i]) 
        onlyWell = edu_df[edu_df['name'].apply( lambda x: ("wellesley" in x.lower()))].to_dict('r')
        edu_list.append(onlyWell)
    return edu_list

In [22]:
getWellesleyInfo(new)[0]

[{'name': 'Wellesley College',
  'degree': "Bachelor's degree",
  'grades': None,
  'field_of_study': 'Computer Science',
  'date_range': '2018 – 2022',
  'activities': 'Computer Science Club, Asian Student Union, Wellesley Democrats'}]

In [23]:
new['education_clean'] = getWellesleyInfo(new) # new column with wellesley education info

In [24]:
# columsn with other major data including degree major etc. 
new['degree'] = new['education_clean'].apply(lambda x: x[0]['degree'])
new['study_range'] = new['education_clean'].apply(lambda x: x[0]['date_range'])
new['major'] = new['education_clean'].apply(lambda x: x[0]['field_of_study'])

In [25]:
def split(x):
    """ 
    splits a string if existant otherwise returns empty string 
    """
    if x == None:
        return ""
    return x.split()

In [26]:
def gradYear(data):
    """
    given the range of years of study extracts graduation year
    """
    years = split(data)
    if len(years) > 0:
        return years[-1]
    else:
        return ""

In [27]:
# new column that contains graduation year 
new['grad_year'] = new['study_range'].apply(lambda x: gradYear(x)) 

In [28]:
from dateutil import parser

In [29]:
def convertGradYear(yearData):
    """ 
    converts a graduation year to a dateobject year if existant otherwise returns -1
    """
    if yearData == "":
        return -1 # if there is no data
    else:
        return parser.parse(yearData).year

In [30]:
# converting the type of graduation year to int from string
new['grad_year'] = new['grad_year'].apply(lambda x: convertGradYear(x))

In [31]:
# filtering out current students
alums = new[new['grad_year'] < 2021]

In [32]:
alums = alums.reset_index(drop=True)

In [34]:
#alums

## Part 3: Removing non-Wellesley graduates.
Within our current dataset there are still entries that contain individuals who might have cross-registered at Wellesley, taken an online or summer course and thus should not be included in our Wellesley alums dataset. 
Thus we first filter out the degree description that indicate someone is not a Wellesley graduate and remove them.

In [35]:
# look at degrees and use that to identify students who really are not students at wellesley
alums['degree'].unique()

array(["Bachelor's degree", 'Bachelor of Arts - BA',
       'Bachelor of Arts - BA, Magna Cum Laude',
       'Bachelor of Science - BS', None, 'Media Arts and Sciences',
       'Bachelor’s Degree', 'Bachelor of Arts (B.A.)',
       'Cross-registered Student', 'Bachelors', "Bachelor's Degree",
       "Bachelor's degree (Transferred after first year)",
       "Pursing a Bachelor's Degree", 'Bachelor of Arts (BA)',
       'Comparative Literature', 'Bachelor of Science (B.S.)',
       'Bachelor of Arts', 'Undergraduate', 'BA', 'Post Bac', 'B.A.',
       'Summa Cum Laude, Phi Beta Kappa',
       'Bachelor of Arts (B.A.) with Honors',
       'Candidate for Bachelor of Arts, May 2019', 'B. A.',
       'Bachelor of Science (B.S.) University of Dayton',
       "Bachelor's Degree, Summa Cum Laude (Durant Scholar)",
       'Computer Science', 'Bachelor of Arts Degree',
       'Post-Baccalaureate', 'Summer School', 'BA, Computer Science',
       'B.A', "Bachelor's", 'Post-Baccalaureate Student',
 

In [36]:
# these should be filtered out
filters = ['Post Bac', 'Post-Baccalaureate',
           'Cross-registered Student', 'Summer School',
          'Cross Registration', 'Exchange student','Edx Certificate',
           'Various Management Certifications',
           'Cross-Registering']

In [37]:
# creating a bool series from isin() 
is_filter = alums["degree"].isin(filters) 

In [39]:
# alums data set without people who are not actual graduates
alums_cleaned = alums[is_filter == False]
#alums_cleaned.head()

## Part 4: Extracting grad school information
Some alums attend grad school after graduating from Wellesley. We want to identify the individuals who go to graduate school and find out where they went to graduate school. Thus the first step is to check if someone has wellesley listed ast their most recent education, but also take into account that some may have listed their educational stations in non-chronological order. 

In [40]:
# break down the data to those who listed wellesley as their first edu entry and those who did not 
edu1 = alums_cleaned['education'].apply(lambda x: x[0]['name'] != "Wellesley College")

In [41]:
edu1

0      False
1      False
2      False
3      False
4      False
       ...  
615     True
616     True
617     True
618     True
619     True
Name: education, Length: 610, dtype: bool

In [42]:
# subdata set with only alums that attended Wellesley and no other institution
edu2 = alums_cleaned['education'].apply(lambda x: len(x) == 1)

In [43]:
a1 = alums_cleaned[edu2 == False] # gather all potential alums who might have attended graduate school

In [65]:
# create new dataframt that explores graduate education
alums_c2 = alums_cleaned.copy()

In [66]:
alums_c2['education'][0][0]['name']!= "Wellesley College"

False

In [67]:
# add colum with indicator if someone might have attended graduate school or not
alums_c2['grad_edu'] = edu1

Since some individuals might have included their cross-registration status as an additional educational entry, we do not want to include these students as graduate students as they cross-registered during their undergrad. Thus we filter these students out of our dataset.

In [68]:
filter2 = ['Cross-Registered', 'Cross-Registered Undergraduate']

In [69]:
f2 = alums_c2['education'].apply(lambda x: x[0]['degree'] in filter2)

In [70]:
f2

0      False
1      False
2      False
3      False
4      False
       ...  
615    False
616    False
617    False
618    False
619    False
Name: education, Length: 610, dtype: bool

In [71]:
alums_c2['filter_grad'] = ~f2

In [72]:
# adds true false indicator column that indicates if someone was cross-registered or not 
alums_c2.loc[~alums_c2['filter_grad'], 'result'] = False
alums_c2.loc[alums_c2['grad_edu'] & alums_c2['filter_grad'], 'result'] = True
alums_c2['result'].fillna(False, inplace = True)

In [73]:
alums_c2[alums_c2['result'] == True]['result'].count()

248

In [74]:
# rename our result column to grad_school
alums_c2.columns = ['grad_school' if x=='result' else x for x in alums_c2.columns]

In [75]:
def filterProfessors(row):
    """
    returns false if someone has professor as their title otherwise true
    """
    if 'Professor' in row['title']:
        return False
    return True

In [76]:
alums_c2[alums_c2["grad_edu"] == True]['education'].apply(lambda x: x[0])

9      {'name': 'MIT Sloan School of Management', 'de...
16     {'name': 'Massachusetts Institute of Technolog...
22     {'name': 'General Assembly', 'degree': 'User E...
25     {'name': 'Yale School of Management', 'degree'...
29     {'name': 'Babson College', 'degree': 'Cross-Re...
                             ...                        
615    {'name': 'Georgia State University', 'degree':...
616    {'name': 'University of Tulsa', 'degree': 'Doc...
617    {'name': 'Imperial College London', 'degree': ...
618    {'name': 'The University of Texas at Austin', ...
619    {'name': 'Stanford University', 'degree': 'Ph....
Name: education, Length: 250, dtype: object

In [77]:
# if mit is in education want to check if they attended grad school or just cross-registered etc.
tf = alums_cleaned['education'].apply(lambda x: x[0]['name'] == "Massachusetts Institute of Technology")

In [154]:
alums_cleaned[tf]['education'].apply(lambda x: x[0]['degree'] in filter2)

16      True
43     False
79      True
82     False
117    False
126    False
178    False
202    False
330    False
338    False
341    False
347    False
353    False
370    False
437    False
449    False
462    False
510    False
521    False
548    False
560    False
603    False
Name: education, dtype: bool

In [78]:
alums_c2[tf]['grad_edu'][16]

True

In [56]:
# drop unnecessary columms 
alums_c2 = alums_c2.drop(['grad_edu', 'filter_grad'], axis=1)

In [500]:
#save data
alums_c2.to_pickle("education")

## Part 5: Extract job data
To further examine the jobs that alumnae hold after graduation we want to extract the information that is pertinent to jobs and thus should be examined seperatly. 

In [388]:
def getJob(person):
    """
    extracts the current job of an individual 
    """
    if len(person) > 0:
        return person[0]
    else:
        return None

In [396]:
# save current job 
alums_c2['job_cleaned'] = alums_c2['jobs'].apply(lambda x:getJob(x))

In [415]:
def getJobInfo(info,text):
    """
    if a person has provided information to a particular job return it otherwise return nothing
    """
    if info != None:
        return info[text]
    return None

In [489]:
def jobDf(data):
    """
    creates a dataframe that contains relevant job and employment information of alums
    """
    new_df = pd.DataFrame(data['headline'])
    new_df['name'] = data['name']
    new_df['title'] = data['job_cleaned'].apply(lambda x:  getJobInfo(x, 'title'))
    new_df['company'] = data['job_cleaned'].apply(lambda x:  getJobInfo(x, 'company'))
    new_df['company'] = cleanJobs(new_df['company'])
    new_df['location'] = data['job_cleaned'].apply(lambda x:  getJobInfo(x, 'location'))
    new_df['years'] = data['job_cleaned'].apply(lambda x:  getJobInfo(x, 'date_range'))
    new_df['description'] = data['job_cleaned'].apply(lambda x:  getJobInfo(x, 'description'))
    new_df['url'] = data['job_cleaned'].apply(lambda x:  getJobInfo(x, 'li_company_url'))
    new_df = new_df.reset_index(drop=True)
    return new_df

In [490]:
# create a datafrmae only with job information 
company_df = jobDf(alums_c2)

In [492]:
company_df.to_csv('jobs.csv') # save job infromation in additional csv file

In [1]:
#company_df[company_df['company'] == 'Wellesley College']['headline'].apply(lambda x: x.split('at'))

In [427]:
# look a the unique cmpanies and save them 
jobs = company_df['company'].unique()

In [433]:
jobs[:15] # some jobs currently held 

array(['MIT Media Lab', 'Gusto', 'CarGurus\n        Full-time',
       'Massachusetts Institute of Technology', 'Wellesley College',
       'Peloton Interactive\n        Full-time',
       'Facebook\n        Full-time', 'Intuit\n        Full-time',
       'Cornerstone Research\n        Full-time',
       'TrueNorth\n        Internship', 'InterSystems\n        Full-time',
       'Gaiascope, Inc.\n        Internship', 'Google\n        Full-time',
       'Indeed.com', 'University of Houston CS REU Program'], dtype=object)

In [458]:
def cleanJobs(jobs):
    job_list = []
    for job in jobs:
        if job != None:
            split_string = job.split("\n", 1)
            substring = split_string[0]
            job_list.append(substring)
        else:
            job_list.append('')
    return job_list

In [446]:
jobs_df = pd.DataFrame(job_list)

In [450]:
# gather the unique companies
unique_comps=jobs_df[0].value_counts()