DSCI 2012 - Data Wrangling

# Loading Your Own Data

This is just a brief "how to get started" notebook for reading in the data file that you created as part of this lab. It uses a 1,000 row fake dataset that has some resume-like data in it.

There are a few operations toward the end that might be interesting to you.

In [11]:
# import the pandas library to work with data frames
# there's no output when you run the import commands
# unless there is an error, so I added a "Print" statement
import pandas as pd
print("Libraries Loaded!")

Libraries Loaded!


In [7]:
# let's load the data file from your hard drive
# this is the first time you've loaded local data
# and this file location will only work if the file is in the same directory
# as your notebook
# Note - the "r" there before the name of the file is great for Windows users
resumeDataFrame = pd.read_csv(r'resumes.csv')
        
# let's look at the beginning
resumeDataFrame.head()


Unnamed: 0,id,first_name,last_name,email,gender,python_yn,interview_score,pandas_yn,state,highest_ed
0,1,Fidelio,Phittiplace,fphittiplace0@jalbum.net,Male,False,,False,PA,12
1,2,Percy,Gethins,pgethins1@wunderground.com,Male,False,2.0,True,CA,16
2,3,Wilden,Lacroutz,wlacroutz2@simplemachines.org,Genderfluid,True,7.0,True,NM,8
3,4,Woody,Davidesco,wdavidesco3@bloglovin.com,Male,True,5.0,True,AR,17
4,5,Daffy,Blinckhorne,dblinckhorne4@stanford.edu,Female,False,,True,OK,13


In [9]:
# let's look at the end
#resumeDataFrame.tail()

resumeDataFrame.index

RangeIndex(start=0, stop=1000, step=1)

In [4]:
# Let's find all the people who haven't been interviewed yet
resumeDataFrame[resumeDataFrame['interview_score'].isna()]


Unnamed: 0,id,first_name,last_name,email,gender,python_yn,interview_score,pandas_yn,state,highest_ed
0,1,Fidelio,Phittiplace,fphittiplace0@jalbum.net,Male,False,,False,PA,12
4,5,Daffy,Blinckhorne,dblinckhorne4@stanford.edu,Female,False,,True,OK,13
15,16,Tally,Baldam,tbaldamf@gravatar.com,Female,True,,True,OK,3
17,18,Lynnea,Monier,lmonierh@plala.or.jp,Female,False,,True,OH,18
19,20,Dru,Duham,dduhamj@tuttocitta.it,Male,False,,False,VA,14
...,...,...,...,...,...,...,...,...,...,...
991,992,Cosette,Stedman,cstedmanrj@squarespace.com,Female,True,,True,TX,10
994,995,Byran,Low,blowrm@vkontakte.ru,Male,True,,True,IA,19
996,997,Lina,McTerlagh,lmcterlaghro@marriott.com,Genderqueer,True,,False,TX,3
997,998,Sybilla,Tandy,standyrp@bluehost.com,Female,True,,False,CA,18


In [5]:
# Let's find the people who have greater than a college diploma

highlyEducatedFolks = resumeDataFrame.loc[(resumeDataFrame['highest_ed'] > 16)]
highlyEducatedFolks

Unnamed: 0,id,first_name,last_name,email,gender,python_yn,interview_score,pandas_yn,state,highest_ed
3,4,Woody,Davidesco,wdavidesco3@bloglovin.com,Male,True,5.0,True,AR,17
6,7,Jereme,Borrott,jborrott6@issuu.com,Male,True,2.0,True,CA,20
7,8,Terry,Bettesworth,tbettesworth7@i2i.jp,Male,True,3.0,True,TX,17
10,11,Care,Torbard,ctorbarda@soundcloud.com,Male,True,6.0,True,OH,20
17,18,Lynnea,Monier,lmonierh@plala.or.jp,Female,False,,True,OH,18
...,...,...,...,...,...,...,...,...,...,...
986,987,Lissy,Degoe,ldegoere@wikipedia.org,Female,False,7.0,False,CO,20
989,990,Salvatore,Ganny,sgannyrh@posterous.com,Male,False,,True,MD,18
994,995,Byran,Low,blowrm@vkontakte.ru,Male,True,,True,IA,19
997,998,Sybilla,Tandy,standyrp@bluehost.com,Female,True,,False,CA,18


In [6]:
# how about those who have not been interviewed and are highly educated

interviewableEducatedFolks = resumeDataFrame.loc[(resumeDataFrame['highest_ed'] > 16) & (resumeDataFrame['interview_score'].isna())]
interviewableEducatedFolks


Unnamed: 0,id,first_name,last_name,email,gender,python_yn,interview_score,pandas_yn,state,highest_ed
17,18,Lynnea,Monier,lmonierh@plala.or.jp,Female,False,,True,OH,18
29,30,Maribel,Shoebottom,mshoebottomt@yellowpages.com,Female,True,,True,TX,17
31,32,Randy,Gokes,rgokesv@twitter.com,Male,True,,False,WI,19
48,49,Bran,Rhydderch,brhydderch1c@accuweather.com,Male,False,,False,OK,18
109,110,Tierney,McAndrew,tmcandrew31@dmoz.org,Female,True,,True,MN,20
125,126,Geoffry,Fontelles,gfontelles3h@xrea.com,Genderfluid,False,,True,CA,17
143,144,Arvy,Guye,aguye3z@dedecms.com,Male,False,,False,DC,17
164,165,Bartlet,Stud,bstud4k@lulu.com,Male,True,,True,NC,19
168,169,Roz,Cluely,rcluely4o@shop-pro.jp,Female,True,,False,IN,20
197,198,Rowney,Kinvig,rkinvig5h@163.com,Male,False,,False,OK,17


In [14]:
# let's find the average level of education by state
grouped = resumeDataFrame.groupby('state')
averageEducation = grouped['highest_ed'].mean()
averageEducation

state
AK    12.875000
AL    10.875000
AR    13.142857
AZ    12.863636
CA    10.210526
CO    11.238095
CT     7.222222
DC    11.523810
DE    10.000000
FL     9.592105
GA     9.037037
HI     9.200000
IA     9.000000
ID     6.333333
IL    10.692308
IN    10.789474
KS    12.142857
KY     8.538462
LA    11.736842
MA    10.750000
MD    11.076923
MI     9.136364
MN     9.882353
MO     8.909091
MS    12.666667
MT    11.000000
NC    11.280000
ND    17.000000
NE     8.000000
NH    15.000000
NJ    11.625000
NM    13.800000
NV     8.812500
NY    11.489796
OH    11.368421
OK    12.263158
OR    12.285714
PA    12.428571
SC    13.428571
SD     8.333333
TN    10.473684
TX    11.439024
UT     8.615385
VA    11.682927
WA     9.800000
WI    10.100000
WV     7.666667
WY    10.250000
Name: highest_ed, dtype: float64