The series is one of the core data structures in pandas. You think of it a cross between a list and a dictionary. The items are all stored in an order and there's labels with which you can retrieve them. An easy way to visualize this is two columns of data. The first is the speical index, a lot like kyes in a dictionary. The second is your actuall data. It's important to note that the data column has a label of its own and can be retrieved using the `.name` attribute. This is different than with dictionaries and is useful when it comes to merging multiple columns of data.

In [1]:
import pandas as pd

In [2]:
students = ['Alice','Jack','Molly']

pd.Series(students)

0    Alice
1     Jack
2    Molly
dtype: object

In [3]:
numbers = [1,4,5]

In [4]:
pd.Series(numbers)

0    1
1    4
2    5
dtype: int64

In [5]:
# NaN is equivalent to None, so we can check if they are equal to find a missing data.

students = ['Alice','Jack', None]

pd.Series(students)

0    Alice
1     Jack
2     None
dtype: object

In [7]:
# However, if we create a list of numbers, integers or floats, and put in the None type, 
# pandas automatically converts this to a special floating point value designated as NaN,
# which stands for Not a Number.

numbers = [1,2,None]

pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

In [9]:
# pandas represents NaN as a floating point number, and because integers can be typecast to
# floats. pandas went and converted our integers to floats.
# It is important to stress that None and NaN might be being used by the data scientist in 
# same way, to denote missing data, butthat underneath these are not represented by pandas in
# the same way.

# NaN is *NOT* equivalent to None and when we try the quality test, the result is False.

import numpy as np
np.nan == None

False

In [10]:
np.nan == np.nan

False

In [11]:
# Instead, you need to use special functions to test for the presence of not a number,
# such as the NumPy library isnan().

np.isnan(np.nan)

True

In [12]:
# So keep in mind when you see NaN, it's meaning is similar to None, but it's a 
# numeric value and treated differently for efficiency reasons.

In [13]:
# Series can be created from dictionary data directly. If you do this, the index is automatically
# assigned to the keys of the dictionary that you provided and not just incrementing integers.

student_scores = {'Alice':'Physics',
                 'Jack':'Chemistry',
                 'Molly':'English'}
s = pd.Series(student_scores)

In [14]:
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

In [15]:
# pandas set the data type of the series to "object"
# The index, the first column, is also a list of strings.

In [16]:
s.index

Index(['Alice', 'Jack', 'Molly'], dtype='object')

In [17]:
# The dtype of object is not just for strings, but for arbitrary objects.
students = [('Alice','Brown'),('Jack','White'),('Molly','Green')]
pd.Series(students)

0    (Alice, Brown)
1     (Jack, White)
2    (Molly, Green)
dtype: object

In [18]:
# Separate index creation from the data by passing in the index as a list explicitly to the series

s = pd.Series(['Physics','Chemistry','English'], index = ['Alice','Jack','Molly'])

In [19]:
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

In [21]:
# pandas overrides the automatic creation to favor only and all of the indices values that you
# provided. It will ignore from your dictionary all keys which are not in your index, and pandas
# will add None or NaN type values for any index value you provide, which is not in your 
# dictionary key list.

students_scores = {'Alice':'Physics',
                  'Jack':'Chemistry',
                  'Molly':'English'}

s = pd.Series(students_scores, index = ['Alice','Molly','Sam'])

s

Alice    Physics
Molly    English
Sam          NaN
dtype: object

Querying Series

In [22]:
# To query by numeric location, starting at zero, use the iloc attribute. 
# To query by the index label, use the loc attribute.

students_classes = {'Alice':'Physics',
                   'Jack':'Chemistry',
                   'Molly':'English',
                   'Sam':'History'}

s = pd.Series(students_classes)
s

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [23]:
s.iloc[3]

'History'

In [24]:
s.loc['Molly']

'English'

In [25]:
# Keep in mind that iloc and loc are not methods, they are attributes. So don't use parenthesis to 
# query them, but square brackets instead, which is called the indexing operator.
# In Python, this calls or get or set for an item depending on the context of its use.

# If you pass in an integer parameter, the operator will behave as if you want it to query via 
# the iloc attribute.

s[3]

'History'

In [26]:
# If you pass in an object, it will query as if you wanted to use the loc attribute.

s['Molly']

'English'

In [27]:
class_code = {99:'Physics',
             100:'Chemistry',
             101:'English',
             102:'History'}

s = pd.Series(class_code)

# If we try and call s[0] we get a *key error* because there's no item in the classes list with
# an index of zero, instead we have to call iloc explicitly if we want the first item.

s.iloc[0]

'Physics'

In [28]:
grades = pd.Series([90, 80, 87, 69])

total = 0
for grade in grades:
    total += grade
print(total/len(grades))

81.5


In [29]:
# Vectorization works with most ofthe functions in the numpy library, including the sum function

total = np.sum(grades)

print(total/len(grades))

81.5


In [30]:
# First, create a big series of random numbers. This is used a lot when demonstrating techniques 
# with pandas.

numbers = pd.Series(np.random.randint(0, 1000, 10000))

# Look at the top five items in that series to make sure they actually seem random.
numbers.head()

0    136
1    585
2      6
3    301
4    799
dtype: int32

In [31]:
len(numbers)

10000

In [36]:
# The ipython interpreter has something called magic functions begin with a percentage sign. If we 
# type this sign and then hit the Tab key, you can see a list of the available magic functions.

# The function to use is called timeit. This function will run a few times to determine, on average,
# how long it takes.

In [37]:
%%timeit -n 100
total = 0
for number in numbers:
    total += number
total/len(numbers)

1.38 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [38]:
%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

70.3 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [39]:
# With broadcasting, you can apply an operation to every value in the series, changing the series.

numbers.head()

0    136
1    585
2      6
3    301
4    799
dtype: int32

In [44]:
# Increase everything in the series by 2

numbers += 2
s.head()

99       Physics
100    Chemistry
101      English
102      History
dtype: object

In [48]:
# pandas does support iterating through a series much like a dictionary, allowing you to unpack 
# values easily.

# We can use the iteritems() function which returns a label and value
for label, value in s.iteritems():
    # for the item which is returned, lets call set_value()
    s.set_value(label, value+2)
    
s.head()

TypeError: can only concatenate str (not "int") to str

In [52]:
%%timeit -n 10

s = pd.Series(np.random.randint(0, 1000, 10000))

for label, value in s.iteritems():
    s.loc[label] = value+2

426 ms ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [53]:
%%timeit -n 10

s = pd.Series(np.random.randint(0, 1000, 10000))

s += 2

345 µs ± 56.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [54]:
# .loc attribute lets you not only modify data in place, but also add new data as well.
# Keep in ind, indices can have mixed types.

s = pd.Series([1, 2, 3])

s.loc['History'] = 102

In [55]:
s

0            1
1            2
2            3
History    102
dtype: int64

In [56]:
students_classes = pd.Series({'Alice':'Physics',
                             'Jack':'Chemistry',
                             'Molly':'English',
                             'Sam':'History'})

students_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [58]:
# Create a series just for some new student Kelly, which lists all of the courses she has 
# taken.

kelly_classes = pd.Series(['Philosophy','Arts','Math'], index = ['Kelly','Kelly','Kelly'])
kelly_classes

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [59]:
# We can append all of the data in this new Series to the first using the .append() function.

all_students_classes = students_classes.append(kelly_classes)
all_students_classes

Alice       Physics
Jack      Chemistry
Molly       English
Sam         History
Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [60]:
# pandas will take the series and try to infer the best data types to use.
# The append method doesn't actually change the underlying Series objects, it returns a new
# series which is made up of the two appended together. This is a common pattern in pandas. - 
# by default returning a new object instead of modifying in place - and one you should come to
# expect. By printing the original series we can see that series hasn't changed.

students_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [62]:
all_students_classes.loc['Kelly']

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

The DataFrame data structure is the heart of the Panda's library. It's a primary object that you'll be working 
with in data analysis and cleaning tasks.

The DataFrame is conceptually a two-dimensional series object, where there's an index and multiple columns of 
content, with each column having a label. In fact, the distinction between a column and a row is really only a 
conceptual distinction. And you can think of the DataFrame itself as simply a two-axes labeled array.

In [63]:
record1 = pd.Series({'Name':'Alice',
                    'Class':'Physics',
                    'Score': 84})

record2 = pd.Series({'Name':'Jack',
                    'Class':'Chemistry',
                    'Score': 70})

record3 = pd.Series({'Name':'Molly',
                    'Class':'English',
                    'Score':93})

In [64]:
df = pd.DataFrame([record1, record2, record3],
                 index = ['school1','school2','school3'])

df.head()

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,84
school2,Jack,Chemistry,70
school3,Molly,English,93


In [65]:
# An alternative method is that you could use a list of dictionaries, where each dictionary
# representas a row of data.

students = [{'Name':'Alice',
            'Class':'Physics',
            'Score':84},
           {'Name':'Jack',
           'Class':'Chemistry',
           'Score': 70},
           {'Name':'Molly',
           'Class':'English',
           'Score':93}]
students


[{'Name': 'Alice', 'Class': 'Physics', 'Score': 84},
 {'Name': 'Jack', 'Class': 'Chemistry', 'Score': 70},
 {'Name': 'Molly', 'Class': 'English', 'Score': 93}]

In [70]:
df = pd.DataFrame(students, index = ['School1','School2','School1'])

In [71]:
df

Unnamed: 0,Name,Class,Score
School1,Alice,Physics,84
School2,Jack,Chemistry,70
School1,Molly,English,93


In [72]:
df.loc['School2']

Name          Jack
Class    Chemistry
Score           70
Name: School2, dtype: object

In [73]:
type(df.loc['School2'])

pandas.core.series.Series

In [74]:
df.loc['School1']

Unnamed: 0,Name,Class,Score
School1,Alice,Physics,84
School1,Molly,English,93


In [76]:
type(df.loc['School1'])

pandas.core.frame.DataFrame

In [77]:
df.loc['School1','Name']

School1    Alice
School1    Molly
Name: Name, dtype: object

In [78]:
df.T
# transpose the matrix. This pivots all of the rows into columns and all of the columns 
# into rows, and is done with the T attribute.

Unnamed: 0,School1,School2,School1.1
Name,Alice,Jack,Molly
Class,Physics,Chemistry,English
Score,84,70,93


In [89]:
# Call .loc on the transpose to get the student names only
df.T.loc['Name']

School1    Alice
School2     Jack
School1    Molly
Name: Name, dtype: object

In [80]:
df['Name']

School1    Alice
School2     Jack
School1    Molly
Name: Name, dtype: object

In [81]:
df.loc['Name']

KeyError: 'Name'

In [82]:
type(df['Name'])

pandas.core.series.Series

In [83]:
# Chain operations together.

df.loc['School1']['Name']

School1    Alice
School1    Molly
Name: Name, dtype: object

In [84]:
print(type(df.loc['School1']))
print(type(df.loc['School1']['Name']))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [85]:
# Chaining tends to cause Pandas to return a copy of the DataFrame instead of a view on the DataFrame.
# If you are changing data though this is an important distinction and can be a source of error.

# Here's another approach.
df.loc[:,['Name','Score']]

Unnamed: 0,Name,Score
School1,Alice,84
School2,Jack,70
School1,Molly,93


In [86]:
# the colons means that we want to get all of the rows.

In [93]:
# The drop function doesn't change the DataFrame by default

df.drop('School1')

Unnamed: 0,Name,Class,Score
School2,Jack,Chemistry,70


In [94]:
# Drop has two interesting optional parameters. The first is called inplace, and if it's set to 
# True, the DataFrame will be updated in place, instead of a cpy being returned.

copy_df = df.copy()

In [95]:
copy_df.drop('Name', inplace = True, axis = 1)

In [96]:
copy_df

Unnamed: 0,Class,Score
School1,Physics,84
School2,Chemistry,70
School1,English,93


In [98]:
del copy_df['Class']

In [99]:
copy_df

Unnamed: 0,Score
School1,84
School2,70
School1,93


In [100]:
df['ClassRanking'] = None

In [101]:
df

Unnamed: 0,Name,Class,Score,ClassRanking
School1,Alice,Physics,84,
School2,Jack,Chemistry,70,
School1,Molly,English,93,


How you can load data from a comma separated file into a DataFrame

In [102]:
# The jupyter notebooks use ipython as the knernel underneath, which provides convenient ways to 
# integrate lower level shell commands, which are programs run in the underlying operating system.
# "Cat" is one shell command, for "concatenate", which just outputs the contents of a file. In
# python if we prepend the line with an exclamation mark it will execute the remainder of the line 
# as a shell command.
!cat datasets/Admission_Predict.csv

'cat' is not recognized as an internal or external command,
operable program or batch file.


In [103]:
import pandas as pd

# Pandas makes it easy to turn a CSV into a dataframe, we just call read_csv()
df = pd.read_csv('datasets/Admission_Predict.csv')

In [104]:
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [105]:
df = pd.read_csv('datasets/Admission_Predict.csv', index_col = 0)

In [107]:
df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [109]:
new_df = df.rename(columns = {'GRE Score':'GRE Score','TOEFL Score':'TOEFL','University Rating':'University Rating',
                             'SOP':'Statement of Purpose','LOR':'Letter of Recommendation',
                             'CGPA':'CGPA','Research':'Reserach','Chance of Admit':'Chance of Admit'})

new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL,University Rating,Statement of Purpose,LOR,CGPA,Reserach,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [110]:
new_df.columns

Index(['GRE Score', 'TOEFL', 'University Rating', 'Statement of Purpose',
       'LOR ', 'CGPA', 'Reserach', 'Chance of Admit '],
      dtype='object')

In [112]:
# There is space right after 'LOR' and a space after 'Chance of Admit' because the key we used was
# just three characters, instead of 'LOR'.

# We can change a column by including the space in the name
new_df = new_df.rename(columns = {'LOR ':'Letter of Recommendation'})
new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL,University Rating,Statement of Purpose,Letter of Recommendation,CGPA,Reserach,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [114]:
# Another way is to create some function that does the cleaning and then tell 
# renamed to apply that functioin across all of the data. Python comes with a 
# handy string function to strip white space called "strip()".

new_df = new_df.rename(mapper = str.strip, axis = 'columns')

new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL,University Rating,Statement of Purpose,Letter of Recommendation,CGPA,Reserach,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [115]:
df.columns

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ', 'CGPA',
       'Research', 'Chance of Admit '],
      dtype='object')

In [116]:
# Use the df.columns attribute by assigning to it a list of column names which
# will directly rename the columns. This will directly modify the original 
# dataframe and is very efficient especially when you have a lot of columns
# and you only want to change a few.
cols = list(df.columns)
# Then a little list comprehenshion
cols = [x.lower().strip() for x in cols]
# Then we just overwrite what is already in the .columns attribute
df.columns = cols
# Take a look at results
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


Understand Boolean masking. Boolean masking is the heart of fast and efficient querying in numpy
and pandas, and its analogous to bit masking used in other areas of computational science.

A Boolean mask is an arry which can be of one dimension like a series, or two dimensions like a 
data frame, where each of the values in the array are either true or false.

In [118]:
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [119]:
# Boolean masks are created by applying operators directly to the pandas Series or DataFrame
# objects. 

admit_mask = df['chance of admit'] > 0.7
admit_mask

Serial No.
1       True
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398     True
399    False
400     True
Name: chance of admit, Length: 400, dtype: bool

In [120]:
# The result is a series, since only one column is being operator on, filled with either
# True or False values, which is what the comparison operator returns.

In [121]:
df.where(admit_mask).head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
5,,,,,,,,


In [122]:
# The resulting data frame keeps the original indexed values, and only data which met
# the condition was retained.

df.where(admit_mask).dropna().head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
6,330.0,115.0,5.0,4.5,3.0,9.34,1.0,0.9


In [123]:
# The returned dataframe now has all of the NaN rows dropped. Notice the index now 
# includes one through four and six, but not five.

df[df['chance of admit'] > 0.7].head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9


In [124]:
# It is much harder to read, but it's also very more common when you're readin 
# other people's code, so it's important to be able to understand it. Reviewing 
# this indexing operator on dataframe, it now does two things:

# It can be called with a string parameter to project a single column
df['gre score'].head()

Serial No.
1    337
2    324
3    316
4    322
5    314
Name: gre score, dtype: int64

In [125]:
# Send it a list of columns as strings
df[['gre score','toefl score']].head()

Unnamed: 0_level_0,gre score,toefl score
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1
1,337,118
2,324,107
3,316,104
4,322,110
5,314,103


In [126]:
# Send it a boolean mask
df[df['gre score'] >320].head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9
7,321,109,3,3.0,4.0,8.2,1,0.75


In [127]:
# Combine multiple boolean masks, such as multiple criteria for including.
# In bitmasking in other places in CS this is done with "and", if both masks
# must be True for a True value to be in the final mask, or "or" if only one
# needs to be True.

# In pandas, if you want to take two boolean series and add them together
(df['chance of admit'] > 0.7) & (df['chance of admit'] < 0.9)

Serial No.
1      False
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398    False
399    False
400    False
Name: chance of admit, Length: 400, dtype: bool

In [128]:
# Order of operations! A common error for new pandas users is to try and do boolean comparisons 
# using the & operator but not putting parentheses around the individual terms you are 
# interested in

In [129]:
# Another way to do this is to just get rid of the comparison operator completely, and 
# instead use the built in functions which mimic this approach
df['chance of admit'].gt(0.7) & df['chance of admit'].lt(0.9)
# gt: great to; lt: lower to

Serial No.
1      False
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398    False
399    False
400    False
Name: chance of admit, Length: 400, dtype: bool

In [130]:
df['chance of admit'].gt(0.7).lt(0.9)

Serial No.
1      False
2      False
3      False
4      False
5       True
       ...  
396    False
397    False
398    False
399     True
400    False
Name: chance of admit, Length: 400, dtype: bool

In [131]:
# You need to be able to read and write all of these, and understand the implications of the route
# you are choosing.

Indices can either be either autogenerated or they can be set explicitly. Another way to set an index is to use the `set_index()` function. This function takes a list of columns and pomotes those columns to an index. 

In [133]:
# The set_index() function is a destructive process, and it doesn't keep the current index.
# If you want to keep the current index, you need to manually create a new column and copy
# into it values from the index attribute.

# We don't want to index the dataframe by serial numbers, but instead by the chance of admit.
# But lets assume we want to keep the serial number for later. 
# Let's preserve the serial number in a new column.

df['Serial Number'] = df.index
# Set the index to another column
df = df.set_index('chance of admit')
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,Serial Number
chance of admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.92,337,118,4,4.5,4.5,9.65,1,1
0.76,324,107,4,4.0,4.5,8.87,1,2
0.72,316,104,3,3.0,3.5,8.0,1,3
0.8,322,110,3,3.5,2.5,8.67,1,4
0.65,314,103,2,2.0,3.0,8.21,0,5


In [134]:
# When creating a new index from an existing column the index has a name, which
# is the original name of the column
df = df.reset_index()
df.head()

Unnamed: 0,chance of admit,gre score,toefl score,university rating,sop,lor,cgpa,research,Serial Number
0,0.92,337,118,4,4.5,4.5,9.65,1,1
1,0.76,324,107,4,4.0,4.5,8.87,1,2
2,0.72,316,104,3,3.0,3.5,8.0,1,3
3,0.8,322,110,3,3.5,2.5,8.67,1,4
4,0.65,314,103,2,2.0,3.0,8.21,0,5


In [135]:
# Multi-level indexing. This is similar to composite keys in relational database systems.

df = pd.read_csv('datasets/census.csv')

In [136]:
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


In [137]:
df['SUMLEV'].unique()

array([40, 50], dtype=int64)

In [138]:
# There are only two different values, 40 and 50.
df = df[df['SUMLEV'] == 50]
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


In [139]:
columns_to_keep = ['STNAME','CTYNAME','BIRTHS2010','BIRTHS2011','BIRTHS2012','BIRTHS2013','BIRTHS2014','BIRTHS2015',
                  'POPESTIMATE2010','POPESTIMATE2011','POPESTIMATE2012','POPESTIMATE2013','POPESTIMATE2014',
                  'POPESTIMATE2015']
df = df[columns_to_keep]
df.head()

Unnamed: 0,STNAME,CTYNAME,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
1,Alabama,Autauga County,151,636,615,574,623,600,54660,55253,55175,55038,55290,55347
2,Alabama,Baldwin County,517,2187,2092,2160,2186,2240,183193,186659,190396,195126,199713,203709
3,Alabama,Barbour County,70,335,300,283,260,269,27341,27226,27159,26973,26815,26489
4,Alabama,Bibb County,44,266,245,259,247,253,22861,22733,22642,22512,22549,22583
5,Alabama,Blount County,183,744,710,646,618,603,57373,57711,57776,57734,57658,57673


In [140]:
df = df.set_index(['STNAME','CTYNAME'])
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Alabama,Autauga County,151,636,615,574,623,600,54660,55253,55175,55038,55290,55347
Alabama,Baldwin County,517,2187,2092,2160,2186,2240,183193,186659,190396,195126,199713,203709
Alabama,Barbour County,70,335,300,283,260,269,27341,27226,27159,26973,26815,26489
Alabama,Bibb County,44,266,245,259,247,253,22861,22733,22642,22512,22549,22583
Alabama,Blount County,183,744,710,646,618,603,57373,57711,57776,57734,57658,57673


In [141]:
# Query this dataframe.

# When you use a MultiIndex, you must provide the arguments in order by the level you wish
# to query.

df.loc['Michigan','Washtenaw County']

BIRTHS2010            977
BIRTHS2011           3826
BIRTHS2012           3780
BIRTHS2013           3662
BIRTHS2014           3683
BIRTHS2015           3709
POPESTIMATE2010    345563
POPESTIMATE2011    349048
POPESTIMATE2012    351213
POPESTIMATE2013    354289
POPESTIMATE2014    357029
POPESTIMATE2015    358880
Name: (Michigan, Washtenaw County), dtype: int64

In [143]:
# If you want to compare two counties, you can pass a list of tuples describing the 
# indices we with to query into loc.
df.loc[[('Michigan','Washtenaw County',
       'Michigan','Wayne County')]]

Unnamed: 0_level_0,Unnamed: 1_level_0,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Michigan,Washtenaw County,977,3826,3780,3662,3683,3709,345563,349048,351213,354289,357029,358880


Missing values are pretty common in data cleaning activities. Missing values can be there for any number of reasons, and I just want to touch on a few here.

For example, if you are running a survey and a respondant didn't answer a question the missing value is actually an omission. This kind of missing data is called **Missing at Random** if there are other variables that might be used to predict the variable which is missing. If there's no relationship to other variables, then we call this data **Missing Completely at Random(MCAR)**.

In [144]:
# The pandas read_csv() function has a parameter called na_values to let us specify
# the form of missing values. It allows scalar, string, list, or dictionaries to 
# be used.

df = pd.read_csv('datasets/class_grades.csv')
df.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,,63.15,48.89
3,7,,,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


In [145]:
# The function .isnull() to create a Boolean mas of the whole dataframe, This 
# effectively broadcasts the isnull() function to every cell of data
mask = df.isnull()
mask.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,True,False,False
3,False,True,True,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,True,False
8,False,False,False,False,False,False
9,False,False,False,False,False,False


In [146]:
df.dropna().head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61
10,7,80.44,90.2,75.0,91.48,39.72
12,8,97.16,103.71,72.5,93.52,63.33
13,7,91.28,83.53,81.25,99.81,92.22


In [147]:
# Filling funcion, fillna(). This function takes a number or parameters.
df.fillna(0, inplace = True)

df.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,0.0,63.15,48.89
3,7,0.0,0.0,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,0.0,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


In [149]:
# na_filter option to turn off white space filtering, if white space is an actual value of
# interest. But in practice, it is pretty rare.

df = pd.read_csv('datasets/log.csv')
df.head(10)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


In [150]:
# Method parameter(). The two common fill values are ffill and bfill. ffill is for forward
# filling and it updates an na value for a particular cell with the value from the previous
# row. bfill is backward filling, which is the opposite of ffill. It fills the missing 
# values with the next valid value.
# It's important to note that your data needs to be sorted in order for this to have the 
# effect you might want. Data which comes from traditional database management systems 
# usually has no order guarantee, just like this data. So be careful.

In [151]:
df = df.set_index('time')
df = df.sort_index()
df.head(10)

Unnamed: 0_level_0,user,video,playback position,paused,volume
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


In [152]:
df = df.reset_index()
df = df.set_index(['time','user'])
df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


In [154]:
df = df.fillna(method = 'ffill')
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,False,10.0
1469974454,sue,advanced.html,24,False,10.0
1469974484,cheryl,intro.html,7,False,10.0


In [156]:
df = pd.DataFrame({'A': [1, 3, 2, 1, 4],
                  'B': [2, 5, 6, 7, 3],
                  'C': ['a','b','c','d','e']})
df

Unnamed: 0,A,B,C
0,1,2,a
1,3,5,b
2,2,6,c
3,1,7,d
4,4,3,e


In [157]:
df.replace(1, 100)


Unnamed: 0,A,B,C
0,100,2,a
1,3,5,b
2,2,6,c
3,100,7,d
4,4,3,e


In [161]:
df.replace([1, 3], [100, 300])

Unnamed: 0,A,B,C
0,100,2,a
1,300,5,b
2,2,6,c
3,100,7,d
4,4,300,e


In [162]:
df = pd.read_csv('datasets/log.csv')
df.head(10)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


In [163]:
# To replace using a regex we make the first parameter to replace the regex pattern we want to match, the 
# second parameter the value we want to emit upon match, and then we pass in a third parameter "regex = True"

# We want to detect all html pages in the "video" column, end with ".html", and we want to overwrite that
# with the keyword "webpage".

df.replace(to_replace = ".*html$", value = "webpage", regex = True)


Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,webpage,5,False,10.0
1,1469974454,cheryl,webpage,6,,
2,1469974544,cheryl,webpage,9,,
3,1469974574,cheryl,webpage,10,,
4,1469977514,bob,webpage,1,,
5,1469977544,bob,webpage,1,,
6,1469977574,bob,webpage,1,,
7,1469977604,bob,webpage,1,,
8,1469974604,cheryl,webpage,11,,
9,1469974694,cheryl,webpage,14,,


When you use statistical functions on DataFrames, thee functions typically ignore missing values. You should be aware of the values that are being excluded. Why you have missing values really matters, depending on the problem you are trying to solve. It might be unreasonable to infer missing values, if the data should not exist in the first place.

### Example: Manipulation DataFrame 

In [164]:
df = pd.read_csv('datasets/presidents.csv')
df.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days"
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days"
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days"
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days"
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days"


In [167]:
# Start with cleaning up that name into firstname and lastname.

df["First"] = df['President']
df["First"] = df["First"].replace("[ ].*","",regex = True)
df.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age,First
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days",George
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days",John
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days",Thomas
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days",James
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days",James


In [168]:
del(df["First"])

# The apply() function on a dataframe will take some arbitrary function you have written and apply it to 
# either a series (a single column) or dataframe across all rows or columns.

def splitname(row):
    # the row is a single series object which is a single row indexed by column values
    row['First'] = row['President'].split(" ")[0]
    row['Last'] = row['President'].split(" ")[-1]
    return row

df = df.apply(splitname, axis = 'columns')

In [169]:
df.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age,First,Last
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days",George,Washington
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days",John,Adams
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days",Thomas,Jefferson
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days",James,Madison
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days",James,Monroe


In [170]:
del(df['First'])
del(df['Last'])

# Etract takes a regular expression as input and sepcifically requires you to capture groups
# that correspond to the output columns you are interested in.

pattern = "(^[\w]*)(?:.* )([\w]*$)"

df["President"].str.extract(pattern).head()

Unnamed: 0,0,1
0,George,Washington
1,John,Adams
2,Thomas,Jefferson
3,James,Madison
4,James,Monroe


In [175]:
pattern = "(?P<First>^[\w]*)(?:.* )(?P<Last>[\w]*$)"

names = df["President"].str.extract(pattern).head()

In [176]:
names

Unnamed: 0,First,Last
0,George,Washington
1,John,Adams
2,Thomas,Jefferson
3,James,Madison
4,James,Monroe


In [178]:
df["First"] = names["First"]
df["Last"] = names["Last"]
df.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age,First,Last
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days",George,Washington
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days",John,Adams
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days",Thomas,Jefferson
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days",James,Madison
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days",James,Monroe


In [179]:
df["Born"] = df["Born"].str.extract("([\w]{3} [\w]{1,2}, [\w]{4})")
df["Born"].head()

0    Feb 22, 1732
1    Oct 30, 1735
2    Apr 13, 1743
3    Mar 16, 1751
4    Apr 28, 1758
Name: Born, dtype: object

In [180]:
df["Born"] = pd.to_datetime(df["Born"])
df["Born"].head()

0   1732-02-22
1   1735-10-30
2   1743-04-13
3   1751-03-16
4   1758-04-28
Name: Born, dtype: datetime64[ns]