# DataFrames - Part II

## Memory Optimization

In [1]:
import pandas as pd

In [3]:
df = pd.read_csv('data/pandas/employees.csv')
df.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.17,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
First Name           933 non-null object
Gender               855 non-null object
Start Date           1000 non-null object
Last Login Time      1000 non-null object
Salary               1000 non-null int64
Bonus %              1000 non-null float64
Senior Management    933 non-null object
Team                 957 non-null object
dtypes: float64(1), int64(1), object(6)
memory usage: 62.6+ KB


Use `pd.to_datetime()` to convert string date to date-type column

In [5]:
df['Start Date'].head()

0     8/6/1993
1    3/31/1996
2    4/23/1993
3     3/4/2005
4    1/24/1998
Name: Start Date, dtype: object

In [6]:
df['Start Date'] = pd.to_datetime(df['Start Date'])
df.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,6:53 AM,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,4:47 PM,101004,1.389,True,Client Services


When coverting a timestamp with no date, `pd.to_datetime()` will use todays date.

In [8]:
df['Last Login Time'].head()

0    12:42 PM
1     6:53 AM
2    11:17 AM
3     1:00 PM
4     4:47 PM
Name: Last Login Time, dtype: object

In [10]:
df['Last Login Time'] = pd.to_datetime(df['Last Login Time'])
df.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-10-05 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-10-05 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2018-10-05 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2018-10-05 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2018-10-05 16:47:00,101004,1.389,True,Client Services


In [11]:
df['Senior Management'].head()

0     True
1     True
2    False
3     True
4     True
Name: Senior Management, dtype: object

In [12]:
df['Senior Management'] = df['Senior Management'].astype('bool')
df.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-10-05 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-10-05 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2018-10-05 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2018-10-05 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2018-10-05 16:47:00,101004,1.389,True,Client Services


In [13]:
df['Gender'].head()

0      Male
1      Male
2    Female
3      Male
4      Male
Name: Gender, dtype: object

In [14]:
df['Gender'] = df['Gender'].astype('category')
df.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-10-05 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-10-05 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2018-10-05 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2018-10-05 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2018-10-05 16:47:00,101004,1.389,True,Client Services


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
First Name           933 non-null object
Gender               855 non-null category
Start Date           1000 non-null datetime64[ns]
Last Login Time      1000 non-null datetime64[ns]
Salary               1000 non-null int64
Bonus %              1000 non-null float64
Senior Management    1000 non-null bool
Team                 957 non-null object
dtypes: bool(1), category(1), datetime64[ns](2), float64(1), int64(1), object(2)
memory usage: 49.0+ KB


## Filter a DataFrame Based on a Condition

In [16]:
df = pd.read_csv(
    'data/pandas/employees.csv',
    parse_dates=['Start Date','Last Login Time']
)

df['Senior Management'] = df['Senior Management'].astype('bool')
df['Gender'] = df['Gender'].astype('category')
df.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-10-05 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-10-05 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2018-10-05 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2018-10-05 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2018-10-05 16:47:00,101004,1.389,True,Client Services


In [18]:
# filter gender for males
df[df['Gender'] == 'Male'].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-10-05 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-10-05 06:53:00,61933,4.17,True,
3,Jerry,Male,2005-03-04,2018-10-05 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2018-10-05 16:47:00,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,2018-10-05 01:35:00,115163,10.125,False,Legal


In [19]:
# filter team for finance
df[df['Team'] == 'Finance'].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
2,Maria,Female,1993-04-23,2018-10-05 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2018-10-05 13:00:00,138705,9.34,True,Finance
7,,Female,2015-07-20,2018-10-05 10:43:00,45906,11.598,True,Finance
14,Kimberly,Female,1999-01-14,2018-10-05 07:13:00,41426,14.543,True,Finance
46,Bruce,Male,2009-11-28,2018-10-05 22:47:00,114796,6.796,False,Finance


In [21]:
# filter for senior management
mask = df['Senior Management']
df[mask].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-10-05 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-10-05 06:53:00,61933,4.17,True,
3,Jerry,Male,2005-03-04,2018-10-05 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2018-10-05 16:47:00,101004,1.389,True,Client Services
6,Ruby,Female,1987-08-17,2018-10-05 16:20:00,65476,10.012,True,Product


In [22]:
# filter for everyone outside of the marketing team
df[df['Team'] != 'Marketing'].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
1,Thomas,Male,1996-03-31,2018-10-05 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2018-10-05 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2018-10-05 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2018-10-05 16:47:00,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,2018-10-05 01:35:00,115163,10.125,False,Legal


In [23]:
# everyone whose salary is greater than 100k
df[df['Salary'] > 100000].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
2,Maria,Female,1993-04-23,2018-10-05 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2018-10-05 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2018-10-05 16:47:00,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,2018-10-05 01:35:00,115163,10.125,False,Legal
9,Frances,Female,2002-08-08,2018-10-05 06:51:00,139852,7.524,True,Business Development


In [24]:
# everyone whose bonus is less than 1.5%
df[df['Bonus %'] < 1.5].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
4,Larry,Male,1998-01-24,2018-10-05 16:47:00,101004,1.389,True,Client Services
15,Lillian,Female,2016-06-05,2018-10-05 06:09:00,59414,1.256,False,Product
58,Theresa,Female,2010-04-11,2018-10-05 07:18:00,72670,1.481,True,Engineering
77,Charles,Male,2004-09-14,2018-10-05 20:13:00,107391,1.26,True,Marketing
175,Willie,Male,1998-02-17,2018-10-05 20:20:00,146651,1.451,True,Engineering


In [26]:
# everyone who started on or before jan 1st 1985
df[df['Start Date'] <= '1985-01-01'].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
10,Louise,Female,1980-08-12,2018-10-05 09:01:00,63241,15.132,True,
12,Brandon,Male,1980-12-01,2018-10-05 01:08:00,112807,17.492,True,Human Resources
18,Diana,Female,1981-10-23,2018-10-05 10:27:00,132940,19.082,False,Client Services
28,Terry,Male,1981-11-27,2018-10-05 18:30:00,124008,13.464,True,Client Services
37,Linda,Female,1981-10-19,2018-10-05 20:49:00,57427,9.557,True,Client Services


## Filter with More Than One Condition (AND)

In [28]:
# verbose
df[(df['Gender'] == 'Male') & (df['Team'] == 'Marketing')].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-10-05 12:42:00,97308,6.945,True,Marketing
21,Matthew,Male,1995-09-05,2018-10-05 02:12:00,100612,13.645,False,Marketing
26,Craig,Male,2000-02-27,2018-10-05 07:45:00,37598,7.757,True,Marketing
74,Thomas,Male,1995-06-04,2018-10-05 14:24:00,62096,17.029,False,Marketing
77,Charles,Male,2004-09-14,2018-10-05 20:13:00,107391,1.26,True,Marketing


By using individual variables for filters, you can create a more succient filter.

In [27]:
male = df['Gender'] == 'Male'
marketing = df['Team'] == 'Marketing'

df[male & marketing].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-10-05 12:42:00,97308,6.945,True,Marketing
21,Matthew,Male,1995-09-05,2018-10-05 02:12:00,100612,13.645,False,Marketing
26,Craig,Male,2000-02-27,2018-10-05 07:45:00,37598,7.757,True,Marketing
74,Thomas,Male,1995-06-04,2018-10-05 14:24:00,62096,17.029,False,Marketing
77,Charles,Male,2004-09-14,2018-10-05 20:13:00,107391,1.26,True,Marketing


## Filter with More Than One Condition (OR)

In [29]:
sm = df['Senior Management']
date = df['Start Date'] < '1990-01-01'

df[sm | date].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-10-05 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-10-05 06:53:00,61933,4.17,True,
3,Jerry,Male,2005-03-04,2018-10-05 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2018-10-05 16:47:00,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,2018-10-05 01:35:00,115163,10.125,False,Legal


In [31]:
# name of robert and team of client services or a start date greater than jun 1 2016
mask1 = df['First Name'] == 'Robert'
mask2 = df['Team'] == 'Client Services'
mask3 = df['Start Date'] > '2016-06-01'

df[(mask1 & mask2) | mask3]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
15,Lillian,Female,2016-06-05,2018-10-05 06:09:00,59414,1.256,False,Product
98,Tina,Female,2016-06-16,2018-10-05 19:47:00,100705,16.961,True,Marketing
387,Robert,Male,1994-10-29,2018-10-05 04:26:00,123294,19.894,False,Client Services
451,Terry,,2016-07-15,2018-10-05 00:29:00,140002,19.49,True,Marketing


## The .isin() Method

In [33]:
# extract legal, sales, or product from team
# brute force
mask1 = df['Team'] == 'Legal'
mask2 = df['Team'] == 'Sales'
mask3 = df['Team'] == 'Product'

df[mask1 | mask2 | mask3].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
5,Dennis,Male,1987-04-18,2018-10-05 01:35:00,115163,10.125,False,Legal
6,Ruby,Female,1987-08-17,2018-10-05 16:20:00,65476,10.012,True,Product
11,Julie,Female,1997-10-26,2018-10-05 15:19:00,102508,12.637,True,Legal
13,Gary,Male,2008-01-27,2018-10-05 23:40:00,109831,5.831,False,Sales
15,Lillian,Female,2016-06-05,2018-10-05 06:09:00,59414,1.256,False,Product


In [34]:
df['Team'].isin(['Legal','Sales','Product']).head()

0    False
1    False
2    False
3    False
4    False
Name: Team, dtype: bool

In [35]:
mask = df['Team'].isin(['Legal','Sales','Product'])

df[mask].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
5,Dennis,Male,1987-04-18,2018-10-05 01:35:00,115163,10.125,False,Legal
6,Ruby,Female,1987-08-17,2018-10-05 16:20:00,65476,10.012,True,Product
11,Julie,Female,1997-10-26,2018-10-05 15:19:00,102508,12.637,True,Legal
13,Gary,Male,2008-01-27,2018-10-05 23:40:00,109831,5.831,False,Sales
15,Lillian,Female,2016-06-05,2018-10-05 06:09:00,59414,1.256,False,Product


## The .isnull() and .notnull() Methods

In [37]:
# rows where team value is null
df['Team'].isnull().head()

0    False
1     True
2    False
3    False
4    False
Name: Team, dtype: bool

In [38]:
mask = df['Team'].isnull()

df[mask].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
1,Thomas,Male,1996-03-31,2018-10-05 06:53:00,61933,4.17,True,
10,Louise,Female,1980-08-12,2018-10-05 09:01:00,63241,15.132,True,
23,,Male,2012-06-14,2018-10-05 16:19:00,125792,5.042,True,
32,,Male,1998-08-21,2018-10-05 14:27:00,122340,6.417,True,
91,James,,2005-01-26,2018-10-05 23:00:00,128771,8.309,False,


In [40]:
mask = df['Gender'].notnull()

df[mask].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-10-05 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-10-05 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2018-10-05 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2018-10-05 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2018-10-05 16:47:00,101004,1.389,True,Client Services


## The .between() Method

In [42]:
# salary between 60k and 70k
df['Salary'].between(60000,70000).head()

0    False
1     True
2    False
3    False
4    False
Name: Salary, dtype: bool

In [44]:
df[df['Salary'].between(60000,70000)].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
1,Thomas,Male,1996-03-31,2018-10-05 06:53:00,61933,4.17,True,
6,Ruby,Female,1987-08-17,2018-10-05 16:20:00,65476,10.012,True,Product
10,Louise,Female,1980-08-12,2018-10-05 09:01:00,63241,15.132,True,
20,Lois,,1995-04-22,2018-10-05 19:18:00,64714,4.934,True,Legal
41,Christine,,2015-06-28,2018-10-05 01:08:00,66582,11.308,True,Business Development


In [45]:
df[df['Bonus %'].between(2.0,5.0)].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
1,Thomas,Male,1996-03-31,2018-10-05 06:53:00,61933,4.17,True,
20,Lois,,1995-04-22,2018-10-05 19:18:00,64714,4.934,True,Legal
40,Michael,Male,2008-10-10,2018-10-05 11:25:00,99283,2.665,True,Distribution
49,Chris,,1980-01-24,2018-10-05 12:13:00,113590,3.055,False,Sales
60,Paula,,2005-11-23,2018-10-05 14:01:00,48866,4.271,False,Distribution


In [46]:
df[df['Start Date'].between('1991-01-01','1992-01-01')].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
27,Scott,,1991-07-11,2018-10-05 18:58:00,122367,5.218,False,Legal
75,Bonnie,Female,1991-07-02,2018-10-05 01:27:00,104897,5.118,True,Human Resources
88,Donna,Female,1991-11-27,2018-10-05 13:59:00,64088,6.155,True,Legal
116,,Male,1991-06-22,2018-10-05 20:58:00,76189,18.988,True,Legal
148,Patrick,,1991-07-14,2018-10-05 02:24:00,124488,14.837,True,Sales


In [47]:
df[df['Last Login Time'].between('8:30AM','12:00PM')].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
2,Maria,Female,1993-04-23,2018-10-05 11:17:00,130590,11.858,False,Finance
7,,Female,2015-07-20,2018-10-05 10:43:00,45906,11.598,True,Finance
10,Louise,Female,1980-08-12,2018-10-05 09:01:00,63241,15.132,True,
18,Diana,Female,1981-10-23,2018-10-05 10:27:00,132940,19.082,False,Client Services
33,Jean,Female,1993-12-18,2018-10-05 09:07:00,119082,16.18,False,Business Development


## The .duplicated() Method

In [48]:
df = pd.read_csv(
    'data/pandas/employees.csv',
    parse_dates=['Start Date','Last Login Time']
)

df['Senior Management'] = df['Senior Management'].astype('bool')
df['Gender'] = df['Gender'].astype('category')
df.sort_values('First Name',inplace=True)
df.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,2018-10-05 10:20:00,61602,11.849,True,Marketing
327,Aaron,Male,1994-01-29,2018-10-05 18:48:00,58755,5.097,True,Marketing
440,Aaron,Male,1990-07-22,2018-10-05 14:53:00,52119,11.343,True,Client Services
937,Aaron,,1986-01-22,2018-10-05 19:39:00,63126,18.424,False,Client Services
137,Adam,Male,2011-05-21,2018-10-05 01:45:00,95327,15.12,False,Distribution


In [49]:
df['First Name'].head()

101    Aaron
327    Aaron
440    Aaron
937    Aaron
137     Adam
Name: First Name, dtype: object

In [51]:
df['First Name'].duplicated().head()

101    False
327     True
440     True
937     True
137    False
Name: First Name, dtype: bool

In [52]:
# filter for unique rows
mask = ~df['First Name'].duplicated(keep=False)
df[mask].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
8,Angela,Female,2005-11-22,2018-10-05 06:29:00,95570,18.523,True,Engineering
688,Brian,Male,2007-04-07,2018-10-05 22:47:00,93901,17.821,True,Legal
190,Carol,Female,1996-03-19,2018-10-05 03:39:00,57783,9.129,False,Finance
887,David,Male,2009-12-05,2018-10-05 08:48:00,92242,15.407,False,Legal
5,Dennis,Male,1987-04-18,2018-10-05 01:35:00,115163,10.125,False,Legal


## The .drop_duplicates() Method

In [53]:
df = pd.read_csv(
    'data/pandas/employees.csv',
    parse_dates=['Start Date','Last Login Time']
)

df['Senior Management'] = df['Senior Management'].astype('bool')
df['Gender'] = df['Gender'].astype('category')
df.sort_values('First Name',inplace=True)
df.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,2018-10-05 10:20:00,61602,11.849,True,Marketing
327,Aaron,Male,1994-01-29,2018-10-05 18:48:00,58755,5.097,True,Marketing
440,Aaron,Male,1990-07-22,2018-10-05 14:53:00,52119,11.343,True,Client Services
937,Aaron,,1986-01-22,2018-10-05 19:39:00,63126,18.424,False,Client Services
137,Adam,Male,2011-05-21,2018-10-05 01:45:00,95327,15.12,False,Distribution


In [54]:
len(df)

1000

In [55]:
len(df.drop_duplicates())

1000

In [56]:
# if you really want only unique first names for whatever reason
df.drop_duplicates(subset=['First Name'],keep='first').head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,2018-10-05 10:20:00,61602,11.849,True,Marketing
137,Adam,Male,2011-05-21,2018-10-05 01:45:00,95327,15.12,False,Distribution
300,Alan,Male,1988-06-26,2018-10-05 03:54:00,111786,3.592,True,Engineering
372,Albert,Male,1997-02-01,2018-10-05 16:20:00,67827,19.717,True,Engineering
988,Alice,Female,2004-10-05,2018-10-05 09:34:00,47638,11.209,False,Human Resources


## The .unique() and .nunique() Methods

In [61]:
df = pd.read_csv(
    'data/pandas/employees.csv',
    parse_dates=['Start Date','Last Login Time']
)

df['Senior Management'] = df['Senior Management'].astype('bool')
df['Gender'] = df['Gender'].astype('category')
df.sort_values('First Name',inplace=True)
df.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,2018-10-05 10:20:00,61602,11.849,True,Marketing
327,Aaron,Male,1994-01-29,2018-10-05 18:48:00,58755,5.097,True,Marketing
440,Aaron,Male,1990-07-22,2018-10-05 14:53:00,52119,11.343,True,Client Services
937,Aaron,,1986-01-22,2018-10-05 19:39:00,63126,18.424,False,Client Services
137,Adam,Male,2011-05-21,2018-10-05 01:45:00,95327,15.12,False,Distribution


In [62]:
df['Gender'].unique()

[Male, NaN, Female]
Categories (2, object): [Male, Female]

In [63]:
df['Team'].unique()

array(['Marketing', 'Client Services', 'Distribution', 'Product',
       'Human Resources', 'Engineering', 'Finance', 'Business Development',
       'Sales', nan, 'Legal'], dtype=object)

In [64]:
# does not include NaN values
df['Team'].nunique()

10

In [65]:
df['Team'].nunique(dropna=False)

11