# DataFrame Boolean Indexing

### Objectives
After this lesson you should be able to...
+ Know that Boolean Indexing with DataFrames is nearly identical to how it is with Series
+ Create criteria from several different columns
+ Pass criteria into the indexing operator
+ Use **`value_counts`** to get exact string names and frequencies
+ Use **`.loc`** to do boolean indexing and column selection simultaneously 
+ Use **`isin`** to test equality against multiple possible values
+ Answer basic questions after subsetting data into groups

### Prepare for this lesson by...
[ALWAYS READ THE DOCUMENTATION BEFORE A LESSON!](http://pandas.pydata.org/pandas-docs/stable/)
+ Read the [Indexing and Selecting](http://pandas.pydata.org/pandas-docs/stable/indexing.html)

## Introduction
Boolean indexing is nearly identical for DataFrames as it is for Series. You first create a boolean Series and pass this Series into the indexing operator. The indexing operator is normally reserved for selection by column name but in addition you may pass a boolean Series to it.

## Employee Dataset
We will be working with the employee dataset. Let's read it in without setting an index.

In [1]:
import pandas as pd
import numpy as np

employee = pd.read_csv('../data/employee.csv')
employee.head()

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
0,5906,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13
1,364,LIBRARY ASSISTANT,Library,26125.0,Hispanic/Latino,Full Time,Female,Active,2000-07-19,2010-09-18
2,1286,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Full Time,Male,Active,2015-02-03,2015-02-03
3,8789,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25
4,8542,ELECTRICIAN,General Services Department,56347.0,White,Full Time,Male,Active,1989-06-19,1994-10-22


## Boolean Indexing with Series - a review

In [2]:
# use boolean selection to select salaries > 100,000
salary = employee['BASE_SALARY']
criteria = salary > 100000

salary[criteria].head(10)

0      121862.0
8      107962.0
11     180416.0
43     165216.0
66     100791.0
169    120916.0
178    210588.0
186    110881.0
217    102019.0
237    130416.0
Name: BASE_SALARY, dtype: float64

### Returning all the rows
Boolean indexing with DataFrames returns all the columns in addition to all the rows. We can use the same criteria to select all the employee data for those with salaries greater than 100,000.

In [3]:
criteria = salary > 100000
employee[criteria].head()

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
0,5906,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13
8,8172,DEPUTY ASSISTANT DIRECTOR (EXECUTIVE LEV,Public Works & Engineering-PWE,107962.0,White,Full Time,Male,Active,1993-11-15,2013-01-05
11,3347,"CHIEF PHYSICIAN,MD",Health & Human Services,180416.0,Black or African American,Full Time,Male,Active,1987-05-22,1999-08-28
43,3982,ASSOCIATE EMS PHYSICIAN DIRECTOR,Houston Fire Department (HFD),165216.0,Hispanic/Latino,Full Time,Male,Active,2013-08-31,2013-08-31
66,7369,"PUBLIC HEALTH DENTIST,DDS",Health & Human Services,100791.0,White,Full Time,Female,Active,2015-12-28,2015-12-28


### More interesting boolean logic with DataFrames
Many more interesting questions can now be asked and solved with boolean operations in the data in a very similar manner to how they were in the Series notebook.

In [4]:
# find people with make more 100,000 and are female
criteria = (employee['BASE_SALARY'] > 100000) & (employee['GENDER'] == 'Female')
employee[criteria].head()

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
0,5906,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13
66,7369,"PUBLIC HEALTH DENTIST,DDS",Health & Human Services,100791.0,White,Full Time,Female,Active,2015-12-28,2015-12-28
237,2438,ASSISTANT DIRECTOR (EXECUTIVE LEVEL),Admn. & Regulatory Affairs,130416.0,Asian/Pacific Islander,Full Time,Female,Active,2002-05-24,2013-07-20
366,8989,DEPUTY ASSISTANT DIRECTOR (EXECUTIVE LEV,Mayor's Office,110000.0,White,Full Time,Female,Active,2014-05-13,2014-05-13
522,3997,DEPUTY ASSISTANT DIRECTOR (EX LVL),Houston Airport System (HAS),110686.0,Black or African American,Full Time,Female,Active,2011-11-07,2011-11-07


### Before getting started - get unique values
You will often need to know the exact string name to do boolean indexing. Use either the **`unique`** or **`value_counts`** method to output the unique values. This will make it easier to make boolean indexing. I prefer using value_counts as it gives the frequencies as well.

In [5]:
employee['DEPARTMENT'].value_counts().head(8)

Houston Police Department-HPD     638
Houston Fire Department (HFD)     384
Public Works & Engineering-PWE    343
Health & Human Services           110
Houston Airport System (HAS)      106
Parks & Recreation                 74
Solid Waste Management             43
Fleet Management Department        36
Name: DEPARTMENT, dtype: int64

In [6]:
employee['RACE'].unique() # returns a numpy array

array(['Hispanic/Latino', 'White', 'Black or African American',
       'Asian/Pacific Islander', nan, 'American Indian or Alaskan Native',
       'Others'], dtype=object)

In [7]:
employee['RACE'].value_counts()

Black or African American            700
White                                665
Hispanic/Latino                      480
Asian/Pacific Islander               107
American Indian or Alaskan Native     11
Others                                 2
Name: RACE, dtype: int64

In [8]:
employee['GENDER'].value_counts()

Male      1397
Female     603
Name: GENDER, dtype: int64

In [9]:
employee['POSITION_TITLE'].value_counts().head(15)

SENIOR POLICE OFFICER        220
POLICE OFFICER               184
FIRE FIGHTER                 138
POLICE SERGEANT               98
ENGINEER/OPERATOR             89
CAPTAIN                       47
UTILITY WORKER                40
INSPECTOR                     36
EQUIPMENT WORKER              33
ADMINISTRATIVE ASSISTANT      28
FIELD SUPERVISOR              23
ADMINISTRATIVE SPECIALIST     22
FIRE FIGHTER,PROBATIONARY     21
LABORER                       20
CUSTOMER SERVICE CLERK        20
Name: POSITION_TITLE, dtype: int64

In [10]:
employee['EMPLOYMENT_STATUS'].value_counts()

Active      1991
Inactive       9
Name: EMPLOYMENT_STATUS, dtype: int64

In [11]:
employee['BASE_SALARY'].describe()

count      1886.000000
mean      55767.931601
std       21693.706679
min       24960.000000
25%       40170.000000
50%       54461.000000
75%       66614.000000
max      275000.000000
Name: BASE_SALARY, dtype: float64

## Boolean Indexing examples
Now that we know the unique values we can do some boolean indexing. Here are some complex examples

In [12]:
# find the white female police officers

criteria_dept = employee['DEPARTMENT'] == 'Houston Police Department-HPD'
criteria_race = employee['RACE'] == 'White'
criteria_gender = employee['GENDER'] == 'Female'
criteria_all = criteria_dept & criteria_race & criteria_gender

In [13]:
employee[criteria_all].head()

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
136,900,POLICE SERGEANT,Houston Police Department-HPD,81239.0,White,Full Time,Female,Active,1991-02-04,2005-02-12
185,1323,SENIOR POLICE OFFICER,Houston Police Department-HPD,66614.0,White,Full Time,Female,Active,1989-10-05,2005-03-26
227,1147,SENIOR POLICE TELECOMMUNICATOR,Houston Police Department-HPD,46675.0,White,Full Time,Female,Active,2011-03-21,2012-07-07
229,4397,SENIOR POLICE OFFICER,Houston Police Department-HPD,66614.0,White,Full Time,Female,Active,1996-07-29,2009-02-21
343,2282,POLICE OFFICER,Houston Police Department-HPD,60347.0,White,Full Time,Female,Active,2001-12-03,2002-12-03


In [14]:
# find the white female police officers or the Asian male firefighters

criteria_dept1 = employee['DEPARTMENT'] == 'Houston Police Department-HPD'
criteria_race1 = employee['RACE'] == 'White'
criteria_gender1 = employee['GENDER'] == 'Female'
criteria1 = criteria_dept1 & criteria_race1 & criteria_gender1

criteria_dept2 = employee['DEPARTMENT'] == 'Houston Fire Department (HFD)'
criteria_race2 = employee['RACE'] == 'Asian/Pacific Islander'
criteria_gender2 = employee['GENDER'] == 'Male'
criteria2 = criteria_dept2 & criteria_race2 & criteria_gender2

employee[criteria1 | criteria2].head()

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
136,900,POLICE SERGEANT,Houston Police Department-HPD,81239.0,White,Full Time,Female,Active,1991-02-04,2005-02-12
185,1323,SENIOR POLICE OFFICER,Houston Police Department-HPD,66614.0,White,Full Time,Female,Active,1989-10-05,2005-03-26
227,1147,SENIOR POLICE TELECOMMUNICATOR,Houston Police Department-HPD,46675.0,White,Full Time,Female,Active,2011-03-21,2012-07-07
229,4397,SENIOR POLICE OFFICER,Houston Police Department-HPD,66614.0,White,Full Time,Female,Active,1996-07-29,2009-02-21
343,2282,POLICE OFFICER,Houston Police Department-HPD,60347.0,White,Full Time,Female,Active,2001-12-03,2002-12-03


In [15]:
# select all the males who have a salary below 40,000 or above 150,000

criteria_gender = employee['GENDER'] == 'Male'
criteria_salary = (employee['BASE_SALARY'] < 40000) | (employee['BASE_SALARY'] > 150000)
criteria = criteria_gender & criteria_salary

employee[criteria].head()

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
11,3347,"CHIEF PHYSICIAN,MD",Health & Human Services,180416.0,Black or African American,Full Time,Male,Active,1987-05-22,1999-08-28
12,107,CUSTOMER SERVICE REPRESENTATIVE I,Public Works & Engineering-PWE,30347.0,Black or African American,Full Time,Male,Active,2015-11-16,2015-11-16
29,1859,UTILITY WORKER,Public Works & Engineering-PWE,29557.0,Black or African American,Full Time,Male,Active,2014-01-21,2014-01-21
31,7798,FIRE FIGHTER TRAINEE,Houston Fire Department (HFD),28024.0,,Full Time,Male,Active,2016-03-14,2016-03-14
39,3956,INVENTORY MANAGEMENT SUPERVISOR,Public Works & Engineering-PWE,38168.0,Black or African American,Full Time,Male,Active,2008-09-08,2015-04-11


### Use **`isin`** to test for membership in a group
**`isin`** is a powerful Series method that tests whether each value is a member of a given list.

In [16]:
# find all the females in three departments
depts = ['Houston Airport System (HAS)','Parks & Recreation','Solid Waste Management']
criteria = employee['DEPARTMENT'].isin(depts) & (employee['GENDER'] == 'Female')
employee[criteria].head()

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
75,5811,ADMINISTRATIVE AIDE,Houston Airport System (HAS),36296.0,White,Full Time,Female,Active,1999-07-12,2003-08-23
92,9622,CUSTODIAN,Parks & Recreation,26125.0,Black or African American,Full Time,Female,Active,1993-10-02,1993-10-02
99,1810,LABORER,Houston Airport System (HAS),26125.0,Hispanic/Latino,Full Time,Female,Active,2001-11-19,2010-04-17
218,4272,ADMINISTRATIVE AIDE,Solid Waste Management,37045.0,Black or African American,Full Time,Female,Active,2014-04-28,2014-04-28
249,7641,STAFF ANALYST,Solid Waste Management,75041.0,Hispanic/Latino,Full Time,Female,Active,1992-10-21,2015-07-18


## Boolean indexing with .loc and .iloc
Notice how all the examples above select all the columns. You can select the columns you want by using our trusty indexers **`.iloc`** and **`.loc`**. You do this by passing the columns you desire after the **comma**. Typically, you only use **`.loc`** when doing boolean indexing and column selection because columns are easily identified by their string names.

Let's do some of the same boolean indexing as above but select only the columns that are involved in the criteria.

In [17]:
# select all the males who have a salary below 40,000 or above 150,000

criteria_gender = employee['GENDER'] == 'Male'
criteria_salary = (employee['BASE_SALARY'] < 40000) | (employee['BASE_SALARY'] > 150000)
criteria = criteria_gender & criteria_salary

employee.loc[criteria, ['GENDER', 'BASE_SALARY']].head()

Unnamed: 0,GENDER,BASE_SALARY
11,Male,180416.0
12,Male,30347.0
29,Male,29557.0
31,Male,28024.0
39,Male,38168.0


In [18]:
# find the white female police officers or the Asian male firefighters

criteria_dept1 = employee['DEPARTMENT'] == 'Houston Police Department-HPD'
criteria_race1 = employee['RACE'] == 'White'
criteria_gender1 = employee['GENDER'] == 'Female'
criteria1 = criteria_dept1 & criteria_race1 & criteria_gender1

criteria_dept2 = employee['DEPARTMENT'] == 'Houston Fire Department (HFD)'
criteria_race2 = employee['RACE'] == 'Asian/Pacific Islander'
criteria_gender2 = employee['GENDER'] == 'Male'
criteria2 = criteria_dept2 & criteria_race2 & criteria_gender2

employee.loc[criteria1 | criteria2, ['DEPARTMENT', 'RACE', 'GENDER']].head()

Unnamed: 0,DEPARTMENT,RACE,GENDER
136,Houston Police Department-HPD,White,Female
185,Houston Police Department-HPD,White,Female
227,Houston Police Department-HPD,White,Female
229,Houston Police Department-HPD,White,Female
343,Houston Police Department-HPD,White,Female


## Using Boolean Selection to answer questions
Many interesting questions can be answered after filtering the data.

### Do men or woman make more money?
Boolean indexing can help answer this question. We first create two new DataFrames that are filtered for each gender and then find the mean of the salary. Men make about $5,000 more.

In [19]:
# use .loc to do boolean and column selection
men = employee.loc[employee['GENDER'] == 'Male', 'BASE_SALARY']
women = employee.loc[employee['GENDER'] == 'Female', 'BASE_SALARY']

In [20]:
men.head()

2    45279.0
3    63166.0
4    56347.0
5    66614.0
6    71680.0
Name: BASE_SALARY, dtype: float64

In [21]:
women.head()

0     121862.0
1      26125.0
35     34923.0
36     60258.0
38     67499.0
Name: BASE_SALARY, dtype: float64

In [22]:
men.mean(), women.mean()

(57354.61191749427, 52168.3396880416)

### Do Hispanic men or Black women employees make more money?

In [23]:
criteria_hm = (employee['GENDER'] == 'Male') & (employee['RACE'] == 'Hispanic/Latino')
criteria_bw = (employee['GENDER'] == 'Female') & (employee['RACE'] == 'Black or African American')
hisp_men = employee.loc[criteria_hm, 'BASE_SALARY']
black_women = employee.loc[criteria_bw, 'BASE_SALARY']

In [24]:
hisp_men.mean(), black_women.mean()

(54782.81901840491, 48915.42123287671)

### Is there a better way to answer this last couple questions?
Yes! The **groupby** method allows for grouping of these categorical variables and will be explained in greater detail in a future lesson.

### What are the ratios of male to female for the fire department and heath and human services?
We first filter the data here and then use **`value_counts`** with the **`normalize`** parameter set to **`True`** which returns relative frequencies.

In [25]:
fd = employee.loc[employee['DEPARTMENT'] == 'Houston Fire Department (HFD)', 'GENDER']
hhs = employee.loc[employee['DEPARTMENT'] == 'Health & Human Services', 'GENDER']

In [26]:
fd.value_counts(normalize=True)

Male      0.945312
Female    0.054688
Name: GENDER, dtype: float64

In [27]:
hhs.value_counts(normalize=True)

Female    0.754545
Male      0.245455
Name: GENDER, dtype: float64

### Using boolean indexing to select an employee
Let's practice selecting a single employee when the 

In [28]:
employee[employee['UNIQUE_ID'] == 8789]

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
3,8789,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25


### Selecting a single employee with the index
That was a little cumbersome. Let's improve that selection by taking advantage of the index. Many datasets have a column with an integer that uniquely identifies each row.  In database speak this column is called the table's **primary key**. The primary key allows for easy and direct access to each employee.

In the **df_coh** table it appears that the first column, **UNIQUE_ID** is the primary key. Most good datasets will have a **data dictionary** that describes each column of the table so you won't have to take a guess as to what the primary key is. The data dictionary (a.k.a metadata) for the current dataset can be [found here.](http://data.ohouston.org/dataset/city-of-houston-current-employee-roster/resource/98448c04-e76f-4fa0-8916-12786e6e5883)

### Ensuring uniqueness
The most important aspect of a primary key is its uniqueness. Use the **`is_unique`** Series method.

In [29]:
employee['UNIQUE_ID'].is_unique

True

Use the **`set_index`** method to make the column **`UNIQUE_ID`** the new index

In [30]:
employee = pd.read_csv('../data/employee.csv')
employee_idx = employee.set_index('UNIQUE_ID')
employee_idx.head()

Unnamed: 0_level_0,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
UNIQUE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
5906,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13
364,LIBRARY ASSISTANT,Library,26125.0,Hispanic/Latino,Full Time,Female,Active,2000-07-19,2010-09-18
1286,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Full Time,Male,Active,2015-02-03,2015-02-03
8789,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25
8542,ELECTRICIAN,General Services Department,56347.0,White,Full Time,Male,Active,1989-06-19,1994-10-22


### Inspecting the new DataFrame output
The index is now meaningful. The index, which previously was just a range beginning at 0, is now the employee ID. **UNIQUE_ID** is now the **name** of the index. The values for the index are still **bold** and a reminder that these values are part of the index and not a column. 

In [31]:
# the index object has a name attribute
employee_idx.index.name

'UNIQUE_ID'

The name of the index remains just above the index. You can delete it if you want.

In [32]:
del employee_idx.index.name

In [33]:
# The name is gone
employee_idx.head()

Unnamed: 0,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
5906,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13
364,LIBRARY ASSISTANT,Library,26125.0,Hispanic/Latino,Full Time,Female,Active,2000-07-19,2010-09-18
1286,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Full Time,Male,Active,2015-02-03,2015-02-03
8789,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25
8542,ELECTRICIAN,General Services Department,56347.0,White,Full Time,Male,Active,1989-06-19,1994-10-22


### Setting the index on read
When first reading the dataset using **`read_csv`**, use the argument **`index_col`** to pass the name of the column you would like as your index.

In [34]:
employee_idx = pd.read_csv('../data/employee.csv', index_col='UNIQUE_ID')
employee_idx.head()

Unnamed: 0_level_0,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
UNIQUE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
5906,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13
364,LIBRARY ASSISTANT,Library,26125.0,Hispanic/Latino,Full Time,Female,Active,2000-07-19,2010-09-18
1286,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Full Time,Male,Active,2015-02-03,2015-02-03
8789,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25
8542,ELECTRICIAN,General Services Department,56347.0,White,Full Time,Male,Active,1989-06-19,1994-10-22


## Selection with .loc vs boolean indexing
Selecting with .loc is much nicer than boolean selection.

In [35]:
employee_idx.loc[8789]

POSITION_TITLE                   ENGINEER/OPERATOR
DEPARTMENT           Houston Fire Department (HFD)
BASE_SALARY                                  63166
RACE                                         White
EMPLOYMENT_TYPE                          Full Time
GENDER                                        Male
EMPLOYMENT_STATUS                           Active
HIRE_DATE                               1982-02-08
JOB_DATE                                1991-05-25
Name: 8789, dtype: object

In [36]:
# Put it in a list to return a DF
employee_idx.loc[[8789]]

Unnamed: 0_level_0,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
UNIQUE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8789,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25


## Speed Difference between boolean indexing and selection by index label

In [37]:
%timeit employee[employee['UNIQUE_ID'] == 8789]

512 µs ± 23.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [38]:
%timeit employee_idx.loc[8789]

158 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


# Your Turn

In [39]:
# reread the employee dataset before you begin.
employee = pd.read_csv('../data/employee.csv')

### Problem 1
<span  style="color:green; font-size:16px">Select all Asian employees that make more than 100,000 dollars?</span>

In [40]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">What percentage of Asian employees make more than 100,000 dollars?</span>

In [41]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">What is the ratio of males to females? What is the ratio of males to females for those that make more than 100,000? How about for those that make less than 30,000?</span>

In [42]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">What is the distribution of race? What is the distribution of race for those that make over 100,000?</span>

In [43]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">Save the two distributions you found in problem 4 to variables. They should be Series. Divide the greater than 100k distribution by the other.</span>

In [44]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">Select all Females that are part of part of the Houston police department and all males that are in the Library department. Also Select only the DEPARTMENT and GENDER columns</span>

In [45]:
# your code here

### Problem 7
<span  style="color:green; font-size:16px">Select all the white, black and hispanic employees that are in the houston police department, Houston fire department and the Parks & Recreation department. Also select only the RACE and DEPARTMENT columns.</span>

In [46]:
# your code here

### Problem 8
<span  style="color:green; font-size:16px">What is the most most common department for black females? How about for Hispanic males?</span>

In [47]:
# your code here

### Problem 9
<span  style="color:green; font-size:16px">Who makes more money, 'Black or African American' Females or White Males?</span>

In [48]:
# your code here

### Problem 10
<span  style="color:green; font-size:16px">Set the index to be **`UNIQUE_ID`** and save result to a new variable. The use **`.loc`** to select employees 2440 and 480 and columns DEPARTMENT through GENDER.</span>

In [49]:
# your code here