# DataFrames II: Filtering Data

In [2]:
import pandas as pd

## This Module's Dataset + Memory Optimization
- The `pd.to_datetime` method converts a **Series** to hold datetime values.
- The `format` parameter informs pandas of the format that the times are stored in.
- We pass symbols designating the segments of the string. For example, %m means "month" and %d means day.
- The `dt` attribute reveals an object with many datetime-related attributes and methods.
- The `dt.time` attribute extracts only the time from each value in a datetime **Series**.
- Use the `astype` method to convert the values in a **Series** to another type.
- The `parse_dates` parameter of `read_csv` is an alternate way to parse strings as datetimes.

In [3]:
employees = pd.read_csv("employees.csv")
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.17,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services


#### Starting with a new dataset, load into a DataFrame & use descriptive funcs to get an overview of various aspects like NaN values, column-wise dtypes etc.
##### Its observed that although there are column headers mentioning dates & timestamps, those columns are still having object as dtype. 
##### If we observe Column-wise count, there's few columns that have NaN values that need to accounted for.  

In [4]:
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   First Name         933 non-null    object 
 1   Gender             855 non-null    object 
 2   Start Date         1000 non-null   object 
 3   Last Login Time    1000 non-null   object 
 4   Salary             1000 non-null   int64  
 5   Bonus %            1000 non-null   float64
 6   Senior Management  933 non-null    object 
 7   Team               957 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 62.6+ KB


##### Also, using `nunique` to check if any columns can be declared as category
As observed, the Date & Timestamps need to be converted to their respective dtypes. 

In [5]:
employees.nunique()

First Name           200
Gender                 2
Start Date           972
Last Login Time      720
Salary               995
Bonus %              971
Senior Management      2
Team                  10
dtype: int64

##### Converting Date & Timestamp columns
More info on [datetime formatting](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)

In [6]:
employees["Start Date"] = pd.to_datetime(employees["Start Date"],format="%m/%d/%Y")

In [7]:
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   First Name         933 non-null    object        
 1   Gender             855 non-null    object        
 2   Start Date         1000 non-null   datetime64[ns]
 3   Last Login Time    1000 non-null   object        
 4   Salary             1000 non-null   int64         
 5   Bonus %            1000 non-null   float64       
 6   Senior Management  933 non-null    object        
 7   Team               957 non-null    object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 62.6+ KB


In [8]:
employees["Last Login Time"] = pd.to_datetime(employees["Last Login Time"],format="%H:%M %p").dt.time

In [9]:
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   First Name         933 non-null    object        
 1   Gender             855 non-null    object        
 2   Start Date         1000 non-null   datetime64[ns]
 3   Last Login Time    1000 non-null   object        
 4   Salary             1000 non-null   int64         
 5   Bonus %            1000 non-null   float64       
 6   Senior Management  933 non-null    object        
 7   Team               957 non-null    object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 62.6+ KB


##### Converting the dType of Senior Management column to Bool

In [10]:
employees["Senior Management"] = employees["Senior Management"].astype(bool)

#### Notice that DataFrame memory is decreasing as we are converting the dtypes

In [11]:
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   First Name         933 non-null    object        
 1   Gender             855 non-null    object        
 2   Start Date         1000 non-null   datetime64[ns]
 3   Last Login Time    1000 non-null   object        
 4   Salary             1000 non-null   int64         
 5   Bonus %            1000 non-null   float64       
 6   Senior Management  1000 non-null   bool          
 7   Team               957 non-null    object        
dtypes: bool(1), datetime64[ns](1), float64(1), int64(1), object(4)
memory usage: 55.8+ KB


In [12]:
employees["Gender"] = employees["Gender"].astype("category")
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   First Name         933 non-null    object        
 1   Gender             855 non-null    category      
 2   Start Date         1000 non-null   datetime64[ns]
 3   Last Login Time    1000 non-null   object        
 4   Salary             1000 non-null   int64         
 5   Bonus %            1000 non-null   float64       
 6   Senior Management  1000 non-null   bool          
 7   Team               957 non-null    object        
dtypes: bool(1), category(1), datetime64[ns](1), float64(1), int64(1), object(3)
memory usage: 49.1+ KB


#### Doing it all in one column - defining date-time formatting in the `read_csv` statement itself

In [13]:
employees = pd.read_csv("employees.csv",parse_dates=["Start Date"], date_format="%m/%d/%Y")
employees["Last Login Time"] = pd.to_datetime(employees["Last Login Time"],format="%H:%M %p").dt.time
employees["Senior Management"] = employees["Senior Management"].astype(bool)
employees["Gender"] = employees["Gender"].astype("category")
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services


In [14]:
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   First Name         933 non-null    object        
 1   Gender             855 non-null    category      
 2   Start Date         1000 non-null   datetime64[ns]
 3   Last Login Time    1000 non-null   object        
 4   Salary             1000 non-null   int64         
 5   Bonus %            1000 non-null   float64       
 6   Senior Management  1000 non-null   bool          
 7   Team               957 non-null    object        
dtypes: bool(1), category(1), datetime64[ns](1), float64(1), int64(1), object(3)
memory usage: 49.1+ KB


## Filter A DataFrame  Based On A Condition
- Pandas needs a **Series** of Booleans to perform a filter.
- Pass the Boolean Series inside square brackets after the **DataFrame**.
- We can generate a Boolean Series using a wide variety of operations (equality, inequality, less than, greater than, inclusion, etc)

In [15]:
# Finding employees in the database based on gender
employees["Gender"] =="Female"
employees[employees["Gender"]=="Female"]
employees[employees["Gender"]=="Male"]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.170,True,
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,01:35:00,115163,10.125,False,Legal
...,...,...,...,...,...,...,...,...
994,George,Male,2013-06-21,05:47:00,98874,4.479,True,Marketing
996,Phillip,Male,1984-01-31,06:30:00,42392,19.675,False,Finance
997,Russell,Male,2013-05-20,12:39:00,96914,1.421,False,Product
998,Larry,Male,2013-04-20,04:45:00,60500,11.985,False,Business Development


In [16]:
# Find all employess in the finance dept
employees[employees["Team"]=="Finance"]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
7,,Female,2015-07-20,10:43:00,45906,11.598,True,Finance
14,Kimberly,Female,1999-01-14,07:13:00,41426,14.543,True,Finance
46,Bruce,Male,2009-11-28,10:47:00,114796,6.796,False,Finance
...,...,...,...,...,...,...,...,...
907,Elizabeth,Female,1998-07-27,11:12:00,137144,10.081,False,Finance
954,Joe,Male,1980-01-19,04:06:00,119667,1.148,True,Finance
987,Gloria,Female,2014-12-08,05:08:00,136709,10.331,True,Finance
992,Anthony,Male,2011-10-16,08:35:00,112769,11.625,True,Finance


##### If a column's dtype is boolean already, then there's a slightly modified format in which we need to call the filtering conditions

In [17]:
# Employees from senior manager
employees[employees["Senior Management"]]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.170,True,
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
6,Ruby,Female,1987-08-17,04:20:00,65476,10.012,True,Product
...,...,...,...,...,...,...,...,...
991,Rose,Female,2002-08-25,05:12:00,134505,11.051,True,Marketing
992,Anthony,Male,2011-10-16,08:35:00,112769,11.625,True,Finance
993,Tina,Female,1997-05-15,03:53:00,56450,19.040,True,Engineering
994,George,Male,2013-06-21,05:47:00,98874,4.479,True,Marketing


In [18]:
# Find employees with salary greater then USD100,000
employees[employees["Salary"]>100000]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,01:35:00,115163,10.125,False,Legal
9,Frances,Female,2002-08-08,06:51:00,139852,7.524,True,Business Development
...,...,...,...,...,...,...,...,...
990,Robin,Female,1987-07-24,01:35:00,100765,10.982,True,Client Services
991,Rose,Female,2002-08-25,05:12:00,134505,11.051,True,Marketing
992,Anthony,Male,2011-10-16,08:35:00,112769,11.625,True,Finance
995,Henry,,2014-11-23,06:09:00,132483,16.655,False,Distribution


In [19]:
employees[employees["Bonus %"]<5].sort_values(by=["Bonus %"],ascending=False)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
79,Bonnie,Female,1988-11-13,03:30:00,115814,4.990,False,Product
343,Ronald,Male,2009-02-24,02:09:00,96633,4.990,True,Engineering
204,Willie,Male,2006-06-06,09:45:00,55281,4.935,True,Marketing
20,Lois,,1995-04-22,07:18:00,64714,4.934,True,Legal
840,Lillian,Female,2002-08-26,08:53:00,103854,4.924,True,Distribution
...,...,...,...,...,...,...,...,...
746,Gloria,Female,2004-08-19,10:31:00,46602,1.027,True,Business Development
527,Helen,,1993-12-02,01:42:00,45724,1.022,False,Product
912,Joe,Male,1998-12-08,10:28:00,126120,1.020,False,
652,Willie,Male,2009-12-05,05:39:00,141932,1.017,True,Engineering


In [20]:
employees[employees["Bonus %"]<5].sort_values(by=["Bonus %"],ascending=False).iloc[2:5]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
204,Willie,Male,2006-06-06,09:45:00,55281,4.935,True,Marketing
20,Lois,,1995-04-22,07:18:00,64714,4.934,True,Legal
840,Lillian,Female,2002-08-26,08:53:00,103854,4.924,True,Distribution


In [21]:
# Filtering by dates
employees[employees["Start Date"] < "1985-01-01"]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
10,Louise,Female,1980-08-12,09:01:00,63241,15.132,True,
12,Brandon,Male,1980-12-01,01:08:00,112807,17.492,True,Human Resources
18,Diana,Female,1981-10-23,10:27:00,132940,19.082,False,Client Services
28,Terry,Male,1981-11-27,06:30:00,124008,13.464,True,Client Services
37,Linda,Female,1981-10-19,08:49:00,57427,9.557,True,Client Services
...,...,...,...,...,...,...,...,...
982,Rose,Female,1982-04-06,10:43:00,91411,8.639,True,Human Resources
983,John,Male,1982-12-23,10:35:00,146907,11.738,False,Engineering
985,Stephen,,1983-07-10,08:10:00,85668,1.909,False,Legal
986,Donna,Female,1982-11-26,07:04:00,82871,17.999,False,Marketing


#### Filtering by date-time

In [22]:
# Filtering by date-time
import datetime as dt

In [23]:
employees[employees["Last Login Time"] < dt.time(12,0,0)]
employees[employees["Last Login Time"] > dt.time(10,0)]#.sort_values(by=["Last Login Time"],ascending=False)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
7,,Female,2015-07-20,10:43:00,45906,11.598,True,Finance
13,Gary,Male,2008-01-27,11:40:00,109831,5.831,False,Sales
18,Diana,Female,1981-10-23,10:27:00,132940,19.082,False,Client Services
...,...,...,...,...,...,...,...,...
975,Susan,Female,1995-04-07,10:05:00,92436,12.467,False,Sales
980,Kimberly,Female,2013-01-26,12:57:00,46233,8.862,True,Engineering
982,Rose,Female,1982-04-06,10:43:00,91411,8.639,True,Human Resources
983,John,Male,1982-12-23,10:35:00,146907,11.738,False,Engineering


## Filter with More than One Condition (AND)
- Add the `&` operator in between two Boolean **Series** to filter by multiple conditions.
- We can assign the **Series** to variables to make the syntax more readable.

In [27]:
# Find all male employees in Finance team
is_male_emp = employees["Gender"] == "Male"
is_in_finance = employees["Team"] == "Marketing"
employees[is_male_emp & is_in_finance].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
21,Matthew,Male,1995-09-05,02:12:00,100612,13.645,False,Marketing
26,Craig,Male,2000-02-27,07:45:00,37598,7.757,True,Marketing
74,Thomas,Male,1995-06-04,02:24:00,62096,17.029,False,Marketing
77,Charles,Male,2004-09-14,08:13:00,107391,1.26,True,Marketing


In [28]:
employees["Team"].value_counts()

Team
Client Services         106
Finance                 102
Business Development    101
Marketing                98
Product                  95
Sales                    94
Engineering              92
Human Resources          91
Distribution             90
Legal                    88
Name: count, dtype: int64

In [48]:
# Find female employees in Engineering team
employees[(employees["Gender"]=="Female") & (employees["Team"]=="Engineering") & (employees["Salary"]>=100000)].head(7)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
30,Christina,Female,2002-08-06,01:19:00,118780,9.096,True,Engineering
113,Tina,Female,2009-06-12,07:16:00,114767,3.711,True,Engineering
122,Christina,Female,2012-04-13,02:04:00,110169,13.892,True,Engineering
138,Ashley,Female,2006-05-25,11:30:00,112238,6.03,True,Engineering
214,Julie,Female,1989-07-23,01:52:00,109588,3.55,False,Engineering
467,Amy,Female,2002-06-19,03:06:00,122897,8.222,True,Engineering
475,Stephanie,Female,1992-11-26,12:54:00,122121,7.937,True,Engineering


## Filter with More than One Condition (OR)
- Use the `|` operator in between two Boolean **Series** to filter by *either* condition.

In [52]:
# Find employees from either Management OR they started before 1990
is_mngmnt = employees["Senior Management"]
is_before1990 = employees["Start Date"]<"1990-01-01"
employees[is_mngmnt|is_before1990]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.170,True,
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,01:35:00,115163,10.125,False,Legal
...,...,...,...,...,...,...,...,...
992,Anthony,Male,2011-10-16,08:35:00,112769,11.625,True,Finance
993,Tina,Female,1997-05-15,03:53:00,56450,19.040,True,Engineering
994,George,Male,2013-06-21,05:47:00,98874,4.479,True,Marketing
996,Phillip,Male,1984-01-31,06:30:00,42392,19.675,False,Finance


#### For multiple selection conditions, its best to define the boolean conditions first AND then carefully use them in the final query ALONG WITH the correct parentheses 
#### NOTE that  `(A and B) or C` has logically different meaning to `A and (B or C)`

In [56]:
# Employees with first name's Robert from Client Services OR start date after 2016-06-01
con1 = employees["First Name"]=="Robert"
con2 = employees["Team"]=="Client Services"
con3 = employees["Start Date"]>"2016-06-01"


In [57]:
employees[(con1 & con2)|con3]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
15,Lillian,Female,2016-06-05,06:09:00,59414,1.256,False,Product
98,Tina,Female,2016-06-16,07:47:00,100705,16.961,True,Marketing
387,Robert,Male,1994-10-29,04:26:00,123294,19.894,False,Client Services
451,Terry,,2016-07-15,12:29:00,140002,19.49,True,Marketing


## The isin Method
- The `isin` **Series** method accepts a collection object like a list, tuple, or **Series**.
- The method returns True for a row if its value is found in the collection.

#### In most cases, the `isin()` method offers better performance than the usual filtering methods (using `==, !=, &, |`, etc.) in Pandas, especially when dealing with larger datasets and checking against multiple values.

In [59]:
# Find employees from either of LEgal, Sales or Product teams
employees[employees["Team"].isin(["Legal","Sales","Product"])]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
5,Dennis,Male,1987-04-18,01:35:00,115163,10.125,False,Legal
6,Ruby,Female,1987-08-17,04:20:00,65476,10.012,True,Product
11,Julie,Female,1997-10-26,03:19:00,102508,12.637,True,Legal
13,Gary,Male,2008-01-27,11:40:00,109831,5.831,False,Sales
15,Lillian,Female,2016-06-05,06:09:00,59414,1.256,False,Product
...,...,...,...,...,...,...,...,...
981,James,Male,1993-01-15,05:19:00,148985,19.280,False,Legal
985,Stephen,,1983-07-10,08:10:00,85668,1.909,False,Legal
989,Justin,,1991-02-10,04:58:00,38344,3.794,False,Legal
997,Russell,Male,2013-05-20,12:39:00,96914,1.421,False,Product


## The isnull and notnull Methods
- The `isnull` method returns True for `NaN` values in a **Series**.
- The `notnull` method returns True for present values in a **Series**.

In [60]:
# Find all records where Name is missing but Team info is avlbl
no_name = employees["First Name"].isnull()
team_details = employees["Team"].notnull()
employees[no_name & team_details]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
7,,Female,2015-07-20,10:43:00,45906,11.598,True,Finance
25,,Male,2012-10-08,01:12:00,37076,18.576,True,Client Services
39,,Male,2016-01-29,02:33:00,122173,7.797,True,Client Services
51,,,2011-12-17,08:29:00,41126,14.009,True,Sales
62,,Female,2007-06-12,05:25:00,58112,19.414,True,Marketing
116,,Male,1991-06-22,08:58:00,76189,18.988,True,Legal
149,,Female,2014-08-17,02:00:00,86230,8.578,True,Distribution
157,,Female,2005-07-27,08:32:00,79536,14.443,True,Product
165,,Female,2014-03-23,01:28:00,59148,9.061,True,Legal
166,,Female,1991-07-09,06:52:00,42341,7.014,True,Sales


## The between Method
- The `between` method returns True if a **Series** value is found within its range.

In [None]:
# Salary between USD 80K & USD 90K
employees[(employees["Salary"]>=80000) & (employees["Salary"]<=90000)]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
19,Donna,Female,2010-07-22,03:48:00,81014,1.894,False,Product
31,Joyce,,2005-02-20,02:40:00,88657,12.752,False,Product
35,Theresa,Female,2006-10-10,01:12:00,85182,16.675,False,Sales
45,Roger,Male,1980-04-17,11:32:00,88010,13.886,True,Sales
54,Sara,Female,2007-08-15,09:23:00,83677,8.999,False,Engineering
...,...,...,...,...,...,...,...,...
930,Nancy,Female,2001-09-10,11:57:00,85213,2.386,True,Marketing
956,Beverly,Female,1986-10-17,12:51:00,80838,8.115,False,Engineering
963,Ann,Female,1994-09-23,11:15:00,89443,17.940,True,Sales
985,Stephen,,1983-07-10,08:10:00,85668,1.909,False,Legal


##### This is better achieved by the `between` method

In [64]:
employees[employees["Salary"].between(80000,90000)]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
19,Donna,Female,2010-07-22,03:48:00,81014,1.894,False,Product
31,Joyce,,2005-02-20,02:40:00,88657,12.752,False,Product
35,Theresa,Female,2006-10-10,01:12:00,85182,16.675,False,Sales
45,Roger,Male,1980-04-17,11:32:00,88010,13.886,True,Sales
54,Sara,Female,2007-08-15,09:23:00,83677,8.999,False,Engineering
...,...,...,...,...,...,...,...,...
930,Nancy,Female,2001-09-10,11:57:00,85213,2.386,True,Marketing
956,Beverly,Female,1986-10-17,12:51:00,80838,8.115,False,Engineering
963,Ann,Female,1994-09-23,11:15:00,89443,17.940,True,Sales
985,Stephen,,1983-07-10,08:10:00,85668,1.909,False,Legal


##### Another usage example - find employees with lowest bonus %

In [67]:
employees["Bonus %"].describe()

count    1000.000000
mean       10.207555
std         5.528481
min         1.015000
25%         5.401750
50%         9.838500
75%        14.838000
max        19.944000
Name: Bonus %, dtype: float64

In [None]:
# finding employees with bonus % between 1-5%
employees[employees["Bonus %"].between(1,5)]
employees[employees["Bonus %"].between(1.0,5.0)]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
1,Thomas,Male,1996-03-31,06:53:00,61933,4.170,True,
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
15,Lillian,Female,2016-06-05,06:09:00,59414,1.256,False,Product
19,Donna,Female,2010-07-22,03:48:00,81014,1.894,False,Product
20,Lois,,1995-04-22,07:18:00,64714,4.934,True,Legal
...,...,...,...,...,...,...,...,...
976,Denise,Female,1992-10-19,05:42:00,137954,4.195,True,Legal
985,Stephen,,1983-07-10,08:10:00,85668,1.909,False,Legal
989,Justin,,1991-02-10,04:58:00,38344,3.794,False,Legal
994,George,Male,2013-06-21,05:47:00,98874,4.479,True,Marketing


In [72]:
# find employees who started working in 1995
employees[employees["Start Date"].between("1995-01-01","1995-12-31")].head(7)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
20,Lois,,1995-04-22,07:18:00,64714,4.934,True,Legal
21,Matthew,Male,1995-09-05,02:12:00,100612,13.645,False,Marketing
74,Thomas,Male,1995-06-04,02:24:00,62096,17.029,False,Marketing
80,Gerald,,1995-03-17,12:50:00,137126,15.602,True,Sales
117,Steven,Male,1995-03-01,03:03:00,109095,9.494,False,Finance
132,Carlos,Male,1995-01-04,07:02:00,146670,10.763,False,Human Resources
136,Henry,Male,1995-04-24,04:18:00,43542,19.687,False,Legal


In [75]:
# find employees logged in between 10AM-12PM
employees[employees["Last Login Time"].between(dt.time(10,00),dt.time(12,00))].sort_values(by=["Last Login Time"])

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
647,Donald,Male,1988-04-06,10:00:00,122920,5.320,False,
739,Carlos,Male,1981-01-25,10:00:00,138598,14.737,False,Sales
72,Bobby,Male,2007-05-07,10:01:00,54043,3.833,False,Product
676,Annie,Female,1992-06-06,10:04:00,138925,9.801,True,Marketing
764,Roger,Male,1988-05-02,10:04:00,115582,15.343,True,Sales
...,...,...,...,...,...,...,...,...
349,Phyllis,Female,2005-11-24,11:57:00,140347,8.723,False,Sales
302,Adam,Male,2007-07-05,11:59:00,71276,5.027,True,Human Resources
607,,Male,1983-10-13,11:59:00,139754,12.740,True,Sales
740,Russell,,2009-05-09,11:59:00,149456,3.533,False,Marketing


#### Using only specific date or time components to filter data
We use the `dt.` accessors to use only certain components of a date or time dtype 

In [77]:
# find employees hired in August 1995 (use only year component)
employees[employees["Start Date"].dt.to_period('M')=="1995-08"]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
261,Marie,Female,1995-08-06,01:58:00,100308,13.677,False,Product
350,Thomas,,1995-08-31,08:31:00,41549,3.95,False,Sales
525,Steve,Male,1995-08-22,06:58:00,67780,9.54,True,Human Resources
661,Craig,Male,1995-08-21,02:38:00,123876,4.225,False,Engineering
869,Matthew,Male,1995-08-03,05:39:00,135352,7.986,True,Business Development


## The duplicated Method
- The `duplicated` method returns True if a **Series** value is a duplicate.
- Pandas will mark one occurrence of a repeated value as a non-duplicate.
- Use the `keep` parameter to designate whether the first or last occurrence of a repeated value should be considered the "non-duplicate".
- Pass False to the `keep` parameter to mark all occurrences of repeated values as duplicates.
- Use the tilde symbol (`~`) to invert a **Series's** values. Trues will become Falses, and Falses will become trues.

In [82]:
# just trying to glance the no. of repeat names
(employees["First Name"].value_counts()).to_dict()

{'Marilyn': 11,
 'Jeremy': 10,
 'Todd': 10,
 'Barbara': 10,
 'Irene': 9,
 'Kathy': 9,
 'Steven': 9,
 'Cynthia': 9,
 'Sarah': 9,
 'Rose': 9,
 'Clarence': 8,
 'Carl': 8,
 'Julie': 8,
 'Bobby': 8,
 'Harry': 8,
 'Ruby': 8,
 'Andrea': 8,
 'Gloria': 8,
 'Alice': 8,
 'Justin': 8,
 'Michael': 7,
 'Linda': 7,
 'Thomas': 7,
 'Scott': 7,
 'Peter': 7,
 'Robin': 7,
 'James': 7,
 'Robert': 7,
 'Harold': 7,
 'Russell': 7,
 'Ruth': 7,
 'Marie': 7,
 'Brandon': 7,
 'Shawn': 7,
 'Terry': 6,
 'Kimberly': 6,
 'Lillian': 6,
 'Jerry': 6,
 'Maria': 6,
 'Beverly': 6,
 'Kenneth': 6,
 'Patricia': 6,
 'Shirley': 6,
 'Mary': 6,
 'Debra': 6,
 'Frank': 6,
 'Ernest': 6,
 'Brenda': 6,
 'Stephen': 6,
 'Albert': 6,
 'Deborah': 6,
 'Ralph': 6,
 'Amanda': 6,
 'Ann': 6,
 'Jonathan': 6,
 'Lisa': 6,
 'Randy': 6,
 'Gregory': 6,
 'Victor': 6,
 'Roger': 6,
 'Gerald': 6,
 'Janice': 6,
 'Henry': 6,
 'Bonnie': 6,
 'Lois': 6,
 'Charles': 5,
 'Johnny': 5,
 'Douglas': 5,
 'Christopher': 5,
 'Diana': 5,
 'Stephanie': 5,
 'Louise': 5,


In [83]:
employees[employees["First Name"].duplicated(keep="first")]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
23,,Male,2012-06-14,04:19:00,125792,5.042,True,
25,,Male,2012-10-08,01:12:00,37076,18.576,True,Client Services
32,,Male,1998-08-21,02:27:00,122340,6.417,True,
34,Jerry,Male,2004-01-10,12:56:00,95734,19.096,False,Client Services
39,,Male,2016-01-29,02:33:00,122173,7.797,True,Client Services
...,...,...,...,...,...,...,...,...
995,Henry,,2014-11-23,06:09:00,132483,16.655,False,Distribution
996,Phillip,Male,1984-01-31,06:30:00,42392,19.675,False,Finance
997,Russell,Male,2013-05-20,12:39:00,96914,1.421,False,Product
998,Larry,Male,2013-04-20,04:45:00,60500,11.985,False,Business Development


In [89]:
employees[employees["First Name"].duplicated(keep="last")]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.170,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
959,Albert,Male,1992-09-19,02:35:00,45094,5.850,True,Business Development
960,Stephen,Male,1989-10-29,11:34:00,93997,18.093,True,Business Development
970,Alice,Female,1988-09-03,08:54:00,63571,15.397,True,Product
973,Russell,Male,2013-05-10,11:08:00,137359,11.105,False,Business Development


In [85]:
employees[employees["First Name"]=="Douglas"]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
217,Douglas,Male,1999-09-03,04:00:00,83341,1.015,True,Client Services
322,Douglas,Male,2002-01-08,06:42:00,41428,14.372,False,Product
667,Douglas,,2009-02-04,02:03:00,104496,14.771,True,Marketing
835,Douglas,Male,2007-08-04,05:23:00,132175,2.28,False,Engineering


In [90]:
employees["First Name"].nunique()

200

## The drop_duplicates Method
- The `drop_duplicates` method deletes rows with duplicate values.
- By default, it will remove a row if *all* of its values are shared with another row.
- The `subset` parameter configures the columns to look for duplicate values within.
- Pass a list to `subset` parameter to look for duplicates across multiple columns.

## The unique and nunique Methods
- The `unique` method on a **Series** returns a collection of its unique values. The method does not exist on a **DataFrame**.
- The `nunique` method returns a *count* of the number of unique values in the **Series**/**DataFrame**.
- The `dropna` parameter configures whether to include or exclude missing (`NaN`) values.