## DataFrames 2 Module

In [4]:
import pandas as pd

In [10]:
emp = pd.read_csv("employees.csv").dropna(how="all")

In [13]:
emp.shape

(1000, 8)

In [14]:
emp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 8 columns):
First Name           933 non-null object
Gender               855 non-null object
Start Date           1000 non-null object
Last Login Time      1000 non-null object
Salary               1000 non-null int64
Bonus %              1000 non-null float64
Senior Management    933 non-null object
Team                 957 non-null object
dtypes: float64(1), int64(1), object(6)
memory usage: 70.3+ KB


Lets optimize the memory by categorizing some dtypes

In [18]:
emp.nunique()

First Name           200
Gender                 2
Start Date           972
Last Login Time      720
Salary               995
Bonus %              971
Senior Management      2
Team                  10
dtype: int64

In [24]:
emp["Gender"].fillna("No Gender", inplace = True)
emp["Senior Management"].fillna("N/A", inplace = True)
emp["Team"].fillna("No Team", inplace = True)

In [27]:
emp["Gender"] = emp["Gender"].astype("category")
emp["Senior Management"] = emp["Senior Management"].astype("category")
emp["Team"] = emp["Team"].astype("category")

In [28]:
emp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 8 columns):
First Name           933 non-null object
Gender               1000 non-null category
Start Date           1000 non-null object
Last Login Time      1000 non-null object
Salary               1000 non-null int64
Bonus %              1000 non-null float64
Senior Management    1000 non-null category
Team                 1000 non-null category
dtypes: category(3), float64(1), int64(1), object(3)
memory usage: 50.4+ KB


In [30]:
emp["Gender"].value_counts()

Female       431
Male         424
No Gender    145
Name: Gender, dtype: int64

In [35]:
emp["Start Date"] = emp["Start Date"].astype("datetime64")
emp["Last Login Time"] = emp["Last Login Time"].astype("datetime64")

In [36]:
emp.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2019-02-09 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2019-02-09 06:53:00,61933,4.17,True,No Team
2,Maria,Female,1993-04-23,2019-02-09 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2019-02-09 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2019-02-09 16:47:00,101004,1.389,True,Client Services


In [37]:
emp["Last Login Time"] = pd.to_datetime(emp["Last Login Time"])

In [38]:
emp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 8 columns):
First Name           933 non-null object
Gender               1000 non-null category
Start Date           1000 non-null datetime64[ns]
Last Login Time      1000 non-null datetime64[ns]
Salary               1000 non-null int64
Bonus %              1000 non-null float64
Senior Management    1000 non-null category
Team                 1000 non-null category
dtypes: category(3), datetime64[ns](2), float64(1), int64(1), object(1)
memory usage: 50.4+ KB


In [39]:
emp.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2019-02-09 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2019-02-09 06:53:00,61933,4.17,True,No Team
2,Maria,Female,1993-04-23,2019-02-09 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2019-02-09 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2019-02-09 16:47:00,101004,1.389,True,Client Services


In [42]:
emp.sort_values("Start Date").tail()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
239,Lillian,No Gender,2016-05-12,2019-02-09 15:43:00,64164,17.612,False,Human Resources
444,,Male,2016-05-24,2019-02-09 21:17:00,76409,7.008,,Distribution
15,Lillian,Female,2016-06-05,2019-02-09 06:09:00,59414,1.256,False,Product
98,Tina,Female,2016-06-16,2019-02-09 19:47:00,100705,16.961,True,Marketing
451,Terry,No Gender,2016-07-15,2019-02-09 00:29:00,140002,19.49,True,Marketing


Generating a Salary Rank for the employees

In [57]:
emp.insert(5, column="Salary Rank",value=emp["Salary"].rank(ascending = False).astype(int))

In [60]:
emp.sort_values(["Salary", "Salary Rank"], ascending= [False, False])

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Salary Rank,Bonus %,Senior Management,Team
644,Katherine,Female,1996-08-13,2019-02-09 00:21:00,149908,1,18.912,False,Finance
429,Rose,Female,2015-05-28,2019-02-09 08:40:00,149903,2,5.630,False,Human Resources
828,Cynthia,Female,2006-07-12,2019-02-09 08:55:00,149684,3,7.864,False,Product
186,,Female,2005-02-23,2019-02-09 21:50:00,149654,4,1.825,,Sales
160,Kathy,Female,2000-03-18,2019-02-09 19:26:00,149563,5,16.991,True,Finance
740,Russell,No Gender,2009-05-09,2019-02-09 11:59:00,149456,6,3.533,False,Marketing
793,Andrea,Female,1999-07-22,2019-02-09 09:25:00,149105,7,13.707,True,Distribution
981,James,Male,1993-01-15,2019-02-09 17:19:00,148985,8,19.280,False,Legal
800,Clarence,Male,1989-08-05,2019-02-09 18:11:00,148941,9,11.517,False,Product
844,Maria,No Gender,1985-06-19,2019-02-09 01:48:00,148857,10,8.738,False,Legal


In [63]:
emp["Team"].value_counts()

Client Services         106
Finance                 102
Business Development    101
Marketing                98
Product                  95
Sales                    94
Engineering              92
Human Resources          91
Distribution             90
Legal                    88
No Team                  43
Name: Team, dtype: int64

In [88]:
whereFemale = emp["Gender"] == "Female"
whereEngr = emp["Team"] == "Engineering"
whereMgmt = emp["Senior Management"]
emp[(whereFemale) & (whereEngr) & (whereMgmt)].sort_values("Salary Rank", ascending = True)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Salary Rank,Bonus %,Senior Management,Team
541,Ruby,Female,1999-05-01,2019-02-09 03:36:00,147362,19,7.851,True,Engineering
948,Ashley,Female,2006-03-31,2019-02-09 13:24:00,142410,54,11.048,True,Engineering
761,Jennifer,Female,2015-03-31,2019-02-09 19:43:00,132084,142,10.006,True,Engineering
633,Andrea,Female,2011-11-17,2019-02-09 14:37:00,123591,208,6.5,True,Engineering
467,Amy,Female,2002-06-19,2019-02-09 03:06:00,122897,214,8.222,True,Engineering
475,Stephanie,Female,1992-11-26,2019-02-09 00:54:00,122121,221,7.937,True,Engineering
30,Christina,Female,2002-08-06,2019-02-09 13:19:00,118780,249,9.096,True,Engineering
608,,Female,1993-10-24,2019-02-09 17:17:00,116236,263,17.274,,Engineering
113,Tina,Female,2009-06-12,2019-02-09 07:16:00,114767,282,3.711,True,Engineering
138,Ashley,Female,2006-05-25,2019-02-09 11:30:00,112238,304,6.03,True,Engineering


In [99]:
in1991 = (emp["Start Date"].dt.year == 1991)
emp[in1991]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Salary Rank,Bonus %,Senior Management,Team
27,Scott,No Gender,1991-07-11,2019-02-09 18:58:00,122367,217,5.218,False,Legal
75,Bonnie,Female,1991-07-02,2019-02-09 01:27:00,104897,364,5.118,True,Human Resources
88,Donna,Female,1991-11-27,2019-02-09 13:59:00,64088,736,6.155,True,Legal
116,,Male,1991-06-22,2019-02-09 20:58:00,76189,624,18.988,,Legal
148,Patrick,No Gender,1991-07-14,2019-02-09 02:24:00,124488,196,14.837,True,Sales
166,,Female,1991-07-09,2019-02-09 18:52:00,42341,927,7.014,,Sales
172,Sara,Female,1991-09-23,2019-02-09 18:17:00,97058,436,9.402,False,Finance
220,,Female,1991-06-17,2019-02-09 12:49:00,71945,664,5.56,,Marketing
245,Victor,Male,1991-04-11,2019-02-09 07:44:00,70817,677,17.138,False,Engineering
277,Brenda,No Gender,1991-05-29,2019-02-09 06:32:00,82439,579,19.062,False,Sales


See how we filtered by the Year only. To access individual Datetime chunks:<br>
- pandas.Series.dt.year returns the year of the date time.
- pandas.Series.dt.month returns the month of the date time.
- pandas.Series.dt.day returns the day of the date time.
- pandas.Series.dt.time returns the time of the date time.
- pandas.Series.dt.hour returns the hour of the date time.
- pandas.Series.dt.minute returns the minute of the date time.

In [105]:
emp["Last Login Time"] = emp["Last Login Time"].dt.time

In [108]:
emp.sort_values("Last Login Time", ascending= False).head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Salary Rank,Bonus %,Senior Management,Team
607,,Male,1983-10-13,23:59:00,139754,85,12.74,,Sales
792,Anne,No Gender,1996-04-18,23:57:00,122762,215,9.564,False,Distribution
66,Nancy,Female,2012-12-15,23:57:00,125250,188,2.672,True,Business Development
349,Phyllis,Female,2005-11-24,23:57:00,140347,76,8.723,False,Sales
930,Nancy,Female,2001-09-10,23:57:00,85213,552,2.386,True,Marketing


### The .isin() method is used to check if one or more values are in a field(or column). It takes in a list.

In [113]:
inAnne = emp["First Name"].isin(["Anne", "Nancy"])
emp[inAnne]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Salary Rank,Bonus %,Senior Management,Team
50,Nancy,Female,2000-09-23,08:05:00,94976,458,13.83,True,Engineering
66,Nancy,Female,2012-12-15,23:57:00,125250,188,2.672,True,Business Development
262,Anne,Female,1986-07-16,14:08:00,69134,690,3.723,True,Engineering
292,Anne,Female,2000-03-07,06:45:00,44537,909,18.284,True,Client Services
555,Anne,Female,1996-10-26,20:09:00,71930,665,18.451,True,Product
595,Nancy,Female,1985-05-07,22:20:00,121006,236,3.512,True,Finance
627,Anne,Female,1984-11-21,12:30:00,128305,166,16.636,False,Marketing
792,Anne,No Gender,1996-04-18,23:57:00,122762,215,9.564,False,Distribution
930,Nancy,Female,2001-09-10,23:57:00,85213,552,2.386,True,Marketing


The __.between()__ Method

In [118]:
mask = emp["Salary"].between(60000, 70000)
emp[mask].head(10)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Salary Rank,Bonus %,Senior Management,Team
1,Thomas,Male,1996-03-31,06:53:00,61933,756,4.17,True,No Team
6,Ruby,Female,1987-08-17,16:20:00,65476,724,10.012,True,Product
10,Louise,Female,1980-08-12,09:01:00,63241,743,15.132,True,No Team
20,Lois,No Gender,1995-04-22,19:18:00,64714,730,4.934,True,Legal
41,Christine,No Gender,2015-06-28,01:08:00,66582,713,11.308,True,Business Development
47,Kathy,Female,2005-06-22,04:51:00,66820,708,9.0,True,Client Services
57,Henry,Male,1996-06-26,01:44:00,64715,729,15.107,True,Human Resources
59,Irene,Female,1997-05-07,09:32:00,66851,707,11.279,False,Engineering
65,Steve,Male,2009-11-11,23:44:00,61310,758,12.428,True,Distribution
74,Thomas,Male,1995-06-04,14:24:00,62096,754,17.029,False,Marketing


### The .duplicated() Method


This method goes through all the records in a field:<br>
- the first occurence of a record is not considered a duplicate
- subscequent occurences of a record is marked True as a duplicate
- there is a __keep__ param that allows you to consider the first or last as True. It takes in _"first"_ and _"last"_ . To keep all duplicated records, you can pass _False_ to the param

In [127]:
emp[~emp["First Name"].duplicated()].sort_values("First Name").head(10)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Salary Rank,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,10:20:00,61602,757,11.849,True,Marketing
137,Adam,Male,2011-05-21,01:45:00,95327,453,15.12,False,Distribution
53,Alan,No Gender,2014-03-03,13:28:00,40341,953,17.578,True,Finance
372,Albert,Male,1997-02-01,16:20:00,67827,699,19.717,True,Engineering
425,Alice,Female,1986-05-02,01:50:00,51395,840,2.378,True,Finance
542,Amanda,Female,2004-08-01,13:32:00,80803,594,14.077,True,Distribution
467,Amy,Female,2002-06-19,03:06:00,122897,214,8.222,True,Engineering
118,Andrea,Female,2012-01-12,05:43:00,120204,241,9.557,False,Business Development
564,Andrew,Male,1985-03-29,18:57:00,43414,916,7.563,True,Client Services
8,Angela,Female,2005-11-22,06:29:00,95570,451,18.523,True,Engineering


the tilde(~) is used to negate a condition in Pandas

### The .drop_duplicates() method

- This method has a _subset_ param which takes in a list of _fields_ where you want to remove duplicates
- It also has a _keep_ param where you can pass _first_ , _last_ or _False_
- using _False_ will remove all rows including first and last occurences of the duplicated record

### The .unique() and .nunique() methods

We use the __.unique()__ method to know the unique data in a field. It returns a list of unique data in the field.<br>
Note: The __nunique()__ method does not see NaN as a unique value. Except you set a param _dropna_ as _False_.

In [131]:
unqTeam = emp["Team"].unique()

In [133]:
list(unqTeam)

['Marketing',
 'No Team',
 'Finance',
 'Client Services',
 'Legal',
 'Product',
 'Engineering',
 'Business Development',
 'Human Resources',
 'Sales',
 'Distribution']