___
# **SF Salaries Data Exploration**

by: [Anggi BK](https://github.com/anggibudik/)

Part of quick exercise (pandas section) for **Python for Data Science and Machine Learning** Bootcamp by Jose Portilla on Udemy.

The [SF salaries dataset](https://www.kaggle.com/kaggle/sf-salaries) is obtained from Kaggle.
___

In [1]:
import pandas as pd

Read csv file as a dataframe

In [2]:
sal = pd.read_csv('samples/04-Salaries.csv')

In [3]:
sal.head()

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,3,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,4,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,5,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


In [4]:
# Check entries numbers

sal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Id                148654 non-null  int64  
 1   EmployeeName      148654 non-null  object 
 2   JobTitle          148654 non-null  object 
 3   BasePay           148045 non-null  float64
 4   OvertimePay       148650 non-null  float64
 5   OtherPay          148650 non-null  float64
 6   Benefits          112491 non-null  float64
 7   TotalPay          148654 non-null  float64
 8   TotalPayBenefits  148654 non-null  float64
 9   Year              148654 non-null  int64  
 10  Notes             0 non-null       float64
 11  Agency            148654 non-null  object 
 12  Status            0 non-null       float64
dtypes: float64(8), int64(2), object(3)
memory usage: 14.7+ MB


**Average BasePay**

In [6]:
sal['BasePay'].mean()

66325.4488404877

**Highest OvertimePay in Dataset**

In [7]:
sal['OvertimePay'].max()

245131.88

**Grab Information Off of a Certain Person: JOSEPH DRISCOLL**

*Job Title*

In [13]:
joseph = sal['EmployeeName'] == 'JOSEPH DRISCOLL'
sal[joseph]['JobTitle']

24    CAPTAIN, FIRE SUPPRESSION
Name: JobTitle, dtype: object

*Total Salary He Makes*

In [15]:
sal[joseph]['TotalPayBenefits']

24    270324.91
Name: TotalPayBenefits, dtype: float64

**Data of Highest Paid Person**

In [21]:
high_paid = sal['TotalPayBenefits'].max()
sal[sal['TotalPayBenefits'] == high_paid]

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,


In [123]:
# Alternative way: use .loc()
sal.loc[sal['TotalPayBenefits'].argmax()] # or use idxmax() instead of argmax()

Id                                                               1
EmployeeName                                        NATHANIEL FORD
JobTitle            GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY
BasePay                                                     167411
OvertimePay                                                      0
OtherPay                                                    400184
Benefits                                                       NaN
TotalPay                                                    567595
TotalPayBenefits                                            567595
Year                                                          2011
Notes                                                          NaN
Agency                                               San Francisco
Status                                                         NaN
Name: 0, dtype: object

**Data of Lowest Paid Person** (p.s: strange data)

In [125]:
sal.iloc[sal['TotalPayBenefits'].argmin()]

Id                                      148654
EmployeeName                         Joe Lopez
JobTitle            Counselor, Log Cabin Ranch
BasePay                                      0
OvertimePay                                  0
OtherPay                               -618.13
Benefits                                     0
TotalPay                               -618.13
TotalPayBenefits                       -618.13
Year                                      2014
Notes                                      NaN
Agency                           San Francisco
Status                                     NaN
Name: 148653, dtype: object

In [126]:
# Alternative way: use the usual data-grab method

low_paid = sal['TotalPayBenefits'].min()
sal[sal['TotalPayBenefits'] == low_paid]

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
148653,148654,Joe Lopez,"Counselor, Log Cabin Ranch",0.0,0.0,-618.13,0.0,-618.13,-618.13,2014,,San Francisco,


**Mean Average Annual BasePay of All Employes (2011-2014)**

In [127]:
sal.groupby('Year').mean()['BasePay']

Year
2011    63595.956517
2012    65436.406857
2013    69630.030216
2014    66564.421924
Name: BasePay, dtype: float64

*(Breakdown version of the above code):*

In [25]:
annual_df = sal.groupby('Year')

In [28]:
annual_df['BasePay'].mean()

Year
2011    63595.956517
2012    65436.406857
2013    69630.030216
2014    66564.421924
Name: BasePay, dtype: float64

**Number of Unique Job Titles**

In [30]:
sal['JobTitle'].nunique()
# or use .unique() then len(). But why, tho..

2159

**Top 5 Most Common Jobs**

In [43]:
sal['JobTitle'].value_counts()[:5]

Transit Operator                7036
Special Nurse                   4389
Registered Nurse                3736
Public Svc Aide-Public Works    2518
Police Officer 3                2421
Name: JobTitle, dtype: int64

In [None]:
# or, for a more panda-ish method:

In [128]:
sal['JobTitle'].value_counts().head(5)

Transit Operator                7036
Special Nurse                   4389
Registered Nurse                3736
Public Svc Aide-Public Works    2518
Police Officer 3                2421
Name: JobTitle, dtype: int64

**Total Job Titles Represented by only One Person in 2013**

In [136]:
sum(sal[sal['Year'] == 2013]['JobTitle'].value_counts() == 1)

202

*(Breakdown version of the above code):*

In [137]:
df_2013 = sal[sal['Year'] == 2013]
one_job = df_2013['JobTitle'].value_counts() == 1

In [138]:
sum(one_job)

202

**Total "Chief" Job Title**

In [110]:
len(sal['JobTitle'][chief_df.str.contains('Chief',case=False)])

627

*(Breakdown version of the above code):*

In [118]:
# Grab Job Title Column
chief_df = sal['JobTitle']

In [119]:
# Return all job titles containing 'chief' string, case in-sensitive
with_chief = chief_df[chief_df.str.contains('Chief',case=False)]

In [120]:
len(with_chief)

627

**Correlation between Job-Title String Length and Total Salary (if any)**

In [139]:
sal['title_len'] = sal['JobTitle'].apply(len)

In [143]:
sal[['TotalPayBenefits','title_len']]

Unnamed: 0,TotalPayBenefits,title_len
0,567595.43,46
1,538909.28,31
2,335279.91,31
3,332343.61,36
4,326373.19,44
...,...,...
148649,0.00,9
148650,0.00,12
148651,0.00,12
148652,0.00,12


In [144]:
sal[['TotalPayBenefits','title_len']].corr()

Unnamed: 0,TotalPayBenefits,title_len
TotalPayBenefits,1.0,-0.036878
title_len,-0.036878,1.0


There's essentially no correlation between title length and total pay benefits.