# Cleaning Human Resources (s2021) Data

In this notebook, we will clean the faculty data from IPEDS and prepare it for analysis.

## Load the Data

We will start by loading the data into a Pandas DataFrame and assigning it to the `HR2021_df` variable.

## Clean the Data
 

Rows with missing data we not removed. 

Columns Kept Include: 

- UNITID (unique id)
- FACSTAT - Faculty and tenure status 
- ARANK - Academic rank (Professor, Associate Professor, Assistant professor, Instructor, Lecture, No academic rank) 
- HRTOTLT - Grand total
- HRTOTLM - Grand total men
- HRTOTLW - Grand total women

Columns that we do not need are removed. Columns kept for the HR2021_df are:


## Save the Cleaned Data

Finally, we will save the cleaned data to a new CSV file named `HR2021_cleaned.csv`.

## Note about data provided by the IPEDS

This file contains the number of full-time instructional staff on the payroll of the institution as of November 1,  by faculty and tenure status, academic rank, race/ethnicity and gender. This file has multiple records per institution.  Each record is uniquely defined by the variables IPEDS ID (UNITID), and the variable SISCAT which is the combination of faculty and tenure status FACSTAT (tenured, on tenure track, and not on-tenure track/no tenure system) and academic rank ARANK  (professors, associate professors, etc.) . Beginning with 2016, staff that are not on tenure track are further disaggregated by contract length (multi-year, indefinite, annual and less-than-annual).  From 2012 through 2015 the multi-year and indefinite contract lengths were combined as one category and often referred to as multi-year in past documentation. These data are applicable to degree-granting institutions with 15 or more full-time employees and related administrative offices.  

In [6]:
# import dependencies 
import pandas as pd

In [7]:
# Load the data into a Pandas DataFrame and assign it to hr2021 variable
HR2021_df = pd.read_csv('Resources/s2021_is.csv', encoding='ISO-8859-1')
HR2021_df.head()

Unnamed: 0,Year,UNITID,SISCAT,FACSTAT,ARANK,XHRTOTLT,HRTOTLT,XHRTOTLM,HRTOTLM,XHRTOTLW,...,XHRUNKNM,HRUNKNM,XHRUNKNW,HRUNKNW,XHRNRALT,HRNRALT,XHRNRALM,HRNRALM,XHRNRALW,HRNRALW
0,2021,100654,1,0,0,R,208,R,104,R,...,R,2,Z,0,R,16,R,10,R,6
1,2021,100654,100,10,0,R,205,R,103,R,...,R,2,Z,0,R,16,R,10,R,6
2,2021,100654,101,10,1,R,39,R,32,R,...,Z,0,Z,0,R,2,R,2,Z,0
3,2021,100654,102,10,2,R,36,R,20,R,...,Z,0,Z,0,R,3,R,1,R,2
4,2021,100654,103,10,3,R,84,R,35,R,...,R,1,Z,0,R,8,R,4,R,4


In [8]:
# print number of rows and columns currently present before cleaning 
print("Number of rows:", HR2021_df.shape[0])
print("Number of columns:", HR2021_df.shape[1])

Number of rows: 63625
Number of columns: 65


In [9]:
# review data types for all 74 columns 
HR2021_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63625 entries, 0 to 63624
Data columns (total 65 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Year      63625 non-null  int64 
 1   UNITID    63625 non-null  int64 
 2   SISCAT    63625 non-null  int64 
 3   FACSTAT   63625 non-null  int64 
 4   ARANK     63625 non-null  int64 
 5   XHRTOTLT  63625 non-null  object
 6   HRTOTLT   63625 non-null  int64 
 7   XHRTOTLM  63625 non-null  object
 8   HRTOTLM   63625 non-null  int64 
 9   XHRTOTLW  63625 non-null  object
 10  HRTOTLW   63625 non-null  int64 
 11  XHRAIANT  63625 non-null  object
 12  HRAIANT   63625 non-null  int64 
 13  XHRAIANM  63625 non-null  object
 14  HRAIANM   63625 non-null  int64 
 15  XHRAIANW  63625 non-null  object
 16  HRAIANW   63625 non-null  int64 
 17  XHRASIAT  63625 non-null  object
 18  HRASIAT   63625 non-null  int64 
 19  XHRASIAM  63625 non-null  object
 20  HRASIAM   63625 non-null  int64 
 21  XHRASIAW  63

In [10]:
# Select the columns to keep in the new DataFrame
keep_cols = ['Year','UNITID','FACSTAT','ARANK','HRTOTLT','HRTOTLM','HRTOTLW']

# Create the new DataFrame with selected columns
HR2021_reduced_df = HR2021_df[keep_cols].copy()

# Review the first 5 rows of the new DataFrame
HR2021_reduced_df.head()

Unnamed: 0,Year,UNITID,FACSTAT,ARANK,HRTOTLT,HRTOTLM,HRTOTLW
0,2021,100654,0,0,208,104,104
1,2021,100654,10,0,205,103,102
2,2021,100654,10,1,39,32,7
3,2021,100654,10,2,36,20,16
4,2021,100654,10,3,84,35,49


In [11]:
# checking last 10 lines 
HR2021_reduced_df.tail(10)

Unnamed: 0,Year,UNITID,FACSTAT,ARANK,HRTOTLT,HRTOTLM,HRTOTLW
63615,2021,497046,40,0,8,3,5
63616,2021,497046,40,4,8,3,5
63617,2021,497046,42,0,4,1,3
63618,2021,497046,42,4,4,1,3
63619,2021,497046,43,0,4,2,2
63620,2021,497046,43,4,4,2,2
63621,2021,497277,0,0,9,1,8
63622,2021,497277,50,0,9,1,8
63623,2021,497286,0,0,5,5,0
63624,2021,497286,50,0,5,5,0


In [12]:
# Save the cleaned data to a new CSV file
HR2021_reduced_df.to_csv('Resources/HR2021_reduced.csv', index=False)

In [13]:
HR2021_reduced_df.info()  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63625 entries, 0 to 63624
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   Year     63625 non-null  int64
 1   UNITID   63625 non-null  int64
 2   FACSTAT  63625 non-null  int64
 3   ARANK    63625 non-null  int64
 4   HRTOTLT  63625 non-null  int64
 5   HRTOTLM  63625 non-null  int64
 6   HRTOTLW  63625 non-null  int64
dtypes: int64(7)
memory usage: 3.4 MB
