# Joining 2021 Institutional Characteristics (IS) Data and Human Resources Data 

In this Jupyter Notebook using Python code, we will join the cleaned institutional characteristics data found in IS2021_reduced.csv and the cleaned human resources data found in HR2021_resduced.csv. The data provided was resported by colleges, collected, and housed by the Integrated Postsecondary Education Data System (IPEDS) managed by the National Center for Education Statsitics (NCES) 

## Load the Data 
HR2021_reduced.csv
IS2021_reduced.csv

## Review the Data 

## Column Names 
HR2021_reduced_df.info() 
---  ------   --------------  -----
 0   Year     63625 non-null  int64
 1   UNITID   63625 non-null  int64
 2   FACSTAT  63625 non-null  int64
 3   ARANK    63625 non-null  int64
 4   HRTOTLT  63625 non-null  int64
 5   HRTOTLM  63625 non-null  int64
 6   HRTOTLW  63625 non-null  int64
 

 IS2021_reduced_df.info()
---  ------    --------------  -----  
 0   Year      6289 non-null   int64  
 1   UNITID    6289 non-null   int64  
 2   INSTNM    6289 non-null   object 
 3   CITY      6289 non-null   object 
 4   STABBR    6289 non-null   object 
 5   ZIP       6289 non-null   int32  
 6   CONTROL   6289 non-null   int64  
 7   HLOFFER   6289 non-null   int64  
 8   INSTSIZE  6289 non-null   int64  
 9   LONGITUD  6289 non-null   float64
 10  LATITUDE  6289 non-null   float64

## Join the Data 

## Save the Joined Data

Finally, we will save the joined data to a new CSV file named IS_HR_2021.csv 

In [2]:
# import dependencies 
import pandas as pd

In [3]:
# Load the data
hr_data = pd.read_csv("Resources/HR2021_reduced.csv")
is_data = pd.read_csv("Resources/IS2021_reduced.csv")

In [4]:
# Review the data
hr_data.head()

Unnamed: 0,Year,UNITID,FACSTAT,ARANK,HRTOTLT,HRTOTLM,HRTOTLW
0,2021,100654,0,0,208,104,104
1,2021,100654,10,0,205,103,102
2,2021,100654,10,1,39,32,7
3,2021,100654,10,2,36,20,16
4,2021,100654,10,3,84,35,49


In [5]:
# Review the data 
is_data.head()

Unnamed: 0,Year,UNITID,INSTNM,CITY,STABBR,ZIP,CONTROL,HLOFFER,INSTSIZE,LONGITUD,LATITUDE
0,2021,100654,Alabama A & M University,Normal,AL,35762,1,9,3,-86.568502,34.783368
1,2021,100663,University of Alabama at Birmingham,Birmingham,AL,35294,1,9,5,-86.799345,33.505697
2,2021,100690,Amridge University,Montgomery,AL,36117,2,9,1,-86.17401,32.362609
3,2021,100706,University of Alabama in Huntsville,Huntsville,AL,35899,1,9,3,-86.640449,34.724557
4,2021,100724,Alabama State University,Montgomery,AL,36104,1,9,2,-86.295677,32.364317


In [6]:
# Join the data
joined_data = pd.merge(hr_data, is_data, on="UNITID")
joined_data.head()

Unnamed: 0,Year_x,UNITID,FACSTAT,ARANK,HRTOTLT,HRTOTLM,HRTOTLW,Year_y,INSTNM,CITY,STABBR,ZIP,CONTROL,HLOFFER,INSTSIZE,LONGITUD,LATITUDE
0,2021,100654,0,0,208,104,104,2021,Alabama A & M University,Normal,AL,35762,1,9,3,-86.568502,34.783368
1,2021,100654,10,0,205,103,102,2021,Alabama A & M University,Normal,AL,35762,1,9,3,-86.568502,34.783368
2,2021,100654,10,1,39,32,7,2021,Alabama A & M University,Normal,AL,35762,1,9,3,-86.568502,34.783368
3,2021,100654,10,2,36,20,16,2021,Alabama A & M University,Normal,AL,35762,1,9,3,-86.568502,34.783368
4,2021,100654,10,3,84,35,49,2021,Alabama A & M University,Normal,AL,35762,1,9,3,-86.568502,34.783368


In [7]:
# Review data types of columns 
joined_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 63625 entries, 0 to 63624
Data columns (total 17 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year_x    63625 non-null  int64  
 1   UNITID    63625 non-null  int64  
 2   FACSTAT   63625 non-null  int64  
 3   ARANK     63625 non-null  int64  
 4   HRTOTLT   63625 non-null  int64  
 5   HRTOTLM   63625 non-null  int64  
 6   HRTOTLW   63625 non-null  int64  
 7   Year_y    63625 non-null  int64  
 8   INSTNM    63625 non-null  object 
 9   CITY      63625 non-null  object 
 10  STABBR    63625 non-null  object 
 11  ZIP       63625 non-null  int64  
 12  CONTROL   63625 non-null  int64  
 13  HLOFFER   63625 non-null  int64  
 14  INSTSIZE  63625 non-null  int64  
 15  LONGITUD  63625 non-null  float64
 16  LATITUDE  63625 non-null  float64
dtypes: float64(2), int64(12), object(3)
memory usage: 8.7+ MB


In [8]:
# Summary statistics for the numeric columns in the DataFrame
print(joined_data.describe())

        Year_x         UNITID       FACSTAT         ARANK       HRTOTLT  \
count  63625.0   63625.000000  63625.000000  63625.000000  63625.000000   
mean    2021.0  208733.552692     28.712283      1.979976     70.272157   
std        0.0   91957.196986     14.920225      1.825141    208.377323   
min     2021.0  100654.000000      0.000000      0.000000      1.000000   
25%     2021.0  153384.000000     10.000000      0.000000      5.000000   
50%     2021.0  191533.000000     40.000000      2.000000     18.000000   
75%     2021.0  223922.000000     42.000000      3.000000     57.000000   
max     2021.0  497286.000000     50.000000      6.000000   6655.000000   

            HRTOTLM       HRTOTLW   Year_y           ZIP       CONTROL  \
count  63625.000000  63625.000000  63625.0  63625.000000  63625.000000   
mean      35.397171     34.874986   2021.0  46526.024094      1.610829   
std      113.225757     99.260466      0.0  29723.392802      0.618448   
min        0.000000      0.0

In [9]:
# Returns the number of missing values for each column in the DataFrame
print(joined_data.isnull().sum())

Year_x      0
UNITID      0
FACSTAT     0
ARANK       0
HRTOTLT     0
HRTOTLM     0
HRTOTLW     0
Year_y      0
INSTNM      0
CITY        0
STABBR      0
ZIP         0
CONTROL     0
HLOFFER     0
INSTSIZE    0
LONGITUD    0
LATITUDE    0
dtype: int64


In [10]:
# Returns the number of duplicated rows in the DataFrame
print(joined_data.duplicated().sum())

0


In [11]:
# Save the joined data
joined_data.to_csv("Resources/IS_HR_2021.csv", index=False)

In [12]:
# Change the column name
joined_data = joined_data.rename(columns={"Year_x": "Year"})

In [13]:
# Drop the Year_y column
joined_data = joined_data.drop(columns=["Year_y"])

In [14]:
# Review data types of columns 
joined_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 63625 entries, 0 to 63624
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      63625 non-null  int64  
 1   UNITID    63625 non-null  int64  
 2   FACSTAT   63625 non-null  int64  
 3   ARANK     63625 non-null  int64  
 4   HRTOTLT   63625 non-null  int64  
 5   HRTOTLM   63625 non-null  int64  
 6   HRTOTLW   63625 non-null  int64  
 7   INSTNM    63625 non-null  object 
 8   CITY      63625 non-null  object 
 9   STABBR    63625 non-null  object 
 10  ZIP       63625 non-null  int64  
 11  CONTROL   63625 non-null  int64  
 12  HLOFFER   63625 non-null  int64  
 13  INSTSIZE  63625 non-null  int64  
 14  LONGITUD  63625 non-null  float64
 15  LATITUDE  63625 non-null  float64
dtypes: float64(2), int64(11), object(3)
memory usage: 8.3+ MB


In [15]:
# check the dataframe
joined_data.head()

Unnamed: 0,Year,UNITID,FACSTAT,ARANK,HRTOTLT,HRTOTLM,HRTOTLW,INSTNM,CITY,STABBR,ZIP,CONTROL,HLOFFER,INSTSIZE,LONGITUD,LATITUDE
0,2021,100654,0,0,208,104,104,Alabama A & M University,Normal,AL,35762,1,9,3,-86.568502,34.783368
1,2021,100654,10,0,205,103,102,Alabama A & M University,Normal,AL,35762,1,9,3,-86.568502,34.783368
2,2021,100654,10,1,39,32,7,Alabama A & M University,Normal,AL,35762,1,9,3,-86.568502,34.783368
3,2021,100654,10,2,36,20,16,Alabama A & M University,Normal,AL,35762,1,9,3,-86.568502,34.783368
4,2021,100654,10,3,84,35,49,Alabama A & M University,Normal,AL,35762,1,9,3,-86.568502,34.783368


In [16]:
# Save the joined data
joined_data.to_csv("Resources/IS_HR_2021.csv", index=False)

UNITID	Unique identification number of the institution
FACSTAT	Faculty and tenure status
ARANK	Academic rank
- Assigned by the institution and includes Professors, Associate professors, Assistant professors, Instructors, Lecturers, and  No academic rank. 
HRTOTLT	Grand total
HRTOTLM	Grand total men
HRTOTLW	Grand total women
INSTNM	Institution (entity) name
CITY	City location of institution
STABBR	State abbreviation
ZIP	ZIP code
CONTROL	Control of institution
HLOFFER	Highest level of offering
INSTSIZE	Institution size category
LONGITUD	Longitude location of institution
LATITUDE	Latitude location of institution
