# Cleaning Institutional Characteristics (IS) Data

In this notebook, we will clean the Institutional Characteristics data from IPEDS and prepare it for analysis.

## Load the Data

We will start by loading the data (hd2020 and s2020_is_rv.csv) into Pandas DataFrames and assigning it to the 'HR2020_df' and `IS2020_df`variable.

## Clean the Data

Next, we will clean the data by dropping any duplicate rows, converting the ZIP code column to integers, and removing any leading or trailing whitespace in string columns.

Rows with missing data we not removed. 

Columns that we do not need are removed. Columns kept for the IS2020_df are:
- UNITID (this the unique key for each university) 
- INSTNM (institution (entity) name ) 
- CITY 
- STABBR (state abbreviation)  
- ZIP (Zipcode)
- CONTROL (Institutional Control or Affiliation (Public, private, for-profit, or religious affiliation)
- HLOFFER (Filter for 4-year/ Backelor's degree or higher (5)) 
- INSTSIZE	Institution size category based on total students enrolled for 
- LONGITUD	(Longitude location of institution)
- LATITUDE	(Latitude location of institution)
- Year (added by coders to orginial data set) 

HR2020_df

- UNITID (unique id)
- FACSTAT - Faculty and tenure status 
- ARANK - Academic rank (Professor, Associate Professor, Assistant professor, - - Instructor, Lecture, No academic rank) 
- HRTOTLT - Grand total
- HRTOTLM - Grand total men
- HRTOTLW - Grand total women


## Save the Cleaned Data

Finally, we will save the cleaned data to a new CSV file named `IS2020_cleaned.csv` and 'HR2020_cleaned.csv'.



In [3]:
# import dependencies 
import pandas as pd

In [4]:
# Load the data into a Pandas DataFrame and assign it to IS2021 variable
IS2020_df = pd.read_csv('Resources/hd2020.csv', encoding='ISO-8859-1')
IS2020_df.head()

Unnamed: 0,Year,UNITID,INSTNM,IALIAS,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,...,CBSATYPE,CSA,NECTA,COUNTYCD,COUNTYNM,CNGDSTCD,LONGITUD,LATITUDE,DFRCGID,DFRCUSCG
0,2020,100654,Alabama A & M University,AAMU,4900 Meridian Street,Normal,AL,35762,1,5,...,1,290,-2,1089,Madison County,105,-86.568502,34.783368,109,1
1,2020,100663,University of Alabama at Birmingham,,Administration Bldg Suite 1070,Birmingham,AL,35294-0110,1,5,...,1,142,-2,1073,Jefferson County,107,-86.799345,33.505697,95,1
2,2020,100690,Amridge University,Southern Christian University Regions University,1200 Taylor Rd,Montgomery,AL,36117-3553,1,5,...,1,388,-2,1101,Montgomery County,102,-86.17401,32.362609,126,2
3,2020,100706,University of Alabama in Huntsville,UAH University of Alabama Huntsville,301 Sparkman Dr,Huntsville,AL,35899,1,5,...,1,290,-2,1089,Madison County,105,-86.640449,34.724557,99,2
4,2020,100724,Alabama State University,,915 S Jackson Street,Montgomery,AL,36104-0271,1,5,...,1,388,-2,1101,Montgomery County,107,-86.295677,32.364317,118,1


In [5]:
# print number of rows and columns currently present before cleaning 
print("Number of rows:", IS2020_df.shape[0])
print("Number of columns:", IS2020_df.shape[1])

Number of rows: 6440
Number of columns: 74


In [6]:
# review data types for all 74 columns 
IS2020_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6440 entries, 0 to 6439
Data columns (total 74 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      6440 non-null   int64  
 1   UNITID    6440 non-null   int64  
 2   INSTNM    6440 non-null   object 
 3   IALIAS    6439 non-null   object 
 4   ADDR      6439 non-null   object 
 5   CITY      6440 non-null   object 
 6   STABBR    6440 non-null   object 
 7   ZIP       6440 non-null   object 
 8   FIPS      6440 non-null   int64  
 9   OBEREG    6440 non-null   int64  
 10  CHFNM     6440 non-null   object 
 11  CHFTITLE  6440 non-null   object 
 12  GENTELE   6440 non-null   object 
 13  EIN       6440 non-null   int64  
 14  DUNS      6440 non-null   object 
 15  OPEID     6440 non-null   int64  
 16  OPEFLAG   6440 non-null   int64  
 17  WEBADDR   6440 non-null   object 
 18  ADMINURL  6440 non-null   object 
 19  FAIDURL   6440 non-null   object 
 20  APPLURL   6440 non-null   obje

In [7]:
# Drop any duplicate rows
IS2020_df.drop_duplicates(inplace=True)

# print number of rows and columns currently present before cleaning 
print("Number of rows:", IS2020_df.shape[0])
print("Number of columns:", IS2020_df.shape[1])

Number of rows: 6440
Number of columns: 74


In [8]:
# Convert the ZIP code column to integers and retain leading zeros
IS2020_df['ZIP'] = IS2020_df['ZIP'].apply(lambda x: int(str(x).zfill(5)[:5]))   
IS2020_df['ZIP'] = IS2020_df['ZIP'].astype(int)


 # Review data types for all 74 columns
IS2020_df.info()   

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6440 entries, 0 to 6439
Data columns (total 74 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      6440 non-null   int64  
 1   UNITID    6440 non-null   int64  
 2   INSTNM    6440 non-null   object 
 3   IALIAS    6439 non-null   object 
 4   ADDR      6439 non-null   object 
 5   CITY      6440 non-null   object 
 6   STABBR    6440 non-null   object 
 7   ZIP       6440 non-null   int32  
 8   FIPS      6440 non-null   int64  
 9   OBEREG    6440 non-null   int64  
 10  CHFNM     6440 non-null   object 
 11  CHFTITLE  6440 non-null   object 
 12  GENTELE   6440 non-null   object 
 13  EIN       6440 non-null   int64  
 14  DUNS      6440 non-null   object 
 15  OPEID     6440 non-null   int64  
 16  OPEFLAG   6440 non-null   int64  
 17  WEBADDR   6440 non-null   object 
 18  ADMINURL  6440 non-null   object 
 19  FAIDURL   6440 non-null   object 
 20  APPLURL   6440 non-null   obje

In [9]:
IS2020_df.head()

Unnamed: 0,Year,UNITID,INSTNM,IALIAS,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,...,CBSATYPE,CSA,NECTA,COUNTYCD,COUNTYNM,CNGDSTCD,LONGITUD,LATITUDE,DFRCGID,DFRCUSCG
0,2020,100654,Alabama A & M University,AAMU,4900 Meridian Street,Normal,AL,35762,1,5,...,1,290,-2,1089,Madison County,105,-86.568502,34.783368,109,1
1,2020,100663,University of Alabama at Birmingham,,Administration Bldg Suite 1070,Birmingham,AL,35294,1,5,...,1,142,-2,1073,Jefferson County,107,-86.799345,33.505697,95,1
2,2020,100690,Amridge University,Southern Christian University Regions University,1200 Taylor Rd,Montgomery,AL,36117,1,5,...,1,388,-2,1101,Montgomery County,102,-86.17401,32.362609,126,2
3,2020,100706,University of Alabama in Huntsville,UAH University of Alabama Huntsville,301 Sparkman Dr,Huntsville,AL,35899,1,5,...,1,290,-2,1089,Madison County,105,-86.640449,34.724557,99,2
4,2020,100724,Alabama State University,,915 S Jackson Street,Montgomery,AL,36104,1,5,...,1,388,-2,1101,Montgomery County,107,-86.295677,32.364317,118,1


In [10]:
print(IS2020_df.iloc[400:420])

     Year  UNITID                                   INSTNM  \
400  2020  120342                     Orange Coast College   
401  2020  120403           Otis College of Art and Design   
402  2020  120421                           Oxnard College   
403  2020  120537            Hope International University   
404  2020  120661        Concorde Career College-San Diego   
405  2020  120698                     Palo Alto University   
406  2020  120768                     Pacific Oaks College   
407  2020  120795               Pacific School of Religion   
408  2020  120838                Pacific States University   
409  2020  120865                    Pacific Union College   
410  2020  120883                University of the Pacific   
411  2020  120953                       Palo Verde College   
412  2020  120971                          Palomar College   
413  2020  121044                    Pasadena City College   
414  2020  121150                    Pepperdine University   
415  202

In [9]:
# first attempt tried again to keep zeros that were part of the zip code in the front, not concered will use lat and lit for map
# First attempt 
# Remove the hyphen and last 4 digits from zip code so it can be convered to an iteger
# IS2021_df['ZIP'] = IS2021_df['ZIP'].apply(lambda x: int(str(x)[:5]))

In [11]:
# Remove any leading or trailing whitespace in string columns
IS2020_df = IS2020_df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

In [12]:
# checking current df
IS2020_df.head()


Unnamed: 0,Year,UNITID,INSTNM,IALIAS,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,...,CBSATYPE,CSA,NECTA,COUNTYCD,COUNTYNM,CNGDSTCD,LONGITUD,LATITUDE,DFRCGID,DFRCUSCG
0,2020,100654,Alabama A & M University,AAMU,4900 Meridian Street,Normal,AL,35762,1,5,...,1,290,-2,1089,Madison County,105,-86.568502,34.783368,109,1
1,2020,100663,University of Alabama at Birmingham,,Administration Bldg Suite 1070,Birmingham,AL,35294,1,5,...,1,142,-2,1073,Jefferson County,107,-86.799345,33.505697,95,1
2,2020,100690,Amridge University,Southern Christian University Regions University,1200 Taylor Rd,Montgomery,AL,36117,1,5,...,1,388,-2,1101,Montgomery County,102,-86.17401,32.362609,126,2
3,2020,100706,University of Alabama in Huntsville,UAH University of Alabama Huntsville,301 Sparkman Dr,Huntsville,AL,35899,1,5,...,1,290,-2,1089,Madison County,105,-86.640449,34.724557,99,2
4,2020,100724,Alabama State University,,915 S Jackson Street,Montgomery,AL,36104,1,5,...,1,388,-2,1101,Montgomery County,107,-86.295677,32.364317,118,1


In [13]:
# checking last 10 lines 
IS2020_df.tail(10)

Unnamed: 0,Year,UNITID,INSTNM,IALIAS,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,...,CBSATYPE,CSA,NECTA,COUNTYCD,COUNTYNM,CNGDSTCD,LONGITUD,LATITUDE,DFRCGID,DFRCUSCG
6430,2020,496274,Calvin University - Handlon Campus,,1728 W. Bluewater Highway,Ionia,MI,48846,26,3,...,1,266,-2,26067,Ionia County,2603,-85.105563,42.982404,-2,-2
6431,2020,496283,Provo College-Idaho Falls Campus,,1592 East 17th Street,Idaho Falls,ID,83404,16,7,...,1,292,-2,16019,Bonneville County,1602,-112.00228,43.481,200,2
6432,2020,496292,Platt College-Miller-Motte College-Chattanooga 2,,4180 South Creek Road,Chattangooga,TN,37406,47,5,...,1,174,-2,47065,Hamilton County,4703,-85.240436,35.092311,217,2
6433,2020,496317,Digital Film Academy - Atlanta,,10 Park Place South,Atlanta,GA,30303,13,5,...,1,122,-2,13121,Fulton County,1305,-84.388556,33.754219,-2,-2
6434,2020,496326,Eagle Gate College-Boise Campus,,9300 West Overland Road,Boise,ID,83709,16,7,...,1,147,-2,16001,Ada County,1601,-116.298572,43.59091,202,2
6435,2020,496335,Coastline Beauty College - Hemet,,2627 West Florida Avenue Suite 100,Hemet,CA,92545,6,8,...,1,348,-2,6065,Riverside County,636,-116.9999,33.746,-2,-2
6436,2020,496371,Elite Welding Academy,South Point,1910 County Road One,South Point,OH,45680,39,3,...,1,170,-2,39087,Lawrence County,3906,-82.594354,38.447233,217,2
6437,2020,496380,Medspa Academies - NIMA National Institute of ...,,3993 Howard Hughes Parkway Suite 150,Las Vegas,NV,89169,32,8,...,1,332,-2,32003,Clark County,3201,-115.158153,36.117261,-2,-2
6438,2020,496414,TechSherpas 365,,10213 Wilsky Blvd,Tampa,FL,33625,12,5,...,1,-2,-2,12057,Hillsborough County,1214,-82.565846,28.04245,-1,-1
6439,2020,496423,Zorganics Institute Beauty and Wellness,ZORGANICS INSTITUTE,410 WEST BAKERVIEW ROAD SUITE 112,Bellingham,WA,98226,53,8,...,1,-2,-2,53073,Whatcom County,5302,-122.49472,48.791194,206,2


In [14]:
# Select the columns to keep in the new DataFrame
keep_cols = ['Year', 'UNITID', 'INSTNM', 'CITY', 'STABBR', 'ZIP', 'CONTROL', 'HLOFFER', 'INSTSIZE', 'LONGITUD', 'LATITUDE']

# Create the new DataFrame with selected columns
IS2020_reduced_df = IS2020_df[keep_cols].copy()

# Review the first 5 rows of the new DataFrame
IS2020_reduced_df.head()

Unnamed: 0,Year,UNITID,INSTNM,CITY,STABBR,ZIP,CONTROL,HLOFFER,INSTSIZE,LONGITUD,LATITUDE
0,2020,100654,Alabama A & M University,Normal,AL,35762,1,9,3,-86.568502,34.783368
1,2020,100663,University of Alabama at Birmingham,Birmingham,AL,35294,1,9,5,-86.799345,33.505697
2,2020,100690,Amridge University,Montgomery,AL,36117,2,9,1,-86.17401,32.362609
3,2020,100706,University of Alabama in Huntsville,Huntsville,AL,35899,1,9,3,-86.640449,34.724557
4,2020,100724,Alabama State University,Montgomery,AL,36104,1,9,2,-86.295677,32.364317


In [15]:
# print number of rows and columns currently present before cleaning 
print("Number of rows:", IS2020_df.shape[0])
print("Number of columns:", IS2020_df.shape[1])

Number of rows: 6440
Number of columns: 74


In [16]:
# review data types for all 74 columns 
IS2020_reduced_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6440 entries, 0 to 6439
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      6440 non-null   int64  
 1   UNITID    6440 non-null   int64  
 2   INSTNM    6440 non-null   object 
 3   CITY      6440 non-null   object 
 4   STABBR    6440 non-null   object 
 5   ZIP       6440 non-null   int32  
 6   CONTROL   6440 non-null   int64  
 7   HLOFFER   6440 non-null   int64  
 8   INSTSIZE  6440 non-null   int64  
 9   LONGITUD  6440 non-null   float64
 10  LATITUDE  6440 non-null   float64
dtypes: float64(2), int32(1), int64(5), object(3)
memory usage: 578.6+ KB


In [17]:
# Save the cleaned data to a new CSV file
IS2020_reduced_df.to_csv('Resources/IS2020_reduced.csv', index=False)

In [18]:
IS2020_reduced_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6440 entries, 0 to 6439
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      6440 non-null   int64  
 1   UNITID    6440 non-null   int64  
 2   INSTNM    6440 non-null   object 
 3   CITY      6440 non-null   object 
 4   STABBR    6440 non-null   object 
 5   ZIP       6440 non-null   int32  
 6   CONTROL   6440 non-null   int64  
 7   HLOFFER   6440 non-null   int64  
 8   INSTSIZE  6440 non-null   int64  
 9   LONGITUD  6440 non-null   float64
 10  LATITUDE  6440 non-null   float64
dtypes: float64(2), int32(1), int64(5), object(3)
memory usage: 578.6+ KB


# In this section we move into the human resource/staff data from s2020_is_rv.csv 

In [19]:
# Load the data into a Pandas DataFrame and assign it to hr2021 variable
HR2020_df = pd.read_csv('Resources/s2020_is_rv.csv', encoding='ISO-8859-1')
HR2020_df.head()

Unnamed: 0,Year,UNITID,SISCAT,FACSTAT,ARANK,XHRTOTLT,HRTOTLT,XHRTOTLM,HRTOTLM,XHRTOTLW,...,XHRUNKNM,HRUNKNM,XHRUNKNW,HRUNKNW,XHRNRALT,HRNRALT,XHRNRALM,HRNRALM,XHRNRALW,HRNRALW
0,2020,100654,1,0,0,R,253,R,137,R,...,R,1,Z,0,R,23,R,16,R,7
1,2020,100654,100,10,0,R,239,R,129,R,...,R,1,Z,0,R,20,R,15,R,5
2,2020,100654,101,10,1,R,52,R,40,R,...,Z,0,Z,0,R,3,R,3,Z,0
3,2020,100654,102,10,2,R,46,R,31,R,...,R,1,Z,0,R,3,R,1,R,2
4,2020,100654,103,10,3,R,94,R,43,R,...,Z,0,Z,0,R,10,R,7,R,3


In [20]:
# print number of rows and columns currently present before cleaning 
print("Number of rows:", HR2020_df.shape[0])
print("Number of columns:", HR2020_df.shape[1])

Number of rows: 63844
Number of columns: 65


In [21]:
# review data types for all 65 columns 
HR2020_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63844 entries, 0 to 63843
Data columns (total 65 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Year      63844 non-null  int64 
 1   UNITID    63844 non-null  int64 
 2   SISCAT    63844 non-null  int64 
 3   FACSTAT   63844 non-null  int64 
 4   ARANK     63844 non-null  int64 
 5   XHRTOTLT  63844 non-null  object
 6   HRTOTLT   63844 non-null  int64 
 7   XHRTOTLM  63844 non-null  object
 8   HRTOTLM   63844 non-null  int64 
 9   XHRTOTLW  63844 non-null  object
 10  HRTOTLW   63844 non-null  int64 
 11  XHRAIANT  63844 non-null  object
 12  HRAIANT   63844 non-null  int64 
 13  XHRAIANM  63844 non-null  object
 14  HRAIANM   63844 non-null  int64 
 15  XHRAIANW  63844 non-null  object
 16  HRAIANW   63844 non-null  int64 
 17  XHRASIAT  63844 non-null  object
 18  HRASIAT   63844 non-null  int64 
 19  XHRASIAM  63844 non-null  object
 20  HRASIAM   63844 non-null  int64 
 21  XHRASIAW  63

In [22]:
# Select the columns to keep in the new DataFrame
keep_cols = ['Year','UNITID','FACSTAT','ARANK','HRTOTLT','HRTOTLM','HRTOTLW']

# Create the new DataFrame with selected columns
HR2020_reduced_df = HR2020_df[keep_cols].copy()

# Review the first 5 rows of the new DataFrame
HR2020_reduced_df.head()

Unnamed: 0,Year,UNITID,FACSTAT,ARANK,HRTOTLT,HRTOTLM,HRTOTLW
0,2020,100654,0,0,253,137,116
1,2020,100654,10,0,239,129,110
2,2020,100654,10,1,52,40,12
3,2020,100654,10,2,46,31,15
4,2020,100654,10,3,94,43,51


In [23]:
# checking last 10 lines 
HR2020_reduced_df.tail(10)

Unnamed: 0,Year,UNITID,FACSTAT,ARANK,HRTOTLT,HRTOTLM,HRTOTLW
63834,2020,496043,42,0,1,1,0
63835,2020,496043,42,3,1,1,0
63836,2020,496043,44,0,40,24,16
63837,2020,496043,44,1,18,15,3
63838,2020,496043,44,2,9,5,4
63839,2020,496043,44,3,13,4,9
63840,2020,496043,45,0,1,0,1
63841,2020,496043,45,4,1,0,1
63842,2020,496326,0,0,3,1,2
63843,2020,496326,50,0,3,1,2


In [25]:
# Save the cleaned data to a new CSV file
HR2020_reduced_df.to_csv('Resources/HR2020_reduced.csv', index=False)

In [26]:
HR2020_reduced_df.info()  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63844 entries, 0 to 63843
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   Year     63844 non-null  int64
 1   UNITID   63844 non-null  int64
 2   FACSTAT  63844 non-null  int64
 3   ARANK    63844 non-null  int64
 4   HRTOTLT  63844 non-null  int64
 5   HRTOTLM  63844 non-null  int64
 6   HRTOTLW  63844 non-null  int64
dtypes: int64(7)
memory usage: 3.4 MB
