# Cleaning Institutional Characteristics (IS) Data

In this notebook, we will clean the Institutional Characteristics data from IPEDS and prepare it for analysis.

## Load the Data

We will start by loading the data (hd2019 and s2019_is_rv.csv) into Pandas DataFrames and assigning it to the 'HR2019_df' and `IS2019_df`variable.

## Clean the Data

Next, we will clean the data by dropping any duplicate rows, converting the ZIP code column to integers, and removing any leading or trailing whitespace in string columns.

Rows with missing data we not removed. 

Columns that we do not need are removed. Columns kept for the IS2020_df are:
- UNITID (this the unique key for each university) 
- INSTNM (institution (entity) name ) 
- CITY 
- STABBR (state abbreviation)  
- ZIP (Zipcode)
- CONTROL (Institutional Control or Affiliation (Public, private, for-profit, or religious affiliation)
- HLOFFER (Filter for 4-year/ Backelor's degree or higher (5)) 
- INSTSIZE	Institution size category based on total students enrolled for 
- LONGITUD	(Longitude location of institution)
- LATITUDE	(Latitude location of institution)
- Year (added by coders to orginial data set) 

HR2020_df

- UNITID (unique id)
- FACSTAT - Faculty and tenure status 
- ARANK - Academic rank (Professor, Associate Professor, Assistant professor, - - Instructor, Lecture, No academic rank) 
- HRTOTLT - Grand total
- HRTOTLM - Grand total men
- HRTOTLW - Grand total women


## Save the Cleaned Data

Finally, we will save the cleaned data to a new CSV file named `IS2019_cleaned.csv` and 'HR2019_cleaned.csv'.



In [4]:
# import dependencies 
import pandas as pd

In [5]:
# Load the data into a Pandas DataFrame and assign it to IS2021 variable
IS2019_df = pd.read_csv('Resources/hd2019.csv', encoding='ISO-8859-1')
IS2019_df.head()

Unnamed: 0,Year,UNITID,INSTNM,IALIAS,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,...,CBSATYPE,CSA,NECTA,COUNTYCD,COUNTYNM,CNGDSTCD,LONGITUD,LATITUDE,DFRCGID,DFRCUSCG
0,2019,100654,Alabama A & M University,AAMU,4900 Meridian Street,Normal,AL,35762,1,5,...,1,290,-2,1089,Madison County,105,-86.568502,34.783368,112,1
1,2019,100663,University of Alabama at Birmingham,,Administration Bldg Suite 1070,Birmingham,AL,35294-0110,1,5,...,1,142,-2,1073,Jefferson County,107,-86.799345,33.505697,98,1
2,2019,100690,Amridge University,Southern Christian University Regions University,1200 Taylor Rd,Montgomery,AL,36117-3553,1,5,...,1,388,-2,1101,Montgomery County,102,-86.17401,32.362609,129,2
3,2019,100706,University of Alabama in Huntsville,UAH University of Alabama Huntsville,301 Sparkman Dr,Huntsville,AL,35899,1,5,...,1,290,-2,1089,Madison County,105,-86.640449,34.724557,102,2
4,2019,100724,Alabama State University,,915 S Jackson Street,Montgomery,AL,36104-0271,1,5,...,1,388,-2,1101,Montgomery County,107,-86.295677,32.364317,120,1


In [6]:
# print number of rows and columns currently present before cleaning 
print("Number of rows:", IS2019_df.shape[0])
print("Number of columns:", IS2019_df.shape[1])

Number of rows: 6559
Number of columns: 74


In [7]:
# review data types for all 74 columns 
IS2019_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6559 entries, 0 to 6558
Data columns (total 74 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      6559 non-null   int64  
 1   UNITID    6559 non-null   int64  
 2   INSTNM    6559 non-null   object 
 3   IALIAS    6558 non-null   object 
 4   ADDR      6559 non-null   object 
 5   CITY      6559 non-null   object 
 6   STABBR    6559 non-null   object 
 7   ZIP       6559 non-null   object 
 8   FIPS      6559 non-null   int64  
 9   OBEREG    6559 non-null   int64  
 10  CHFNM     6559 non-null   object 
 11  CHFTITLE  6559 non-null   object 
 12  GENTELE   6559 non-null   object 
 13  EIN       6559 non-null   int64  
 14  DUNS      6559 non-null   object 
 15  OPEID     6559 non-null   int64  
 16  OPEFLAG   6559 non-null   int64  
 17  WEBADDR   6559 non-null   object 
 18  ADMINURL  6559 non-null   object 
 19  FAIDURL   6559 non-null   object 
 20  APPLURL   6559 non-null   obje

In [9]:
# Drop any duplicate rows
IS2019_df.drop_duplicates(inplace=True)

# print number of rows and columns currently present before cleaning 
print("Number of rows:", IS2019_df.shape[0])
print("Number of columns:", IS2019_df.shape[1])

Number of rows: 6559
Number of columns: 74


In [8]:
# Convert the ZIP code column to integers and retain leading zeros
IS2019_df['ZIP'] = IS2019_df['ZIP'].apply(lambda x: int(str(x).zfill(5)[:5]))   
IS2019_df['ZIP'] = IS2019_df['ZIP'].astype(int)


 # Review data types for all 74 columns
IS2019_df.info()   

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6440 entries, 0 to 6439
Data columns (total 74 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      6440 non-null   int64  
 1   UNITID    6440 non-null   int64  
 2   INSTNM    6440 non-null   object 
 3   IALIAS    6439 non-null   object 
 4   ADDR      6439 non-null   object 
 5   CITY      6440 non-null   object 
 6   STABBR    6440 non-null   object 
 7   ZIP       6440 non-null   int32  
 8   FIPS      6440 non-null   int64  
 9   OBEREG    6440 non-null   int64  
 10  CHFNM     6440 non-null   object 
 11  CHFTITLE  6440 non-null   object 
 12  GENTELE   6440 non-null   object 
 13  EIN       6440 non-null   int64  
 14  DUNS      6440 non-null   object 
 15  OPEID     6440 non-null   int64  
 16  OPEFLAG   6440 non-null   int64  
 17  WEBADDR   6440 non-null   object 
 18  ADMINURL  6440 non-null   object 
 19  FAIDURL   6440 non-null   object 
 20  APPLURL   6440 non-null   obje

In [10]:
IS2019_df.head()

Unnamed: 0,Year,UNITID,INSTNM,IALIAS,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,...,CBSATYPE,CSA,NECTA,COUNTYCD,COUNTYNM,CNGDSTCD,LONGITUD,LATITUDE,DFRCGID,DFRCUSCG
0,2019,100654,Alabama A & M University,AAMU,4900 Meridian Street,Normal,AL,35762,1,5,...,1,290,-2,1089,Madison County,105,-86.568502,34.783368,112,1
1,2019,100663,University of Alabama at Birmingham,,Administration Bldg Suite 1070,Birmingham,AL,35294-0110,1,5,...,1,142,-2,1073,Jefferson County,107,-86.799345,33.505697,98,1
2,2019,100690,Amridge University,Southern Christian University Regions University,1200 Taylor Rd,Montgomery,AL,36117-3553,1,5,...,1,388,-2,1101,Montgomery County,102,-86.17401,32.362609,129,2
3,2019,100706,University of Alabama in Huntsville,UAH University of Alabama Huntsville,301 Sparkman Dr,Huntsville,AL,35899,1,5,...,1,290,-2,1089,Madison County,105,-86.640449,34.724557,102,2
4,2019,100724,Alabama State University,,915 S Jackson Street,Montgomery,AL,36104-0271,1,5,...,1,388,-2,1101,Montgomery County,107,-86.295677,32.364317,120,1


In [11]:
print(IS2019_df.iloc[400:420])

     Year  UNITID                                          INSTNM  \
400  2019  120023  North Orange County Community College District   
401  2019  120069                       North-West College-Pomona   
402  2019  120078                  North-West College-West Covina   
403  2019  120087                     North-West College-Van Nuys   
404  2019  120166             Northwestern Polytechnic University   
405  2019  120184                  Notre Dame de Namur University   
406  2019  120254                              Occidental College   
407  2019  120290                                  Ohlone College   
408  2019  120342                            Orange Coast College   
409  2019  120403                  Otis College of Art and Design   
410  2019  120421                                  Oxnard College   
411  2019  120537                   Hope International University   
412  2019  120661               Concorde Career College-San Diego   
413  2019  120698                 

In [12]:
# Remove any leading or trailing whitespace in string columns
IS2019_df = IS2019_df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

In [13]:
# checking current df
IS2019_df.head()


Unnamed: 0,Year,UNITID,INSTNM,IALIAS,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,...,CBSATYPE,CSA,NECTA,COUNTYCD,COUNTYNM,CNGDSTCD,LONGITUD,LATITUDE,DFRCGID,DFRCUSCG
0,2019,100654,Alabama A & M University,AAMU,4900 Meridian Street,Normal,AL,35762,1,5,...,1,290,-2,1089,Madison County,105,-86.568502,34.783368,112,1
1,2019,100663,University of Alabama at Birmingham,,Administration Bldg Suite 1070,Birmingham,AL,35294-0110,1,5,...,1,142,-2,1073,Jefferson County,107,-86.799345,33.505697,98,1
2,2019,100690,Amridge University,Southern Christian University Regions University,1200 Taylor Rd,Montgomery,AL,36117-3553,1,5,...,1,388,-2,1101,Montgomery County,102,-86.17401,32.362609,129,2
3,2019,100706,University of Alabama in Huntsville,UAH University of Alabama Huntsville,301 Sparkman Dr,Huntsville,AL,35899,1,5,...,1,290,-2,1089,Madison County,105,-86.640449,34.724557,102,2
4,2019,100724,Alabama State University,,915 S Jackson Street,Montgomery,AL,36104-0271,1,5,...,1,388,-2,1101,Montgomery County,107,-86.295677,32.364317,120,1


In [14]:
# checking last 10 lines 
IS2019_df.tail(10)

Unnamed: 0,Year,UNITID,INSTNM,IALIAS,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,...,CBSATYPE,CSA,NECTA,COUNTYCD,COUNTYNM,CNGDSTCD,LONGITUD,LATITUDE,DFRCGID,DFRCUSCG
6549,2019,494834,Educational Technical College,EDUTEC,"Calle Albizu Campos #65, Interior",Lares,PR,00669-0000,72,9,...,1,-2,-2,72081,Lares Municipio,7298,-66.877248,18.294782,234,2
6550,2019,494843,Fortis College-Landover,,4351 Garden City Drive,Landover,MD,20785-2223,24,2,...,1,548,-2,24033,Prince George's County,2404,-76.866209,38.949983,227,2
6551,2019,494852,Stautzenberger College-Rockford Career College,,1130 South Alpine Road Suite 100,Rockford,IL,61108-3900,17,3,...,1,466,-2,17201,Winnebago County,1716,-89.027291,42.255903,227,2
6552,2019,494861,CUNY Brooklyn College - Feirstein Graduate Sch...,,25 Washington Avenue,Brooklyn,NY,11205-1202,36,2,...,1,408,-2,36047,Kings County,3607,-73.967388,40.698507,-2,-2
6553,2019,494870,Rabbinical Seminary of America - Ma'yan HaTorah,,113-25 Myrtle Avenue,Richmond Hill,NY,11418-1316,36,2,...,1,408,-2,36081,Queens County,3605,-73.836262,40.70069,-2,-2
6554,2019,494889,Baker College - Flint,,1050 West Bristol Road,Flint,MI,48507-5508,26,3,...,1,220,-2,26049,Genesee County,2605,-83.697246,42.975177,-2,-2
6555,2019,494898,WellSpring School of Allied Health-Wichita,,"650 N Carriage Parkway, Ste 55",Wichita,KS,67208-4501,20,4,...,1,556,-2,20173,Sedgwick County,2004,-97.442623,37.673562,228,2
6556,2019,494904,Access Careers-Islandia,,"1930 Veterans Highway, Suite 10",Islandia,NY,11749-1599,36,2,...,1,408,-2,36103,Suffolk County,3602,-73.175617,40.800874,61,2
6557,2019,494913,Franciscan School of Theology - San Diego,,5998 Alcala Park,San Diego,CA,92110-2492,6,8,...,1,-2,-2,6073,San Diego County,652,-117.191964,32.772407,-2,-2
6558,2019,494922,University of Montana (The) - Bitterroot Colle...,,103 South 9th Street,Hamilton,MT,59840-3213,30,7,...,-2,-2,-2,30081,Ravalli County,3000,-114.168368,46.246005,-2,-2


In [15]:
# Select the columns to keep in the new DataFrame
keep_cols = ['Year', 'UNITID', 'INSTNM', 'CITY', 'STABBR', 'ZIP', 'CONTROL', 'HLOFFER', 'INSTSIZE', 'LONGITUD', 'LATITUDE']

# Create the new DataFrame with selected columns
IS2019_reduced_df = IS2019_df[keep_cols].copy()

# Review the first 5 rows of the new DataFrame
IS2019_reduced_df.head()

Unnamed: 0,Year,UNITID,INSTNM,CITY,STABBR,ZIP,CONTROL,HLOFFER,INSTSIZE,LONGITUD,LATITUDE
0,2019,100654,Alabama A & M University,Normal,AL,35762,1,9,3,-86.568502,34.783368
1,2019,100663,University of Alabama at Birmingham,Birmingham,AL,35294-0110,1,9,5,-86.799345,33.505697
2,2019,100690,Amridge University,Montgomery,AL,36117-3553,2,9,1,-86.17401,32.362609
3,2019,100706,University of Alabama in Huntsville,Huntsville,AL,35899,1,9,3,-86.640449,34.724557
4,2019,100724,Alabama State University,Montgomery,AL,36104-0271,1,9,2,-86.295677,32.364317


In [17]:
# print number of rows and columns currently present before cleaning 
print("Number of rows:", IS2019_reduced_df.shape[0])
print("Number of columns:", IS2019_reduced_df.shape[1])

Number of rows: 6559
Number of columns: 11


In [18]:
# review data types for all 74 columns 
IS2019_reduced_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6559 entries, 0 to 6558
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      6559 non-null   int64  
 1   UNITID    6559 non-null   int64  
 2   INSTNM    6559 non-null   object 
 3   CITY      6559 non-null   object 
 4   STABBR    6559 non-null   object 
 5   ZIP       6559 non-null   object 
 6   CONTROL   6559 non-null   int64  
 7   HLOFFER   6559 non-null   int64  
 8   INSTSIZE  6559 non-null   int64  
 9   LONGITUD  6559 non-null   float64
 10  LATITUDE  6559 non-null   float64
dtypes: float64(2), int64(5), object(4)
memory usage: 614.9+ KB


In [19]:
# Save the cleaned data to a new CSV file
IS2019_reduced_df.to_csv('Resources/IS2019_reduced.csv', index=False)

In [20]:
IS2019_reduced_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6559 entries, 0 to 6558
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      6559 non-null   int64  
 1   UNITID    6559 non-null   int64  
 2   INSTNM    6559 non-null   object 
 3   CITY      6559 non-null   object 
 4   STABBR    6559 non-null   object 
 5   ZIP       6559 non-null   object 
 6   CONTROL   6559 non-null   int64  
 7   HLOFFER   6559 non-null   int64  
 8   INSTSIZE  6559 non-null   int64  
 9   LONGITUD  6559 non-null   float64
 10  LATITUDE  6559 non-null   float64
dtypes: float64(2), int64(5), object(4)
memory usage: 614.9+ KB


# In this section we move into the human resource/staff data from s2020_is_rv.csv 

In [21]:
# Load the data into a Pandas DataFrame and assign it to hr2021 variable
HR2019_df = pd.read_csv('Resources/s2019_is_rv.csv', encoding='ISO-8859-1')
HR2019_df.head()

Unnamed: 0,Year,UNITID,SISCAT,FACSTAT,ARANK,XHRTOTLT,HRTOTLT,XHRTOTLM,HRTOTLM,XHRTOTLW,...,XHRUNKNM,HRUNKNM,XHRUNKNW,HRUNKNW,XHRNRALT,HRNRALT,XHRNRALM,HRNRALM,XHRNRALW,HRNRALW
0,2019,100654,1,0,0,R,242,R,131,R,...,Z,0,Z,0,R,18,R,13,R,5
1,2019,100654,100,10,0,R,242,R,131,R,...,Z,0,Z,0,R,18,R,13,R,5
2,2019,100654,101,10,1,R,49,R,37,R,...,Z,0,Z,0,R,2,R,2,Z,0
3,2019,100654,102,10,2,R,50,R,34,R,...,Z,0,Z,0,R,4,R,2,R,2
4,2019,100654,103,10,3,R,98,R,46,R,...,Z,0,Z,0,R,8,R,5,R,3


In [22]:
# print number of rows and columns currently present before cleaning 
print("Number of rows:", HR2019_df.shape[0])
print("Number of columns:", HR2019_df.shape[1])

Number of rows: 65135
Number of columns: 65


In [23]:
# review data types for all 65 columns 
HR2019_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65135 entries, 0 to 65134
Data columns (total 65 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Year      65135 non-null  int64 
 1   UNITID    65135 non-null  int64 
 2   SISCAT    65135 non-null  int64 
 3   FACSTAT   65135 non-null  int64 
 4   ARANK     65135 non-null  int64 
 5   XHRTOTLT  65135 non-null  object
 6   HRTOTLT   65135 non-null  int64 
 7   XHRTOTLM  65135 non-null  object
 8   HRTOTLM   65135 non-null  int64 
 9   XHRTOTLW  65135 non-null  object
 10  HRTOTLW   65135 non-null  int64 
 11  XHRAIANT  65135 non-null  object
 12  HRAIANT   65135 non-null  int64 
 13  XHRAIANM  65135 non-null  object
 14  HRAIANM   65135 non-null  int64 
 15  XHRAIANW  65135 non-null  object
 16  HRAIANW   65135 non-null  int64 
 17  XHRASIAT  65135 non-null  object
 18  HRASIAT   65135 non-null  int64 
 19  XHRASIAM  65135 non-null  object
 20  HRASIAM   65135 non-null  int64 
 21  XHRASIAW  65

In [24]:
# Select the columns to keep in the new DataFrame
keep_cols = ['Year','UNITID','FACSTAT','ARANK','HRTOTLT','HRTOTLM','HRTOTLW']

# Create the new DataFrame with selected columns
HR2019_reduced_df = HR2019_df[keep_cols].copy()

# Review the first 5 rows of the new DataFrame
HR2019_reduced_df.head()

Unnamed: 0,Year,UNITID,FACSTAT,ARANK,HRTOTLT,HRTOTLM,HRTOTLW
0,2019,100654,0,0,242,131,111
1,2019,100654,10,0,242,131,111
2,2019,100654,10,1,49,37,12
3,2019,100654,10,2,50,34,16
4,2019,100654,10,3,98,46,52


In [25]:
# checking last 10 lines 
HR2019_reduced_df.tail(10)

Unnamed: 0,Year,UNITID,FACSTAT,ARANK,HRTOTLT,HRTOTLM,HRTOTLW
65125,2019,494843,45,4,20,5,15
65126,2019,494852,0,0,25,8,17
65127,2019,494852,10,0,25,8,17
65128,2019,494852,10,6,25,8,17
65129,2019,494852,40,0,25,8,17
65130,2019,494852,40,6,25,8,17
65131,2019,494852,41,0,25,8,17
65132,2019,494852,41,6,25,8,17
65133,2019,494852,45,0,25,8,17
65134,2019,494852,45,6,25,8,17


In [26]:
# Save the cleaned data to a new CSV file
HR2019_reduced_df.to_csv('Resources/HR2019_reduced.csv', index=False)

In [27]:
HR2019_reduced_df.info()  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65135 entries, 0 to 65134
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   Year     65135 non-null  int64
 1   UNITID   65135 non-null  int64
 2   FACSTAT  65135 non-null  int64
 3   ARANK    65135 non-null  int64
 4   HRTOTLT  65135 non-null  int64
 5   HRTOTLM  65135 non-null  int64
 6   HRTOTLW  65135 non-null  int64
dtypes: int64(7)
memory usage: 3.5 MB
