# Cleaning Institutional Characteristics (IS) Data

In this notebook, we will clean the Institutional Characteristics data from IPEDS and prepare it for analysis.

## Load the Data

We will start by loading the data into a Pandas DataFrame and assigning it to the `IS2021_df` variable.

## Clean the Data

Next, we will clean the data by dropping any duplicate rows, converting the ZIP code column to integers, and removing any leading or trailing whitespace in string columns.

Rows with missing data we not removed. 

Columns that we do not need are removed. Columns kept for the IS2021_df are:
- UNITID (this the unique key for each university) 
- INSTNM (institution (entity) name ) 
- CITY 
- STABBR (state abbreviation)  
- ZIP (Zipcode)
- CONTROL (Institutional Control or Affiliation (Public, private, for-profit, or religious affiliation)
- HLOFFER (Filter for 4-year/ Backelor's degree or higher (5)) 
- INSTSIZE	Institution size category based on total students enrolled for 
- LONGITUD	(Longitude location of institution)
- LATITUDE	(Latitude location of institution)
- Year (added by coders to orginial data set) 


## Save the Cleaned Data

Finally, we will save the cleaned data to a new CSV file named `IS2021_cleaned.csv`.



In [1]:
# import dependencies 
import pandas as pd

In [2]:
# Load the data into a Pandas DataFrame and assign it to IS2021 variable
IS2021_df = pd.read_csv('Resources/hd2021.csv', encoding='ISO-8859-1')
IS2021_df.head()

Unnamed: 0,Year,UNITID,INSTNM,IALIAS,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,...,CBSATYPE,CSA,NECTA,COUNTYCD,COUNTYNM,CNGDSTCD,LONGITUD,LATITUDE,DFRCGID,DFRCUSCG
0,2021,100654,Alabama A & M University,AAMU,4900 Meridian Street,Normal,AL,35762,1,5,...,1,290,-2,1089,Madison County,105,-86.568502,34.783368,109,1
1,2021,100663,University of Alabama at Birmingham,UAB,Administration Bldg Suite 1070,Birmingham,AL,35294-0110,1,5,...,1,142,-2,1073,Jefferson County,107,-86.799345,33.505697,93,1
2,2021,100690,Amridge University,Southern Christian University Regions University,1200 Taylor Rd,Montgomery,AL,36117-3553,1,5,...,1,388,-2,1101,Montgomery County,102,-86.17401,32.362609,127,2
3,2021,100706,University of Alabama in Huntsville,UAH University of Alabama Huntsville,301 Sparkman Dr,Huntsville,AL,35899,1,5,...,1,290,-2,1089,Madison County,105,-86.640449,34.724557,93,2
4,2021,100724,Alabama State University,,915 S Jackson Street,Montgomery,AL,36104-0271,1,5,...,1,388,-2,1101,Montgomery County,107,-86.295677,32.364317,99,1


In [3]:
# print number of rows and columns currently present before cleaning 
print("Number of rows:", IS2021_df.shape[0])
print("Number of columns:", IS2021_df.shape[1])

Number of rows: 6289
Number of columns: 75


In [4]:
# review data types for all 74 columns 
IS2021_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6289 entries, 0 to 6288
Data columns (total 75 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      6289 non-null   int64  
 1   UNITID    6289 non-null   int64  
 2   INSTNM    6289 non-null   object 
 3   IALIAS    6288 non-null   object 
 4   ADDR      6289 non-null   object 
 5   CITY      6289 non-null   object 
 6   STABBR    6289 non-null   object 
 7   ZIP       6289 non-null   object 
 8   FIPS      6289 non-null   int64  
 9   OBEREG    6289 non-null   int64  
 10  CHFNM     6289 non-null   object 
 11  CHFTITLE  6289 non-null   object 
 12  GENTELE   6289 non-null   object 
 13  EIN       6289 non-null   int64  
 14  DUNS      6289 non-null   object 
 15  OPEID     6289 non-null   int64  
 16  OPEFLAG   6289 non-null   int64  
 17  WEBADDR   6289 non-null   object 
 18  ADMINURL  6289 non-null   object 
 19  FAIDURL   6289 non-null   object 
 20  APPLURL   6289 non-null   obje

In [5]:
# Drop any duplicate rows
IS2021_df.drop_duplicates(inplace=True)

# print number of rows and columns currently present before cleaning 
print("Number of rows:", IS2021_df.shape[0])
print("Number of columns:", IS2021_df.shape[1])

Number of rows: 6289
Number of columns: 75


In [6]:
# Convert the ZIP code column to integers and retain leading zeros
IS2021_df['ZIP'] = IS2021_df['ZIP'].apply(lambda x: int(str(x).zfill(5)[:5]))   
IS2021_df['ZIP'] = IS2021_df['ZIP'].astype(int)


 # Review data types for all 74 columns
IS2021_df.info()   

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6289 entries, 0 to 6288
Data columns (total 75 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      6289 non-null   int64  
 1   UNITID    6289 non-null   int64  
 2   INSTNM    6289 non-null   object 
 3   IALIAS    6288 non-null   object 
 4   ADDR      6289 non-null   object 
 5   CITY      6289 non-null   object 
 6   STABBR    6289 non-null   object 
 7   ZIP       6289 non-null   int32  
 8   FIPS      6289 non-null   int64  
 9   OBEREG    6289 non-null   int64  
 10  CHFNM     6289 non-null   object 
 11  CHFTITLE  6289 non-null   object 
 12  GENTELE   6289 non-null   object 
 13  EIN       6289 non-null   int64  
 14  DUNS      6289 non-null   object 
 15  OPEID     6289 non-null   int64  
 16  OPEFLAG   6289 non-null   int64  
 17  WEBADDR   6289 non-null   object 
 18  ADMINURL  6289 non-null   object 
 19  FAIDURL   6289 non-null   object 
 20  APPLURL   6289 non-null   obje

In [7]:
IS2021_df.head()

Unnamed: 0,Year,UNITID,INSTNM,IALIAS,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,...,CBSATYPE,CSA,NECTA,COUNTYCD,COUNTYNM,CNGDSTCD,LONGITUD,LATITUDE,DFRCGID,DFRCUSCG
0,2021,100654,Alabama A & M University,AAMU,4900 Meridian Street,Normal,AL,35762,1,5,...,1,290,-2,1089,Madison County,105,-86.568502,34.783368,109,1
1,2021,100663,University of Alabama at Birmingham,UAB,Administration Bldg Suite 1070,Birmingham,AL,35294,1,5,...,1,142,-2,1073,Jefferson County,107,-86.799345,33.505697,93,1
2,2021,100690,Amridge University,Southern Christian University Regions University,1200 Taylor Rd,Montgomery,AL,36117,1,5,...,1,388,-2,1101,Montgomery County,102,-86.17401,32.362609,127,2
3,2021,100706,University of Alabama in Huntsville,UAH University of Alabama Huntsville,301 Sparkman Dr,Huntsville,AL,35899,1,5,...,1,290,-2,1089,Madison County,105,-86.640449,34.724557,93,2
4,2021,100724,Alabama State University,,915 S Jackson Street,Montgomery,AL,36104,1,5,...,1,388,-2,1101,Montgomery County,107,-86.295677,32.364317,99,1


In [8]:
print(IS2021_df.iloc[400:420])

     Year  UNITID                                   INSTNM  \
400  2021  120661        Concorde Career College-San Diego   
401  2021  120698                     Palo Alto University   
402  2021  120768                     Pacific Oaks College   
403  2021  120795               Pacific School of Religion   
404  2021  120838                Pacific States University   
405  2021  120865                    Pacific Union College   
406  2021  120883                University of the Pacific   
407  2021  120953                       Palo Verde College   
408  2021  120971                          Palomar College   
409  2021  121044                    Pasadena City College   
410  2021  121150                    Pepperdine University   
411  2021  121178  Peralta Community College System Office   
412  2021  121257                           Pitzer College   
413  2021  121275                  Platt College-San Diego   
414  2021  121309           Point Loma Nazarene University   
415  202

In [9]:
# first attempt tried again to keep zeros that were part of the zip code in the front, not concered will use lat and lit for map
# First attempt 
# Remove the hyphen and last 4 digits from zip code so it can be convered to an iteger
# IS2021_df['ZIP'] = IS2021_df['ZIP'].apply(lambda x: int(str(x)[:5]))

In [10]:
# first attempt tried again to keep zeros that were part of the zip code in the front, not concered will use lat and lit for map
# Convert the ZIP code column to integers
# IS2021_df['ZIP'] = IS2021_df['ZIP'].astype(int)

# review data types for all 74 columns 
# IS2021_df.info()

In [11]:
# Remove any leading or trailing whitespace in string columns
IS2021_df = IS2021_df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

In [12]:
# checking current df
IS2021_df.head()


Unnamed: 0,Year,UNITID,INSTNM,IALIAS,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,...,CBSATYPE,CSA,NECTA,COUNTYCD,COUNTYNM,CNGDSTCD,LONGITUD,LATITUDE,DFRCGID,DFRCUSCG
0,2021,100654,Alabama A & M University,AAMU,4900 Meridian Street,Normal,AL,35762,1,5,...,1,290,-2,1089,Madison County,105,-86.568502,34.783368,109,1
1,2021,100663,University of Alabama at Birmingham,UAB,Administration Bldg Suite 1070,Birmingham,AL,35294,1,5,...,1,142,-2,1073,Jefferson County,107,-86.799345,33.505697,93,1
2,2021,100690,Amridge University,Southern Christian University Regions University,1200 Taylor Rd,Montgomery,AL,36117,1,5,...,1,388,-2,1101,Montgomery County,102,-86.17401,32.362609,127,2
3,2021,100706,University of Alabama in Huntsville,UAH University of Alabama Huntsville,301 Sparkman Dr,Huntsville,AL,35899,1,5,...,1,290,-2,1089,Madison County,105,-86.640449,34.724557,93,2
4,2021,100724,Alabama State University,,915 S Jackson Street,Montgomery,AL,36104,1,5,...,1,388,-2,1101,Montgomery County,107,-86.295677,32.364317,99,1


In [13]:
# checking last 10 lines 
IS2021_df.tail(10)

Unnamed: 0,Year,UNITID,INSTNM,IALIAS,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,...,CBSATYPE,CSA,NECTA,COUNTYCD,COUNTYNM,CNGDSTCD,LONGITUD,LATITUDE,DFRCGID,DFRCUSCG
6279,2021,497240,Shoreline Community College - CNC Machinists P...,,6737 Corson Avenue South Building A,Seattle,WA,98108,53,8,...,1,500,-2,53033,King County,5309,-122.326138,47.542457,-2,-2
6280,2021,497259,Shoreline Community College - Dental Hygiene,,1959 Northeast Pacific Street Magnuson HealthS...,Seattle,WA,98195,53,8,...,1,500,-2,53033,King County,5307,-122.309103,47.650612,-2,-2
6281,2021,497268,Arizona College of Nursing-Salt Lake City,,"434 West Ascension Way, Suite 122",Murray,UT,84123,49,7,...,1,482,-2,49035,Salt Lake County,4904,-111.903426,40.656372,202,2
6282,2021,497277,Arizona College-Glendale,,4425 West Olive Avenue Suite 300,Glendale,AZ,85302,4,6,...,1,429,-2,4013,Maricopa County,408,-112.154175,33.566263,220,2
6283,2021,497286,Universal Technical Institute-West Texas,,301 West Howard Lane,Austin,TX,78753,48,6,...,1,-2,-2,48453,Travis County,4817,-97.660482,30.415828,221,2
6284,2021,497301,Avalon Institute-Las Vegas,,"2650 South Decatur Boulevard Suites 1, 6, 8-10",Las Vegas,NV,89102,32,8,...,1,332,-2,32003,Clark County,3201,-115.206409,36.142355,45,2
6285,2021,497310,Medspa Academies-National Institute of Modern ...,,3993 Howard Hughes Parkway Suite 150,Las Vegas,NV,89169,32,8,...,1,332,-2,32003,Clark County,3201,-115.158173,36.117236,29,2
6286,2021,497329,American Institute-Cherry Hill,,2201 Route 38 8th Floor,Cherry Hill,NJ,8002,34,2,...,1,428,-2,34007,Camden County,3401,-75.015417,39.939428,57,2
6287,2021,497338,Glendale Career College-North-West College-Bak...,,3000 Ming Avenue,Bakersfield,CA,93304,6,8,...,1,-2,-2,6029,Kern County,623,-119.035082,35.339951,221,2
6288,2021,497347,University of Maine - Machias,,116 O'Brien Avenue,Machias,ME,4654,23,1,...,-2,-2,-2,23029,Washington County,2302,-67.458025,44.708803,-2,-2


In [14]:
# Select the columns to keep in the new DataFrame
keep_cols = ['Year', 'UNITID', 'INSTNM', 'CITY', 'STABBR', 'ZIP', 'CONTROL', 'HLOFFER', 'INSTSIZE', 'LONGITUD', 'LATITUDE']

# Create the new DataFrame with selected columns
IS2021_reduced_df = IS2021_df[keep_cols].copy()

# Review the first 5 rows of the new DataFrame
IS2021_reduced_df.head()

Unnamed: 0,Year,UNITID,INSTNM,CITY,STABBR,ZIP,CONTROL,HLOFFER,INSTSIZE,LONGITUD,LATITUDE
0,2021,100654,Alabama A & M University,Normal,AL,35762,1,9,3,-86.568502,34.783368
1,2021,100663,University of Alabama at Birmingham,Birmingham,AL,35294,1,9,5,-86.799345,33.505697
2,2021,100690,Amridge University,Montgomery,AL,36117,2,9,1,-86.17401,32.362609
3,2021,100706,University of Alabama in Huntsville,Huntsville,AL,35899,1,9,3,-86.640449,34.724557
4,2021,100724,Alabama State University,Montgomery,AL,36104,1,9,2,-86.295677,32.364317


In [15]:
# print number of rows and columns currently present before cleaning 
print("Number of rows:", IS2021_df.shape[0])
print("Number of columns:", IS2021_df.shape[1])

Number of rows: 6289
Number of columns: 75


In [16]:
# review data types for all 74 columns 
IS2021_reduced_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6289 entries, 0 to 6288
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      6289 non-null   int64  
 1   UNITID    6289 non-null   int64  
 2   INSTNM    6289 non-null   object 
 3   CITY      6289 non-null   object 
 4   STABBR    6289 non-null   object 
 5   ZIP       6289 non-null   int32  
 6   CONTROL   6289 non-null   int64  
 7   HLOFFER   6289 non-null   int64  
 8   INSTSIZE  6289 non-null   int64  
 9   LONGITUD  6289 non-null   float64
 10  LATITUDE  6289 non-null   float64
dtypes: float64(2), int32(1), int64(5), object(3)
memory usage: 565.0+ KB


In [17]:
# Save the cleaned data to a new CSV file
IS2021_reduced_df.to_csv('Resources/IS2021_reduced.csv', index=False)

In [19]:
IS2021_reduced_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6289 entries, 0 to 6288
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      6289 non-null   int64  
 1   UNITID    6289 non-null   int64  
 2   INSTNM    6289 non-null   object 
 3   CITY      6289 non-null   object 
 4   STABBR    6289 non-null   object 
 5   ZIP       6289 non-null   int32  
 6   CONTROL   6289 non-null   int64  
 7   HLOFFER   6289 non-null   int64  
 8   INSTSIZE  6289 non-null   int64  
 9   LONGITUD  6289 non-null   float64
 10  LATITUDE  6289 non-null   float64
dtypes: float64(2), int32(1), int64(5), object(3)
memory usage: 565.0+ KB
