## Cleaning Enrollments/Pathways

#### Notes
This will be useful for further insights about applicant demographics

In [42]:
# importing the necessary modules
import pandas as pd
import numpy as np

In [43]:
df = pd.read_excel('data/ARC Application.xlsx')

df.head()

Unnamed: 0,KY Region,Contact: Auto Id,Contact: Unique ID SSN,Contact: SSN Opt Out,Contact: Mailing State/Province,Contact: County,Contact: Mailing Zip/Postal Code,Contact: Birthdate,Contact: Gender,Disability,...,Displaced Homemaker,Spouse of Armed Forces Reduced Income,Loss of Family Support,Seasonal farm worker?,Contact: First Name,Contact: Last Name,Status,Date Completed,Assessment: Created Date,Contact: Approval Status
0,SOAR,202109-5224,,0,KY,Knox,40906,1981-10-25,Female,No,...,,,,,name,name,Accepted - Prework Complete,2021-08-24,2021-09-10,
1,SOAR,202109-5230,,0,KY,Perry,41773,2000-10-28,Prefer not to say,No,...,,,,,name,name,Accepted - Prework Complete,2021-08-25,2021-09-10,
2,SOAR,202109-5233,,0,KY,Morgan,41472,1986-11-05,Male,Yes,...,,,,,name,name,Accepted - Prework Complete,2021-08-24,2021-09-10,
3,SOAR,202109-5236,,0,KY,Floyd,41669,1978-01-22,Female,No,...,Yes,No,No,No,name,name,Accepted - Prework Complete,2021-08-24,2021-09-10,
4,SOAR,202109-5237,,0,KY,Knox,40906,1992-04-16,Male,No,...,,,,,name,name,Accepted - Prework Complete,2021-08-24,2021-09-10,


#### Preliminary Thoughts
There are a lot of columns and empty values in this at first glance.  This will be useful for a demographical view of who's exactly applying to be a part of the CODE:You program.  Consider merging this with the 'all demographics and programs' dataFrame for further insights on specific demographics that aren't necessarily recorded there.

In [44]:
check_cols = df.columns

print(check_cols)

Index(['KY Region', 'Contact: Auto Id', 'Contact: Unique ID SSN',
       'Contact: SSN Opt Out', 'Contact: Mailing State/Province',
       'Contact: County', 'Contact: Mailing Zip/Postal Code',
       'Contact: Birthdate', 'Contact: Gender', 'Disability',
       'Primary Disability', 'Contact: Race', 'Contact: Veteran',
       'Employment Status', 'Unemployment Insurance', 'Long-term unemployment',
       'Highest level of education completed', 'Enrolled in school?', 'TANF',
       'Within Two Years of Exhausting TANF?', 'SSI', 'SNAP', 'Homeless',
       'What is your housing situation?', 'Justice Involvement', 'Low Income',
       'English Language Learner', 'Comfortably Read and Write in English?',
       'Cultural barriers?', 'Single Parent', 'Displaced Homemaker',
       'Spouse of Armed Forces Reduced Income', 'Loss of Family Support',
       'Seasonal farm worker?', 'Contact: First Name', 'Contact: Last Name',
       'Status', 'Date Completed', 'Assessment: Created Date',
       

In [45]:
df.count() # seeing how full each column is

KY Region                                 2476
Contact: Auto Id                          2476
Contact: Unique ID SSN                       0
Contact: SSN Opt Out                      2476
Contact: Mailing State/Province           2476
Contact: County                           2475
Contact: Mailing Zip/Postal Code          2476
Contact: Birthdate                        2476
Contact: Gender                           2476
Disability                                2475
Primary Disability                         214
Contact: Race                             2476
Contact: Veteran                          2475
Employment Status                         2475
Unemployment Insurance                    2476
Long-term unemployment                     283
Highest level of education completed      2475
Enrolled in school?                       2475
TANF                                      2476
Within Two Years of Exhausting TANF?       281
SSI                                       2476
SNAP         

#### Unique elements

Next, we'll go about finding unique elements to further weed through the data.  If there's 1 or less, then the data can be dropped and mentioned in notes.

In [46]:
unique_elements = df.nunique() # finding the unique elements per column

print(unique_elements)

KY Region                                    1
Contact: Auto Id                          2013
Contact: Unique ID SSN                       0
Contact: SSN Opt Out                         2
Contact: Mailing State/Province              3
Contact: County                             59
Contact: Mailing Zip/Postal Code           329
Contact: Birthdate                        1819
Contact: Gender                              6
Disability                                   3
Primary Disability                           8
Contact: Race                               24
Contact: Veteran                             2
Employment Status                            6
Unemployment Insurance                       2
Long-term unemployment                       2
Highest level of education completed        13
Enrolled in school?                          2
TANF                                         2
Within Two Years of Exhausting TANF?         3
SSI                                          2
SNAP         

#### Prep to clean

There's only one such count for a few rows, which invalidate their need to be on this list and, instead, will be removed and given a note:
* 'KY Region'  Every return is for 'SOAR', so only a note is needed.
* 'Seasonal Farm Worker?'  This row only returns 'No', so it'll be removed.

In [47]:
def df_cleaning(df: pd.DataFrame) -> pd.DataFrame:
    columnsToRemove = [ # dropping columns of little consequence to any metric we might be exploring
        'KY Region',
        'Contact: Unique ID SSN',
	    'Contact: SSN Opt Out',
        'Contact: Mailing Zip/Postal Code',
	    'Seasonal farm worker?',
        'Contact: First Name',
	    'Contact: Last Name',
        'Status',
	    'Assessment: Created Date',
        'Contact: Approval Status'
    ]

    df = df.rename(columns={
        'Contact: Auto Id': 'Auto ID',
        'Contact: County': 'County',
        'Contact: Mailing State/Province': 'State',
        'Contact: Birthdate': 'Birthdate',
        'Contact: Gender': 'Gender',
        'Contact: Race': 'Race',
        'Contact: Veteran': 'Veteran'
    })

    df['State'] = ( # cleaning up the 'State' column so we can better look at demographics
        df['State']
            .str.strip() # removing whitespace
            .str.upper() # uppercase for ease of removal and change
            .replace({'KY': 'Kentucky'}) # specifically targeting the few cells that're 'KY'
            .str.title() # turning the caps back to title
    )

    df = df.drop(columnsToRemove, axis=1)

    fillElement = "Not Provided"    
    df = df.replace(np.nan, fillElement) # changing 'NaN' to "Not Provided"

    min_pct = 0.01 # giving a minimum percentage value for viable data
    df = df.dropna(axis=1, thresh=int(min_pct * len(df))) # precautionary 'thresh'old for relevant data or lack thereof
    df = df.drop_duplicates(subset=["Auto ID", "Date Completed"], keep="first", ignore_index=True) # being mindful that some students come back to take part in other cohorts

    return(df)


#### Further Notes
* I think for demographic reasons we will keep most.
* Keep the 'Date Completed' column, if only to match the unique identifier of 'Auto Id'.  Mentioned above in the code.

In [48]:
df_cleaned = df_cleaning(df)

df_cleaned

Unnamed: 0,Auto ID,State,County,Birthdate,Gender,Disability,Primary Disability,Race,Veteran,Employment Status,...,Justice Involvement,Low Income,English Language Learner,Comfortably Read and Write in English?,Cultural barriers?,Single Parent,Displaced Homemaker,Spouse of Armed Forces Reduced Income,Loss of Family Support,Date Completed
0,202109-5224,Kentucky,Knox,1981-10-25,Female,No,N/A - No Disability,White,No,Employed full-time,...,Not Provided,No,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,2021-08-24 00:00:00
1,202109-5230,Kentucky,Perry,2000-10-28,Prefer not to say,No,Not Provided,Prefer Not To Say,Not Provided,"Unemployed, actively seeking work",...,No,Yes,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,2021-08-25 00:00:00
2,202109-5233,Kentucky,Morgan,1986-11-05,Male,Yes,Not Provided,White,No,"Unemployed, actively seeking work",...,No,Yes,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,2021-08-24 00:00:00
3,202109-5236,Kentucky,Floyd,1978-01-22,Female,No,N/A - No Disability,White,No,Employed full-time,...,No,Yes,Not Provided,No,No,Yes,Yes,No,No,2021-08-24 00:00:00
4,202109-5237,Kentucky,Knox,1992-04-16,Male,No,Not Provided,White,No,Employed full-time,...,No,Yes,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,2021-08-24 00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2250,202506-23902,Kentucky,Floyd,1979-07-19,Male,No,Not Provided,White,No,Employed full-time,...,No,No,No,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided
2251,202506-23914,Kentucky,Pulaski,1991-06-03,Female,Yes,Not Provided,White,No,"Unemployed, actively seeking work",...,No,Yes,No,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided
2252,202506-23927,Kentucky,Perry,1989-06-12,Male,No,Not Provided,White,No,Employed full-time,...,No,No,No,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided
2253,202506-23959,Kentucky,Laurel,1949-08-19,Female,Yes,Not Provided,White,No,"Unemployed, actively seeking work",...,No,No,No,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided


In [50]:
df_cleaned.to_csv('data/cleaned/cleaned applications.csv', index=False) # saving the cleaned file