## Cleaning Employments

In [16]:
# importing the necessary modules
import pandas as pd
import numpy as np
import plotly.express as px

In [17]:
# reading in the df
df = pd.read_excel('data/WORC Employment.xlsx')

df.head()

Unnamed: 0,Auto Id,Full Name,Email,EnrollmentId,Employment History Name,Company Name,Job Title,Start Date,Program: Program Name,Mailing City,Mailing Zip/Postal Code,ATP Placement Type,Salary,Gender,Race,KY Region
0,202203-7853,name name,name@gmail.com,Enrollment-6442,EH-001676,Appalachian Regional Healthcare,Network Coordinator,2023-10-09,Code Kentucky 22-23,Lost Creek,41348,First ATP Placement - New to Tech,16.0,Male,White,SOAR
1,202207-8826,name name,name@gmail.com,Enrollment-6188,EH-001824,MCHC - Mountain Comprehensive Health Corporation,Junior IT systems administrator,2024-02-12,Code Kentucky 22-23,Greys Knob,40808,First ATP Placement - New to Tech,18.0,Female,White,SOAR
2,202306-12150,name name,name@gmail.com,Enrollment-7740,EH-002555,University of Kentucky,Technical Support Specialist II,2024-04-01,Code Kentucky 23-24,Richmond,40475,First ATP Placement - Promotion,25.0,Male,White,SOAR
3,202207-9034,name name,name@gmail.com,Enrollment-6146,EH-002207,Childers oil company,Web developer,2024-04-23,Code Kentucky 22-23,Hazard,41701,First ATP Placement - New to Tech,26.92,Male,White,SOAR
4,202306-12149,name name,name@gmail.com,Enrollment-7701,EH-002294,Code:You,Student Community Coordinator,2024-05-20,Code Kentucky 23-24,Eubank,42567,First ATP Placement - New to Tech,25.48,Female,White,SOAR


#### Preliminary Thoughts (*remove later*)

* There isn't a lot here to keep and the file is pretty small.  So, we'll weed out most of the columns.  We also have a common identifier in our 'Auto Id' column which we'll use to join on the demographics and programs file
* There are a lot of similarities with our demographics and projects dataset, perhaps remove the bulk of the column as we're going to merge this for job placement rates.

In [18]:
def df_cleaning(df: pd.DataFrame) -> pd.DataFrame:
    columnsToRemove = ['Full Name',
            'Email',
            'EnrollmentId',
            'Employment History Name',
            'Company Name',
            'Program: Program Name',
            'Mailing City',
            'Mailing Zip/Postal Code',
            'Gender',
            'Race',
            'KY Region'
        ] # columns to remove
    df = df.drop(columnsToRemove, axis=1)

    df = df.rename(columns = {
        'Auto Id': 'Auto ID'
    })

    fillElement = 'Not Provided'
    df = df.replace(np.nan, fillElement)

    return(df)

#### Programmer's Notes:
* There are a lot of redundant columns between the demographic DataFrame and this one.  So, we'll just merge on 'Auto ID' and the demographic information should relate to the given IDs.
* Every single person is from SOAR.  Make a note.
* Ancillary stuff like 'EnrollmentId' and 'Employment History Name' isn't very informative.
* Removed 'PII'.
* Renamed 'Auto Id'.  Simple touch, but I feel it looks better.

In [19]:
df_cleaned = df_cleaning(df)
display(df_cleaned)

Unnamed: 0,Auto ID,Job Title,Start Date,ATP Placement Type,Salary
0,202203-7853,Network Coordinator,2023-10-09,First ATP Placement - New to Tech,16.0
1,202207-8826,Junior IT systems administrator,2024-02-12,First ATP Placement - New to Tech,18.0
2,202306-12150,Technical Support Specialist II,2024-04-01,First ATP Placement - Promotion,25.0
3,202207-9034,Web developer,2024-04-23,First ATP Placement - New to Tech,26.92
4,202306-12149,Student Community Coordinator,2024-05-20,First ATP Placement - New to Tech,25.48
5,202306-12234,Customer Success Agent,2024-05-20,First ATP Placement - New to Tech,Not Provided
6,202402-14719,Digital Marketer,2024-06-01,First ATP Placement - Already in Tech,18.0
7,202306-12508,EKY Workforce Analyst,2024-07-31,First ATP Placement - New to Tech,19.23
8,202308-13071,Media Research Analyst,2024-08-01,First ATP Placement - Promotion,Not Provided
9,202402-14832,PC Technician,2024-08-09,First ATP Placement - New to Tech,28.0


In [20]:
df_cleaned.to_csv('data/cleaned/cleaned_employment.csv') # creating the cleaned file, allowing overwrites upon rerunning - be careful when you run all.

## **Still in Progress???**