## Classification of Income Being Above or Below/Equal to $50,000 in Adult Census Data

By: Sunsar, Sarah, Emily, Calvin (DSCI 100 003 - Group 23)

Data is from: https://www.kaggle.com/datasets/uciml/adult-census-income

The dataset used for this analysis is derived from the 1994 Census Bureau database. 

The dataset contains a diverse range of numerical and categorical attributes, such as age, hours worked per week,  sex, and more. In this project, we will filter and simplify some categories from this dataset to predict whether an individual"s annual salary falls above or below $50,000.

The question this project will seek to answer is: 

**How do different aspects of a person predict whether annual income will be above or below $50K annually?**

# Reading in Data

In [1]:
library(tidyverse)
library(tidymodels)

-- [1mAttaching core tidyverse packages[22m ------------------------ tidyverse 2.0.0 --
[32mv[39m [34mdplyr    [39m 1.1.3     [32mv[39m [34mreadr    [39m 2.1.4
[32mv[39m [34mforcats  [39m 1.0.0     [32mv[39m [34mstringr  [39m 1.5.0
[32mv[39m [34mggplot2  [39m 3.4.3     [32mv[39m [34mtibble   [39m 3.2.1
[32mv[39m [34mlubridate[39m 1.9.2     [32mv[39m [34mtidyr    [39m 1.3.0
[32mv[39m [34mpurrr    [39m 1.0.2     
-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mi[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
-- [1mAttaching packages[22m -------------------------------------- tidymodels 1.1.1 --

[32mv[39m [34mbroom       [39m 1.0.5     [32mv[39m [34mrsample     [39

In [2]:
df = read_csv("https://raw.githubusercontent.com/calvingdu/dsci100-003-23/master/data/adult_census.csv")

# Splitting the data
df_split <- initial_split(df, prop = 0.8, strata = income)
df_train <- training(df_split)
df_test <- testing(df_split)

paste0("Training set row count: ", nrow(df_train))
paste0("Testing set row count: ", nrow(df_test))

[1mRows: [22m[34m32561[39m [1mColumns: [22m[34m15[39m
[36m--[39m [1mColumn specification[22m [36m--------------------------------------------------------[39m
[1mDelimiter:[22m ","
[31mchr[39m (9): workclass, education, marital.status, occupation, relationship, rac...
[32mdbl[39m (6): age, fnlwgt, education.num, capital.gain, capital.loss, hours.per.week

[36mi[39m Use `spec()` to retrieve the full column specification for this data.
[36mi[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [3]:
head(df_train, 3)

age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
<dbl>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


# Tidying/Cleaning The Data & Selecting Columns
We can begin by tidying up the data. Something we noticed immediately that there are a lot of categorical columns. To tackle this, we plan to make buckets of categories and then turn them into dummy variables. For example, we can turn __sex__ into a dummy variable where 0 = male and 1 = female, make dummy variables by bucketting different things such as paid/unpaid in __workclass__, or making a dummy variable for every unique value. 

These are the columns we plan to use and how we plan to tackle them to make them usable in data: 
- **Age**: No changes (either than scaling/imputation)
- **Workclass**: Make a dummy variable for paid/unpaid
- **Education**: Simplified to a dummy variable of if College Graduate or Not
- **Occupation**: Make a dummy variable for all the occupations
- **Relationship**: Simplified a dummy variable of Married or Not
- **Sex**: Transformed into a Dummy Variable
- **Capital Gain**: No changes (either than scaling/imputation)
- **Capital Loss**: No changes (either than scaling/imputation)
- **Hours Per Week**: No changes (either than scaling/imputation)
- **Native Country**: Evaluated what country has the most people with income > 50k (United States by a large margin) and then make a dummy variable for being in this country or not 

**Dropped Columns & Reasoning**: 
- fnlwgt: Unclear how this relates
- education.num: Already have education
- race: Don"t believe this is significant to the study so removing to avoid overfitting 
- marital.status: We think it"s similar to relationship so we remove to avoid overfitting 


This is also some tidying to do. We can initially see that there are some missing values in __workclass__ and __occupation__ represented as ?. Since we believe these are extremely important roles to guess income category, we remove any rows that don't have data for this. 

In [4]:
filtered_df <- df_train |>
  select(age, workclass, education, occupation, relationship, sex,
         capital.gain, capital.loss, hours.per.week, native.country, income) |>
  filter(workclass != "?" & occupation != "?")

Then, we can begin by making dummy variables in the data using the above choices

### Binary Preprocessing

In [5]:
# Making a function to be used later
preprocess_binary <- function(df){
    transformed_df <- df |>
        mutate(is_married = ifelse((relationship == "Husband" | relationship == "Wife" |relationship == "Wife"),1,0)) |>
        mutate(sex_dummy = ifelse(sex == "Female", 1, 0)) |>
        mutate(being_paid = ifelse(workclass != "Self-emp-not-inc" & workclass != "Without-pay" & workclass != "Never-worked", 1, 0)) |>
        mutate(is_american = ifelse(native.country == "United-States", 1, 0))
    return(transformed_df)
}

binary_df <- preprocess_binary(filtered_df)

### Dummy Variables for Every Occupation

In [6]:
unique_occupations <- as.list(unique(binary_df$occupation))

# Making a function to use later
preprocess_occupation <- function(df, occupations){
    for (occ in occupations) {
        col_name <- gsub("[^a-zA-Z0-9 ]", ".", tolower(occ))
        df[[col_name]] <- as.integer(df$occupation == occ)
    }
    return(df)
}

occupations_df <- preprocess_occupation(binary_df, unique_occupations)

### Removing Original Variables

In [7]:
tidy_df <- occupations_df |>
    select(-c(occupation, workclass, education, occupation, relationship, sex, native.country))

head(tidy_df)

age,capital.gain,capital.loss,hours.per.week,income,is_married,sex_dummy,being_paid,is_american,prof.specialty,...,craft.repair,farming.fishing,adm.clerical,exec.managerial,handlers.cleaners,machine.op.inspct,protective.serv,tech.support,priv.house.serv,armed.forces
<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,...,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
41,0,3900,40,<=50K,0,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
34,0,3770,45,<=50K,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
61,0,2754,25,<=50K,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
51,0,2603,40,<=50K,1,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
21,0,2603,40,<=50K,1,0,1,1,0,...,1,0,0,0,0,0,0,0,0,0
33,0,2603,32,<=50K,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0


In [8]:
process_df <- function(df){
    filtered_df <- df |>
    select(age, workclass, education, occupation, relationship, sex,
            capital.gain, capital.loss, hours.per.week, native.country, income) |>
    filter(workclass != "?" & occupation != "?")

    binary_df <- preprocess_binary(filtered_df)

    unique_occupations <- as.list(unique(binary_df$occupation))
    occupations_df <- preprocess_occupation(binary_df, unique_occupations) 
    new_cols_df <- select(occupations_df, -c(occupation, workclass, education, occupation, relationship, sex, native.country))

    return(new_cols_df)
}

processed_df_train <- process_df(df_train)
print(all.equal(processed_df_train, tidy_df) & all.equal(colnames(processed_df_train), colnames(tidy_df)))

[1] TRUE


# Preliminary Exploratory Data Analysis



## Summary Table

In [9]:
tidy_df |>
group_by(income) |>
summarize(across(everything(), ~ mean(., na.rm = TRUE), .names = "mean_{.col}")) |>
t()

0,1,2
income,<=50K,>50K
mean_age,36.60979,43.92667
mean_capital.gain,144.1656,3991.2340
mean_capital.loss,51.89115,193.75094
mean_hours.per.week,39.36592,45.74179
mean_is_married,0.3307006,0.8486036
mean_sex_dummy,0.3814985,0.1471501
mean_being_paid,0.9201515,0.9062551
mean_is_american,0.8883960,0.9150743
mean_prof.specialty,0.09937787,0.24007839


## Summary Visualization

# Methods

# Expected outcomes and significance