# Exploratory Data Analysis

**Owner: Daniel Soukup - Created: 2025.11.01**

The goal of this notebook is to explore the dataset, understand our target column and features (statistical properties, data quality) and plan for preprocessing and modelling.

In [0]:
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import numpy as np

import plotly.express as px

import plotly.offline as pyo
pyo.init_notebook_mode()

## Data Loading

Ensure the datasets are available:

In [0]:
client = dataiku.api_client()
project = client.get_project('US_CENSUS_PROJECT')

datasets = project.list_datasets()

for dataset in datasets:
    print(dataset["name"])

Load the datasets:

In [0]:
def load_data_by_name(name: str) -> pd.DataFrame:
    """
    Load dataset by its name.
    """
    mydataset = dataiku.Dataset(name)
    mydataset_df = mydataset.get_dataframe()
    
    return mydataset_df

train_df = load_data_by_name("census_income_learn")
test_df = load_data_by_name("census_income_test")

In [0]:
train_df.shape, test_df.shape

We have ~200K rows for testing and ~100K rows for testing with 42 columns, the last being our target. Note that we are missing the column names.

In [0]:
train_df.head()

In [0]:
test_df.head()

In [0]:
train_df.info()

**Observations:**

It is clear that we need to do a fair bit of cleaning:
- clarify the missing column names
- the data dictionary lists 40 features, however we see 41 columns here
- there are some missing values in col11
- we have a high number of object data types that will need processing prior to modelling

Excerpt from the data dict:

- Number of instances data = 199523
    - Duplicate or conflicting instances : 46716
- Number of instances in test = 99762
    - Duplicate or conflicting instances : 20936
- Class probabilities for income-projected.test file
    - Probability for the label '- 50000' : 93.80%
    - Probability for the label '50000+' : 6.20%
- Number of attributes = 40 (continuous : 7 nominal : 33)

In [0]:
A quick check confirms that there are no exact duplicate columns per se:

In [0]:
train_df.T.duplicated().sum()

## Column Mapping

Lets look at the high level stats first:

In [0]:
# numeric columns
train_df.describe()

In [0]:
train_df.select_dtypes("object").describe()

We'll use the data dictionary and number of unique values to map our columns to their proper names. We also need this information to make well-informed decisions on how the columns should be processed for modelling.

In [0]:
pd.set_option('display.max_colwidth', 100) 

unique_values = pd.DataFrame(
    {
        "unique_values": [train_df[col].unique() for col in train_df.columns],
        "num_unique": [train_df[col].nunique() for col in train_df.columns],
        "dtype": [train_df[col].dtype for col in train_df.columns],
    },
    index=train_df.columns,
)
unique_values

Based on this we put together our mapping:

In [0]:
{'col_0': 'col_0', 'col_1': 'col_1', 'col_2': 'col_2', 'col_3': 'col_3', 'col_4': 'col_4', 'col_5': 'col_5', 'col_6': 'col_6', 'col_7': 'col_7', 'col_8': 'col_8', 'col_9': 'col_9', 'col_10': 'col_10', 'col_11': 'col_11', 'col_12': 'col_12', 'col_13': 'col_13', 'col_14': 'col_14', 'col_15': 'col_15', 'col_16': 'col_16', 'col_17': 'col_17', 'col_18': 'col_18', 'col_19': 'col_19', 'col_20': 'col_20', 'col_21': 'col_21', 'col_22': 'col_22', 'col_23': 'col_23', 'col_24': 'col_24', 'col_25': 'col_25', 'col_26': 'col_26', 'col_27': 'col_27', 'col_28': 'col_28', 'col_29': 'col_29', 'col_30': 'col_30', 'col_31': 'col_31', 'col_32': 'col_32', 'col_33': 'col_33', 'col_34': 'col_34', 'col_35': 'col_35', 'col_36': 'col_36', 'col_37': 'col_37', 'col_38': 'col_38', 'col_39': 'col_39', 'col_40': 'col_40', 'col_41': 'col_41'}

In [0]:
col_mapping = {
    "col_0": "age", # matches type and range
    "col_1": "class of worker", # unique values checked with data dict (UVDD)
    "col_2": "detailed industry recode", # UVDD
    "col_3": "detailed occupation recode", # UVDD
    "col_4": "education", # UVDD
    "col_5": "wage per hour", # looks to be at right position, type checks, in cents?
    "col_6": "enroll in edu inst last wk", # UVDD
    "col_7": "marital stat", # UVDD
    "col_8": "major industry code", # UVDD
    "col_9": "major occupation code", # UVDD
    "col_10": "race", # UVDD
    "col_11": "hispanic origin", # UVDD - 10 unique in data dict? values match though
    "col_12": "sex", # UVDD
    "col_13": "member of a labor union", # UVDD
    "col_14": "reason for unemployment", # UVDD
    "col_15": "full or part time employment stat", # UVDD
    "col_16": "capital gains", # data dict check, range ok, dollars?
    "col_17": "capital losses", # data dict check, range ok, dollars?
    "col_18": "dividends from stocks", # data dict check
    "col_19": "tax filer stat", # UVDD
    "col_20": "region of previous residence", # UVDD
    "col_21": "state of previous residence", # UVDD
    "col_22": "detailed household and family stat", # data dict check
    "col_23": "detailed household summary in household", # data dict check
    "col_24": "???"
    "col_25": "migration code-change in msa", # UVDD
    "col_26": "migration code-change in reg", # UVDD
    "col_27": "migration code-move within reg", # UVDD
    "col_28": "live in this house 1 year ago",# UVDD
    "col_29": "migration prev res in sunbelt", # UVDD
    "col_30": "num persons worked for employer", # value check
    "col_31": "family members under 18", # UVDD
    "col_32": "country of birth mother",  # UVDD
    "col_33": "country of birth self",  # UVDD
    "col_34": "country of birth father",  # UVDD
    "col_35": "citizenship", # UVDD
    "col_36": "own business or self employed", # UVDD
    "col_37": "fill inc questionnaire for veteran's admin", # UVDD
    "col_38": "veterans benefits", # UVDD
    "col_39": "weeks worked in year", # data dict order
    "col_40": "year", # UVDD
    "col_41": "income"
}
