# Exploratory Data Analysis

**Owner: Daniel Soukup - Created: 2025.11.01**

The goal of this notebook is to explore the dataset, understand our target column and features (statistical properties, data quality) and plan for preprocessing and modelling.

In [0]:
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import numpy as np

import plotly.express as px

import plotly.offline as pyo
pyo.init_notebook_mode()

## Data Loading

Ensure the datasets are available:

In [0]:
client = dataiku.api_client()
project = client.get_project('US_CENSUS_PROJECT')

datasets = project.list_datasets()

for dataset in datasets:
    print(dataset["name"])

Load the datasets:

In [0]:
def load_data_by_name(name: str) -> pd.DataFrame:
    """
    Load dataset by its name.
    """
    mydataset = dataiku.Dataset(name)
    mydataset_df = mydataset.get_dataframe()
    
    return mydataset_df

train_df = load_data_by_name("census_income_learn")
test_df = load_data_by_name("census_income_test")

In [0]:
train_df.shape, test_df.shape

We have ~200K rows for testing and ~100K rows for testing with 42 columns, the last being our target. Note that we are missing the column names.

In [0]:
train_df.head()

In [0]:
test_df.head()

In [0]:
train_df.info()

**Observations:**

It is clear that we need to do a fair bit of cleaning:
- clarify the missing column names
- the data dictionary lists 40 features, however we see 41 columns here
- there are some missing values in col11
- we have a high number of object data types that will need processing prior to modelling

Excerpt from the data dict:

- Number of instances data = 199523
    - Duplicate or conflicting instances : 46716
- Number of instances in test = 99762
    - Duplicate or conflicting instances : 20936
- Class probabilities for income-projected.test file
    - Probability for the label '- 50000' : 93.80%
    - Probability for the label '50000+' : 6.20%
- Number of attributes = 40 (continuous : 7 nominal : 33)