#  **"Census Data" Data Mining and Analysys**
This Dataset used for this assignment is the Census Dataset for predicting whether annual income of an individual exceeds $50K/yr <br> from: https://archive.ics.uci.edu/dataset/2/adult

---

## **1. Initial exploration**

>**The following cell blocks are executed for initial exploration of the dataset, its missing and duplicated values, shapes, and statistics**

In [None]:
#Import necessary libraries

import pandas as pd
import numpy as np

>**The dataset doesn't have column names by default, so we define the column names before we import the dataset so we can read the data properly**



In [None]:
# Define column names

columns = [
    "age", "workclass", "fnlwgt", "education", "education-num",
    "marital-status", "occupation", "relationship", "race", "sex",
    "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"
]

>**In this dataset, the null values are marked as " ?". To ensure we read " ?" as empty, we use na_values to treat " ?" as NaN.** 

*The cleaning process later will include the handling of " ?" to NaN in the cleaned dataset.*

In [None]:
# Load dataset
census_df = pd.read_csv(
    "../data/raw_dataset.csv",
    header=None,    
    names=columns,        # assign the column names
    na_values=' ?'         # treat ' ?' as missing values
)

In [None]:
# Display basic info

print("=== Dataframe Info ===")
print(census_df.info())
print("\n")

In [None]:
# Statistical description

print("=== Dataframe Description ===")
print(census_df.describe(include='all'))  # include='all' covers categorical columns too
print("\n")


In [75]:
# Missing values

print("=== Missing Values ===")
print(census_df.isnull().sum())
print("\n")

=== Missing Values ===
age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
income               0
dtype: int64




In [80]:
# Check duplicates

print("=== Duplicate Rows ===")
print(census_df.duplicated().sum())
print("\n")

=== Duplicate Rows ===
24




>**The following codes will show the shape, statistics, and sample rows of the uncleaned/raw dataset.**

In [None]:
# Show the sample rows

print("=== Sample Rows ===")
print(census_df.sample(10))   
print("\n")

In [None]:
#Show the shape of the dataset

print("Dataset shape:", census_df.shape)

Dataset shape: (32561, 15)


>The code below shows the top labels for non-numerical attributes of the dataset

In [None]:
# Summary for categorical columns only

print("\n=== Categorical Summary ===")
print(census_df.describe(include=['object']))


=== Categorical Summary ===
       workclass education       marital-status       occupation relationship  \
count      30725     32561                32561            30718        32561   
unique         8        16                    7               14            6   
top      Private   HS-grad   Married-civ-spouse   Prof-specialty      Husband   
freq       22696     10501                14976             4140        13193   

          race    sex  native-country  income  
count    32561  32561           31978   32561  
unique       5      2              41       2  
top      White   Male   United-States   <=50K  
freq     27816  21790           29170   24720  


>The code below was copied from Lesson003_Descriptive Statistics colab code (Iris Dataset) 
<br>to get calculate the summary statistics of the dataset's numerical attributes

In [None]:
from pandas.api.types import is_numeric_dtype


# Initialize an empty DataFrame to store the statistics
summary_df = pd.DataFrame(columns=['Column', 'Mean', 'Mode', 'Std', 'Min', 'Max'])

summary_frames = []

# Loop through the numeric columns and calculate statistics
for col in census_df.columns:
    if is_numeric_dtype(census_df[col]):
        mean = census_df[col].mean()
        std = census_df[col].std()
        min_val = census_df[col].min()
        max_val = census_df[col].max()

        mode_values = census_df[col].mode().values

        if len(mode_values) > 1:
            mode = np.array(mode_values)
        else:
            mode = mode_values[0]

        median = census_df[col].median()

        # Create a DataFrame for the current column
        col_summary = pd.DataFrame({'Column': [col], 'Mean': [mean], 'Mode': [mode],  
                                    'Median': [median], 'Standard Deviation': [std], 'Min': [min_val], 'Max': [max_val]})
        summary_frames.append(col_summary)

# Concatenate the list of DataFrames into one summary DataFrame
summary_df = pd.concat(summary_frames, ignore_index=True)

# Display the summary DataFrame
summary_df

Unnamed: 0,Column,Mean,Mode,Median,Standard Deviation,Min,Max
0,age,38.581647,36,37.0,13.640433,17,90
1,fnlwgt,189778.366512,"[123011, 164190, 203488]",178356.0,105549.977697,12285,1484705
2,education-num,10.080679,9,10.0,2.57272,1,16
3,capital-gain,1077.648844,0,0.0,7385.292085,0,99999
4,capital-loss,87.30383,0,0.0,402.960219,0,4356
5,hours-per-week,40.437456,40,40.0,12.347429,1,99


## **2. Data Cleaning Process**

>**The following cell blocks are executed to handle missing values, standardize formats, and detect and treat outliers.**