# DAV 6150 ‚Äî Final Project Proposal

**Project Members:** Alwyn Munatsi, Lucia Shumba, Chidochashe Makanga and Bekithemba Nkomo  
**Dataset:** [UCI Adult (Census Income)](https://archive.ics.uci.edu/dataset/2/adult)  
**Project Topic:** Predicting Income Levels Using Machine Learning and Ensemble Modeling: An Analysis of Socioeconomic Determinants in U.S. Census Data


## 1Ô∏è. Introduction

Income inequality remains one of the most persistent social and economic challenges facing modern societies. In the United States, the gap between high-income and low-income earners has widened steadily over the past few decades. According to the U.S. Census Bureau, the nation‚Äôs Gini index, a measure of income inequality rose to 0.494 in 2023, reflecting one of the highest levels recorded since data collection began in 1967. The top 10 percent of households now earn more than eight times what the bottom 10 percent earn annually.

Education and employment factors play critical roles in this divide. Individuals holding at least a bachelor‚Äôs degree earn a median annual income roughly 65 % higher than those with only a high-school diploma, while gender and occupational disparities further amplify wage differences. For instance, the Bureau of Labor Statistics reported in 2023 that women‚Äôs median weekly earnings were 83 % of men‚Äôs, with even wider gaps in certain industries such as finance and information technology.

These national patterns illustrate that income outcomes are shaped by complex interactions among demographic, educational, and occupational characteristics. Understanding how these variables relate to earning potential is essential for policymakers, economists, and organizations seeking to design equitable compensation structures and targeted workforce-development programs.

By analyzing publicly available census data that captures these individual-level attributes, we can quantify and visualize the underlying structure of income distribution across the population. Such an analysis contributes to evidence-based discussions on social mobility, educational access, and labor-market inequality, issues of growing importance in today‚Äôs data-driven economy.

---

## 2Ô∏è. Research Questions

**Primary RQ 1:**  
* Can demographic, educational, and occupational attributes be used to accurately predict whether an individual earns more than $50 000 per year?

Understanding the relationship between socioeconomic attributes and income level can help organizations, policymakers, and educators make data-informed decisions about workforce development and wage equity. If reliable prediction patterns are identified, human-resources departments and public agencies could leverage similar models to forecast income disparities within specific sectors or regions, ensuring fairer compensation structures and targeted training initiatives.

**Secondary RQ 2:**  
* Which variables most strongly influence the likelihood of earning above $50 000?

Identifying the key determinants of income can highlight which educational, occupational, or demographic factors drive earning potential. Such insights can inform programs that promote upskilling in high-impact areas, help governments evaluate the effectiveness of education funding, and guide employers in creating evidence-based policies to address systemic barriers to upward mobility.

**Secondary RQ 3:**  
* Does an ensemble model outperform individual models in predicting high-income individuals?

Testing an ensemble approach provides practical insight into how hybrid predictive systems can enhance model accuracy in real-world analytics workflows. Businesses and research institutions increasingly rely on ensemble methods for human-resources analytics, credit scoring, and socioeconomic forecasting. Demonstrating that ensemble modeling yields superior performance could validate its broader application in decision-support systems that require high predictive reliability.

### Real-World Relevance  
The results of this research can have tangible applications in both public and private sectors.
* Government agencies can use similar models to forecast income distribution trends and evaluate the impact of education or tax policies.
* Corporations and HR departments can apply predictive insights to ensure pay equity and optimize talent-management strategies.
* Educational institutions and nonprofits can identify which learning paths most effectively enhance earning potential, shaping curriculum design and workforce-development initiatives.

By translating statistical patterns into actionable intelligence, this work contributes to a data-driven understanding of income inequality, empowering decision-makers to promote fairness, opportunity, and sustainable economic growth.

---

## 3Ô∏è. Data to Be Used

The data for this project will be sourced from the UCI Machine Learning Repository, one of the most reputable open-access repositories for research datasets. Specifically, the dataset titled ‚ÄúAdult (Census Income)‚Äù, originally extracted from the 1994 U.S. Census Bureau database, will serve as the foundation for this analysis. The dataset can be accessed directly at:
üîó https://archive.ics.uci.edu/dataset/2/adult

The dataset contains approximately 48,842 records and 15 attributes, capturing demographic, educational, occupational, and financial details for adults aged 16 years and older. The response variable, income, classifies individuals into two categories: ‚Äú<=50K‚Äù and ‚Äú>50K‚Äù annual income brackets.
The features include both numerical variables (e.g., age, hours-per-week, capital-gain, capital-loss) and categorical variables (e.g., workclass, education, marital-status, occupation, race, sex, native-country).

**Data Collection Method**

The dataset will be collected directly by downloading the CSV file from the official UCI repository web page. No web scraping or third-party API calls are required since the dataset is openly available for academic use. The data will then be imported into a Python Jupyter Notebook using the pandas library for cleaning, exploration, and modeling.

---

## 4Ô∏è. Approach

**Data Management Plan**

All data handling and analysis will take place in a Jupyter Notebook using the pandas library. The raw dataset (downloaded CSV) will be stored locally and version-controlled via GitHub.A structured pipeline will ensure full reproducibility:
1. Data Ingestion -> Pre Exploratory Data Analysis -> Data Preparation -> Post Exploratory Data Analysis -> Feature Selection and/or Dimensionality Reduction -> Model Training -> Evaluation -> Ensemble Integration -> Interpretation -> Reporting.

2. Cleaned and encoded intermediate files will be saved in CSV format for consistency.

3. No external database or scraping is needed; all computations occur within Python‚Äôs scikit-learn and xgboost frameworks.

**Statistical and Analytical Methods**

Descriptive statistics (mean, median, mode, variance) will summarize numeric attributes such as age and hours-per-week, while frequency tables and bar charts will describe categorical features such as education and occupation.
Inferential relationships will be examined through correlation analysis and chi-square tests. Model performance will be compared using Accuracy, Precision, Recall, F1-Score, and ROC-AUC, with 5-fold cross-validation for generalization assessment.

**Machine Learning Models to Be Constructed**

Four distinct models will be developed:
1. Logistic Regression ‚Äì establishes a simple linear baseline for binary classification.
2. Random Forest Classifier ‚Äì ensemble of decision trees that capture non-linear and interaction effects.
3. Support Vector Machine (SVM) ‚Äì constructs an optimal separating hyperplane for complex boundaries.
4. XGBoost Classifier (required advanced model) ‚Äì gradient-boosted ensemble that sequentially minimizes classification error, improving accuracy and generalization.

Hyperparameter tuning will be conducted via GridSearchCV.
Each model will be evaluated using identical train/test splits and cross-validated metrics to identify the best individual performer.

**Ensemble Model Design ("Weak Learners")**

To satisfy the final-project ensemble requirement, four base learners, Logistic Regression, K-Nearest Neighbors (KNN), SVM, and Decision Tree,will be combined using a Voting Classifier. Both hard and soft voting mechanisms will be tested.
The ensemble‚Äôs predictions will then be compared with the standalone XGBoost model to evaluate whether hybrid learning provides additional gains. If time allows, a Stacking Classifier with XGBoost as the meta-learner will also be explored to strengthen predictive performance.

**Feasibility and Timeline**

This project is technically and logistically feasible within the remaining duration of the DAV 6150 course. The Adult (Census Income) dataset has approximately 49 000 rows √ó 15 columns, which is moderate in size and computationally efficient to process on a standard laptop with 8‚Äì16 GB RAM.

All preprocessing, modeling, and visualization steps can be performed using open-source Python packages (pandas, scikit-learn, xgboost, matplotlib, seaborn) without the need for specialized hardware or cloud computing resources.

The project‚Äôs structure follows a linear and modular workflow, allowing progress to be tracked and validated at each stage. Each module‚Äôs deliverable builds upon the previous one, ensuring continuous development rather than last-minute integration.

Risk mitigation strategies include:
* Saving intermediate cleaned and encoded datasets to avoid data-loss or reprocessing delays.
* Using version control (GitHub) to track notebook iterations and parameter adjustments.
* Allocating buffer days between milestones for code debugging and visualization refinements.
* Ensuring reproducibility through consistent random seeds and documented preprocessing pipelines.

Given these considerations, the project is realistic, reproducible, and well-scoped under the current timeline.

**Proposed Timeline (Modules 9-15)**

| **Module / Week**      | **Focus Area**                                    | **Planned Work and Milestones**                                                                                                                                                                                                                                                               |
| ---------------------- | ------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Module 9 (Week 1)**  | **Proposal Drafting ‚Äì Concept Development**         | Review potential datasets, confirm selection of the **UCI Adult Dataset**, and draft Introduction, Research Questions, and Approach sections. Submit **Proposal First Draft** for initial feedback.                                                                                           |
| **Module 10 (Week 2)** | **Proposal Revision ‚Äì EDA Preview**                 | Perform a lightweight exploratory check to confirm data quality (no full modeling yet). Incorporate instructor feedback and refine proposal sections.                                                                                                                                         |
| **Module 11 (Week 3)** | **Finalize Proposal Submission**                   | Submit **Final Project Proposal Notebook** (graded 30 points). Confirm project scope, models, and team roles. This marks the transition from planning -> execution.                                                                                                                         |
| **Module 12 (Week 4)** | **Data Collection & Exploratory Data Analysis**     | Download dataset, handle missing values (‚Äú?‚Äù -> NaN), conduct full EDA (distribution plots, correlations, feature summaries), and document findings with visualizations.                                                                                                                       |
| **Module 13 (Week 5)** | **Data Preparation & Model Development Phase 1**    | Perform encoding of categorical variables, scaling of numeric features, and feature engineering. Train **Logistic Regression** and **Random Forest** models with cross-validation. Evaluate baseline performance metrics.                                                                     |
| **Module 14 (Week 6)** | **Model Development Phase 2 & Advanced Modeling**   | Train **Support Vector Machine (SVM)** and **XGBoost Classifier** (advanced model). Conduct hyperparameter tuning via GridSearchCV. Compare models using ROC curves, F1-Scores, and feature-importance plots.                                                                                 |
| **Module 15 (Week 7)** | **Ensemble Integration & Presentation Preparation** | Combine four weak learners (Logistic Regression, KNN, SVM, Decision Tree) into a **Voting/Stacking Ensemble**. Compare ensemble vs XGBoost. Finalize notebook documentation and prepare 10-minute presentation. Submit **Final Notebook** + Deliver **Live Presentation**. |


**Roles and Responsibilities**

This project will be completed by a four-member team: Alwyn Munatsi, Lucia Shumba, Chidochashe Makanga, and Bekithemba Nkomo. Each team member contributes collaboratively while leading specific domains to ensure balanced workload and accountability.

| **Team Member**               | **Primary Responsibilities**                                                                                                                                                                                                                           |
| ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Alwyn Munatsi** | Oversees project coordination, integration, and documentation. Leads **data acquisition**, **feature engineering**, and **XGBoost model implementation**. Manages GitHub version control and ensures deliverable alignment with DAV 6150 requirements. |
| **Lucia Shumba**         | Leads **Exploratory Data Analysis (EDA)** and **visualization**. Designs statistical summaries, correlation maps, and demographic charts. Supports feature engineering and Random Forest tuning.                                                       |
| **Chidochashe Makanga**         | Leads **Logistic Regression** and **Support Vector Machine (SVM)** model development. Conducts **hyperparameter optimization** and comparative evaluation. Assists with ensemble construction and documentation.                                       |
| **Bekithemba Nkomo**          | Leads **Ensemble Model Construction** and **Result Interpretation**. Compares ensemble performance against XGBoost. Designs and delivers **presentation visuals and slides**                                                            |

**Collaborative Contributions:**

All team members will:
* Participate in proposal drafting and final report writing.
* Review and validate each other‚Äôs code and narrative sections.
* Contribute to presentation preparation and rehearsals.
* Ensure code reproducibility and consistency within the shared GitHub repository.

This structure promotes synergy, accountability, and balanced contributions across all technical and analytical tasks.