# Train a Model

## Model Training Overview

To train a machine learning model, we first need **clean and well-prepared data**.

In my repository  ðŸ‘‰ **[AI_Learning_DataPrep_SageMaker](https://github.com/VijayBheemineni/AI_Learning_DataPrep_SageMaker)**,  
I analyzed the `adult_data.csv` dataset, performed data cleaning and feature transformations, and split the data into three datasets:

- **Training dataset**
- **Validation dataset**
- **Test dataset**

All three processed CSV files are stored in **Amazon S3** and are used as inputs for model training and evaluation.

---

## Choosing the Machine Learning Algorithm

Once data preparation is complete, the next step is to select an appropriate **machine learning algorithm**.

In this use case, the goal is to predict whether an individual's income is:
- `>=50K` or
- `<50K`

Since the output has only **two possible outcomes**, this is a **binary classification problem**.  
For this reason, I am using the **XGBoost algorithm**, which is well-suited for structured/tabular data and is commonly used for classification problems.

---

## What Happens During Model Training

The objective of model training is to create a model that can make accurate predictions on **new, unseen data**.

- The **training data** contains both input features and the target label (`income`)
- Future data used for predictions **does not contain the target label**

During training:
- The algorithm learns patterns that map input features to the target
- These learned patterns are stored as a **trained ML model**
- This model can then be used to predict income categories for new data

---

## Hyperparameters and Model Tuning

In addition to selecting an algorithm, we also configure **hyperparameters**.

Hyperparameters:
- Control how the training job runs
- Influence model behavior and learning process
- Have a significant impact on model performance and accuracy

Selecting the right hyperparameter values is an important part of training an effective model.

---


## Task 1: Setup the Environment


In [None]:
#Install matplotlib, bokeh, seaborn and restart kernel
%pip install matplotlib # Low level plotting library to create static plots
%pip uninstall bokeh -y # Python Visualization Library for creating interactive charts
%pip install bokeh==2.4.2
%pip install seaborn # High level statistical visualization library built on Matplotlib
%reset -f

# Import packages
import boto3
import sagemaker

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
boto3_session = boto3.Session()
sagemaker = boto3_session.client('sagemaker')

# Reload modules
%load_ext autoreload
%autoreload 2

# TODO: Local IDE Jupyter Kernel setup and execution steps.