# Applied Machine Learning: Starter Problem

In this notebook, we'll go through the steps to prepare a dataset and train a basic machine learning model using the **Iris dataset**. 

## Step 1: Import Libraries

We'll use **pandas** for data manipulation and **scikit-learn** for machine learning. Specifically, we will use:
- `train_test_split` to divide our dataset into training and test sets
- `DecisionTreeClassifier` as our machine learning model
- `metrics` to evaluate model performance

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

## Step 2: Load Dataset

We have preloaded the **Iris dataset**, which contains measurements of different iris flowers along with their species labels.
The columns are:
- Sepal Length
- Sepal Width
- Petal Length
- Petal Width
- Species (label we want to predict)

In [None]:
address = '/workspaces/python-for-data-science-and-machine-learning-essential-training-part-1-3006708/data/iris.csv'

dataset = pd.read_csv(address)
dataset.head()

Let's check the unique species in this dataset to understand the labels we want to predict.

In [None]:
dataset.Species.unique()

## Step 3: Separate Features and Labels

We will separate the **input features (X)** from the **labels (y)**. Features are the columns that describe each flower (sepal and petal measurements), and the label is the species.

In [None]:
# Features: columns 1-4
X = dataset.iloc[:, 1:5]
X.head()

In [None]:
# Labels: column 5
y = dataset.iloc[:, 5]
y.head()

## Step 4: Split Dataset into Training and Test Sets

We split the dataset so that **70% of the data is used to train the model** and **30% is used to test it**. This allows us to evaluate the model's performance on unseen data.

We also set `random_state=0` to make the split reproducible.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

## Step 5: Train Decision Tree Classifier

We will use a **Decision Tree Classifier**, a simple and interpretable machine learning model. It learns rules from the training data to predict the species of iris flowers.

In [None]:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

## Step 6: Make Predictions

We use the trained model to predict the species of flowers in the **test set**.

In [None]:
y_predict = clf.predict(X_test)
y_predict

## Step 7: Evaluate Model Accuracy

We measure how well the model performs by comparing the predicted labels with the actual labels from the test set. The **accuracy** metric gives the fraction of correctly predicted instances.

In [None]:
accuracy = metrics.accuracy_score(y_test, y_predict)
print("Accuracy:", accuracy)

## Summary

- Loaded the Iris dataset
- Separated features and labels
- Split data into training and test sets
- Trained a Decision Tree Classifier
- Evaluated the model using accuracy

This completes a basic workflow for a **starter machine learning problem** using scikit-learn.