# (PART) MACHINE LEARNING {-}

# What is the goal of supervised learning?

## Explanation

Supervised learning aims to learn a mapping from input features (X) to a target variable (y), based on labeled training data. The model can then make predictions on new, unseen data.

In [None]:
## Python Code
# Supervised learning example: simple linear regression
from sklearn.linear_model import LinearRegression
import pandas as pd

# Sample data
df = pd.DataFrame({
    "X": [1, 2, 3, 4, 5],
    "y": [2, 4, 6, 8, 10]
})

model = LinearRegression()
model.fit(df[["X"]], df["y"])
print("Model coefficient:", model.coef_[0])


## R Code

```{r}
# Supervised learning example: simple linear regression
df <- data.frame(X = 1:5, y = c(2, 4, 6, 8, 10))
model <- lm(y ~ X, data = df)
summary(model)

```

# How do you split a dataset into training and test sets?

## Explanation

Splitting the data allows you to train a model on one portion and evaluate it on another, unseen portion. This helps estimate real-world performance.

In [None]:
## Python Code
from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv("data/iris.csv")
X = df.drop("species", axis=1)
y = df["species"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("Training set size:", len(X_train))


## R Code

```{r}
library(caret)
data <- readr::read_csv("data/iris.csv")
set.seed(42)
train_index <- createDataPartition(data$species, p = 0.7, list = FALSE)
train <- data[train_index, ]
test <- data[-train_index, ]
nrow(train)

```

# How do you train a classification model using logistic regression?

## Explanation

Logistic regression is a linear model used for binary and multi-class classification problems. It estimates probabilities using the logistic function. For binary classification, the target must have only two classes. For multi-class, specialized implementations like multinomial logistic regression are used.

## Python Code


In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load dataset from local path
df = pd.read_csv("data/iris.csv")

# Subset for binary classification (e.g., setosa vs versicolor)
binary_df = df[df["species"].isin(["setosa", "versicolor"])].copy()
binary_df["species"] = binary_df["species"].map({"setosa": 0, "versicolor": 1})

# Split into train/test
X = binary_df.drop("species", axis=1)
y = binary_df["species"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate model
print("Accuracy:", model.score(X_test, y_test))

## R Code

```{r}
library(readr)
library(caret)

# Load dataset
df <- read_csv("data/iris.csv")

# Subset for binary classification (setosa vs versicolor)
df_bin <- subset(df, species %in% c("setosa", "versicolor"))

# Convert species to binary factor with correct level order
df_bin$species <- factor(df_bin$species, levels = c("setosa", "versicolor"))

# Split into train/test
set.seed(42)
index <- createDataPartition(df_bin$species, p = 0.7, list = FALSE)
train <- df_bin[index, ]
test <- df_bin[-index, ]

# Fit logistic regression model — now works because species has 2 levels
model <- glm(species ~ sepal_length + sepal_width + petal_length + petal_width,
             data = train, family = "binomial")

# Predict probabilities
pred_probs <- predict(model, newdata = test, type = "response")

# Classify based on threshold
predicted <- factor(ifelse(pred_probs > 0.5, "versicolor", "setosa"),
                    levels = levels(test$species))

# Evaluate
confusionMatrix(predicted, test$species)
```

# How do you evaluate a classification model?

## Explanation

Evaluation metrics for classification include:

- Accuracy: Proportion of correct predictions.
- Precision: Of all predicted positives, how many were actually positive.
- Recall (Sensitivity): Of all actual positives, how many were predicted positive.
- F1-Score: Harmonic mean of precision and recall.
- Confusion Matrix: Table showing correct vs incorrect predictions per class.

These metrics give a more complete view than accuracy alone, especially for imbalanced datasets.


## Python Code

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd

# Load and subset iris dataset
df = pd.read_csv("data/iris.csv")

# ✅ Use .copy() to avoid SettingWithCopyWarning when modifying DataFrame
binary_df = df[df["species"].isin(["setosa", "versicolor"])].copy()

# Convert class labels to binary (0 = setosa, 1 = versicolor)
binary_df["species"] = binary_df["species"].map({"setosa": 0, "versicolor": 1})

# Split into training and test sets
X = binary_df.drop("species", axis=1)
y = binary_df["species"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)

# Print confusion matrix and full classification report
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

## R Code

```{r}
library(readr)
library(caret)

# Load and subset iris dataset
df <- read_csv("data/iris.csv")
df_bin <- subset(df, species %in% c("setosa", "versicolor"))
df_bin$species <- factor(df_bin$species, levels = c("setosa", "versicolor"))

# Split into training and test sets
set.seed(42)
index <- createDataPartition(df_bin$species, p = 0.7, list = FALSE)
train <- df_bin[index, ]
test <- df_bin[-index, ]

# Train logistic regression model
model <- glm(species ~ sepal_length + sepal_width + petal_length + petal_width,
             data = train, family = "binomial")

# Predict probabilities and convert to classes
pred_probs <- predict(model, newdata = test, type = "response")
predicted <- factor(ifelse(pred_probs > 0.5, "versicolor", "setosa"),
                    levels = levels(test$species))

# Evaluate using confusion matrix (includes accuracy, sensitivity, specificity, etc.)
confusionMatrix(predicted, test$species)

```

# How do you train a decision tree for prediction?

## Explanation

Decision trees are flexible models that recursively split the dataset based on feature values to form decision rules. For classification, they predict a class label; for regression, they predict a continuous value. Trees are interpretable and can handle both linear and non-linear patterns.

## Python Code

## R Code

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load and subset iris dataset for binary classification
df = pd.read_csv("data/iris.csv")

# ✅ Use .copy() to avoid SettingWithCopyWarning
binary_df = df[df["species"].isin(["setosa", "versicolor"])].copy()

# Convert labels to binary (0 = setosa, 1 = versicolor)
binary_df["species"] = binary_df["species"].map({"setosa": 0, "versicolor": 1})

# Split into train/test
X = binary_df.drop("species", axis=1)
y = binary_df["species"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train decision tree classifier
tree = DecisionTreeClassifier(max_depth=3)
tree.fit(X_train, y_train)

# Predict and evaluate
y_pred = tree.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

## R Code

```{r}
library(readr)
library(caret)
library(rpart)

# Load and subset iris dataset for binary classification
df <- read_csv("data/iris.csv")
df_bin <- subset(df, species %in% c("setosa", "versicolor"))
df_bin$species <- factor(df_bin$species, levels = c("setosa", "versicolor"))

# Split into train/test
set.seed(42)
index <- createDataPartition(df_bin$species, p = 0.7, list = FALSE)
train <- df_bin[index, ]
test <- df_bin[-index, ]

# Train decision tree
tree_model <- rpart(species ~ ., data = train, method = "class", control = rpart.control(maxdepth = 3))

# Predict and evaluate
predicted <- predict(tree_model, newdata = test, type = "class")
confusionMatrix(predicted, test$species)

```