# Lecture 7 Activity

This notebook builds off the in-class notes on the basic scikit-learn workflow. This time we will look at another classic ML dataset. This dataset connects the properties of a given wine (alcohol content, color, etc.) with the type of wine it is. There are three total categories. The goal of this notebook is to train two kinds of scikit-learn classifiers on this task, and see which performs better.

### Steps
1. Load data into `X` (features) and `y` (labels)
2. Split into train/test using `train_test_split`
3. Create a model
4. Fit with `.fit(X_train, y_train)`
5. Predict with `.predict(X_test)`
6. Evaluate with a metric

### Set up imports
Import pandas and the needed scikit-learn libraries. 

In [1]:
# import the necessary libraries

import pandas as pd
from sklearn import datasets

### Load the wine dataset

It is built into Scikit learn in a similar way to the iris data set <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html">hint</a>

**TODO:** Load the dataset and store:
- `X` as a pandas DataFrame
- `y` as a pandas Series


In [12]:
# Load Iris into X (DataFrame) and y (Series)
wine = datasets.load_wine()

X = pd.DataFrame(wine.data,columns=wine.feature_names)
y = pd.Series(wine.target)


In [13]:
assert isinstance(X, pd.DataFrame), 'X should be a pandas DataFrame.'
assert isinstance(y, pd.Series), 'y should be a pandas Series.'
assert X.shape[0] == y.shape[0], 'X and y must have the same number of rows.'

### Inspect the data

1. Display the the first 5 rows of the X data frame

2. Print the first 5 values in the y 

In [14]:
# display first 5 rows of X
X.head()


Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [15]:
# inspect y by printing it out.
print(y)

0      0
1      0
2      0
3      0
4      0
      ..
173    2
174    2
175    2
176    2
177    2
Length: 178, dtype: int64


### Create training and testing data

Break the data down into training and testing data.

Create `X_train`, `X_test`, `y_train`, `y_test`.

Separate 75% of the data for training and 25% for testing data. Use a random seed of 50.


In [17]:
# Split the data here

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [18]:
assert X_train is not None, 'You must create X_train (and the others too).'
assert X_train.shape[0] + X_test.shape[0] == X.shape[0], 'Train + test rows must equal total rows.'
assert y_train.shape[0] == X_train.shape[0], 'y_train must match X_train rows.'
assert y_test.shape[0] == X_test.shape[0], 'y_test must match X_test rows.'

# Check approximate split ratio (allowing small rounding)
expected_test = int(round(0.25* X.shape[0]))
assert abs(X_test.shape[0] - expected_test) <= 1, 'Test set size looks incorrect.'

# Check stratification: each class should appear in both train and test
assert set(y.unique()).issubset(set(y_train.unique())), 'A class is missing from y_train. Did you stratify?'
assert set(y.unique()).issubset(set(y_test.unique())), 'A class is missing from y_test. Did you stratify?'
print('✅ Passed: Train/test split looks correct.')

✅ Passed: Train/test split looks correct.


## Create the models

We’ll use **Logistic Regression** and **Decision Tree Classifier** for classification.

**TODO:** Create a `LogisticRegression` model named `logistic_model`.

**TODO:** Create a `DecisionTreeClassifier` model named `tree_model`.

You'll need to import it like this: `from sklearn.linear_model import LogisticRegression`

You'll also need to import the decision tree from scikit-learn like in the notes.


In [21]:
# create the logistic model (we don't need to train it yet)
from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression()

In [22]:
assert logistic_model is not None, 'You must create a model object named `logistic_model`.'
assert isinstance(logistic_model, LogisticRegression), 'logistic_model should be a LogisticRegression instance.'
print('✅ Passed: Logistic Model created.')

✅ Passed: Logistic Model created.


In [None]:
# create basic decision tree model here. Set the random_state parameter to 42 for reproducibility.
from sklearn.tree import DecisionTreeClassifier
tree_model = DecisionTreeClassifier(random_state=42)

In [26]:
# check decision tree model
assert tree_model is not None, 'You must create a model object named `tree_model`.'
assert isinstance(tree_model, DecisionTreeClassifier), 'tree_model should be a DecisionTreeClassifier instance.'
print('✅ Passed: Decision Tree Model created.')

✅ Passed: Decision Tree Model created.


### Fit the models

You can ignore the warnings if you see any.

In [27]:
# use the training data to fit both models
logistic_model.fit(X_train, y_train)
tree_model.fit(X_train, y_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0,1,2
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`.",'gini'
,"splitter  splitter: {""best"", ""random""}, default=""best"" The strategy used to choose the split at each node. Supported strategies are ""best"" to choose the best split and ""random"" to choose the best random split.",'best'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: int, float or {""sqrt"", ""log2""}, default=None The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at  each split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. note::  The search for a split does not stop until at least one  valid partition of the node samples is found, even if it requires to  effectively inspect more than ``max_features`` features.",
,"random_state  random_state: int, RandomState instance or None, default=None Controls the randomness of the estimator. The features are always randomly permuted at each split, even if ``splitter`` is set to ``""best""``. When ``max_features < n_features``, the algorithm will select ``max_features`` at random at each split before finding the best split among them. But the best found split may vary across different runs, even if ``max_features=n_features``. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, ``random_state`` has to be fixed to an integer. See :term:`Glossary ` for details.",
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0


In [28]:
assert hasattr(logistic_model, 'classes_'), 'Logistic model does not look fitted yet. Did you call model.fit(...) ?'
assert len(logistic_model.classes_) >= 2, 'Fitted logistic model should have 2+ classes.'
print('✅ Passed: Logistic Model fitted.')

assert hasattr(tree_model, 'classes_'), 'Decision tree model does not look fitted yet. Did you call model.fit(...) ?'
assert len(tree_model.classes_) >= 2, 'Fitted decision tree model should have 2+ classes.'
print('✅ Passed: Decision Tree Model fitted.')

✅ Passed: Logistic Model fitted.
✅ Passed: Decision Tree Model fitted.


### Generate test set predictions using both models

In [31]:
# Predict on the test data with both models
y_pred_logistic = logistic_model.predict(X_test)
y_pred_tree = tree_model.predict(X_test)

In [32]:
# --- Checks (do not edit) ---
assert y_pred_logistic is not None, 'You must set y_pred_logistic.'
assert y_pred_tree is not None, 'You must set y_pred_tree.'

assert len(y_pred_logistic) == len(y_test), 'y_pred_logistic should have the same length as y_test.'
assert len(y_pred_tree) == len(y_test), 'y_pred_tree should have the same length as y_test.'

print('✅ Passed: Predictions computed.')

✅ Passed: Predictions computed.


## 6) Evaluate with accuracy

**TODO:** Compute accuracy as a float named `acc`.

Then print it with 3 decimal places.


In [35]:
# get the accuracy score for both models
from sklearn.metrics import accuracy_score
logistic_accuracy = accuracy_score(y_test,y_pred_logistic)
tree_accuracy = accuracy_score(y_test,y_pred_tree)

In [36]:
# --- Checks (do not edit) ---
assert logistic_accuracy is not None, 'You must set logistic_accuracy.'
assert 0.0 <= logistic_accuracy <= 1.0, 'Accuracy must be between 0 and 1.'

assert tree_accuracy is not None, 'You must set tree_accuracy.'
assert 0.0 <= tree_accuracy <= 1.0, 'Accuracy must be between 0 and 1.'


print(f'✅ Passed: Accuracy = {logistic_accuracy:.3f}')
print(f'✅ Passed: Accuracy = {tree_accuracy:.3f}')

✅ Passed: Accuracy = 0.978
✅ Passed: Accuracy = 0.956
