# Classify Forest Cover Types with Regularized Logistic Regression

---

**Dataset**: *Forest Covertypes* (`sklearn.datasets.fetch_covtype`)
- Multi-class target: 7 forest cover types (classes 1–7)
- 54 numerical features (cartographic, soil, and wilderness data)
- Real-world dataset used in remote sensing & ecology

---

1. **Load the Forest CoverType Dataset**
   - Use `from sklearn.datasets import fetch_covtype`
   - Load the data and take a random sample of ~10,000 instances for faster training
   - Print the number of samples, features, and target classes

In [1]:
from sklearn.datasets import fetch_covtype
data = fetch_covtype()
X, y = data.data, data.target

In [5]:
print(data['DESCR'])

.. _covtype_dataset:

Forest covertypes
-----------------

The samples in this dataset correspond to 30×30m patches of forest in the US,
collected for the task of predicting each patch's cover type,
i.e. the dominant species of tree.
There are seven covertypes, making this a multiclass classification problem.
Each sample has 54 features, described on the
`dataset's homepage <https://archive.ics.uci.edu/ml/datasets/Covertype>`__.
Some of the features are boolean indicators,
while others are discrete or continuous measurements.

**Data Set Characteristics:**

Classes                        7
Samples total             581012
Dimensionality                54
Features                     int

:func:`sklearn.datasets.fetch_covtype` will load the covertype dataset;
it returns a dictionary-like 'Bunch' object
with the feature matrix in the ``data`` member
and the target values in ``target``. If optional argument 'as_frame' is
set to 'True', it will return ``data`` and ``target`` as pandas
data

In [3]:
y

array([5, 5, 2, ..., 3, 3, 3], shape=(581012,), dtype=int32)

In [2]:
X

array([[2.596e+03, 5.100e+01, 3.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.590e+03, 5.600e+01, 2.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.804e+03, 1.390e+02, 9.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       ...,
       [2.386e+03, 1.590e+02, 1.700e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.384e+03, 1.700e+02, 1.500e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.383e+03, 1.650e+02, 1.300e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00]], shape=(581012, 54))

2. **Preprocess the Data**
   - Standardize features using `StandardScaler`

3. **Train Multi-class Logistic Regression**
   - Use `LogisticRegression()`
   - Train with 5-fold cross-validation
   - Report accuracy and confusion matrix

4. **Apply Regularization**
   - Try different `C` values (e.g., `[0.01, 0.1, 1, 10]`)
   - Compare **L1** vs **L2** penalties
   - Review differences in model sparsity and accuracy

5. **Use GridSearchCV**
   - Perform grid search over `C` and `penalty` with 5-fold CV
   - Output best model and hyperparameters













6. **(Bonus)** Analyze Feature Importance
   - For L1-regularized models, which features are selected (non-zero coefficients)?
   - What do those features mean?

---

