# What is Machine Learning?

* Machine learning involves building mathematical models to help understand data
* "Learning" enters the fray when we give these models _tunable parameters_ that can be adapted to observed data
* Once these models have been fit to previously seen data, they can be used to predict and understand aspects of newly observed data

## Categories of Machine Learning

* Two main types: supervised and unsupervised learning
* Supervised learning: models that can predict labels based on labeled training data
    * Classification: models that predict labels as two or more discrete categories
    * Regression: models that predict continuous labels
* Unsupervised learning: models that identify structure in unlabeled data
    * Clustering: models that detect and identify distinct groups in the data
    * Dimenstionality reduction: models that detect and identify lower-dimensional structure in higher-dimensional data

## Data Representation in Scikit-Learn

* Features matrix (X): Two-dimensional array with shape [n_samples, n_features]
* Target array (y): One-dimensional array with length n_samples
* Data typically stores in NumPy arrays, Pandas DataFrames, or SciPy sparse matrices

## Scikit-Learn's Estimator API

* Consistency: Common interface for all algorithms
* Inspection: Parameters exposed as public attributes
* Composition: Complex tasks as sequences of simpler algorithms
* Sensible defaults: Appropriate default parameter values

## Basic Steps for Using Scikit-Learn

1. Choose model class: Import appropriate estimator (e.g., `from sklearn.linear_model import LinearRegression`)
2. Choose hyperparameters: Instantiate model with desired settings
3. Prepare data: Arrange into features matrix (X) and target vector (y)
4. Fit model: Use `model.fit(X, y)` method
5. Apply model
    * For supervised learning: `model.predict(X_new)`
    * For unsupervised learning: `model.transform(X) or model.predict(X)`

## Important Utility Functions

* `train_test_split()` - Divides data into training and testing sets
* `accuracy_score()` - Measures prediction accuracy
* `confusion_matrix()` - Shows detailed breakdown of correct/incorrect predictions

## Supervised Learning Examples

* **Linear Regression:**
    * Model parameters stored with trailing underscores (e.g., model.coef_, model.intercept_)

* **Classification** (e.g., Iris dataset with Gaussian Naive Bayes):
```python
      model = GaussianNB()
      model.fit(Xtrain, ytrain)
      y_model = model.predict(Xtest)
```

## Unsupervised Learning Examples

* **Dimensionality Reduction** (e.g., PCA):
```python
      model = PCA(n_components=2)
      model.fit(X)
      X_2D = model.transform(X)
```

* **Clustering** (e.g., Gaussian Mixture Model):
```python
      model = GMM(n_components=3)
      model.fit(X)
      y_gmm = model.predict(X)
```

