## ML vs. Traditional Programming

In traditional programming, a programmer writes explicit rules and instructions for the computer to follow. In contrast, machine learning allows the computer to learn from data and improve its performance over time without being explicitly programmed for every task.

One way to think about the difference is:

-   ****Traditional Programming****: input + model → output
-   ****Machine Learning****: input + output → model

## Key Concepts in Machine Learning

Some key concepts in machine learning which will be useful to introduce early on in the workshop include:

-   ****Data****: The foundation of machine learning. Data can be structured (e.g., tables, databases) or unstructured (e.g., text, images).
-   ****Observations****: Individual data points or instances in the dataset. In a tabular dataset, observations are the rows that represent different examples.
-   ****Features****: Individual measurable properties or characteristics of the data. In a tabular dataset, features are the columns that represent different attributes.
-   ****Labels****: The target variable or outcome that the model is trying to predict. This is usually one of the columns in a tabular dataset.
-   ****Model****: A mathematical representation of the relationship between features and labels. The model is trained on the data to learn these relationships.
-   ****Training****: The process of teaching a machine learning model using a dataset. During training, the model learns to identify patterns in the data.
-   ****Testing****: Evaluating the performance of a trained model on a separate dataset to assess its accuracy and generalization capabilities.

## Types of Machine Learning

Machine learning can be broadly categorized into three main types based on the nature of the learning task:

-   ****Supervised Learning****: The model is trained on a labeled dataset, where each observation has a corresponding label. The goal is to learn a mapping from features to labels. Examples include ****classification**** and ****regression**** tasks.
-   ****Unsupervised Learning****: The model is trained on an unlabeled dataset, where the goal is to find patterns or groupings in the data. Examples include ****clustering**** and ****dimensionality reduction**** tasks.
-   ****Reinforcement Learning****: The model learns by interacting with an environment and receiving feedback in the form of rewards or penalties. The goal is to learn a policy that maximizes cumulative rewards over time.

We will explore these concepts in more detail in a bit, but for now, it is important to understand that machine learning relies heavily on data and the relationships between features and labels.

We will also introduce more advanced concepts in later sessions, as needed.

## Practical Demonstration

The `scikit-learn` library, which is the most populuar machine learning library in Python, provides a wide range of tools for building and evaluating machine learning models. It includes datasets, preprocessing utilities, and various algorithms for classification, regression, clustering, and more.

We will start by exploring some of the built-in datasets in `scikit-learn`, which are useful for learning and practicing machine learning concepts. These datasets are often used as benchmarks for testing algorithms and understanding their behavior.

The first dataset we will explore is the ****Iris dataset****, which is a classic dataset in machine learning. It contains measurements of iris flowers and their corresponding species labels. The dataset has four features (sepal length, sepal width, petal length, and petal width) and three classes (species of iris).

Import the necessary libraries and load the Iris dataset.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

The `iris` dataset is loaded as, essentially, a dictionary:

In [None]:
print(type(iris))

Let's print the keys of this dictionary:

In [None]:
print(iris.keys())

Print the description of the dataset to understand its contents.

In [None]:
print(iris['DESCR'])

Load the data and target variables into a Pandas DataFrame for easier manipulation and visualization.

In [None]:
import pandas as pd
iris_df = pd.DataFrame(data=iris['data'], 
                       columns=iris['feature_names'])
iris_df['species'] = iris['target_names'][iris['target']]
print(iris_df.head())

Finally, let's save the DataFrame to a CSV file for later use.

In [None]:
iris_df.to_csv('../../data/iris.csv', index=False)
print("Iris dataset saved to 'iris.csv'")

## Hands-on Exercises

Explore the California housing dataset:

-   Load the California housing dataset from `scikit-learn`
-   Transform the data into a Pandas DataFrame for easier manipulation and visualization
-   Print the first few rows of the dataset, the names of the features and the target variable, and the number of observations and features in the dataset.
-   Save the DataFrame to a CSV file for later use.