
# What is Machine Learning?

**Machine Learning** (ML) is a subfield of **artificial intelligence** (AI) that focuses on the development of algorithms and statistical models that enable computers to perform tasks without explicit instructions. Instead, ML systems learn from data, identifying patterns and making decisions based on that data.

## ML vs. Traditional Programming

In traditional programming, a programmer writes explicit rules and instructions for the computer to follow. In contrast, machine learning allows the computer to learn from data and improve its performance over time without being explicitly programmed for every task.

One way to think about the difference is:

-   **Traditional Programming**: input + model $\Rightarrow$ output
-   **Machine Learning**: input + output $\Rightarrow$ model

## Key Concepts in Machine Learning

Some key concepts in machine learning which will be useful to introduce early on in the workshop include:

-   **Data**: The foundation of machine learning. Data can be structured (e.g., tables, databases) or unstructured (e.g., text, images).
-   **Observations**: Individual data points or instances in the dataset. In a tabular dataset, observations are the rows that represent different examples.
-   **Features**: Individual measurable properties or characteristics of the data. In a tabular dataset, features are the columns that represent different attributes.
-   **Labels**: The target variable or outcome that the model is trying to predict. This is usually one of the columns in a tabular dataset.
-   **Model**: A mathematical representation of the relationship between features and labels. The model is trained on the data to learn these relationships.
-   **Training**: The process of teaching a machine learning model using a dataset. During training, the model learns to identify patterns in the data.
-   **Testing**: Evaluating the performance of a trained model on a separate dataset to assess its accuracy and generalization capabilities.

## Types of Machine Learning

Machine learning can be broadly categorized into three main types based on the nature of the learning task:

-   **Supervised Learning**: The model is trained on a labeled dataset, where each observation has a corresponding label. The goal is to learn a mapping from features to labels. Examples include **classification** and **regression** tasks.
-   **Unsupervised Learning**: The model is trained on an unlabeled dataset, where the goal is to find patterns or groupings in the data. Examples include **clustering** and **dimensionality reduction** tasks.
-   **Reinforcement Learning**: The model learns by interacting with an environment and receiving feedback in the form of rewards or penalties. The goal is to learn a policy that maximizes cumulative rewards over time.

We will explore these concepts in more detail in a bit, but for now, it is important to understand that machine learning relies heavily on data and the relationships between features and labels.

We will also introduce more advanced concepts in later sessions, as needed.

## Practical Demonstration

The `scikit-learn` library, which is the most populuar machine learning library in Python, provides a wide range of tools for building and evaluating machine learning models. It includes datasets, preprocessing utilities, and various algorithms for classification, regression, clustering, and more.

We will start by exploring some of the built-in datasets in `scikit-learn`, which are useful for learning and practicing machine learning concepts. These datasets are often used as benchmarks for testing algorithms and understanding their behavior.

The first dataset we will explore is the **Iris dataset**, which is a classic dataset in machine learning. It contains measurements of iris flowers and their corresponding species labels. The dataset has four features (sepal length, sepal width, petal length, and petal width) and three classes (species of iris).

Import the necessary libraries and load the Iris dataset.

In [1]:
from sklearn.datasets import load_iris
data = load_iris()

The `iris` dataset is loaded as, essentially, a dictionary:

In [2]:
print(type(data))

<class 'sklearn.utils._bunch.Bunch'>


Let's print the keys of this dictionary:

In [3]:
print(data.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])


Print the description of the dataset to understand its contents.

In [4]:
print(data['DESCR'])

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

Load the data and target variables into a Pandas DataFrame for easier manipulation and visualization.

In [5]:
import pandas as pd
df = pd.DataFrame(data=data['data'], 
                  columns=data['feature_names'])
df['species'] = data['target_names'][data['target']]
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


Finally, let's save the DataFrame to a CSV file for later use.

In [6]:
df.to_csv('../../data/iris.csv', index=False)
print("Iris dataset saved to 'iris.csv'")

Iris dataset saved to 'iris.csv'


## Exercises

Explore the California housing dataset:

-   Load the California housing dataset from `scikit-learn` and transform it into a `pandas.DataFrame`

In [7]:
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing(as_frame=True)
df = data.frame
df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


-   Print the first few rows of the dataset, the names of the features and the target variable, and the number of observations and features in the dataset.

In [8]:
# Print the first few rows of the dataset
print(df.head())

# Print the names of the features and the target variable
print("Features:", df.columns[:-1].to_list())
print("Target variable:", df.columns[-1])

# Print the number of observations and features (excluding the target variable) in the dataset
print("Number of observations:", df.shape[0])
print("Number of features:", df.shape[1] - 1)

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  MedHouseVal  
0    -122.23        4.526  
1    -122.22        3.585  
2    -122.24        3.521  
3    -122.25        3.413  
4    -122.25        3.422  
Features: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
Target variable: MedHouseVal
Number of observations: 20640
Number of features: 8


-   Save the DataFrame to a CSV file for later use.