<a href="https://colab.research.google.com/github/acmucsd-projects/AI-Tutorial-Resources/blob/main/1%20%7C%20Introduction_to_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to Machine Learning**

Contributors: Nolan, Katie

Welcome to the Intro to Machine Learning notebook! This is an introduction notebook generally geared towards those unfamiliar with machine learning, or in need of a quick refresher. We'll be glossing over a brief summary of various AI and Machine Learning concepts, and check out some examples of the most basic foundation of machine learning models, starting from regression models. If you are new to Python or data science, we recommend that you go through this notebook in its entirety, since it will set you up well for everything else.

#### **What is Machine Learning?**
Machine learning, generally speaking, is the idea of **estimating a solution by applying an algorithm to data without explicit instruction**. A question you may have, then, is *What situations strictly require machine learning solutions rather than a well-designed algorithm?*

> Tasks such as MRI scans, face recognition, voice recognition, speech recognition, and generative AI are examples of these cases. You could say that you can develop an algorithm to handle these, but in the case of face recognition (as an example), there are simply too many edge cases that an algorithm may never exist, or warrant the gargantuan effort in comparison to simply training a machine learning model to automate that task.



In general, we can break down the machine learning pipeline to be:
```
Data -> Algorithm -> Solution
```
wherein we apply some kind of algorithm to our data to learn some kind of solution. These solutions broadly fall under three categories:
* **Predictions (Supervised Learning)**
  * Classification: Prediction of categories - Is an image a dog or a duck?
  * Regression: Prediction of values - Price of a house given its dimensions?
* **Patterns(Unsupervised Learning)**
  * Discovering structure or underlying patterns in a collection of data (i.e., [K-Means](https://user.ceng.metu.edu.tr/~akifakkus/courses/ceng574/k-means/)) (demo link)
  * Discovering diseases, groups of customers in market data
* **Decisions (Reinforcement Learning)**
  * Given the state of the world, what is your next action? What rewards or punishments will be incurred?

## **Seaborn, Pandas, and Data Exploration**

Two of the most fundamental libraries in Python is **pandas** and **seaborn**, toolkits for data manipulation and data visualization, respectively.

In [None]:
import pandas as pd

As an example, we'll be exploring the **Iris Dataset**, the oldest and a classic benchmarking dataset from [The UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/53/iris), often used for classification tasks. It consists of iris flowers, where each observation includes four features:
1. **Sepal Length**
2. **Sepal Width**
3. **Petal Length**
4. **Petal Width**

<center>
<img src="https://www.math.umd.edu/~petersd/666/html/iris_with_labels.jpg" alt="Iris diagram">
</center>

The dataset also contains a feature called **species**, which categorizes the flowers into one of three species:
- **Iris setosa**
- **Iris versicolor**
- **Iris virginica**

We can load the dataset in using seaborn (their library has built-in datasets), and explore with pandas.

In [None]:
import pandas as pd
import seaborn as sns
iris_df = sns.load_dataset('iris')

`sns.load_dataset` automatically loads in the dataset as a **Pandas DataFrame**, an object that allows us to view and manipulate our data. Data is commonly stored in `.csv` (comma-separated value) files.

If we wanted to load a `filename.csv` file as a dataset instead, we'd call the `read_csv()` method from Pandas to load it as a DataFrame:
```
import pandas as pd
df = pd.read_csv('filename.csv')
```
There are a lot more parameters you can try in the documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).

We'd recommend you try experimenting with `skiprows` and `chunksize` to see how those work (they're some of the more common other parameters besides the file path.

In [None]:
iris_df.head()

We can either call `.head()` or `.tail()` on our object, which shows either the first few or last few rows, respectively, of our dataset. It also takes in an integer value as the number of rows to display. We can also see a general summary of our dataset using `.describe()`. Try it in the cell below.

In [None]:
# Your code here

#### Changelog
<details><summary>click to reveal!</summary>

* 5/1/2025 - Creation [@NolanChai](http://github.com/NolanChai)