<a href="https://colab.research.google.com/github/anyuanay/INFO213/blob/main/INFO213_Week1_GivingComputerAbilityToLearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INFO 213: Data Science Programming 2


## Week 1: Giving Computers the Ability to Learn from Data



**Overview**
- [Learn the fundamentals for using AI tools effectively]()
- [Building intelligent machines to transform data into knowledge](#Building-intelligent-machines-to-transform-data-into-knowledge)
- [Different types of machine learning](#The-three-different-types-of-machine-learning)
    - [Classification for predicting class labels](#Classification-for-predicting-class-labels)
    - [Regression for predicting continuous outcomes](#Regression-for-predicting-continuous-outcomes)
    - [Solving interactive problems with reinforcement learning](#Solving-interactive-problems-with-reinforcement-learning)
    - [Discovering hidden structures with unsupervised learning](#Discovering-hidden-structures-with-unsupervised-learning)
- [An introduction to the basic terminology and notations](#basic-terminology)
 - [Feature Matrix](#feature-matrix)
- [A roadmap for building machine learning systems](#A-roadmap-for-building-machine-learning-systems)
- [Introduction to Scikit-Learn](#Introduction-to-Scikit-Learn)
- [Supervised learning example: Simple linear regression](#supervised-example)

## Learn the Fundamentals before Using AI

- Real-world human-AI synergy in data science workflow


![](https://github.com/anyuanay/INFO213/blob/main/human-AI-synergy-data-science.png?raw=true)

- **You must know the fundamentals in order to effectively use A.I. tools.**

- My suggestion to learn fundamentals when using Colab

![](https://github.com/anyuanay/INFO213/blob/main/disable-AI-colab.png?raw=true)


## Different Types of Machine Learning

### Classification for predicting class labels
- Given a set of training data with labels, A and B.
- Each data point is defined by two features $x_1$ and $x_2$.
- Learn a model that separates the data points with different lables.
- Use the model to predict the label of a new data point.

<img src="https://github.com/rasbt/machine-learning-book/raw/main/ch01/figures/01_03.png" width="500px">

### Regression for predicting continuous outcomes
- Given a set of data points $(x_i, y_i)$.
- Each $x_i$ is associated with a real value $y_i$.
- Learn a model between $x_i$ and $y_i$.
- Predict the $y$ value of a new value $x$.

<img src="https://github.com/rasbt/machine-learning-book/raw/main/ch01/figures/01_04.png" width="500px">

### Solving interactive problems with reinforcement learning
- Given an environment with states $ s_t $ and actions $ a_t $.  
- Each action $ a_t $ taken in state $ s_t $ results in a reward $ r_t $ and transitions the agent to a new state $ s_{t+1} $.  
- Learn a policy $ \pi(a \mid s) $ that maps states to actions to maximize cumulative rewards over time.  
- Predict the optimal action $ a $ for a new state $ s $ to achieve long-term success.

<img src="https://github.com/rasbt/machine-learning-book/raw/main/ch01/figures/01_05.png" width="500px">

### Finding subgroups with clustering
- Given a set of data points without labeled outcomes.  
- Each data point represents a feature vector $(x_1, x_2)$.  
- Learn a model to group the data into $ k $ clusters based on similarity.  
- Predict the cluster assignment for a new data point $ (\hat{x_1}, \hat{x_2}) $.

<img src="https://github.com/rasbt/machine-learning-book/raw/main/ch01/figures/01_06.png" width="500px">

## An introduction to the basic terminology and notations
- Training example: A row in a table representing the dataset and synonymous with an observation, record, instance, or sample (in most contexts, sample refers to a collection of training examples).
- Feature, abbrev. $X$: A column in a data table or data (design) matrix. Synonymous with predictor, variable, input, attribute, or covariate.
- Target, abbrev. $y$: Synonymous with outcome, output, response variable, dependent variable, (class) label, and ground truth.
- Training: Model fitting, for parametric models similar to parameter estimation.

<img src="https://github.com/rasbt/machine-learning-book/raw/main/ch01/figures/01_08.png" width="600px">

### Feature Matrix
- The training data can be thought of as a two-dimensional numerical array or matrix, which we will call the features matrix. By convention, this features matrix is often stored in a variable named $X$.
-
The features matrix is assumed to be two-dimensional, with shape [n_samples, n_features], and is most often contained in a NumPy array or a Pandas DataFrame.

- In addition to the feature matrix $X$, we also generally work with a label or target array, which by convention we will usually call $y$. The target array is usually one dimensional, with length n_samples, and is generally contained in a NumPy array or Pandas Series.

![](https://i.imgur.com/UEtv9Dv.png)

## A roadmap for building machine learning systems

<img src="https://github.com/rasbt/machine-learning-book/raw/main/ch01/figures/01_09.png" width="700px">

# Retrieval Practice

## Introducing Scikit-Learn

There are several Python libraries which provide solid implementations of a range of machine learning algorithms.
One of the best known is [Scikit-Learn](http://scikit-learn.org), a package that provides efficient versions of a large number of common algorithms.

### Basics of the API

Most commonly, the steps in using the Scikit-Learn estimator API are as follows.

1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
2. Choose model hyperparameters by instantiating this class with desired values.
3. Arrange data into a features matrix and target vector following the discussion above.
4. Fit the model to your data by calling the ``fit()`` method of the model instance.
5. Apply the Model to new data:
   - For supervised learning, often we predict labels for unknown data using the ``predict()`` method.
   - For unsupervised learning, we often transform or infer properties of the data using the ``transform()`` or ``predict()`` method.

## Supervised learning example: Simple linear regression

As an example of this process, let's consider a simple linear regression—that is, the common case of fitting a line to $(x, y)$ data.
We will use the following simple data for our regression example:

```
import matplotlib.pyplot as plt
import numpy as np

rng = np.random.RandomState(42)
x = 10 * rng.rand(50)
y = 2 * x - 1 + rng.randn(50)
plt.scatter(x, y);
```

With this data in place, we can use the recipe outlined earlier. Let's walk through the process:

### 1. Choose a class of model

In Scikit-Learn, every class of model is represented by a Python class.
So, for example, if we would like to compute a simple linear regression model, we can import the linear regression class:

```
from sklearn.linear_model import LinearRegression
```

### 2. Choose model hyperparameters

An important point is that *a class of model is not the same as an instance of a model*.

Once we have decided on our model class, there are still some options open to us.
Depending on the model class we are working with, we might need to answer one or more questions like the following:

- Would we like to fit for the offset (i.e., *y*-intercept)?
- Would we like the model to be normalized?
- Would we like to preprocess our features to add model flexibility?
- What degree of regularization would we like to use in our model?
- How many model components would we like to use?

These are examples of the important choices that must be made *once the model class is selected*.
These choices are often represented as *hyperparameters*, or parameters that must be set before the model is fit to data.
In Scikit-Learn, hyperparameters are chosen by passing values at model instantiation.


For our linear regression example, we can instantiate the ``LinearRegression`` class and specify that we would like to fit the intercept using the ``fit_intercept`` hyperparameter:

```
model = LinearRegression(fit_intercept=True)
model
```

### 3. Arrange data into a features matrix and target vector

Here our target variable ``y`` is already in the correct form (a length-``n_samples`` array), but we need to massage the data ``x`` to make it a matrix of size ``[n_samples, n_features]``.
In this case, this amounts to a simple reshaping of the one-dimensional array:

```
X = x[:, np.newaxis]
X.shape
```

### 4. Fit the model to your data

Now it is time to apply our model to data.
This can be done with the ``fit()`` method of the model:

```
model.fit(X, y)
```

This ``fit()`` command causes a number of model-dependent internal computations to take place, and the results of these computations are stored in model-specific attributes that the user can explore.
In Scikit-Learn, by convention all model parameters that were learned during the ``fit()`` process have trailing underscores; for example in this linear model, we have the following:

```
model.coef_
```

```
model.intercept_
```

These two parameters represent the slope and intercept of the simple linear fit to the data.
Comparing to the data definition, we see that they are very close to the input slope of 2 and intercept of -1.

### 5. Predict labels for unknown data

Once the model is trained, the main task of supervised machine learning is to evaluate it based on what it says about new data that was not part of the training set.
In Scikit-Learn, this can be done using the ``predict()`` method.
For the sake of this example, our "new data" will be a grid of *x* values, and we will ask what *y* values the model predicts:

```
xfit = np.linspace(-1, 11)
```

As before, we need to coerce these *x* values into a ``[n_samples, n_features]`` features matrix, after which we can feed it to the model:

```
Xfit = xfit[:, np.newaxis]
yfit = model.predict(Xfit)
```

Finally, let's visualize the results by plotting first the raw data, and then this model fit:

```
plt.scatter(x, y)
plt.plot(xfit, yfit);
```

# Retrieval Practice