## Data for Supervised Learning
Supervised learning is all about learning to make predictions: given an input $x$ (e.g. home square footage), can we produce an output $\hat{y}$ (e.g. estimated value) as close to the actual observed output $y$ (e.g. sale price) as possible. Note that the "hat" above $y$ is used to denote an estimated or predicted value.

Let's start by generating some artificial data. We'll create a vector of inputs, $X$, and a corresponding vector of target outputs $Y$. In general, we'll refer to invidual examples with a lowercase ($x$), and a vector or matrix containing multiple examples with a capital ($X$).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

### Part 1: Generating a Known Function (15 points)

Implement variable Y as:

$Y = m*X + b + variance$ 

In [None]:
def create_1d_data(num_examples=10, m=2, b=1, random_scale=1):
    """Create X, Y data with a linear relationship with added noise.

    Args:
    num_examples: number of examples to generate
    w: desired slope
    b: desired intercept
    random_scale: add uniform noise between -random_scale and +random_scale

    Returns:
    X and Y with shape (num_examples)
    """
    X = np.arange(num_examples)
    np.random.seed(4)  # consistent random number generation
    random_variance = np.random.uniform(low=-random_scale, high=random_scale, size=X.shape)
    # TODO: build Y = mx + b + random_variance

    return X, Y

In [None]:
# Create some artificial data using create_1d_data.
X, Y = create_1d_data()
plt.scatter(X, Y)
plt.show()

Explain why the graph does not exactly represent 2X+1

*Writen answer:*



---
### Part 2: Models for Data (15 points)

A model is a function that takes an input $x$ and produces a prediction $\hat{y}$.

Let's consider two possible models for this data:
1. $M_1(x) = x+5$ 
2. $M_2(x) = 2x+1$

Compute the predictions of models $M_1$ and $M_2$ for the values in $X$. These predictions should be vectors of the same shape as $Y$. Then plot the prediction lines of these two models overlayed on the "observed" data $(X, Y)$. Use [plt.plot()](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html) to draw the lines. Note: you will generate only one plot. Make sure to include axes, titles and legend. 

In [None]:
# YOUR CODE HERE



In [None]:
plt.scatter(X, Y)

plt.plot(M1, color = 'r')
plt.plot(M2, color = 'g')

plt.legend(["X Observed Data", "M1 Model Preds", "M2 Model Preds"], loc="lower right")

font1 = {'family':'serif','color':'blue','size':20}
font2 = {'family':'serif','color':'darkred','size':15}

plt.title("Comparison of M1 and M2 Models", fontdict = font1)
plt.xlabel("X Input Values", fontdict = font2)
plt.ylabel("Y Output Values", fontdict = font2)

plt.show()

Explain which model looks to be performing better. How do you know?

*Writen answer:*



---
### Part 3: EDA with Pandas (70 points)

Import sklearn and pandas and read the iris dataset into a pandas dataframe: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html

In [None]:
# YOUR CODE HERE


What is the iris dataset? What are we trying to predict? What type of machine learning problem is this?

*Writen answer:*



Explore the data, perform EDA (Exploratory Data Analysis) on the iris dataset. Print the distribution of each feature (variable). Is there any missing data? Print the label classes, counts, and percentages of our target variable. Note any interesting findings. What are your thoughts on how we might measure success on this problem (i.e. a model that predicts the target labels)?

In [None]:
# YOUR CODE HERE


*Writen answer:*



In [None]:
# YOUR CODE HERE


---

## Review

* In **Supervised Machine Learning**, we must start with data in the form $(X,Y)$ where $X$ are the inputs and $Y$ are the output labels.
* A **model** is a function that maps an input $x$ to an output $y$. The model's output is referred to as a **prediction**, denoted by $\hat{y}$.