# Understanding Machine Learning Concepts

In this lesson, learners will explore fundamental machine learning concepts relevant to SageMaker, including supervised and unsupervised learning.

## Learning Objectives
- Define key machine learning terms
- Explain the differences between supervised and unsupervised learning
- Identify common algorithms used in machine learning

## Why This Matters

Understanding machine learning concepts is crucial for anyone looking to leverage AWS SageMaker for building and deploying machine learning models. Supervised and unsupervised learning are foundational techniques that guide how we approach data analysis and model training.

## Supervised Learning

Supervised learning is a type of machine learning where the model is trained on labeled data. This means that the input data is paired with the correct output, allowing the model to learn the relationship between the two.

### Why It Matters
Supervised learning is widely used in various applications, such as classification and regression tasks.

In [None]:
# Example of Supervised Learning
# Using a linear regression model to predict house prices
from sklearn.linear_model import LinearRegression
import numpy as np

# Sample data
X_train = np.array([[1500], [2000], [2500], [3000]])  # Size in square feet
y_train = np.array([300000, 400000, 500000, 600000])  # Prices

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predicting a new value
predicted_price = model.predict(np.array([[2200]]))
print(f'Predicted price for 2200 sq ft: ${predicted_price[0]:,.2f}')

### Micro-Exercise 1
Define supervised learning and provide an example.

```python
# Supervised Learning
# Definition: Supervised learning is a type of machine learning where the model is trained on labeled data.
# Example: Predicting house prices based on size.
```

In [None]:
# Micro-Exercise 1 Starter Code
# Define supervised learning
supervised_learning_definition = "Supervised learning is a type of machine learning where the model is trained on labeled data."
example = "Predicting house prices based on size."
print(f'Definition: {supervised_learning_definition}')
print(f'Example: {example}')

## Unsupervised Learning

Unsupervised learning involves training a model on data without labeled responses. The model tries to learn the underlying structure of the data through techniques like clustering and association.

### Why It Matters
Understanding unsupervised learning is crucial for tasks that involve clustering and association, such as customer segmentation and anomaly detection.

In [None]:
# Example of Unsupervised Learning
# Using KMeans clustering to segment customers
from sklearn.cluster import KMeans
import numpy as np

# Sample customer data
X_customers = np.array([[25, 50000], [30, 60000], [22, 45000], [35, 70000], [28, 52000]])  # Age and Income

# Create and fit the model
kmeans = KMeans(n_clusters=2)
kmeans.fit(X_customers)

# Predicting cluster for new customer
new_customer = np.array([[29, 58000]])
cluster = kmeans.predict(new_customer)
print(f'New customer belongs to cluster: {cluster[0]}')

### Micro-Exercise 2
Identify which algorithms are used for supervised learning.

```python
# Supervised Learning Algorithms
# 1. Linear Regression
# 2. Decision Trees
```

In [None]:
# Micro-Exercise 2 Starter Code
# List of supervised learning algorithms
supervised_learning_algorithms = ["Linear Regression", "Decision Trees"]
for i, algorithm in enumerate(supervised_learning_algorithms, start=1):
    print(f'{i}. {algorithm}')

## Main Exercise
Given a dataset, analyze its characteristics and select an appropriate algorithm for the problem at hand. Justify your choice based on the data type and desired outcome.

```python
# Analyze dataset characteristics
# Select an algorithm based on the analysis
# Justification: 
```

In [None]:
# Main Exercise Starter Code
# Example dataset characteristics
dataset_characteristics = "The dataset contains numerical features and a categorical target variable."
selected_algorithm = "Decision Trees"
justification = "Decision Trees can handle both numerical and categorical data effectively."
print(f'Dataset Characteristics: {dataset_characteristics}')
print(f'Selected Algorithm: {selected_algorithm}')
print(f'Justification: {justification}')

## Common Mistakes
- Confusing supervised and unsupervised learning
- Assuming all algorithms can be applied to any type of data without consideration of data characteristics.

## Recap
In this lesson, we covered the fundamental concepts of supervised and unsupervised learning, including their definitions, importance, and common algorithms. Next, we will dive deeper into AWS SageMaker and how to implement these concepts in practice.