# Part 1: Machine Learning - Introduction

Welcome to the first part of the Artificial Intelligence and Machine Learning module.
In this section, you'll begin your journey into machine learning and train your first models.

Don’t just follow the subject blindly—strive to understand the tools you're using.
Read the documentation, and be prepared to explain:

What each step does
Why it’s important
The advantages and limitations of each tool


## What is Machine Learning?
Machine learning is a branch of computer science focused on building systems that learn from data.
Rather than following fixed rules like traditional programs, ML models uncover hidden patterns in datasets to make predictions or decisions.

### Research Topics
Before getting started do your research on machine learning and be able to at least answer the following questions:

<details>
    <summary>- What is the difference between supervised and unsupervised learning?</summary>
    The key difference between supervised and unsupervised learning lies in whether or not the algorithm is trained with labeled data:

Supervised Learning:
- The data used to train the model includes input-output pairs (i.e., it’s labeled).
- The goal is to learn a mapping from inputs to known outputs.
- Common tasks: classification (e.g., spam detection) and regression (e.g., predicting house prices).
- Example: A dataset of images of animals labeled as "cat", "dog", etc. The model learns to associate image features with labels.

Unsupervised Learning:
- The data has no labels; the algorithm tries to find patterns or structure in the data.
- Common tasks: clustering (e.g., customer segmentation) and dimensionality reduction (e.g., PCA).
- Example: Given customer data with no labels, the model might group customers with similar behavior into clusters.
</details>

<details>
<summary>- What are types of supervised learning?</summary>

#### Email Spam Detection
- Input: Email text and metadata
- Output: "Spam" or "Not Spam"
- Type: Classification

#### Loan Approval Prediction
- Input: Applicant financial data
- Output: "Approve" or "Deny"
- Type: Classification

#### House Price Prediction
- Input: Features like size, location, number of rooms
- Output: Predicted price
- Type: Regression

#### Voice Recognition (e.g., Siri, Alexa)
- Input: Audio waveform
- Output: Transcribed text
- Type: Classification / Sequence prediction
</details>

<details>
<summary>- What’s the difference between classification and regression? What output does each produce?</summary>
Both classification and regression are types of supervised learning, but they differ in the kind of output they produce.

#### Classification:
Goal: Predict a category or class label.
Output: Discrete values (e.g., "cat", "dog", "spam", "not spam").
Examples:
- Email → "spam" or "not spam"
- Image → "dog", "cat", or "horse"
- Loan application → "approve" or "deny"
Think: "Which bucket does this data point belong to?"

Regression:
Goal: Predict a continuous numerical value.
Output: Real numbers (e.g., 15.2, 420.75).
Examples:
- Predict house price: $350,000
- Predict temperature: 21.5°C
- Predict stock price: $101.32
Think: "What number best fits this input?"
</details>
- How do you determine whether a problem is classification or regression?
- Are there ML problems that fall outside classification and regression?


<details>
<summary>- What is skewed data and how to mitigate it's effect?</summary>

#### What is Skewed Data?
Skewed data refers to a distribution where the values are not symmetrically distributed—one tail is longer or fatter than the other:
- Right-skewed (positive skew): Tail is on the right (e.g., income distribution).
- Left-skewed (negative skew): Tail is on the left (e.g., age at retirement).
This can distort statistics like the mean, affect machine learning models, and lead to biased conclusions.

How to Mitigate Skewed Data
#### Transform the Data
Apply transformations to reduce skewness:
- Logarithmic: log(x) – for right-skewed data.
- Square root: sqrt(x)
- Box-Cox or Yeo-Johnson – more flexible for both skew types.

#### Use Robust Statistics
- Use metrics less sensitive to skew:
- Median instead of mean
- IQR (interquartile range) instead of standard deviation

#### Bin or Categorize
- Convert continuous skewed variables into categories or bins to reduce impact.

#### Outlier Treatment
- Winsorization: Replace extreme values with nearest percentiles.
- Clipping: Cap values at a certain threshold.
- Remove outliers cautiously if they’re truly anomalous.

#### Use Algorithms That Handle Skew
- Tree-based models like Random Forest, XGBoost, or Decision Trees are less affected by skewed data.
</details>


</details>

## Predicting Bike Sharing Demand
Explore the dataset to understand distributions, correlations, and feature relationships.
Use appropriate plots and explain your choices—why you chose that graph, and what alternatives exist.

### Prepare:
Gain insights from the data through exploration.
Clean and adjust the data as necessary.
Build a preprocessing pipeline using scikit-learn.
Wrap the pipeline creation into a reusable function that accepts any estimator.
### Model Training
Train models to predict hourly bike rentals using:

LinearRegression
RandomForestRegressor
XGBRegressor
GridSearchCV for hyperparameter tuning
Evaluate your models using multiple performance metrics.

### Dataset
Download the dataset from the UCI Machine Learning Repository:

👉 Bike Sharing Dataset

## Requirements
You are required to use the following tools:

uv – Dependency management
jupyter – Interactive notebook environment
pandas – Data manipulation
seaborn – Visualization
scikit-learn – ML toolkit
mlflow – Experiment tracking and logging