# 🤖 Scikit-learn: Overview & Common Algorithms
**Scikit-learn** is a Python library used for machine learning. It provides tools for supervised and unsupervised learning, preprocessing, model selection, and evaluation.

This notebook covers:
- What is Scikit-learn?
- Types of tasks (classification, regression, etc.)
- Preprocessing and pipelines
- Commonly used algorithms and when to use them

## 📘 What is Scikit-learn?
Scikit-learn is an open-source machine learning library that simplifies building predictive models using:
- **Classification**: Spam detection, disease diagnosis
- **Regression**: Predicting sales, prices, temperature
- **Clustering**: Grouping similar items
- **Preprocessing**: Data cleaning and transformation
- **Model Selection**: Evaluating and tuning models

## 🔍 Types of Learning
### 1. Supervised Learning
- **Classification**: Predict categories (e.g. Yes/No, Spam/Not Spam)
- **Regression**: Predict continuous values (e.g. Price, Marks)

### 2. Unsupervised Learning
- **Clustering**: Group data without labels (e.g. customer segments)
- **Dimensionality Reduction**: Reduce the number of features (e.g. PCA)

### 3. Preprocessing & Evaluation
- Handle missing values, encode labels, scale data
- Evaluate using accuracy, mean squared error, cross-validation

## 🧠 Commonly Used Algorithms in Scikit-learn
| Algorithm | Type | Use Case | Module |
|----------|------|----------|--------|
| LinearRegression | Regression | Predict sales, prices | linear_model |
| LogisticRegression | Classification | Binary classification (pass/fail) | linear_model |
| DecisionTreeClassifier | Classification | If-else like models | tree |
| RandomForestClassifier | Classification | Ensemble of trees, more accurate | ensemble |
| KNeighborsClassifier | Classification | Based on closest data points | neighbors |
| KMeans | Clustering | Customer grouping | cluster |
| PCA (Principal Component Analysis) | Dim. Reduction | Reduce features | decomposition |

## ✅ Simple Regression Example

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Sample data: hours vs marks
X = [[1], [2], [3], [4]]
y = [10, 20, 30, 40]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print("Predictions:", predictions)
print("MSE:", mean_squared_error(y_test, predictions))

## 📌 When to Use Which Algorithm
| Task | Algorithm | When to Use |
|------|-----------|-------------|
| Binary Classification | LogisticRegression | Simple yes/no, spam detection |
| Multiclass Classification | RandomForestClassifier | Multiple categories, robust |
| Regression | LinearRegression | Predict numeric outcomes |
| Clustering | KMeans | Grouping items with no labels |
| Dimensionality Reduction | PCA | Too many features, want simplification |