# Machine Learning Fundamentals

Welcome to this comprehensive guide on Machine Learning fundamentals! This notebook will cover the essential concepts, algorithms, and practical implementations to get you started with ML.

## Table of Contents
1. [Introduction to Machine Learning](#introduction)
2. [Types of Machine Learning](#types)
3. [Data Preprocessing](#preprocessing)
4. [Linear Regression](#linear-regression)
5. [Logistic Regression](#logistic-regression)
6. [Decision Trees](#decision-trees)
7. [K-Nearest Neighbors (KNN)](#knn)
8. [Model Evaluation](#evaluation)
9. [Cross-Validation](#cross-validation)
10. [Feature Engineering](#feature-engineering)

## Setup: Import Required Libraries

In [None]:
# Data manipulation and analysis
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix, classification_report
from sklearn.datasets import load_iris, load_boston, make_classification, make_regression

# Settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

print("Libraries imported successfully!")

---
<a id='introduction'></a>
## 1. Introduction to Machine Learning

**Machine Learning** is a subset of Artificial Intelligence that enables computers to learn from data without being explicitly programmed.

### Key Concepts:
- **Features (X)**: Input variables used to make predictions
- **Target (y)**: Output variable we want to predict
- **Model**: Mathematical representation that learns patterns from data
- **Training**: Process of teaching the model using data
- **Prediction**: Using the trained model to make forecasts on new data

### The ML Workflow:
1. **Data Collection**: Gather relevant data
2. **Data Preprocessing**: Clean and prepare data
3. **Feature Engineering**: Select and create meaningful features
4. **Model Selection**: Choose appropriate algorithm
5. **Training**: Fit the model to training data
6. **Evaluation**: Assess model performance
7. **Optimization**: Fine-tune parameters
8. **Deployment**: Use model in production

---
<a id='types'></a>
## 2. Types of Machine Learning

### 2.1 Supervised Learning
Learning from labeled data (input-output pairs)
- **Regression**: Predicting continuous values (e.g., house prices)
- **Classification**: Predicting discrete categories (e.g., spam/not spam)

### 2.2 Unsupervised Learning
Learning from unlabeled data
- **Clustering**: Grouping similar data points
- **Dimensionality Reduction**: Reducing feature space

### 2.3 Reinforcement Learning
Learning through interaction with an environment (rewards/penalties)

**This notebook focuses on Supervised Learning.**

In [None]:
# Visual representation of ML types
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Supervised Learning example
X_super = np.random.randn(50, 2)
y_super = (X_super[:, 0] + X_super[:, 1] > 0).astype(int)
axes[0].scatter(X_super[:, 0], X_super[:, 1], c=y_super, cmap='viridis', s=50)
axes[0].set_title('Supervised Learning\n(Labeled Data)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')

# Unsupervised Learning example
X_unsuper = np.random.randn(50, 2)
axes[1].scatter(X_unsuper[:, 0], X_unsuper[:, 1], c='gray', s=50)
axes[1].set_title('Unsupervised Learning\n(Unlabeled Data)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')

# Reinforcement Learning concept
axes[2].text(0.5, 0.7, 'Agent', ha='center', va='center', fontsize=14, 
             bbox=dict(boxstyle='round', facecolor='lightblue'))
axes[2].text(0.5, 0.3, 'Environment', ha='center', va='center', fontsize=14,
             bbox=dict(boxstyle='round', facecolor='lightgreen'))
axes[2].annotate('Action', xy=(0.5, 0.65), xytext=(0.5, 0.35),
                arrowprops=dict(arrowstyle='->', lw=2, color='red'))
axes[2].annotate('Reward', xy=(0.5, 0.35), xytext=(0.5, 0.65),
                arrowprops=dict(arrowstyle='->', lw=2, color='blue'))
axes[2].set_title('Reinforcement Learning\n(Agent-Environment)', fontsize=12, fontweight='bold')
axes[2].axis('off')

plt.tight_layout()
plt.show()

---
<a id='preprocessing'></a>
## 3. Data Preprocessing

Data preprocessing is crucial for building effective ML models. Common steps include:

1. **Handling Missing Values**
2. **Feature Scaling**
3. **Encoding Categorical Variables**
4. **Train-Test Split**

In [None]:
# Create sample dataset
data = {
    'age': [25, 30, 35, np.nan, 45, 50, 28, 33, 40, 38],
    'salary': [50000, 60000, 70000, 65000, 80000, 90000, 55000, 62000, 75000, 72000],
    'department': ['IT', 'HR', 'IT', 'Finance', 'IT', 'HR', 'Finance', 'IT', 'HR', 'Finance'],
    'performance': [1, 0, 1, 1, 1, 0, 1, 1, 0, 1]
}

df = pd.DataFrame(data)
print("Original Dataset:")
print(df)
print("\nDataset Info:")
print(df.info())

### 3.1 Handling Missing Values

In [None]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum())

# Fill missing values with mean
df['age'].fillna(df['age'].mean(), inplace=True)

print("\nAfter handling missing values:")
print(df['age'])

### 3.2 Feature Scaling

Scaling ensures all features contribute equally to the model.

In [None]:
# Standardization (mean=0, std=1)
scaler = StandardScaler()
df[['age_scaled', 'salary_scaled']] = scaler.fit_transform(df[['age', 'salary']])

print("Original vs Scaled Features:")
print(df[['age', 'age_scaled', 'salary', 'salary_scaled']].head())

### 3.3 Encoding Categorical Variables

In [None]:
# One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['department'], prefix='dept')

print("Dataset with Encoded Categories:")
print(df_encoded.head())

### 3.4 Train-Test Split

Split data into training and testing sets to evaluate model performance.

In [None]:
# Prepare features and target
X = df_encoded.drop(['performance', 'age', 'salary'], axis=1)
y = df_encoded['performance']

# Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

---
<a id='linear-regression'></a>
## 4. Linear Regression

Linear Regression is used for predicting continuous values. It finds the best-fit line through the data.

**Equation**: $y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilon$

Where:
- $y$ = predicted value
- $\beta_0$ = intercept
- $\beta_i$ = coefficients
- $x_i$ = features
- $\epsilon$ = error term

In [None]:
# Generate sample regression data
np.random.seed(42)
X_reg = np.random.rand(100, 1) * 10
y_reg = 2 * X_reg + 5 + np.random.randn(100, 1) * 2

# Split data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Train Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train_reg, y_train_reg)

# Make predictions
y_pred_reg = lr_model.predict(X_test_reg)

# Evaluate
mse = mean_squared_error(y_test_reg, y_pred_reg)
r2 = r2_score(y_test_reg, y_pred_reg)

print(f"Model Coefficients: {lr_model.coef_[0][0]:.2f}")
print(f"Model Intercept: {lr_model.intercept_[0]:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.4f}")

In [None]:
# Visualize Linear Regression
plt.figure(figsize=(10, 6))
plt.scatter(X_test_reg, y_test_reg, color='blue', alpha=0.6, label='Actual')
plt.plot(X_test_reg, y_pred_reg, color='red', linewidth=2, label='Predicted')
plt.xlabel('Feature (X)', fontsize=12)
plt.ylabel('Target (y)', fontsize=12)
plt.title('Linear Regression: Actual vs Predicted', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

---
<a id='logistic-regression'></a>
## 5. Logistic Regression

Logistic Regression is used for binary classification problems. It predicts the probability of an instance belonging to a class.

**Sigmoid Function**: $P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + ... + \beta_nx_n)}}$

Output is between 0 and 1 (probability).

In [None]:
# Generate classification data
X_class, y_class = make_classification(
    n_samples=200, n_features=2, n_redundant=0, n_informative=2,
    random_state=42, n_clusters_per_class=1
)

# Split data
X_train_class, X_test_class, y_train_class, y_test_class = train_test_split(
    X_class, y_class, test_size=0.2, random_state=42
)

# Train Logistic Regression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_class, y_train_class)

# Predictions
y_pred_class = log_reg.predict(X_test_class)
y_pred_proba = log_reg.predict_proba(X_test_class)

# Evaluate
accuracy = accuracy_score(y_test_class, y_pred_class)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test_class, y_pred_class))

In [None]:
# Visualize Decision Boundary
def plot_decision_boundary(X, y, model, title):
    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.figure(figsize=(10, 6))
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap='viridis', s=50)
    plt.xlabel('Feature 1', fontsize=12)
    plt.ylabel('Feature 2', fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.colorbar()
    plt.show()

plot_decision_boundary(X_test_class, y_test_class, log_reg, 
                      'Logistic Regression Decision Boundary')

---
<a id='decision-trees'></a>
## 6. Decision Trees

Decision Trees are versatile models that can be used for both classification and regression. They split the data based on features to create a tree-like structure.

**Advantages**:
- Easy to understand and interpret
- Handles both numerical and categorical data
- Requires little data preprocessing

**Disadvantages**:
- Prone to overfitting
- Can be unstable with small variations in data

In [None]:
# Train Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_classifier.fit(X_train_class, y_train_class)

# Predictions
y_pred_dt = dt_classifier.predict(X_test_class)

# Evaluate
dt_accuracy = accuracy_score(y_test_class, y_pred_dt)
print(f"Decision Tree Accuracy: {dt_accuracy:.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test_class, y_pred_dt))
print("\nClassification Report:")
print(classification_report(y_test_class, y_pred_dt))

In [None]:
# Visualize Decision Tree boundary
plot_decision_boundary(X_test_class, y_test_class, dt_classifier, 
                      'Decision Tree Decision Boundary')

In [None]:
# Feature Importance
feature_importance = dt_classifier.feature_importances_
feature_names = [f'Feature {i+1}' for i in range(X_class.shape[1])]

plt.figure(figsize=(8, 5))
plt.barh(feature_names, feature_importance, color='teal')
plt.xlabel('Importance', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.title('Feature Importance in Decision Tree', fontsize=14, fontweight='bold')
plt.show()

---
<a id='knn'></a>
## 7. K-Nearest Neighbors (KNN)

KNN is a simple, instance-based learning algorithm. It classifies a data point based on the majority class of its k nearest neighbors.

**Key Points**:
- Non-parametric (no assumptions about data distribution)
- Lazy learner (no training phase)
- Distance-based algorithm

**Common Distance Metrics**:
- Euclidean Distance: $d = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}$
- Manhattan Distance: $d = \sum_{i=1}^{n}|x_i - y_i|$

In [None]:
# Train KNN Classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_class, y_train_class)

# Predictions
y_pred_knn = knn.predict(X_test_class)

# Evaluate
knn_accuracy = accuracy_score(y_test_class, y_pred_knn)
print(f"KNN Accuracy: {knn_accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test_class, y_pred_knn))

In [None]:
# Visualize KNN boundary
plot_decision_boundary(X_test_class, y_test_class, knn, 
                      'KNN Decision Boundary (k=5)')

In [None]:
# Finding optimal K value
k_values = range(1, 21)
accuracies = []

for k in k_values:
    knn_temp = KNeighborsClassifier(n_neighbors=k)
    knn_temp.fit(X_train_class, y_train_class)
    y_pred_temp = knn_temp.predict(X_test_class)
    accuracies.append(accuracy_score(y_test_class, y_pred_temp))

plt.figure(figsize=(10, 6))
plt.plot(k_values, accuracies, marker='o', linewidth=2, markersize=8)
plt.xlabel('K Value', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('KNN: Finding Optimal K Value', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.xticks(k_values)
best_k = k_values[np.argmax(accuracies)]
plt.axvline(x=best_k, color='red', linestyle='--', label=f'Best K = {best_k}')
plt.legend()
plt.show()

print(f"Best K value: {best_k} with accuracy: {max(accuracies):.4f}")

---
<a id='evaluation'></a>
## 8. Model Evaluation

Proper model evaluation is crucial to understand performance and avoid overfitting.

### 8.1 Regression Metrics
- **Mean Squared Error (MSE)**: Average squared difference between predictions and actual values
- **Root Mean Squared Error (RMSE)**: Square root of MSE
- **R² Score**: Proportion of variance explained by the model (0 to 1)
- **Mean Absolute Error (MAE)**: Average absolute difference

### 8.2 Classification Metrics
- **Accuracy**: Proportion of correct predictions
- **Precision**: True Positives / (True Positives + False Positives)
- **Recall (Sensitivity)**: True Positives / (True Positives + False Negatives)
- **F1-Score**: Harmonic mean of Precision and Recall
- **Confusion Matrix**: Table showing correct and incorrect predictions

In [None]:
# Load Iris dataset for comprehensive evaluation
from sklearn.datasets import load_iris
iris = load_iris()
X_iris, y_iris = iris.data, iris.target

# Split data
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42
)

# Train model
dt_iris = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_iris.fit(X_train_iris, y_train_iris)
y_pred_iris = dt_iris.predict(X_test_iris)

# Confusion Matrix
cm = confusion_matrix(y_test_iris, y_pred_iris)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris.target_names, 
            yticklabels=iris.target_names)
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)
plt.title('Confusion Matrix - Iris Dataset', fontsize=14, fontweight='bold')
plt.show()

print("\nDetailed Classification Report:")
print(classification_report(y_test_iris, y_pred_iris, target_names=iris.target_names))

---
<a id='cross-validation'></a>
## 9. Cross-Validation

Cross-validation is a technique to assess model performance more reliably by training and testing on different subsets of data.

**K-Fold Cross-Validation**:
1. Split data into K folds
2. Train on K-1 folds, test on remaining fold
3. Repeat K times
4. Average the results

**Benefits**:
- More reliable performance estimate
- Uses all data for training and testing
- Reduces variance in performance metrics

In [None]:
# Perform 5-Fold Cross-Validation
from sklearn.model_selection import cross_val_score

# Different models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(max_depth=3, random_state=42),
    'KNN': KNeighborsClassifier(n_neighbors=5)
}

results = {}

for name, model in models.items():
    cv_scores = cross_val_score(model, X_iris, y_iris, cv=5, scoring='accuracy')
    results[name] = cv_scores
    print(f"{name}:")
    print(f"  CV Scores: {cv_scores}")
    print(f"  Mean Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
    print()

In [None]:
# Visualize Cross-Validation Results
plt.figure(figsize=(10, 6))
positions = np.arange(len(models))
bp = plt.boxplot([results[name] for name in models.keys()], 
                  labels=models.keys(), patch_artist=True)

for patch in bp['boxes']:
    patch.set_facecolor('lightblue')

plt.ylabel('Accuracy', fontsize=12)
plt.title('Cross-Validation Results Comparison', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='y')
plt.xticks(rotation=15)
plt.tight_layout()
plt.show()

---
<a id='feature-engineering'></a>
## 10. Feature Engineering

Feature engineering is the process of creating new features or transforming existing ones to improve model performance.

**Common Techniques**:
1. **Creating Interaction Features**: Combining multiple features
2. **Polynomial Features**: Creating polynomial combinations
3. **Binning**: Converting continuous variables to categorical
4. **Feature Selection**: Choosing most relevant features
5. **Domain-Specific Features**: Using domain knowledge

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_classif

# Example: Polynomial Features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_iris_poly = poly.fit_transform(X_iris[:5])  # Transform first 5 samples for display

print("Original Features (first 5 samples):")
print(X_iris[:5])
print(f"Original shape: {X_iris[:5].shape}")
print("\nPolynomial Features (degree=2):")
print(X_iris_poly)
print(f"New shape: {X_iris_poly.shape}")
print(f"\nFeature names: {poly.get_feature_names_out(iris.feature_names)}")

In [None]:
# Feature Selection: Select K Best
selector = SelectKBest(score_func=f_classif, k=2)
X_iris_selected = selector.fit_transform(X_iris, y_iris)

# Get feature scores
feature_scores = pd.DataFrame({
    'Feature': iris.feature_names,
    'Score': selector.scores_
}).sort_values('Score', ascending=False)

print("Feature Importance Scores:")
print(feature_scores)

# Visualize
plt.figure(figsize=(10, 6))
plt.barh(feature_scores['Feature'], feature_scores['Score'], color='coral')
plt.xlabel('F-Score', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.title('Feature Importance for Iris Classification', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

---
## Summary and Next Steps

### What We Covered:
1. Introduction to Machine Learning concepts
2. Types of ML (Supervised, Unsupervised, Reinforcement)
3. Data Preprocessing techniques
4. Linear Regression for continuous predictions
5. Logistic Regression for binary classification
6. Decision Trees for interpretable models
7. K-Nearest Neighbors for instance-based learning
8. Model Evaluation metrics and techniques
9. Cross-Validation for robust performance assessment
10. Feature Engineering to improve model performance

### Next Steps in Your ML Journey:
1. **Advanced Algorithms**: Random Forests, Gradient Boosting (XGBoost, LightGBM), Support Vector Machines
2. **Deep Learning**: Neural Networks, CNNs, RNNs
3. **Unsupervised Learning**: K-Means, DBSCAN, PCA, t-SNE
4. **Time Series Analysis**: ARIMA, LSTM
5. **Natural Language Processing**: Text classification, sentiment analysis
6. **Computer Vision**: Image classification, object detection
7. **Model Deployment**: Flask, FastAPI, Docker, Cloud platforms
8. **MLOps**: Model versioning, monitoring, CI/CD

### Resources:
- **Books**: "Hands-On Machine Learning" by Aurélien Géron, "Pattern Recognition and Machine Learning" by Christopher Bishop
- **Online Courses**: Coursera, fast.ai, deeplearning.ai
- **Practice**: Kaggle competitions, UCI ML Repository
- **Documentation**: scikit-learn, TensorFlow, PyTorch

### Practice Projects:
1. House price prediction
2. Customer churn prediction
3. Image classification
4. Sentiment analysis
5. Recommendation systems

In [None]:
# Final visualization: Model comparison
print("Congratulations on completing the ML Fundamentals notebook!")
print("Keep practicing and exploring more advanced topics.")
print("Remember: The key to mastering ML is consistent practice and experimentation!")