
# Module 1: Introduction to Machine Learning

**Learning Objectives**
- Define machine learning and explain its significance.
- Identify the main types of machine learning (supervised, unsupervised).
- Understand the role of key Python libraries (scikit-learn, pandas, TensorFlow, NumPy).



## What is Machine Learning?

Machine Learning (ML) is about teaching computers to learn patterns from data and use those patterns to make predictions or decisions.  
Unlike traditional programming, where humans write explicit rules, ML systems infer the rules from examples.

Mathematically, we can think of ML as finding a function:

$ \hat{y} = f_\theta(x) $

where:  
- $x$: the input features (e.g., hours studied, attendance).  
- $\hat{y}$: the model’s prediction (e.g., whether a student passes or fails).  
- $\theta$: the learned parameters (the model's internal settings that determine how inputs map to predictions).

**Example**: If $x$ is the number of hours a student studied, $\hat{y}$ might be the probability that the student passes. The function $f_\theta$ could be a logistic regression model.



## Why Machine Learning Matters

- Automates complex tasks (e.g., fraud detection, medical diagnosis).  
- Supports data-driven decisions.  
- Powers personalization (e.g., Netflix, Spotify).  
- Improves efficiency and accuracy.  

For students: ML is the engine behind course recommendation systems, budgeting apps, and even predictive text in your messages.



## Supervised Learning Example

In supervised learning, models are trained on labeled data.  
We can formalize this as:

$ y = f_\theta(x) + \epsilon $

where:  
- $y$: the true outcome (e.g., pass = 1, fail = 0).  
- $x$: input features (study hours, attendance).  
- $f_\theta$: the model’s prediction rule.  
- $\epsilon$: error term (captures randomness or noise).

**Example**: Predicting whether a student passes based on study habits.


In [1]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# Create synthetic dataset: study habits
np.random.seed(42)
n = 50
hours_studied = np.random.randint(0, 20, n)
attendance_rate = np.random.randint(50, 100, n)

# Rule for pass/fail (hidden ground truth)
passed = (hours_studied * 0.5 + attendance_rate * 0.3 + np.random.normal(0,5,n)) > 40
passed = passed.astype(int)

df = pd.DataFrame({
    "hours_studied": hours_studied,
    "attendance_rate": attendance_rate,
    "passed": passed
})

# Split dataset
X = df[["hours_studied", "attendance_rate"]]
y = df["passed"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


Accuracy: 0.9333333333333333
Confusion Matrix:
 [[14  0]
 [ 1  0]]



**Interpretation**  
- The coefficients in $f_\theta$ show how study hours and attendance contribute to passing probability.  
- The confusion matrix compares predicted vs actual outcomes.  
- A false negative means a student who should have passed was predicted as failing — and vice versa.



## Unsupervised Learning Example

In unsupervised learning, the model finds hidden patterns without labeled outcomes.  

One common method is **clustering**, formalized as:  

$ \text{Cluster}(x_i) = \arg\min_k \; \| x_i - \mu_k \|^2 $  

where:  
- $x_i$: a data point (e.g., a student’s expenses).  
- $\mu_k$: the center of cluster $k$.  
- The algorithm assigns each student to the closest cluster center.

**Example**: Grouping students by monthly cost-of-living.


In [None]:

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Synthetic dataset: student monthly expenses
np.random.seed(42)
n = 50
rent = np.random.randint(400, 1200, n)
groceries = np.random.randint(150, 400, n)
transport = np.random.randint(50, 200, n)
wifi = np.random.randint(30, 100, n)

X = pd.DataFrame({
    "rent": rent,
    "groceries": groceries,
    "transport": transport,
    "wifi": wifi
})

# KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)

# Reduce to 2D for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.scatter(X_pca[:,0], X_pca[:,1], c=labels, cmap="viridis")
plt.title("Student Cost-of-Living Clusters")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.show()



**Interpretation**  
- Each color shows a group of students with similar expenses.  
- Cluster examples:  
  - **High-rent group**: students paying a lot for housing.  
  - **Balanced group**: average spending.  
  - **Frugal group**: low rent and fewer expenses.  
- This can guide budgeting strategies or student support programs.



## Quick Tour of Python Tools

### pandas
Used for handling tabular data and preprocessing before modeling.


In [None]:

# Create DataFrame with a missing value
df_tools = pd.DataFrame({
    "hours": [10, 8, None, 15],
    "quiz_avg": [80, 75, 60, 90]
})
print("Original DataFrame:\n", df_tools)

# Handle missing value
df_tools["hours"].fillna(df_tools["hours"].mean(), inplace=True)
print("\nAfter filling missing value:\n", df_tools)



### scikit-learn  
We used it for classification and clustering above.  
It provides ready-to-use implementations for most ML algorithms.



### NumPy
Enables efficient numerical computations.


In [None]:

# Z-score normalization with NumPy
scores = np.array([70, 75, 80, 85, 90])
z_scores = (scores - scores.mean()) / scores.std()
print("Scores:", scores)
print("Z-scores:", z_scores)



### TensorFlow  
A deep learning library for neural networks.  
We will explore it in detail later (Modules 8–9).



## Applications of ML

- **Healthcare**: disease diagnosis, medical image analysis.  
- **Finance**: fraud detection, credit scoring.  
- **Marketing**: customer segmentation, recommendation systems.  
- **NLP**: sentiment analysis, chatbots.  



## Wrap-up

- ML helps systems learn from data instead of explicit rules.  
- Supervised learning predicts outcomes from labeled data.  
- Unsupervised learning finds hidden patterns in unlabeled data.  
- Python libraries like pandas, scikit-learn, NumPy, and TensorFlow are the backbone of ML practice.  

**Next:** Supervised Learning in detail (Classification).



## References

- [Microsoft Learn: Introduction to Machine Learning](https://learn.microsoft.com/en-us/training/modules/introduction-to-machine-learning/)  
- [Google ML Crash Course](https://developers.google.com/machine-learning/crash-course)  
