## Key terms:

#### **Artificial Intelligence (AI)**
Artificial Intelligence is a major field in computer science focused on **creating intelligent agents** capable of **perceiving their environment, learning from experience, reasoning about information and making decisions** to achieve specific goals. It aims to **replicate or simulate human cognitive abilities** such as learning, problem-solving, planning, and language understanding.

---



#### **Deep Learning (DL)**
Deep Learning is a sub-area of Machine Learning (and a sub-sub-area of Artificial Intelligence) that uses **artificial neural networks (ANNs)** inspired by the structure and function of the human brain.

These networks contain **multiple layers** that allow them to automatically learn increasingly complex features from large amounts of data. It excels in tasks like **image recognition, speech processing, natural language understanding and autonomous systems** where traditional ML methods may struggle.

---


#### **Transfer Learning (TL)**
It is a technique where a model that has already learned something from a large dataset is re-used for a new, related task. Instead of training a whole new model from zero which would take a significant amount of time and data, we can just take a pre-trained model, keep what it has already learned and fine-tune it to match the new purpose it must serve.

This works because machine learning models (especially deep learning models) learn general patterns in each layer that can be used again in multiple places after modification. Like for instance, a model trained on cars will learn shapes, parts, curves, textures etc. and these are not specific to just cars but they also appear in other objects like animals, furniture etc. 

*Common Use Cases:*
- Image classification (CNN-based models)  
- Natural language processing (BERT, GPT, etc.)  
- Audio and speech recognition  
- Medical image analysis  

---


#### **Machine Learning (ML)**
Machine Learning is a **method of AI** that enables systems to learn patterns from data **without being explicitly programmed** to perform a task. Instead of following hard-coded instructions as in traditional programming, ML models **improve their performance automatically** by analyzing data, identifying patterns and making predictions or decisions based on that experience.

**ML has the following steps/framework:**  
ML follows an **iterative cycle** (meaning we often repeat steps multiple times to improve accuracy, performance, and general reliability).


### **1. Problem Definition**
This step is crucial because it determines the **architecture, approach, and type of model** we will use. It involves the process of defining:

- **The goal of the project**  
- **What we want the model to predict or detect**  
- **The type of output we want/need**

*Sample approach:*

- Are we predicting a number? → *Regression*  
- Are we predicting a category? → *Classification*  
- Are we grouping similar items? → *Unsupervised learning*

*Types of learning approaches include:*

- **Supervised learning**  
  - The model is trained on labeled data (every training example includes both the input and correct output). The model learns the relationship between them and uses this knowledge to make predictions on new data.
  - Task examples:
    - Classification (predicting categories)
    - Regression (predicting numerical values)  
  - Eg.: Predicting house prices, Predicting if a mail is spam or not etc.

- **Unsupervised learning**  
  - Data has no labels (model tries to find hidden structures or patterns by itself since no correct answers are provided). In this kind of learning, the model groups similar data points or reduces data complexity [data is generally said to be complex if it has a large number of features, if there's noise in the data (as in, there is random or irrelevant info), if groups/clusters overlap, if there exists unusual data points that can confuse the model or distort clusters etc.]
  - Task examples:
    - Clustering (grouping similar items together)
    - Dimensionality Reduction (simplifying large datasets)
  - Eg.: Customer segmentation (model takes purchase behavior and groups customers automatically into categories like "frequent buyers", "bargain seekers" or "premium customers").

- **Semi-supervised learning**  
  - Mix of both supervised/labeled and unsupervised/unlabeled data. It generally uses a small amount of labelled data and a larger amount of unlabelled data.
  - It is helpful when labeling is costly, slow or requires domain experts like in the following example uses:
    - Medical imaging
    - Speech recognition etc....
  - It provides better performance than unsupervised learning and is significantly cheaper than fully supervised learning.

- **Reinforcement learning**  
  - Here, an agent learns by interacting with an environment and receiving rewards or penalties based on its actions so there is no fixed dataset (since the agent learns by trial and error). The goal here is to maximize the total rewards over time.
  - The agent makes decisions step by step, learns which move yields the highest rewards and gradually becomes better at achieving its goal.
  - Eg.: Game-playing AI, robotics navigation.


---

### **2. Data Collection & Understanding**
This step involves gathering the data needed for training and understanding its quality, structure, and limitations.

*Types of data include:*

- **Structured data**  
  - Highly organized data that follows a clearly defined format [usually in tables (rows and columns)]. Since it is well-organized, it is easy to store, search, filter and analyze using traditional tools.
  - Eg.: Excel sheets, CSV files, SQL tables.
  - Structured data is simple, clean and predictable making it the easiest form of data for traditional machine learning.

- **Unstructured data**  
  - This type of data is more "free-form"… meaning it does NOT have a fixed format or predefined structure thus making it hard to place into rows or columns. This type of data is harder to process since it is more complex and richer.
  - Eg.: images, audio recordings, videos, raw text.
  - Most of the data we generate today through phones, cameras, websites etc. are unstructured. Deep learning excels in extracting meaning from this kind of data.


---

### **3. Evaluation (Defining Success)**
In this step, we raise the question of **"How do we know our model is good?"** to decide what is "good" when it comes to our model. Different problems require different evaluation methods so choosing the right metric is essential to properly judge performance.

Metrics are measurement tools we use to evaluate how good a model is. They let us know how close, accurate and/or reliable the model's predictions are.

Eg.: (these will be elaborated in the next sections of lesson 2)

- **Regression (predicting numbers)**  
  - Metrics:
    - MAE (Mean Absolute Error) – measures average absolute difference  
    - MSE (Mean Squared Error) – penalizes larger mistakes  
    - RMSE (Root Mean Squared Error) – widely used for forecasting

- **Classification (predicting categories)**  
  - Metrics:
    - Accuracy – overall correctness  
    - Precision – how many predicted positives were correct  
    - Recall – how many actual positives we identified  
    - F1 score – balance of precision + recall  
    - Confusion matrix – visual breakdown of predictions

- **Forecasting (time series)**  
  - MAE  
  - MSE  
  - RMSE  
  - MAPE (percentage error)


---

### **4. Feature Engineering & Selection**
Columns that are inputted into our model and used to make predictions in future (since models learn from them) are called **"features"** *(taken from Lesson3b)*.

Feature engineering is the process of using domain knowledge to select, transform and create variables (features) from raw data to improve the model's performance. It generally includes the following:

- Creating new useful features  
- Removing irrelevant ones  
- Scaling or transforming data  
- Selecting the most important features

Eg.: Heart disease prediction model's features may include:

- Age  
- Weight  
- Sex  
- Blood Pressure  
- Chest Pain Type  
- Cholesterol Level  
etc...


---

### **5. Modelling**
In this step, we choose the **type of algorithm** and train the model using the features and data.
Eg. model choices since different models work better for different problems:

**For Tabular Data:**
- Linear Regression  
- Random Forest  
- XGBoost  
- Decision Trees  

**For Images:**
- CNNs (Convolutional Neural Networks)

**For Text / Sequences:**
- RNNs  
- LSTMs  
- Transformers


---

### **6. Experimentation**
This step involves testing and improving the model by:
- Trying different algorithms  
- Adjusting hyperparameters  
- Collecting more data  
- Adding / removing features  
- Improving preprocessing

ML development is **iterative**, so we repeat this step until we reach the best possible performance.

---
**Ethics:**
- One major challenge is anonymising data.
  Even when we remove names and contact details, identifying a person is still surprisingly easy. For example, a patient can often be uniquely identified using just postcode, birth date and gender.
  So "anonymous" data is rarely truly anonymous.

- Another ethical concern comes from how ML models can end up discriminating against certain groups.
  The algorithm itself doesn’t 'want' to discriminate but if the data contains biased patterns, the model will learn and repeat those biases.
  A classic example is loan applications.
  If the model uses features such as gender, religion, or race, the result is immediately unethical even if the algorithm claims it’s 'just using the data'.

Therefore because of these risks, we must always ask:

1. Who is allowed to access the data?
Data often contains highly sensitive information, so access must be controlled.

2. For what purpose was the data collected?
Using data for purposes the user didn’t agree to is unethical (and sometimes illegal).

3. What conclusions can we legitimately draw from it?
Predictions seem precise but they are still based on probability, noise and incomplete information.

This is why all ML results need to come with caveats or disclaimers.
Statistical arguments alone are never enough...they must be supported by context, ethical consideration and an understanding of the limitations of the data.


---

### Data
Data today is generated constantly because electronic devices are everywhere. Everything we do on a device produces some form of data --- clicks, messages, sensor readings, app usage etc.

However, the amount of data generated is increasing too rapidly as compared to our ability to understand it. This creates a gap that machine learning helps bridge thus making it easy for us to make sense of the massive amount of information that is hidden inside the data. 

**Real world data** unlike the data we have so far worked with are almost never clean and/or complete. They usually have :
- Incomplete combinations:
    The number of actual instances is much smaller than the total possible attribute combinations.

- Sparse data:
    Many values are zero or missing.

- Missing values:
    Attributes may not be recorded for every example.

- Noise:
    Human errors, device inaccuracies, wrong labels.

- Non-deterministic behavior:
    In real life, decisions are often based on probability, not fixed rules.

As explained above in the ML section's 2nd part, Data types can be differentiated into 2 types : Structured data and Unstructured data.

---
## Python Basics

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
from scipy.stats import skewnorm

In [5]:
# 1. Hello World
print("Hello, Google Colab!")

# 2. Variables & Data Types
x = 10        # integer
y = 3.14      # float
name = "Sunny"  # string
is_phd = True   # boolean

print("x:", x, type(x))
print("y:", y, type(y))
print("name:", name, type(name))
print("is_phd:", is_phd, type(is_phd))

# 3. Lists
fruits = ["apple", "banana", "cherry"]
print("Fruits:", fruits)
print("First fruit:", fruits[0])

# 4. Loops
for fruit in fruits:
    print("I like", fruit)

# 5. If-Else
age = 21
if age >= 18:
    print("You are an adult.")
else:
    print("You are a minor.")

# 6. Functions
def greet(person):
    return f"Hello {person}, welcome to Python!"

print(greet("Sunny"))

# 7. Dictionaries
student = {"name": "Sunny", "age": 27, "field": "AI"}
print("Student info:", student)
print("Name:", student["name"])

# 8. Using NumPy (numerical operations)
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print("NumPy array:", arr)
print("Mean:", arr.mean())

# 9. Using Pandas (data tables)
import pandas as pd
data = {"Name": ["Alice", "Bob", "Charlie"],
        "Age": [24, 30, 22]}
df = pd.DataFrame(data)
print("Pandas DataFrame:")
print(df)

Hello, Google Colab!
x: 10 <class 'int'>
y: 3.14 <class 'float'>
name: Sunny <class 'str'>
is_phd: True <class 'bool'>
Fruits: ['apple', 'banana', 'cherry']
First fruit: apple
I like apple
I like banana
I like cherry
You are an adult.
Hello Sunny, welcome to Python!
Student info: {'name': 'Sunny', 'age': 27, 'field': 'AI'}
Name: Sunny
NumPy array: [1 2 3 4 5]
Mean: 3.0
Pandas DataFrame:
      Name  Age
0    Alice   24
1      Bob   30
2  Charlie   22
