# Week 2: Machine Learning Essentials
[back](../README.md)

### Introduction

Welcome to Week 2 of the Agentic AI program. This week focuses on bridging the gap between basic Python programming and intelligent agent behavior by introducing machine learning fundamentals. You will explore how to train models that enable agents to make predictions, classify data, and reason about the world.

**Key Characteristics of ML Models:**
- **Learning from data**: Models improve their performance by analyzing examples.
- **Predictive intelligence**: Trained models can make predictions on new data.

For AI agents, machine learning is essential. It enables agents to:
- Predict optimal actions based on historical outcomes.
- Classify incoming requests for appropriate routing.
- Rank search results by relevance.
- Estimate resource requirements for tasks.

### Learning Outcomes

By the end of this week, you will be able to:
- Understand the ML workflow.
- Save and load trained models using pickle.
- Create reusable ML tools.

---

## Core Concepts

### The ML Workflow

Machine learning projects typically follow a standard workflow:

![ML Workflow](w2_mlworkflow.png)


### Common Python Frameworks for Machine Learning

Here are some popular Python frameworks used in machine learning:

- **scikit-learn**: A simple and efficient library for data mining and machine learning, ideal for beginners and traditional ML tasks.
- **PyTorch**: A flexible and dynamic deep learning framework, widely used for research and production.
- **TensorFlow**: An end-to-end open-source platform for machine learning, known for its scalability and support for production-grade models.
- **XGBoost**: An optimized gradient boosting library designed for speed and performance, often used in competitive machine learning.
- **LightGBM**: A gradient boosting framework that is highly efficient and scalable, particularly for large datasets.

This week, we will use **scikit-learn** to develop and deploy ML models. Its simplicity and efficiency make it an excellent choice for understanding the fundamentals of the ML workflow.

### Correlation and Causation

**Correlation** refers to a statistical relationship between two variables, where changes in one variable are associated with changes in another. However, correlation does not imply causation.

**Causation** means that one variable directly affects another, establishing a cause-and-effect relationship.

#### Example 1: Correlation (No Causation)
- **Observation**: Students who study in a reputed college tend to score higher on exams.
- **Explanation**: While there is a correlation between reputed college and exam scores, other factors like student abilities, prior knowledge, and exam difficulty also play a role. Correlation alone does not prove that studying more results in higher scores.

#### Example 2: Causation
- **Observation**: Providing additional tutoring sessions improves student performance.
- **Explanation**: Controlled experiments can establish that tutoring directly impacts performance, demonstrating a causal relationship.

Understanding the distinction between correlation and causation is critical in machine learning to avoid making incorrect assumptions about relationships in data.



### Introduction to Linear Regression

Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship, where changes in the independent variable(s) result in proportional changes in the dependent variable.

#### Key Features:
- **Simple Linear Regression**: Involves one independent variable.
- **Multiple Linear Regression**: Involves two or more independent variables.

Linear regression is widely used for predictive modeling and understanding relationships between variables.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
import warnings
import pickle

warnings.filterwarnings("ignore")

In [None]:
from sklearn.linear_model import LinearRegression
# Read the Salary_dataset.csv file
salary_data = pd.read_csv('Salary_Data.csv')
Y = salary_data['Salary']
X = salary_data[['YearsExperience']]
# Create and fit the linear regression model
salarymodel = LinearRegression()
salarymodel.fit(X, Y)

# Display the coefficients
print("Intercept:", salarymodel.intercept_)
print("Coefficient:", salarymodel.coef_)
print("Model R^2 Score:", salarymodel.score(X, Y))

Machine learning (ML) models play a pivotal role in enhancing the intelligence of AI agents. By integrating ML models, AI agents can analyze data, make predictions, and adapt to dynamic environments. Here's how ML models can be utilized within AI agents:

1. **Training the Model**: Begin by training the ML model on a relevant dataset. For instance, in the case of predicting salaries based on years of experience, the model is trained using historical data to learn the relationship between the input features and the target variable.

2. **Saving the Model**: Once trained, the model is serialized and saved using tools like `pickle`. This ensures that the model can be reused without retraining, saving computational resources.

Pickle files are used to serialize and save machine learning models, enabling AI agents to load and reuse trained models without retraining. This ensures efficient deployment, allowing agents to make predictions or decisions in real-time. 

In [None]:

with open('salarymodel.pkl', 'wb') as file:
    pickle.dump(salarymodel, file)

print("Model saved to salarymodel.pkl")

In [None]:
# Load the pickled model
with open('salarymodel.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

# Example: Predict salary for a given number of years of experience
example_years_experience = [[float(input("Enter the number of years of experience: "))]]
predicted_salary = loaded_model.predict(example_years_experience)

print(f"Predicted Salary for {example_years_experience[0][0]} years of experience: {predicted_salary[0]}")


## Student Performance Case Study


The Student Performance dataset, available on **Kaggle**, contains data on various factors influencing student academic performance. This dataset provides a practical example for understanding how multiple variables contribute to academic success and can be used to train regression models that predict exam scores based on the provided features.

**Dataset Variables:**
- **Hours Studied**: Total number of hours spent studying by each student.
- **Previous Scores**: Scores obtained by students in previous tests.
- **Extracurricular Activities**: Indicates whether the student participates in extracurricular activities (Yes/No).
- **Sleep Hours**: Average number of hours of sleep the student had per day.
- **Sample Question Papers Practiced**: Number of sample question papers the student practiced.
- **Performance Index**: A measure of the overall performance of each student. The performance index ranges from 10 to 100, with higher values indicating better performance.

Download link: [Student Performance Dataset](https://www.kaggle.com/datasets/nikhil7280/student-performance-multiple-linear-regression/data#:~:text=calendar_view_week-,Student_Performance,-.csv)

### The ML Workflow

- **Goal**: Predicting the student performance index for given data set variables.
- **Explore and prepare**: Analyze and preprocess the dataset to ensure data integrity and quality.
- **Train and model**: Use scikit-learn to train regression models on the dataset.
- **Deploy**: Save the trained model to integrate it into an application.
- **Monitor and manage**: Evaluate the model's performance over time and update it as needed.



In [None]:
# Read the Student_Performance.csv file
student_data = pd.read_csv('Student_Performance.csv')
# Display the first few rows of the dataset
print(student_data.head())


In [None]:
# Correct column names by removing spaces
student_data.columns = student_data.columns.str.replace(' ', '')

# Convert 'Extracurricular_Activities' to numerical values (Yes -> 1, No -> 0)
student_data['ExtracurricularActivities'] = student_data['ExtracurricularActivities'].map({'Yes': 1, 'No': 0})

# Display the updated dataset
print(student_data.head())

In [None]:
# Find duplicates
duplicates = student_data[student_data.duplicated()]
print(f"Number of duplicate rows: {len(duplicates)}")

# Drop duplicates
student_data = student_data.drop_duplicates()

In [None]:
# Check for rows with null values in the dataset
rows_with_null = student_data[student_data.isnull().any(axis=1)]
print(f"Number of rows with null values: {len(rows_with_null)}")
# Describe student_data using statistics
student_data.describe()

In [None]:
# Compute the correlation matrix
correlation_matrix = student_data.corr()

# Plot the heatmap
plt.figure(figsize=(10, 8))
sn.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix Heatmap')
plt.show()

In [None]:
# See unique values
print(student_data["SleepHours"].unique())


In [None]:

# First set figure size
plt.figure(figsize=(15, 6))

def count_plot(column_name, data, hue=None, rotation=0):
    """
    1) Input: column name, column data type must be object or categorical
    2) Output: Count plot using seaborn module, unique values in x-axis and frequency in y-axis
    3) Bar labels show frequency of each unique value above each column in the graph
    """
    graph = sn.countplot(x=column_name, data=data, hue=hue, order=data[column_name].value_counts().index)
    for container in graph.containers:
        graph.bar_label(container)
        
    plt.xticks(rotation=rotation)
    plt.show()

# Call the function with the correct DataFrame
count_plot(column_name="SleepHours", data=student_data)


In [None]:
count_plot(column_name="SampleQuestionPapersPracticed", data=student_data)

In [None]:
# Splitting data into Indipendent and Dependent Variable
X = student_data.drop("PerformanceIndex", axis=1)
y = student_data["PerformanceIndex"]

# Case 1

In [None]:
# import train_test_split package from sklearn.model_selection
from sklearn.model_selection import train_test_split

In [None]:
# Consider PreviousScores as correlation is significant
X1 = X[['PreviousScores']]
y1 = y
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2, random_state=40)
# The random_state parameter ensures reproducibility of the train-test split.
# By setting random_state=42, the split will always produce the same training and testing sets
# when the code is run multiple times, making the results consistent and comparable.
print(X1_train.shape, X1_test.shape, y1_train.shape, y1_test.shape)
# create object from RandomForestRegressor
case1model = LinearRegression()
case1model.fit(X1_train, y1_train)
# Print the coefficients, intercept, and score of the model
print("Coefficients:", case1model.coef_)
print("Intercept:", case1model.intercept_)
print("Model R^2 Score:", case1model.score(X1_train, y1_train))




In [None]:
# Now, consider only HoursStudied and PreviousScores as features given their correlation
X2 = X[['HoursStudied', 'PreviousScores']]
y2 = y
# Split the data into training and testing sets
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=40)
case2model = LinearRegression()
case2model.fit(X2_train, y2_train)
# Print the coefficients, intercept, and score of the model
print("Coefficients:", case2model.coef_)
print("Intercept:", case2model.intercept_)
print("Model R^2 Score:", case2model.score(X2_train, y2_train))



In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# Predict the target values for the test data
y1_pred = case1model.predict(X1_test)
y2_pred = case2model.predict(X2_test)
# Calculate R² score
r1 = r2_score(y1_test, y1_pred)
r2 = r2_score(y2_test, y2_pred)
print(f"R² Score for Case 1 Model: {r1}")
print(f"R² Score for Case 2 Model: {r2}")


#