# Assignment 2 Notebook Overview

> Note 1: for answers with Python, display both codes and results clearly. You may explain 
answers with either comments, markdowns or print() function.

> Note 2: for answers with manual calculation, please display all calculation steps clearly. 

> Note 3: round all numerical answers to 2 decimal places.    

## 0. Environment Setup and First Look

Load helper libraries, style the plotting backend, and pull the shared `smoking.csv` dataset for reuse across later sections.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score  

### First Look

In [4]:
# Load the assignment dataset (ships with the notebook repository)
# ------------------------------------------------------------------
DATA_PATH = Path("smoking.csv")          # update if you store the CSV elsewhere
smoking_df = pd.read_csv(DATA_PATH)      # load the CSV into a pandas DataFrame

# ------------------------------------------------------------------
# Lightweight structural checks (verbose on purpose for clarity)
# ------------------------------------------------------------------
print("Dataset overview")
print(f"Rows:    {smoking_df.shape[0]}")
print(f"Columns: {smoking_df.shape[1]}")
print("Column names in order:")
display(smoking_df.info())  # includes non-null counts and types
print("Summary stats for numeric + categorical")
display(smoking_df.describe().T)  # transpose for better readability
print("First 5 rows")
display(smoking_df.head())  # first 5 rows

NameError: name 'Path' is not defined

## 1. Question 1 - Logistic Regression Walkthrough

Suppose we collect data for a group of students in a course with variables 𝑋1 = ℎ𝑜𝑢𝑟𝑠 𝑠𝑡𝑢𝑑𝑖𝑒𝑑, 𝑋2 = 𝐶𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒  𝑃𝐴, and 𝑦 = 1 if the student receives an A in this course and 𝑦 = 0 otherwise. We trained a logistic regression model on this dataset to predict whether a student can receive A with features 𝑋1 and 𝑋2, the estimated parameters are  𝑤0 = −7, 𝑤1 = 0.05, 𝑤2 = 1.  Answer the following questions with either Python or manual calculation.  

In [None]:
w0 = -7.0   # intercept term
w1 = 0.05   # coefficient for X1 = hours studied
w2 = 1.0    # coefficient for X2 = cumulative GPA

### (a) What is the probability that a student who studies for 40 hours and has a cumulative GPA of 3.5 gets an A in the class?   

In [None]:
hours_example = 40        # X1 from the problem statement
gpa_example = 3.5         # X2 from the problem statement

# Compute the predicted score using the linear model
predicted_score = w0 + w1 * hours_example + w2 * gpa_example

# Compute the predicted probability using the logistic function (sigmoid)
predicted_probability = 1 / (1 + np.exp(-predicted_score))

print("Part (a)")
print(f"  The predicted probability that a student who studies for {hours_example} hours and has a cumulative GPA of {gpa_example} gets an A in the class is approximately {predicted_probability:.2f}")

Part (a)
  The predicted probability that a student who studies for 40 hours and has a cumulative GPA of 3.5 gets an A in the class is approximately 0.18


### (b) How many hours would the student in step (a) need to study to have a 50% chance of getting an A in the class? 

In [None]:
# Solve w0 + w1 * hours + w2 * GPA = 0 for hours when GPA stays at 3.5
# (since logit = 0 corresponds to P = 0.5).
target_probability = 0.50
target_logit = np.log(target_probability / (1 - target_probability))  # equals 0.0
hours_for_half = (-w0 - w2 * gpa_example) / w1

print("Part (b)")
print(f"  Hours needed for 50% chance (same GPA): {hours_for_half:.2f}")

Part (b)
  Hours needed for 50% chance (same GPA): 70.00


### (c) What is the odds ratio and log-odds for the student in (a)? And what does the odds ratio mean?   

In [None]:
# Odds (p / (1 - p)) equal exp(log-odds) for logistic regression outputs.
odds_a = np.exp(predicted_score)

print("Part (c)")
print(f"  Odds (p/(1-p)) for these inputs: {odds_a:.2f}")
print(f"  Log-odds: {predicted_score:.2f}")
print(f"  Interpretation: For every 1 student predicted to earn an A, ~{(1 / odds_a):.2f} students are predicted not to (given these inputs).")

Part (c)
  Odds (p/(1-p)) for these inputs: 0.22
  Log-odds: -1.50
  Interpretation: For every 1 student predicted to earn an A, ~4.48 students are predicted not to (given these inputs).


### (d) Visualize the model (the linear hyperplane) with a simple line plot where 𝑋1 is on the 𝑥-axis and 𝑋2 on the 𝑦-axis. Also indicate the region for A grade (positive, 𝑦 = 1) and region for non-A grade (negative) in the figure.  (Hint: you can draw the plot manually or with the help of any software.)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Create data points
x1 = np.linspace(-10, 10, 100)
x2 = -2*x1 + 5  # Example hyperplane equation: 2x1 + x2 - 5 = 0

# Create the plot
matplotlib.pyplot.figure(figsize=(10, 8))
matplotlib.pyplot.plot(x1, x2, 'b-', label='Decision Boundary')

# Fill regions
matplotlib.pyplot.fill_between(x1, x2, 15, alpha=0.3, color='green', label='A Grade (y=1)')
matplotlib.pyplot.fill_between(x1, x2, -15, alpha=0.3, color='red', label='Non-A Grade (y=0)')

# Add labels and title
matplotlib.pyplot.xlabel('X₁')
matplotlib.pyplot.ylabel('X₂')
matplotlib.pyplot.title('Linear Decision Boundary for Grade Classification')
matplotlib.pyplot.grid(True)
matplotlib.pyplot.legend()

# Set axis limits
matplotlib.pyplot.xlim(-10, 10)
matplotlib.pyplot.ylim(-10, 10)

# Add origin lines
matplotlib.pyplot.axhline(y=0, color='k', linestyle='-', alpha=0.3)
matplotlib.pyplot.axvline(x=0, color='k', linestyle='-', alpha=0.3)

matplotlib.pyplot.show()