# Assignment 3 - Logistic Regression

Objective:

In this assignment, you will implement logistic regression using Python to solve a binary classification problem. You will explore how logistic regression works, evaluate its performance using different metrics, and interpret the results.


Instructions:

Complete the tasks below. Submit your code (in a Jupyter notebook or Python script) and include explanations for each task where appropriate.


In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import matplotlib.pyplot as plt

## Part 1: Logistic Regression Concepts (15 points)

**Question: Write a brief description of logistic regression and its purpose. Explain why it is used for classification tasks and how it differs from linear regression.**

YOUR ANSWER:

**Question: What is a sigmoid function?**

YOUR ANSWER:


**Question: How does logistic regression handle classification problems (i.e. what is a threshold)?**

YOUR ANSWER:

## Part 2: Logistic Regression with Sklearn (55 points)

Use the Titanic Dataset from the Seaborn library, which provides information on passengers such as age, gender, class, and whether they survived or not. The goal is to predict whether a passenger survived (1) or did not survive (0) based on these features.

### A. Load the data
You may need to read up on the dataset in order to understand what the target variables is and what the feature variables are.


In [3]:
import seaborn as sns
df = sns.load_dataset('titanic')
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


**Question: What is the House Prices dataset?**

YOUR ANSWER:

### Train/Test Split the Data (5 points)


Split into training and testing. Use the training dataset to perform EDA (exploratory data analysis) and fit the model. Predict on the test dataset and evaluate model performance.

In [4]:
X = df[['pclass', 'sex', 'age', 'fare', 'embarked']]
y = df['survived']

In [None]:
# YOUR CODE HERE

### B. Preprocessing (10 points)

1. Handle missing data

2. Apply one-hot-encoding

A template has been provided for you. Modify this code so that it runs fit/transform on the training set and **only transform** on the test set.

In [8]:
from sklearn.impute import SimpleImputer

# Impute numerical features ('age', 'fare') with the mean
X.loc[:, ['age', 'fare']] = num_imputer.fit_transform(X[['age', 'fare']])

# Impute categorical features ('embarked') with the most frequent value
X['embarked'] = cat_imputer.fit_transform(X[['embarked']]).ravel()

# Step 2: Convert categorical variables into numeric using one-hot encoding
X = pd.get_dummies(X, columns=['sex', 'embarked'], drop_first=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['embarked'] = cat_imputer.fit_transform(X[['embarked']]).ravel()


### C. Fit a Logistic Regression Model (5 points)
Create a logistic regression model.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# YOUR CODE HERE


### D. Score & Evaluate the Model (5 points)
Run predict on the fitted model. Pass the predictions to a scoring function.

In [None]:
# YOUR CODE HERE


### E. Visualize the Results (10 points)
This is open to your own approach

In [None]:
# YOUR CODE HERE


**Question (5 points):** How did your model perform? In what situations was it wrong?

YOUR ANSWER:

**Question (5 points):** What is the weight and bias or your model? What is the formula?

YOUR ANSWER:

In [None]:
# YOUR CODE HERE

### F. Grid Search over Hyerparameters (15 points)

Use GridSearchCV to find the best hyperparameters (e.g., solver, regularization strength). Select the best model from the grid search object.

In [9]:
# YOUR CODE HERE

### G. Analyze Feature Importance (10 points)

Analyze which features are most important in predicting survival. This part is up to you.


**Question (5 points):** Which features were most important in predicting survival?

YOUR ANSWER:

## Part 3: The Sigmoid Function (25 points)

The purpose of this assignment is to build a deeper understanding of activation functions, specifically the sigmoid function used in logistic regression and neural networks. You will implement a custom sigmoid function in Python and use it to compute outputs based on provided weights and bias values.

Review of the Sigmoid Function.

The sigmoid function maps any real-valued input to the range (0, 1), making it useful for binary classification tasks.

Your task:
* Code the formula of the signmoid function in the return statement of the funciton below.
* Test the function. Match the expected output.
* Modify the inputs to match the paramters of your trained model above. Test it on the first row of the test set.

Don't forget to answer the questions.

In [10]:
import numpy as np

def sigmoid(weights, bias, inputs):
    """
    Custom implementation of the sigmoid function.

    Parameters:
    weights (list or np.array): The weights for each input feature.
    bias (float): The bias term.
    inputs (list or np.array): The input features.

    Returns:
    float: The sigmoid output in the range (0, 1).
    """
    # Compute the weighted sum: z = w1*x1 + w2*x2 + ... + wn*xn + b
    z = np.dot(weights, inputs) + bias

    # Apply the sigmoid function
    # YOUR CODE HERE - FINISH THE FORMULA
    return


### Test the function

Given the following weights, bias, and input features, use your custom sigmoid function to compute the output.

Expected Output: 0.3543

In [12]:
weights = [0.2, -0.5, 0.3]
bias = 0.4
inputs = [1.5, 2.0, -1.0]
sigmoid(weights, bias, inputs)

0.35434369377420455

**Question (5 points):** If the output of your sigmoid function is 0.7, what does this value represent in terms of a binary classification task (e.g., predicting if a student passed or failed an exam)?

YOUR ANSWER:



**Question (5 points):** What happens when the weighted sum of z becomes very large (e.g., positive 10) or very small (e.g., negative 10)? How does the sigmoid function behave in such cases?

YOUR ANSWER:

### Modify the Inputs

Change the weights and bias values to match that of your model trained in Part 2. Make the inputs match first row of the test test.

In [None]:
# YOUR CODE HERE

weights =
bias =
inputs =
sigmoid(weights, bias, inputs)

**Question (5 points):** Does the probabilistic output for this row match the probabilistic output from your sklearn model? If not, what are the different values?

YOUR ANSWER:

In [None]:
# YOUR CODE HERE
