## Logistic Regression

In this notebook, we'll walk step-by-step through performing logistic regression using the Titanic dataset. We'll cover each of the following steps clearly and thoroughly:

1. Loading and inspecting the dataset.
2. Preprocessing: Handling missing values and encoding categorical data.
3. Splitting the data into training and testing sets.
4. Training a logistic regression model.
5. Evaluating the model's performance.
6. Checking assumptions for logistic regression.
7. Interpreting the model's results clearly.

### **Step 1: Load and Inspect the Data**
We'll use the Titanic dataset, a popular dataset for predicting passenger survival based on various features.

In [10]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns

# Load Titanic dataset in seaborn
df = sns.load_dataset('titanic')

# Inspect the dataset
print(df.head())
# only looking at variables up to and including fare
# not including the variables following fare

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  


## **Step 2: Preprocess the Data**
Before training our model, we must prepare the dataset. This step involves two key processes:

*2.1 Handle Missing Values*
Logistic regression requires datasets without missing values, so we remove rows with missing data in important columns (age, embarked).

*2.2 Encode Categorical Variables*
Machine learning algorithms require numerical inputs. Therefore, categorical variables (sex, embarked) must be converted into numeric form using one-hot encoding.

- Why encoding?
    - Converts categorical labels into numeric values that the model can interpret.
    - Avoids misinterpretation of categorical variables as numeric variables (e.g., treating "male" as numerically greater or less than "female").

In [None]:
# Handling missing values
# df.info()
# age variable is missing 177 observations
df = df.dropna(subset = ['age'])
# drops every row that is missing the age observation


# Encoding categorical variables
# Use drop_first = True to avoid "dummy trap"
df = pd.get_dummies(df, columns = ['sex'], drop_first = True)

# Define features and target
features = ['pclass', 'age', 'sibsp', 'parch', 'fare', 'sex_male']
X = df[features]
y = df['survived']

# Preview the cleaned dataset
print(X.head())
print(y.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
   pclass   age  sibsp  parch     fare  sex_male
0       3  22.0      1      0 

## **Step 3: Split the Data**
We divide our dataset into two subsets:
- *Training Set*: To build our logistic regression model.
- *Testing Set*: To evaluate the performance of our model.

In [12]:
from sklearn.model_selection import train_test_split
# Split dataset into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state = 42)


## **Step 4: Train the Logistic Regression Model**
We'll now build and train our logistic regression model using the training data.

In [13]:
from sklearn.linear_model import LogisticRegression
# Initialize and train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## **Step 5: Evaluate the Model**
To understand how well our model predicts survival, we use the following evaluation metrics:
- *Accuracy*: Proportion of correct predictions.
- *Confusion Matrix*: Breakdown of predictions (True positives, True negatives, False positives, False negatives).
- *Classification Report*: Precision, recall, and F1-score.

In [None]:


# Predict on test data

# Calculate accuracy


# Generate confusion matrix

# Display classification report


## **Step 6: Examining and Interpreting the Model**
After training our logistic regression model, let's examine the coefficients to understand their meaning in more detail.
- Coefficients in logistic regression represent the effect of each feature on the log-odds of the target (survival in this case).
- Positive coefficients increase the odds of the outcome (survival).
- Negative coefficients decrease the odds of the outcome (survival).
- The magnitude of coefficients shows the strength of each feature's influence.

### 6.1 Interpreting Coefficients
After training our logistic regression model, let's examine the coefficients to understand their meaning in more detail.
- Coefficients in logistic regression represent the effect of each feature on the log-odds of the target (survival in this case).
- Positive coefficients increase the odds of the outcome (survival).
- Negative coefficients decrease the odds of the outcome (survival).
- The magnitude of coefficients shows the strength of each feature's influence.

Output:
- Coefficient Interpretation:
    - Positive coefficients: Increase the log-odds (and thus probability) of survival.
    - Negative coefficients: Decrease the log-odds (and thus probability) of survival.



| Feature      | Coefficient | Impact on Survival Probability                      | Explanation                                                    |
|--------------|-------------|-----------------------------------------------------|----------------------------------------------------------------|
| `pclass`     | -1.21       | Decreases                                           | Lower passenger classes (higher numeric values) decrease survival odds.            |
| `age`        | -0.04       | Slight negative impact                              | Older passengers have slightly lower odds of survival.         |
| `sibsp`      | -0.35       | Negative impact                                     | Having more siblings/spouses aboard decreases survival odds.   |
| `parch`      | -0.05       | Slight negative impact                              | Having more parents/children aboard slightly reduces odds.     |
| `fare`       | 0.002      | Very slight positive impact                         | Paying higher fares slightly increases survival odds.          |
| `sex_male`   | -2.61       | Strong negative impact                              | Being male greatly decreases the probability of survival.      |
| **Intercept**| 5.47        | Baseline log-odds                                   | Baseline survival odds for females with lowest-class, fare, age, ect. |


In [None]:
# Extract coefficients and intercept


# Display coefficients


### 6.2 Understanding the `predict_proba` Function
Logistic regression outputs probabilities between 0 and 1, indicating the likelihood of belonging to a specific class.
- The function predict_proba() returns two columns:
    - Probability of class 0 (not survived).
    - Probability of class 1 (survived).
- Output:
    - Each row gives the probability of not surviving (first column) and surviving (second column).
    - Useful for making informed decisions based on probabilities rather than just binary predictions.

In [None]:
# Get predicted probabilities for the test set


# Display probabilities for first 5 test observations
