[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%208%20Notebooks/GDAN%205400%20-%20Week%208%20Notebooks%20%28IX%29%20-%20Task%209%20-%20Train%20Logistic%20Model.ipynb)

This notebook provides a mini-tutorial on training a machine learning model.

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

### Read in the Titanic Training Data

In [None]:
import numpy as np
import pandas as pd

train_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Titanic/train.csv'
train = pd.read_csv(train_url)
print('# of rows in training dataset:', len(train), '\n')
train[:2]

# Train and Evaluate a Logistic Regression Model 
Now we are ready to train our model. That is the focus of Task 9 in the fourth coding assignment, which has the following requirements:

- Train a **logistic regression model** using the training data.  
- The predictor variables (`X`) used in this model are **Age, Female, and Fare**.  
- Use the trained model to make **predictions** on the test set.  
- Evaluate the model’s performance using the **accuracy** score.  
- *Hint*: Make sure to include the proper import statements from `sklearn`


### What is Logistic Regression?  
Logistic Regression is a commonly used machine learning algorithm for **binary classification** problems, meaning it predicts one of two possible outcomes.  

For the Titanic dataset, the model will predict:  
- **`1` (Survived)**  
- **`0` (Did not survive)**  

Since our target variable (`Survived`) is binary, logistic regression is a suitable choice.

---

## Step 1: Import Required Libraries  
We need to import:  
- **`LogisticRegression`** from `sklearn.linear_model` – to create the model.  
- **`accuracy_score` and `classification_report`** from `sklearn.metrics` – to evaluate performance.  


```python
from sklearn.linear_model import LogisticRegression  
from sklearn.metrics import accuracy_score, classification_report
```

---

### Step 2: Train the Logistic Regression Model  

We train the model using the **training dataset (`X_train`, `y_train`)**.  

```python
model = LogisticRegression()  
model.fit(X_train, y_train)  
```

### What’s Happening Here?
- `LogisticRegression()` initializes the model.  
- `.fit(X_train, y_train)` trains the model on our features (`Age`, `Female`, `Fare`) and labels (`Survived`).  
- The model **learns patterns** from the training data to predict survival.  

---

### Step 3: Make Predictions  

Now, we use the trained model to predict survival on the **validation set (`X_val`)**.  

```python
val_predictions = model.predict(X_val)  
```

- `model.predict(X_val)` generates predictions for unseen data.  
- The output is an array of `0`s and `1`s, representing survival predictions.  

---

### Step 4: Evaluate Model Performance  

We assess the model’s accuracy using `accuracy_score`:  

```python
print("Accuracy:", accuracy_score(y_val, val_predictions), '\n')  
```

- **Accuracy** = (Correct Predictions / Total Predictions)  
- A higher accuracy means the model correctly predicts survival more often.  

For a more detailed analysis, we can also print a **classification report**:  

```python
print(classification_report(y_val, val_predictions))  
```

- This report provides:
  - **Precision** (How many predicted survivors actually survived?)  
  - **Recall** (How many actual survivors were correctly identified?)  
  - **F1-score** (Balance of precision and recall).  

---

### Interpreting the Results  

#### Good Model Performance If:
- Accuracy is **above 70%** (varies depending on features used).  
- Precision and recall scores are **balanced** across both classes.  

### Potential Issues If:
- The model predicts **only 0s or 1s** → It may not be learning properly.  
- Accuracy is **too low (<60%)** → Consider adding more features or tuning the model.  

---

### Conclusion  
Logistic regression is a **simple but effective** model for predicting survival. Now that we’ve trained and evaluated our first model, we can refine it in the next steps to improve performance!

---

Below I will walk you though the code. First, we need to fix the missing values in `Age` and create `Female`, create `X` and `y`, and generate the test-train split. 

In [None]:
train['Age'] = train['Age'].fillna(train["Age"].median())
train["Female"] = train["Sex"].map({"female": 1, "male": 0})

In [None]:
X = train[['Age', 'Female', 'Fare']]
y = train['Survived'] 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

First, we read in the relevant packages

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

Next, we *initialize* the model (we are calling it simply `model`) and *train* the model on our three features (`Age`, `Female`, `Fare`) and our outcome variable (`Survived`).

After the model is run, we generate predictions on validation set – the 20% of 'holdover' data – and save the predictions in `val_predictions.` 

In [None]:
# Train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

val_predictions = model.predict(X_val) 

Lastly, we evaluate model performance using the `accuracy_score` and the `classification_report`

In [None]:
print("Accuracy:", accuracy_score(y_val, val_predictions), '\n')
print(classification_report(y_val, val_predictions))  

<br>We can also print the model *coefficients* to understand the influence of each variable. Look especially at the *sign* (negative or positive) of each coefficient in order to ascertain whether `Age`, `Female`, and `Fare` are positively or negatively related to the likelihood of survival.

In [None]:
print("Model coefficients (Based on Age, Female, and Fare):\n")
print('\tIntercept:', model.intercept_[0])
print('\tAge coefficient:', model.coef_[0][0])
print('\tFemale coefficient:', model.coef_[0][1])
print('\tFare coefficient:', model.coef_[0][2], '\n')