# Logistic-Regression Practice

We will use a version of the famous Titanic data set that requires very little cleaning.

In [4]:
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

Read in the data set.

In [7]:
t_df = pd.read_csv('titanic_data.csv', index_col='PassengerId')
t_df = t_df.dropna()

Remove columns that don't make reasonable numeric predictors.

In [10]:
t_df.drop(columns=['Name', 'Cabin', 'Ticket'], inplace=True)

Convert the remaining columns to use numeric labels.

In [13]:
t_df['Sex'].replace(['male', 'female'], [1, 0], inplace=True)
t_df['Embarked'].replace(['S', 'C', 'Q'], [0, 1, 2], inplace=True)

Extract the dependent and independent variables.

In [16]:
X = t_df.drop(columns=['Survived'])
y = t_df['Survived']

Split training and test sets.

Notice that we are  _practicing to learn_, not creating a product, so we have not paid attention to validation vs. test.

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

### Run everything up to this point and check the variable explorer for the following.
#### Do you have distinct training and test sets for the independent and dependent variables? Put the answer in your Jupyter notebook. Include the sizes of the sets in cardinality and percentage.

#### Look at the two training sets and at least one test set to verify they contain what you expect.
Are there any issues? Put the answer in your Jupyter notebook. Include an explanation or discussion if necessary.

In [32]:
logmodel = sm.Logit(y_train, sm.add_constant(X_train)).fit(disp=False)
print(logmodel.summary())

                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  128
Model:                          Logit   Df Residuals:                      120
Method:                           MLE   Df Model:                            7
Date:                Tue, 29 Oct 2024   Pseudo R-squ.:                  0.3126
Time:                        13:27:59   Log-Likelihood:                -57.049
converged:                       True   LL-Null:                       -82.996
Covariance Type:            nonrobust   LLR p-value:                 6.124e-09
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.2842      1.372      3.124      0.002       1.596       6.972
Pclass        -0.3167      0.507     -0.624      0.532      -1.311       0.678
Sex           -2.9603      0.602     -4.917      0.0

### Are there any predictors that are not statistically significant in the conventional sense?
Put the answer in your Jupyter notebook.<p>
A variable is conventionally statistically significant if its _p_ value is less than 0.05. (Do you know why?)

### What variable is particularly strong in predicting survival?
Put the answer in your Jupyter notebook.

### What does a negative coefficient imply and why?
Put the answer in your Jupyter notebook.

### Based on your discussion, first think about what other variable ought to be a decent predictor?

### Next, check the report output to see if that was the case.
Enter what variable you thought might be a good predictor and whether that turned out to be the case.

## Next, we wil learn about the quality of our predictions on the test set.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

# Form our predictions, convert continuous [0, 1] predictions to binary
predictions = logmodel.predict(sm.add_constant(X_test))
bin_predictions = [1 if x >= 0.5 else 0 for x in predictions]

# We can now assess the accuracy and print out the confusion matrix
print(accuracy_score(y_test, bin_predictions))
print(confusion_matrix(y_test, bin_predictions))

## Discussion

### There is another way to evaluate our model... for a variety of thresholds.

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_test, predictions)
roc_auc = roc_auc_score(y_test, predictions)

plt.plot(fpr, tpr, label='ROC Curve (area = %0.3f)' % roc_auc)
plt.title('ROC Curve (area = %0.3f)' % roc_auc)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')