## Introduction to Data Science

#### University of Redlands - DATA 101
#### Prof: Joanna Bieri [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
#### [Class Website: data101.joannabieri.com](https://joannabieri.com/data101.html)

---------------------------------------
# Homework Day 18
---------------------------------------

GOALS:

1. Practice Logistic Regression
2. Interpret Logistic Regression Results

----------------------------------------------------------


This homework has **1 Exercise** and **1 Challenge Exercise**

### Important Information

- Email: [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
- Office Hours: Duke 209 <a href="https://joannabieri.com/schedule.html"> Click Here for Joanna's Schedule</a>


### Announcements

**Come to Lab!** If you need help we are here to help!

### Day 18 Assignment - same drill.


1. Make sure **Pull** any new content from the class repo - then **Copy** it over into your working diretory.
2. Open the file Day##-HW.ipynb and start doing the problems.
    * You can do these problems as you follow along with the lecture notes and video.
3. Get as far as you can before class.
4. Submit what you have so far **Commit** and **Push** to Git.
5. Take the daily check in quiz on **Canvas**.
7. Come to class with lots of questions!


In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'colab'

from itables import show

# This stops a few warning messages from showing
pd.options.mode.chained_assignment = None 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Machine Learning Packages
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression 
from sklearn import metrics

### Data: A collection of Emails

- Emails for the first three months of 2012 for an email account
- Data from 3921 emails and 21 variables on them
- Outcome: whether the email is spam or not
- Predictors: number of characters, whether the email had "Re:" in the subject, time at which email was sent, number of times the word "inherit" shows up in the email, etc.


Data Information: https://www.openintro.org/data/index.php?data=email

This lab follows the Data Science in a Box units "Unit 4 - Deck 6: Logistic regression" by Mine Çetinkaya-Rundel. It has been updated for our class and translated to Python by Joanna Bieri.

In [2]:
file_name = 'data/email.csv'
DF = pd.read_csv(file_name)

In [3]:
DF

Unnamed: 0,spam,to_multiple,from,cc,sent_email,time,image,attach,dollar,winner,...,viagra,password,num_char,line_breaks,format,re_subj,exclaim_subj,urgent_subj,exclaim_mess,number
0,0,0,1,0,0,2012-01-01T06:16:41Z,0,0,0,no,...,0,0,11.370,202,1,0,0,0,0,big
1,0,0,1,0,0,2012-01-01T07:03:59Z,0,0,0,no,...,0,0,10.504,202,1,0,0,0,1,small
2,0,0,1,0,0,2012-01-01T16:00:32Z,0,0,4,no,...,0,0,7.773,192,1,0,0,0,6,small
3,0,0,1,0,0,2012-01-01T09:09:49Z,0,0,0,no,...,0,0,13.256,255,1,0,0,0,48,small
4,0,0,1,0,0,2012-01-01T10:00:01Z,0,0,0,no,...,0,2,1.231,29,0,0,0,0,1,none
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3916,1,0,1,0,0,2012-03-31T00:03:45Z,0,0,0,no,...,0,0,0.332,12,0,0,0,0,0,small
3917,1,0,1,0,0,2012-03-31T14:13:19Z,0,0,1,no,...,0,0,0.323,15,0,0,0,0,0,small
3918,0,1,1,0,0,2012-03-30T16:20:33Z,0,0,0,no,...,0,0,8.656,208,1,0,0,0,5,small
3919,0,1,1,0,0,2012-03-28T16:00:49Z,0,0,0,no,...,0,0,10.185,132,0,0,0,0,0,small


**Exercise 1** Logistic Regression with ONE explanatory variable.

Choose another variable from the data set to use as your explanatory variable and create a Logistic Regression model to predict if an email is spam or not. You should do all of the following:

1. Say what variable you are using to predict spam messages (do some analysis, at minimum a value_counts()). Why do you think this is a good variable to use in predicting if an email is spam.
2. Create and fit a Logistic Regression model.
3. Show the results: intercept, coefficient, basic confusion matrix prediction.
4. What do you think the decision cutoff should be? Update the cutoff and redo the confusion matrix.
5. Explain your results in words. You should talk about False Negative and False positive rates and what they mean in terms of the variables you chose.


**Exercise 2 - challenge** Logistic Regression with MORE THAN ONE explanatory variable.

Try redoing the analysis, but this time add a few more explanatory variables. Again do some analysis of the variables you are chosing and state why they are a good choice. Then answer again questions 1-5.

In [4]:
DF.columns


Index(['spam', 'to_multiple', 'from', 'cc', 'sent_email', 'time', 'image',
       'attach', 'dollar', 'winner', 'inherit', 'viagra', 'password',
       'num_char', 'line_breaks', 'format', 're_subj', 'exclaim_subj',
       'urgent_subj', 'exclaim_mess', 'number'],
      dtype='object')

In [7]:
DF['exclaim_mess'].value_counts()


exclaim_mess
0      1435
1       733
2       507
4       190
3       128
       ... 
75        1
89        1
139       1
148       1
947       1
Name: count, Length: 73, dtype: int64

I chose exclaim mess as my predictor because spam emails use very often exclamation marks to grab people's attention while normal emails don't often use exlamation marks. 

In [9]:
X = DF['exclaim_mess'].values.reshape(-1,1)
y = DF['spam']

# Do the regression
LM = LinearRegression()
LM.fit(X,y)


In [11]:
LM.coef_

array([2.60442421e-05])

In [12]:
LM.intercept_

np.float64(0.09342708895722703)

In [15]:
LM.score(X,y)

2.1183502742827542e-05

A good cutoff would be 0.1 if the predicted spam probability is over 10% then we label it as a spam.

In [18]:
model = LogisticRegression()
model.fit(X, y)

cnf_matrix = metrics.confusion_matrix(y, model.predict(X))
cnf_matrix


array([[3554,    0],
       [ 367,    0]])

We see that with the original cutoff the model predicts every email as “not spam.” This results in 3554 true negatives and 367 false negatives, meaning the model fails to identify a single spam message. There are no false positives, but this is not a good outcome, detecting spam is more important than avoiding occasional false alarms. A false negative means a spam email slipped through to the inbox, while a false positive would mean a normal email was incorrectly marked as spam. Because all spam emails were missed, the cutoff must be lowered. With variables like exclamation marks, which are often more common in spam, using a lower cutoff (such as 0.1) helps the model flag messages with even a small amount of exclamation emphasis as potentially spam. This reduces false negatives and greatly improves spam detection, even though it may slightly increase false positives.

In [19]:
DF[['exclaim_mess','dollar','urgent_subj']].describe()
DF[['exclaim_mess','dollar','urgent_subj']].value_counts()

exclaim_mess  dollar  urgent_subj
0             0       0              1274
1             0       0               645
2             0       0               428
4             0       0               166
3             0       0                93
                                     ... 
13            1       0                 1
              10      0                 1
              12      0                 1
14            5       0                 1
1236          44      0                 1
Name: count, Length: 318, dtype: int64

In [33]:
DF_model2 = DF[['exclaim_mess','urgent_subj','dollar','spam']]

# Get the variables
X = DF_model2['exclaim_mess'].values.reshape(-1,1)
y = DF_model2['spam']

# Do the regression
LM2 = LinearRegression()
LM2.fit(X,y)

# Save the predicted values to the data frame
DF_model2['prediction'] = LM.predict(X)

DF_model2
LM2

In [34]:
DF_model2

Unnamed: 0,exclaim_mess,urgent_subj,dollar,spam,prediction
0,0,0,0,0,0.093427
1,1,0,0,0,0.093453
2,6,0,4,0,0.093583
3,48,0,0,0,0.094677
4,1,0,0,0,0.093453
...,...,...,...,...,...
3916,0,0,0,1,0.093427
3917,0,0,1,1,0.093427
3918,5,0,0,0,0.093557
3919,0,0,0,0,0.093427


In [35]:
LM2.coef_

array([2.60442421e-05])

In [39]:
LM2.score(X,y)

2.1183502742827542e-05

In [37]:
LM2.intercept_

np.float64(0.09342708895722703)

In [38]:
model2 = LogisticRegression()
model2.fit(X, y)

cnf_matrix2 = metrics.confusion_matrix(y, model2.predict(X))
cnf_matrix2

array([[3554,    0],
       [ 367,    0]])

For the multivariable logistic regression model, I used exclaim_mess, urgent_subj, and dollar as predictors because spam emails often contain many exclamation marks, urgency related subject lines, and monetary references. After fitting the model and examining the predicted probabilities, I found that the default cutoff of 0.5 produced a confusion matrix of [[3554,0], [367, 0]], meaning the model classified every email as “not spam,” resulting in 3554 true negatives, 367 false negatives, and no true positives or false positives. This happens because the predicted probabilities are very small, so nothing reaches the spam threshold. To make the model useful, the cutoff must be lowered (for example, to around 0.09 (our intercept)), which allows the model to finally flag some messages as spam. Lowering the cutoff reduces false negatives, spam messages missed by the model at the cost of introducing some false positives, where normal emails are incorrectly marked as spam. 