### MATH 3375 Project 4 - Logistic Regression

For this project, we will use a data set with several features of individual e-mails to predict whether the e-mail should be classified as 'spam'. The data set was obtained from the Machine Learning Repository at UC Irvine. 

Below is documentation related to the data set. 

    | SPAM E-MAIL DATABASE ATTRIBUTES 
    |
    | 48 continuous real [0,100] attributes of type word_freq_WORD 
    | = percentage of words in the e-mail that match WORD,
    | i.e. 100 * (number of times the WORD appears in the e-mail) / 
    | (total number of words in e-mail).  A "word" in this case is any 
    | string of alphanumeric characters bounded by non-alphanumeric 
    | characters or end-of-string.
    | Example: word_freq_credit indicates the percentage of words 
    |          in the email that are 'credit'
    |
    | 6 continuous real [0,100] attributes of type char_freq_#
    | = percentage of characters in the e-mail that match a specific 
    | character, given by the table below:

|feature|character description|character|
|-------|---------------------|:---------:|
| char_freq_1| semicolon |**;**|            
| char_freq_2| left parenthesis |(|      
| char_freq_3| left bracket |$[$|
| char_freq_4| exclamation mark|**!**| 
| char_freq_5| dollar sign |$|
| char_freq_6| hashtag/pound sign |#|


    | i.e. 100 * (number of CHAR occurences)/(total characters in e-mail)
    | Example: char_freq_4 indicates the percentage of characters 
    |          in the email that are an exclamation mark
    |
    | 1 continuous real [1,...] attribute: capital_run_length_average
    | = average length of uninterrupted sequences of capital letters
    |
    | 1 continuous integer [1,...] attribute: capital_run_length_longest
    | = length of longest uninterrupted sequence of capital letters
    |
    | 1 continuous integer [1,...] attribute: capital_run_length_total
    | = sum of length of uninterrupted sequences of capital letters
    | = total number of capital letters in the e-mail
    |
    | 1 nominal {0,1} class attribute: is_spam
    | = denotes whether the e-mail was considered spam (1) or not (0), 
    | i.e. unsolicited commercial e-mail.  
    |
    | For more information, see file 'spambase.DOCUMENTATION' at the
    | UCI Machine Learning Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html






In [None]:
spam_data <- read.csv("spambase.csv")
head(spam_data)

## Tasks

### Exploratory Data Analysis

##### 1. Use code to find the answers to the following questions.

**Type your answers in this cell.**  

_For an answer to receive credit, the code cell(s) below must have supporting code to show how your answers were obtained._

* How many rows are in the data set?
    <br>Answer:
    
    
* OTHER than the response variable, how many features (columns) are in the data set?
    <br>Answer:
    
    
* What proportion of the data points in the data set are classified as spam?
    <br>Answer:
    
    
* Does the word 'money' show up more frequently in spam e-mails than non-spam emails? (_HINT: Use a plot to visualize this_.)
    <br>Answer:
    

In [None]:
#Put code for Exercise 1 in this cell. You may add additional cells if you like.


### Partitioning Data

##### 2. Partition the data set into training and test sets.

Partition the data set into a training set (80% of the data) and a test set (20% of the data).  Note that you should train your model(s) _**using ONLY the training data**.  We are saving the test data set to evaluate performance of the models._

In [None]:
#Put code for Exercise 2 in this cell. You may add additional cells if you like.



### Create a Model

##### 3. Create a Logistic Regression Model with Five Predictors

Create a Logistic Regression model to predict whether a model is spam. The model should meet the following guidelines:

* The model should use only FIVE of the features in the data set as predictors
* At least TWO of the features should be from the set of **word_freq_xxxx** features
* At least ONE of the features should be from the set of **char_freq_x** features
* At least ONE of the features should be from the set of **capital_run_length_xxxx** features

Beyond the guidelines above, the choice of predictors is up to you.  Show your model summary after you have created the model.

In [None]:
#Put solution to Exercise 3 in this cell. You may add additional cells if you like.


### Interpret the Model

##### 3a. Answer the following questions about the model.

(Type your answers in this cell.)

* Which predictors are significant, and at what significance level?
    <br>Answer:
    
    
* Use the model coefficients in your summary to complete the following equation for using the model to predict the **_probability_** that an e-mail is spam. Note that the equation is written in LaTex.

$$P(Y=1|x_1,x_2,x_3,x_4,x_5) = $$

### Predictions

##### 4. Generate Probability Predictions

Use your model to generate probability predictions for records **_in the test set_**. Display a few rows showing the predicted probability **_and_** the value of the **is_spam** response variable for each record. (Suggestion: Create a dataframe and then show only the first few rows, NOT the entire dataframe.) 

In [None]:
#Put solution to Exercise 4 in this cell. You may add additional cells if you like.


##### 5. Explore Classification Thresholds

Test several thresholds for classification, where probability > threshold results in a POSITIVE (spam) classification, and any other probability results in a NEGATIVE (non-spam) classification.  **_Use at least the values 0.3, 0.4, 0.5, 0.6, and 0.7_** as thresholds. You may also use others if you choose. 

For each threhold, you should do the following:

* Generate binary (0/1) predictions for the **_test_** data set.
* Create and display an ROC plot. 
* Display the Area Under the Curve (AUC) value for the plot.

In [None]:
#Put solution to Exercise 5 in this cell. You may add additional cells if you like.


##### 5a. Identify the Best Threshold

Based on your results above, which threshold is the most suitable for predicting classification? (Give your answer in this cell with a VERY brief justification.)

### Final Model Metrics

##### 6. Create a Confusion Matrix 

Using the threshold you selected above, create and display a confusion matrix for the classifications predicted with that threshold for the **_test_** data set.


In [None]:
#Put solution to Exercise 6 in this cell. You may add additional cells if you like.


##### 7. Compute Model Metrics

Using the confusion matrix you created in Exercise 7, compute the following metrics.  Write your answers in this cell.

* Accuracy
    <br>Answer:
    
    
* Sensitivity
    <br>Answer:
    
    
* Specificity
    <br>Answer:
    
    
* Precision
    <br>Answer:
    
    

In [None]:
#Put solution to Exercise 7 in this cell. You may add additional cells if you like.


##### 8. Create a Calibration Plot

Using the **_test data set_**, create a calibration plot.

In [None]:
#Put solution to Exercise 8 in this cell. You may add additional cells if you like.


##### 8a. Evaluate the Quality of Your Model

Type your answers in this cell.

1. Interpret the calibration plot. What does it tell you?
2. Interpret the metrics you computed in Exercise 7. What do they tell you?
3. Based on both the metrics AND the calibration plot, what is your assessment of the model overall? Explain.

