# Homework Week 1

## Question 2.1
Describe a situation or problem from your job, everyday life, current events, etc., for which a
classification model would be appropriate. List some (up to 5) predictors that you might use.

### Summary
A good application of a classification algorithm would be at my current job with the company ExecOnline. ExecOnline is an EdTech startup that provides MBA courses to large enterprises. A component of our business is forecasting how many students will enroll in each one of our programs by semester. Student enrollment is decided by the business and not by the student, so the classification model will be applied to enterprises and classifying them by their characteristics and behaviors. Different types of companies have different enrollment tendencies, for example a company like Citigroup will put students through all 4 semesters and put at least 25 people through, while a company like Akamai Technologies typically only participates 1 time a year with around 8 students. So the purpose of classifying enterprises into different groups is to assist us in predicting their behavior based on how those groups have participated in the past.

### Predictors
* **Company Size (Revenue):** Company Size is very important in determining what group an enterprise will fit into. Companies with a much larger revenue typically have more money to spend on professional development and will have larger contracts with more consistent student enrollment.
* **Industry:** Industry will also be a useful characteristic of an enterprise to help in classifying them into groups. Behavior varies by industry, for example the finance & banking industry has large, well-established companies that have been focusing on professional development of new leaders for a long time. Industries like Technology with much newer companies like Twitter are just dipping their toes into things like professional development, and are still in the experimentation phase.
* **Average Student Enrollment:** Behaviors like how many students these enterprises have put through each semester in the past are also big indicators of how an enterprise will participate in the future. This classifier goes hand in hand with characterics like company size because it can tell us where a client is in their relationship with us (whether they’re a new or old client). Newer clients are usually testing the waters and behave differently than 3 year old clients who know how they’re going to use us.
* **Participation Frequency:** The number of semesters that an enterprise chooses to participate in throughout the year is another useful behavioral classifier. This can shed a lot of light on to how an enterprise has structured their professional development program. If an enterprise participates in all 4 semesters throughout the year then they are focused on upskilling a lot of new leaders to fight something like employee turnover while an enterprise that participates 1 time a year would be focused on promoting and training new leaders they hope to retain.

In [1]:
# Load necessary libraries
library(dplyr)
library(kernlab)
library(kknn)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



## Question 2.2.1
In this problem I am using the support vector machine function ksvm to find a good classifier for the given credit card data and then analyzing how well each C value classifies the data points. In this first step I wrote a loop to test different C values of increasing magnitude.

In [3]:
# Read data in and convert to matrix for use in ksvm
data <- tbl_df(read.csv("credit_card_data-headers.txt", header = TRUE, sep = "\t"))
data <- as.matrix(data)

for (i in 1:10) {
  c_test_val = 10**i
  # Create ksvm model with test C value
  model <- ksvm(data[,1:10],data[,11],type="C-svc",kernel="vanilladot",C=c_test_val,scaled=TRUE)
  
  # Get model predictions
  pred <- predict(model,data[,1:10])
  
  # Get model prediction accuracy
  pred_accuracy = sum(pred == data[,11]) / nrow(data)
  print(pred_accuracy)
}

 Setting default kernel parameters  
[1] 0.8639144
 Setting default kernel parameters  
[1] 0.8639144
 Setting default kernel parameters  
[1] 0.8623853
 Setting default kernel parameters  
[1] 0.8623853
 Setting default kernel parameters  
[1] 0.8639144
 Setting default kernel parameters  
[1] 0.6253823
 Setting default kernel parameters  
[1] 0.5458716
 Setting default kernel parameters  
[1] 0.6636086
 Setting default kernel parameters  
[1] 0.8027523
 Setting default kernel parameters  
[1] 0.4923547


The model prediction accuracy is marginally better with C values of increasing magnitude and actually decreases after using a value of 10e5. Additionally, the speed of model calculation decreases significantly with higher C values so for the sake of speed and prediction accuracy I will keep the original suggested value of 100

In [4]:
# call ksvm. Final column is response variable
model <- ksvm(data[,1:10],data[,11],type="C-svc",kernel="vanilladot",C=100,scaled=TRUE)

# Get model predictions
pred <- predict(model,data[,1:10])

# see what fraction of the model’s predictions match the actual classification
sum(pred == data[,11]) / nrow(data)

 Setting default kernel parameters  


In [11]:
# calculate a1…am
a <- colSums(model@xmatrix[[1]] * model@coef[[1]])

# calculate a0
a0 <- -model@b

# Equation of the line
line_equation = a * model@xmatrix[[1]] + a0

Multiplying the xmatrix by the coef gives the linear combination of data points that define a1,...,am.  I used the xmatrix attribute since the model stores these data points as scaled

## Question 2.2.2
In this problem I am using the k-nearest-neighbors classification function kknn to classify data points in the supplied credit data set. I wrote a loop to test different k values to identify which one would produce the highest prediction accuracy.

In [6]:
# Read data in again as df
data <- tbl_df(read.csv("credit_card_data-headers.txt", header = TRUE, sep = "\t"))

# Create df for match ratios
match_ratios = data.frame(k = integer(15),
                          matches = integer(15))

# Loop through k values
for (k_val in 1:15) {
  
  # populate k values into match_ratio df
  match_ratios$k[k_val] <- k_val
  
  # Loop through each row of the dataframe to fit a model to each one
  for (i in 1:nrow(data)) {
    kknn_model = kknn(R1~A1+A2+A3+A8+A9+A10+A11+A12+A14+A15,
                      data[-i,],
                      data[i,],
                      k=k_val,
                      distance = 2,
                      kernel = "optimal",
                      scale = TRUE)
    
    # Compare prediction with true value
    fit_val <- fitted.values(kknn_model)
    actual_val <- data[i, 11]
    
    # If you find a match add to the count in the match_ratios df
    if(round(fit_val) == actual_val) {
      match_ratios$matches[k_val] <- match_ratios$matches[k_val] + 1
    }
  }
}

# Create prediction ratios
match_ratios$prediction_accuracy <- match_ratios$matches / nrow(data)
match_ratios

k,matches,prediction_accuracy
1,533,0.8149847
2,533,0.8149847
3,533,0.8149847
4,533,0.8149847
5,557,0.851682
6,553,0.8455657
7,554,0.8470948
8,555,0.8486239
9,554,0.8470948
10,556,0.8501529


In [7]:
# Find highest prediction accuracy
highest_accuracy <- max(match_ratios$prediction_accuracy)

# Find row with highest prediction accuracy
which(match_ratios$prediction_accuracy == highest_accuracy)


I identified a k value of 12 and 15 and I will choose the lower k value of 12 to produce the highest prediction accuracy.