In [None]:
.libPaths('/gpfs/projects/datascience/shared/R/Data4ML')
library('randomForest')

# Classification Algorithms

All machine learning is trying to approximate a function that maps some inputs to outputs.


Classification is a type of Machine Learning algorithm that aims to predict the class something belongs to from a set of features. Inputs vary, but the output is the probability that an example belongs to a class.

Examples
* From pixels in an image, predict whether it's a dog or a cat
* From social media text, determine if a statement is positive or negative. 
* From measurements of an animal, determine its species


Two main categories
* Binary classification - You have only two classes Positive/Negative, True/False, Cat/Dog
  * Target values are a 0 or a 1
  * Target values can be string label Dog/Cat
  
* Multiclass Classification - You have several classes Cat/Dog/Bird, 0/1/2/3/4/5/6... 
  * Target values can be expressed in several ways
      * A string with a class's label
      * A number labeling the class
      * A one-hot encoded vector which is a vector with length equal to the number of classes with precisely one entry labeled as a one and the rest zero.
          * i.e., For three classes
          * The first class is represented as  [1,0,0]
          * The first class is represented as  [0,1,0]
          * The first class is represented as  [0,0,1]






# Example Datasets
Let's look at some example datasets to see if they are AI-Ready

* Are their clear data splits?
* Are there missing values?
* Is there at least one clear target?
* Is the dataset Tidy? 

In [None]:
install.packages('ISLR')

In [None]:
library('ISLR')

# Binary Data 
What kind of OJ did you buy Minute Made or Citrus Hill

In [None]:
head(OJ)

## Question: is this dataset AI Ready?

# Khan Gene Data

* Divided: into xtrain,xtest,ytrain,ytest
* Inputs: Gene expression values for 2308 genes from tissue samples
* Target: 4 types of small round blue cell tumors




In [None]:
View(Khan)

## Question is this dataset AI Ready?

In [None]:
library("titanic")



In [None]:
head(titanic_train)
head(titanic_test)

## Question is this dataset AI Ready?

## It can be hard to find AI Ready datasets 
* That's okay, because most are close enough that we can fix them.

* We're going to take an existing dataset of wine rankings and try a classifier on it
    * Make a test/train split
    * Define a target
    * Check for missing values!
    * Train and ML model!

# Create a Wine-dataset
read in data from two csvs, and combine them in R (we could do the same in bash)


In [None]:
red<-read.csv('../data/wine/winequality-red.csv',header=TRUE,sep=';')
white<-read.csv('../data/wine/winequality-white.csv',header=TRUE,sep=';')
#Let's combine both datasets and add a column to distingush red vs white
red$isred <-1
white$isred <-0

In [None]:
head(red)


In [None]:
all=merge(red,white,all=TRUE)
#Check for missing values
is.null(all)

## Split the data into 80% train and 20% test
use the sample command to randomly select indices


In [None]:
?sample

In [None]:
# Randomly Select training indices
train_index <- sample(seq_len(nrow(all)),size= nrow(all)*.8, replace=FALSE )


#Split the data
wine_train <- all[train_index,]
wine_test <- all[-train_index,]


# Classification Algorithms

There are several kinds of classification algorithms
* Logistic Regression 
* Support Vector Machine
* Random Forest
* Deep Neural Networks
* Lots more

It's common to try several to see which one works best; however, a few things to think about
* Not all algorithms do multi-class classification
* Some algorithms are computationally expensive
* Some work better on bigger datasets
* We are working with structured data, but generally, different kinds of data will need different kinds of algorithms
You'll have to try and balance the tradeoffs for your problem.

Today we will start with Random Forests which can be used with both binary and multi-class classification. **For now, just think of it as a kind of program that learns from examples**

## There are some differences in how algorithms work in the two classification cases

* Outputs for binary classification is just a single number between 0-1 that it belongs to the first class (or which every class is labeled by a 1/True). The probability it belongs to a second class is just one minus the probability of the first class.

* Outputs for multi-class classification is one number **per** class with the probability that the example belongs to that class **the sum of all these numbers must equal one**

* It's true that binary classification is a subset of multi-class classification (with just two classes), but it is common you'll see the distinction

## Probability to Predictions
Classification models often output probability, but if you want to calculate things like accuracy, you need to guess the true class:
* Binary classification if probability > 0.5 True otherwise False
* Multi-class classification prediction has the greatest probability
**You don't have to follow these rules** - you can use your own thresholds if you want to be more confident in your predictions


# Random Forests

In [None]:
?randomForest

Lots of options, but the simplest usage is just giving this function an x and a y.
https://www.rdocumentation.org/packages/randomForest/versions/4.7-1/topics/randomForest
```R
# S3 method for formula
randomForest(formula, data=NULL, ..., subset, na.action=na.fail)
# S3 method for default
randomForest(x, y=NULL,  xtest=NULL, ytest=NULL, ntree=500,
             mtry=if (!is.null(y) && !is.factor(y))
             max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),
             weights=NULL,
             replace=TRUE, classwt=NULL, cutoff, strata,
             sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),
             nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,
             maxnodes = NULL,
             importance=FALSE, localImp=FALSE, nPerm=1,
             proximity, oob.prox=proximity,
             norm.votes=TRUE, do.trace=FALSE,
             keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,
             keep.inbag=FALSE, ...)
# S3 method for randomForest
print(x, ...)
```

In [None]:
# Let's get our x's and y's 
# In this case lets try to predict if a wine is red or white

x<-subset(wine_train, select=-isred)
y<-as.factor(wine_train$isred)

xtest<-subset(wine_test, select=-isred)
ytest<-as.factor(wine_test$isred)




In [None]:
rf=randomForest(x,y,xtest,ytest,ntree=1000,importance=TRUE,keep.forest=TRUE)
rf

# Predictions

In [None]:
#Predict Classes Red or White
outputs=predict(rf,xtest,type='response')
head(outputs)

#Predict Probabilities Red or White
cont_outputs=predict(rf,xtest,type='prob')
head(cont_outputs)


# Check the quality of predictions with a histogram

If everything worked well, the red wines should have probabilities near one and whites near zero

In [None]:
options(repr.plot.width=20,repr.plot.height=10)
ax<-pretty(0:1,n=50)
hRed<-hist(cont_outputs[ytest==1,2],plot=FALSE,breaks=ax)
hWhite<-hist(cont_outputs[ytest==0,2],plot=FALSE,breaks=ax)

c1=rgb(1,.1,.2,alpha=.80)
c2=rgb(1,1.,.2,alpha=.80)

plot(hWhite,col=c2,xlab='P(Red|X)')
plot(hRed, col=c1,add=TRUE)
legend(.8,600,legend=c('Whites','Reds'),fill=c(c2,c1))


In [None]:
#List Red wines with the highest probability of being white wines
index<-sort(cont_outputs[ytest==1,1],decreasing=TRUE)
index[1:10]
# Let's look at the most 'white-like' red
all[583,]


# Importance

A good next question is what why there are some outliers and how the model is deciding between what is a red wine and what is a white wine.

* This is another thing to think about when selecting a model, how easy is it to get out meaningful information about how it's making decisions

* Random forests often use **Mean Decrease in Accuracy** - which means how much worse in the classifier when a variable is not included. If the classifier is bad after removing a variable, it is considered important.

In [None]:

importance(rf)

# Question

Is a higher total sulfur dioxide more or less likely in red wine? 



In [None]:

ax<-pretty(0:500,n=100)

hRed<-hist(red$'total.sulfur.dioxide',plot=FALSE,breaks=ax)
hWhite<-hist(white$'total.sulfur.dioxide',plot=FALSE,breaks=ax)

c1=rgb(1,.1,.2,alpha=.80)
c2=rgb(1,1.,.2,alpha=.80)

plot(hWhite,col=c2)
plot(hRed, col=c1,add=TRUE)


# Exercise 
We had one red wine classify as exceptionally white-like. From the importance values and histogram plots you make. Can you find out why one of the red wines was classified very likely as a white wine?

In [None]:
"Your Code"


# Quality Exercise

Now lets try to predict the quality of wine as evaluated by a panel. 

* Ratings in the dataset go from 3-9 
* How could we turn this into a **binary** classification problem


In [None]:
"Your Code"

Now Train your own random forest and see if you can determine what the most important factors influencing wine quality are?

In [None]:
"Your Code"

# How much data do you need?

This is a trick question - lets see when we train with different dataset sizes

In [None]:
nd=500
y_data <- c()
x_axis=seq(from=10,to=nrow(x),by=100)
for (nd in x_axis){
    rf=randomForest(x[1:nd,],y[1:nd],xtest,ytest,ntree=500,importance=TRUE)
    y_data<-append(y_data,rf$test$err.rate[500,1])
}

In [None]:
plot(x_axis,y_data,log='yx',xlab='# of Examples',ylab="Error Rate")

# Significance

Permutation tests can be used to get estimates if the importance values of each random forest are significant or not, and there is a package for that!


In [None]:
 install.packages('rfPermute',lib='/gpfs/projects/datascience/jsearcy/R/x86_64-conda-linux-gnu-library/4.1')

In [None]:
library('rfPermute')


In [None]:
rf_ptest<-rfPermute(x,y,num.cores=4,num.rep=200,ntree=100)

# While you're waiting
Try opening a new terminal and running
```bash
top -u <your user name>
```

In [None]:
importance(rf_ptest)