Starting with Analytics edge an MITx, course is relatively basic but it is still worth taking. In addition there are more advanced Masters course, such as Introdction to Analyics Modeling, a GTx course which covers the Analytics modeling in more breadth, plus UMich's Data Science and Predictive analytics course. Still both of these courses are not with the Mathematical rigor of say Shalazi or Wasserman's courses that as of this date are still available online on .edu websites.

This Repo is deticated to my work from courses in Analytics

The Analytics Edge in AnalyticEdge folder

For the R code look in the AnalyticsEdge folder (quick reference below)

This course is taught based on R and is relatively easy. However, it is still worth taking because it is very practical and contains plenty of interesting data sets. Here I list some points, useful and for future reference. I recommend this course to anyone who is interested in analytics.

An Introduction to Analytics


table uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.


Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.

Linear Regression

spl = sample.split(data$class, SplitRatio = 0.7)
train = subset(data, spl == TRUE)
test = subset(data, spl == FALSE)
model = lm(y ~ ., data = train)
pred = predict(model, newdata = test)

# out-of-sample R squared
SSE = sum((test$y - pred)^2)
SST = sum((test$y - mean(train$y))^2)
RSquared = 1 - SSE / SST


Return subsets of vectors, matrices or data frames which meet conditions.


Select a formula-based model by AIC.


na.omit returns the object with incomplete cases removed.

coredata() in zoo

Generic functions for extracting the core data contained in a (more complex) object and replacing it.

It can be used to avoid explicitly transforming data types such as using as.vector().

mice() in mice

Generates Multivariate Imputations by Chained Equations (MICE)

complete() in mice

Takes an object of class mids, fills in the missing data, and returns the completed data in a specified format.`

Logistic Regression

LogModel = glm(formula, data = train, family = "binomial")
pred = predict(LogModel, newdata = test, type = "response")
predObj = prediction(pred, test$class)
auc = as.numeric(performance(predObj, "auc")@y.vallues)
perf = performance(predObj, "tpr", "fpr")
plot(perf) # roc curve

definition of sensitivity, specificity and accuracy

interpretation of auc

  • The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator.
  • The AUC of a model has the following nice interpretation: given a random patient from the dataset who actually received poor care, and a random patient from the dataset who actually received good care, the AUC is the perecentage of time that our model will classify which is which correctly.

Sometimes the accuracy will not improve a lot compared to a baseline prediction (always predicting the majority), but the model is still valuable since it gives more information and different cutoffs can be used in different situations. The quality of the model is mainly evaluated by auc.


tree = rpart(formula, data = train, method = "class", minbucket = 25)
pred = predict(tree, newdata = test)
predObj = prediction(pred[, 2], test$class)

# cross-validation
numFolds = trainControl(method = "cv", number = 10)
cpGrid = expand.grid(.cp = seq(0.01,0.5,0.01)) 
train(formula, data = train, method = "rpart", trControl = numFolds, tuneGrid = cpGrid)

method = "class" should be removed for regression trees. type = "class" can be added in the predict() function if we want to predict the labels when the cutoff is 0.5.

Text Analytics

corpus = Corpus(VectorSource(data))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, stemDocument)
dtm = DocumentTermMatrix(corpus)
spdtm = removeSparseTerms(dtm, 0.95)
dataSparse =
names(dataSparse) = make.names(names(dataSparse))


ifelse returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE.

It is convenient to use and can save me some code.


nchar takes a character vector as an argument and returns a vector whose elements contain the sizes of the corresponding elements of x.

Note there is no length() function for string.


# standardization
preproc = preProcess(train)
normTrain = predict(preproc, train)
normTest = predict(preproc, test)

# prediction with k-means
km = kmeans(normTrain, centers = k)
km.kcca = as.kcca(km, normTrain)
clusterTrain = predict(km.kcca)
clusterTest = predict(km.kcca, newdata=normTest)


Draws rectangles around the branches of a dendrogram highlighting the corresponding clusters. First the dendrogram is cut at a certain level, then a rectangle is drawn around selected branches. It helps visualize the sizes of clusters.


Creates a grid of colored or gray-scale rectangles with colors corresponding to the values in z. This can be used to display three-dimensional or spatial data aka images. This is a generic function.


Retrieve or set the dimension of an object.

It can be used to transform a vector into a matrix.


split divides the data in the vector x into the groups defined by f.

A list is returned and then lapply(), sapply() often can be used.


ggplot2 documentation


ggmap plots the raster object produced by get_map().

factor() can be used to reorder factor levels.

strptime() can be used to extract time and date from a string.


Extract the weekday, month or quarter, or the Julian time (days since some origin).

Linear Optimization & Integer Optimization

solver in Excel


