# Preliminaries (Analytics Terminology and Analytics Timeline)

Unfortunately business analytics, data analytics, data science, and machine learning are all overlapping disciplines which sometimes don't share the same terminology. The purpose of this discussion is to develop a common set of terminology that we will utilize in QSO370/QSO570 and also to discuss the analytics process in whole.

## A Reminder On Terminology

Bookmark pages 9 - 11 for some additional terminology used by our book, but below you'll find most of the terminology we'll use in our class. Some of this is a recap from our first notebook, but you will notice some new concepts here as well.

### The Basics

+ We use the term **analytics** to refer broadly to business analytics, data analytics, data science, and machine learning.
+ The term **data mining** refers to any act of extracting useful information from data.
+ A **data frame** is a table which holds data in the form of rows and columns where *rows* correspond to individual records and *columns* correspond to measured variables. Think of an Excel spreadsheet.
+ The columns (or variables) contained in a data frame are often called **features**, while the rows (or records) are often called **observations**.
+ A learning application is said to be **supervised** if the data frame contains an objective column which we are interested in *predicting* or *explaining* using the other features of the dataset. The variable corresponding to this objective column is often called a **response** variable. All remaining columns of the data frame are referred to as **predictors**. Applications without a response variable are called **unsupervised**.
+ A predictor is said to be **numerical** if measurements like the *mean* and *standard deviation* are meaningful for that predictor. Predictors for which *counts* are the best summary are referred to as **categorical**.

### Model Classes, Bias, and Variance

+ A **model class** is a description of the type of model we are building. For example, in this class we will encounter **linear regression**, **logistic regression**, and **trees** (as well as some others depending on your interests throughout the course).
  + **Question:** Why are there so many different classes of model?
    + The different classes of models make different underlying assumptions about the data. Choosing a model whose assumptions are close to satisfied will result in a well-performing model, while choosing a model for an inappropriate scenario can result in disaster.
  + The level of **flexibility** of a model is measured by how responsive the model is to a movement in a particular training observation. Additional predictors, allowing for higher-order terms (curved relationships, or interaction between predictors) all increase the flexibility of a model.
  + Models with low flexibility are said to be high in **bias** (they are biased against more complex relationships in the data) -- models with high bias typically *underfit*. Models with high levels of flexibility are also high in **variance** (they react strongly to small changes in the training data and may therefore be unstable) -- models with high variance typically *overfit*.
    + Given the above, we have a **bias-variance tradeoff** (decreasing bias increases variance, resulting in increased risk of overfitting).
    + Training error (there error estimate when applying our model to the data it was built on) is lowered by low-bias, high-variance models (highly flexible models).
    + Test error (our estimate for true prediction error) is high in both extremes (low-bias, high-variance and also high-bias, low-variance). Solving the bias-variance tradeoff problem means identifying an appropriate level of model flexibility.
    + Overfitting is a problem for us because it causes us to underestimate our expected prediction error rates. Underfitting is a problem because we've built a model that doesn't predict or explain as well as a model which is appropriately fit (there is more juice that can be squeezed from the data).

### Model-Building Frameworks

+ A **model-building framework** is a strategy for model construction. There are many, and we will see a few:
  + The **validation set approach** involves splitting all available data into a *training* and *test* set. The *test* set is sometimes called a *validation* set or *hold-out* set.
  + The **train-test-safe approach** involves splitting the available data into three sets rather than two. The *safe* data is used exactly once to validate our <u>final model</u> before passing it on to higher management or to a deployment team which will implement the model.
    + While they are reasonable approaches, the validation approaches above are sensitive to the data inlcuded in the training and test sets. Different training and test sets lead to different models -- but how different? How much confidence can we have in these models if they are so sensitive to training and test data?
  + **Cross validation** breaks the available data up into $k$ folds (typical values of $k$ are 5, 10, and $n$), and each fold takes a turn as the *test* set. That is, in $k$-fold cross validation, we build $k$ models (which we can then aggregate), and compute $k$ estimates for the true prediction error (which again we can aggregate). Aggregating coefficients affords us opportunities to explore uncertainty associated with model coefficients as well as predictions, computing standard error and confidence interval estimates for both.
    + Utilizing a *safe* set is still recommended with cross validation.
  + **Bootstrap Aggregation** (or more commonly, **bagging**) involves *bootstrapping* hundreds or thousands of new datasets from your available data, building a model of a particular class on each of these bootrstrapped samples, and then using some method (possibly averaging or voting) for making predictions. 
    + A bootstrapped sample is a sample containing the same number of observations as our available data, built from our sample data by randomly selecting rows *with replacement*. 
      + Consider a dataset with 100 observations. A bootstrapped sample is a new dataset built using rows of the original dataset. Some of the original dataset's rows may be included many times, while others may be taken no times.
      + A nice feature of bootstrapping is that we expect about 30% of the rows to be left out of any given bootstrapped sample. This leaves us with a built-in test dataset (commonly called **out of bag**) which we can use to approximate true prediction error.
  + **Random Forest** is a technique that is often applied to tree-based models but can be applied in other scenarios as well. The goal of random forest is to de-emphasize dominant predictors and to explore opportunities for combinations of lesser predictors to be combined in a potentially superior model. In random forest, we typically bootstrap samples and allow some subset of the predictors to be used by the model constructed on each bootstrapped dataset. 
    + In addition to giving lesser predictors a chance, this random forest approach leads to a collection of models which are not correlated with one another. This provides an advantage over bagging -- we get a reduction in the variance (in the case of bias/variance) resulting from a single tree.

## The Data Mining Process

This section outlines the data mining process. We leave out the data collection aspects of data mining and assume that we have data available to us in the form of a table (or tables).

### Different Types of Data Mining

+ Exploratory Data Analysis
  + When we don't seek to build a predictive model and rather just look to analyze trends in our existing data, we are engaging in exploratory analyses. Sometimes EDA is the entire data mining task, while other times it is just the beginning of a predictive/learning task.
+ Statistical Learning / Machine Learning
  + Supervised learning applications
    + A **classification** task seeks to predict a categorical reponse variable. That is, we group objects into different classes. Classes may be a positive/negative flag (in the case of medical diagnoses or sales conversion) or extend to more than two groups (species classifications or customer segmentation within a market).
    + A **regression** task works to predict a numerical response variable. That is, we want to predict sales or revenue in dollars, customer lifetime values, etc.
  + Unsupervised applications
    + **Clustering** is much like classification in that we organize our observations/records into different clusters (groups), but it is unknown to us what the groups are or even how many groupings should be present.
    + **Association Rules** (also: market-basket analysis, or recommendation systems) look at transactional data and develop "*what goes with what*" rules to upsell or recommend products.

### The Process

There are multiple paradigms for the data mining process (CRISP - DM: cross-industry standard process for data mining, SEMMA: sample, explore, modify, model, asses), but what these all come down to is that data mining is an iterative process. What appears below is a slight variation on the process that appears in Chapter 2 of our book. 
1. Develop an understanding of the purpose of the data mining project.
  + How will the stakeholder use and be affected by the results?
  + Is the analysis to be done once? Or is it ongoing?
  + **Beware of analytics as a solution in search of a problem.**
2. Obtain the dataset to be used in the analysis
  + This may involve queuerying and random sampling from a large database -- if analytics interests you, you should take a course in SQL as well as NoSQL.
  + You may also need to stitch (join) datasets together. 
3. (**differs from our text**) For inference tasks (anything beyond just describing the existing data) partition the data into *training*, *test*, and *safe* sets.
4. Make a plan to explore, clean, and pre-process data
  + Use the training data to identify the state of your entire dataset.
  + Develop a plan to tidy your data. Tabular data is *tidy* if it is organized such that (i) *every row corresponds to one observation/record* and (ii) *every column corresponds to a single measured variable*.
  +  Decide how missing data be handled. Should missing values stay missing or be imputed? Do the missing values actually provide information? If missing data are to be imputed or if information is to be extracted from patterns in missing data, we may only use information from the *training set* when deciding how to impute (for example, replace all missing values in a column with the average value of that column <u>from the training data</u>).
      + Remember, that if we are building a model, we must somehow remove missing values -- either by deleting rows/columns or by imputation.
  + Are the data reasonable? Are the values as expected?
5. Decide if/how to engineer new features, and whether it is necessary to engage in variable selection and dimension reduction.
6. Perform the tidying, feature engineering, and variable selection tasks on the original dataset.
  + Note that any feature engineering involving calculated summary values (means, standard deviations, etc.) should use summary values from the *training set* only -- this includes normalization (standardization) and also advanced techniques such as *Principal Components Analysis*.
  + Once these tasks have been completed, re-split the data into *training*, *test*, and *safe* sets.
7. Determine the data mining task. Is the goal classification, prediction, clustering, association, a hypothesis test, something else?
  + Write the objective as a question which is answerable with your available data.
8. Choose and use the algorithms to perform the task. 
  + This is iterative and involves testing many competing models (potentially models of different classes) using the *test* data.
9. Interpret the results to answer the question. 
  + Choose a final (*best*) model and re-estimate the true prediction error on the *safe* dataset. 
10. Deploy the model 
  + Apply your final model to new data, making real predictions that help to inform decision making.