# An Introduction and Overview of Chapter 1

This notebook begins with talking points for Chapter 1 of *Data Mining for Business Analytics with Python*. The last section of the notebook describes what you can expect from our QSO370/QSO570 course at Southern New Hampshire University for the Fall 2020 semester.

### Topics addressed

This notebook addresses the following topics:
+ What is business analytics?
+ Sample applications
+ What are data mining, business analytics, data analytics, data science, statistical learning, and machine learning?
+ Common terminology
+ Techniques -- why are there so many?
+ The importance of assumptions
+ What to expect from our course

## What is business analytics?

Simply put, business analytics is the process of utilizing data to make informed business decisions. The process of utilizing data includes the use of summary statistics (averages, ranges, standard deviations, etc) and data visualization as well as advanced techniques such as the construction of descriptive and predictive models. 

With the advent of the web and the availability of cheap and powerful computing resources, companies are often sitting on vast stores of data. Business analytics seeks to gain insights from the available data -- the ability to make truly informed decisions is so valuable to businesses that it has led to data being called the *new oil*. Analytics and other adjacent fields such as data science and machine learning present truly exciting opportunities -- welcome.

## Sample applications

The applications of analytics, data science, and artificial intelligence are countless. We've all heard of the pursuit of the "self-driving car" which may revolutionize travel and shipping. Insurance companies have been using analytics for rate-setting since the 1930's. Entertainment companies like Netflix and Disney use analytics to determine whether a movie or series is likely to be successful agreeing to start filming -- there is a reason for all of the recent *live action* Disney re-releases. Facebook, Google, and other technology platforms use web history to engage in targeted advertising. Travel websites have used analytics to determine which types of customers can be presented with higher rates and still purchase hotel reservations or airline tickets. 

With all of these applications, it is clear that significant ethical questions exist in the realms of analytics and data science. Should companies like [23 And Me](https://www.23andme.com/) or [FitBit](https://www.fitbit.com/us/home) be allowed to share your data with your primary healthcare provider so that they are better aware of any your elevated health risks? Should these same companies be allowed to sell your data to insurance companies who may use this added information to adjust your insurance rates or even refuse to cover you? I recommend that anyone seriously interested in an analytics or artificial intelligence role pursue several courses in ethics.

## What's the difference between all of these titles?

As you've been reading, you've likely noticed that there are lots of terms and titles used to describe people who use data to better understand processes, make informed decisions, or to deliver products. It is useful to understand the differences between the various roles.
+ **Machine Learning and Statistical Learning** are typically umbrella terms which include all of data science, predictive analytics, and artificial intelligence. In my view, machine learning concerns applications in which we are training a model to make predictions or actions but we do necessarily care to know why a particular prediction/action was made while statistical learning concerns applications in which we are training a model so that *we* can learn how the inputs are related to a target output. I would describe machine learning as being primarily concerned with producing accurate predictions while statistical learning is concerned with interpretable models.
+ **Data Science** is concerned with the entire flow of an analytics project. A data scientist can be involved in anything from data collection and storage, to cleaning data, conducting exploratory data analysis, building and assessing various models, deploying models for real-time use, as well as monitoring and updating those models. A data scientist may be expected to know and use a variety of tools, including SQL and NoSQL for database management, Apache Spark for interfacing with *big-data*, R or Python for conducting analysis and building models, and other analytics tools.
+ **Data and Business Analytics** typically focuses on the exploratory data analysis, modeling, and reporting tasks associated with analytics projects. Data and business analysts are typically responsible for *in-house* insights and reporting, but may not be responsible for overseeing any data collection or storing and might not deploy any models for real-time prediction. According to current job boards, both analysts and scientists should be familiar with a suite of tools including SQL and at least one of R or Python. Analysts and scientists should also be familiar with some advanced mathematics and certainly with statistics (see the section on *The Importance of Assumptions*).

## Common terminology

There's a lot of terminology that we'll be utilizing in our course. Here are a few definitions to get us started.
+ A **Data Frame** is a table which stores data -- think of a data frame as a *csv* file (like an excel spredsheet with just a single tab).
+ A **feature** is a column in a data frame (also called a **variable**).
+ A **response variable** is a variable whose values we are interested in predicting.
+ A **predictor variable** is a variable which may be used to predict the value of a *response variable*. Note that predictor variables are not necessarily *useful*.
+ A **numerical variable** is a variable for which a metric like the mean, median, or standard deviation "makes sense".
+ A **categorical variable** (or **factor**) is a variable which serves to group observations into two or more categories.
    + There are some cases in which a variable may not be neither numerical nor categorical. In these cases, we typically have an unstructured column like free-response text or a timestamp, which requires additional pre-processing before they can be used.
+ A **case**, **record**, or **observation** is a single unit on which variables have been measured. These may represent products, customers, individual sales, etc.).
+ A data frame is **tidy** if every row corresponds to a single observation and every column denotes a single measured value of a particular variable.
+ The term **algorithm** has multiple meanings in analytics, and so we will use the term **model class** to describe the type of model being built (a regression model, decision tree, support vector machine, etc) and the term **model-building framework** to describe the algorithm being used to train a model (validation-set approach, cross-validation, bagging, boosting, etc.).
+ **Regression** is a modeling task in which we are trying to predict a numerical response variable.
+ **Classification** is a modeling task in which we are trying to predict a categorical response variable (that is, we try to predict group membership).
+ **Supervised Learning** is a modeling task in which a response column is known. Since the true responses are known, the response column *supervises* the fitting procedure. Note that all regression tasks must be supervised.
+ **Unsupervised Learning** is a modeling task in which no response column is provided. Such tasks include clustering (identifying how many *groups* are present in a dataset), recommendation engines (Netflix uses your watch history to recommend new movies or shows based off of what other users with similar histories and habits have watched), and dimensionality reduction (attempting to encode a very wide dataset with many features into a data frame with fewer columns while losing minimal information -- say maybe thousands of columns down to ten or twenty).
+ The **training data** is a subset of data which is used for exploratory data analysis, data visualization, and model construction.
+ A **holdout set** is a subset of data which is unseen by the analyst or model during the training procedure (this also includes the initial exploratory and visualization phases). We will typically have two separate holdout sets. 
    + A **test set** is used to assess and compare models. The test set can be utilized multiple times and often helps to tune models. 
    + A **safe** (or **validation set**) is used to evaluate the final model after it has been chosen. The **safe** data provides one last check on expected model performance before a model is deployed or reported on. 
    + The importance of the **test** and **safe** sets are that they simulate brand new, unseen data since performance metrics computed on the training data are biased (the model knew the answers in these cases).
+ **Model Validation** typically involves computing error rates or error metrics for our models.
    + The **training error** is a measurement of errors made by the model when predicting the response for observations which the model was trained on.
    + The **test error** (or **validation error**) is a measurement of errors made by the model when predicting the response for observations which were unseen by the model during the fitting procedure.
    + Test error is typically a better indicator of the true expected prediction error rates than training error. Since the model "knows the answers" in the case of the training data, the model has an unfair advantage in making those predictions and so the training error will be too optimistic. 

## An overview of techniques -- why so many?

Simple linear regression, multiple linear regression, curvi-linear regression, logistic regression, tree-based models, support vector machines, linear and quadratic discriminant analysis, nearest neighbors, neural networks, deep learning -- why are there so many techniques? As a simple initial answer, not all models are well-suited for both regression and classification tasks. That answer is indeed, too simplistic, however -- otherwise we would have one type of model for regression tasks and another type of model for classification.

There are several additional reasons for having so many model types. We should consider the type of output we want -- yes, we care whether the response is numerical or categorical, but even further -- if we have a classification task, are we interested in just a class prediction (we predict *Class A*) or are we interested in the *propensity* (likelihood) of the observation belonging to each class (we predict a 51% probability of belonging to *Class A*, a 45% probability of belonging to *Class B*, and a 4% probability of belonging to *Class C*). Additionally, we need to consider the assumptions made by each model class (see the next section) as well as the requirements for training these models -- do we have enough data? -- do we have too much data? -- how long will training take? -- do we have enough computing power?

## The importance of assumptions

Being aware of the assumptions made by each model is crucial to the success of any analytics project. If you use a model class which makes assumptions that are not plausible for your data you will, at best, get less than optimal results during training and validation and, at worst, not realize until your model begins performing poorly after deployment. While you can technically *do* analytics without understanding the mathematics or statistics going on under the hood in each of these models, it is not responsible to do so. We will take care throughout our course to highlight and understand the assumptions we are making about our data when we choose a particular model. As you continue to learn about new model classes beyond QSO370/570, please make sure to understand the theoretical underpinnings for these models so that you are prepared to wield your modeling powers responsibly!

## What to expect from QSO370/QSO570

In QSO370/QSO570, we explore *predictive analytics*. I'll make the following assumptions in this course.
+ You have prior exposure to working with data in a tabular format -- for example, you've used a spreadsheet software.
+ You have exposure to basic statistics which includes exploratory data analysis and inferential techniques (including hypothesis testing and the construction and interpretation of confidence intervals).
+ You have some experience with data visualization.
+ You are genuinely interested in analytics and modeling with data.

Our course will cover the following content corresponding to chapters in the text *Data Mining for Business Analytics* (Shmueli et al.):
+ Chapter 2: An Overview of the Analytics Process
+ Chapter 5: Evaluating Predictive Performance
+ Chapter 6: Linear Regression
+ Chapter 9: Classification and Regression Trees
+ Chapter 10: Logistic Regression
+ Chapter 13: Ensembles and Uplift

You'll notice that we are not covering Chapter 3 (Data Visualization) or Chapter 4 (Dimension Reduction). I assume that you have had exposure to *data visualization* in QSO250 as well as through your basic statistics foundation. If you have not had that exposure, I will provide a supplementary document that you can work through in order to gain some background. While we will not explicitly cover *dimension reduction* in our course, we will discuss the utility of dimension reduction as we consider drawbacks associated with model complexity.

You may also notice that we are skipping many of the model classes that are presented in our text. Unfortunately, there is just not time to cover them all. I've made the choice to cover linear regression models and regression trees for models which can be applied in regression scenarios as well as logistic regression and classification trees for methods which can be applied in classification scenarios. If you are interested in further exposure to statistical models, you might consider taking MAT 300 (Regression Analysis) and MAT434 (Statistical Learning and Classification) as well as other advanced courses in business analytics. 

In QSO370/QSO570, you won't just be *learning about* the aspects of analytics projects and predictive modeling in this course -- you'll actually be getting your hands dirty while implementing the techniques we discuss. We'll be making heavy use of Python, which is one of the most popular tools for analytics, data science, and machine learning today. Python is a general purpose coding language with a large community of users developing packages for machine learning applications. While Python is indeed a coding language, I'll assume that you have never written a line of code before, and we will build from the ground up.

For the last third of our course, you'll get to focus on an area (or areas, in the case of graduate students) of specialization that you are most interested in. Undergraduates (students in QSO370) should complete at least one of the following specializations, while graduate students (students in QSO570) should complete at least two of the specializations. Each specialization includes a significant capstone project -- graduate students should pursue only one of the capstone projects rather than being required to complete both capstones. Please choose the specialization or specializations which are most interesting to you. Note that, although the *Time Series* specialization includes three chapters rather than two, the workload is the same as for the other two specializations.
+ Specialization I: *Unsupervised Learning*
    + Chapter 14: Association Rules and Collaborative Filtering
    + Chapter 15: Cluster Analysis
+ Specialization II: *Time Series*
    + Chapter 16: Handling Time Series
    + Chapter 17: Regression-Based Forecasting
    + Chapter 18: Smoothing Methods
+ Specialization III: *Unstructured Data*
    + Chapter 19: Social Network Analysis
    + Chapter 20: Text Mining

## Closing

Welcome to QSO370/QSO570 (Predictive Analytics). I hope that this first chapter of Data Mining for Business Analytics and this short overview have piqued your interest in data analytics. We are embarking on an exciting journey in which you will both learn and apply advanced techniques for modeling with data. This course covers powerful and "state of the art" techniques -- I hope you will enjoy it.