---
title: "Introduction to Data Science"
subtitle: "IN2004B: Generation of Value with Data Analytics"
author: 
  - name: Alan R. Vazquez
    affiliations:
      - name: Department of Industrial Engineering
format: 
  revealjs:
    chalkboard: false
    multiplex: false
    footer: "Tecnologico de Monterrey"
    logo: IN2004B_logo.png
    css: style.css
    slide-number: True
    html-math-method: mathjax
editor: visual
jupyter: python3
---


## Agenda

</br>

1.  Data Science
2.  Supervised and Unsupervised Learning

# Data Science

## Data Science is ...

a multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from vast amounts of structured and unstructured [data]{style="color:purple;"}.

. . .

:::::: center
::::: columns
::: {.column width="40%"}
![](images/clipboard-2707460623.png){fig-align="center" width="377"}
:::

::: {.column width="50%"}
![](images/clipboard-3242954590.png){fig-align="center" width="485"}
:::
:::::
::::::

## Similar concepts

</br>

::: incremental
-   **Data mining** is a process of discovering patterns in large [data sets]{style="color:purple;"} using methods at the intersection of statistics and database systems.

-   **Predictive modeling** is the process of developing a model so that we can understand and quantify the accuracy of the model's prediction in yet-to-be-seen future [data sets]{style="color:purple;"}.

-   **Statistical learning** refers to a set of tools (statistical models and data mining methods) for modeling and understanding complex [data sets]{style="color:purple;"}.
:::

## In 2004...

Hurricane Frances battered the Caribbean and threatened to directly affect Florida's Atlantic coast.

. . .

::::::: center
:::::: columns
::: {.column width="30%"}
![](images/clipboard-123225538.png){width="194"}
:::

::: {.column width="30%"}
![](https://c8.alamy.com/compes/ccn9gk/ft-pierce-9-6-04-un-clubouse-danados-por-el-huracan-frances-en-ocean-village-en-hutchinson-island-el-lunes-el-complejo-tambien-recibio-algunos-danos-a-techos-pisos-de-tierra-y-algunas-unidades-fueron-danadas-por-las-tormentas-foto-por-aguas-lannis-el-palm-beach-post-no-para-su-distribucion-fuera-de-cox-ccn9gk.jpg){width="268"}
:::

::: {.column width="30%"}
![](images/clipboard-679480994.png){width="280"}
:::
::::::
:::::::

. . .

Residents headed for higher ground, but in Arkansas, Walmart executives saw a big opportunity for one of their newest data-driven weapons: ***predictive technology***.

## 

</br>

::::::: center
:::::: columns
:::: {.column width="70%"}
::: {style="font-size: 90%;"}
A week before the storm made landfall, Linda M. Dillman, Wal-Mart's chief information officer, pressured her staff to create forecasts based on what had happened when Hurricane Charley hit the area several weeks earlier.

<br/>

Backed by trillions of bytes of purchase history stored in Walmart's data warehouse, he said, the company could "start predicting what's going to happen, rather than waiting for it to happen," as he put it.
:::
::::

::: {.column width="30%"}
![](images/clipboard-3960378011.png){width="224"}

<br/>

![](images/clipboard-4213549697.png){width="516"}
:::
::::::
:::::::

## The result

::::::: columns
:::: {.column width="50%"}
The New York Times reported

::: {style="font-size: 80%;"}
> *"... Experts analyzed the data and found that stores would indeed need certain products, not just the typical flashlights."*
:::
::::

:::: {.column width="50%"}
Dillman said

::: {style="font-size: 80%;"}
> *"We didn't know in the past that strawberry Pop-Tarts increase their sales, like seven times their normal sales rate, before a hurricane."*
:::
::::
:::::::

[![](images/clipboard-3670330051.png){fig-align="center" width="529"}](https://www.nytimes.com/2004/11/14/business/yourmoney/what-walmart-knows-about-customers-habits.html)

## Cross-Industry Standard Process (CRISP) for Data Science

![](images/clipboard-4096324521.png){fig-align="center"}

## CRISP Model

</br>

-   **Business Understanding**: What does the business need?

-   **Data Understanding**: What data do we have or need? Is it clean?

-   **Data Preparation**: How do we organize the data for modeling?

-   **Modeling**: What modeling techniques should we apply?

-   **Evaluation**: Which model best meets business objectives?

-   **Deployment**: How do stakeholders access the results?

## Business understanding

</br></br>

-   Business understanding refers to defining the business problem you are trying to solve.

-   The goal is to reframe the business problem as a data science problem.

-   Reframing the problem and designing a solution is often an iterative process.

## Problems in Data Science

</br>

[**Classification**]{style="color:blue;"} (or class probability estimation) attempts to predict, for each individual in a population, which of a (small) set of classes that individual belongs to. For example, "Among all T-Mobile customers, which ones are likely to respond to a given offer?"

. . .

[**Regression**]{style="color:green;"} attempts to estimate or predict, for each individual, the numerical value of some variable for that individual. For example, "How much will a given customer use the service?"

## 

</br></br></br>

**Clustering** attempts to group individuals in a population based on their similarity, but not for any specific purpose. For example, "Do our customers form natural groups or segments?"

## Discussion

</br>

-   Often, reframing the problem and designing a solution is an iterative process.

-   The initial formulation may not be complete or optimal, so multiple iterations may be necessary to formulate an acceptable solution.

-   [The key to great success is creative problem formulation by an analyst on how to frame the business problem as one or more data science problems.]{style="color:brown;"}

## Data understanding I

</br>

::: incremental
-   If the goal is to solve a business problem, data constitutes the raw material available from which the solution will be built.

-   The available data rarely matches the problem.

-   For example, historical data is often collected for purposes unrelated to the current business problem or without any explicit purpose.
:::

## Data understanding II

</br>

-   Data costs vary. Some data will be available for free, while others will require effort to obtain.

::: incremental
-   A key part of the data understanding phase is estimating the costs and benefits of each data source and deciding whether further investment is justified.

-   Even after acquiring all the data sets, compiling them may require additional effort.
:::

## Example

</br>

- In the 1980s, credit cards had uniform pricing — companies lacked the systems for mass differential pricing.

- By 1990, Richard Fairbanks and Nigel Morris saw that IT could power predictive models to customize offers (pricing, credit limits, low introductory rates, cash back, loyalty points).

- Signet Bank’s strategy: model **profitability**, not just probability of default, since a small fraction of customers generate most profits.  

##

</br></br></br>


- **Problem:** They lacked data on how different credit terms affected profitability.

. . . 

- **Solution:** Acquire data at a cost — run experiments offering varied terms to different customers. Losses from these offers were considered **investments in data**.


## What happened?

As expected, Signet's number of bad accounts skyrocketed.

The losses continued for several years while data scientists worked to build predictive models from the data, evaluate them, and implement them to improve profits.

Because the company viewed these losses as investments in data, they persisted despite complaints from stakeholders.

Eventually, Signet's credit card business turned around and became so profitable that it was spun off to separate it from the bank's other operations, which were now overshadowing the success of its consumer lending business.


## Richard Fairbanks and Nigel Morris

![](images/ruchardnigel.jpg){fig-align="center" width="512" height="286"}

::::: columns
::: {.column width="50%"}
Founders of
:::

::: {style="font-size: 50%;"}
![](images/Capital_One_logo.png){fig-align="center" width="323"}
:::
:::::

## Most used data science tools

1.  Python
2.  R
3.  SAS
4.  Excel
5.  Power BI
6.  Tableau
7.  Apache Spark

<https://hackr.io/blog/top-data-analytics-tools>

## Other tools

-   RapidMiner (<https://rapidminer.com/products/studio/>)

-   JMP (<https://www.jmp.com/es_mx/home.html>)

-   Minitab (<https://www.minitab.com/es-mx/products/minitab/>)

-   Trifacta (<https://www.trifacta.com/>)

-   BigML (<https://bigml.com/>)

-   MLBase (<http://www.mlbase.org/>)

-   Google Cloud AutoML (<https://cloud.google.com/automl/>)

# Supervised and Unsupervised Learning

## Terminology

</br></br>

-   [**Predictors**]{style="color:darkblue;"}. They are represented using the notation $X_1$ for the first predictor, $X_p$ for the second predictor, ..., and $X_p$ for the *p*-th predictor.

-   $\boldsymbol{X} = (X_1, X_2, \ldots, X_p)$ represents a whole collection of $p$ predictors.

-   [**Response**]{style="color:darkred;"}. $Y$ represents the response variable, which we will attempt to predict.

## Types of learning

</br></br>

In data science (and machine learning), there are two main types of learning:

-   [Supervised learning]{style="color:blue;"}

-   [Unsupervised learning]{style="color:green;"}

## 

![](images/clipboard-2564234206.png)

## Supervised learning...

</br>

Includes algorithms that learn by example. That is, we provide the supervised algorithm with a data set with known [predictor and response values]{style="color:blue;"}. The algorithm must find a way to determine the responses from the predictors.

Since we have the correct (true) responses, the algorithm can identify patterns in the data, learn from its mistakes, and make better predictions of the responses.

The algorithm is *trained* to reach a high level of accuracy and performance for predicting the responses.

## Mathematically 

</br>

We want to establish the following relationship

$$ 
Y = f(\boldsymbol{X}) + \epsilon,
$$

where $f$ is a function of the predictors and $\epsilon$ is a natural (random) error.

. . . 

- $f(\boldsymbol{X})$ represents the true relationship between the response ($Y$) and predictors ($\boldsymbol{X}$). 

. . . 

- However, $f(\boldsymbol{X})$ is unknown and very complex! 

## 

</br></br></br>

A supervised algorithm attempts to construct an approximation $\hat{f}(\boldsymbol{X})$ to the true function $f(\boldsymbol{X})$ using available data on the predictors and response. 

Ideally, the algorithm builds an $\hat{f}(\boldsymbol{X})$ that is interpretable, but not necessarily.

## Two data sets

In supervised learning, there are two main types of data:

-   [**Training data**]{style="color:darkblue;"} is data used by the supervised algorithm to construct $\hat{f}(\boldsymbol{X})$.

-   [**Test data**]{style="color:darkgreen;"} is data NOT used in the algorithm's training process, but is used to evaluate the quality of $\hat{f}(\boldsymbol{X})$.

![](images/training_test.jpeg){fig-align="center" width="720"}

## Popular supervised algorithms

![](images/clipboard-4052653044.png){fig-align="center"}

## 

![](images/clipboard-2622609818.png){fig-align="center"}

## Unsupervised learning...

</br>

studies [data of the predictors]{style="color:green;"} ($\boldsymbol{X}$) to identify patterns. [There are no responses.]{style="color:green;"} 

An unsupervised algorithm identifies correlations and relationships by analyzing available training data. So, the unsupervised algorithm is left to interpret the data set and organize it in some way to describe its structure.

In technical terms, we want the algorithm to say something about the joint probability distribution of the predictors $P(X_1, X_2, \ldots, X_p)$. 

## Popular Unsupervised Algorithms

![](images/clipboard-1531600765.png){fig-align="center"}

## Let's play with supervised models.

</br></br>

1.  <https://quickdraw.withgoogle.com/>

2.  <https://tenso.rs/demos/rock-paper-scissors/>

3.  <https://teachablemachine.withgoogle.com/>

# [Return to main page](https://alanrvazquez.github.io/TEC-IN2004B/)