In [1]:
# To run this notebook as a reveal.js presentation, run the following command in notebook's folder:
# `jupyter-nbconvert --to slides data-analytics-big-picture.ipynb --reveal-prefix=reveal.js --post serve`

# DATA ANALYTICS ▶THE BIG PICTURE
[franckalbinet@gmail.com](franckalbinet@gmail.com)

# I. FOR WHAT PURPOSE?

![data-science-process.png](img/data-science-process.png)
https://en.wikipedia.org/wiki/Data_science

## ▸ Why IoT & Data Science should be considered together?
___

1. Data Science without data is an empty shell
2. Collecting tons of data as an end in itself is pointless
3. Volume, velocity and variety of data being collected by IoT open new opportunity to Data Science
4. Because there is a lot of hype around both fields

## ▸ IoT, Data Science, Human ecosystem

* **IoT augments**, mimics this **sensing process**
* **Data Science** augments, mimics this **representation process**
* **Humans define objectives** and goals, give meaning and interpret based on current knowledge


## ▸ The value chain and the oil analogy

* **"Data is the new oil"** repeated ad nauseam in the medias and the analogy is often carried further
* **Upstream**: raw data is collected (data supply)
* **Middlestream**: process of refining, cleaning, pre-processing, analyzing and modeling data
* **Downstream**: delivering data products to end users
* **Pipeline**: the value chain as a whole


## ▸ AN EXAMPLE OF QUALITY MONITORING SYSTEM

## ▸ IoT network senses

* Particulate matter (PM)
* CO2
* SO2
* ...


## ▸ Analytics pipeline fetches secondary data

* Social networks
* Meteorological forecasts
* Crowd-sourced data


*Example of crowd-sourced data: https://blog.safecast.org/*



## ▸ Analytics pipeline "stirs" data

* Performs **spatial interpolation of measurements** with associated uncertainty
* Performs **Sentiment Analysis** on social networks data
* **Harnesses** Meteorological **forecasts**
* And **outputs diagnosis / prognosis**

## ▸ Decision Makers ... 

* **Interpret Data** Science pipeline outputs in a wider context
* Control road traffic, enforce restriction use, ...
* Provide feedbacks to improve the whole value chain if required


## ▸ Why Decision Makers have the last word?

1. They understand the overall context: risk, uncertainty, not modelled parameters, ...
2. They can decide when to inject additionnal domain knowledge into the pipeline
3. They are the bosses!


# II. HOW ▸ ARCHITECTURAL CONSIDERATIONS

## ▸ Once a task at hand

* **What** type of analytics, algorithms, ...?
* **When** to perform calculation?
* **Where** to perform calculation and store data [if required]?


## ▸ Use case and domain knowledge first

* What is the use case?
* What decision do we want to make?
* Which data will support the decision?
* What are the resources (financial, human, technical)?

*As opposed to “let’s collect everything we can, then we will see what we can do with it” syndrom*


## ▸ Storing or not storing

* Store, persist, accrue data over time in a **"Data Lake"**
* Discard data once harnessed by the analytics pipeline: **"Stream Analytics"**


## ▸ Delegating or centralizing

* Computing load can balanced/spread across network topology
* When computing takes place at the edge of the network, this is called **Edge Computing**
* Both simple and complex computing taks can be performed at the edge

[http://www.nvidia.co.uk/object/embedded-systems-dev-kits-modules-uk.html](http://www.nvidia.co.uk/object/embedded-systems-dev-kits-modules-uk.html)

## ▸ Learning vs. predicting


* **Training phases** of algorithms might require considerable resources (Neural Networks)
* But once trained, the **prediction phase** requires few resources

## ▸ Quiz

1. There is a common architecture standard for IoT/Data science technical implementations
2. Big data infrastructures is most of the time required
3. Training artificial Neural Network is resource intensive, predicting/using it is not
4. Targeted use cases drive type and quantity of data to be collected


# III. HOW ▸ ALGORITHMIC CONSIDERATIONS

## ▸ Data taxonomy

* **QUANTITATIVE** vs **QUALITATIVE**
* **STRUCTURED** vs **UNSTRUCTURED**

## ▸ Keep data alive

* There is **no ideal structure** for data
* All **depends on purpose** of use
* BUT in all cases, every cleaning or transformation steps should be **reproducible and automated**


## ▸ Data wrangling, mungling

> "… is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it  more appropriate and valuable for a variety of downstream purposes such as analytics” 

https://en.wikipedia.org/wiki/Data_wrangling

## ▸ What is "raw", primary data?
___

* **Data collected from a source** (for instance a sensor, data entry clerk, …)
* Has **not been subject to prior processing**
* **BUT IS RELATIVE**: data constantly re-aligned to analysis, predictive goals 

* Ex: Temperature sensors sending data -> might have systematic bias that could be corrected based on other stations, … detection errors mechanism (checksum) might reveal that some of the data transmitter are corrupted, so need to remove them, …
* Data entry clercks and spelling mistakes, …
* Prior processing: automatically by a program, manually by an analyst, researcher, …
* Depending pre-processed data (clean , reshaped, …) for one purpose might not be suitable for another one. In reality, data should be considered as a living entity rather than something set in stone, constantly re-aligned to the needs (that’s reflected by the variety of techniques  in data wrangling, versatility of tools, and Big Data tool ecosystem

## ▸ Some typical data cleaning tasks

* Handling **missing data**
* Handling **duplicate data**
* String manipulation (to fix typos, ...)
* ...


## ▸ Some typical data preparation tasks

* Replacing values, binning, indexing, combining, reshaping, ...
* Joining, merging, pivoting, stacking with quantitative data exploration in mind
* Performing feature engineering
* ...

## ▸ In reality ...
___

* Data cleaning and preparation is **often reported to take up 80% or more** of a data scientist’s time
* SO, The whole process **MUST BE automated and reproducible**


## ▸ Exploratory Data Analysis [EDA]


## ▸ EDA motivation

* To get acquainted **gain insights** on "never-seen" dataset
* To **check assumptions** required by modeling approaches
* To identify **potential correlation** between variables
* To identify **potential similarities and clusters**
* ...

## ▸ EDA as a detective work

* In its tabular form, data is **not really conducive to interpretation**
* Quest is to **find the "angle of view"** that will suddenly reveal never expected properties
* But it is a **subtle game** and require careful interpretation and critical thinking


## ▸ Does correlation imply causation?
___

![correlation.png](img/correlation.png)


https://www.xkcd.com/552/



## ▸ Why quantitative AND graphical EDA?

![anscombe-quartet.png](img/anscombe-quartet.png)


## ▸  Identifying clusters/patterns


![kmeans-final-clustering.png](img/kmeans-final-clustering.png)

https://github.com/rasbt/python-machine-learning-book

## ▸  Spatializing:  # measurements Safecast - World


![safecast-world.png](img/safecast-world.png)

https://www.kaggle.com/franckalbinet/safecast-exploratory-data-analysis-part-i

## ▸  Spatializing:  # measurements Safecast - Japan


![safecast-jpn.png](img/safecast-jpn.png)

https://www.kaggle.com/franckalbinet/safecast-exploratory-data-analysis-part-i

## ▸ Predicting with Machine/Deep Learning

## ▸  Explanatory versus predictive goals


* In many disciplines, there is a near-exclusive use of **statistical modeling for causal explanation**;
* With the **assumption** that models with high explanatory power are inherently of high predictive power;
* In Machine and Deep Learning **focus is given instead in predictive power** based on monitored chosen evaluation metrics.

## ▸ Finding the "best" line 

![regression-which-line.png](img/regression-which-line.png)


## ▸ The Best line with residuals

![regression-best-line-errors.png](img/regression-best-line-errors.png)


## ▸ Model capacity: the best line

![best-line-really.png](img/best-line-really.png)

 

## ▸ Fiddling with model capacity

![polynomial-1-2-100.png](img/polynomial-1-2-100.png)

 

## ▸  That's all good but how to choose the right model capacity?


![machine_learning.png](img/machine_learning.png)


https://xkcd.com/1838/



## ▸ Regularization as a "wood chisel"

* Each hypothesis space is more or less prone to overfitting
* When overfitting, ideally we would like a "knob" allowing to **regularize** it
* The **sequence Overfit -> Regularize** with training/validation/test datasets is our **"wood chisel"** to find and fine-tune the proper model
* Let's talk about that tonight at Adriatico?

https://en.wikipedia.org/wiki/Occam%27s_razor


## ▸ Deep Neural Networks can virtually fit anything


![tf-deep-neural-net-to-the-rescue.png](img/tf-deep-neural-net-to-the-rescue.png)


http://playground.tensorflow.org/ (with settings used: https://goo.gl/6BJ83E)


## ▸ Object Detection - YOLO example

![yolo.png](img/yolo.png)

https://pjreddie.com/darknet/yolo/

# II. THE TOOLS & RESOURCES

> **"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."**

—Brian W. Kernighan, co-author of The C Programming Language and the "K" in "AWK"


## ▸ Python [possible] ecosystem for Data Analytics

![tools.png](img/tools.png)

## ▸ Outstanding Python reference books [companion notebooks]

![python-reference-books.png](img/python-reference-books.png)

* https://github.com/wesm/pydata-book
* https://github.com/ageron/handson-ml
* https://github.com/fchollet/deep-learning-with-python-notebooks


## ▸ Outstanding R reference books [free books]

![R-reference-books.png](img/R-reference-books.png)

* http://r4ds.had.co.nz/
* http://adv-r.had.co.nz/
* http://r-pkgs.had.co.nz/

## ▸ Outstanding online courses

* https://www.coursera.org/learn/machine-learning
* https://www.deeplearning.ai/
* http://www.fast.ai/
* https://www.datacamp.com/courses
* ...

## ▸ A diversity of approaches but all good!

![dl-learning-resources.png](img/dl-learning-resources.png)

# IV. HANDS-ON SESSIONS INTRO.

## ▸ In medias res

1. Install minimal technology stack
2. You will be given a series of resources on NumPy, Pandas, ...
3. Will define together achievable data analysis objectives
4. Take the form of a Hackaton

*Form balanced teams!*

## ▸ Questions

* Who has previous experience in NumPy and Pandas?
* Who has previous experience in Git and GitHub?


## ▸ Supporting notebooks

* `resources/python-language-essentials-for-data-science.ipynb`
* `resources/data-cleaning-and-preparation.ipynb`
* more coming ...

Toolbox install: `hands-on-sessions/README.md`