# BIOS 823: Statistical Program for Big Data 


## Overview

The topics are intended to give you a mental map of the knowledge you need to thrive as a data scientist, and that you can continue to build on long after you graduate. This course mostly focuses on practical skills for data analysis. Algorithmic, statistical and machine learning theory will be reviewed where appropriate but is not emphasized. This does not mean that they are not important - I am just assuming that you have learned or will learn them in other courses. The number of topics is probably too much to realistically cover in a semester. Some topics will just be touched on briefly, and others introduced via homework assignments.

## Learning objectives

1. [ ] Develop a conceptual understanding of the data ecosystem in health care enterprises
    - [ ] From normalized database to data warehouse
    - [ ] From data warehouse to data lake
    - [ ] From extract-transform-load (ETL)to workflow management
    - [ ] From SQL to NoSQL
    - [ ] From workstation to cluster
    - [ ] From BI to data science
    - [ ] Data engineers and data scientists
- [ ] Develop general skills to perform analysis of health care data
    - [ ] Basic Python data science stack
        - [ ] Functional programming style with `operator`, `functional`, `itertools` and `toolz`
        - [ ] Numerics with `numpy`, `scipy`, `pandas`, `statsmodels`
        - [ ] Graphics with `matplotlib`, `seaborn`
        - [ ] Machine learning with `scikit-learn` and `tensorflow` 
    - [ ] File formats
        - [ ] Delimited files
        - [ ] JSON
        - [ ] XML
        - [ ] HDF
        - [ ] Avro
        - [ ] Parquet
    - [ ] Relational databases and SQL
        - [ ] Basic relational theory
        - [ ] Normalization and de-normalization (star schema)
        - [ ] Basic queries
        - [ ] Sub-queries
        - [ ] Joins
        - [ ] Aggregate and window functions
        - [ ] User-defined functions (UDF)
    - [ ] NoSQL databases
        - [ ] From ACID to BASE
        - [ ] Key-value database example: Redis
        - [ ] Document database example: MongoDB
        - [ ] Graph database example: Neo4j
        - [ ] Column family database example: HBase
- [ ] Develop "Big data" skills
    - [ ] Multi-core and distributed computing
        - [ ] Parallel, concurrent, distributed
        - [ ] Synchronous and asynchronous
        - [ ] Simple parallel programs with `concurrent-futures`, `multiporcess`, `joblib`
        - [ ] The Hadoop base ecosystem: `HDFS`, `YARN`, `MapReduce`
        - [ ] Distributed computing with `dask`
        - [ ] Distributed computing with Spark
        - [ ] Spark SQL
        - [ ] Spark MLLib
        - [ ] Spark streaming
        - [ ] Spark GraphFrames
- [ ] Develop skills for analysis of specific data types (local and distributed)
    - [ ] Application 1: Analysis of structured data with `pandas`, `dask`, `spark` and `datashader`
    - [ ] Application 2: Analysis of free text with `nltk`, `spacy` and `spark-nlp`
    - [ ] Application 3: Analysis of time series data `statsmodels` and `prophet`
    - [ ] Application 4: Analysis of genome data with `toolz`, `biopython` and `adam`
    - [ ] Application 5: Analysis of image data with `scikit-image` and `tensorflow`
    - [ ] Application 6: Analysis of network data with `networkx`, `neo4j` and `graph-frames`

## Class Schedule (likely to evolve)

### Overview and Review

1. [ ] Introduction
2. [ ] Foundations I (Python functional style)
3. [ ] Foundations II (Python data science stack)

### Data storage and retrieval

1. [ ] File types for data storage and ETL
2. [ ] SQL database and warehouse schemas
2. [ ] SQL: Creation and manipulation
3. [ ] SQL: Basic queries and sub-queries
4. [ ] SQL: Window queries and UDFs
5. [ ] Key-value and document databases
6. [ ] Column-family and graph databases

Midterm I (10%)

### Distributed computing

1. [ ] The Hadoop ecosystem, HDFS and YARN
2. [ ] MapReduce and other tools
4. [ ] Dask data frames
5. [ ] Dask arrays and bags
6. [ ] Setting up Dask on AWS
7. [ ] Spark basics
8. [ ] Spark machine learning
9. [ ] Spark and streaming data

Midterm II (10%)

### Data analysis

1. [ ] Structured including geographical
2. [ ] Free text
3. [ ] Images
4. [ ] Time series
5. [ ] Graphs and networks
6. [ ] Genomics

Final Exam (30%)


## Homework

1. [ ] Functional programing in Python
- [ ] The Python data science stack
- [ ] File formats
- [ ] SQL
- [ ] NoSQL
- [ ] Parallel and asynchronous programming
- [ ] Distributed computing with Dask
- [ ] Distributed computing with Spark
- [ ] Structured and geographical data
- [ ] Time series/free text data
- [ ] Genome/image data
- [ ] Network data


## Exams

- All exams require programming
- Exams (or parts of them) may be closed-book

### Midterm 1

- Covers parts 1-2

### Midterm 2

- Covers parts 3-4

### Final exams

- Covers all parts