## 01 Big Data
### How big is big data
- Rule of thumb: the more experience someone has working with data/modern techniques, the larger a dataset has to be to qualify as 'big data'
- **Big data** for our purposes is data that is too large to work with on a local machine
    - Not to mean databases that are too large to clone to a local machine
    - Rather, the only the data desired to work with from database is too large to work with locally
- At least gigabtyes of data:
    - Gigabytes: millions of rows
    - Terabytes: billions of rows
    - Petabytes/exabytes: most likely require specialized machines
### Issues of big data
- Accessing big data: methods previously used won't work, including single job sql
- Data science vs. data engineering: not all companies (esp. smaller ones) differentiates these the same way or at all
    - Engineering covers designing a storage system
    - Science is concerned with accessing, manipulating, analyzing, and interpreting
- Understanding big data: visualization and summary statistics still apply, but implementing them can change
    - Outliers: could be thousands or more, not just a few
    - Visualization: keep in mind...
        - Visualizations are simplifications, they inherently gloss over some data
        - Analysis is potentially more impacted as the amount of 'missed' data increases
- Training: there are huge advantages to training models on big data
    - Some models (eg NNs) require large amounts of data to perform well
    - But:
        - Can't work with data locally
        - May be too slow to work with all at once for modeling purposes
        - Need to have methods for efficiently working with big data

## 02 Hadoop and Big Data Storage
- Hadoop for our purposes refers to the larger environment of Hadoop-based software used for storing, moving, and analyzing big data under a unified framework
- Components of Hadoop discussed here are only part of the picture
    - Focus here is on the parts that are useful for model building and analytics
    - Much of the infrastructural backend won't be covered
### Key Components
- Hadoop (or Hadoop clusters) have four core components
1. Commons: utilities structure; typically handled by engineering
2. YARN: scheduling and resourcing tool
3. HDFS (Hadoop distributed file system): distributed data store with fast access tools
    - Data is distributed across many machines/drives instead of a single one
    - Advantages include speed of access & stability (not dependent on a single machine)
4. MapReduce: data processing tool for distributed systems
    - Basis of the Hadoop project as a whole
    - PIG: tool for pulling raw data from HDFS
        - Functionally equivalent to MapReduce with more intuitive querying based interface
        - Navigating data can still be a challenge, but this is set by the internal data model & structures
### Other Pieces
- Data scientists most often won't be working with these tools (except PIG)
- Data is most often set up with a querying layer
    - Hive is most common for Hadoop
    - Could also be Presto or a PIG Script
- Hive: allows for the use of sql type tools/structures for large datasets in HDFS
    - Speed: HiveQL is slower, partly because of data size but also because its generally slow
        - Hive queries can take hours or days
        - Faster tools (Presto) / database types (Redshift, Vertica) exist but are costly and require more hardware
    - Syntax: slightly different around joins and datetimes (which is the case between sql languages as well)
    - Other differences are minor from a user perspective
        - Most differences are engineering related, stemming from the nature of working on distributed database
        - working with distributed data works and feels similar to local work

## 03 Distributed Computing and Spark
- Using standard tools on big data translates to slow analysis & modeling
- Big data also has its own tools/techniques

### Multicore Computing
- **Parallelization**: packages like sklearn have n_jobs option to set the number of cores to utilize when training model
    - Trains on multiple cores simultaneously
    - Theoretically, processing time is divided by the number of cores (2 cores over 1 = half the time to train)
- Some models are easier to parallelize than others but most can be parallelized in some way
    - Random forest: different cores generate different trees
    - Boosted trees: as trees split, subsequent models run in different cores
    - SVM doesn't parallelize well, uses memory for points near margins

### SPARK
- Training/running models locally is often out of the question
    - Too much data
    - Too long to train
- Spark: distributed computing tool from the Apache suite built up around hadoop
    - PySpark: python like syntax for Spark making it easy to translate python to spark
        - Looks nearly identical to python
        - Requires infrastructure set up to run
    - There also Spark versions of iPython/Jupyter notebooks and SKLearn

## 04 Where's the code?
- No code has been written this section for three reasons:
    1. Functionally, the code to build these systems is generally outside the scope of data science
    2. Hundreds of cores/drives required to use Hadoop or Spark as intended
        - Typically using cloud computing like AWS
    3. Data science is more concerned with using these systems rather than architecting them
        - Stacks are largely unstandardized and have a large range of possible tools and implementations
        - Until working with a specific big data implementation, focus on understanding why the tools exists and what they do
        - Learn to use specific tools as needed rather than abstractly