# Data Science at Scale

## What do we mean by "scale"

* Scale is determined by
    * Size of data
    * Capacity of hardware

## Big Data is

* data you can't open in Excel
* data you can't fit in RAM
* data you can't fit on a single machine

## A data scientist operates on many scales

* Can't open in Excel $\rightarrow$ use `Pandas` and chunking
* Can't fit in RAM $\rightarrow$ use a database or stream the file
* Can't fit on a single machine $\rightarrow$ use Hadoop and `PySpark`

## Example - Average Super Hero Height - Pandas

In [2]:
!pip install  dfply

Collecting dfply
[?25l  Downloading https://files.pythonhosted.org/packages/53/91/18ab48c64661252dadff685f8ddbc6f456302923918f488714ee2345d49b/dfply-0.3.3-py3-none-any.whl (612kB)
[K    100% |████████████████████████████████| 614kB 6.0MB/s ta 0:00:01
Installing collected packages: dfply
Successfully installed dfply-0.3.3


In [8]:
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/88/01/a37e827c2d80c6a754e40e99b9826d978b55254cc6c6672b5b08f2e18a7f/pyspark-2.4.0.tar.gz (213.4MB)
[K    100% |████████████████████████████████| 213.4MB 125kB/s eta 0:00:01   17% |█████▋                          | 37.5MB 10.0MB/s eta 0:00:18    29% |█████████▎                      | 62.0MB 15.9MB/s eta 0:00:10    41% |█████████████▏                  | 87.5MB 13.3MB/s eta 0:00:10    47% |███████████████▎                | 102.1MB 13.7MB/s eta 0:00:09    55% |█████████████████▊              | 118.3MB 11.7MB/s eta 0:00:09    65% |████████████████████▉           | 138.9MB 13.9MB/s eta 0:00:06    74% |███████████████████████▊        | 158.2MB 9.5MB/s eta 0:00:065MB 29.5MB/s eta 0:00:01��███████████████   | 193.8MB 24.6MB/s eta 0:00:01    92% |█████████████████████████████▋  | 197.7MB 15.0MB/s eta 0:00:02��████████████▎ | 202.2MB 39.8MB/s eta 0:00:01█████████████ | 206.8MB 21.1MB/s eta 0:00:01�█▉| 212.6MB 17.7MB/s 

In [4]:
import pandas as pd
from dfply import *

heroes = pd.read_csv('./data/heroes_information.csv')
major_publisher = ['Marvel Comics', 'DC Comics']

(heroes >> 
   filter_by(X.Publisher.isin(major_publisher)) >>
   group_by(X.Publisher) >>
   summarise(mean_height = mean(X.Height)))

Unnamed: 0,Publisher,mean_height
0,DC Comics,91.072093
1,Marvel Comics,142.756443


## Example - Average Super Hero Height - `sqlalchemy`

In [5]:
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine, func
from heroes import Base, Hero

engine = create_engine('sqlite:///heroes.db')
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
session = Session()


session.query(Hero.publisher, func.avg(Hero.height).label('avg_ht')).\
  filter(Hero.publisher.in_(major_publisher)).\
  group_by(Hero.publisher).\
  all()

[('DC Comics', 91.07209302325582), ('Marvel Comics', 142.75644329896906)]

## Example - Average Super Hero Height - `pyspark`

In [9]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, mean

spark1 = SparkSession.builder.appName('Ops').getOrCreate()
df = spark1.read.csv('data/heroes_information.csv', inferSchema=True, header=True)

df.where(col('Publisher').isin(major_publisher)).\
   groupBy("Publisher").\
   agg(mean('Height')).\
   show()

+-------------+------------------+
|    Publisher|       avg(Height)|
+-------------+------------------+
|Marvel Comics|142.75644329896906|
|    DC Comics| 91.07209302325582|
+-------------+------------------+



## <font color="red"> Exercise 1: Compare and Contrast </font>

<img src="img/all_three_1.png" width=600>

Your thoughts here

## Filter using in/isin

<img src="img/all_three_2.png" width=600>

## Group by publisher

<img src="img/all_three_3.png" width=600>

## Aggregate the mean height

<img src="img/all_three_4.png" width=500>

## Course outline

* Part 1 - Working with Tabular Data

* Part 2 - Working with Unstructured Data


## Part 1 - Working with Tabular Data

* Cleaning and prepping data in `Pandas` (2-3 weeks)
* SQL Alchemy (2 weeks)
* Spark SQL (3 weeks)

## Part 2 - Working with Unstructured Data

* Introduction to functional list processing (3 weeks)
* Processing Unstructured Data with Spark
* Project