# Data Engineering, Big Data, and Machine Learning on Google Cloud Platform

As the number of internet connected devices continues to grow, the amount of data generated worldwide is becoming mind-bogglingly large. Due to this proliferation of information, it is more important than ever to be able to build applications that can derive insights from vast quantities of data in an automated fashion.

The Google Cloud Platform (GCP) provides four core infrastructure components: underlying all applications is **security**; above this base are **compute power**, **storage** and **networking** tools; finally, atop these, are high level **big data and machine learning products** that abstract away difficult implementation work.

#### The Big Data Ecosystem

* Hadoop: the most popular MapReduce framework
* Spark: general purpose framework for SQL, streaming, ML and more
* Hive: datawarehousing system and query language
* Pig: scripting language that can be compiled into Hadoop MapReduce jobs

#### Compute Power

Google's compute engine is an Infrastructure as a Service or IaaS solution that enables users to run virtual machines.

Google trains machine learning algorithms on a vast network of data centers. Smaller, trained versions of these models are then deployed onto consumer hardware. You can access Google's AI research via pre-trained AI models that can be utilized out-of-the-box.

As Moore's Law has slowed and the rate of compute performance has plateaued, one solution has been to build Application-Specific Chips (ASICs) to limit the power consumption of a chip. Google has created Tensor Processing Units (TPUs) with more memory and faster processors that are specifically optimized for machine learning workloads. TPUs in the cloud enable businesses to solve large, challenging problems in a way that would not otherwise have been possible.

`Google Cloud Platform > Compute > Compute Engine > VM instances`

An example process:

* Spin up a VM
* Perform processing
* Stop the VM
* Copy output into cloud storage
* Serve files to end users

#### Storage

One major way that cloud computing differs from typical desktop computing is that compute and storage are independent. The size of the disks associated with a compute instance do not limit the amount of data that can be processed and stored. Rather, data is transferred via pipelines into a cloud storage solution, for example an elastic storage bucket. Google `gsutil` commands, via the Google Cloud SDK, provide a Unix-like syntax for copying files into buckets.

Storage options:

* Unstructured data? Cloud Storage
* Structured, transactional data? Cloud SQL or Cloud Spanner (SQL-based retrieval) or Cloud Datastore (key-based retrieval)
* Structured data requiring analytics? Cloud Bigtable (low latency) or BigQuery (higher latency)

`Google Cloud Platform > Storage > Storage`

`Google Cloud Platform > Storage > SQL`

`Google Cloud Platform > Big Data > BigQuery`

**Cloud SQL** is a transactional RDBMS that is optimized for more WRITES than READS; such relational databases are best suited for transactional updates on relatively small datasets. In contrast, BigQuery is a big data analytics warehouse for reporting READS. 

**Cloud Storage** is better suited for unstructured data that is accessed infrequently but may be used at a later time, for example imported from a bucket into a Hadoop cluster for analysis or read into BigQuery.

**BigQuery** is a petabyte-scale, severless data warehouse complete with data analytics and accessible via web UI, REST API, command line and third-party tools. It is comprised of two services:  a fast SQL Query Engine and fully manged data storage. You can feed large datasets through machine learning algorithms or apply GIS functions directly within BigQuery. BigQuery also offers additional features such as support for arrays as data types as well as a special field type called RECORD, which is a STRUCT that can contain multiple associated fields. Also included with BigQuery are [public datasets](https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset&_ga=2.229760409.-449722783.1565046880) (both structured and unstructured) that are maintained by Google.

**Cloud Dataprep** enables users to clean and transform data via a web UI. Variables can be inspected for the number of different categories or missing values. You can also create a data cleaning pipeline and apply it to streaming data via Dataflow.

#### Networking

Google’s private network is the largest in the world, comprised of thousands of miles of fiber optic cable providing petabit bisectional bandwidth. This network interconnects with the public Internet at many Edge points of presence worldwide. When a user accesses a Google resource, they are redirected to the location that will provide the lowest latency response.

#### Security

Google provides security for lower level systems such as physical hardware and data encryption that would otherwise be difficult for many businesses to manage on their own. Similarly, customizable user access controls in BigQuery enable pinpoint security for data and encryption keys.

#### Migrating Existing Setups to the Cloud

An on-premises model using SparkML and Apache Hadoop clusters can be moved onto Google Cloud, using Cloud Dataproc to run machine-learning jobs and Cloud SQL (or one of the other storage options) as the RDBMS. Migrated jobs can be run on specialized, ephemeral clusters that only need to be active for the duration that the job is running. 

* Create a cluster
* Submit jobs to cluster

`Google Cloud Platform > Big Data > Dataproc`

Fault-tolerant workloads, which can handle interruptions of individual VMs, may be candidates for Preemptible Virtual Machines (PVMs), which offer further cost-savings for the user.

#### Big Data Pipelines

Modern streaming pipelines must accommodate a wide variety of different data sources as well as significant data volume and velocity. One such example is amalgamating data from a network of IoT devices. **Cloud Pub/Sub** is a distributed messaging service that facilitates streaming pipelines by ingesting messages from publishers and outputting messages to subscribers. Each Pub/Sub instance is tagged to a specific topic. 

Apache Beam is a common solution for designing batch and streaming pipelines. Beam jobs are executed via Cloud Dataflow without having to explicitly manage compute and storage or worry about scaling. 

#### Machine Learning

There are 3 options for doing machine learning on GCP. 

- BigQuery ML
- Auto ML
- Custom Keras models on TensorFlow

**BigQuery ML** enables users to train machine learning models on data stored within BigQuery.

**Google's AutoML** makes it easy to utilize high performance NASNet (Neural Architecture Search) algorithms that search for the optimal neural network architecture.

Jupyter notebooks can be used to write and test custom machine learning models built using **Keras**.

#### Data Visualization

**Data Studio** enables users to visualize and highlight key insights; stock templates streamline the process of creating dashboard reports. This functionality can be accessed from within BigQuery.