# Google Data Engineer Part 1

## 1.1 - Big Data Fundamentals

### 1.1.1 - Introduction to Data and Machine Learning on the GCP

In this first course of the data engineer track, you will gain an overview of the data and machine learning
parts of Google Cloud platform. And not just in cursory overview. So what we're going to be
doing is that in each module, everything is going to be two pronged.

At one level, we will look at some products that help you accomplish certain things on Google Cloud. But at the same time, we'll look at very specific use cases, very common use cases that
involve machine learning, that involve data processing, that involve data analysis.

if you look at how to accomplish those use cases using this particular products. And we'll do this over and over again. And by the time we're done with this course you will have basically gotten a pretty good overview of all
of the moving parts of the data and ML parts of the platform.

### 1.2.1 - About Big Data and Machine Learning Fundamentals

In this course I'll be providing
a very quick overview of GCP, in particular its foundational
parts to compute and storage. And then we dive straight in to
provide a deeper overview of the different ways you can
process data with GCP. So we'll talk about BigQuery, Dataflow,
Dataproc, Cloud SQL, Datalab, etc.

So who's this class meant for? It's for meant for
what we call data engineers. People who design, build,
maintain data structures, databases, data processing systems,
data pipelinesand who do extraction,
transformation, loading of data, move data from one place to another. These may be data scientists
who are analyzing data, enabling machine learning to happen, maybe
even doing machine learning yourself. People who model business processes, who enable data driven decision
making within your company. So if you recognize yourself in any of
these things this class is meant for you. It's meant for anyone who'll be working
with data on Google Cloud Platform. It's also meant for any decision
makers who are trying to decide whether your company should move to GCP
and do your data processing on GCP.

This course gives you a good overview of all
the capabilities that the Google Cloud provides so
that you can make an informed decision. We will now
start with an overview of Google Cloud Platform as a whole,
but with particular emphasis on the ways you can
handle data in the platform. The way you can ingest data, different
ways of doing processing of data, whether with map produce,
such as with Spark or whether with a streaming
mechanism like with Dataflow. We look at BigQuery, which is our
auto scaling date warehouse, etc

So we'll give you overview of GCP, but with an emphasis on the data
handling parts of it. We'll also then move on to talking
about the foundation of GCP and this is computing and storage. Like any computer, and
the Cloud is a computer, the two key parts of a computer are the computing units and
the storage units of persistent data. And the way that happens on the Cloud is
with Compute Engine and Cloud Storage, so we'll talk about both of those. And then we'll move on to things
that you're probably doing today, that you could easily move to the Cloud. We'll talk about the use cases that are
quite common that Google provides a good managed environment for such that you
can take things you're doing on premise, on your own hardware, and
move it to the Cloud quite easily. Because the same software that you may
be using Is also present on the Cloud.

As examples of those use cases,we will talk about relational databases in the form of Cloud SQL,
which is a MySQL database hosted on GCP. And we'll talk about Cloud Dataproc,
which is a hosted version of big Spark hype of the Hadoop
ecosystem software processes. So in order to do that we'll
look at how to import data and how to query MySQL
running in Google Cloud. And similarly, we'll also look at how
to take a Spark program, submit it, and have it run on Cloud Dataproc. And the Spark program that we will look
at will be a machine-learning program to carry out recommendations. And once we have talked about
the use cases that we're talking about, migrating use cases,
we will then move on talking about more
transformational use cases. These may be things that
you may not be doing today, mainly because it may not be possible for
you to do them today.

For example, we look at how you could query petabytes of data in a matter
of seconds with Google BigQuery. We will talk about how to
do fast random access, trading off global consistency
versus low global availability. We'll also talk about machine learning. We'll talk about how to do TensorFlow. Maybe you run TensorFlow, but very commonly maybe running
it on a single machine. We'll talk about how you'd run
TensorFlow in a distributed fashion on the Cloud over
extremely large data sets. And then we will move onto
just providing a very quick overview of how you would do
scalable reliable data processing on Google Cloud with Cloud Pub/Sub,
which is a messaging architecture and with Cloud Dataflow,
which is a way to execute code that processes both streaming data and
batch data in essentially the same way. And finally, we'll come to
the conclusion of the course, and I'll leave you with some resources for
further reading.

## 1.2.2 Introduction to GPC and Big Data Products

So what is the Google Cloud platform? Well, let's start out by talking
about what cloud computing is. When I started out in computing
a couple decades ago, everything that we did was on-premise. So I had a work station, and this work
station was literally under my desk. And if I needed to run a program,
I would run it on that machine. If something happened to the machine, I would yank out the power cord,
I would plug it back in. I owned the machine, I managed the
software on it, I installed everything. I had root access on that machine. It was all mine. I managed the software data, Christie did the networking, I paid for everything.

That's essentially "on premise" computing. But at some point, things changed and it became well you don't have to. Let's go ahead and take all of our machines, put them in this nice data center. So we had a new building,
we had data center and we basically had all of our
machines in that data center, we still owned the data center, because it
was just another floor of our building and whenever we would have visitors
to our lab we would take them and very proudly show them this is where
all of our computing happens, right? So we had our data center, we owned
the hardware, but the electricity, the networking,
the physical security of that place, that not everybody could
walk into that data center. All that was managed by the people
who managed the data center.

But my group still paid for the hardware. We essentially paid a portion of the data
center costs, in terms of a portion of the "rent" if you will. But we essentially still
own those machines. We control those machine. We install the software on them. We decided when those
machines would get upgraded. But we had kind of given up was direct
physical access to those machines. Now if you think about it, this evolution
from things being completely under my desk to being slightly out of reach but
something that I'm still controlling.

Cloud computing is the next step of that. And with cloud computing
the big cloud vendors, whether it's Google or Amazon or
Microsoft, they own the hardware, they manage the electricity,
they manage the networking, they manage the physical
security into those data centers. But, you don't even own the hardware. So what you end up doing is that you
ask for some computing resource. And as we see later, the computing
resource could be a virtual machine, but in many cases,
you prefer it not to be virtual machines. You don't want to work
at such a low level, where you're spinning up VMs and spinning
them down. You want to think in terms of higher level constructs like here
is a job that I want to run. Or here's a SQL query that
I want to execute, right.

Basically you have some kind of computing that you need to do. And when you need to that, you ask for
resources to carry that computing out and you pay only only for
those resources that you use. And the whole cloud is 
a shared resource and you get those resources for the time that
you need and you give them them back, and you're not worried anymore
about not using it or  over-provisioning servers that you're not using or not having enough of a computing
resource when you need it, because somebody else is using it, etc. Right, so the whole idea behind cloud
computing is that you have an available resource whenever you need, and
you're not paying for it when you don't.

So why is Google in the cloud business? This is our mission. It's to organize the world's information,
make it universally accessible and useful. What does that have to do with cloud? Doesn't seem that cloud computing and organizing the world's information have anything in common. So why is Google in the business of cloud? Well, it turns out that if you need to organize the world's information and make it universally accessible and useful,
there are some things that you have to do. One of the things that you have to do if
you are going to organize the world's information, because there is a lot
of it and it's keeps on growing, is that you have to build
extremely powerful infrastructure. There is a statistic that blew my
mind the first time I heard about it. Of every five CPUs that
are produced in the world today, year on year, Google buys one of them. Think about that for a second. Google buys one in every five CPUs that's produced year on year. That goes into our infrastructure. So you can imagine how powerful that infrastructure is that we have to build in order to organize
the world's information.

But that's the physical hardware, right? In addition,
we have to make it universally accessible. And if you want to make information
universally accessible, you have to build global data centers,
you have to build a global network so we have private fibers
between all the continents, and you need to have edge locations
in a lot of different countries. An edge location is essentially this
idea that if somebody's accessing your resource, say from Africa for example, the
second person who's trying to access that resource shouldn't have to go across
the world to go get the resource again. They should be able to get a cached
version of that resource from an edge location. So you need to maintain edge locations, and edge locations are in a lotmore places than the data centers.

But within the data center also, the design of the data centers is such that any two machines
in the data center are just a hop away. That you can basically have some. If you look at the East West
cross-section versus, so if you look at the network bisectional bandwidth between machine and google data center it's
more than petabit a second. So now you're looking at
an extremely fast networking and extremely global infrastructure that
needed to get built in order to make that information that we had collected and
organized universally accessible. That's the physical part of it. In addition,
we've tended to run into problems of large amount of data
ahead of most other people, most other companies in the world. So what that has ended up doing is
that we have ended up innovating in data technologies, pioneering many of the alternative techniques
that you're all familiar with now.

So, for example, GFS, the Google File System, and Map Produce, those were two papers that
came out of Google research. And they from the basis of what? Yup. HDFS, the Hadoop Distributed File System
is based on the Google files system GFS. And Hadoop itself, the MapReduce
frame work is based on the paper by Sanjay Ghemawat and Jeff Dean on
MapReduce that's published in 2004. And, of course, HDFS and HADOOP
basically led to the whole ecosystem of a bright and open source at DOOP Tools
that are available in the world today. Something to kind of realize is that
even though we published this paper in 2004 in MapReduce, by about 2006 we were
no longer creating new MapReduce programs. We had moved on. Why had we moved on? Well, some people realized
that the whole MapReduce framework the way it works, is that if you
have a very, very, very large dataset, you take that dataset and
you chop it up into small pieces and you store those pieces on
different compute notes. Close to the compute, and then you
basically have each of the compute nodes doing their little bit of
processing on their local data. Those are the map operations. And then take those results. You combine them and
then you basically do processing on it. 

But the key point is that you have to take your data set, and you have to shard it, or split it,
across all of the computer notes, which means you are all
decisive of your data sets and decisive of your computer notes
are intimately tied together, and that kind of limits your scheme,
because you are often wasting a whole bunch of computer nodes just
because you need to store your data there. Or if you need to process some
data you can only process it on the compute notes that
already have that data.

So this changed. And so this is part of the innovation and data technology that came about such
that GFS, the Google file system, got replaced by Colossus,
which is the current file system. Of course, there have been enhancements
in Colossus all along the way. MapReduce got, not changed. So, we don't do MapReduce
at Google anymore. Instead, the data processing
technologies  of choice are Dremel and Flume, and those are both externalized
in Google Cloud as BigQuery and Dataflow, that we will look at. So the point is that with all of the innovation that's happening at Google to build our own infrastructure, with Google cloud we're opening up that
innovation such that you can use it.

So Dremel, which I talked about, is the basis of Big Query. Flume, the thing that started in 2010,
it started in 2010, but it has continually gotten enhanced. So Flume and Millwheel, etc., they all form what is now
externalized as data flow. And tensof flow 2015
forms a basis of cloud ML. CLoud ML is a whole state and
in flow solution. Similarly you have PupSub which
has been released as itself. And spoonerism now in alpha etc. So you basically have all
these innovations that continually get released
into Google Cloud. So, one of the things is that if
you're working on Google Cloud, you basically have access to all of
Google's data processing capabilities. Now, I talked about how the Mapreduce
that we used to use in 2004, we don't use anymore. And, I kind of mentioned that one of the
reasons that we don't use it is because MapReduce tends to be limited by the
number of compute nodes that you have, and that we have better solutions
now in the form of big query and in the form of data flow. And that illustrates a key point. If you're doing cloud computing today, many times though what you may think of as cloud is essentially the change from core
location to a vitalized data center.

Remember the slide that I showed early
on that said what is cloud computing? And I said cloud computing is
evolution of you going from things that are being on premise
to being on a data center, where you know you basically
still own the hardware, but part of the management of that hardware has
gone away and that with cloud computing you don't have to own the hardware
you don't have to manage any of it. You just get to get
the resources that you need. Right, that is the idea. But if you are working in the cloud
at the level of virtual machines, if you're saying, I'm going to spin
up a VM so I can run this job and I'm going to reserve this VM for months
on end so I can keep running this job, at that point, you've lost much
of the benefits of the cloud.

The cloud is hugely beneficial if you are running things ephemerally, if you're
running things just when you need them. So that's kind of what we're
talking about when we talk about how a virtualized data center
is your second wave of cloud. It's not yet
getting the full benefit of the cloud. The true benefit of the cloud
happens when you're using the cloud in an elastic
completely global way. And what I mean by that is that you should
be able to auto-scale your clusters, you should be able to
do distributed storage, distributed data processing,
distributed machine learning. If you have a job, and that job
can be done on a thousand machines in seconds, you should be able to
do it on a thousand machines for ten seconds and just pay for that. You shouldn't have to go ahead and
create a cluster of 10 machines, and then process that job
on those ten machines for 10 hours or twelve hours or
whatever it takes you. The whole idea behind an elastic
cloud is this idea that you get to use your machines just for
the time that you need them, and then when you don't need them,
it's not on your bill anymore.

## 1.2.3 - GPC Big Data Products

So let's move on to talk about
the big data products in GCP. So the best way to think about
the Google Cloud Platform is that it's a set of building blocks. It is a set of things that are well
designed such that they are well designed, that they are well integrated, such that
you can put together your own solutions to solve your problems in a very
convenient and very intuitive way. So Spotify kind of illustrates
the typical customer journey. So these are all Spotify tweets. And Spotify came to Google Cloud, the
reason that most people come to the cloud. It's not to spend less,
but to just pay for the use to basically get more security, now because things can be
safer if they're secured on the cloud with Google's security
engineering team, for example. It can be no-ops, no-ops is, again,
a word that we'll use quite a bit. And what we mean by
no-ops in this context, a no-op is essentially no
systems operation required. In other words, you don't need to
spin up a cluster and do a job on it. You just have a job, and
you submit it to the cloud. And Google cloud figures out which
machines to run it on, runs it, and you don't even interact with those
machines at all, for example. So those are all advantages. So that's the reason why many
companies come to the cloud. And that's kind of why
Spotify came to the cloud. So this was one of their tweets, did you hear we're migrating from our own
data centers to Google Cloud Platform? And then they realized, hey, look, I see
two nice building blocks in Google cloud. I see something called Pub/Sub
with some messaging system. And I see data flow with a data
pipeline execution environment. And look, if I can put these two things together, we
can build our own event delivery system. So that's our second tweet. See how we use Google cloud, Pub/Sub, and data flow to build our own
event delivery system. In other words, there is this whole flexibility to
put together your own solutions. That's great, but in addition
to building your own solutions, there is also the extreme power
that already comes built in. And that was that third tweet where
there's a Google's bit query is the bomb. You can start with 2.2 billion things and summarize down to 20,000
things in less than a minute. And the reason that he's so
excited is that before bit query, it probably took them, I don't know,
maybe an hour, maybe a couple of hours. So there's a huge benefit in
terms of business advantage, to take things that used to take
you hours and do them in seconds. And that's kind of where that's a promise, that transformational aspect
of doing things on the cloud. You may never be able to convince
your boss to give you 3,000 machines if you're not going to use
those 3,000 machines all the time. But on the cloud, you can get
3,000 machines for a few minutes, get your job done. And that time savings is worth a lot. Another way to look at the Google cloud
is functionally, what does it do? What do these products do? So each of these blue hexagons is
a Google cloud platform product. And they all do different things. So you can group them. For example, there is the foundational
pieces, there is the computing, there is storage, Compute Engine and
Cloud Storage, which are a foundation. You have a number of types of databases, Cloud Datastore, Cloud SQL, Bigtable, etc. These all have different target use cases,
and we'll talk about them, why you might want to use Datastore
versus Cloud SQL versus Bigtable. But these are all databases. And then you have some things that
are about analytics, like BigQuery, or about machine learning, like Cloud ML, or
the Translate API, or the Vision API, etc. And then you have data-handling
frameworks, things that help you deal with streaming data, that help
you move data from one place to another. This is things like Pub/Sub,
Dataflow, Dataproc, etc. So this is the way that we are going to
look at the cloud platform, we look at them in terms of these things. We look at foundational stuff, we look at
databases, we look at analytics and ML. And then we look at
data-handling frameworks. The reason that there are all of
these big hexagons is that Google is trying to solve a particular issue
related to how people get onto the cloud. Which is that a lot of times, people want to migrate the code
that they're already running. So it's just about changing
where they're computing, not actually changing the code itself. Because you don't want to do too
many things at the same time. So you may take software services
that you're running on-premise, and you may might want to
move it to Google cloud. And so in that sense, what all that you
want to do is to change where you compute, but you want to do
exactly the same things. So if today, in your own data center,
you're running a Hadoop cluster and you want to move that code, that project,
over to the cloud platform, well, you can take Cloud Dataproc and
migrate things over. You can migrate your Hadoop, Spark, or big jobs over to Google cloud,
and run it on Cloud Dataproc. Similarly, if you have a MySQL database, you could run it on
Google Cloud using Cloud SQL. So these are all different ways of taking
things that you're already doing and just changing where you're
doing the computation. At another level, the reason that
you maybe move in to the cloud is because the cloud gives you greater
scalability and reliability. So if you need to do very large scale,
very reliable messaging, you might want to use Cloud Pub/Sub. If you want to do very scalable,
very flexible, very reliable data processing,
you may want to do Dataflow. But you might also do it in Spark with
Dataproc and take advantage of the fact that with Dataproc, you'll get to
resize clusters very quickly, right? So you may be looking at using
the DataFlow, Dataproc, Pub/Sub for the scalability and
reliability that the cloud provides. This is the concept that I was talking
about where if you are running things on a cluster of, say, not 20 machines or
30 machines, then the reason you're doing it is because you have to
justify the cost on an annual basis. You may be able to go ahead and do that
exact same thing in a much more scalable fashion on the cloud because
you can easily justify using hundreds of machines for now 20,
30 minutes, rather than having to have those things crunched through
your data and take now days or months. The third reason why people move to
the cloud is all of the innovation, right, the things that the cloud,
Google's environment makes possible. So, whether it's data exploration,
business intelligence, and economic data warehouse for petabytes of data,
whether distributed machine learning, those are all things that change
how you do your computing. So these are not things that
you may be doing today. You may not be analyzing petabytes
of data because it may not be the kind of thing that you
can do on a timely basis. But you can do that on Google cloud, and
that basically means that now rather than just moving code that you're doing over,
you may be taking and creating new business concepts,
new capabilities. New lines of business that open up
because you can now do something that you didn't used to be able to do. Right, maybe you're able to analyze
your factory floor in real time, which you weren't able to do. Or maybe you're able to analyze
your customer's behavior, give them recommendations on
what to buy in real time, which you may not have been able to do. Those are the kinds of transformational
use cases that you may be able to do. So if you're looking at Google cloud, then
there are three possible situations that people come to the cloud that now, again,
this is stuff that we see because I'm part of the Professional Services Org
at Google. We can see these different use
cases play out all the time. There are people who are doing migrations,
changing where to compute, people trying to scale up their data
processing or make it more reliable, and people looking at
transforming their businesses. So all three of these things
are supported by the BigData and Machine Learning platform, and
we will look at all of them.

## 1.2.4 Big Data Products

So let's move on to talk about
the big data products in GCP. So the best way to think about
the Google Cloud Platform is that it's a set of building blocks. It is a set of things that are well
designed such that they are well designed, they are well integrated, such that
you can put together your own solutions to solve your problems in a very
convenient and very intuitive way.

So Spotify kind of illustrates the typical customer journey. Spotify came to Google Cloud, the
reason that most people come to the cloud. It's not to spend less,
but to just pay for the use to basically get more security, now because things can be
safer if they're secured on the cloud with Google's security
engineering team, for example. It can be no-ops. No-ops is, again,
a word that we'll use quite a bit. And what we mean by
no-ops in this context, a no-op is essentially no
systems operation required. In other words, you don't need to
spin up a cluster and do a job on it. You just have a job, and
you submit it to the cloud. And Google cloud figures out which
machines to run it on, runs it, and you don't even interact with those
machines at all, for example.

So those are all advantages. So that's the reason why many
companies come to the cloud. And that's kind of why
Spotify came to the cloud. They realized, hey, look, I see
two nice building blocks in Google cloud. I see something called Pub/Sub
with some messaging system. And I see data flow with a data
pipeline execution environment. And look, if I can put these two things together, we
can build our own event delivery system. See how they use Google cloud, Pub/Sub, and data flow to build our own
event delivery system.

In other words, there is this whole flexibility to
put together your own solutions. That's great, but in addition
to building your own solutions, there is also the extreme power
that already comes built in. And that was that third tweet where
there's a "Google's BigQuery is the bomb. You can start with 2.2 billion things and summarize down to 20,000
things in less than a minute." And the reason that he's so
excited is that before BigQuery, it probably took them, I don't know,
maybe an hour, maybe a couple of hours. So there's a huge benefit in
terms of business advantage, to take things that used to take
you hours and do them in seconds. And that's kind of where that's a promise, that transformational aspect
of doing things on the cloud. You may never be able to convince
your boss to give you 3,000 machines if you're not going to use
those 3,000 machines all the time. But on the cloud, you can get
3,000 machines for a few minutes, get your job done.

And that time savings is worth a lot. Another way to look at the Google cloud
is functionally, what does it do? What do these products do? So each of these blue hexagons is
a Google cloud platform product. And they all do different things. So you can group them.

|Category                  |Technology                                           |
|--------------------------|-----------------------------------------------------|
|Fundamentals              |Cloud Storage, Compute Engine                        |
|Databases                 |Cloud SQL, BigTable, DataStore                       |
|Analystics and ML         |BigQuery, Cloud Datalab, Translate API, Vision API...|                          |
|Data handling frameworks  |PubSub, Dataflow, Dataproc                           |


Data-handling frameworks ate things that help you deal with streaming data, that help
you move data from one place to another. These are things like Pub/Sub,
Dataflow, Dataproc, etc. So this is the way that we are going to
look at the cloud platform, we look at them in terms of these things. We look at foundational stuff, we look at
databases, we look at analytics and ML. And then we look at
data-handling frameworks.

The reason that there are all of
these big hexagons is that Google is trying to solve a particular issue
related to how people get onto the cloud. Which is that a lot of times, people want to migrate the code
that they're already running. So it's just about changing
where they're computing, not actually changing the code itself. Because you don't want to do too
many things at the same time. So you may take software services
that you're running on-premise, and you may might want to
move it to Google cloud. And so in that sense, all that you
want to do is to change where you compute, but you want to do
exactly the same things.


So if today, in your own data center,
you're running a Hadoop cluster and you want to move that code, that project,
over to the cloud platform, well, you can take Cloud Dataproc and
migrate things over. You can migrate your Hadoop, Spark, or big jobs over to Google cloud,
and run it on Cloud Dataproc. Similarly, if you have a MySQL database, you could run it on
Google Cloud using Cloud SQL. So these are all different ways of taking
things that you're already doing and just changing where you're
doing the computation.

At another level, the reason that
you maybe move in to the cloud is because the cloud gives you greater
scalability and reliability. So if you need to do very large scale,
very reliable messaging, you might want to use Cloud Pub/Sub. If you want to do very scalable,
very flexible, very reliable data processing,
you may want to do Dataflow. But you might also do it in Spark with
Dataproc and take advantage of the fact that with Dataproc, you'll get to
resize clusters very quickly, right? So you may be looking at using
the DataFlow, Dataproc, Pub/Sub for the scalability and
reliability that the cloud provides. This is the concept that I was talking
about where if you are running things on a cluster of, say, not 20 machines or
30 machines, then the reason you're doing it is because you have to
justify the cost on an annual basis. You may be able to go ahead and do that
exact same thing in a much more scalable fashion on the cloud because
you can easily justify using hundreds of machines for now 20,
30 minutes, rather than having to have those things crunched through
your data and take now days or months.

The third reason why people move to
the cloud is all of the innovation, right, the things that the cloud,
Google's environment makes possible. So, whether it's data exploration,
business intelligence, and economic data warehouse for petabytes of data,
whether distributed machine learning, those are all things that change
how you do your computing. So these are not things that
you may be doing today. You may not be analyzing petabytes
of data because it may not be the kind of thing that you
can do on a timely basis. But you can do that on Google cloud, and
that basically means that now rather than just moving code that you're doing over,
you may be taking and creating new business concepts,
new capabilities. New lines of business that open up
because you can now do something that you were not able to do. Right, maybe you're able to analyze
your factory floor in real time, which you weren't able to do. Or maybe you're able to analyze
your customer's behavior, give them recommendations on
what to buy in real time, which you may not have been able to do. Those are the kinds of transformational
use cases that you may be able to do. So if you're looking at Google cloud, then
there are three possible situations that people come to the cloud that now, again,
this is stuff that we see because I'm part of the Professional Services Org
at Google. We can see these different use
cases play out all the time. There are people who are doing migrations,
changing where to compute, people trying to scale up their data
processing or make it more reliable, and people looking at
transforming their businesses. So all three of these things
are supported by the BigData and Machine Learning platform, and
we will look at all of them.

## 1.2.5 - How to do Labs

Let's move on to talking
about how to do the labs. So navigate to
codelabs.developers.google.com/cpb100. And if you do that, so
I'll just click on this link, it'll take you to a Codelab site. And here the title of the slide
is Sign up for the free trial. Look for the Codelab for
signing up for the free trial. There it is. Click on that and then go ahead and
follow these instructions. And then once you're done come back
to the video and we can start, okay?

## 1.2.6 - Lab Sign Up

 Sign Up for the Free Trial and Create a Project

10 min
Overview

In this lab you sign up for the Google Cloud free trial and create a project used to complete the labs. Be aware that you need a credit card in order to register for the trial. This is to confirm your identity.
What you need

To complete this lab, you need:

    Internet access

    Access to a supported Internet browser:

The latest version of Google Chrome

The latest version of Firefox

Microsoft Internet Explorer 11+

    A credit card to register for the free trial

What you learn

In this lab, you:

    Register for the Google Cloud Platform free trial

    Create a project using the Google Developers Console

Start the Codelab

    https://codelabs.developers.google.com/codelabs/cpb100-free-trial/ 

## 1.2.7 - Lab Resources

Finally, let's end by talking about
some of the resources available to you. cloud.google.com, that's the landing page. And then, you have a page about
datacenters, a page about the security on the Google Cloud,
why you might want to choose Google Cloud, and the pricing philosophy
behind the cloud itself. So these are all links, that feel
free to peruse them at your leisure. Now we'll move on.

# 1.3 Foundations of GCP Compute and Storage

## 1.3.1 Introduction to GCP Compute and Storage

Let's start talking about
the Foundations of GCP. And the foundations of GCP lies with its
computing and storage infrastructure. Any computer consists of computing, and storage, and networking to
connect the computing in storage. The Cloud computer, is also a computer. It's a global computer, but
it also contains a compute engine. It contains storage and networking that
you don't directly interact with, but networking that's there in any case, for you to connect the computing that you are
doing with your data that you have stored. So in this module we look at
the foundations of Google Cloud Platform, the compute engine and cloud storage.

## 1.3.2. CPU's on Demand

Like any computer, and
the cloud computer is a computer, you need compute processing units, CPUs. And the CPUs on the cloud are provided
by a compute engine of virtual machines. And you need a place to store your input
data, to store your output data, to store your intermediate data, things that
are persistent, things that are temporary. And that storage on GCP is
provided by cloud storage. And connecting the two,
connecting the virtual machines or compute engine instances
with these storage units or cloud storage is a private network. You think not to directly interact
with this network but it's there and it is what allows you to have a global
scale data and compute infrastructure. 

GCP provides:
* Computer Engine
* Global Private Network
* Cloud Storage

Goals: To be as no-ops as possible.

It provides load-balancing, advanced networking, monitoring, clustering, container support. You also get after-the-fact discounts for un-used capacity. Pre-emptible machines allow you do use time on powerful machines but give it up if you are not using it. You can't depend on them, but they can give you greate boots in compute power.

For the most part,
when you work with GCP, you will not be working at the level at which we're
going to be talking about in this chapter. So we're not going to be working at
the level of individual virtual machines. No, you're not going to be spinning
up VMs in order to do a job, we'll be working with things that
are much higher level than that. But even if you need to
work at this low level, in terms of infrastructure,
the design goals of GCP remain the same. And the design goal is for working with cloud infrastructure
to be as no-ops as possible. And no-ops here essentially means that we want to minimize a system
administration overhead. And because we're talking
about computing and storage, we want to basically also mention that we
want this to be as flexible as possible. In such a way that you can
change the type of virtual machine that you're running without
paying any penalties, for example. So you're not reserving instances for
long periods of time. In fact, we want to make
it as flexible and easy for you to get your compute
jobs done as possible. So when we talk about compute engine,
the idea is in terms of flexibility. You can go ahead and get a compute
engine that is say, N1 standard four and that's a very specific configuration
of machines that you can have. But we would like you to stop thinking
about it in terms of this very specific physical infrastructure, and instead start thinking in terms
of more abstract concepts. So for example you might say, I want a virtual machine that
has 8 CPUs and 30 gig of RAM. And it's the job of the Google cloud
infrastructure to go ahead and fetch you a virtual machine that
has 8 CPUs and 30 gigs of RAM. Regardless of the type of machine that you
get, you will always get load balancing, advanced networking, monitoring,
clustering, container support, etc. So there is no second class machine here. Every machine that you have,
has all of these capabilities built in. At the same time we want to
give you flexible compute. And you lose flexibility whenever
you say that I have to go ahead and get a machine, and
have to keep it running for months on end. Because face it,
if you're running a machine for months on end,
you have essentially bought the machine. And what we want is for you to work
with machines on the order of minutes. However, there are always going to be
workloads where you might find yourself having a machine and using it fully
tilt for long periods of time. Rather than ask you to try to determine
which of your workloads you are going to be running for long periods of time,
GCP gives you a discount after the fact. So at the end of the month, if it turns
out that you've used a machine for 60% of the month,
you will automatically get a 15% discount. And this is something that happens
on your bill after we've found that you've used it. So what this means is that you
always get to retain your agility. So, for example, if you have a workload
that's currently running on 8 CPUs and you decide that you need to
increase it to 12 CPUs, for a few hours, well,
go ahead and do that, right? You can move your workload to
a different machine when you need to and move it back to a smaller machine
when the peak loads go away. In addition to this whole idea of being
able to change the machine type of stopped instances, you have another concept
that's very, very, very useful, especially when it comes
to jobs like Hadoop jobs. And this is the idea for
preemptible virtual machine. The reason that GCP,
one of the reasons that GCP can say, well, if you want an 8 CPU machine,
30 gig of RAM, we'll find it for you and we'll give to you,
is because some of those machines that are currently being used
are what are called preemptible. Whoever is using those machines
has agreed that in return for a hefty 80% discount on the machine
charge, they agree to give it up if someone comes along and is willing
to pay full price for those machines. So that's what a preemptible machine is. So a preemptible machine is a machine
that you get a great discount on in return for your flexibility, in letting
go of of it when you don't need it. But why would you do that? Why would have a machine that
you're willing to give up? Well, if you're running a workload
like Hadoop, which is fault-tolerant, if a machine goes away, well,
whatever that machine was doing, those jobs get basically distributed
among the other workers. Then preemptible machines are a great
strategy to reduce your overall cost. So you might say, for example,
that you're creating a data proc cluster, a data proc is a Hadoop cluster on GCP,
but we look at it in the next chapter. So you may say I'm going to create a data
proc cluster, and in my data proc cluster, I'm going to have 10 standard VMs and
30 preemptible VMs. So now your job is going to
get done four times faster. And at the same time,
those extra 30 machines that you're using, are actually at 80% of the normal cost. So not only are you
getting it done faster, you're also getting it done cheaper. So preemptible machines are good thing
to incorporate into your strategy. With the idea that even if you
don't get a preemptible machines, those standard machines are enough for you
to get the job done in a timely manner. So you don't want to bank on a preemptible
machine, being available when you need it, but if it is available and
you happen to get it, you automatically gotten a huge discount on
the total cost of your job.

## 1.3.3. Foundation of GCP Compute and Storage

### 1.3.3.1. Lab Overview

So let's go ahead and
try to start a compute engine instance, so we'll start a lab. We'll go ahead and do or
start a compute engine instance. And in this compute engine instance,
what we are going to do is that we're going to create a compute engine
instance using the GCP console. We will add SSH to this instance. And then just to show you that you
have root access into this machine, you will install a software package kit,
which is basically used for source code version control. So let's go ahead, try out this lab,
and come back and join me.

### 1.3.3.2. Lab Review

So this is what the lab will look like, so you probably want to
console.cloud.google.com. This is the GCP console,
and there you have the full projects that you could select, so for
example I have a bunch of projects open. So, I'm going to be selecting
the cloud training demos projects, so I'm fine there. And I'd say that I want to go ahead and
create a compute engine. So, go to compute engine and I'd say
that I'd like to go ahead and create it. So at this point I can provide a name of the compute engine I
just leave it as it is. The zone that I want to
compute engine to be in. We talk about zones ad regions shortly. >> But lets say I peak your central one F. And then I say, how many CPUs that
I want,how much memory that I want. So I could say that I want to have,
for example N1 standard two, which is two virtual CPU,or N1
standard 16, which is 16 CPUs, and of course, the one CPU, one. Currently cost $27 a month. The eight CPUs cost 206 dollars a month,
right. So there's this cost that's shown
to you when you pick a machine. But you could also say that you
want to customize a machine. And I'd say, I want to basically go ahead
and create a machine with four cores. And they say 20 gigs of RAM. And that's basically
what it's going to cost. So, I can fine tune
the machine that I want. Here, I'll just say that I'll want
a single CPU machine, the basic. And that I want a Debian kernel. But we could go ahead and
change this, it could be Debir and Centaurs or Ubuntu etcetera. You could also have those of the appendix
system images, you can also have custom images, you can have snapshots
that you've made of existing BMs, you can have different application images,
you could use an existing disk. Please go ahead and use the default of and
jesey, that's the current default, and then I'll go ahead and
say that I want this virtual machine. To be able to access all the cloud APIs. In particular, we wanted to be
able to access cloud storage. Okay, but rather than just provide
access to an API one by one, we'll take the simple way out,
allow access to all the cloud APIs. We're not going to be
running a web server, so I don't need to allow http or
https traffic to the virtual machine. And I can go ahead and
do management disks networking, add my own encryption keys and so on. But now, let's just go ahead and
pick the default. Go ahead and say create and at this point,
the instance is getting created. Should take usually 60 seconds or so. And we should have our instance and
that instance is going to get created in us-central-f because that was a zone
that I selected when I created it. So, want to see this green click, it means
that the instances has been created. You can go ahead and look at this
instance the details offer and of course there's no CPU or anything, we
haven't started using this instance yet. But let's go ahead and SSH into it,
so I can SSH from here or I can SSH from this other window. I'll just quick hit SSH and at this point we basically have our SSH
window that comes up. This is also just a browser window. Notice that it's transferring
SSH keys to the virtual machine establishing connection, and there we are. At this point my machine is empty. I have something. Now, I can type top and
I can see that now I have my machine. It's not really doing anything. Not no zero percent of
the CPU is getting used. So, let's go ahead and at this point. No let's see if for
example, git is installed. I can type in git, and the command's
not found, it's not installed. So, what I'll do is I'll go ahead and
then install git, and in order to do that, I'll do sudo Apt-get installed git, okay. And it says do you want to continue,
yes I want to continue. And at this point, it's going to go
ahead and get that software, install it. The key thing to realize is If I want to
run something that requires root access, I do sudo. Quick trick. If you need to do something that requires
lots of root, you want to keep not doing multiple things as root, you can also do
sudo su, and that switches you to root. And now who am I? I should be route and
I can do routes stuff but let's exit that. So, at this point I have gate install,
I have machine up. Let's go ahead and
see what we can do with this machine but to do that let's go back to the lecture. [BLANK AUDIO]

## 1.3.4. A Global Filesystem: Cloud Storage

### Data processing in the cloud


Could storage allows you to process this data, be it in CloudSQL, BigQuery oer Dataproc. This is durable, globally available and allows you to share data between products,

How do you get your data into cloud storage?
* *gsutil cp* - Allows you to copy data directly into a storage bucket. Loosely speaking, they provide a domain name to dump files into.
* Make a REST API call
* GPC Consoile

Although gs appears to be heirarchical and support tranditional ls functionality etc, it is nothing more that string-to-blob storage.

Cloud storage should be used as a staging areaand supportd data handling.
* provides object change notificaiton
* can import into analysis tools and databases
* can control access ion a project, bucket or object level
* provides versionsing, redundancy and edge-caching

You are able to control latency and availability with zones and regions. These are geographical contructs to that you can reduce latency, distribure to minimize disruptions and make it globally available.
* use closes zones to reduce latency
* use multiple zones to minimize disruption
* use multiple regions to provide global access

## 1.3.5. Lab interaction with Cloud Storage

What we do here is:
* ingest data
* transform data on the computer engine
* storage transformed data on cloud storage by pushing to the transformed data to a bucket
* publish Cloud storage data to the web

GPC provides a "public link" feature to share data with anyone. A more efficient way to do many things more cheaply that a VM instance is using CloudShell.

CloudShell is a free VM with a bunch of stuff already present which starts at you cloud home directory. It has git, python, gsutil and gcloud. It a micro-vm and flushes when you close the browser window. You can use it to launch scripts and run commands server-lessly.

## 1.3.7. Resources

You can easily find documentation regarding:
* computeVM
* storage
* pricing
* cloud launcher. Allows you to use already configured VM's e.g. wordpress.


## 1.4. Data analysis in the Cloud

### 1.4.1. Cloud Managed Data Services for Common Use-cases

### 1.4.2. Stepping Stones to Transformation

With app engline, there use to be 3 stepa:
* Develop a web app
* Upload to App Engine
* App Engine automatically scales
* App Engine managed run-times

Problems?
* Only Java
* No choice of languages or frameworks
* Were providing prescriptive framework
* Required greenfields start
* Can't handle legacy code/frameworks

Google container engine solved a lot of these problems by allowing you to containerize your app and deploy into a google managed cloud.

Google Cloud Big Data Platform

* Changes where you compute - cloud cheaper adn secure
* Additional scaling and reliability
* Change how you compute - data exploration, data warehousing, machine learning

*Machine learning is the next tansformation. Instead of programming a computer, you teach it to learn something and do what you want.* - Eric Schmidt

Who uses recommendation engines? This is a big use case for machine learning. E-commerce, movie sites etc.

How do recommendation engines work?

* Ratings - start with recording a user ratings/choices
* Training - model created to predict a user's ratings
* Recommending - For each user, the model is applied to unrated houses

One option is to use the choices of other uses that rated the same house at the same value i.e. who is this user like? You could also look as general popularity. In general, you need to cluster users and cluster houses by ratings.

Typically, you might build these models as a batch job e.g. once a week. This fits with hadoop cluster, runing pyspark on dataproc. A relational database would make sense here e.g. Cloud SQL.

### 1.4.3. Your MySQL Database in the Cloud

You choose your storage options based on the access patter.

|---------|Cloud Storage   | Cloud SQL      | Datastore      |BigTable    |BigQuery      |
|---------|----------------|----------------|----------------|------------|--------------|
|Capacity|Petabytes +|Gigabytes|Terabytes|Petabytes|Petabytes 
|Accsss metaphor|File system|Relational DB|Persistent hashmap|Key-value API|Relational|
|Read|Copy to local disk|Select rows|Filter on property|Scan rows|Select rows|
|Write|One file|Insert rows|Put object|Put row|Batch stream|
|Update granlarity|Object|Field|Attribute|Row|Field|
|Usage|Store blobs|No-ops SQL DB|Struture data for AppEngine|No-ops, high throughput flat data|Interactive SQL, fully managed|

Cloud SQL is fully managed MySQL - fast connection from GAE/GCE, flexible pricing, managed backups, replication, security, connect from anywhere.

### 1.4.4. Lab: Working with CloudSQL

When you create an instance, you give it a name, choose defaults and you are given an IP. Once you have a database, you will want to create tables and put data in it. *CloudShell* is an easy way to do this.

You can do this by copying ther csv file to the bucket. CloudSQL has an import option from SQL file or csv.

### 1.4.5. Managed Hadoop in the Cloud

There is a rich ecosystem around big data in Open Source. 
* Hadoop in the canonical map-reduce framework
* Pig provides a convenient scripting language compiled into hadoop map-reduce jobs
* Hive is a data warehousing system and query language
* Spark is a fast, interactive general purpose framework of SQL, streaming and ML.

Dataproc reduces cost and complexity, and provides a google-managed hadoop, pig, hive, spark cluster. It integrates easily with GCP, and by storing your data in GCP, aside from all the normal advantages, your data lives independently from your cluster. Computing becomes a job specific resource independent from your data.

### 1.4.6. Providing recommendation with Cloud Dataproc

You create a Dataproc instance much like you create a CloudSQL instance. You cant to create this cluster in the same zone as your databases to minimize latency and data transfer.

You can now submit jobs to the cluster as python files.

### 1.4.7. Module review

Realtional databases no not support very high throughput, mainly because they need to manage transations. They handle well to a few hundreds gigabytes, but now more. They do not handle unstructured data. They handle transations on relatively small data sets.

Cloud SQL and dataproc provide value adds by managing and providing a no-ops deployments.

