Studying for the GCP Machine Learning Enginerr Professional certification
Training path for Data Analyst
- Teaches how to use Looker!
Advanced Solutions Lab (For Teams!)
The Advanced Solutions Lab is an immersive training program that provides a unique opportunity for technical teams to learn from Google's machine learning experts in a dedicated, collaborative space on Google Campus.
Not all data sources are hosted on Google tools or within the Google Marketing Platform. To bridge this gap, several partner connectors can be used to connect to other data sources. Here are four notable partner connectors:
Connects to over 70 different marketing analytics tools and data sets. Most of its connectors are live connections, although some use snapshot data. Supermetrics can work around Universal GA sampling and provide fields not available in Google's built-in connector. It can also perform data transformations, join data, and aggregate data from multiple accounts at once. Data can be sent to Looker Studio, Google Sheets, or BigQuery.
Offers 40+ connections, with connectors using a data warehouse approach. This makes loading data quicker and faster compared to live connections. PMA can send data to Looker Studio, Google Sheets, and BigQuery as well.
A warehouse-based connector with many data sources available. It includes built-in data transformation features, allowing you to perform resource-intensive transformations before connecting to Looker Studio. Funnel.io supports additional destinations like BigQuery, Snowflake, and Tableau.
A newer player in the market with 90+ connectors. It can be a live connection or use a snapshot feature for warehouse connections. Dataddo includes built-in drag-and-drop data transformation features and supports various data destinations, including Looker Studio, Redshift, Power BI, and Tableau.
Supermetrics
https://supermetrics.com/
https://lookerstudio.google.com/u/0/reporting/6a211d82-26ef-4a4d-920c-7d5a2db2c279/page/MLdGB
Looker Studio Master Class by Ahmad Kanani
https://www.youtube.com/@siakanani/videos
https://support.google.com/looker-studio/table/6379764?hl=en
https://github.com/google/re2/wiki/Syntax
https://support.google.com/a/answer/1371417?hl=en
Tool that helps build your own custom visualizations in Google Data Studio using HTML, CSS, and JavaScript.
https://www.templr.pro/
https://www.templr.pro/article/getting-started-with-templr-pro
Tool to generate color palletes
https://coolors.co/
Tools to generate or download icons for free
https://icons8.com/
https://www.flaticon.com/
Perform sentiment analysis by using client libraries Link
Load a CSV file from Cloud Storage using an explicit schema Link
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
The US National Institute of Standards and Technology created it, although, there is nothing US specific about it. Here it is, cloud computing is a way of using I.T. that has these five equally important traits.
- First, you get computing resources on-demand and self-service. All you have to do is use a simple interface and you get the processing power, storage, and network you need, with no need for human intervention.
- Second, you access these resources over the net from anywhere you want.
- Third, the provider of those resources has a big pool of them and allocates them to customers out of that pool. That allows the provider to get economies of scale by buying in bulk and pass the savings on to the customers. Customers don't have to know or care about the exact physical location of those resources.
- Fourth, the resources are elastic. If you need more resources you can get more, rapidly. If you need less, you can scale back.
- Last, the customers pay only for what they use or reserve as they go. If they stop using resources, they stop paying.
Cloud Dataproc offers the open source big data environment Hadoop, as a managed service. TensorFlow: Machine learning open source software library developed inside Google, is at the heart of a strong open source ecosystem. Kubernetes: gives customers the ability to mix and match microservices running across different clouds Google Stackdriver: lets customers monitor workload across multiple cloud providers.
Google's network carries as much as 40% of the world's Internet traffic every day.
100,000 km of fiber cable and 8 subsea cables
The requiring computing power for ML models was doubling every 2 years until 2012.
From 2013 to now requiring computing power for ML models was doubling every 3.5 months!
1. Collecting data is often the longest and hardest part of an ML project, and the one most likely to fail.
2. Manual analysis helps you fail fast and try new ideas in a more agile way.
3. To build a good ML model, you have to know your data. If you do not know analytics you cannot do ML.
4. ML is a journey towards automation and scale.
5. In ML it is necessary to build a Streaming Pipeline in addition to a Batch Bipeline.
6. The performance metrics you care about change between training and predictions as well. During training, the key
performance aspect you care about is scaling to a lot of data, distributed training if you will. During prediction,
though, the key performance aspect is speed of response and high QPS.
7. Lots of ML frameworks exist for training, but not so many are equaly capable operationalization.
1. Most ML value comes along the way.
2. ML improves almost everythign it touches.
3. ML is hard, so it is hard for competitors too.
4. ML is a great differentiator.
1. Keeping track of different versions.
2. Controllig the experiment space.
3. Pinpointing the best-performing model.
4. Collaboration is not easy.
Simple ML and more data > Fancy ML and small data.
By International Data Corporation on May 2020
1. Lack of staff with the right expertise.
2. Lack of production-ready data.
3. Lack of and integrated development environment.
1. Discovery Phase
2. Development Phase
3. Deployment Phase
-
Training your own ML algorithm would be faster than writting the software.
-
Start creating or using MLs without data analysis.
- It is necessary a Data Strategy first before a Machine Learning strategy.
- "There is no Machine Learning without data, and there is no Machine Learning success without good data" by Robie Allen
- Assume the data is ready to use.
- If you can't make a histogram chart of your data neither can your ML. This is because most MLs are usually making many plots and performing regression on them.
-
Not keeping humans in the loop.
-
Product launch focused on the ML algorithm only.
-
Optmizing ML model for wrong things.
-
Is the ML algorithm really improving things in the real world.
-
Using a pre-trained algorithm vs building your own.
-
Training ML algorithms only once.
- ML Algorithms must be trained more than once. Training requires resources.
-
Trying to design your own perception or NLP algorithm.
-
Trying to jump to a fully machine learned, automated end-to-end, auto-magic everything solution.
- Everyone wants to make this leap. However, it usually doesn't lead to great products or organization outcomes. Google has seen it internally and within their partner organizations.
- Very high expectations of success
- 85% of Machine Learning Projects Fail
- According to Gartner it is predicted that through 2022, 85 percent of AI projects will deliver erroneous outcomes due to bias in data, algorithms or the teams responsible for managing them.
- Thinking ML will completely exchange the human workforce.
- That's a very high expectation for an ML system to meet. You should think about ML as a way to expand or scale the impact of your people, not as a way of completely removing them.
- An ML model will bring higher returns very quick.
- Can you guess how well a company will do with just 1 or 2 quarters of data? Probably not because it takes at least a full year of a public company's returns information for investors and market to better access how well a company is doing now and two full years of data to forecast its future performance.
1. Recognize the ways that a model is dependent on data.
2. Make cost-conscious engineering decisions.
3. Know when to roll back a model to an earlier version.
4. Debug the causes of observed model behavior.
5. Implement a pipeline that is immune to one type of dependency.
1. Can be expensive.
2. Low model accuracy.
3. Long debugging.
1. Multi-functional teams: Requires a lot of experts in different fields.
2. Experimental nature: Needs constant experimentation.
3. Testing complexity: It is more complex than testing other systems
4. Deployment complexity: Requires multi-step pipeline.
5. Model decay: Data is constantly changing, so it is important to monitor otherwise performance can drop.
1. Instances are generated at random according to some probability distribution D.
2. Instances are independent and identically distributed.
3. That D is stationary with fixed distributions.
1. Data Drift: A change in P(X) is a shift in the model's input distribution.
2. Concept Drift: A change in P(Y|X) is a shift in the actual relationship between the model's inputs and the outputs.
3. Prediction Drift: A change in P(Ÿ|X) is a shift in the model's predictions.
4. Label Drift: A change in P(Y Ground Truth) is a shift in the model's output or label distribution.
NOTE: Data Drift, Feature Drift, Population, and Covariate Shift describe changes in the data distribution of inputs.
- Accelerators: GPUs, TPUs, et cetera
- Disks
- Skillsets: software engineers, researchers, data engineers, data analysts, data scientists, different skillsets
- Teams across the org:
4.1. Teams that are gonna be building the experiments.
4.2. Teams that are gonna be using the experiments.
4.3. Teams that are gonna be monitoring the machine learning models.
90% of Enterprise data is unstructured such as emails, video footage, texts, reports, catalogs,
fashion show, events, news, etc.
- Google is one of the world's largest corporate purchasers of wind and solar energy.
- Google has been a hundred percent carbon neutral since 2007.
- Its data centers energy source will shortly reach a hundred percent renewable energy.
Accuracy of Data
Consistency of Data
Timeliness of Data
Completenes of Data
Google was the first major Cloud provider to deliver per second billing for its IaaS Compute offering.
Google gives customers the ability to run their applications elsewhere,
if Google becomes no longer the best provider for their needs
GCP provides four tools to help with billing: Budgets and Alerts: You can define budgets either per billing account or per GCP project. A budget can be a fixed limit or you can tie it to another metric. Billing Exports: lets you store detailed billing information in places where it's easy to retrieve for more detailed analysis, such as a BigQuery dataset or a Cloud storage bucket. Reports: is a visual tool in the GCP console that allows you to monitor your expenditure. Quotas: designed to prevent the over-consumption of resources, whether because of error or malicious attack. There are two types of quotasL rate quotas and allocation quotas
- Rate Quotas: reset after a specific time. For example, by default, the Kubernetes Engine service sets a quota of a 1000 calls to its API from each GCP project every 100 seconds. After that 100 seconds, the limit is reset.
- Allocation Quotas: govern the number of resources you can have in your projects. For example, by default, each GCP project has a quota allowing it no more than five Virtual Private Cloud networks. Although projects all start with the same quotas, you can change some of them by requesting an increase from Google Cloud support.
- Google enables hardware encryption support in hard drives and SSDs.
That's how Google achieves encryption at rest of customer data.
- Google services that want to make themselves available on the Internet register themselves with an infrastructure service called the Google Front End, which
checks incoming network connections for correct certificates and best practices.
- The GFE also additionally, applies protections against denial of service attacks.
The sheer scale of its infrastructure, enables Google to simply absorb many denial of service attacks, even behind the GFEs.
- Google also has multi-tier, multi-layer denial of service protections that further reduce the risk of any denial of service impact.
- Inside Google's infrastructure, machine intelligence and rules warn of possible incidents.
- Google conducts Red Team exercises,
simulated attacks to improve the effectiveness of its responses.
- Google aggressively limits and actively monitors the activities of employees who have been granted administrative access to the infrastructure.
- To guard against phishing attacks against Google employees, employee accounts including mine require use of U2F compatible security keys.
- To help ensure that code is as secure as possible Google stores its source code centrally and requires two-party review of new code.
- Google also
gives its developers libraries that keep them from introducing certain classes of security bugs.
- Externally, Google also runs a vulnerability rewards program, where we
pay anyone who is able to discover and inform us of bugs in our infrastructure or applications.
Policies are inherited downwards in the hierarchy.
- All Google Cloud platform resources belong to a project.
- Projects are the basis for enabling and using GCP services like managing APIs, enabling billing and adding and removing collaborators and enabling other Google services.
**Each project is a separate compartment** and each resource belongs to exactly one.
- Projects can have different owners and users
- Projects are built separately and they're managed separately.
- Each GCP project has a name and a project ID that you assign.
- The project ID is
a permanent, unchangeable identifier
and it has to be unique across GCP. - You use project IDs in several contexts to tell GCP which project you want to work with.
- On the other hand, project names are
for your convenience and you can assign them
. - GCP also assigns each of your projects a unique project number and you'll see a display to you in various contexts.
- On the Navigation menu (Navigation menu), click IAM & Admin. This opens a page that contains a list of users and specifies permissions and roles granted to specific accounts.
Role Name | Title | Permissions |
---|---|---|
roles/viewer | Viewer | Permissions for read-only actions that do not affect state, such as viewing (but not modifying) existing resources or data. |
roles/editor | Editor | All viewer permissions, plus permissions for actions that modify state, such as changing existing resources. Note: The Editor role contains permissions to create and delete resources for most Google Cloud services. However, it does not contain permissions to perform all actions for all services. For more information about how to check whether a role has the permissions that you need, see Role types. |
roles/owner | Owner | All Editor permissions and permissions for the following actions: - Manage roles and permissions for a project and all resources within the project. - Set up billing for a project. Note: - Granting the Owner role at a resource level, such as a Pub/Sub topic, doesn't grant the Owner role on the parent project. - Granting the Owner role at the organization level doesn't allow you to update the organization's metadata. However, it allows you to modify all projects and other resources under that organization. - To grant the Owner role on a project to a user outside of your organization, you must use the Google Cloud console, not the gcloud CLI. If your project is not part of an organization, you must use the Google Cloud console to grant the Owner role. |
The Google APIs Explorer is a tool available on most REST API reference documentation pages that lets you try Google API methods without writing code. The APIs Explorer acts on real data, so use caution when trying methods that create, modify, or delete data. For more details, read the APIs Explorer documentation.
GCP API Design Guide
This is a general design guide for networked APIs. It has been used inside Google since 2014 and is the guide that Google follows when designing Cloud APIs and other Google APIs. This design guide is shared here to inform outside developers and to make it easier for us all to work together.
Cloud Endpoints developers may find this guide particularly useful when designing gRPC APIs, and we strongly recommend such developers use these design principles. However, we don't mandate its use. You can use Cloud Endpoints and gRPC without following the guide.
This guide applies to both REST APIs and RPC APIs, with specific focus on gRPC APIs. gRPC APIs use Protocol Buffers to define their API surface and API Service Configuration to configure their API services, including HTTP mapping, logging, and monitoring. HTTP mapping features are used by Google APIs and Cloud Endpoints gRPC APIs for JSON/HTTP to Protocol Buffers/RPC transcoding.
This guide is a living document and additions to it will be made over time as new style and design patterns are adopted and approved. In that spirit, it is never going to be complete and there will always be ample room for the art and craft of API design.
Machine families resource and comparison guide
This document describes the machine families, machine series, and machine types that you can choose from to create a virtual machine (VM) instance with the resources you need. When you create a VM, you select a machine type from a machine family that determines the resources available to that VM. There are several machine families you can choose from and each machine family is further organized into machine series and predefined machine types within each series. For example, within the N2 series in the general-purpose machine family, you can select the n2-standard-4 machine type.
Compute Engine resources are hosted in multiple locations worldwide. These locations are composed of regions and zones. A region is a specific geographical location where you can host your resources. Regions have three or more zones. For example, the us-west1 region denotes a region on the west coast of the United States that has three zones: us-west1-a
, us-west1-b
, and us-west1-c
.
Resources that live in a zone, such as virtual machine instances or zonal persistent disks, are referred to as zonal resources. Other resources, like static external IP addresses, are regional. Regional resources can be used by any resource in that region, regardless of zone, while zonal resources can only be used by other resources in the same zone.
For example, to attach a zonal persistent disk to an instance, both resources must be in the same zone. Similarly, if you want to assign a static IP address to an instance, the instance must be in the same region as the static IP address.
Putting resources in different zones in a region reduces the risk of an infrastructure outage affecting all resources simultaneously. Putting resources in different regions provides an even higher degree of failure independence. This lets you design robust systems with resources spread across different failure domains.
Only certain resources are region- or zone-specific. Other resources, such as images, are global resources that can be used by any other resources across any location. For information on global, regional, and zonal Compute Engine resources, see Global, Regional, and Zonal Resources.
Storage & Process Data depends on:
- Type of data
- Business need
- Unstructured (Relational Data)
- Structured Data (No-Relational Data)
It is information stored in a non-tabular form such as documents images and audio files.
IMPORTANT: Cloud Storage is great for No-Relational Data.
It is information stored in a non-tabular form. There are two types of such data: transactional workloads and analytical workloads.
Batch processing is when the processing and analysis happens on a set of stored data.
- For example,
payroll and billing systems
that have to be processed on either a weekly or monthly basis.
Streaming data is a flow of data records generated by various data sources. Streaming data processing means that the data is analyzed in near real-time and that actions will be taken on the data as quickly as possible. - The processing of streaming data happens as the data flows through a system. This results in the analysis and reporting of events as they happen. An example would be
fraud detection
orintrusion detection
.
variety: Data can come in a variety of different sources and in various formats.
volume: The volume of data that varies from gigabytes to petabytes is not easy nor cheap to handle.
velocity: Data often needs to be processed in near-real time or late.
veracity: Data won't always be good quality and will come with some inconsistencies and uncertainties.
1. Data can be streamed from many different methods and devices.
2. It can be hard to distribute event messages to the right subscribers.
3. Data can arrive quickly and at high volumes
4. Ensuring services are reliable, secure, and perform as expected.
Pub/Sub is a distributed messaging service that can receive messages from a variety of device streams such as gaming events, IoT devices, and application streams. It ensures at-least-once delivery of received messages to subscribing applications, with no provisioning required. Pub/Sub’s APIs are open, the service is global by default, and it offers end-to-end encryption.
Dataflow creates a pipeline to process both streaming data and batch data.
- “Process” in this case refers to the steps to: extract, transform, and load data (ETL).
Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud ecosystem.
Dataflow is serverless and NoOps, which means No Operations. - A NoOps environment is one that doesn't require management from an operations team, because maintenance, monitoring, and scaling are automated.
- Serverless computing is a cloud computing execution model. This is when Google Cloud, for example, manages infrastructure tasks on behalf of the users. This includes tasks like:
resource provisioning
,performance tuning
, andensuring pipeline reliability
.
Dataflow is designed to be low maintenancemeans. - It means that you can spend more time analyzing the insights from your datasets and less time provisioning resources to ensure that your pipeline will successfully complete its next cycles.
1. Will the pipeline code be compatible with both batch and streaming data, or will it need to be refactored?
2. Will the pipeline code software development kit (SDK) being used have all the transformations, mid-flight
aggregations and windowing?
3. Will the pipeline SDK be able to handle late data?
4. Are there existing templates or solutions that should be referenced?
Popular soluton for pipeline design. It’s an open source, unified programming model to define and execute data processing pipelines, including ETL, batch, and stream processing.
1. Apache Beam is unified, which means it uses a single programming model for both batch and streaming data.
2. It’s portable, which means it can work on multiple execution environments, like Dataflow and Apache Spark,
among others.
3. It’s extensible, which means it allows you to write and share your own connectors and transformation libraries.
4. Apache Beam provides pipeline templates, so you don’t need to build a pipeline from nothing.
5. It can write pipelines in Java, Python, or Go.
6. The Apache Beam software development kit, or SDK, is a collection of software development tools in one
installable package. It provides a variety of libraries for transformations and data connectors to sources
and sinks.
7. Apache Beam creates a model representation from your code that is portable across many runners.
1. How much maintenance overhead is involved?
2. Is the infrastructure reliable?
3. How is the pipeline scaling handled?
4. How can the pipeline be monitored?
5. Is the pipeline locked in to a specific service provider?
When a job is received by Dataflow it does the following:
1. Graph Optimization
2. Work Scheduler
3. Auto-scaler
4. Auto-healing
5. Work rebalancing
6. Compute & Storage
Steaming
, Batching
, and Utility
Streaming templates are for processing continuous, or real-time, data.
- Pub/Sub to BigQuery
- Pub/Sub to Cloud Storage
- Datastream to BigQuery
- Pub/Sub to MongoDB
Batch templates are for processing bulk data, or batch load data.
- BigQuery to Cloud Storage
- Bigtable to Cloud Storage
- Cloud Storage to BigQuery
- Cloud Spanner to Cloud Storage
Utility templates address activities related to bulk compression, deletion, and conversion.
- Bulk compression of Cloud Storage files
- Firestore bulk deletion
- File format conversion
It is a fully-managed data wherehouse.
Being fully managed means that bigquery takes care of the underlying infrastructure so you can focus on using SQL queries to answer business questions without worrying about deployment scalability and security.
This article explains the format and schema of the data that is imported into BigQuery. The data used come from teh table: data-to-insights.ecommerce.web_analytics
.
Data Lake is just a pool of raw unorganized and unclassified data which has no specified purpose a data.
Data Wherehouse contains structured and organized data which can be used for advanced querying.
BigQuery ML enables users to create and execute machine learning models in BigQuery by using standard SQL queries
End-to-end user journey for each model
It is a unified platform which means having one digital experience to create, deploy, and manage models over time, and at scale.
- Data Readiness
- Feature Readiness
- Training & Hyperparameter Tunning
- Deployment & Model Monitoring
- Create a dataset and upload data.
- Train an ML model on your data.
- Upload and store your model in Vertex AI.
- Deploy your trained model to an endpoint for serving predictions.
- Send prediction requests to your endpoint.
- Specify a prediction traffic split in your endpoint.
- Manage your models and endpoints.
- Determining how to handle large quantities of data.
- Determining the right machine learning model to train the data.
- Harnessing the required amount of computing power.
- Scalability
- Monitoring
- Continuous Integration, Delivery and Deployment
- Many tools require advanced coding skills.
- Take focus away from model configuration.
- No unified workflow.
- Difficulties finding tools.
Hyperparameters are the variables that govern the training process itself. For example, part of designing a DNN is deciding how many hidden layers of nodes to use between the input and output layers, and how many nodes each hidden layer should use. These variables are not directly related to the training data. They are configuration variables. Note that parameters change during a training job, while hyperparameters are usually constant during a job.
Hyperparameter tuning in Cloud Machine Learning Engine using Bayesian Optimization
Feature Store | Model Registry | ML Metadata | Model Evaluation |
---|---|---|---|
Shared and reuse ML features accross use cases. | Register, organize, track, and version your trained and deployed ML models. | Automatically track inputs or outputs of all components. | Iteratively run model evaluations on new datasets at scale. |
Serve ML features at scale with low latency. | Govern the model launch process. | Query the metadata to help analyze, debug, and audit the performance. | Visualize and compare model evaluations to identify the best model for prod deployment. |
Alleviate training serving skew. | Maintain model documentation and reporting. | Maintain model documentation and reporting. | Visualize, analyze, and compare detailed ML lineage. |
Hyperparameter tuning searches for the best combination of hyperparameter values by optimizing metric values across a series of trials. Metrics are scalar summaries that you add to your trainer, such as model accuracy. Hyperparameter tuning optimizes target variables that you specify, called hyperparameter metrics. Model accuracy, as calculated from an evaluation pass, is a common metric. Metrics must be numeric.
Provides robust, actionable explanations | Is built into multiple Vertex AI services | Is flexible, fast, and scalable |
---|---|---|
Vertex Explainable AI integrates feature attributions into Vertex AI to show the most important features for specific predictions with Shapley, Integrated gradients, and eXplanation with Ranked Area Integrals (XRAI). | You can get explanations easily through Vertex AI Prediction, AutoML Table, and Vertex AI Workedbench. | Supports tabular, image, and text models from any ML framework. Fully managed, serverless, and significantly faster than open source. |
It is a hyperparameter optmizer. Vertex AI Vizier is a black-box optimization service that helps you tune hyperparameters in complex machine learning (ML) models. When ML models have many different hyperparameters, it can be difficult and time consuming to tune them manually. Vertex AI Vizier optimizes your model's output by tuning the hyperparameters for you.
Black-box optimization is the optimization of a system that meets either of the following criteria:
- Doesn't have a known objective function to evaluate.
- Is too costly to evaluate by using the objective function, usually due to the complexity of the system.
This page describes how to make API requests to Vertex AI Vizier by using Python.
Tutorial on how to use Vertex AI Vizier in Python
It is managed cloud service for machine learning engineers and data scientists to store, serve, manage, and share machine learning features at scale.
- It is a centralized repository to organize, store, and serve machine learning features.
- It aggregates all the different features from different sources and updates them to make them available from a central repository.
- Upon engineers need to model something, they can use the features available in the Feature Store dictionary to build a dataset.
Monitor and alert | Diagnose | Update the model |
---|---|---|
Monitor signals for model's predictive performance, and alert when those signals deviate. | Help identify the cause for deviation, for example, what changed, how, and how much. | Trigger model retraining pipeline or collect relevant training data to address performance degradation. |
Used to track, analyze, visualize, and compare ML experiments.
- Vary and track parameters and metrics as you experiment.
- Organize Vertex AI Pipeline runs and compare their parameters, metrics, and artifacts.
- Track steps and artifacts to capture the lineage of experiments.
- Compare Vertex AI Pipeline against Vertex AI Workbench Notebook experiments.
Based on the OpenSource TensoBoard tool, it is used to track, analyze, visualize, and compare ML experiments.
- Track and visualize metrics such as loss and accuracy over time.
- Visualize model computational graphs.
- View histograms of weights, biases, or other tensors.
- Project embeddings to a lower dimensional space.
- Display image, text, and audio samples.
It is a managed instance of Vertex AI Pipelines. It is a set of integrated, fully managed, and scalable pipelines for end-to-end ML with tabular data that uses Google's AutoML technology for model development and provising customization options to fit your needs.
- Supports large datasets that are multiple TB in size.
- Allows you to improve stability and lower training time.
- Allows you to improve training speed.
- Allows you to reduce model size and improve latency.
- Hard to share and reuse
- Hard to serve in production, reliably with low latency
- Inadvert skew in feature values between training and serving
Feature Management Pain Points:hard to reuse, hard to serve, and training-serving skewness.
1. Features are shareable for training or serving tasks: Features are managed and served from a central repository,
which helps maintain consistency across your organization.
2. Features are reusable: Helps save time and reduces duplicative efforts, especially for high-value features.
3. Features are scalable: Features automatically scale to provide low-latency serving, so you can focus on
developing the logic to create the features without worrying about deployment.
4. Features are easy to use: Feature Store is built on an easy-to-navigate user interface.
It is a rich feature repository to server, share and re-use ML features.
- Share and resus ML features across use cases:
Centralized feature repository with easy APIs to search & discover features, fetch them for training/serving and manage permissions.
- Serve ML Features at scale with low latency:
Offload the operational overhead of handling infrastructure for low latency scalable feature serving.
- Alleviate training serving skewness:
Compute feature values once, re-use for training and serving. Track & monitor for drift and other quality issues.
- Batch and Streaming Feature Ingestion:
Ingest features efficiently in large batches, or in real-time as data streams in.
Predicted Value | |
---|---|
Actual Value | AV vs PV |
P(cat) | N(dog) | |
---|---|---|
P(cat) | TP | FN |
N(dog) | FP | TN |
True Positive: Things that you correctly predicted. Things you include that should be included. Label says something exists and the model predicts it.
False Negatives (Type II error): Things that you incorrectly did not predict. Things you exclude when it should be included. Label says something exists but the model doesn't predict it.
False Positive (Type I Error): Things that you incorrectly predict. Things you include when it should be excluded. Label says something doesn't exist but the model predicts it.
True Negative: Things that you correctly excluded. Things you exclude that should be excluded. Label says something doesn't exist and the model doesn't predict it.
Refers to all the positive cases, and looks at how many were predicted correctly.
R = TP / (TP + FN)
Refers to all the cases predicted as positive and how many are actually positive.
P = TP / (FP + TP)
Machine Learning Development + Operations
ML = Upload Data + Engineering Feature + Train Model + Evaluate Model
Operations = Deploy + Monitor + Release
ML = Solve production challenges related to machine learning
Operations = Building an intergrated system + Operating in production
Practicing MLOps means advocating for automation and monitoring at each step of
the ML system construction. This means adopting a process to enable:
- Continuous integration (CI)
- Continuous training (CT)
- Continuous delivery (CD)
Endpoint
- Best when immediate results with low latency are needed.
- A model must be deployed to an endpoint before that model can be used to serve real-time
- e.x: Making instant recommendations based on a user’s browsing habits whenever they’re online.
Batch Predition
- Best when no immediate response is required, and accumulated data should be processed with a single request.
- e.x: Sending out new ads every other week based on the user’s recent purchasing behavior and what’s currently popular on the market
Offline Predition
- Best when the model should be deployed in a specific environment off the cloud.
- e.x: Edge AI where a camera need to identify defects on a product during assembly line or packaging process.
It automates, monitors, and governs machine learning systems by orchestrating the workflow in a serverless manner
It is a notebook tool that helps to define one's own pipeline. You can do this with prebuilt pipeline components, which means that you primarily need to specify how the pipeline is put together using components as building blocks
STEP 1: Data Uploading
STEP 2: Feature Engineering
Data Types:
- Streaming vs Batch Data
- Structured vs Unstructured Data
STEP 1: Training Data
STEP 2: Evaluating Data
STEP 1: Deployement STEP 2: Monitoring STEP 3: Managed
- Avoid storing data in block storage like netwrok file systems or VM hard disks.
- Avoid reading data directly from databases like Cloud SQL.
- Store tabular and intermidiate processed data in BigQuery.
- Use Vertex AI Feature Store with structured data.
- For optimal speed, better storing materialized data than using views or subqueries for training data.
- Store image, video, audio, and unstructured data in Cloud Storage.
- This also applies to TFRecord files if using TensorFlow or AVRO files if using other framework.
- Improve write and read throughput to Cloud Storage by mombining many individual images, videos or audio clips into large files.
- Use Vertex Data Labeling for unstructured data.
- Search Vertex AI Feature Store.
1.1. Search to see if a feature already exists.
1.2. Fetch those features for your training labels using Vertex AI Feature Store's batch serving capability. - Create a new feature.
2.1. Create a new feature using your Cloud Storage bucker or BigQuwry location.
2.2. Fetch raw data from your data lake and write your scripts to perform feature processing.
2.3. Join the feature values and the new feature values. Merging those feature values produces the training data set.
2.4. It is a solution for online serving of the features to online prediction use cases.
2.5. You can share features among others in the organization for their own ML models.
- Use notebooks intances for small datasets.
- For large datasets, distributed training, or scheduled training, it is recommended using Vertex Training service.
- Python Source Distribution: Training application code package as Python source distribution. Can include custom Python dependencies or others.
- Cloud Storage Bucket: Push the package training application code to Google Cloud Storage bucket.
- Vertex Training: Configure and run custom job on Vertex Training with pre-built containers.
- Offers feature attributions to provide insights into why models generate predictions.
- Details the importance of each feature that a model uses as input to make a prediction.
- Supports custom-trained models based on tabular and image data.
- Maximize a model's predictive accuracy.
- Provides an automated model enhancer to test different hyperparameter configurations when training your model.
- Use Notebooks to evaluate (development) and understand (experimentation) your models.
- Use for writing code, starting jobs, running queries, and checking status.
- Notebooks offers What-if-Tool (WIT) and Language Interpretability Tool (LIT).
- Create a notebook for each team member.
- Use Vertex SDK for Python.
- Secure Personaly Identifiable Information (PII) in Notebooks.
6.1. Apply data governance and security policies to help protect your Notebooks that contain PII data.
6.2. Follow Notebooks security blueprint: Protecting PII data guide.
- It is an enterprise-ready managed solution.
- Vertex AI TensorBoard service lets you track experiment metrics such as loss and accuracy over time.
- Visualize a model graph.
- Project enbeddings to a lower dimension space.
- Allows cost effective, secure solutions, and easy colaboration among Data Scientists and ML Researchers to track, compare and share experiments.
- BigQuery for tabular data
- Dataflow to process unstructured data
- Use Dataflow to convert the unstructured data into binary data formats like TFRecord to improve ingestion performance during training.
- TensorFlow Extended for leveraging TensorFlow ecosystem.
- TensorFlow Transform is the TensorFlow component that enables defining and executing a preprocessing function to transform your data.
Interaction Bias: People drew more all-stars than hills in a recent game. The AI did not recognize hills. Latent Bias: Training a model to identify famous phisicists is likely to be bias towards men. Selection Bias: Chosing photos from your family to train a model to identify anybody. Confirmation Bias: Refers to only looking for data which confirm your hypotheses. Reporting Bias: Refers to choices in data that reveal certain aspects about the trainers or their opinions. Automation Bias: Refers to the biases which appear when the data we use is only the data we can easily automate.
Biometrics | Country | Dialect | Health | Income | Language | Location | Race | Religion | Sexual Orientation | Skin Color | Socialeconomic Status
- It is available for free within the TensorBoard.
- Designed to let you visualize inference results.
- Edit data a data point.
- See how your model performs.
- Explore the effects of a single feature.
- Arrange examples for similarity.
- View confusion matrices and other metrics.
- Test algorithmic fairness constraints.
- Developed by Google and Open-Sourced
- Facets Overview: Provide users a quick understanding o fthe distribution of values accross features.
- Facets Dive: Provide users an easy-to-customize, intuitive interface.
A tensor is an D-dimensional array of data. They behave like numpy n-dimensional arrays except that:
tf.constant produces constant tensors.
tf.Variable produces tensors that can be modified.
reshape() by itself cannot be used to transpose a matrix unless the matrix happens to be a vector. If the matrix
is not a vector then transpose alters the internal storage order of the elements, whereas reshape() never does.
For example, internally [1 2; 3 4] is stored in the order 1 3 2 4, and transpose of [1 2;3 4] would be [1 3;2 4]
which would be stored in the order 1 2 3 4. You can see that the 2 and 3 have swapped internal places in the transpose.
Reshape never swaps internal orderings.
TensorFlow can compute the derivative of a function with respect to any paremeter.
- The computation is recorded with GradientTape (a context manager).
def compute_gradients(X, Y, w0, w1):
with tf.GradientTape() as tape:
loss = loss_mse(X, Y, w0, w1)
return tape.gradient(loss, [w0, w1])
w0, w1 = tf.Variable(0.0), tf.Variable(0.0)
dw0, dw1 = compute_gradients(X, Y, w0, w1)
- The function is expressed with TensorFlow ops only!
Dataset = tf.data
can do more than just ingesting data.
Feature Columns = tf.feature_column
tells the model what inputs to expect.
- Feature columns take care of packing the inputs into the input vector of the model. For example, one-hot enconding bellow.
tf.feature_column.categorical_column_with_vocabulary_list("type", ["house","apt"])
"house" = 1, 0
"apt" = 0, 1
Other examples of feature columns"
tf.feature_column.bucketized_column(..)
tf.feature_column.embedding_column(..)
tf.feature_column.crossed_column(..) # Enables a model to lean separate for combination of features.
tf.feature_column.categorical_column_with_hash_bucket(..)
...
Lower Dimensions = Less Accuracy + More Lossy Compression
vs
Higher Dimensions = Overfitting + Slow Training
SavedModel is a universal serialization format for TensorFlow models.
- SavedModel provides a language neutral format to save your machine learning models that is both recoverable and hermetic.
- It enables higher level systems and tools to produce, consume and transform your TensorFlow models. The resulted SavedModel is then servable.
- Models saved in this format can be restored using the tf.
Save a model using SavedModel in the gcloud ai-platform
.
gcloud ai-platform versions create \
--model=$MODEL_NAME $VERSION NAME \
--framework=tensorflow \
--python-version=3.5 \
--runtime-version=2.1 \
--origin=$EXPORT_PATH \
--staging-bucket=gs://$BUCKET
Deploy the model saved using SavedModel commands on gcloud ai-platform
.
input.json = {"sq_footage": 3140, "type": 'house'} #This is an input to test the model loaded.
gcloud ai-platform predict \
--model propertyprice \
--version dnn \
--json-instances input.json
Keras processing layers: text preprocessing
, numerical features preprocessing
, categorical features preprocessing
, image preprocessing
, and image data augmentation
.
tf.keras.layers.TextVectorization
-> turns raw strings into and encoded representation that can be read by an Embedding or Dense layer.
tf.keras.layers.Discretization
-> turns continous numerical features into bucket data with discrete ranges.
tf.keras.layers.CategoryEncoding
-> turns integer categorical features into one-hot, multi-hot, or count dense ecodings.
tf.keras.layers.Hashing
-> performs categorical feature hashing, also known as the "hashing trick."
tf.keras.layers.StringLookup
-> turns strings categorical values into an encoded representation that can be read by an Embedding or Dense layer.
tf.keras.layers.IntegerLookup
-> turns integer categorical values into an encoded representation that can be read by an Embedding or Dense layer.
Stateful preprocessing layers that compute based on training data:
- TextVectorization: Holds mapping between string tokens and integer indices.
- StringLookup and IntegerLookup: Holds a mapping between inputs values and integer indices.
- Normalization: Holds the mean and standard deviation of the features.
- Discreditization: Holds information about value bucket boundaries.
NOTE: These layers are non-trainable. Their state is not set during training. It must be set before training.
Categorical Features Processing
- Not advisable for multiple inputs and outputs.
- Any of the layers in the model have multiple inputs and multiple outputs that, that model needs to do layer sharing or the model has a nonlinear topology such as a residual connection or if it multi-branches.
Functional API gives your model the ability to have multiple inputs and outputs.
- It allows for models to share layers and actually it's a little bit more than that.
- It allows you to define ad hoc network graphs, should you need. With that functional API, models are defined by creating instances of layers and then connecting them directly to each other in pairs, then defining a model that specifies the layers act as the input and the output to the model when stringing everything together.
- Create models that are more flexible than the sequential API.
- It can handle models with nonlinear topology, models with shared layers, and models with multiple inputs or outputs, so consider that functional API in those use cases.
- The API also makes it easy to manipulate multiple inputs and outputs which is not possible in Sequencial API.
- Less verbose than using keras.Model subclasses.
- Validates your model while you are defining it.
- Your model is plottable and inspectable.
- Your model can be serialized or cloned.
- Does not support deynamic architectures.
- Sometimes you have to write from scratch and you need to build subclasses, e.g. custom training or inference layers.
Also known as Non-linear Transformation Layer.
- Act as a transtition point between layers, and so you get nonlinearity.
- Adding in this noliniear transformation is the only way to stop the neural network to condense down into a shallow network.
- ReLU hidden activations often have 10 times the speed of training than networks with Sigmoid hidden activations.
- Due to the negative domain's function always being zero, one can end up with ReLU layers dying. When updating the weights, since one have to multiply error's derivative by the activation, one end up with a gradient zero.
- The logistics sigmoid function is a smooth approximation of the derivative of the rectifier (ReLU).
- Rectifier that allows small negative values when the input is less than zero.
- Learns parameters that controls the leakiness and shape of the function.
- It adaptively learns parameters of the rectifiers.
- It is generalization of ReLU that uses parameterized exponential function to transform from positive to small negative values.
- Nonlinearity results in the expected transformation of a stochastic regularizer , which randomly applies the identity or zero map to the neuron's inut.
Adam is not an acronym. It is an extension to Stochastic gradient decent and can be used in place of classical stochastic gradient descent to update network weights more efficiently. PAPER: Adam: A method for Stochastic Optimization
- It is a procedure to update the network weights iteratively based in training data.
- Invariability due to the diagonal rescaling of the gradients.
- Well-suited for models that have large and large and large data sets.
- When you have a lot of parameters that you're adjusting.
- Problems with very noisy or sparse gradients and nonstationary objectives.
It reduces learning rate when the gradient values are small
It gives frequently occurring features low learning rates
It improves AdaGrad by avoiding and reducing LR to zero
- It works well on wide models.
- FTRL, liek Adam, make really good defaults for deep neural nets as well as linear models that you're building.
Callbacks are utilities called at certain points during model training for activities such as logging and visualization using tools such as TensorBoard. Saving the training iterations to a variable allows for plotting of all your chosen evaluation metrics like mean absolute error, root mean squared error, accuracy, et cetera, versus the epochs.
Refers to any technique that helps generalize a model.
A generalized model performs well not just on training data, but also on never-seen test data.
L1 Norm measures the absolute value of distance a and b
L2 Norm is the Euclidean Distance. It is the square root of the sum of the squares.
- Regression model that uses the L1 regularization technique is called Lasso Regression.
- Regression model that uses the L2 Regularization technique is called Ridge Regression.
Lasso vs Ridge difference
Lasso shrinks the less important feature’s coefficient to zero, thus removing some features altogether.
Ridge regression adds “squared magnitude” of coefficient as a penalty term to the loss function.
"It is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data." Dr. Andrew Ng
Different problems in the same domain may need different features. It depends on you and your subject matter expertise to determine which fields you want to start with for your hypothesis.
Feature engineering it an iterative process.
Using indicator variables to isolate key information: Isolates a specific area for a training dataset.
Highlighting interactions between two or more features: Sum of two featuers, product of two feature, etc.
Representing the same feature in a different way:
- Create new feature "grade" with "Elementary School","Middle School", and "High School" as classes.
- Group similar classes, and then group the remaining ones into a single "Other" class.
- Transform categorical features into dummy variables.
- Features should be related to the objective. Look for good features and avoid bad features.
- Features should be known at prediction-time.
- Features should be numeric.
3.1. The vocabulary and the mapping of the vocabulary needs to be identical at prediction time. If new data is added in problems arise for sparce columns (one-hot encoding). - Features should have enough examples.
It involves two aspects, representation transformation and feature construction.
- Feature representation is converting a numeric feature to a categorical feature through bucketization.
- Converting categorical features to a numeric representation through one-hot encoding, learning with counts, sparse feature embeddings, etc.
- Some models work only with numeric or categorical features.
- Other models handle mixed type features. Even when models handle both types.
- Polynomial expansion by using univariate mathematical functions.
- Feature crossing to capture feature interactions.
2.1. It is all about memorization. Memorization is the oposite of generalization, which is what machine learning aims to do. The goal of ML is generalization.
2.2. It only works on large data sets.
2.3. Feature crosses lead to sparcity. Sparce models contain fewer features, soon, they are easier to train on limited data with less chances of overfitting. - Using business logic from the domain of the ML use case.
Does the feature cross of all the combinations: ML.FEATURE_CROSS(STRUCT(features))
Specify all the preprocessing during model: TRANSFORM (ML.FEATURE_CROSS(STRUCT(features)), ML.BUCKETIZE(f, split_points) etc...)
Where split_points is an array: ML.BUCKETIZE(f, split_points)
Memorization works when you have so much data that for any single grid cell within your input space the distribution of data is statistically significant. When that is the case, you can memorize. You are essentially just learning the mean for every grid cell.
ST_Distance: returns the shortest distance in meters between two non-empty geographies.
ST_GeoPoint:
It is a preprocessing function that creates buckets or bins. That is, it bucketizes a continuous numerical feature into a string feature with bucket names as the value.
When the TRANSFORM clause is used, user specify transforms during training will be automatically applied during model serving, prediction, evaluation, etc.
- TRANSFORM clause ensures that transformations are automatically applied during prediction.
Uses open-source API (Apache Beam) to execute Flink, Spark, Parallen tasks, etc.
- Beam is a way to write elastic data processing pipelines.
- A Pipeline is a directed graph of steps.
- Pipeline must have a source, which is where the pipeline gets input data.
- The pipeline has a series of steps. Each of the steps in Beam is called a transform.
Each transform works on a structure called PCollection.
I'll return to a detailed explanation of PCollections shortly.
For now, just remember that every transform gets a PCollection as input and outputs the
result to another PCollection. The result of the last transform in a pipeline is important.
- None of the pipeline operators actually run the pipelineYou need a runner to run the pipeline. A runner takes the pipeline code and executes.
- Runners are platform-specific, meaning that there's a dataflow runner for executing a pipeline on Cloud dataflow. There's also a direct runner that will execute a pipeline on your local computer. You can even implement your own custom runner for your own distributed computing platform.
- PCollection is like a data structure with pointers to where the dataflow cluster stores your data. That's how dataflow can provide elastic scaling of the pipeline.
- Dataflow is elastic and can use a cluster of servers for your pipeline. So PCollection is like a data structure with pointers to where the dataflow cluster stores your data.
- One way to implement the transformation is to take a PCollection of strings, which are called lines in the code, and return a PCollection of integers. This specific transform step in the code computes the length of each line.
pipe = beam.Pipeline()
(pipe
| beam.io.ReadStringsFromPubSub('project/topic')
| beam.WindowInto(SlidingWindows(60))
| beam.Map(Transform)
| beam.GroupByKey()
| beam.FlatMap(Filter)
| beam.io.WriteToBigQuery(table)
)
pipe.run()
- Apache Beam SDK comes with a variety of connectors that enable dataflow to read from many data sources, including text files in Google Cloud Storage or file systems.
- With different connectors, it's possible to read even from real time streaming data sources, like Google Cloud Pub/Sub or Kafka.
- There are connectors for Cloud Storage, Pub/Sub, BigQuery and more.
It is a hybrid of Apache and TensorFlow.
TFX is an end-to-end ML platform based on TensorFlow.
NOTE: Artifacts produced by tf.Transform's are consumed at both training and serving time to avoid skew.
1. tf.transform is a hybrid of Apache Beam and TensorFlow
2. One of the goals of tf.Transform is to provide a TensorFlow graph for preprocessing that can be incorporated
into the serving graph (and, optionally, the training graph).
The Preprocessing function is the most important concept of tf.Transform. It is a logical description of a transformation of the dataset. The preprocessing function accepts and returns a dictionary of tensors, where a tensor means Tensor or 2D SparseTensor.
1. Need to keep batch and live processing in sync.
2. All other tooling, such as evaluation, must be kept in sync with batch processing.
Do everything in the training graph
1. Loses the benefits of matearialization.
2. Doesn't allow for "reduces".
Do everything in the training graph + using statistics/vocabs generated from raw data.
1. Only allows for stats/vocabs on raw data.
2. Doesn't address materialization.
Transform does batch process but also emits a tf.Graph that can be used to repeat these transformations in survey.
1. By combining this graph with the trained model graph into a single serving graph,
you can guarantee that the same operations that were done to the training data
NOTE:
Gradient ascent works better input raw data is scaled. In order to do that, you will first have to find the minimum
and the maximum of the numeric feature over the entire training data set. And then you will scale every input value
by the min and max that were computed on the training data set.
For serving, we need to write out the transformation data.
- Experimentation
- Training Operationalization
It is an automated training pipeline with the purpose to help the model be repeatedly retrained.
NOTE: Process that Data Scientists use to develop the models on an experimentation platform.
1. Problem Definition
2. Data Selection
3. Data Exploration
4. Feature Engineering
5. Model Propotyping
6. Model Validation
It is the process algorithm selection, model training, hyperparameter tuning, and model evaluation in the Experimentation and Prototyping activity.
Feature Store
Data Catalog
Data Plex
Analytics Hub
Dataprep
Online serving is for low-latency data retrieval of small batches of data for real-time processing.
Data stakeholders (consumers, producers and administrators) within an organization face a number of challenges:
- Data consumers don't know what data is where. They have to navigate data "swamps" they stumble into.
- Data consumers don't know what data to use to get insights because most data is not well documeted and, even if documented, is not well maintained.
- Data can't be found and is often lost when it resides only in people's minds.
- Is the data fresh, clean, validated, approved for use in production?
- Which data set out of several duplicate sets is relevant and up-to-date?
- How does one data set relate to another?
- Who is using data set related to another?
- Who and what processes are transforming the data?
- Data producers don't have an efficient way to put forward their data for consumers. If there's no self-service, consumers may overwhelm producers. Several data engineers can't manually provide data to thousands of data analysts.
- Valuable time is lost if data consumers have to find out how to request data access, request it, wait without a defined response time, escalate, and wait again.
Organize and manage your data in a way that makes sense for your business without data movement or duplication. It provides logical constructs like lakes, data zones and assets that enable you to abstract away underlying storage systems and become the foundation for setting policies around data access, security, life-cycle management, and more.
- Achieve freedom of choice.
- Store data wherever you want.
- Choose the best analytics tools for the job.
- Enforce consistent controls.
- Use built-in fata intelligence.
- Automate data management.
- Get access to higher quality data.
- Structure Data | Semi-structured Data | Unstructured Data
- Automatic metadata extraction and classification
- Apply data validation and quality checks
- Data Catalog (for search & discovery) and Unified Metadata (accross BigQuery and Open Source)
- Landing Zone
- Structured Zone
- Refined Zone
Exchanges data analytics assets across organizations to address challenges of data reliability and cost. You can exchange data, ML models, or other analytics assets, and easily publish or subscribe to shared datasets in an open, secure, and privacy-safe environment.
Analytics Hub makes it convenient for you to build a data ecosystem.
This data ecosystem could include, public exchanges like data from the World Bank,industry exchanges
from healthcare and retail, commercial exchanges that will include logistics, consumer, and energy data,
and also Google exchanges, which include, patents, web analytics, and trend data.
Traditional data sharing requires batch data pipelines that extract data from databases, store it in flat files, and transmit them to the consumer where they are ingested into another database.
- Expensive and fragile data pipelines: Pipelines are expensive to run, but any changes to the source data can cause them to break.
- Unecessary data replication: Pipelines result in multiple copies of the data which bring unnecessary cost, especially with multi-petabyte datasets.
- Late arriving or asynchronous data assets: The time required by batch pipelines also means that data is late arriving, leading to less timely business decisions.
- Loss of visibility and control of data: Traditional data sharing techniques, also bypass data governance processes. As a provider of data, how do you know how your data is being used?
- Commercialization of data workflows management: If you want to monetize your data, how do you manage subscriptions and entitlements
NOTE: Altogether, these challenges mean that organizations are unable to realize the true potential
to transform their business with shared data.
- Data Publisher
- Exchange Administrator
- Data Subscriber
They are collections of data and analytics assets designed for sharing. Administrators can easily curate an exchange by managing the dataset listings within the exchange.
They are collections of tables and views in BigQuery defined by data publisher and make up the unit of cross-project or cross-organization sharing.
Dataprep is a tool to instantly prepare data. Dataprep will produce Dataflow jobs. You can automate or schedule Dataprep jobs because of Dataflow.
- Discover
- Cleanse
- Structure
- Enrich
- Validate
NOTE: When you are importing data into Dataprep, you are creating a reference to a source of data.
When the data is required for use, Dataprep reads a sample of the source data into the application
for your use through an object known as a connection.
This means ways to upload or download your data from Dataprep.
- Upload/Download: Upload data directly from a local desktop and also save it locally on export.
- Cloud Storage: Read from and write to files in Cloud Storage
- BigQuery: Store relational content in BigQuery, from which Dataprep can read.
- Automatically identify schemas, data types, possible joints, and anomalies.
- Anomalies detection include missing values, outliers, and duplicates, so you can skip the time-consuming work of data profiling.
- Offer visual representation with Histograms, Ranges and key statistical information.
- Automatically detect 17 different datatypes.
- Can trsnform structured and unstructured datasets stored in CSV, JSON or relational table formats
- It can store any size, from megabytes to petabytes with equal ease and simplicity.
- Quickly indentify data quality issues.
- Get automatic data transformation suggestions.
- Standardize, structure and join datasets easily with a guided approach.
- Process diverse datasets
- Prepare datasets of any size
- Built on top of Dataflow
- Auto-scalable
NOTE: In Dataprep, the flows are implemented as sequence of recipes. The recipes are data processing
steps built from a library of Wrangler. Cloud Dataprep Wranglers write beam code in Cloud Dataflow.
1. Build recipes in Cloud Dataprep UI.
2. Converts repices to Beam.
3. Runs a Cloud Dataflow job pipeline.
Recipes are a repeatable set of transformation steps, built by chaining data Wranglers together. You can include end-to-end steps from ingestion, transformation, aggregation, and save to BigQuery. You will run your Dataprep job to process your recipes against your entire dataset.
Learning rate controls the size of the step in weight space. If steps are too small, the training will take a long time. If steps are too large, the training will bounce around and miss the optimal point.
NOTE: The default value for linear aggressor estimator in TensorFlow library is set to 0.2 or one over the square root of the number of features. This assumes your feature and label values are small numbers.
Batch size controls the number of samples that gradient is calculated on. If batch size is too small, we could be bouncing around because the batch may not be a good enough representation of the input. If batch size is too large, training will take a very long time.
- As a rule of thumb, 40 to 100 tends to be a good range for batch size.
- Larger batch sizes require smaller learnign rates.
- Recent research suggests small mini batch sizes provide more up-to-date gradient calculations which yields more stable and reliable training. And in experimental results for the C410, C4100, and ImageNet data sets, the best performance has been consistently obtained for many batch sizes between m = 2 and m = 32. So it may be better to start with a smaller batch size.
It is running training in parallel on many devices such as CPUs or GPUs or TPUs in order to make your training faster.
Data parallism and model parallism.
it is a common architecture for distributed training where you run the same model and computation on every device. But train each of them using different training samples. Each device computes loss and gradients based on training samples it sees. Then we update the models' parameters using these gradients. The updated model is then used in the next round of computation.
Two Model Approaches
An async parameter server architecture some devices are designated to be parameter servers and others as workers. Each worker independently fetches the latest parameters from the PS and computes gradients based on a subset of training samples. It then sends the gradients back to the PS, which then updates its copy of the parameters with those gradients. Each worker does this independently. This allows it to scale well to a large number of workers. This worked well for many models in Google, for training workers might be preempted by higher-priority production jobs, or a machine may go down for maintenance, or where there is asymmetry between the workers. These don't hurt the scaling because workers are not waiting for each other.
The downside of this approach, however, is that workers get out of sync. They compute parameter updates based on scale values, and this can delay convergence.
In this approach, each worker holds a copy of the model's parameters. There are no special servers holding the parameters. Each worker computes gradients based on the training samples they see and communicate between themselves to propagate the gradients and update their parameters. All workers are synchronized. Conceptually, the next forward pass doesn't begin until each worker has received the gradients and updated their parameters. With fast devices in a controlled environment, the variance between the step time on each worker is small. When combined with strong communication links between the workers, the overhead of synchronization is also small. So overall this approach can lead to faster convergence. Given these two broad strategies, that is asynchronous parameter server approach and the synchronous allreduce approach
Async parameter server approach
- Multiple machines.
- Many low-power or unreliable workers. Such as a cluster of machines with just CPUs.
- More mature approach. It is supported well by TensorFlow by the estimator API's train and evaluate method.
- Constrained by I/O.
Sync allreduced approach
- Multiple devices on one machine.
- When there are fast devices with strong communication links such as multiple GPUs on one host, or TPUs.
- Fast devices with strong links. Gaining a lot more traction recently because of the improvements in hardware.
- Constrained by compute power.
It is when your model is so big that it doesn't fit on one device's memory. So you divide it into smaller parts that compute over the same training samples on multiple devices. For example, you could put different layers on different devices.
- Local Training First: instead of training your model directly within your notebook instance, you can submit a training job from your notebook. The training job would automatically provision computing resources and deprovision those resources when the job is complete. There's no worrying about leaving a high-performance virtual machine configuration running.
- Modulerized Architecture: The training service can help to modularize your architecture. Put your training code into a container to operate as a portable unit. The training code can have parameters passed into it such as input data location and hyperparameters to adapt to different model type scenarios without redeployment. Also, the training code can export the trained model file, thus enabling working with other AI services in a decoupled manner.
- Cloud Logging: The training service also supports reproducibility. Each training job is tracked with inputs, outputs and the container image used. Log message are available in Cloud logging, and jobs can be monitored while running.
- Distributed Training: The training service also supports distributed training, which means that you can train models across multiple nodes in parallel. That translates into faster training times than would be possible within a single VM instance.
- BigQuery data source tables cannot be larger than 100 gigabytes.
- You must use a multi-regional BigQuery dataset in the US or EU locations.
- If the table is in a different project, you must provide the
BigQuery Data Editor role
to the Vertex AI service account in that project.
- The first line of the data source must contain the name of the columns.
- Each data source file cannot be larger than 10 gigabytes. You can include multiple files up to a maximum size of 100 gigabytes.
- If the cloud storage bucket is in a different project where you use Vertex AI, you must provide the Storage Object Creator role to the Vertex AI service account in that project.
Artifact Lineage: it describes all the factors that resulted in an artifact such as training data or hyperparameters used for model training.
- The training, test, and evaluation data used to create the model.
- The hyperparameters uesd during model training.
- The code that was used to train the model.
- Metadata recorded from the training and evaluation process.
- Artifacts that desended from this model.
Source control repo (storage location)
- Notebooks
- Pipeline source code
- Preprocessing functions
- Model source code
Experiments and ML metadata (storage location)
- Experiments
- Parameters
- Metrics
- Datasets (reference)
- Pipeline metadata
Vertex AI (storage location)
- Trained models
Artifact Registry
- Pipeline containers
- Custom training environments
- Custom prediction environments
Vertex Prediction
- Deployed models
- Traffic Patterns
- Error rates
- Latency
- Resource Utilization
The machine learning model registry is a centralized tracking system that stores linage, versioning, and related metadata for published machine learning models.
Registry can capture governance data required for auditing purposes:
1. Who trained and published a model.
2. WHich datasets were used for training.
3. The values of metrics measuring predictive performance.
4. When the model was deployed to production.
Static Training | Dynamic Training |
---|---|
Space intensive | Compute intensive |
Higher storage cost | Lower storage cost |
Low, fixed latency | Variable latency |
Lower maintenance | Higher maintenance |
Highly Peaked: A model that predicts the next word based on the current word, which you might find in your mobile phone keyboard app would be highly peaked bexause a small number of words account for the majority of words used.
Low Peaked: A model that predicted quaterly revenue for all sales verticals in order to populate a report would be right on the same verticals.
Low Cardinality: Model predicting sales revenue given organization division number.
High Cardinality: Model predicting lifetime value given a user friendly e-commerce platform.
1. Modular programs are more maintainable.
2. Easier to reuse.
3. Easier to test.
4. Easier to fix because they allow engineers to focus on small pieces of code rather than the entire program.
1. An upstream model.
2. A data source maintained by another team.
3. The relationship between features and labels.
4. The distributions of inputs.
It is the change in relationships between the model inputs and the model output.
Sudden Drift: a new concept occurs within a short time.
Gradual Drift: a new concept gradually replaces an old one over a period of time.
Incremental Drift: an old concept incrementally changes to a new concept over a period of time.
Recurring Concepts: an old concept may reoccur after some time.
Data Drift | Concept Drift |
---|---|
Change in spamming behavior to try to fool the model | e-Commerce apps eliance on personalization, for example, the fact that people’s preferences ultimately do change over time |
Rule update in the app change in the limit of user messages per minute, selection bias, and non-stationary environment, training data for a given season that has no power to generalize to another season | Sensors nature of the data they collect and how it may change over time |
Section bias as training data for a given season that has no power to generalize to another season | Movie recommendations rely on user preferences and they may change |
Non-stationary environment training data for a given season that has no power to generalize to another season | Demand forecasting heavily relies on time, and as we have seen, time is a major contributor to potential concept drift |
Data drift: If you diagnose data drift, enough of the data needs to be labeled to introduce new classes and the model retrained.
Concept drift: If you diagnose concept drift, the old data needs to be relabeled and the model retrained. Periodically updating your static model with more recent historical data, for example, is a common way to mitigate concept drift. Or either discard the static model completely or you can use the existing state as the starting point for a better model to update your model by using a sample of the most recent historical data.
- Ingest and validade data.
1.1 TensorFlow Data Validation.
1.2 Two common use-cases of TensorFlow Data Validation within a TensorFlow Extended pipelines:validation of continuously arriving data
andtraining-serving skew detection
. - Train and analyze model.
2.1. MLOps. - Deploy in production.
On day one you generate statistics based on data from day one. Then, you generate statistics based on day two data. From there, you can validate day two statistics against day one statistics and generate a validation report. You can do the same for day three, validating day three statistics against statistics from both day two and day one.
Training-serving skew occurs when training data is generated differently from how the data used to request predictions is generated.
What causes distribution skew?
- Possible causes might come from a change in how data is handled in training vs in production, or even a faulty sampling mechanism that only chooses a subsample of the serving data to train on. For example, if you use an average value, and for training purposes you average over 10 days, but when you request prediction, you average over the last month. In general, any difference between how you generate your training data and your serving data, the data you use to generate predictions, should be reviewed to prevent training-serving skew.
- Training-serving skew can also occur based on your data distribution in your training, validation, and testing data splits.
- To summarize, distribution skew occurs when the distribution of feature values for training data is significantly different from serving data and one of the key causes for distribution skew is how data is handled or changed in training vs production.
- ExampleGen component: which takes raw data as input and generates TensorFlow examples, it can take many input formats for example CSV, TF Record. It also splits the examples for you into Train/Eval.
- StaticsGen (Statistics Generation) component: which generates statistics for feature analysis.
- SchemaGen (Schema Generation) component: which gives you a description of your data.
- Example Validator component: which allows you to check for anomalies.
Common problems | Identity effective features |
---|---|
Missing Data | Informative features |
Labels treated as features | Redundant features |
Features with values outside an expected range | Features that vary so widely in scale that they may slow learning |
Data anomalies | Features with little or no unique predictive information |
- Feature min, max, mean, mode, and median.
- Feature correlations.
- Class imbalance.
- Check to see missing values.
- Histograms of features, both numerical and categorical.
Types, Categories and Ranges of the data.
- Type: indicates the feature datatype
- Presence: indicates whether the feature must be present in 100% of examples or not, so whether it’s required or optional.
- Valency: indicates the number of values required per training example.
- Domain and Values: indicates the feature domain and its values. In the case of categorical features, single indicates that each training example must have exactly one category for the feature.
- It can detect different classes of anomalies in the data and emit validation results.
- The ExampleValidator pipeline component identifies any anomalies in the example data by comparing data statistics computed by the StatisticsGen pipeline component against a schema.
- It takes the inputs and looks for problems in the data, like missing values, and reports any anomalies.
It is the time taken to trian a model
Constraint | Input/Output | CPU | Memory |
---|---|---|---|
Commonly Occurs | *Large inputs. *Input requires parsing. *Small models | *Expensive computations. *Underpowered Hardware | *Large number of inputs. *Complex model |
Take Action | *Store efficiently. *Parallelize reads. *Consider batch size | *Train on faster accel. *Upgrade processor. *Run on TPUs. *Simplify model | *Add memory. *Use fewer layers. *Reduce batch size. |
Distributed training distributes training workloads across multiple mini-processors, or worker nodes. These worker nodes work in parallel to accelerate the training process. Their parallelism can be achieved via two types of distributed training architecture: Data Paralism and Model Paralism.
It is model-agnostic, making it the most widely used paradigm for parallelizing neural network training. In data parallelism, you run the same model and computation on every device, but train each of them using different training data samples. Each device computes loss and gradients based on the training samples. Then you update the model's parameters using these gradients. The updated model is then used in the next round of computation.
There are two approaches used to update the model using gradients.
All of the devices train their local model using different parts of data from a single, large mini-batch. They then communicate their locally calculated gradients, directly or indirectly, to all devices. In this approach, each worker device computes the forward and backward passes through the model on a different slice of input data. The computed gradients from each of these slices are then aggregated across all of the devices and reduced, usually using an average, in a process known as
Allreduce. The optimizer then performs the parameter updates with these reduced gradients, thereby keeping the devices in sync.
Because each worker cannot proceed to the next training step until all the other workers have finished the current step, this gradient calculation becomes the main overhead in distributed training for synchronous strategies. Only after all devices have successfully computed and sent their gradients, so that all models are synchronized, is the model updated.
The asynchronous parameter server approach, in which the parameter servers contain fewer features, consume less memory, and can run just a cluster of CPUs, is great for sparse models, as it shards the model across parameter servers, and workers only need to fetch the part they need for each step.
No device waits for updates to the model from any other device. The devices can run independently and share results as peers, or communicate through one or more central servers known as parameter servers. Thus, in an asynchronous parameter server architecture, some devices are designated to be parameter servers and others as workers. Devices used to run computations are called worker devices, while devices used to store variables are parameter devices. Each worker independently fetches the latest parameters from the parameter servers and computes gradients based on a subset of training samples. It then sends the gradients back to the parameter server, which then updates its copy of the parameters with those gradients. Each worker does this independently. This allows it to scale well to a large number of workers, where training workers might be preempted by higher priority production jobs, or a machine may go down for maintenance, or where there is asymmetry between the workers. This doesn't hurt the scaling, because workers are not waiting for each other.
The downside of this approach, however, is that workers can get out of sync. They compute parameter updates based on stale values, and this can delay convergence.
For dense models, the parameter server transfers the whole model each step, and this can create a lot of network pressure. Therefore, the synchronous Allreduce approach should be considered for dense models which contain many features and thus consume more memory. In this approach, all machines share the load of storing and maintaining the global parameters. This makes it the best option for dense models, like BERT, Bidirectional Encoder Representations from Transformers.
When a model is too big to fit on one device's memory, you can divide it into smaller parts on multiple devices and then compute over the same training samples. This is called model parallelism.
Model parallelism feeds or gives every processor the same data, but applies a different model to it. Think of model parallelism as simply multiple program, same data. Model parallelism splits the weights of the net equally among the threads. And all threads work on a single mini-batch. Here, the generated output after each layer needs to be synchronized, i.e. stacked, to provide the input to the next layer. In this approach, each GPU has different parameters and computation of different parts of a model.
In other words, multiple GPUs do not need to synchronize the values of the parameters.
Model parallelism needs special care when assigning different layers to different GPUs, which is more complicated than data parallelism. The gradients obtained from each model and each GPU are accumulated after a backward process, and the parameters are synchronized and updated. However, a hybrid of the data and model parallelism approaches is sometimes used together in the same architecture.
Optimize TensorFlow performance using the Profiler
- How will you distribute the data accross the different devices.
- How will you accumulate the gradients during backpropagation.
- How will the model parameters be updated.
tf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs. There are four TensorFlow distributed training strategies that support data parallelism: Mirrored Strategy, Multi-Worker Mirrored Strategy, TPU Strategy, Parameter Server Strategy.
You can use mirrored strategy when you have a single machine with multiple GPU devices.
- Mirrored strategy will create a replica of the model on each GPU.
- During training, one minibatch is split into n parts, where "n" equals the number of GPUs, and each part is fed to one GPU device.
- For this setup, mirrored strategy manages the coordination of data distribution and gradient updates across all of the GPUs.
To improve training, we can use the MirroredStrategy.
It implements synchronous distributed training across multiple workers, each with potentially multiple GPUs.
Similar to mirrored strategy, it creates copies of all variables in the model on each device across all workers. If you've mastered single-host training and are looking to scale training even further, then adding multiple machines to your cluster can help you get an even greater performance boost.
For faster training, we can use the MultiWorkerMirroredStrategy.
The "cluster" key contains a dictionary with the internal IPs and ports of all the machines. This is set up through TF_CONFIG.
All machines are designated as "workers", which are the physical machines on which the replicated computation is executed.
There needs to be one worker that takes on some extra work, such as saving checkpoints and writing summary files to TensorBoard. This machine is known as the "chief".
- Create a strategy object.
- Wrap the creation of the model parameters within the Scope of the strategy.
- Scale the batch size by the number of replicas in the cluster.
If the data is not stored in a single dataset, then TensorFlow's AutoShardPolicy will autoshard
the elements accross all the workers.
Saving the model is slightly more complicated in the multi-worker case, because there needs to be different destinations for each worker. The chief worker will save to the desired model directory, while the other workers will save the model to temporary directories. It's important that these temporary directories are unique in order to prevent multiple workers from writing to the same location. Saving can contain collective ops, so all workers must save, not just the chief.
TPUStrategy uses a single machine where the same model is replicated on each core with its variables synchronized (mirrored) across each replica of the model.
The main difference, however, is that TPUStrategy will all-reduce across TPU cores, whereas MirroredStrategy will all-reduce across GPUs.
For really fast training, we can use the TPUStrategy.
- TPUs read training data exclusively from Google Cloud Storage (GCS).
- GCS can sustain a pretty large throughput if it is continuously streaming from multiple files in parallel.
- Too few files: GCS will not have enough streams to get max throughput.
- Too many files: time will be wasted accessing each individual file.
Parameter server training cluster consists of Workers and ParameterServers. Variables are created on ParameterServers, and they are read and updated by Workers in each step.
The tf.data API makes it possible to handle large amounts of data, read it in different file and data formats, and perform those complex transformations.
tf.data.Dataset represents a sequence of elements in which each element consists of one or more components. For example, in an image pipeline an element might be a single training example with a pair of tensor components representing the image and its label.
The dataset API will help you create input functions for your model that load data in progressively throttling it.
Two ways to create a dataset:
1. A data source constructs a dataset from data stored in memory or in one or more files.
2. Data transformation constructs a dataset from one or more tf.dataset objects.
Large datasets tend to be sharded or broken apart into multiple files, which can be loaded progressively. Remember that you train on mini batches of data. You don't even have the entire dataset in memory. One mini batch is all you need for one training step.
There are specialized dataset classes that can read data from text files like CSVs, TensorFlow records, or fixed length record files. Datasets can be created from many different file formats.
- Use TextLineDataset to instantiate a dataset object, which is comprised of, as you might guess, one or more text files.
- Use TFRecordDataset for TFRecord files.
- FixedLengthRecord Dataset is a dataset object from fixed length recordsor one or more binary files.
- For anything else you can use the generic dataset class and add your own decoding code.
- In terms of raw processing speed, you'll want to use Cloud ML Engine batch predictions.
- The next fastest is to directly load the SavedModel into your Dataflow job and then invoke it.
- The third option, in terms of speed, is to use TensorFlow Serving on Cloud ML Engine.
- But if you want maintainability, the second and third options reverse. The batch prediction is still the best.
- Using online predictions as a microservice allows for easier upgradability and dependency management than loading up the current version into the Dataflow job.
An open-source machine learning platform designed to enable the use of machine learning pipelines to orchestrate complicated workflows running on Kubernetes. Kubeflow helps build hybrid cloud machine learning models.
1. It makes deploying machine learning workflows on Kubernetes simple, portable, and scalable.
2. It helps you migrate between cloud and on-prem environments.
3. It also extends Kubernetes' ability to run independent and configurable steps with
machine-learning-specific frameworks and libraries.
4. Kubeflow is an open-source machine learning stack built on Kubernetes.
5. On Google Cloud, you can run Kubeflow on Google Kubernetes Engine (GKE).
6. Kubeflow can be ran on anything from a phone, to a laptop, to an on-prem cluster.
7. Your code remains the same.
8. Some of the configuration settings just change.
- It is open source, so it can run on Google Kubernetes Engine which is part of Google Cloud.
- Kubeflow can actually run on anything, whether it's a phone, a laptop, or an on-premises cluster. Regardless of where it's run, the code remains the same.
The idea is you continuously train the model on the device, and then you combine the model updates from a federation of user devices to update the overall model. The goal is for each user to get their customized experience because there's model training happening on the device, but still retain privacy because it's the overall model update.
recommendation systems are about personalization it's about taking your product that works for everyone and personalizing it for an individual user
Content-based recommendation system you use the metadata about your products for example perhaps you know which movies are cartoons and which movies are sci-fi.
Use item features to recommend new items that are similar to what the user has already liked based on their previous actions or explicit feedback. They don't rely on information about other users or other user item interactions.
1. Needs metadata about the items.
2. Needs market segmentation of the users.
3. Uses attributes of the items to recommend new items to a user.
3. Recommendation relies on the behaviors and item interactions of a single user.
4. Recommended when large number of items were rated.
5. Doesn't take into account the behavior or ratings of other users.
6. There is no machine learning.
7. Relies on the builder of the recommendation system to assign proper tags to items and users.
8. Doesn't need information about other users.
9. Can recommend niche items that the user was not initially interested in.
10. Requires domain knowledge to hand-engineer features.
11. Difficult to expand interests of user.
1. Genres 2. Themes 3. Actors/directors involved 4. Professional Ratings
1. Movies Summary. 2. Stills from movie. 3. Movie Trailer 4. Professional Reviews
Collaborative filtering you don't have any metadata about the products instead you learn about item similarity and user similarity from the ratings data itself. We might store our user movie data in a matrix like this with check marks indicating whether the user watched the complete movie or commented on it or gave it a star rating or however it is that we measure that a specific user.
1. No need of any metadata about your items.
2. No need of market segmentation of your users.
3. Recommendation relies on the behaviors and item interactions of other users.
3. As long as you have an interactions matrix (user-interaction matrix) you're ready to go.
4. Uses similarities between users and items simultaneously to determine recommendations.
5. Recommended when a few number of items were rated.
6. Collaborative filtering learns latent factors and can explore outside user's personal bubble.
1. User Ratings. 2. User Views 3. User wishlist/cart history 4. User Purchase/Return History
1. User Reviews 2. User-answered questions 3. User-submitted photos. 4. User-submitted videos
NOTE: 'Embeddings can be learned from data.`
As long as the number of latency features is less than half the harmonic mean of the number of users and the number of items this will save space for this hypothetical website that would be almost 10 000 latent features 9998 to be precise. Each movie is essentially its own feature.
k < U * V / 2(U + V)
Flexible and Paralyzable algorithm. You can scale your problem to handle much larger data.
Slower and hard to handle unobserved interaction pairs,
Solve U holding v constant and solve v holding U constant
Only works for Least Square Problems.
Paralyzable algorithm. You can scale your problem to handle much larger data.
Faster convergence than SDG.
Easily to handle unobserved interaction pairs.
Unobserved pairs are given values of zero. Poor performance and recommendation
|A-UV^T|^2
Just ignore the missing values.
SUM(i,j)(Aij-UiVj)^2
Assigns a weight for those interaction pairs that are missing as a way to represent low confidence.
SUM(i,j)(Aij-UiVj)^2 + w0 * SUM(i,j)(0-UiVj)^2
NOTE: If you have metadata and an interactions matrix a neural network solution is recommended.
Use neural networks to combine all of the advantages and eliminate the disadvantages of all three types of recommendation systems three types well.
Is based on explicit knowledge about the user's preferences items and or recommendation criteria. They are especially useful when alternative approaches such as collaborative filtering or content-based methods cannot be applied. This occurs in situations where items are not purchased very often.
1. System that can be used to provide business impact inputs to systems.
2. Used in situations where items are not purchased very often.
3. Recommended when there is no rated information about an item or user.
4. Will often explicitly ask users for their preferences and then use that information to begin making
recommendations.
1. Demographic information 2. Location/country/language 3. Genre preferences 4. Global filters
1. User "about me" snippets.
Hybrid models of all three of these types of systems use all the data available and connect all of these models together into an ml pipeline.
- User space and the product space are sparse and skewed.
1.1. Iteraction matrix is sparse because there are potentially few interactions within the entire user item space
1.2. Most items are rated by very few users.
1.3. Most users rate only a small fraction of items.
1.4. Iteraction matrix is skewed because some properties are very popular.
1.5. Iteraction matrix is skewed because some properties are very prolific. - Cold start problem when there aren't enough interactions for users or items.
- Lack of explicit user feedback leads to the need of implicit user feedback.
3.1. Number of clicks.
3.2. Play counts.
3.3. Fraction of video watch.
3.4. Site navigation.
3.5. Time spent on page. - Explicit rating is not easily available.
Is a metric for items in an embedding space.
s(a,b) = SUMi(ai x bi)
def dot(ai, bi):
return sum(a*b for a, b in zip(ai, bi))
It is a dot product and magnitude (norm). It is the dot product scaled by the norm of the feature vectors.
s(a,b) = SUMi(ai x bi) / (|a||b|) s(a,b) = SUMi(ai x bi) / (sqrt(SUMi(a^2i)) x sqrt(SUMi(a^2i)))
def norm(vi):
return sum(v*v for v in vi) ** 0.5
def cos_sim(ai, bi):
dot = dot(ai, bi)
norm_a, norm_b = norm(ai), norm(bi)
return dot / (norm_a*norm_b)
Deep learning models can also be used when building a recommendation system.
Deep neural networks work well because they are flexible and can be trained to have varying outcomes such as predicting ratings interactions or even next items.
Context aware recommendation systems add an extra dimension to our usual collaborative filtering problem. Traditional collaborative filtering recommendation systems use a rank two tensor a user item interaction matrix containing explicit or implicit ratings. Contextual collaborative filtering recommendation systems on the other hand use a multi-dimensional tensor but the user item interaction matrix flow ratings is stratified across multiple dimensions of context.
Data
U x I x C -> R
Apply [C] Context Vector
Contextualized Data
U x I -> R
2D Recommender
U x I -> R
Apply [U] User Vector
Contextual Recommendations
i1,i2,i3, ...
1. Reduction-Based Approach (2005)
2. Exact and Generalized Prefiltering (2009)
3. Item Splitting (2009)
4. User Splitting (2011)
5. Dimensions as Virtual Items (2011)
6. User-Item Splitting (2014)
t-value
tmean = | Mic - Mić| \ |sqrt(Sic/Nic + Sić/Nić)|
Data
U x I x C -> R
2D Recommender
U x I -> R
Apply [U] User Vector
Recommendations
i1,i2,i3, ...
Apply [C] Context Vector
Contextual Recommendations
i1,i2,i3, ...
Weight postfiltering method
R'ij = Rij * P
Data
U x I x C -> R
MD Recommender
U x I x C -> R
Apply [U] User Vector
Apply [C] Context Vector
Contextual Recommendations
i1,i2,i3, ...
- How is user's rating deviated?
- Contextual rating deviation (CRD)
- Looks at the deviations of users across context dimensions
CRD is used to adjust rating recommendations
Biased Matrix Factorization in Traditional RS
Rui = M + bu + bi + pu^Tqi
M = Global Average Rating
bu = User Bias
bi = Item Bias
pu^Tqi = User-Item interaction
Deviation-based Context Aware Matrix Factorization (CAMF_C approach)
Ruicic2...cN = M + bu + bi + pu^Tqi + SUMj=i^N(CRDcj)
M = Global Average Rating
bu = User Bias
bi = Item Bias
pu^Tqi = User-Item interaction
Ruicic2...cN = Contextual Rating
SUMj=i^N(CRDcj) = Contextual Rating Deviation
Reinforcement learning is borrowed from an area of behavior psychology known as operant conditioning and deals with learning the relationship between stimuli action and consequences that is the occurrence of rewards or punishments. These rewards and punishments then guide the learner on the desired behavior or policy.
Reinforcement learning in software is an area of machine learning where an agent or system of agents learns to achieve a goal by interacting with its environment. By goal, we mean that we want the agent to learn the optimal path or behavior that collects the maximum reward.
The agents learn to:
1. Achieve a goal.
2. Achieve the optimal behavior.
3. Obtain the maximum reward.
- There is no supervisor. There is only a real number of reward signal.
- Decision Making is sequencial.
- Time plays a crucial role in RL problems.
- Feed back is alaways delayed, not instanteneous.
- The agent's actions determine the subsequent data it receives.
Agent -> Environment: States | Actions | Rewards
State: Summary of events so far; the current situation.
Action: One or more events that alter the state.
Environment: The scenario the agent has to respond to.
Agent: The learner entity that performs actions in an environment.
Reward: Feedback on agent actions, also known as reward signal.
Policy: Method to map the agent's to actions.
Episode: A termination point.
Value: Long-term reward gained by the end of an episode.
Value Function: Measure of potential future rewards from being in a particular state, or V(S)
Q(S,A): "Q-value" of an action in various state/action pairs.
Qualities | Model-based | Model-Free |
---|---|---|
You have access to or knowledge about the environment | YES | NO |
You can avoid needless exploration by focusing on areas you already know are worthwhile | YES | NO |
Need to make more assumptions and approximations | YES | NO |
Need lots of samples | NO | YES |
Over many episodes, results become less optimal | YES | NO |
Over many episodes, results become more optimal | NO | YES |
Applicable accross a wide variety of applications | NO | YES |
Model-based methods | Model-free methods |
---|---|
Analytic gradient computation | Value-based |
Sampling-based planning | Policy-based |
Model-based data generation | Contextual bandits |
Value-equivalence prediction | Actor-critic |
On-policy | |
Off-policy |
You explore in order to learn state-action values and maximize a value function, V(S).
The agent can sample and generalize to derive a policy, pi, that maximizes the value of action for each state.
Sample backup: the agent learns from environment sampling which may provide an incomplete picture of the environment dynamics with enough samples. The sample backup approaches come closer to the full backup approaches.
Deep backup: the agent learns the whole trajectory of the chosen action up to the termination point. This can be the whole trajectory of the sample and not necessarily the full environment.
Shallow backup:: the agent learns one step at a time in a breadth-first search manner of the chosen action trajectory.
Full backup: the agent learns from the ability to access the complete environment.
1. It tends to evaluate the whole trajectory of actions the agent took up to the terminal state of
the episode because this method of reward attribution is sensitive to the trajectory.
2. Actions taken it tends to overfit and exhibit higher variance
3. If a particular value was achieved it's assumed that each action was equally responsible for the outcome.
4. There is an inherent assumption that an episode has a terminal state or endpoint and reaches
that in a feasible amount of time.
1. Easy to implement at each time step with each action that the agent takes it gets a reward.
2. The rewards accumulate throughout the episode and are backed up throughout, so that the agent
learns that these actions led to a certain cumulative reward.
The td method can learn directly from raw experience without the model of the environment's dynamics.
Unlike the Monte Carlo method, td estimates are based in part on other learned estimates without waiting until the end of the episodes bootstrapping. The agent learns from one or more intermediate time steps in a recursive fashion.
The recursive learning helps in accelerating overall learning even in cases where there might not be any well-defined terminal states in td backup.
NOTE: Because td backups haven't seen the whole set of trajectories, they have a narrow perspective and tend to underfit especially in the beginning despite higher complexity and higher bias
TD backups are used more often than Monte Carlo backups.
You want the action performerd in every state to help you to gain maximum reward in the future.
The agent will:
- Learn the stochastic policy function that maps state to action.
- Act by sampling policy.
- Utilize exploration techniques.
Poly-based method | Value-based |
---|---|
There are large action spaces. | |
Stochastic is needed. | |
An agent will learn the policy directly. | |
Lower bias in the policy is needed. |
An agent simultaneously attempts to:
- Explore (acquire new knowledge).
- Exploit (optimize its decision based on existing knowledge.
It is an extension of multi-armed bandits or simplified RL.
- In a sequence of trials, the agent acts based on a given context.
- Each data point is a new episode.
- Value of exploration strategies is much easier to quantify/tune.
- Context can be the input feature space (recommender|personalization systems).
- A policy as the function approximator: Estimation value gain from an action.
Bandits problem is a simplified reinforcement learning problem which has only one time step and note state transition dynamics in the diagram.
Batch Normalization: Our input pixel values are in the range [0,1] and this is compatible with the dynamic range of the typical activation functions and optimizers. However, once we add a hidden layer, the resulting output values will no longer lie in the dynamic range of the activation function for subsequent layers. When this happens, the neuron output is zero, and because there is no difference by moving a small amount in either direction, the gradient is zero. There is no way for the network to escape from the dead zone.
To fix this, batch norm normalizes neuron outputs across a training batch of data, i.e. it subtracts the average and divides by the standard deviation. This way, the network decides, through machine learning, how much centering and re-scaling to apply at each neuron. In Keras, you can selectively use one or the other:
x^(k) = (x^(k) - E[x^(k)]) / sqrt(Var[x^(k)])
def batchnorm(data, gamma, beta, eps=1e-5):
"""
Arguments:
- data = Data of shape (X,Y)
- gamma = Scale parameter of shape (Y,)
- beta = Shift paremeter of shape (Y,)
- eps = Constant for numeric stability
Returns:
- out = Data result of shape (X,Y)
- cache = A tuple of values needed for backward pass
"""
sample_mean = x.mean(axis=0)
sample_var = x.var(axis=0)
std, x_center = np.sqrt(sample_var + eps), x - sample_mean
x_norm = x_center / std
result = (gamma * x_norm) + beta
cache = (x_norm, x_center, std, gamma)
return out, cache
Conitnuous Integration (CI) | Continuous Delivery or Deployment (CD) | Continuous Training |
---|---|---|
Checkout the code | Build | Monitor |
Complete the task | Test | Measure |
Validate against the code base | Release | Retrain |
Perform unit testing | ||
Merge the code |
DevOps | MLOps |
---|---|
Tests and validates code and components | Tests and validates data, data schemas and models |
Focuses on a single software package service | Considers the whole system: the ML training pipeline |
Deploys code and moves to the next task | Constantly monitors, retrains, and serves the model |
The difference between Continuous Delivery and Continuous Deployment is the automation deployment.
Continuous delivery: automates integration or acceptance tests, deployment to staging and smoke tests. Deployment to the production still done manually.
Continuous deployment: compliments continuous integration with additional steps by automating the configuration and deployment application to the production evironment.
Represents the pressure to prioritize releases over quality, which might mean not paying close attention to code quality.
Discovery Phase | Development Phase | Deployment Phase |
---|---|---|
Business use case definition | Data pipeline creation and feature engineering | Plan for deployment |
Data exploration | Model building | Model operationalization |
Architecture and algorithm selection | Model evaluation | Model monitoring |
Presentation results |
TFX Component: it is an implementation of the machine learning task in your pipeline. They are designed to be modular and extensible, while incorporating Google's machine learning best practices on tasks such as data partitionaing, validation, and transformation.
Component specification: A configuration protocol buffer defines how components communicate with each other via input and output artifact channels and runtime parameters.
Component driver: A drive coordinates job execution.
Component executor: Code to perform ML workflow step such as data preprocessing or TensorFlow model training.
Component publisher: Updates ML Metadata store.
Component interface: Packages component specification and executor for use in pipeline.
- Driver reads the component specification for parameters and artifacts and retrieves input artifacts from the metadata store for the component.
- Executor performs computation on artifacts.
- Publisher uses the component specification and executor results to store the component's output artifacts in the metadata store.
It is a sequence of components connected be channels in a directed acyclic graph (DAG) of artifact dependencies.
1. Uses Medata storage<br>
1.1. It stores the metadata in a relational back end.<br>
1.2. It does not store the actual pipeline artifacts.<br>
2. It is task aware. They can be authored in a script or notebook to run manually by the user as tasks.<br>
2.1. A task can be an entire pipeline run or a partial pipeline run of an individual component and its downstream components.
3. It is data aware. It means TFX pipelines store all the artifacts from every component over many executions.
NOTE: TFX pipeline uses ML Metadata storage, an open source library to standardize the definition storage and querying of metadata for ML pipelines. ML metadata stores the metadata in a relational back end. It does not store the actual pipeline artifacts.
Coordinate pipeline components. These are primarily shared libraries if utilities and protobuffs for defining abstractions that simplify development of TFX pipelines accross different computing and orchestration environments.
They take the logical pipeline object, which can contain pipeline arcs components in a DAG, and are responsible for scheduling components of the TFX pipeline sequentially based on the artifact dependencies.
It defines a common in-memory data representation shared by all TFX libraries and components based on Apache Arrow, a columnar memory format for efficient analytics on CPUs and GPUs.
- It is the entry point to one's pipeline that adjusts data.
- It supports inputs like CSV, TF Records, Avro, and Parquet.
- As outputs is produces TF examples or TF sequence examples.
ExampleGen brings configurable and reproducible data partitioning and shuffling into TF Records, a common data representation used by all components in your pipeline.
ExampleGen supports external ingestion of CSV, Avro, Parquet, and TF Record data sources. This can be done across sharded file systems using glob file patterns as well.
NOTE: A span is a grouping of training examples.
1. Brings configurable and reproducible data partitioning.
2. Shuffling into TF records, a common data representation used by all components in one's pipeline.
3. Supports external ingestion of CSV< Avro, Parquet and TF Record data sources.
4. Supports BigQuery ingestion through configuring sequel queries for each data partition.
5. Leverages Apache Beam for scalable, fault-tolerant data ingestion.
6. Customizable tonew inout data formats and ingestion methods which makes it easier to incorporate
into one's machine learning project.
7. Supports advanced data management capabilities such as data partitioning, versioning, and custom
splitting on features or time.
The ExampleValidator pipeline component identifies any anomalies in the example data by comparing data statistics computed by the StatisticsGen pipeline component against a schema.
- It can perform validity checks by comparing data set statistics against a schema that codifies expectations of the user.
- It can detect feature train-serving skew by comparing training and serving data.
- It can also detect data drift by looking at a series of feature data across different data splits.
The Transform TFX pipeline component performs feature engineering on the TF examples data artifact emitted from the ExampleGen component using the data schema artifact from SchemaGen or imported from external sources as well as TensorFlow transformations typically defined in a pre-processing function as shown in the example on the slide.
- It brings consistent feature engineering at training and serving time to benefit your machine learning project.
- Including feature engineering directly into your model graph reduces train-serving skew from differences in feature engineering.
- It is also underpinned by Apache Beam, so you can scale up your future transformations using distributed compute as your data grows.
- It trains the TensorFlow model.
- It supports TF1 estimators and native TF2 Keras models via the generic executer.
- Trainers component spec also allows you to parameterize your training and evaluation arguments, such as the number of steps as shown in the example on the screen.
- It makes extensive use of the Python Keras tuner API for tuning hyper parameters.
- As inputs, the tuner component takes in the transform data and transform graph artifacts.
- As outputs the tuner components output a hyper parameter artifact.
- You can modify the trainer configurations to directly ingest the best hyperparameters found from the most recent tuner run.
- Brings the benefits of tight integration with the trainer component to perform hyperparameter tuning in a continuous training pipeline.
- You can also perform distributed tuning by running parallel trials on Google Cloud to significantly speed up your tuning jobs.
- Evaluates how well the model performed during trainer and tuner.
- It will perform a thorough analysis using the TensorFlow Model Analysis library to compute machine learning metrics across data splits and slices.
- As outputs, the evaluator component produces two artifacts, an evaluation metrics artifact that contains configurable model performance metrics slices, and a "model blessing" artifact that indicates whether the models performance was higher than the configured thresholds and that it is ready for production.
- Brings standardization to your machine learning projects for easier sharing and reuse.
- Evaluator blesses the model if the new trained model is good enough to be pushed to production. In other words, it assures that your pipeline will only graduate a model to production when it has exceeded the performance of previous models.
- It is used as an early warning layer before pushing a model to production.
- It blesses the model if the model is mechanically serviceable in a production environment
- As inputs, InfraValidator takes the SavedModel artifact from the trainer component, launches a sandboxed model server with the model, and tests whether it can be successfully loaded and optionally queried using the input data artifact from the example Gen component.
- Focuses on the compatibility between the model server binary, such as TensorFlow serving, in the model ready to deploy.
- It is the user's responsibility to configure the environment correctly.
- Only interacts with the model server in the user configured environment to see whether it works well.
- Brings an additional validation check to your TFX pipeline by ensuring that only top-performing models are graduated to production and that they do not have any failure-causing mechanical issues.
- Brings standardization to this model infra check and is configurable to mirror model-serving environments such as Kubernetes clusters and TF Serving.
- It is used to push a validated model to a deployment target during model training or retraining.
- Relies on one or more blessings from other validation components as input to decide whether to push the model.
- As output a pusher component will wrap model versioning data with the train TensorFlow SavedModel for export to various deployment targets.
- The pusher component brings the benefits of a production gatekeeper to your TFX pipeline to ensure that only the best performing models that are mechanically sound make it to production.
- Standardizes the code for pipeline model export for reuse and sharing across machine learning projects while still having the flexibility to be configured for Filesystem and Model Server deployments.
- Used to perform batch inference on unlabeled TF examples.
- It is typically deployed after an evaluator component to perform inference with a validated model, or after the trainer component to directly perform inference on an exported model.
- It currently performs in memory model inference and remote inference.
- Remote inference requires the model to be hosted on Cloud AI platform.
- As inputs, BulkInferer reads from the following artifacts, a train TensorFlow SavedModel from the trainer component, optionally a model blessing artifact from the evaluator component, input data TF example artifacts from the example jam component.
- As output BulkInferer generates an inference result proto, which contains the original features and the prediction results.
- If your machine learning use case only calls for batch inference, BulkInferer is a great option for your machine learning project to directly include inference in your pipeline.
- Enables you to have task and data-driven batch inference in a continuous training and inference pipeline.
They are special purpose classes for performing advanced metadata operation such as importing external artifacts into ML metadata, performing queries of current ML metadata based on artifact properties and their history.
Imports (register) an external data object into ML Metadata.
- The primary use case for this node is to bring in external artifacts like a schema into the TFX pipeline for use by the transform and trainer components.
- Instead of regenerating the schema for each pipeline read, you can use the ImporterNode to bring a previously generated or updated schema into your pipeline.
Handles special artifact resolution logistics that will be used as inputs for downstream nodes.
- It is only required if you're performing model validation in addition to evaluation.
- The LatestArtifactResolver returns the latest N artifacts in a given channel. This is useful for comparing multiple run artifacts, such as those generated by the evaluation component.
- The latest bless model resolver returns the latest validated and bless model.
Pipeline: Entire set of operations being performed, including reading input, applying transformations, writing output, and the execution engine to be used.
PCollection: Represents an unordered set of data, e.g, for input and output.
PTransform: Data processing operation that operates over 1:many PCollections. ParDO is the core parallel processing transform.
Runners: Excute and translate pipelines to massively parallel big-data processing systems.
Trained models: Type definitions of artifacts and their propertities.
Trainer: Execution records (runs) of components.
Data provenence accross all executions.
It is a managed Apache Airflow service that helps you create, schedule, monitor and manage workflows.
- Author end-to-end workflows on Google Cloud.
- Integrate with BigQuery, Cloud Storage, AI Plaftform, etc.
- Security for yours worflows accross Google Cloud using tools such as Cloud IAM.
- Have your infrastructure fully managed by Google.
- Explore Cloud Composer and Airflows logs through Cloud Operations Logging and Monitoring.
In Airflow, A DAG is defined in a Python script, which represents the DAG structure (tasks and dependencies) as code.
A DAG is usually created by the Airflow scheduler, but it can also be created by an external trigger.
The data collection is followed by the imposition of a model, normality, linearity, analysis, estimation and testing that follows are focused on the parameters of that model.
- The data collection is not followed by a model imposition.
- It is followed immediately by analysis with a goal of inferring what model would be appropriate.
- It does not impose deterministic or probabilistic models on the data.
- It allows the data to suggest admissible models that best fit the data.
- Analyst attempts to answer research questions about unkown parameters using probability statements based on prior data.
- They may bring their own domain knowledge and/or expertise to the analysis as new information is obtained.
- The purpose is to determine posterior probabilities based on prior probabilities and new information.
- Posterior probabilties
Mean Square Error: There is a quadratic penalty for means-squared error, so it is essentially trying to minimize the Euclidean distance between the actual label and the predicted label.
Cross Entropy: The penalty is almost linear when the predicted probability is close to the actual label, but as it gets farther away it becomes exponential when it gets close to predicting the opposite class of the label.
Machine Learning | Standard Statistics (linear/logistic regression) | |
---|---|---|
Data Preparation | Doesn't require explicit commands to find patterns in data | Need to know variables and and parameters before hand |
Hypothesis | No hypothesis needed | Need hypothesis to test |
Type of data? | Multi-dimensional data that can be non-linear in nature | Linear data |
Training? | Needs to be trained | No training |
Goal? | Generally better for predictions | Generally better for inferences /hypothesis testing |
Scientific Question? | What will happen? | How/why will happen? |
One measure of the quality of the prediction at a single point is simply the sign differnece between the prediction and actual value. This difference is called the error.
- Get the errors for the training sample:
Error = Actual Value - Predicted Value
orError = Ÿi - Yi
- Compute the squares of the errors in step 1:
(Ÿi - Yi)^2
2.1 Ÿi = actual value 2.2 Yi = predicted value - Computer mean of the squared errors:
Mean = 1/n * sum(Error^2)
or `Mean = 1/n * sum((Ÿi - Yi)^2)
sqrt(1/n * sum(Error^2))
n = number of values
Pro blem: RMSE doesn't work as well for classification.
-1/N * sum(y*log(ÿ) + (1-y)*log(1-ÿ)
which is -1/N * sum(Positive Term + Negative Term)
Search for the minima by descending the gradient
def computeLoss(params):
while loss is ? Epsilon:
direction = computeDirection()
for i in range(self.params):
self.params[i] = self.params[i] + stepsize * direction[i]
loss = computeLoss()
In order to converge step sizes are necessary.
1. Small step sizes (learning rate) can take very long to converge.
2. Large step sizes (learning rate) may never converege to the true minimum. Process is not guarantee to converge.
3. A correct step size requires try and error. However, one correct step size does not fit all models.
Ablation Analysis: analysis where the value of an individual feature is computed by comparing it to a model trained without it. What might this engineer be concerned about? The engineer might be concerned about legacy and bundled features.
AlloyDB: A fully managed PostgreSQL - compatible dabase service for your most demanding enterprise workloads. It combines the best of Google with PostgreSQL, for superior performance, scale, and availability.
AlloyDB AI: automates embeddings generation, to easily transform operational data into vector embeddings.
Artifact Lineage: it describes all the factors that resulted in an artifact such as training data or hyperparameters used for model training. One can understand differences in performance or accuracy over several pipeline runs
Average Pooling: instead of calculating a max value like max pooling, average pooling calculates the average value for each block on the feature map.
Bundled Features: are features that were added as part of a bundle, which collectively are valuable but individually may not be.
Cardinality: refers to the number of values in a set.
Changes in the Distribution: The statistical term for changes in the likelihood of observed values like model inputs.
Cold Start: when your model is not updating to new users, new products, and new patterns in user preference because the model only knows about your older products, it continues to recommend them long after they’ve fallen out of favor. Ultimately, users simply ignored the recommendations altogether, and made do with the site’s search functionality.
Composability: the ability to compose a bunch of microservices together and the option to use what makes sense for your problem.
Compulation:
Concept Drift: occurs when there is a change in the relationship between the input feature and the label, or target. It can occur due to shifts in the feature space and/or the decision boundary, so we need to be aware of these during production.
Container: it is an abstraction that packages applications and libraries together so that the applications can run on a greater variety of hardware and operating systems. This ultimately makes hosting large applications better.
Convolution: is the mathematical combination of two functions to produce a third function.
DAR Run: is a physical instance of a DAG, containing task intances that run for a specific execution_date.
DAG Runner: it refers to an implementation that supports an orchestration.
Data Leakage: it is when the label is somehow laking into the training data.
Directed Acyclic Graphs (DAGs): A DAG is a collection of all tasks you want to run, organized in a way that reflects their relationships and dependencies.
Distributed Training: it is running training in parallel on many devices such as CPUs or GPUs or TPUs in order to make your training faster.
Drift: Drift occurs when the statistical properties of the inputs and the target which the model is trying to predict change over time in unforeseen ways. In other words, it is the change in an entity with respect to its baseline.
Drift Detection: How significantly are service requests evolving over time.
Data Dredging: It is the statistical manipulation of data in order to find patterns which can be presented as statistically significant, when in reality there is no underlying effect.
Data Parallelism: it is a common architecture for distributed training where you run the same model and computation on every device. But train each of them using different training samples. Each device computes loss and gradients based on training samples it sees. Then we update the models' parameters using these gradients. The updated model is then used in the next round of computation.
Embedding: is a map from our collection of items to some finite dimensional vector space. They are commonly used to represent input features in machine learning problems.
Extrapolation:: means to generalize outside the bounds of what we’ve previously seen.
Explicit feedback: the user is intentionally explicitly leaving feedback for that item.
Feature cross: It is a process of combining features into a single feature. It enables a model to learn separate weights for each combination of features. A feature cross is a synthetic feature formed by multiplying (crossing) two or more features. Crossing combinations of features can provide predictive abilities beyond what those features can provide individually.
Implicit feedback: It is not intentionally given as a means of reading the item the user has interacted with. However, there was some type of interaction and from that we can infer whether the user had a positive or a negative experience. This could be whether someone viewed a video, how long they watched a video, if a user spent a lot of time on a page, if they clicked certain areas, or buttons on the page, etc.
Interpolation:: is the opposite of Extrapolation. It means to generalize within the bounds of what we’ve previously seen.
Latent Factors: Compress the data to find the best generalities to rely on.
Latent Feature: It is a feature that we are not directly observing or defining but are instead inferring through our model from the other variables that are directly observed.
Legacy Features: are older features that were added because they were valuable at the time. The have become redundant because of the implementation of new features without our knowledge.
Loss Function: it takes the quality of predictions for a group of data points from our training set and compose them into a single number with which to estimate the quality of the models current parameters.
Machine Learning Metadata: Data about data, but not the data itself. Who triggerd the pipeline run? What hyperparameters were used for training? Where is the model file stored? When was the model pushed to production? Why was model A prefered over model B? How was the training environmnet configured?
Max Pooling: operation like a convolution that returns the maximum value out of all the input data values pass to a kernel.
Model Definition:
Model Staleness: Data that you used to train the model in the research or production environment does not represent the data that you actually have in your live system.
Model Residuals: it is the difference between its predictions and the labels.
Occam's razor:L When presented with competing hypothetical answers to a problem, one should select the one that makes the fewest assumptions.
Peakedness: is the degree to which data values are concentrated around the mean in a data distribution, or in this case, how concentraded the distribution of the prediction workload is. You can also think of it as inverse entropy.
PKL: it is standard method of serialization objects in python.
Portability: Because it is necessary to configure the stack over and over again and production services is not done on your laptop, portability is needed.
Root Mean Squared Error (RMSE): RMSE measures the difference between the predictions of a model and the observed values. A large RMSE is equivalent to a large average error, so smaller values of RMSE are better.
1. RMSE is a useful way to see how well a model is able to fit a dataset.
2. The larger the RMSE, the larger the difference between the predicted and observed values.
This means the worse a model fits the data.
3. Conversely, the smaller the RMSE, the better a model is able to fit the data.
4. One nice property of RMSE is that the error is given in the units being measured,
so you can tell very directly how incorrect the model might be on the unseen data.
Training-serving skew: is a difference between model performance during training and performance during serving. This skew can be caused by:
- A discrepancy between how you handle data in the training and serving pipelines.
- A change in the data between when you train and when you serve.
- A feedback loop between your model and your algorithm.