# **Explore the Use Case and Analyze the Datasets**

## **Introduction**

![](2023-12-28-12-04-36.png)

![](2023-12-28-12-27-43.png)

![](2023-12-28-12-28-15.png)

![](2023-12-28-12-29-10.png)

![](2023-12-28-12-29-52.png)

**Saling Out:** Distributed model training in prallel across various instances

![](2023-12-28-12-32-20.png)

**Note:** Looking at the toolbox for this step, you will use Amazon Simple Storage Service or Amazon S3 and Amazon Athena to ingest, store, and query your data. With AWS Glue, you will catalog the data in its schema. For statistical bias detection in data, you will learn how to work with Amazon SageMaker Data Wrangler and Amazon SageMaker.

![](2023-12-28-12-35-09.png)

![](2023-12-28-12-41-50.png)

![](2023-12-28-12-42-37.png)

![](2023-12-28-12-46-31.png)

![](2023-12-28-12-48-13.png)

![](2023-12-28-12-49-00.png)

**Note:** Multi-class classification is a supervised learning task, hence you need
to provide your tax classifier model with examples how to correctly learn
to classify the products and the product reviews into the right sentiment classes. 

![](2023-12-28-12-52-24.png)

## **Working with Data**

### **Data Ingestion and Exploration**

Imagine your e-commerce company is collecting all the customer feedback across all online channels. You need to capture, suddenly, customer feedback streaming from social media channels, feedback captured and transcribed through support center calls, incoming emails, mobile apps, and website data, and much more. To do that, you need a flexible and elastic repository that can start, not only the different file formats, such as dealing with structured data, CSV files, as well as unstructured data, such as support center call audio files. 

![](2023-12-28-13-03-52.png)

You can ingest data in its raw format without any prior data transformation. Whether it's structured relational data in the form of CSV or TSV files, semi-structured data such as JSON or XML files, or unstructured data such as images,
audio, and media files. You can also ingest streaming data, such as an application delivering a continuous feed of log files, or feeds from social media channels, into your data lake. 

A data lake needs to be governed. With new data arriving at any point in time you need to implement ways to discover and catalog the new data. You also need to secure and control access to the data to comply with the political data security, privacy, and governance regulations. With this governance in place, you can now give data signs and machine learning teams access to large and diverse datasets. 

![](2023-12-28-13-12-41.png)

Data lakes are often built on top of object storage, such as Amazon S3. You're probably familiar with file and block storage. File storage stores and manages data as individual files organized in hierarchical file folder structures.
In contrast, block storage stores and manages data as individual chunks called the blocks. And each block receives a unique identifier, but no additional metadata is stored with that block. With object storage, data is stored and managed as objects, which consists of the data itself, any relevant metadata, such as when the object was last modified, and a unique identifier. Object storage is particularly helpful for storing and retrieving growing amounts of data of any type, hence it's the perfect foundation for data lakes. Amazon S3 gives you access to durable and high-available object storage in the cloud

![](2023-12-28-13-14-01.png)

![](2023-12-28-13-15-26.png)

![](2023-12-28-13-16-09.png)

![](2023-12-28-13-19-27.png)

![](2023-12-28-13-20-10.png)

![](2023-12-28-13-22-49.png)

To do that, import the AWS Wrangler Python library as shown here, and then call the catalog.create_database function, providing a name for the database to create. AWS Data Wrangler also offers a convenience function called catalog.create_CSV_table that you can use to register the CSV data with the AWS Glue Data Catalog. The function will only store the schema and the metadata in the AWS Glue Data Catalog table that you specify. The actual data again remains in your S3 bucket. 

![](2023-12-28-13-29-59.png)

![](2023-12-28-13-31-06.png)

Athena is an interactive queries service that lets you run standard SQL queries to explore your data. Athena is serverless, which means you don't need to set up any infrastructure to run those queries, and, no matter how large the data is that you want to query, you can simply type your SQL query, referencing the dataset schema you provided in the AWS Glue Data Catalog. No data is loaded or moved, and here is a sample SQL query. 

Again, this database and table only contains the metadata of your data. The data still resides in S3, and when you run this Python command, AWS Data Wrangler will send this SQL query to Amazon Athena.Athena then runs the query on the specified dataset and stores the results in S3, and it also returns the results in a Pandas DataFrame, as specified in the command shown here

![](2023-12-28-13-39-24.png)

Athena is based on Presto, an open source distributed SQL engine, developed for this exact use case, running interactive queries against data sources of all sizes. And remember, no installation or infrastructure setup is needed, and no data movement is required. Just register your data with AWS Glue and use Amazon Athena to explore your datasets from the comfort of your Python environment. 

### **Data Visualization**

![](2023-12-28-13-50-30.png)

![](2023-12-28-17-02-00.png)

![](2023-12-28-16-59-10.png)

![](2023-12-28-16-59-56.png)

![](2023-12-28-17-01-24.png)

![](2023-12-28-17-04-30.png)

![](2023-12-28-17-17-10.png)

![](2023-12-28-17-13-31.png)

![](2023-12-28-17-14-08.png)

![](2023-12-28-17-14-49.png)

![](2023-12-28-17-19-52.png)

## **Practice**

![](2023-12-04-19-50-28.png)

![](2023-12-04-19-55-08.png)

![](2023-12-04-19-55-45.png)

![](2023-12-04-19-57-06.png)

![](2023-12-04-19-58-08.png)

![](2023-12-04-20-00-18.png)

![](2023-12-04-19-59-50.png)

![](2023-12-04-20-01-23.png)

![](2023-12-04-20-01-45.png)

![](2023-12-04-20-02-07.png)

![](2023-12-04-20-02-37.png)

![](2023-12-04-20-03-29.png)

![](2023-12-04-20-04-12.png)

![](2023-12-04-20-04-37.png)


**WHAT IS IAM?**

IAM is a web service that enables you to manage access to your AWS account and resources. It also provides a centralized view of who and what are allowed inside your AWS account (authentication), and who and what have permissions to use and work with your AWS resources (authorization).With IAM, you can share access to an AWS account and resources without having to share your set of access keys or password. You can also provide granular access to those working in your account, so that people and services only have permissions to the resources they need. For example, to provide a user of your AWS account with read-only access to a particular AWS service, you can granularly select which actions and which resources in that service they can access.

**GET TO KNOW THE IAM FEATURES**

To help control access and manage identities within your AWS account, IAM offers many features to ensure security.

IAM is global and not specific to any one Region. This means you can see and use your IAM configurations from any Region in the AWS Management Console.
IAM is integrated with many AWS services 
by default
.
You can establish password policies in IAM to specify complexity requirements and mandatory rotation periods for users.
IAM supports MFA.
IAM supports identity federation, which allows users who already have passwords elsewhere—for example, in your corporate network or with an internet identity provider—to get temporary access to your AWS account.
Any AWS customer can use IAM; the service is offered at no additional charge.

**WHAT IS AN IAM USER?**

An IAM user represents a person or service that interacts with AWS. You define the user within your AWS account. And any activity done by that user is billed to your account. Once you create a user, that user can sign in to gain access to the AWS resources inside your account.You can also add more users to your account as needed. For example, for your cat photo application, you could create individual users in your AWS account that correspond to the people who are working on your application. Each person should have their own login credentials. Providing users with their own login credentials prevents sharing of credentials.

**IAM USER CREDENTIALS**

An IAM user consists of a name and a set of credentials. When creating a user, you can choose to provide the user:

Access to the AWS Management Console
Programmatic access to the AWS Command Line Interface (AWS CLI) and AWS Application Programming Interface (AWS API)
To access the AWS Management Console, provide the users with a user name and password. For programmatic access, AWS generates a set of access keys that can be used with the AWS CLI and AWS API. IAM user credentials are considered permanent, in that they stay with the user until there’s a forced rotation by admins.When you create an IAM user, you have the option to grant permissions directly at the user level.This can seem like a good idea if you have only one or a few users. However, as the number of users helping you build your solutions on AWS increases, it becomes more complicated to keep up with permissions. For example, if you have 3,000 users in your AWS account, administering access becomes challenging, and it’s impossible to get a top-level view of who can perform what actions on which resources.If only there were a way to group IAM users and attach permissions at the group level instead. Guess what: There is!

**WHAT IS AN IAM GROUP?**

An IAM group is a collection of users. All users in the group inherit the permissions assigned to the group. This makes it easy to give permissions to multiple users at once. It’s a more convenient and scalable way of managing permissions for users in your AWS account. This is why using IAM groups is a best practice.If you have a an application that you’re trying to build and have multiple users in one account working on the application, you might decide to organize these users by job function. You might want IAM groups organized by developers, security, and admins. You would then place all of your IAM users in the respective group for their job function.This provides a better view to see who has what permissions within your organization and an easier way to scale as new people join, leave, and change roles in your organization.Consider the following examples.

A new developer joins your AWS account to help with your application. You simply create a new user and add them to the developer group, without having to think about which permissions they need.
A developer changes jobs and becomes a security engineer. Instead of editing the user’s permissions directly, you can instead remove them from the old group and add them to the new group that already has the correct level of access.
Keep in mind the following features of groups.

Groups can have many users.
Users can belong to many groups.
Groups cannot belong to groups.
The root user can perform all actions on all resources inside an AWS account by default. This is in contrast to creating new IAM users, new groups, or new roles. New IAM identities can perform no actions inside your AWS account by default until you explicitly grant them permission.The way you grant permissions in IAM is by using IAM policies.

**WHAT IS AN IAM POLICY?**

To manage access and provide permissions to AWS services and resources, you create IAM policies and attach them to IAM users, groups, and roles. Whenever a user or role makes a request, AWS evaluates the policies associated with them. For example, if you have a developer inside the developers group who makes a request to an AWS service, AWS evaluates any policies attached to the developers group and any policies attached to the developer user to determine if the request should be allowed or denied.

**IAM POLICY EXAMPLES**

Most policies are stored in AWS as JSON documents with several policy elements. Take a look at the following example of what providing admin access through an IAM identity-based policy looks like.

{

"Version": "2012-10-17",    

     "Statement": [{        
          "Effect": "Allow",        

          "Action": "*",        

          "Resource": "*"     

     }]

}

In this policy, there are four major JSON elements: Version, Effect, Action, and Resource.

The Version element defines the version of the policy language. It specifies the language syntax rules that are needed by AWS to process a policy. To use all the available policy features, include "Version": "2012-10-17" before the "Statement" element in all your policies.
The Effect element specifies whether the statement will allow or deny access. In this policy, the Effect is "Allow", which means you’re providing access to a particular resource.
The Action element describes the type of action that should be allowed or denied. In the above policy, the action is "*". This is called a wildcard, and it is used to symbolize every action inside your AWS account.
The Resource element specifies the object or objects that the policy statement covers. In the policy example above, the resource is also the wildcard "*". This represents all resources inside your AWS console.
Putting all this information together, you have a policy that allows you to perform all actions on all resources inside your AWS account. This is what we refer to as an administrator policy.

Let’s look at another example of a more granular IAM policy.

{"Version": "2012-10-17",    

     "Statement": [{        

          "Effect": "Allow",        

          "Action": [            

               "iam: ChangePassword",            

               "iam: GetUser"            

               ]        

          "Resource": 

"arn:aws:iam::123456789012:user/${aws:username}"    

     }]

}

After looking at the JSON, you can see that this policy allows the IAM user to change their own IAM password (iam:ChangePassword) and get information about their own user (iam:GetUser). It only permits them to access their own credentials because the resource restricts access with the variable substitution ${aws:username}.

![](2023-12-04-20-51-42.png)

![](2023-12-04-20-55-34.png)

![](2023-12-04-20-56-29.png)

![](2023-12-04-22-26-16.png)

![](2023-12-04-22-37-41.png)

![](2023-12-04-22-26-47.png)

![](2023-12-04-22-38-03.png)

![](2023-12-04-22-38-25.png)

![](2023-12-04-22-38-53.png)

![](2023-12-04-22-39-07.png)

![](2023-12-04-22-58-13.png)

![](2023-12-04-22-58-44.png)

![](2023-12-04-22-59-05.png)

![](2023-12-04-22-59-46.png)

![](2023-12-04-23-00-02.png)

![](2023-12-04-23-00-22.png)

![](2023-12-04-23-00-39.png)

![](2023-12-04-23-01-12.png)

![](2023-12-04-23-01-24.png)

![](2023-12-04-23-01-40.png)

![](2023-12-04-23-01-53.png)

![](2023-12-04-23-02-14.png)

![](2023-12-04-23-02-31.png)

![](2023-12-04-23-03-24.png)

![](2023-12-04-23-03-45.png)

![](2023-12-04-23-04-02.png)

![](2023-12-04-23-04-21.png)

## **Exercise and Assessment**

### **Lab**

![](2023-12-05-09-27-47.png)

![](2023-12-05-09-30-31.png)

![](2023-12-05-09-31-12.png)

![](2023-12-05-09-31-37.png)

![](2023-12-05-09-33-19.png)

![](2023-12-05-09-33-47.png)

![](2023-12-05-09-34-38.png)

![](2023-12-05-09-35-29.png)

![](2023-12-05-09-36-01.png)

![](2023-12-05-09-36-25.png)

![](2023-12-05-09-36-51.png)

![](2023-12-05-09-38-16.png)

![](2023-12-05-09-38-47.png)

### **Demo AWS IAM**

![](2023-12-05-12-41-20.png)

![](2023-12-05-12-41-40.png)

**Roles allow us to have temporary credentials that are used to make calls to AWS API**

![](2023-12-05-12-44-04.png)

![](2023-12-05-12-44-21.png)

![](2023-12-05-14-11-16.png)

![](2023-12-05-14-15-53.png)

**For example, "S3-object-lambda:*" means all the API calls are allowed.**

![](2023-12-05-14-18-33.png)

![](2023-12-05-14-18-56.png)

![](2023-12-05-14-19-42.png)

![](2023-12-05-14-20-26.png)

![](2023-12-05-14-20-52.png)

![](2023-12-05-14-21-06.png)

![](2023-12-05-14-21-20.png)

![](2023-12-05-14-21-46.png)

![](2023-12-05-14-22-04.png)

![](2023-12-05-14-22-24.png)

![](2023-12-05-14-22-42.png)

![](2023-12-05-14-22-54.png)

![](2023-12-05-14-23-18.png)

![](2023-12-05-14-23-42.png)

![](2023-12-05-14-24-06.png)

![](2023-12-05-14-24-26.png)

![](2023-12-05-14-24-54.png)

![](2023-12-05-14-25-15.png)

![](2023-12-05-14-25-37.png)

![](2023-12-05-14-26-11.png)

![](2023-12-05-14-26-28.png)

![](2023-12-05-14-26-46.png)

![](2023-12-05-14-27-06.png)

![](2023-12-05-14-27-25.png)

![](2023-12-05-14-27-49.png)

![](2023-12-05-14-28-09.png)

![](2023-12-05-14-28-27.png)

![](2023-12-05-14-28-55.png)

![](2023-12-05-14-29-11.png)

![](2023-12-05-14-29-29.png)

![](2023-12-05-14-29-42.png)

![](2023-12-05-14-29-56.png)

![](2023-12-05-14-30-15.png)

![](2023-12-05-14-30-30.png)

![](2023-12-05-14-30-46.png)

![](2023-12-05-14-31-17.png)

![](2023-12-05-14-31-38.png)

![](2023-12-05-14-31-51.png)

![](2023-12-05-14-32-26.png)

![](2023-12-05-14-32-54.png)

![](2023-12-05-14-33-19.png)

![](2023-12-05-14-33-40.png)

![](2023-12-05-14-34-07.png)

### **Hosting the Employee Directory Application on AWS**

![](2023-12-05-16-17-45.png)

![](2023-12-05-16-18-45.png)

![](2023-12-05-16-18-57.png)

![](2023-12-05-16-19-27.png)

![](2023-12-05-16-20-17.png)

![](2023-12-05-16-20-47.png)

![](2023-12-05-16-21-06.png)

![](2023-12-05-16-21-38.png)

![](2023-12-05-16-22-11.png)

![](2023-12-05-16-22-29.png)

![](2023-12-05-16-22-50.png)

![](2023-12-05-16-35-49.png)

![](2023-12-05-16-36-15.png)

![](2023-12-05-16-40-30.png)

![](2023-12-05-16-41-04.png)

**Which is an instance level firewall that will HHTP and HTTPS traffic in.**

![](2023-12-05-16-45-58.png)

![](k.png)

![](2023-12-05-16-47-32.png)

**User Data** is a script that is going to run when the instance boots up. 

![](2023-12-05-16-48-47.png)

![](2023-12-05-16-49-03.png)

![](2023-12-05-16-49-19.png)

![](2023-12-05-16-50-06.png)

![](2023-12-05-16-50-22.png)

![](2023-12-05-16-50-38.png)

![](2023-12-05-16-51-43.png)

**Paste it into a new browser tab ...**

![](2023-12-05-16-52-29.png)

As of March 15, 2023 the default Amazon Machine Image (AMI) for Amazon EC2 has been updated to use the Amazon Linux 2023 AMI. In the demonstrations for this course, we use the Amazon Linux 2 AMI. If you are following along with the videos please be aware that if you use the new Amazon Linux 2023 AMI with the user data the way it appears in the videos the script will not run properly and the application will not launch. We are in the process of updating the course to reflect this change.

In the meantime, there are a few ways to work around this issue. You can either use the Amazon Linux 2 AMI with the user data as shown in the demonstrations and this will resolve the issue, or you can use an updated version of the user data script which I will include in this message.

To recap, we have a new default AMI for EC2 instances called the Amazon Linux 2023 AMI. The videos show us using Amazon Linux 2. Because of changes between these two AMIs the user data script shown in the videos will not run properly on Amazon Linux 2023 based instances. You can either choose Amazon Linux 2 as the AMI when launching the instance, and use the original user data script or you can use the Amazon Linux 2023 AMI and use the updated user data script.

**Note:** When using the user data scripts, remember to replace the <INSERT REGION HERE> with whatever AWS region you are operating in, and ensure you remove both brackets as well.

**Amazon Linux 2 user data script:**

          #!/bin/bash -ex

          wget https://aws-tc-largeobjects.s3-us-west-2.amazonaws.com/DEV-AWS-MO-GCNv2/FlaskApp.zip

          unzip FlaskApp.zip

          cd FlaskApp/

          yum -y install python3 mysql

          pip3 install -r requirements.txt

          amazon-linux-extras install epel

          yum -y install stress

          export PHOTOS_BUCKET=${SUB_PHOTOS_BUCKET}

          export AWS_DEFAULT_REGION=<INSERT REGION HERE>

          export DYNAMO_MODE=on

          FLASK_APP=application.py /usr/local/bin/flask run --host=0.0.0.0 --port=80



**Amazon Linux 2023 user data script:** 

          #!/bin/bash -ex

          wget https://aws-tc-largeobjects.s3-us-west-2.amazonaws.com/DEV-AWS-MO-GCNv2/FlaskApp.zip

          unzip FlaskApp.zip

          cd FlaskApp/

          yum -y install python3-pip

          pip install -r requirements.txt

          yum -y install stress

          export PHOTOS_BUCKET=${SUB_PHOTOS_BUCKET}

          export AWS_DEFAULT_REGION=<INSERT REGION HERE>

          export DYNAMO_MODE=on

          FLASK_APP=application.py /usr/local/bin/flask run --host=0.0.0.0 --port=80 