

### Business scenario

You work for a training organization that recently developed an introductory course about machine learning (ML). The course includes more than 40 videos that cover a broad range of ML topics. You have been asked to create an application that will students can use to quickly locate and view video content by searching for topics and key phrases.

You have downloaded all of the videos to an Amazon Simple Storage Service (Amazon S3) bucket. Your assignment is to produce a dashboard that meets your supervisor’s requirements.

## Project steps

To complete this project the required steps are:

1. [Viewing the video files](#1.-Viewing-the-video-files)
2. [Transcribing the videos](#2.-Transcribing-the-videos)
3. [Normalizing the text](#3.-Normalizing-the-text)
4. [Extracting key phrases and topics](#4.-Extracting-key-phrases-and-topics)
5. [Creating the dashboard](#5.-Creating-the-dashboard)

In [1]:
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Access the environment variables
bucket = os.getenv("BUCKET")
job_data_access_role = os.getenv("JOB_DATA_ACCESS_ROLE")
source_bucket = os.getenv("SOURCE_BUCKET")
source_prefix = os.getenv("SOURCE_PREFIX")

## 1. Viewing the video files
([Go to top](#Capstone-8:-Bringing-It-All-Together))


The source video files are located in the following shared Amazon Simple Storage Service (Amazon S3) bucket.

In [2]:
!aws s3 ls s3://{source_bucket}/{source_prefix}

2021-04-26 20:17:33  410925369 Mod01_Course Overview.mp4
2021-04-26 20:10:02   39576695 Mod02_Intro.mp4
2021-04-26 20:31:23  302994828 Mod02_Sect01.mp4
2021-04-26 20:17:33  416563881 Mod02_Sect02.mp4
2021-04-26 20:17:33  318685583 Mod02_Sect03.mp4
2021-04-26 20:17:33  255877251 Mod02_Sect04.mp4
2021-04-26 20:23:51   99988046 Mod02_Sect05.mp4
2021-04-26 20:24:54   50700224 Mod02_WrapUp.mp4
2021-04-26 20:26:27   60627667 Mod03_Intro.mp4
2021-04-26 20:26:28  272229844 Mod03_Sect01.mp4
2021-04-26 20:27:06  309127124 Mod03_Sect02_part1.mp4
2021-04-26 20:27:06  195635527 Mod03_Sect02_part2.mp4
2021-04-26 20:28:03  123924818 Mod03_Sect02_part3.mp4
2021-04-26 20:31:28  171681915 Mod03_Sect03_part1.mp4
2021-04-26 20:32:07  285200083 Mod03_Sect03_part2.mp4
2021-04-26 20:33:17  105470345 Mod03_Sect03_part3.mp4
2021-04-26 20:35:10  157185651 Mod03_Sect04_part1.mp4
2021-04-26 20:36:27  187435635 Mod03_Sect04_part2.mp4
2021-04-26 20:36:40  280720369 Mod03_Sect04_part3.mp4
2021-04-26 20:40:01  443479

## 2. Transcribing the videos
 ([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to implement your solution to transcribe the videos. 

In [3]:
!pip install --upgrade pip
!pip install moviepy torch transformers
!pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate
!pip install nltk
!pip install spacy
!pip install python-dotenv
!python -m spacy download en_core_web_sm

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-cu667638
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-cu667638
  Resolved https://github.com/huggingface/transformers.git to commit b109257f4fb8b1166e7c53cc5418632014ed53a5
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m39.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the

In [4]:
import os
import pandas as pd
import boto3
from moviepy.editor import VideoFileClip
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)

# Directory to save audios
directory = "audios"

# Initialize the S3 client
s3 = boto3.client("s3")


# Convert videos to wav audio files
def process_video(video_key):
    try:
        # Check if the directory exists
        if not os.path.exists(directory):
            # Create directory if it doesn"t exist
            os.makedirs(directory)

        # Extract the video ID
        video_id = video_key.split("/")[-1].replace(".mp4", "")

        # Output audio
        audio_file = f"{directory}/{video_id}.wav"

        # Download the video to a temporary file
        tmp_video_file = "/tmp/video.mp4"
        s3.download_file(source_bucket, video_key, tmp_video_file)

        # Load the video clip
        video_clip = VideoFileClip(tmp_video_file)

        # Extract audio from the video clip
        audio_clip = video_clip.audio

        # Write the audio to a 16 bit wav file
        audio_clip.write_audiofile(audio_file, codec="pcm_s16le")

        # Close the video clip
        video_clip.close()
        print(f"Successfully processed {video_key}")


    except Exception as e:
        print(f"Error processing {video_key}: {e}")
        return None

# Convert audio to text
def transcribe_audio(audio_path):
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

    model_id = "openai/whisper-large-v3"
    model = AutoModelForSpeechSeq2Seq.from_pretrained(
        model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
    )
    model.to(device)

    processor = AutoProcessor.from_pretrained(model_id)

    pipe = pipeline(
        "automatic-speech-recognition",
        model=model,
        tokenizer=processor.tokenizer,
        feature_extractor=processor.feature_extractor,
        max_new_tokens=128,
        chunk_length_s=30,
        batch_size=16,
        return_timestamps=True,
        torch_dtype=torch_dtype,
        device=device,
    )

    result = pipe(audio_path)
    print(result["text"])
    return result["text"]


# List objects in the source bucket
response = s3.list_objects_v2(Bucket=source_bucket, Prefix=source_prefix)

# Create
df = pd.DataFrame(columns=["video", "audio", "text"])

# Process each video
if "Contents" in response:
    for obj in response["Contents"]:
        video_key = obj["Key"]
        if video_key.endswith(".mp4"):
            process_video(video_key)
            video_name = video_key.split("/")[-1]
            audio_name = video_name.replace(".mp4", ".wav")
            audio_text = transcribe_audio(f"{directory}/{audio_name}")
            new_df = pd.DataFrame([{"video": video_name, "audio": audio_name, "text": audio_text}])
            df = pd.concat([df, new_df], ignore_index=True)

df

ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4554:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4554:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name
ALSA lib conf.c:4554:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5033:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2501:(snd_pcm_open_noupdate) Unknown PCM default
ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4554:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4554:(_snd_config_evaluate) function snd_func_concat returned error: N

MoviePy - Writing audio in audios/Mod01_Course Overview.wav


                                                                        

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod01_Course Overview.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


 Hi, and welcome to Amazon Academy Machine Learning Foundations. In this module, you'll learn about the course objectives, various job roles in the machine learning domain, and where you can go to learn more about machine learning. After completing this module, you should be able to identify course prerequisites and objectives, indicate the role of the data scientist in business, and identify resources for further learning. We're now going to look at the prerequisites for taking this course. Before you take this course, we recommend that you first complete AWS Academy Cloud Foundations. You should also have some general technical knowledge of IT, including foundational computer literacy skills like basic computer concepts, email, file management, and a good understanding of the Internet. We also recommend that you have intermediate skills with Python programming and a general knowledge of applied statistics. Finally, general business knowledge is important for this course. This include

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod02_Intro.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, and welcome to Module 2 of AWS Academy Machine Learning. In this module, we're going to introduce machine learning. We'll first look at the business problems that can be solved by machine learning. We'll then talk about terminology, process, tools, and some of the challenges you'll face. After completing this module, you should be able to recognize how machine learning and deep learning are part of artificial intelligence, describe artificial intelligence and machine learning terminology, identify how machine learning can be used to solve a business problem, be used to solve a business problem, describe the machine learning process, list the tools available to data scientists, and identify when to use machine learning instead of traditional software development methods. You're now ready to get started with Section 1. See you in the next video.
MoviePy - Writing audio in audios/Mod02_Sect01.wav


                                                                        

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod02_Sect01.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, and welcome to Section 1. In this section, we're going to talk about what machine learning is. This course is an introduction to machine learning, which is also known as ML. But first, we'll discuss where machine learning fits into the larger picture. Machine learning is a subset of artificial intelligence, or AI. This is a broad branch of computer science that's focused on building machines that can do human tasks. Deep learning is a subdomain of machine learning. To understand where these all fit together, we'll discuss each one. As we just mentioned, machine learning is a subset of a broader computer science field known as artificial intelligence. AI focuses on building machines that can perform tasks a human would typically perform. In contemporary popular culture, you've probably seen AIs in movies, television, or works of fiction. For example, you might have seen AIs that control the world around them, or that start acting on their own initiative. These AIs started as comput

                                                                        

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod02_Sect02.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi and welcome back. In this section, we're going to look at the types of business problems machine learning can help you solve. Machine learning is used all across your digital lives. Your email spam filter is the result of a machine learning program that was trained with examples of spam and regular email messages. Based on books you're reading or products you bought, machine learning programs can predict other books or products you're likely to be interested in. Again, the machine learning program was trained with data from other readers' habits and purchases. When detecting credit card fraud, the machine learning program was trained on examples of transactions that turned out to be fraud, along with normal transactions. You can probably think of many more examples, from social media applications, using facial detection to group your photos, to detecting brain tumors in brain scans or finding anomalies in x-rays. There are three main types of machine learning. There's supervised le

                                                                        

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod02_Sect03.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi and welcome back. This is section 3 and we're going to give you a quick, high-level overview of machine learning terminology and a typical workflow. We will cover these topics in more detail later in this course, but for now we'll focus on the larger picture. So to begin, you should always start with the business problem you or your team believe could benefit from machine learning. From there, you want to do some problem formulation. In this phase, one task is to articulate your business problem and convert it to an ML problem. After you've formulated the problem, you move on to the data preparation and pre-processing phase. You'll pull data from one or more data sources. These data sources might have differences in data or types that need to be reconciled so you can form a single, cohesive view of your data. You'll need to visualize your data and use statistics to determine if the data is consistent and can be used for machine learning. We'll look at some of the data sources later

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod02_Sect04.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Welcome back. In this section, we'll look at some of the tools you'll be using throughout the rest of this course. Before we start, this list isn't an exhaustive list of all the tools available today. We're only going to cover them at a high level, but it's a good place to get started. First, there's the Jupyter Notebook. The Jupyter Notebook is an open-source web application you can use to create and share documents that contain live code, equations, visualizations, and narrative text. Uses include data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. JupyterLab is a web-based interactive development environment for Jupyter notebooks, code, and data. JupyterLab is flexible. You can use it to configure and arrange the user interface to support a wide range of workflows in data science, scientific computing, and machine learning. JupyterLab is extensible and modular. You can write plugins that add new componen

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod02_Sect05.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, welcome back. This is section 5 and we're going to discuss challenges with machine learning. You'll come across many challenges in machine learning. There are a lot of poor quality and inconsistent data available. A significant portion of your job will be getting access to or generating enough good data that's representative of the problem you want to solve. A key issue to watch out for is under or overfitting the model. It's not all about the data, although it mostly is. Do you have data science experience? Is staffing a team of data scientists cost-effective? Does management support using machine learning? What does the business landscape look like? Are the problems too complex to formulate into a machine learning problem? Can the resulting model be explained to the business? If it can't be explained, it might not get adopted. What's the cost of building, updating, and operating a machine learning solution? Finally, how does the technology map? Does the business unit have access

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod02_WrapUp.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 It's now time to review the module. Here are the main takeaways for this module. First, we looked at defining machine learning and how it fits into the broader AI landscape. We also looked at the types of problems machine learning can help us solve and how machine learning applies learning algorithms to develop models from large datasets. We then looked at the machine learning pipeline and the different stages for developing a machine learning application. Finally, we introduced some of the tools and services you can use, before discussing some of the challenges with machine learning. In summary, in this module you learned how to recognize how machine learning. In summary, in this module, you learned how to recognize how machine learning and deep learning are part of artificial intelligence, describe artificial intelligence and machine learning terminology, identify how machine learning can be used to solve a business problem, describe the machine learning process, list the tools avai

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod03_Intro.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Welcome back to AWS Academy Machine Learning. This is module three, and we're going to work through the entire machine learning pipeline by using Amazon SageMaker. This module will discuss a typical process for handling a machine learning problem. The machine learning pipeline can be applied to many machine learning problem. The machine learning pipeline can be applied to many machine learning problems. The focus is on supervised learning, but the process you learn in this module can be adapted to other types of machine learning as well. This is a large module and we'll be covering a lot of material. At the end of this module, you'll be able to formulate a problem from a business request, obtain and secure data for machine learning, build a Jupyter notebook by using Amazon SageMaker, outline the process for evaluating data, explain why data needs to be preprocessed, use open-source tools to examine and pre-process data. Use Amazon SageMaker to train and host a machine learning model. 

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod03_Sect01.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, and welcome back to module 3. This is section 1, and we're going to take a look at some of the data sets we'll use in this module. We'll also look at guidance for how to formulate a business problem. Before we get started, here's a reminder of the machine learning pipeline we looked at in the previous module, and how that maps to the sections in this module. This section, Section 1, will cover how to formulate a problem. It will also cover the datasets we'll use throughout this module. Section 2 will discuss how to obtain and secure data for your machine learning activities. In Section 3, we'll show you tools and techniques for gaining an understanding of your data. Then in Section 4, we'll look at pre-processing your data so it's ready to train a model. Section 5 will cover selecting and training an appropriate machine learning model. Section 6 will show you how to deploy a model so you can make a prediction. Section 7 will examine the process of evaluating the performance of a m

                                                                        

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod03_Sect02_part1.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, welcome back. We're now going to look at a few ways you can collect and secure data. In this section, we'll explore some of the techniques and challenges associated with collecting and securing the data that's needed for machine learning. Consider again the original example about predicting credit card fraud. You've further formulated the problem. But what data do you need to actually train your model so you can get the desired output and subsequently achieve your intended business outcome? Do you have access to the data? If so, how much data do you have and where is it? What solution can you use to bring all this data into one centralized repository? The answers to these questions are essential at this stage. The good news for a budding data scientist is that there are many places where you can obtain data. Private data from you or your existing customer already exists, including everything from log files to customer invoice databases. Private data can be useful depending on the 

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod03_Sect02_part2.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, welcome back. We'll continue exploring data collection by reviewing how to extract, transform, and load data. Data is typically spread across many different systems and data providers. This presents a challenge. You'll need to bring all these data sources together into something that can be consumed by a machine learning model. You can do this through extract, transform, and load, which is also known as ETL. The steps in ETL are defined this way. In the extract step, you pull the data from the sources to a single location. During extraction, you might need to modify the data, combine matching records, or do other tasks that transform the data. Finally, in the load step, the data is loaded into a repository such as Amazon S3. A typical ETL framework has several components. As an example, consider the diagram. First, the Crawler A program connects to a data store, which can be a source or a target. It progresses through a ranked list of classifiers to determine the schema for your d

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod03_Sect02_part3.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, welcome back. We'll continue exploring data collection by reviewing how to secure your data. It's important to consider the security of your data. Though the data sets used in this course are all public, real data about customer transactions or health records need to be kept secure. You can use AWS Identity and Access Management, which is also known as IAM. It's a service that controls access to resources. Make sure you're securing your data within AWS correctly so you can avoid data breaches. The diagram shows a simple IAM policy that allows only read access to a specific S3 bucket for the listed role. In addition to controlling access to data, you need to make sure your data is secure. It's a good practice and it might also be legally required for certain data types, such as financial data or healthcare records. AWS provides encryption features for storage services, typically for data that's at rest or in transit. You can often meet these encryption requirements by enabling encr

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod03_Sect03_part1.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi and welcome back. This is section 3 and we're going to cover how to evaluate your data. In this section, we'll look at different data formats and types. We'll also look at how you can visualize and analyze the data before feature engineering. Before you can start running statistics on your data to better understand what you're working with, you need to ensure it's in the right format for analysis. For Amazon SageMaker, algorithms support training with data in CSV format. Many of the tools you'll use to explore, visualize, and analyze the data can also read it in CSV format. Generally speaking, you'll need to have at least some domain knowledge for the problem you're trying to solve with machine learning. For example, if you're developing a model to predict if a set of symptoms indicates a disease, you'd need to know the relationship between the symptoms and the disease. Data typically needs to be in numeric form, so machine learning algorithms can use the data to make predictions. 

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod03_Sect03_part2.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, welcome back. We'll continue exploring how to describe your data. Now that your data is in a readable format, you can perform descriptive statistics on the data to better understand it. Descriptive statistics help you gain valuable insights into your data so that you can effectively pre-process the data and prepare it for your ML model. We'll look at how you can do that and discuss why it's so important. First, descriptive statistics can be organized into a few different categories. Overall statistics include the number of rows and the number of columns in your dataset. This information, which relates to the dimensions of your data, is very important. For example, it can indicate that you have too many features, which can lead to high dimensionality and poor model performance. Attribute statistics are another type of descriptive statistic, specifically for numeric attributes. They're used to get a better sense of the shape of your attributes. This includes properties like the mean

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod03_Sect03_part3.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, welcome back. Now we'll review how to find correlations in your dataset. How can you quantify the linear relationship among the variables you're seeing in a scatterplot? A correlation matrix is a good tool in this situation. It conveys both the strong and weak linear relationships among numerical variables. Correlation can go as high as 1 or as low as minus 1. When the correlation is 1, this means those two numerical features are perfectly correlated with each other. It's like saying y is proportional to x. When the correlation of those two variables is minus one, it's like saying y is proportional to minus x. Any linear relationship in between can be quantified by the correlation. So if the correlation is zero, this means there's no linear relationship. But it doesn't mean that there's no relationship, it's just an indication that there's no linear relationship between those two variables. However, looking at a number isn't always straightforward. Often it's easier to view the nu

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod03_Sect04_part1.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, and welcome to section 4. In this section, we're going to look at feature engineering. Feature engineering is one of the most impactful things you can do to improve your machine learning model. We'll now look at what it is. There are two things that can help make your models more successful. The first is feature selection and the second is feature extraction or the process of creating features. In feature selection you select the most relevant features and discard the rest. You can apply feature selection to prevent redundancy or irrelevance in the existing features. You can also use it to limit the number of features to help prevent overfitting. Feature extraction builds valuable information from raw data by reformatting, combining, and transforming primary features into new ones. This process continues until it yields a new data set that can be consumed by the model to achieve your goals. As the diagram shows, feature extraction covers a range of activities, from dealing with mi

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod03_Sect04_part2.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, welcome back. We'll continue exploring feature engineering by reviewing how to clean your dataset. In addition to converting string data to numerical data, you'll need to clean your dataset for several other potential problem areas. Before encoding the string data, make sure the strings are all consistent. You'll also need to make sure variables use a consistent scale. For example, if one variable describes the number of doors in a car, the scale will probably be between 2 and 8. But if another variable describes the number of cars of a particular type sold in the state of California, the scale will type sold in the state of California, the scale will probably be in the thousands. Some data items might also capture more than one variable in a single value. For instance, suppose the dataset includes variables that combine safety and maintenance into a single variable, such as safe high maintenance. single variable, such as safe high maintenance. You'll need to train your machine le

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod03_Sect04_part3.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, welcome back. We'll continue exploring feature engineering by describing how to work with outliers. You might also need to clean your data based on any outliers that exist. Outliers are points in your dataset that lie at an abnormal distance from other values. They're not always something you want to clean up because they can add richness to your dataset. But they can also make it harder to make accurate predictions because they skew values away from the other, more normal values related to that feature. An outlier might also indicate that the data point actually belongs to another column. You can think of outliers as falling into two broad categories. might also indicate that the data point actually belongs to another column. You can think of outliers as falling into two broad categories. The first is a single variation for just a single variable, or a univariate outlier. The second is a variation of two or more variables, or a multivariate outlier. One of the more common ways to

                                                                        

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod03_Sect05.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, welcome back to module 3. This is section 5 on training. In this section, we're going to look at how to select a model and train it with the data we have preprocessed. At this point, you've done a lot to clean and prepare your data, but that doesn't mean your data is completely ready to train the algorithm. Some algorithms may not be able to work with training data in a data frame format. Some file formats, like CSV, are commonly used by various algorithms, but they do not make use of that optimization that some of the file formats, like RecordIO Protobuf, can use. Many Amazon SageMaker algorithms support training with data in a CSV format. Amazon SageMaker requires that a CSV file doesn't have a header record and that the target variable is in the first column. Most Amazon SageMaker algorithms work best when you use the optimized protobuf record I-O format for the training data. Using this format allows you to take advantage of pipe mode when training the algorithms that support 

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod03_Sect06.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi and welcome back. This is section 6 and we're going to look at hosting and using the model. In this section, we'll look at how you can deploy your trained model so it can be consumed by applications. After you've trained, tuned, and tested your model you'll learn more about testing in the next section, you're now ready to deploy your model. If you're thinking that we're looking at the phases out of order, here's why we're discussing deployment now. If you want to test your model and get performance metrics from it, you first need to make an inference or prediction from the model, and this typically requires deployment. Deployment for testing is different from production, although the mechanics are the same. Amazon SageMaker provides everything you need to host your model for simple testing and evaluation, from a few requests to deployments handling tens of thousands of requests. There are two ways you can deploy your model. For single predictions, you can deploy your model with Ama

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod03_Sect07_part1.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, welcome back to module 3. In this section, we'll look at how you can evaluate your model's success in predicting results. At this point, you've trained your models. It's now time to evaluate that model to determine if it will do a good job predicting the target on new and future data. Because future instances have unknown target values, you need to assess how the model will perform on data where you already know the target answer. You'll then use this assessment as a proxy for performance on future data. This is the reason why you hold out a sample of your data for evaluating or testing. An important part of this phase involves choosing the most appropriate metric for your business situation. Think back to the earlier section on problem formulation. During that phase, you define your business problem and outcome, and then you craft a business metric to evaluate success. The model metric you choose at this phase should be linked to that business metric as much as possible. There's 

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod03_Sect07_part2.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, welcome back. We'll continue exploring how to evaluate your model. The diagram shows the confusion matrix of how two different models performed on the same data. Can you tell which one's better? Which is better isn't a good question to ask. What do you mean by better? Does better mean making sure you find all the cats? Even if it means you'll get many false positives? Or does better mean making sure the model is the most accurate? It's difficult to see just by looking at the two charts. What if you're trying several models, using multiple folds, and have hundreds of data points to compare? To do that, you'll need to calculate more metrics. The first metric is sensitivity. This is sometimes referred to as recall, hit rate, or true positive rate. Sensitivity is the percentage of positive identifications. In the cat example, it represents what percentage of cats were correctly identified. To calculate sensitivity, take the number of true positives, or the number of positive identific

                                                                        

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod03_Sect07_part3.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, welcome back. We'll continue exploring how to evaluate your model. Classification models are going to return a probability for the target. This is a value of the input belonging to the target class, and it will be between 0 and 1. To convert the value to a class, you need to determine the threshold to use. You might think it's 50%, but you could change it to be lower or higher to improve your results. As you've seen with sensitivity and specificity, there's a trade-off between correctly and incorrectly identifying classes. Changing the threshold can impact that outcome. We're going to take a look at how you can visualize this. A receiver operating characteristic graph is also known as an ROC graph. It summarizes all the confusion matrices that each threshold produced. To build one, you calculate and plot the sensitivity, or true positive rate, against the false positive rate on a graph for each threshold value. You can calculate the false positive rate by subtracting the specifici

                                                                        

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod03_Sect08.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, and welcome back to module 3. This is section 8. In this section, we're going to take a look at how you can tune the model's hyperparameters to improve model performance. Recall from an earlier module that hyperparameters can be thought of as the knobs that tune the machine learning algorithm to improve its performance. Now that we're looking more explicitly at tuning models, it's time to look more specifically at the different types of hyperparameters and how to perform hyperparameter optimization. There are a couple of different categories of hyperparameters. The first kind are model hyperparameters. The first kind are model hyperparameters. They help define the model itself. As an example, consider a neural network for a computer vision problem. For this case, additional attributes of the architecture need to be defined, like filter size, pooling, and the stride or padding. The second kind are optimizer hyperparameters. They relate to how the model learns patterns based on data

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod03_WrapUp.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 It's now time to review the module and wrap up with a knowledge check. In this module, you learned how to formulate a problem from a business request, obtain and secure data for machine learning, build a Jupyter notebook by using Amazon SageMaker. Outline the process for evaluating data. Explain why data needs to be pre-processed. Use open source tools to examine and pre-process data. Use Amazon SageMaker to train and host a machine learning model. Use cross-validation to test the performance of an ML model, use a hosted model for inference, and create an Amazon SageMaker hyperparameter tuning job to optimize a model's effectiveness. That concludes this module. Thanks for watching. We'll see you again in the next video.
MoviePy - Writing audio in audios/Mod04_Intro.wav


                                                                     

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod04_Intro.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, and welcome to Module 4 of AWS Academy Machine Learning. In this module, we're going to look at forecasting. We'll start with an introduction to forecasting and look at how time series data is different from other kinds of data. Then, we're going to look at Amazon Forecast, a service that helps you simplify building forecasts. At the end of this module, you'll be able to describe the business problem solved with Amazon Forecast, describe the challenges of working with time series data, list the steps required to create a forecast by using Amazon Forecast. And use Amazon Forecast to make a prediction. See you in the next video!
MoviePy - Writing audio in audios/Mod04_Sect01.wav


                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod04_Sect01.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, and welcome to Section 1. We'll get started by reviewing what forecasting is and some use cases for it. Forecasting is an important area of machine learning. It's important because there are so many opportunities for predicting future outcomes based on historical data. Many of these opportunities involve a time component. However, while the time component adds additional information, it also makes time series problems more difficult to handle compared to other types of predictions. You can think of time series data as falling into two broad categories. The first type is univariate data, which means there's just one variable. The second one is multivariate data, which means there's more than one variable. There are several common patterns in time series data. The first pattern is a trend. With a trend, you get a pattern with the values increasing, decreasing, or staying the same over time. There are seasonal patterns. These reflect times of the year, month, day, or other patterns. 

                                                                        

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod04_Sect02_part1.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi and welcome back. This is section 2 and we're going to focus on processing time series data because it can be different from other types of data you've been using so far. Time series data is data that is captured in chronological sequence over a defined period of time. Introducing time into a machine learning model has a positive impact because the model can derive meaning from changes in the data points over time. Time series data tends to be correlated. This means that there's a dependency between data points. This has mixed results for forecasting. This is because you're dealing with a regression problem, and regression assumes that data points are independent. You need to develop a method for dealing with data dependence so you can increase the validity of the predictions. In addition to the time series data, you can add related data to augment a forecasting model. For example, suppose you want to make a prediction about retail sales. You could include information about the pro

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod04_Sect02_part2.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, welcome back. We'll continue exploring wrangling time series data. Seasonality in data is any kind of repeating observation where the frequency of the observation is stable. For example, in sales, you typically see higher sales at the end of a quarter and into the fourth quarter. Consumer retail sees even higher sales in the fourth quarter. Be aware that data can have multiple types of seasonality in the same data set. There are many times when you should incorporate seasonality information into your forecast. For instance, localized holidays are a good example for sales. The chart shows that the total revenue generated by arcades has a strong correlation with the number of computer science doctorates awarded in the US. But correlations do not mean causation. If you disagree, see the source for the chart. There are many other correlations plotted on the site, and none of them make any sense. With your own data, be careful that you're not seeing and acting on correlations that don'

                                                                        

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod04_Sect02_part3.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi and welcome back. In this section, we'll look at how you can use Amazon Forecast to create a predictor and generate forecasts. When you generate forecasts, you can apply the machine learning development pipeline you've seen throughout this course. But you still need data. You need to import as much data as you have, both historical data and related data. You'll want to do some basic evaluation and feature engineering before you use the data to train a model so you can meet the requirements of Amazon Forecast. To train a predictor, you need to choose an algorithm. If you're not sure which algorithm is the best for your data, Amazon Forecast can choose for you. To do this, select AutoML as your algorithm. You also need to select a domain for your data. If you're not sure what the best fit is, you can also select a custom domain. Domains have specific types of data they require. When you have a trained model, you can then use the model to make a forecast using an input data set group.

                                                                    

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod04_WrapUp.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, welcome back. It's now time to review the module and wrap it up. In this module, you learned how to describe the business problem solved by Amazon Forecast, describe the challenges of working with time series data, list the steps required to create Forecast by using Amazon Forecast, and use Amazon Forecast to make a prediction. Thanks for participating. See you in the next module.
MoviePy - Writing audio in audios/Mod05_Intro.wav


                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod05_Intro.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Welcome back to AWS Academy Machine Learning. This is Module 5, and we have a great topic for you today, Computer Vision. In this module, we'll start with an overview of the computer vision space, and you'll learn about some of the use cases and terminology. Next, we'll explore details about analyzing image and video with managed services from Amazon Web Services or AWS. Finally, we'll look at how you can use your own customized data sets for performing object detection. At the end of this module, you'll be able to describe the use cases for computer vision, describe the Amazon Managed Machine Learning services available for image and video analysis, list the steps required to prepare a custom dataset for object detection, describe how Amazon SageMaker Ground Truth can be used to prepare a custom dataset. And finally, use Amazon Recognition to perform facial detection. Thanks for watching, we'll see you in the next video.
MoviePy - Writing audio in audios/Mod05_Sect01_ver2.wav


                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod05_Sect01_ver2.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


MoviePy - Writing audio in audios/Mod05_Sect02_part1_ver2.wav


                                                                        

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod05_Sect02_part1_ver2.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Welcome back. In this section, we'll explore image analysis in more detail. And in part two, we'll take a closer look into video analysis. To start, we'll introduce the main Amazon service we'll be using, Amazon Recognition. Amazon Recognition is a computer vision service that's based on deep learning. You can use it to add image and video analysis to your applications. There are many uses for Amazon Recognition, including creating searchable image and video libraries. Amazon Recognition makes both images and stored videos searchable, so you can discover the objects and scenes that appear in them. You can use Amazon Recognition to build a face-based user verification system, so your applications can confirm user identities by comparing their live image with a reference image. Amazon Recognition interprets emotional expressions, such as happy, sad, or surprise. It can also interpret demographic information from facial images, such as gender. Amazon Recognition can also detect inappropr

                                                                        

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod05_Sect02_part2.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, welcome back. We'll continue exploring image analysis with a closer look at facial detection. Facial detection uses a model that was tuned to perform predictions specifically for detecting faces and facial features. Facial detection has many of the same features as standard object detection, such as a bounding box or the coordinates of the box surrounding the face that was detected. This will include a value representing the confidence that the bounding box contains a face. There will be a list of attributes if found, such as if the face has a beard or if it appears to be male or female. There will also be a confidence score for these attributes. It can also detect physical emotions, like whether the person is smiling or frowning. It's important to understand this classification is based only on visual clues, and so it might not represent the actual emotion of the person. Facial landmarks are components of the face such as eyes and mouth. Typical landmarks also include X and Y coo

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod05_Sect03_part1.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 In this section, we'll look at preparing custom datasets for computer vision, so you can detect custom objects. One challenge of using a pre-built model is that it will only find images it was trained to find. Though Amazon Rekognition was trained with tens of millions of images, it can't detect objects that it wasn't trained on. For example, consider the 8 of hearts playing card. If you run this card through Amazon Recognition, the results show various attributes. However, none of the labels are playing card or 8 of hearts. If you want Amazon Recognition to detect images in your problem domain, you must train the model with your images. So in this section, you'll learn how to train Amazon Recognition with images from your problem domain. Though you'll focus only on using Amazon Recognition here, you'll encounter a similar process if you use other pre-trained models. Training a computer vision algorithm to recognize images requires a large input dataset, which isn't practical for most

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod05_Sect03_part2.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, welcome back. We'll continue exploring video analysis by reviewing how to create the training dataset. Datasets contain information that's needed to train and test an Amazon Recognition Custom Labels model, such as images, labels, and bounding boxes. such as images, labels, and bounding boxes. You can use images from Amazon S3, or you can upload them from your computer to S3 as part of the process. To train a model, your dataset should have at least two labels, with at least 10 images per label. Each image in your dataset must be labeled. As we mentioned earlier, you can use the Amazon Recognition Custom Labels console or Amazon SageMaker Ground Truth to label your images. Again, to train an Amazon Recognition Custom Labels model, your images must be labeled. A label indicates that an image contains an object, scene, or concept. As we mentioned earlier, a dataset needs at least two defined labels. Also, each image must have at least one assigned label that identifies the object, s

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod05_Sect03_part3.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, welcome back. We'll continue exploring video analysis by reviewing how to create the test dataset. The final step before you train your model is to identify a test dataset. You will use this test dataset to validate and evaluate the model's performance. You'll do this by performing an inference on the images in the test dataset. You'll then compare the results with the labeling information that's in the training dataset. You can create your own test dataset. Alternatively, you can use Amazon Recognition Custom Labels to split your training dataset into two datasets by using an 80-20 split. This split means that 80% of the data is used for training and 20% is used for testing. After you define the training and test datasets, Amazon Recognition Custom Labels can automatically train the model for you. The service automatically loads and inspects the data, selects the correct machine learning algorithms, trains a model, and provides model performance metrics. You're charged for the am

                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod05_Sect03_part4_ver2.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Hi, welcome back. We'll continue exploring video analysis by reviewing how to evaluate and improve your model. In general, you can improve the quality of your model with larger quantities of better quality data. Use training images that clearly show the object or scene, and don't include many things that you're not interested in. For bounding boxes around objects, use training images that show the object as fully visible and not hidden by other objects. Make sure that your training and test data sets match the type of images that you'll eventually run inference on. For objects where you have just a few training examples, like logos, you should provide bounding boxes around the logo in your test images. These images represent the scenarios you want to localize the object in. Reducing false positives often results in better precision. To reduce false positives, first, check if increasing the confidence threshold enables you to keep the correct predictions while eliminating false positiv

                                                                     

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod05_WrapUp_ver2.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 It's now time to summarize some of the main points in this module. In this module, you learned how to describe the use cases for computer vision, describe the Amazon Managed Machine Learning services available for image and video analysis, list the steps required to prepare a custom data set for object detection. Describe how Amazon SageMaker Ground Truth can be used to prepare a custom data set. And use Amazon Recognition to perform facial detection. That concludes this introduction to computer vision. Thanks for watching. We'll see you again in the next video.
MoviePy - Writing audio in audios/Mod06_Intro.wav


                                                                     

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod06_Intro.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Introduction to Natural Language Processing Hi, and welcome to Module 6 of AWS Academy Machine Learning, Introduction to Natural Language Processing. In this module, we'll introduce Natural Language Processing, which is also known as NLP. This section includes a description of the major challenges faced by NLP and the overall development process for NLP applications. We'll then review five AWS services you can use to speed up the development of NLP-based applications. After completing this module, you should be able to describe the NLP use cases that are solved by using managed Amazon ML services and describe the managed Amazon ML services available for NLP let's get started
MoviePy - Writing audio in audios/Mod06_Sect01.wav


                                                                        

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod06_Sect01.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 We'll get started by reviewing what Natural Language Processing means. Natural Language Processing is also known as NLP. Before we explain what NLP is, we'll consider an example of NLP, Amazon Alexa. Alexa works by having a device, such as an Amazon Echo, record your words. The recording of your speech is sent to Amazon's servers to be analyzed more efficiently. Amazon breaks down your phrase into individual sounds. Then, it connects to a database containing the pronunciation of various words to find which words most closely correspond to the combination of individual sounds. Amazon identifies important words to make sense of the tasks and carry out corresponding functions. For instance, if Alexa notices words like outside or temperature, it will open the weather Alexa skill. Amazon servers then send the information back to your device and Alexa skill. Amazon servers then send the information back to your device, and Alexa speaks. NLP is a broad term for a general set of business or c

                                                                        

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod06_Sect02.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Welcome back. In this section, we'll review five managed machine learning services you can use for various use cases. These services simplify the process of creating a machine learning application. We'll start by looking at Amazon Transcribe. You can use Amazon Transcribe to recognize speech in audio files and produce a transcription. It can recognize specific voices in an audio file, and you can create a customized vocabulary for terms that are specialized for a particular domain. You can also add a transcription service to your applications by integrating with WebSockets, a transcription service to your applications by integrating with WebSockets, an internet protocol you can use for two-way communication between an application and Amazon Transcribe. Here are some of the more common use cases for Amazon Transcribe. First, medical professionals can record their notes, and Amazon Transcribe can capture their spoken notes as text. Also, video production organizations can generate subti

                                                                    

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod06_WrapUp.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Welcome back. It's now time to review the module and wrap it up. In summary, in this module, you learned how to describe the NLP use cases that are solved by using managed Amazon ML services and describe the managed ML services available for NLP. Good job. Thanks for watching. We'll see you in the next module. and describe the managed ML services available for NLP. Good job. Thanks for watching. We'll see you in the next module.
MoviePy - Writing audio in audios/Mod07_Sect01.wav


                                                                      

MoviePy - Done.
Successfully processed CUR-TF-200-ACMNLP-1/video/Mod07_Sect01.mp4


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Welcome to Module 7, Course Wrap-Up. Congratulations on completing the AWS Academy Machine Learning course. We'll take a few minutes to review what you've learned and where you can go from here. We're going to start with a review of what you've learned in this course. You learned how to describe machine learning, implement a machine learning pipeline, and use Amazon machine learning services for forecasting, computer vision, and natural language processing. Well done. Although this course isn't designed to prepare you to become certified for the AWS Certified Machine Learning specialty, we'll review how you can continue to work towards that certification. AWS Certification helps you build credibility and confidence by validating your cloud expertise with an industry-recognized credential. It also helps organizations identify skilled professionals who can lead cloud initiatives by using AWS. You must earn a passing score by taking a proctored exam to earn an AWS certification. After re

Unnamed: 0,video,audio,text
0,Mod01_Course Overview.mp4,Mod01_Course Overview.wav,"Hi, and welcome to Amazon Academy Machine Lea..."
1,Mod02_Intro.mp4,Mod02_Intro.wav,"Hi, and welcome to Module 2 of AWS Academy Ma..."
2,Mod02_Sect01.mp4,Mod02_Sect01.wav,"Hi, and welcome to Section 1. In this section..."
3,Mod02_Sect02.mp4,Mod02_Sect02.wav,"Hi and welcome back. In this section, we're g..."
4,Mod02_Sect03.mp4,Mod02_Sect03.wav,Hi and welcome back. This is section 3 and we...
5,Mod02_Sect04.mp4,Mod02_Sect04.wav,"Welcome back. In this section, we'll look at ..."
6,Mod02_Sect05.mp4,Mod02_Sect05.wav,"Hi, welcome back. This is section 5 and we're..."
7,Mod02_WrapUp.mp4,Mod02_WrapUp.wav,It's now time to review the module. Here are ...
8,Mod03_Intro.mp4,Mod03_Intro.wav,Welcome back to AWS Academy Machine Learning....
9,Mod03_Sect01.mp4,Mod03_Sect01.wav,"Hi, and welcome back to module 3. This is sec..."


## 3. Normalizing the text
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to perform any text normalization steps that are necessary for your solution.

In [5]:
import nltk
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("stopwords")
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re


# Text normalization and lemmatization
def preprocess_text(text):
    # Convert the text to lowercase
    text = text.lower()

    # Remove special characters
    text = re.sub("[^A-Za-z0-9]+", " ", text)

    # Replace all newline characters (\n) with spaces
    text = re.sub(r"\n", " ", text)

    # Remove leading or trailing whitespaces
    text = text.strip()

    # Remove stopwords
    stop_words = set(stopwords.words("english"))

    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha()]
    tokens = " ".join(tokens)
    return tokens

df["normalized_text"] = df["text"].apply(preprocess_text)
df

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,video,audio,text,normalized_text
0,Mod01_Course Overview.mp4,Mod01_Course Overview.wav,"Hi, and welcome to Amazon Academy Machine Lea...",hi welcome amazon academy machine learning fou...
1,Mod02_Intro.mp4,Mod02_Intro.wav,"Hi, and welcome to Module 2 of AWS Academy Ma...",hi welcome module aws academy machine learning...
2,Mod02_Sect01.mp4,Mod02_Sect01.wav,"Hi, and welcome to Section 1. In this section...",hi welcome section section going talk machine ...
3,Mod02_Sect02.mp4,Mod02_Sect02.wav,"Hi and welcome back. In this section, we're g...",hi welcome back section going look type busine...
4,Mod02_Sect03.mp4,Mod02_Sect03.wav,Hi and welcome back. This is section 3 and we...,hi welcome back section going give quick high ...
5,Mod02_Sect04.mp4,Mod02_Sect04.wav,"Welcome back. In this section, we'll look at ...",welcome back section look tool using throughou...
6,Mod02_Sect05.mp4,Mod02_Sect05.wav,"Hi, welcome back. This is section 5 and we're...",hi welcome back section going discus challenge...
7,Mod02_WrapUp.mp4,Mod02_WrapUp.wav,It's now time to review the module. Here are ...,time review module main takeaway module first ...
8,Mod03_Intro.mp4,Mod03_Intro.wav,Welcome back to AWS Academy Machine Learning....,welcome back aws academy machine learning modu...
9,Mod03_Sect01.mp4,Mod03_Sect01.wav,"Hi, and welcome back to module 3. This is sec...",hi welcome back module section going take look...


## 4. Extracting key phrases and topics
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to extract the key phrases and topics from the videos.

In [6]:
import spacy

# Load spacy's English NLP model
nlp = spacy.load("en_core_web_sm")

# Extract key phrases
def extract_keyphrases(text):
    doc = nlp(text)
    return [chunk.text for chunk in doc.noun_chunks]

df["key_phrases"] = df["normalized_text"].apply(extract_keyphrases)

# Extract topics
labels = {"LANGUAGE", "ORDINAL", "EVENT", "PRODUCT", "DATE", "WORK_OF_ART", "FAC",  "ORG", "PERSON",  "GPE"}

def extract_topics(text):
    doc = nlp(text)

    # Extract entities
    topics = [entity.text for entity in doc.ents if entity.label_ in labels]

    # Extract all nouns
    topics.extend([token.text for token in doc if token.pos_ == "NOUN"])
    return topics

df["topics"] = df["normalized_text"].apply(extract_topics)
df


Unnamed: 0,video,audio,text,normalized_text,key_phrases,topics
0,Mod01_Course Overview.mp4,Mod01_Course Overview.wav,"Hi, and welcome to Amazon Academy Machine Lea...",hi welcome amazon academy machine learning fou...,"[amazon academy machine, foundation module, co...","[machine learning, first, amazon, lex amazon, ..."
1,Mod02_Intro.mp4,Mod02_Intro.wav,"Hi, and welcome to Module 2 of AWS Academy Ma...",hi welcome module aws academy machine learning...,"[introduce machine, first look business proble...","[first, welcome, module, machine, module, intr..."
2,Mod02_Sect01.mp4,Mod02_Sect01.wav,"Hi, and welcome to Section 1. In this section...",hi welcome section section going talk machine ...,"[welcome section section, talk machine learnin...","[first, every year, first, welcome, section, s..."
3,Mod02_Sect02.mp4,Mod02_Sect02.wav,"Hi and welcome back. In this section, we're g...",hi welcome back section going look type busine...,"[section, machine learning, digital life email...","[anomaly x ray, first, boundary cross, first, ..."
4,Mod02_Sect03.mp4,Mod02_Sect03.wav,Hi and welcome back. This is section 3 and we...,hi welcome back section going give quick high ...,"[section, quick high level overview machine, t...","[november, february year, mali city, age birth..."
5,Mod02_Sect04.mp4,Mod02_Sect04.wav,"Welcome back. In this section, we'll look at ...",welcome back section look tool using throughou...,"[section look tool, rest course, exhaustive li...","[today, first, linear, tensorflow kera, first,..."
6,Mod02_Sect05.mp4,Mod02_Sect05.wav,"Hi, welcome back. This is section 5 and we're...",hi welcome back section going discus challenge...,"[section, discus challenge machine learning, m...","[today, first, recognition, section, discus, c..."
7,Mod02_WrapUp.mp4,Mod02_WrapUp.wav,It's now time to review the module. Here are ...,time review module main takeaway module first ...,"[time review module main takeaway module, land...","[time, review, module, module, machine, fit, l..."
8,Mod03_Intro.mp4,Mod03_Intro.wav,Welcome back to AWS Academy Machine Learning....,welcome back aws academy machine learning modu...,"[back aws academy machine learning module, ama...","[machine, module, work, machine, pipeline, mod..."
9,Mod03_Sect01.mp4,Mod03_Sect01.wav,"Hi, and welcome back to module 3. This is sec...",hi welcome back module section going take look...,"[module section, data set use module, guidance...","[first, first, first, six month, first, second..."


## 5. Creating the dashboard
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to create the dashboard for your solution.

In [7]:
from ipywidgets import Label, widgets
from IPython.display import display, HTML

# Create a textbox for the key phrases and topics and label for it
textbox = widgets.Text(description='Search:')
textbox_label = Label(value='Enter a keyword or topic to search videos:')

display(textbox_label, textbox)

# Search videos based on keywords and topics
def search_video(sender):
    # Clear the earlier output
    output.clear_output()
    term = sender.value.strip()

    query = term.lower()
    search_result = 0

    if query != '':
        with output:
            for index, row in df.iterrows():
                if any(query in phrase for phrase in row['key_phrases']) or any(query in topic for topic in row['topics']):
                    search_result += 1
                    video_name = row['video']
                    video_url = f"https://{source_bucket}.s3.amazonaws.com/{source_prefix}{video_name}"
                    display(HTML(f"</br><p><b>{video_name.replace('.mp4','')}</b></p><video width='400' controls><source src='{video_url}' type='video/mp4'></video>"))
            if search_result >0:
                print(f"Found {search_result} video(s) related to '{term}'.")

        if search_result == 0:
            with output:
                print(f"No video found related to '{term}'.")


textbox.on_submit(search_video)

# Create an output widget to display the search results
output = widgets.Output()
display(output)


Label(value='Enter a keyword or topic to search videos:')

Text(value='', description='Search:')

Output()