# **Algorithmic Methods of Data Mining - Winter Semester 2023**

## **AWS Question (AWSQ)**
AWS offers access to many cloud-based tools and services that simplify data processing, storage, and analysis. Thanks to AWS's scalable and affordable solutions, data scientists can work effectively with large datasets and carry out advanced analytics. A data scientist must, therefore, perform the essential task of learning how to use AWS. To complete a straightforward data analysis task in this question, you must set up an environment on Amazon Web Services. 

In this question, you are asked to provide the most commonly used tags for book lists. Going through the [__list.json__](https://www.kaggle.com/datasets/opalskies/large-books-metadata-dataset-50-mill-entries) file, you'll notice that each list has a list of tags attached, and we want to see what are the <ins>most popular tags</ins> across all of the lists. Please report the __top 5__ most frequently used tags and the number of times they appear in the lists.

You have to follow the following (recommended) steps:  
- Download the *list.json* file to your local system. 
- Write a Python script that generates the report and the system's time to generate it.
- Set up an EC2 instance on your AWS account and upload the list.json file together with your script to the instance
- Compare the running times of your script on your local system and the EC2 instances.

__Important note__: Please run the __same script__ on both your local system and your EC2 instance to compare the results. e.g., keep the parameters the same if you are processing the data by loading it partially and aggregating the results. Comment about the differences you find.

Please provide a report as follows: 
- The information about the config of the EC2 instance 
- The command used to connect to the EC2 
- The commands used to upload the files and run the script on the EC2 instance through your local system 
- A table containing the most popular tags and their number of usage
- A table containing the running time of the script on your local system and EC2 instance
  
The following is the expected outcome for the most popular tags:
|tag|#usage|
|---|---|
|romance|6001|
|fiction|5291|
|young-adult|5016|
|fantasy|3666|
|science-fiction|2779|


### **Downloading the dataset and writing Python script**

In [22]:
import json
from collections import Counter
import time

filepath = "dataset/list.json"
n = 5

start_time = time.time()

tags_array = []

with open(filepath, "r") as f:
    for line in f:
        tags = json.loads(line).get("tags", [])
        for tag in tags:
            tags_array.append(tag)

counter = Counter(tags_array)
top_tags = counter.most_common(n)

end_time = time.time()

print("tag: #usage")
print("-----------")
for tag, number_of_times in top_tags:
    print(f"{tag}: {number_of_times}")

print(f"\nExecution time: {end_time - start_time:.2f} seconds")

tag: #usage
-----------
romance: 6001
fiction: 5291
young-adult: 5016
fantasy: 3666
science-fiction: 2779

Execution time: 12.88 seconds


### **Config of EC2**

Steps of creating EC2 instance:
- Login to the AWS Academy Learner Lab
- Starting the Lab
- Launching AWS Management Console and accessing the EC2 dashboard
- Clicking on Launch Instance.
- Naming the instance: hw2_awsq
- Choosing a suitable Amazon Machine Image: Ubuntu 22.04
- Selecting an instance type: t2.large
- Adding key pair: using a old key pair from lab (mykeypair.pem)
- Configuring storage: 30 GB of volume
- Launching the instance
- On EC2 dashboard clicking on the instance then on the top left corner we go Actions -> Security -> Modify IAM role


- Choosing LabInstanceProfile then Update IAM Role



### **Connecting to EC2 using SSH and uploading the files**

To connect to EC2 instance SSH is used. OS of the local machine is Ubuntu 22.04:
- Navigating through terminal to mykeypair.pem
- Running the next command to connect to the ssh:
    
    ssh -i "mykeypair.pem" ubuntu@ec2-3-83-141-6.compute-1.amazonaws.com
- After adding it to the list of the known hosts, we are introduced with the next text: 
    
    ubuntu@ip-172-31-87-42:~$
- Two commands are used to bring the OS with the newest updates and upgrades: 
    
    sudo apt update

    sudo apt upgrade
- Afterwards command is used to install Command Line Interface - AWS CLI (this can be used for other Amazon services): 

    sudo apt install awscli
- Shutdown is performed to apply certain settings with: 
    
    sudo shutdown -h now
- Launching the instance again through the console we get new command for connecting: 

    ssh -i "mykeypair.pem" ubuntu@ec2-54-146-140-217.compute-1.amazonaws.com
- We get the output Python 3.10.12 using the command: 

    python3 --version 
- We also install pip by running: 

    sudo apt install pip
- New directory dataset is created with: 

    mkdir dataset
- Opening a new terminal on local machine we firstly navigate to the directory where is the script and then type next two commands to transfer files through ssh:

    scp -i "key/mykeypair.pem" aws_script.py ubuntu@ec2-54-146-140-217.compute-1.amazonaws.com:/home/ubuntu
    
    scp -i "key/mykeypair.pem" "dataset/list.json" ubuntu@ec2-54-146-140-217.compute-1.amazonaws.com:/home/ubuntu/dataset

- This can also be done using S3 service, by creating a bucket, uploading files there, and then accessing the content of the files with:

    aws s3 ls s3://name-of-the-bucket

### **Lauching instance EC2 through SSH**

### **Running the script**


After switching to the terminal with the remote machine, we run the next few lines that are showing the content of the directories and output of running the script:
```
ubuntu@ip-172-31-87-42:~$ ls
aws_script.py  dataset
ubuntu@ip-172-31-87-42:~$ cd dataset/
ubuntu@ip-172-31-87-42:~/dataset$ ls
list.json
ubuntu@ip-172-31-87-42:~/dataset$ cd ..
ubuntu@ip-172-31-87-42:~$ python3 aws_script.py 
tag: #usage
-----------
romance: 6001
fiction: 5291
young-adult: 5016
fantasy: 3666
science-fiction: 2779

Execution time: 16.09 seconds
```

The next table is containing the most popular tags and their number of usage on both local machine and EC2:

|tag (local)|#usage (local)|tag (EC2)|#usage (EC2)|
|---|---|---|---|
|romance|6001|romance|6001|
|fiction|5291|fiction|5291|
|young-adult|5016|young-adult|5016|
|fantasy|3666|fantasy|3666|
|science-fiction|2779|science-fiction|2779|

The next table is containing the running time of the script on your local system and EC2 instance:

|# of tries|Execution time local machine (s)|Execution time EC2 (s)|
|---|---|---|
|1|12.66|16.09|
|2|12.82|16.01|
|3|12.74|16.11|
|4|12.92|16.17|
|5|12.88|15.90|
|Avg.Time|12.88|15.90|