# Assignment 7: Cloud Computing with AWS

# Due date: Friday, June 13

## Introduction

In this assignment, you will apply the cloud computing skills you learned in the lecture. You will perform two main activities:
1.  **Setup and Verification:** Set up a single AWS EC2 instance, connect to it, and verify its configuration.
2.  **Remote Analysis:** Perform a complete data analysis workflow by generating data locally, uploading it to your EC2 instance for processing, calculating summary statistics, creating a plot, and downloading the results.

**Important:** Remember to **terminate** your EC2 instance after you complete the assignment. The `.pem` key file is sensitive and should never be committed to a public repository.

## Activity: Remote Analysis of Road Accident Data

### (a) Launch and Configure your EC2 Instance

- Log into the AWS Console and launch a new EC2 instance with the following specifications:
  - **Name:** `qtm350-assignment`
  - **Application and OS Images:** `Ubuntu Server 24.04 LTS`
  - **Instance type:** `t2.micro` (to be eligible for the Free Tier)
  - **Key pair:** Create a new key pair named `qtm350_key` and download the `.pem` file.
  - **Network settings:** Allow SSH traffic from your IP address.
  - **Storage:** Use the default 8GB.

### (b) Generate Data Locally

- On your **local machine**, create a Python script named `generate_accidents.py` with the following code. This script will create a CSV file with fictitious data on road accidents.

In [None]:
import pandas as pd
import numpy as np
import datetime

# Set seed for reproducibility
np.random.seed(42)

# Generate dates for a full year
dates = pd.to_datetime(pd.date_range(start='2023-01-01', periods=365, freq='D'))

# Generate number of accidents, with higher probability on weekends
day_of_week_effect = [1.0, 1.0, 1.1, 1.2, 1.5, 1.8, 1.6] # Mon-Sun
accidents = []
for date in dates:
    base_accidents = np.random.poisson(lam=10)
    num_accidents = int(base_accidents * day_of_week_effect[date.dayofweek])
    accidents.append(num_accidents)

# Create DataFrame
df = pd.DataFrame({
    'date': dates.date,
    'day_of_week': dates.day_name(),
    'number_of_accidents': accidents
})

# Save to CSV
df.to_csv('uk_road_accidents.csv', index=False)
print("uk_road_accidents.csv created successfully.")

- Run the script from your local terminal: `python3 generate_accidents.py`.

### (c) Prepare Analysis Script

- On your **local machine**, create another Python script named `analyse_accidents.py`. This script will read the data, calculate statistics, and generate a bar chart.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Read the data
df = pd.read_csv('uk_road_accidents.csv')

# --- Analysis ---
# Calculate average accidents per day of the week
avg_accidents_by_day = df.groupby('day_of_week')['number_of_accidents'].mean().round(2)
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
avg_accidents_by_day = avg_accidents_by_day.reindex(day_order)

most_accidents_day = avg_accidents_by_day.idxmax()
highest_avg = avg_accidents_by_day.max()

# --- Save Summary Statistics ---
with open('summary_stats.txt', 'w') as f:
    f.write('Road Accident Data Summary\n')
    f.write('==========================\n')
    f.write(f'Day with most accidents on average: {most_accidents_day}\n')
    f.write(f'Highest average number of accidents: {highest_avg}\n')
print("summary_stats.txt created successfully.")

# --- Create Visualisation ---
plt.figure(figsize=(10, 6))
sns.barplot(x=avg_accidents_by_day.index, y=avg_accidents_by_day.values)
plt.title('Average Number of Road Accidents by Day of the Week', fontsize=16)
plt.xlabel('Day of the Week')
plt.ylabel('Average Number of Accidents')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('accident_plot.png')
print("Plot saved to accident_plot.png")

### (d) Execute Full Workflow on EC2

- Connect to your `qtm350-assignment` instance via SSH.
- **On the instance**, install the required Python libraries:
  ```sh
  sudo apt update
  sudo apt install python3-pandas python3-matplotlib python3-seaborn
  ```
- **On the instance**, run `lsb_release -a > os_info.txt` to generate the OS info file.
- From your **local** terminal, use `scp` to upload both `uk_road_accidents.csv` and `analyze_accidents.py` to your EC2 instance's home directory.
- Back on your **EC2 instance**, run the analysis script: `python3 analyse_accidents.py`.
- From your **local** terminal, use `scp` to download the three generated files: `os_info.txt`, `summary_stats.txt`, and `accident_plot.png`.
- **Important:** Once you have all files, **terminate** your instance from the AWS Console.

## Submission

To complete the assignment, commit and push the following three files to Canvas or a GitHub repository (please share the URL on Canvas, too):
1.  `os_info.txt`
2.  `summary_stats.txt`
3.  `accident_plot.png`