# QTM 350 - Data Science Computing
## Practice: Interactive Data Analysis with Jupyter on AWS EC2
**Author:** Danilo Freire (danilo.freire@emory.edu, Emory University)

## Hands-on Practice: The Goal

- **Objective:**
- In this comprehensive session, you will use a remote cloud server as a fully interactive data science workbench. This is a common and powerful workflow for handling data analysis tasks.

- **Process:**
- You will launch an EC2 instance, set up a remote Jupyter Notebook environment using SSH port forwarding, download and clean a dataset interactively, perform an analysis with visualisation, and then retrieve your final work (both the notebook and the cleaned data).

- **Environment:**
- This workflow uses a single AWS EC2 instance managed from your local terminal.

- **Estimated Time:**
- 60-75 minutes.

## Part 1: Server Setup and Jupyter Launch

- **Launch an EC2 Instance:**
- Log into the AWS Console and launch one `t2.micro` instance. Use the `Ubuntu Server 24.04 LTS` image and name it `jupyter-ec2-instance`.
- Use your existing `qtm350_key.pem` key pair and configure the security group to allow SSH traffic from your IP address.

- **Connect and Install Software:**
- Connect to your instance using `ssh`.
- Once connected, install wget, `pandas`, `matplotlib`, `seaborn`, and Jupyter on the instance.

- **Start a Port-Forwarding SSH Session:**
- Open a **new local terminal window** (keep your original SSH session open for now) and run the following command. This maps port 8000 on your local machine to port 8888 on the EC2 instance, allowing you to access the remote Jupyter server.

- **Start the Jupyter Notebook Server:**
- In your **original SSH terminal** (the one already connected to EC2), start the Jupyter server. The `--no-browser` flag is important.

- Jupyter will print a URL containing a security token. Copy this token (the long string of characters after `token=`).

- **Access Jupyter in Your Browser:**
- Open a web browser on your **local machine** and navigate to `http://localhost:8000`.
- Paste the token you copied into the password box to log in.

## Part 2: Interactive Data Cleaning and Analysis

You are now running a Jupyter Notebook session on a remote cloud server. All work from this point will be done in notebook cells.

- **Create a New Notebook:**
- In the Jupyter file browser, click `New` -> `Python 3 (ipykernel)` to create a new notebook. Rename it `employee_analysis.ipynb`.

- **Download the Raw Data (Inside the Notebook):**
- In the first cell of your new notebook, use the `wget` command to download the `dirty_employee_data` to your EC2 instance.. More information about the command [here](https://niagads.scrollhelp.site/support/wget-linux-file-downloader-user-guide). 
- The link is <https://github.com/danilofreire/qtm350-summer/blob/main/lectures/lecture-20/dirty_employee_data.csv>.
- If you need to create the dataset, you can find the script here: <https://github.com/danilofreire/qtm350-summer/blob/main/lecture-20/dirty-data-generation.py>.

- **Load and Inspect the Data:**
- In the next cell, load the data using pandas and use `df.info()` to inspect its structure and identify problems.

In [None]:
# Cell 2: Load and inspect
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('dirty_employee_data.csv')
df.info()

- **Perform Data Cleaning:**
- In new cells, perform the following cleaning steps, inspecting the result after each one:
  1.  **Standardise Department Names:** Use a dictionary and the `.replace()` method to merge 'HR'/'hr' into 'Human Resources' and 'Tech' into 'Technology'.
  2.  **Fill Missing Salaries:** Use `.groupby()` and `.transform()` to fill missing salary values with the median salary of that employee's department.
  3.  **Handle Missing Dates:** Remove any rows where the `start_date` is missing using `.dropna()`.
  4.  **Correct Data Type:** Convert the `start_date` column to a proper datetime format using `pd.to_datetime()`.

In [None]:
# Cell 3: Clean data (can be split into multiple cells)
department_map = {'HR': 'Human Resources', 'hr': 'Human Resources', 'IT': 'Technology', 'Tech': 'Technology'}
df['department'] = df['department'].replace(department_map)

df['salary'] = df.groupby('department')['salary'].transform(lambda x: x.fillna(x.median()))

df.dropna(subset=['start_date'], inplace=True)

df['start_date'] = pd.to_datetime(df['start_date'])

print("Data after cleaning:")
df.info()

- **Create a Visualisation:**
- In a new cell, use `sns.countplot()` to create a bar chart showing the number of employees in each department.

In [None]:
# Cell 4: Create plot
plt.figure(figsize=(10, 6))
sns.countplot(y='department', data=df, order=df['department'].value_counts().index)
plt.title('Number of Employees per Department', fontsize=16)
plt.xlabel('Number of Employees')
plt.ylabel('Department')
plt.tight_layout()
plt.show()

## Part 3: Saving and Retrieving Your Work

- **Save the Cleaned Data:**
- In the final cell of your notebook, save the cleaned DataFrame to a new CSV file.

- **Download Your Files:**
- In the Jupyter interface in your browser, first save the notebook (`File` -> `Save Notebook`).
- Then, from your **local terminal**, use `scp` to download both your completed notebook and the new clean CSV file.

## End of Practice Session
- Congratulations! 🥳
- You have successfully used a cloud server as an interactive data science workbench, from data acquisition to final analysis and retrieval of your work.
- **Crucially, do not forget to terminate your EC2 instance** from the AWS Console to stop incurring charges.
- To do this, go back to your SSH session running the Jupyter server, press `Ctrl+C` twice to stop it, then go to the AWS Console and terminate the `jupyter-ec2-instance`.

# And that's all for today! 🎉