This assignment provides hands-on experience with Apache Spark by analyzing real-world Spark cluster logs using PySpark. You will set up a Spark cluster on AWS EC2 and perform distributed data analysis on approximately 2.8 GB of production log data.
You will analyze logs from 194 Spark applications running on YARN clusters between 2015-2017, containing 3,852 container log files. The dataset includes ApplicationMaster logs and Executor logs that reveal insights about resource allocation, task execution, performance patterns, and cluster behavior.
For complete dataset documentation, see DATASET_OVERVIEW.md.
By completing this assignment, you will:
- Set up and manage a distributed Apache Spark cluster on AWS EC2
- Use PySpark to process large-scale log data
- Extract structured information from unstructured log files
- Perform distributed data analysis and aggregations
- Optimize Spark jobs for performance
- Monitor jobs using Spark Web UI
- Apply log parsing and regex techniques at scale
- AWS Account with appropriate IAM permissions
- Access to an EC2 instance (Linux) to run the setup commands
- AWS CLI configured with credentials
- Your laptop's public IP address (get from https://ipchicken.com/)
- Basic familiarity with Python and PySpark
- Understanding of distributed computing concepts
Original Archive: The dataset is provided as a compressed archive in this repository.
Structure:
- 194 application directories
- Each application contains multiple container log files
- Container 000001: ApplicationMaster logs
- Container 000002+: Executor logs
- Total size: ~2.8 GB (uncompressed)
See DATASET_OVERVIEW.md for complete documentation.
The dataset is provided as a compressed archive. You'll need to download, uncompress, and upload it to your own S3 bucket.
# 1. Create necessary directories
mkdir -p data/raw data/output
# 2. Download the dataset archive from the course S3 bucket
# Note: The bucket uses Requester Pays, so you must include --request-payer requester
aws s3 cp s3://dsan6000-datasets/spark-logs/Spark.tar.gz data/raw/ \
--request-payer requester
# This will download ~175 MB compressed file
# 3. Extract the archive
cd data/raw/
tar -xzf Spark.tar.gz
cd ../..
# 4. Verify extraction
ls data/raw/application_* | head -10
# Should see directories like: application_1485248649253_0052
# 5. Create your personal S3 bucket (replace YOUR-NET-ID with your actual net ID)
export YOUR_NET_ID="your-net-id" # e.g., "abc123"
aws s3 mb s3://${YOUR_NET_ID}-assignment-spark-cluster-logs
# 6. Upload the data to S3
aws s3 cp data/raw/ s3://${YOUR_NET_ID}-assignment-spark-cluster-logs/data/ \
--recursive \
--exclude "*.gz" \
--exclude "*.tar.gz"
# 7. Verify upload
aws s3 ls s3://${YOUR_NET_ID}-assignment-spark-cluster-logs/data/ | head -10
# You should see application directories listed
# Note: "Broken pipe" error at the end is normal (caused by head command)
# 8. Save your bucket name for later use
echo "export SPARK_LOGS_BUCKET=s3://${YOUR_NET_ID}-assignment-spark-cluster-logs" >> ~/.bashrc
source ~/.bashrcImportant Notes:
- The source bucket (
dsan6000-datasets) uses Requester Pays, so always include--request-payer requesterwhen downloading - Once you upload to YOUR personal bucket, you don't need
--request-payerflag - The
data/raw/directory is in.gitignoreand will not be committed to git - The compressed archive files (
.gz,.tar.gz) are also ignored - Only your output files in
data/output/will be committed
# Install Java (required for PySpark)
sudo apt update
sudo apt install -y openjdk-17-jdk-headless
# Verify installation
java -version
# Set JAVA_HOME (add to ~/.bashrc for persistence)
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install project dependencies
uv sync
# Activate the virtual environment
source .venv/bin/activateFollow AUTOMATION_README.md to create your cluster:
# Run the automated setup script
./setup-spark-cluster.sh <YOUR_LAPTOP_IP>
# This creates a 4-node cluster (1 master + 3 workers)
# The script will auto-detect your IAM instance profile for S3 accessThe script will:
- Auto-detect the IAM instance profile from your current EC2 instance
- Create security groups and SSH keys
- Launch 4 EC2 instances (1 master + 3 workers)
- Install and configure Spark on all nodes
- Output cluster configuration to
cluster-config.txtandcluster-ips.txt
Note: The script automatically detects the IAM instance profile attached to your current EC2 instance. If no profile is found, it will attempt to use the first available profile, or launch without an IAM role (S3 access may not work).
For faster development iteration, create a small sample from the raw data:
# Copy one application directory as a sample for local testing
mkdir -p data/sample
cp -r data/raw/application_1485248649253_0052 data/sample/
# Now you can test locally before running on the cluster with full datasetComplete TWO problems total:
Development Recommendation:
While not required for submission, we strongly recommend developing and testing your code locally first (without the cluster) using the sample data in data/sample/. This helps you:
- Debug your code faster without cluster overhead
- Iterate quickly on your logic
- Verify your parsing patterns work correctly
Once your code works locally, adapt it for the cluster and run on the full dataset.
Analyze the distribution of log levels (INFO, WARN, ERROR, DEBUG) across all log files. This problem requires basic PySpark operations and simple aggregations.
Deliverable:
problem1.py: Python script that runs on the cluster
Outputs (3 files):
data/output/problem1_counts.csv: Log level countsdata/output/problem1_sample.csv: 10 random sample log entries with their levelsdata/output/problem1_summary.txt: Summary statistics
Expected output 1 (counts):
log_level,count
INFO,125430
WARN,342
ERROR,89
DEBUG,12
Expected output 2 (sample):
log_entry,log_level
"17/03/29 10:04:41 INFO ApplicationMaster: Registered signal handlers",INFO
"17/03/29 10:04:42 WARN YarnAllocator: Container request...",WARN
...
Expected output 3 (summary):
Total log lines processed: 3,234,567
Total lines with log levels: 3,100,234
Unique log levels found: 4
Log level distribution:
INFO : 125,430 (40.45%)
WARN : 342 ( 0.01%)
...
Analyze cluster usage patterns to understand which clusters are most heavily used over time. Extract cluster IDs, application IDs, and application start/end times to create a time-series dataset suitable for visualization with Seaborn.
Key Questions to Answer:
- How many unique clusters are in the dataset?
- How many applications ran on each cluster?
- Which clusters are most heavily used?
- What is the timeline of application execution across clusters?
Deliverable:
problem2.py: Python script that runs on the cluster
Usage:
# Full Spark processing (10-20 minutes)
uv run python problem2.py spark://$MASTER_PRIVATE_IP:7077 --net-id YOUR-NET-ID
# Skip Spark and regenerate visualizations from existing CSVs (fast)
uv run python problem2.py --skip-sparkOutputs (5 files):
data/output/problem2_timeline.csv: Time-series data for each applicationdata/output/problem2_cluster_summary.csv: Aggregated cluster statisticsdata/output/problem2_stats.txt: Overall summary statisticsdata/output/problem2_bar_chart.png: Bar chart visualizationdata/output/problem2_density_plot.png: Faceted density plot visualization
Expected output 1 (timeline):
cluster_id,application_id,app_number,start_time,end_time
1485248649253,application_1485248649253_0001,0001,2017-03-29 10:04:41,2017-03-29 10:15:23
1485248649253,application_1485248649253_0002,0002,2017-03-29 10:16:12,2017-03-29 10:28:45
1448006111297,application_1448006111297_0137,0137,2015-11-20 14:23:11,2015-11-20 14:35:22
...
Expected output 2 (cluster summary):
cluster_id,num_applications,cluster_first_app,cluster_last_app
1485248649253,181,2017-03-29 10:04:41,2017-03-29 18:42:15
1472621869829,8,2016-08-30 12:15:30,2016-08-30 16:22:10
...
Expected output 3 (stats):
Total unique clusters: 6
Total applications: 194
Average applications per cluster: 32.33
Most heavily used clusters:
Cluster 1485248649253: 181 applications
Cluster 1472621869829: 8 applications
...
Visualization Output: The script automatically generates two separate visualizations:
-
Bar Chart (
problem2_bar_chart.png):- Number of applications per cluster
- Value labels displayed on top of each bar
- Color-coded by cluster ID
-
Density Plot (
problem2_density_plot.png):- Shows job duration distribution for the largest cluster (cluster with most applications)
- Histogram with KDE overlay
- Log scale on x-axis to handle skewed duration data
- Sample count (n=X) displayed in title
Follow all instructions for code development and cluster creation from the lab-spark-cluster repository. Once you have:
- Created your Spark cluster following the lab instructions
- Developed and tested your code locally on your EC2 instance (optionally in a Jupyter notebook)
- Created a
.pyfile ready to run on the cluster
Then proceed to run your analysis on the cluster using the steps below.
Repeat this workflow for both problems (problem1.py and problem2.py).
# From your local machine, load cluster configuration
source cluster-config.txt
# Copy your script to master node
scp -i $KEY_FILE problem1.py ubuntu@$MASTER_PUBLIC_IP:~/
# SSH to master node
ssh -i $KEY_FILE ubuntu@$MASTER_PUBLIC_IP
# On the master node:
cd ~/spark-cluster
source cluster-ips.txt
# Run your script on the cluster (IMPORTANT: Use your actual net ID!)
# Replace YOUR-NET-ID with your actual net ID (e.g., abc123)
uv run python ~/problem1.py spark://$MASTER_PRIVATE_IP:7077 --net-id YOUR-NET-ID
# Example:
# uv run python ~/problem1.py spark://$MASTER_PRIVATE_IP:7077 --net-id abc123
# Exit back to your local machine
exit
# Download results to your repo (adjust filenames based on your problem)
# For problem1, download all 3 output files:
scp -i $KEY_FILE ubuntu@$MASTER_PUBLIC_IP:~/spark-cluster/problem1_counts.csv data/output/
scp -i $KEY_FILE ubuntu@$MASTER_PUBLIC_IP:~/spark-cluster/problem1_sample.csv data/output/
scp -i $KEY_FILE ubuntu@$MASTER_PUBLIC_IP:~/spark-cluster/problem1_summary.txt data/output/
# For problem2, download all 5 output files:
scp -i $KEY_FILE ubuntu@$MASTER_PUBLIC_IP:~/spark-cluster/problem2_timeline.csv data/output/
scp -i $KEY_FILE ubuntu@$MASTER_PUBLIC_IP:~/spark-cluster/problem2_cluster_summary.csv data/output/
scp -i $KEY_FILE ubuntu@$MASTER_PUBLIC_IP:~/spark-cluster/problem2_stats.txt data/output/
scp -i $KEY_FILE ubuntu@$MASTER_PUBLIC_IP:~/spark-cluster/problem2_bar_chart.png data/output/
scp -i $KEY_FILE ubuntu@$MASTER_PUBLIC_IP:~/spark-cluster/problem2_density_plot.png data/output/Important Notes:
- Replace
problem1.pywithproblem2.pyas needed - Replace
YOUR-NET-IDwith your actual net ID (e.g., abc123) - Download all output files generated by each problem
- Repeat this entire process for both problems
- Local runs (sample data): 1-5 minutes
- Cluster runs (full dataset): 5-20 minutes depending on complexity
Monitor progress in the Spark Web UI:
- Master UI:
http://$MASTER_PUBLIC_IP:8080 - Application UI:
http://$MASTER_PUBLIC_IP:4040
Submit the following to your GitHub repository:
problem1.py- Your solution for Problem 1: Log Level Distributionproblem2.py- Your solution for Problem 2: Cluster Usage Analysis
problem1_counts.csv- Log level countsproblem1_sample.csv- Sample log entries with levelsproblem1_summary.txt- Summary statistics
problem2_timeline.csv- Time-series data for each applicationproblem2_cluster_summary.csv- Aggregated cluster statisticsproblem2_stats.txt- Overall summary statisticsproblem2_bar_chart.png- Bar chart showing applications per clusterproblem2_density_plot.png- Faceted density plots showing duration distribution per cluster
- Brief description of your approach for each problem
- Key findings and insights from the data
- Performance observations (execution time, optimizations)
- Screenshots of Spark Web UI showing job execution
- Explanation of the visualizations generated in Problem 2
- Local test files or outputs
- Configuration files (
cluster-config.txt,cluster-ips.txt,*.pem) - Raw data files (
data/raw/) - Compressed archives (
*.tar.gz,*.gz)
Note: Configuration files are in .gitignore for security reasons.
Total Files to Submit: 10 files
| File | Location | Description |
|---|---|---|
problem1.py |
Repository root | Problem 1 script |
problem2.py |
Repository root | Problem 2 script |
problem1_counts.csv |
data/output/ |
Problem 1 log level counts |
problem1_sample.csv |
data/output/ |
Problem 1 sample entries |
problem1_summary.txt |
data/output/ |
Problem 1 statistics |
problem2_timeline.csv |
data/output/ |
Problem 2 timeline data |
problem2_cluster_summary.csv |
data/output/ |
Problem 2 cluster summary |
problem2_stats.txt |
data/output/ |
Problem 2 statistics |
problem2_bar_chart.png |
data/output/ |
Problem 2 bar chart |
problem2_density_plot.png |
data/output/ |
Problem 2 density plots |
ANALYSIS.md |
Repository root | Your analysis report |
Total: 100 Points
- Correct implementation (30 points)
- Accurate log level extraction and counting (15 points)
- Proper sample generation (10 points)
- Correct summary statistics (5 points)
- Code quality and documentation (10 points)
- Output file completeness (10 points)
- All 3 required output files present and correctly formatted
- Correct implementation (30 points)
- Accurate cluster and application ID extraction (10 points)
- Proper timeline data generation (10 points)
- Correct aggregations and statistics (10 points)
- Visualization quality (10 points)
- Both visualizations generated correctly with Seaborn
- Bar chart shows applications per cluster with clear labels
- Density plot shows job duration distribution for largest cluster
- Log scale properly applied to handle skewed data
- Charts are clear, labeled, and informative
- Code quality and documentation (5 points)
- Output file completeness (5 points)
- All 5 required output files present and correctly formatted
- Comprehensive analysis of findings (20 points)
- Clear description of approach for each problem
- Key insights and patterns discovered in the data
- Discussion of cluster usage patterns and trends
- Performance analysis (15 points)
- Execution time observations
- Optimization strategies employed
- Comparison of local vs cluster performance
- Documentation quality (10 points)
- Screenshots of Spark Web UI showing job execution
- Well-formatted markdown with clear sections
- Professional presentation
- Additional insights (5 points)
- Novel visualizations beyond requirements
- Deeper analysis of the data
- Creative problem-solving approaches
Maximum total with bonus: 150 points
Use PySpark's regexp_extract for parsing:
from pyspark.sql.functions import regexp_extract, col
logs_parsed = logs_df.select(
regexp_extract('value', r'^(\d{2}/\d{2}/\d{2} \d{2}:\d{2}:\d{2})', 1).alias('timestamp'),
regexp_extract('value', r'(INFO|WARN|ERROR|DEBUG)', 1).alias('log_level'),
regexp_extract('value', r'(INFO|WARN|ERROR|DEBUG)\s+([^:]+):', 2).alias('component'),
col('value').alias('message')
)from pyspark.sql.functions import input_file_name
# Extract from file path
df = logs_df.withColumn('file_path', input_file_name())
df = df.withColumn('application_id',
regexp_extract('file_path', r'application_(\d+_\d+)', 0))
df = df.withColumn('container_id',
regexp_extract('file_path', r'(container_\d+_\d+_\d+_\d+)', 1))- Use
broadcast()for small lookup tables - Cache intermediate DataFrames:
df.cache() - Avoid collecting large datasets to driver
- Use DataFrame operations instead of RDDs
- Partition data appropriately:
repartition()
from pyspark.sql.functions import to_timestamp
df = df.withColumn('timestamp',
to_timestamp('timestamp', 'yy/MM/dd HH:mm:ss'))IMPORTANT: Always clean up your cluster to avoid unnecessary AWS charges!
# Automated cleanup
./cleanup-spark-cluster.sh
# Or manual cleanup:
# 1. Terminate EC2 instances
# 2. Delete key pair
# 3. Delete security groupassignment-spark-cluster/
├── README.md # This file - assignment instructions
├── DATASET_OVERVIEW.md # Complete dataset documentation
├── AUTOMATION_README.md # Cluster setup guide
├── ANALYSIS.md # Your analysis report (to be created)
├── setup-spark-cluster.sh # Automated cluster creation script
├── cleanup-spark-cluster.sh # Automated cleanup script
├── pyproject.toml # Python dependencies
├── .gitignore # Git ignore rules
├── data/
│ ├── raw/ # Raw log files (NOT committed - in .gitignore)
│ │ ├── Spark.tar.gz # Downloaded archive (NOT committed)
│ │ └── application_*/ # Extracted log directories (NOT committed)
│ ├── sample/ # Small sample for local testing
│ │ └── application_*/ # Sample application logs
│ └── output/ # Your CSV results (COMMIT THESE!)
│ ├── problem1_X_local.csv
│ └── problem1_X_cluster.csv
├── cluster-files/ # Cluster configuration templates
├── src/ # Source code utilities
├── problem1_X_local.py # Your solution scripts (to be created)
└── problem1_X_cluster.py # Your cluster scripts (to be created)
Important Notes:
data/raw/is in.gitignore- never commit large datasetsdata/output/is NOT in.gitignore- commit your CSV results here- Configuration files with secrets (
.pem,cluster-config.txt) are in.gitignore
Instance type: t3.large (4 vCPU, 8 GB RAM) Number of instances: 4 (1 master + 3 workers) Estimated cost: $0.33/hour
Expected total cost: $3-5 for completing the assignment (assuming 10-15 hours including cluster time)
Remember to terminate your cluster when not actively using it!
-
Java not found:
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64 -
Spark session fails:
- Verify cluster is running: check Master UI
- Check master URL format:
spark://PRIVATE_IP:7077 - Ensure security group allows worker communication
-
S3 access denied:
- Verify IAM role attached to EC2 instances
- Check S3 bucket permissions
-
Parsing errors:
- Handle malformed log lines gracefully
- Use
try/exceptin UDFs for robust parsing
- Review DATASET_OVERVIEW.md for log format details
- Check Spark logs:
$SPARK_HOME/logs/ - Monitor Spark Web UI for job progress and errors
- Refer to PySpark Documentation
- Never commit
.pemfiles or AWS credentials - Limit security group access to your IP only
- Use IAM roles for S3 access (not access keys)
- Always clean up resources when done
This is an individual assignment. You may discuss high-level concepts with classmates, but all code must be your own work. Copying code from others or online sources without attribution is prohibited.
Submit your completed assignment by pushing to your GitHub repository:
git add .
git commit -m "Complete Spark log analysis assignment"
git push origin mainEnsure your repository includes all deliverables listed above.
MIT License - See LICENSE file for details.