# üìä Data Engineering & Data Science Portfolio

**Author**: Portfolio Showcase  
**Last Updated**: 2025-11-29  
**Target Roles**: Data Engineer | Data Scientist | Data Architect  

---

## üéØ Overview

This portfolio demonstrates **production-grade** data engineering, machine learning, and distributed systems expertise through real-world projects. All notebooks are executable and showcase end-to-end implementation from data ingestion to deployment.

### üîë Key Competencies

‚úÖ **Data Engineering**: ETL pipelines, streaming (Kafka), batch processing (PySpark)  
‚úÖ **Machine Learning**: Deep learning (PyTorch), NLP, time series forecasting  
‚úÖ **Cloud Architecture**: Azure integration, Data Lake, blob storage  
‚úÖ **Distributed Systems**: Hadoop, HDFS, Docker, containerization  
‚úÖ **Production Patterns**: Monitoring, error handling, cost optimization  

---

## üìö Portfolio Structure

### ü§ñ Machine Learning & Finance

#### [ML_Finance/GARCH_LSTM_Forecasting.ipynb](ML_Finance/GARCH_LSTM_Forecasting.ipynb)
**Cryptocurrency Trading System with LSTM**

- **Pipeline**: Kafka ‚Üí SQLite ‚Üí Feature Engineering ‚Üí LSTM
- **Technologies**: PyTorch, Kafka, SQLite, Technical Analysis
- **Highlights**: 
  - 50+ technical indicators (Fibonacci, RSI, ATR)
  - Dual-output LSTM (price + volatility)
  - Production daemon with paper/live trading
  - Data leakage prevention (temporal splits)
- **Skills**: ML Engineering, Financial Modeling, Streaming ETL

---

### üéÆ Game Theory & Reinforcement Learning

#### [ML_GameTheory/Poker_DeepRL.ipynb](ML_GameTheory/Poker_DeepRL.ipynb)
**Poker Strategy Analysis with Deep RL**

- **Approach**: Game-theoretic AI, Nash equilibrium approximation
- **Technologies**: Python, Deep Learning, Monte Carlo Simulation
- **Highlights**:
  - Self-attention for state representation
  - Counterfactual regret minimization
  - Monte Carlo equity calculation
  - GTO strategy implementation
- **Skills**: Reinforcement Learning, Game Theory, Statistical Simulation

---

### üß† NLP & Transformers

#### [NLP/Custom_Transformer_Implementation.ipynb](NLP/Custom_Transformer_Implementation.ipynb)
**Transformer Architecture from Scratch**

- **Implementation**: Character-level language model (10M parameters)
- **Technologies**: Python, NumPy, Deep Learning
- **Highlights**:
  - Self-attention mechanism with causal masking
  - Sinusoidal positional encoding
  - Multi-head attention (6 heads, 6 layers)
  - Nucleus sampling, top-k, temperature scaling
- **Skills**: Deep Learning Architecture, NLP, Low-level Implementation

---

### üìÑ Document Processing & NLP Engineering

#### [NLP_Engineering/Document_Processing_Pipeline.ipynb](NLP_Engineering/Document_Processing_Pipeline.ipynb)
**PDF Analysis with OpenAI Integration**

- **Pipeline**: PDF Extraction ‚Üí OpenAI Analysis ‚Üí LaTeX Generation
- **Technologies**: PyPDF2, OpenAI API, LaTeX, scikit-learn
- **Highlights**:
  - Automated exam question extraction
  - AI-powered solution generation
  - TF-IDF clustering for similarity detection
  - Cost-optimized API batching
- **Skills**: ETL Pipelines, API Integration, NLP, Document Generation

---

### ‚òÅÔ∏è Cloud Integration

#### [Cloud_Integration/Azure_Face_DataLake.ipynb](Cloud_Integration/Azure_Face_DataLake.ipynb)
**Azure Face API + Data Lake Pipeline**

- **Architecture**: Blob Storage ‚Üí Face API ‚Üí Data Lake ‚Üí Analytics
- **Technologies**: Azure (Face API, Blob, Data Lake), Python, Streamlit
- **Highlights**:
  - Batch processing for 1000s of images
  - Face detection + attribute extraction
  - Tiered storage (hot/cool/archive)
  - Cost optimization strategies
- **Skills**: Cloud Architecture, Azure Services, Data Lake Design

---

### üñ•Ô∏è Distributed Systems

#### [Distributed_Systems/Hadoop_Docker_Setup.ipynb](Distributed_Systems/Hadoop_Docker_Setup.ipynb)
**Containerized Hadoop Cluster**

- **Stack**: Docker Compose ‚Üí Hadoop ‚Üí HDFS ‚Üí YARN ‚Üí MapReduce
- **Technologies**: Docker, Hadoop, distributed computing
- **Highlights**:
  - Multi-node cluster orchestration
  - HDFS with 3x replication
  - MapReduce job execution
  - Horizontal scaling patterns
- **Skills**: DevOps, Infrastructure-as-Code, Big Data Architecture

---

### üî• Data Engineering: Spark & Kafka

#### [PySpark/PySpak_Tutorial.ipynb](PySpark/PySpak_Tutorial.ipynb)
**PySpark Data Processing**

- **Focus**: Batch ETL, Spark SQL, DataFrames
- **Highlights**: Window functions, UDFs, performance optimization

#### [PySpark/Kafka_ETL/Kafka_ETL.ipynb](PySpark/Kafka_ETL/Kafka_ETL.ipynb)
**Real-time Streaming with Kafka**

- **Focus**: Structured Streaming, exactly-once semantics
- **Highlights**: Checkpointing, fault tolerance, schema evolution

---

### üìä Causal Inference & Econometrics

#### [IIEL_notebook/notebook_IIEL.ipynb](IIEL_notebook/notebook_IIEL.ipynb)
**Spatial Econometrics & Treatment Effects**

- **Focus**: Panel data, fixed effects, event studies
- **Highlights**: Tkinter GUI, Folium maps, power simulations

---

## üõ†Ô∏è Technology Stack

### Programming & Frameworks
```
Python (Advanced)    ‚îÇ PyTorch ‚îÇ TensorFlow ‚îÇ scikit-learn
PySpark (Production) ‚îÇ Pandas ‚îÇ NumPy ‚îÇ SQL
```

### Cloud & Infrastructure
```
Azure (Face API, Blob, Data Lake) ‚îÇ Docker ‚îÇ Kubernetes
Kafka ‚îÇ Hadoop ‚îÇ HDFS ‚îÇ YARN
```

### Data Engineering
```
ETL Pipelines ‚îÇ Streaming (Kafka, Spark) ‚îÇ Batch Processing
Data Modeling ‚îÇ Schema Design ‚îÇ Data Quality
```

### Machine Learning
```
Deep Learning ‚îÇ NLP (Transformers, LSTM) ‚îÇ Time Series
Reinforcement Learning ‚îÇ Computer Vision ‚îÇ MLOps
```

---

## üìà Project Metrics

| Category | Metric | Value |
|----------|--------|-------|
| **Data Volume** | Max dataset processed | 100K+ rows |
| **ML Models** | Parameters (largest) | ~10M (Transformer) |
| **Streaming** | Throughput | Real-time (<200ms latency) |
| **Cloud** | Images processed | 1000s/batch |
| **Distributed** | Cluster nodes | 3-10 nodes (scalable) |
| **API Integration** | Services | OpenAI, Azure, Binance |

---

## üéì Skills Demonstrated

### Data Engineering (70%)
- ‚úÖ ETL pipeline design and implementation
- ‚úÖ Streaming data processing (Kafka, Spark Streaming)
- ‚úÖ Batch processing optimization (PySpark)
- ‚úÖ Data modeling and schema design
- ‚úÖ Cloud data lakes and warehouses
- ‚úÖ Distributed systems (Hadoop, HDFS)

### Data Science (20%)
- ‚úÖ Machine learning model development
- ‚úÖ Deep learning (PyTorch, custom architectures)
- ‚úÖ NLP and transformers
- ‚úÖ Time series forecasting
- ‚úÖ Statistical modeling and causal inference

### Data Architecture (10%)
- ‚úÖ System design for data platforms
- ‚úÖ Infrastructure-as-Code (Docker, Kubernetes)
- ‚úÖ Cloud architecture (Azure)
- ‚úÖ Scalability patterns
- ‚úÖ Cost optimization

---

## üöÄ Quick Start Guide

### For Recruiters

**Recommended Reading Order**:

1. **Data Engineering Focus**: 
   - Start with `PySpark/Kafka_ETL/Kafka_ETL.ipynb` (streaming)
   - Then `Distributed_Systems/Hadoop_Docker_Setup.ipynb` (architecture)
   - Finally `Cloud_Integration/Azure_Face_DataLake.ipynb` (cloud)

2. **Data Science Focus**:
   - Start with `ML_Finance/GARCH_LSTM_Forecasting.ipynb` (production ML)
   - Then `NLP/Custom_Transformer_Implementation.ipynb` (deep learning)
   - Finally `ML_GameTheory/Poker_DeepRL.ipynb` (advanced AI)

3. **Full-Stack Data Role**:
   - Read all notebooks in order listed above

### Running Notebooks

```bash
# Install dependencies
pip install jupyter pandas numpy matplotlib seaborn scikit-learn

# Launch Jupyter
jupyter notebook

# Navigate to INDEX.ipynb
```

**Note**: Some notebooks require external services (Azure, OpenAI API keys). Mock data is provided for demonstration.

---

## üìû Contact & Links

**Portfolio Repositories**:
- [ffws_GARCHLSTM](https://github.com/anarcoiris/ffws_GARCHLSTM) - Financial ML
- [DeepGamble](https://github.com/anarcoiris/DeepGamble) - Game Theory AI
- [MiNiLLM](https://github.com/anarcoiris/MiNiLLM) - Transformer Implementation
- [Examn_Xterminator](https://github.com/anarcoiris/Examn_Xterminator) - Document Processing
- [FaceGUI](https://github.com/anarcoiris/FaceGUI) - Azure Cloud Integration
- [docker-hadoop](https://github.com/anarcoiris/docker-hadoop) - Distributed Systems

---

## üéØ Why This Portfolio?

### Production-Ready Code
- ‚úÖ Error handling and validation
- ‚úÖ Logging and monitoring
- ‚úÖ Cost optimization
- ‚úÖ Scalability patterns
- ‚úÖ Documentation and testing

### Real-World Complexity
- ‚úÖ Multi-service integration
- ‚úÖ Distributed systems
- ‚úÖ Cloud architecture
- ‚úÖ Data at scale (100K+ rows)
- ‚úÖ Production deployment patterns

### Breadth & Depth
- ‚úÖ 10+ technologies demonstrated
- ‚úÖ End-to-end pipelines
- ‚úÖ ML + Engineering + Cloud
- ‚úÖ Batch + Streaming + Real-time

---

## üìä Portfolio Statistics

```
Total Notebooks:        10+
Lines of Code:          5000+ (across projects)
Technologies:           15+ frameworks/tools
Cloud Services:         3 (Azure, Binance, OpenAI)
Distributed Nodes:      Up to 10 (Hadoop cluster)
ML Model Parameters:    10M (largest model)
Data Processed:         100K+ rows (batch), Real-time (streaming)
```

---

*This portfolio showcases enterprise-grade data engineering and data science capabilities for production environments.*

**Last Updated**: November 2025  
**Status**: ‚úÖ All notebooks executable with demo data


In [None]:
# Portfolio overview visualization
import matplotlib.pyplot as plt
import numpy as np

# Skills distribution
skills = ['Data Engineering', 'Machine Learning', 'Cloud/DevOps', 'Data Science', 'Distributed Systems']
proficiency = [90, 85, 80, 85, 75]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Skills radar
axes[0].barh(skills, proficiency, color='steelblue')
axes[0].set_xlabel('Proficiency (%)')
axes[0].set_title('Technical Skills Portfolio', fontweight='bold')
axes[0].set_xlim(0, 100)
axes[0].grid(axis='x', alpha=0.3)

# Project distribution
categories = ['ML/DL', 'ETL/Streaming', 'Cloud', 'Distributed\nSystems', 'NLP']
project_counts = [3, 3, 1, 1, 2]
axes[1].pie(project_counts, labels=categories, autopct='%1.0f%%', startangle=90)
axes[1].set_title('Portfolio Coverage by Category', fontweight='bold')

plt.tight_layout()
plt.show()

print("Portfolio loaded successfully!")
print("Navigate to any notebook above to explore specific projects.")