## Summary

Experience in software development and hardware design, from processor architecture to physical implementation. Knowledge of computer architecture, FPGA/ASIC/VLSI, computer science, and machine learning fundamentals. Interests in financial markets, efficient machine learning implementations.

## Experience

Intel Austin, Texas

## **SoC Performance Architect**

2017-present

Provide workload performance at power analysis/workload projections for upcoming Xeon server SoCs. Enhanced SoC modeling capabilities: e.g. linked multiple simulators with a common timing kernel (C++). Created statistics parsers, dashboards, visualizations, workflow automation, modeling tools to accelerate analysis. Conducted server/client SoC performance tuning, analysis, and validation – interconnect/memory systems.

## **Design Automation Engineer**

2016-2017

Developed automation solutions to deliver CPU physical implementation up to successful SoC tape-in.

Qualcomm Research
Research Intern

San Diego, California
Summer 2013

Performed mixed-signal circuit design verification and FPGA prototyping.

### Skills

| Programming                          | Software/Machine Learning             | ASIC Design/SoC Performance         |
|--------------------------------------|---------------------------------------|-------------------------------------|
| Proficient in C++ and Python         | PyTorch, Git, Docker, Keras           | SystemC, Platform Architect, Simics |
| Tcl, Java, Clojure, LATEX, Unix, SQL | Spark, Scikit-learn, Pandas ETL       | RTL Design: SystemVerilog/Verilog   |
| HTML+CSS, Javascript, Node.js        | Shell scripting, basic finance/crypto | Place-Route, DFT, Timing, DRC/LVS   |

# **Select Publications/Awards**

A Logic-on-logic 3D-stacked Heterogeneous Multi-core Processor.. IEEE ICCD 2017. Physical Design of a 3D-stacked Heterogeneous Multi-core Processor. IEEE 3D-IC 2016. Ranked 34<sup>th</sup> in USA, IEEExtreme 24-hour Programming Competition, 2014. Team of 2. Best FPGA Implementation at International LSI Design Contest, Japan 2009. Xilinx Award. Team of 3.

### Education

## North Carolina State University

Raleigh, North Carolina

3.98/4.0. 2016

coccore

Ph.D. in Computer Engineering

Dissertation: Three-Dimensional Integration of Heterogeneous Multi-Core Processors.

Built a functional 3D-IC processor chip. Developed custom 3D-IC physical implementation flow.

Performed architecture analysis, processor verification, and entire back-end flow up to deliverable layout.

Teaching Assistant (graduate-level): Design of Digital Systems, Computer Design & Technology.

Advanced Microarchitecture ASIC Design Electronic Sys. Level Design Software Engineering Computer Networks Parallel Computer Arch. **ASIC Verification** Physical Design Memory Systems Computer Design & Tech. IC Technology & Fabrication VLSI Systems Design Embedded Systems Design Digital Electronics VLSI System Testing (Duke U.) Modern Computer Algebra

### Bandung Institute of Technology

Indonesia

### B.S. in Electrical Engineering, with distinction

2009

Thesis: C implementation of on-chip feedforward neural network and Kohonen SOM, both training and inference, floating point and fixed point, on a multi-core Parallax microcontroller. TA: Digital Systems, Microprocessor Lab.

Oita University Japan

## Exchange Student, Research & Coursework

2007-2008

Implemented control of panning camera using neural networks (C). Used a neural network to track face location relative to center and provide control commands to the camera.

# **Project Experience**

## **Machine Learning**

PyTorch: Integrated and analyzed model quantization coupled with feedback alignment training algorithm (open-source libraries). Experiments on back-propagation algorithm alternatives, e.g binarized neural network with greedy training approach.

Benchmarking of MobileNet, SqueezeNet quantized/non-quantized models on Android using TensorFlow Lite.

#### Memory Systems

Performed modelling and performance comparison between ideal and non-ideal block placement policy for multicore systems. Cache block placement policy: requestor core cache vs remote core cache. Analyzed experiment results from running SPEC2K benchmarks in SIMICS.

## **ESL & Physical Design**

Performed TLM & ESL modelling of an SoC design that consists of an ARM Cortex core, DRAM model, and AMBA bus. Performed physical design optimizations, signal integrity analysis, power analysis, timing analysis. Tools: SystemC, Mentor Graphics Vista, Catapult, Python, C++, UML, Encounter, Primetime.

## **Parallel Computer Architecture**

Implemented a MSI, MESI, MOESI cache coherence protocols simulator in C++.

Explored cache coherence protocols to reduce off-chip memory accesses.

### Computer Design and Technology

Implemented a generic cache simulator, branch target buffer simulator, and Tomasulo superscalar processor simulator in C++.

Implemented a checkpoint recovery mechanism for large fetch window processor within SimpleScalar simulator environment in C++.

#### Advanced Microarchitecture

Implemented and compared thread migration strategies within SimpleScalar simulator in C++.

#### **ASIC Verification**

Verified an out-of-order superscalar core (FabScalar) for tape-out, found design bugs in load-store unit and issue queue. Created a reusable SystemVerilog testbench executed in QuestaSim.

#### **Digital Electronics**

Designed a low power Hybrid Latch Flip-flop in academic 45 nm tech library. Operating clock frequency 4GHz, power consumption 19.9 uW, setup time 13.5ps, hold time 86ps,  $t_{DQ}$  of 63.64 ps.

Designed a voltage-mode and current-mode differential transmitter circuit. Tools: HSPICE.

### VLSI Systems Design

Designed a full-custom 3x3 arbiter-crossbar CMOS unit, 2nd best performance and energy\*delay-squared metric out of 27 teams. Customized power delivery network and clock tree design. Created custom standard cell library and top-level integration. Achieved 5.5 GHz clock frequency, 0.19 nW power, with FreePDK45 technology library. Tools: Cadence Virtuoso, HSPICE, Calibre DRC-LFD.

## ASIC Design

Implemented a Viterbi Decoder in RTL Verilog. Optimized throughput and delay per unit area metric by designing a fast floating point unit, using dual port memory, and pipelining.

## RTL Design, FPGA Prototyping

Implemented "Sokoban" (moving box puzzle game) on FPGA: coded the game in MIPS assembly by hand (prototyped in C). Wrote MIPS processor RTL from scratch (team effort, 1 GHz clock in a commercial 180 nm process). Wrote the Verilog code to interface with FPGA buttons and render VGA graphics. Created game sprites.

### **Online Courses**

Startup Engineering (Coursera), Analysis of Algorithms, Scalable Machine Learning (edX).

## Silicon Implementation / Tape-outs

Successful academic tape-out (functional 3D-IC processor chip) of a heterogeneous multi-core processor system with thread migration features at NCSU. Processor implementation has two stacked dies of 5.25 mm x 5.25 mm on a 130 nm process.