### **Summary**

Strong C++/Python coding. Drove end-to-end academic processor implementation yielding functional silicon. Experience in computer architecture, digital ASIC/physical design automation, edge machine learning projects.

## Experience

Intel Austin, Texas

### Silicon Architecture Engineer (SoC Performance)

2017-present

Awarded for accelerating performance projections delivery to customers and improving modeling methodologies. Analyze impact of architectural/fabrication process changes on server SoC performance at power/thermal budget using regressions analysis. Perform pre-silicon performance tuning and validation of interconnect fabric and I/O. Modeled features into simulator, developed tools and dashboards for data analysis, multi-simulator integration.

### Design Automation Software Engineer (ASIC Physical Design)

Jan 2016-2017

Developed physical design flows and supported team members on yielding a successful CPU design tape-in.

Qualcomm Research San Diego, California

Research Intern Summer 2013

Performed mixed-signal circuit design verification, post-silicon measurements, and FPGA prototyping.

### **Skills**

| Programming                                 | Machine Learning/Tools              | Processor Implementation          |
|---------------------------------------------|-------------------------------------|-----------------------------------|
| Strong in $\overline{C_{++}}$ , Python, Tcl | Strong in Pandas. PyTorch, SKlearn. | SystemVerilog/Verilog, SystemC    |
| Basic Java, Javascript, Clojure, Perl       | Git, Docker, Dashboards, D3         | Platform Architect, Simics        |
| Unix shell, HTML, SQLite, Node.js           | XGBoost, Regressions, Efficient ML  | Place-Route, DFT, Timing, DRC/LVS |

# **Select Publications/Awards**

A Logic-on-logic 3D-stacked Heterogeneous Multi-core Processor. IEEE ICCD 2017. Physical Design of a 3D-stacked Heterogeneous Multi-core Processor. IEEE 3D-IC 2016. Ranked 34<sup>th</sup> in USA, IEEExtreme 24-hour Programming Competition, 2014. Team of 2. Best FPGA Implementation at International LSI Design Contest, Japan 2009. Xilinx Award. Team of 3.

#### Education

## North Carolina State University

Raleigh, North Carolina

#### Ph.D. in Computer Engineering

3.98/4.0. Fall 2010 - Spring 2016

Dissertation: Three-Dimensional Integration of Heterogeneous Multi-Core Processors.

Research team built a functional 3D-IC processor chip. Developed automated 3D-IC physical design flow. Performed architecture analysis, verification, and entire back-end flow up to deliverable layout (3 chips).

Research Assistant ('11-'15). Teaching Assistant: Design of Digital Systems, Computer Design & Technology.

Software Engineering Advanced Microarchitecture ASIC Design Electronic Sys. Level Design Computer Networks Parallel Computer Arch. **ASIC Verification** Physical Design Memory Systems IC Technology & Fabrication VLSI Systems Design Computer Design & Tech. Embedded Systems Design Modern Comp. Algebra-AU VLSI System Testing (Duke U.) Digital Electronics

Duke University

Durham, North Carolina

#### Visiting Scholar: coursework, research collaboration

Spring 2013

### Bandung Institute of Technology

Indonesia

### **B.S.** in Electrical Engineering

Fall 2004 - Spring 2009

Thesis: C implementation of a neural network and Kohonen SOM: training/inference, floating/fixed point, on a multi-core Parallax microcontroller. Teaching Assistant: Digital Systems, Microprocessor Lab.

Oita University Japan

**Exchange Student: Research (Shibata Lab), Engineering/Cultural Coursework**Fall 2007 - Spring 2008

Implemented face following feature on a panning camera using a feed-forward neural network (coded in C).

# **Project Experience**

#### **Machine Learning**

Implemented quantized LeNet model with feedback alignment training (open-source libraries) in PyTorch. Experiments on developing new training algorithms for binarized neural networks in PyTorch.

Benchmarking of deep learning models in quantized/non-quantized forms (OpenVINO, Tensorflow Lite).

#### End-to-end Silicon Implementation / Tape-outs

Successful academic tape-out (functional fabricated 3D-IC processor chip) of a heterogeneous multi-core processor system with thread migration features at NCSU. Developed fully automated back-end flow within a single Makefile. Developed visualization tools which includes chip pin-out diagram in TikZ, rendering 3D-IC interconnect pins in D3.js. Performed final physical verification (DRC/LVS) and layout fixes (e.g. antenna rules violation) for signoff. Silicon implementation has two stacked dies of 5.25 mm x 5.25 mm on a 130 nm process.

### RTL Design (Verilog), FPGA Prototyping, bare-metal programming (C/asm)

Implemented "Sokoban" (moving box puzzle game) on FPGA: coded the game in MIPS assembly by hand (prototyped in C). Wrote MIPS processor Verilog RTL, features include pipelining, data forwarding, Kogge-Stone adder (team effort, yielded 1 GHz clock in a commercial 180 nm process). Wrote Verilog code for debouncing FPGA buttons and rendering graphics through VGA interface. Created game sprites.

#### **Memory Systems**

Performed modelling and performance comparison between ideal and non-ideal block placement policy for multicore systems. Cache block placement policy: requestor core cache vs remote core cache. Analyzed experiment results from running SPEC2K benchmarks in SIMICS.

### ESL & Physical Design (SystemC, HLS)

Performed TLM & ESL modelling of an SoC design that consists of an ARM Cortex core, DRAM model, and AMBA bus. Performed physical design optimizations, signal integrity analysis, power analysis, timing analysis. Tools: SystemC, Mentor Graphics Vista, Catapult, Python, C++, UML, Encounter, PrimeTime.

#### **Parallel Computer Architecture**

Implemented a cache coherence protocol (MSI, MESI, MOESI) simulator in C++.

Explored enhancements to cache coherence protocols to reduce off-chip memory accesses.

#### Computer Design and Technology (C++)

Implemented cache, branch target buffer, and Tomasulo superscalar processsor simulators.

Implemented a checkpoint recovery mechanism for large fetch window processor within SimpleScalar simulator.

### Advanced Microarchitecture (C++)

Implemented and compared thread migration (across cores) strategies within SimpleScalar simulator framework.

#### ASIC Verification (SystemVerilog)

Verified an out-of-order superscalar core (FabScalar) for tape-out, found design bugs in load-store unit and issue queue. Created a reusable SystemVerilog testbench executed in QuestaSim.

### Digital Electronics (CMOS circuit design)

Designed a low power Hybrid Latch Flip-flop in academic 45 nm tech library. Operating clock frequency 4GHz, power consumption 19.9 uW, setup time 13.5ps, hold time 86ps,  $t_{DQ}$  of 63.64 ps.

Designed a voltage-mode and current-mode differential transmitter circuit. Tools: HSPICE.

## VLSI Systems Design (logic design, physical layout)

Designed a full-custom 3x3 arbiter-crossbar CMOS unit, 2nd best performance and energy\*delay-squared metric out of 27 teams. Customized power delivery network and clock tree design. Created custom standard cell library and top-level integration. Achieved 5.5 GHz clock frequency, 0.19 nW power, with FreePDK45 technology library. Tools: Cadence Virtuoso, HSPICE, Calibre DRC-LFD.

#### ASIC Design (Verilog)

Implemented a Viterbi Decoder in RTL Verilog. Optimized throughput and delay per unit area metric by designing a fast floating point unit, using dual port memory, and pipelining.

#### **Online Courses**

Machine Learning, Startup Engineering, Analysis of Algorithms (Coursera). Fast.ai. Scalable ML/Spark (edX).