Skip to content

MazenEwiss/VLSI_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Convolution Accelerator - VLSI Hardware Design Project

πŸ“‹ Project Overview

This repository contains a complete hardware implementation of a 2D Convolution Accelerator designed for high-performance image processing and deep learning applications. The accelerator is built using an 8Γ—8 Systolic Array architecture and implements a streaming coprocessor model for efficient matrix convolution operations.

Course: CMP3020 - VLSI Design
Institution: Cairo University Faculty of Engineering (CUFE), Computer Engineering Department
Architecture: Domain-Specific Accelerator for 2D Convolution
Target Technology: Sky130 PDK (130nm)


🎯 Key Features

Hardware Capabilities

  • 8Γ—8 Systolic Array with 64 Processing Elements (PEs)
  • 32KB On-Chip SRAM with ping-pong buffering for continuous operation
  • Configurable Matrix Sizes: 16Γ—16 to 64Γ—64 input matrices
  • Flexible Kernel Support: 2Γ—2 to 16Γ—16 convolution kernels
  • 8-bit Unsigned Integer arithmetic with 32-bit internal accumulation
  • AXI-Stream-like Interface with Valid/Ready handshake protocol
  • Weight Stationary Dataflow for optimal energy efficiency

Design Highlights

  • βœ… Modular Architecture: Separate control, memory, and compute subsystems
  • βœ… Memory Efficiency: Ping-pong buffering hides DRAM access latency
  • βœ… Address Generation Unit (AGU): Handles 2D-to-1D address mapping with sliding windows
  • βœ… Tiling Support: Processes large matrices in hardware-sized blocks
  • βœ… Handshake Protocols: Backpressure-aware data streaming
  • βœ… Comprehensive Testbenches: Self-checking verification environment
  • βœ… Golden Model: Python reference implementation for validation

πŸ—οΈ System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Convolution Accelerator Top                  β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚   Control    β”‚      β”‚    Memory    β”‚      β”‚  Systolic   β”‚  β”‚
β”‚  β”‚     Unit     │◄────►│  Controller  │◄────►│   Array     β”‚  β”‚
β”‚  β”‚    (FSM)     β”‚      β”‚  (Ping-Pong) β”‚      β”‚   (8Γ—8)     β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚         β”‚                     β”‚                     β”‚          β”‚
β”‚         β”‚                     β”‚                     β”‚          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚        Data Loader & Address Generation Unit (AGU)       β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚              SRAM Buffer (32KB Max)                      β”‚  β”‚
β”‚  β”‚         (Sky130 1rw1r Pseudo-Dual Port SRAM)            β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                                 β”‚
    rx_data (8-bit input)            tx_data (8-bit output)
    rx_valid/rx_ready              tx_valid/tx_ready

Component Breakdown

1. Systolic Array (8Γ—8 Core)

  • 64 identical Processing Elements (PEs)
  • Each PE performs Multiply-Accumulate (MAC) operations
  • Weight Stationary dataflow: weights preloaded, pixels stream through
  • Pipeline depth: 15 cycles (ROWS + COLS - 1)
  • Location: rtl/core/systolic_array.v, rtl/core/processing_element.v

2. Memory Controller

  • Manages 32KB on-chip SRAM (configurable: 4KB-32KB)
  • Ping-pong buffering: simultaneous read from one buffer, write to another
  • 3-cycle read latency handling
  • Integrates Sky130 SRAM hard macros (1rw1r configuration)
  • Location: rtl/mem/memory_controller.v

3. Control Unit

  • Main FSM orchestrator with states: IDLE, LOAD_INPUT, LOAD_WEIGHT, COMPUTE, DRAIN, DONE
  • Configuration management (matrix size N, kernel size K)
  • Handshake protocol controller
  • Coordinates between memory, AGU, and systolic array
  • Location: rtl/control/control_unit.v

4. Address Generation Unit (AGU)

  • Converts 2D coordinates (x, y) to linear memory addresses
  • Implements sliding window patterns for convolution tiling
  • Handles "halo" pixels for edge cases: Input Size = Array Size + (K-1)
  • Generates non-sequential access patterns efficiently
  • Location: rtl/control/address_generator.v

5. Data Loader

  • Manages data movement between external DRAM and on-chip buffers
  • Implements Valid/Ready handshake protocol
  • Handles format conversions (8-bit ↔ 32-bit)
  • Location: rtl/control/data_loader.v

πŸ“ Repository Structure

VLSI_Project/
β”œβ”€β”€ README.md                          # This file - comprehensive project description
β”œβ”€β”€ QUICK_START.md                     # Quick setup and simulation guide
β”œβ”€β”€ BUGFIX_SUMMARY.md                  # Integration bug fixes documentation
β”œβ”€β”€ project.txt                        # Full project specification document
β”œβ”€β”€ project_doc.pdf                    # Project documentation (PDF)
β”‚
β”œβ”€β”€ rtl/                               # RTL source files (Verilog)
β”‚   β”œβ”€β”€ convolution_accelerator_top.v  # Top-level module
β”‚   β”œβ”€β”€ accelerator_integration.v      # Memory + Systolic array integration
β”‚   β”œβ”€β”€ core/                          # Compute core modules
β”‚   β”‚   β”œβ”€β”€ systolic_array.v           # 8Γ—8 systolic array
β”‚   β”‚   β”œβ”€β”€ processing_element.v       # Single PE (MAC unit)
β”‚   β”‚   β”œβ”€β”€ README.md                  # Core architecture documentation
β”‚   β”‚   β”œβ”€β”€ Stage1_IOs.md             # Stage 1 I/O specifications
β”‚   β”‚   └── systolic_array_handshake.md # Handshake protocol details
β”‚   β”œβ”€β”€ control/                       # Control and data management
β”‚   β”‚   β”œβ”€β”€ control_unit.v             # Main FSM controller
β”‚   β”‚   β”œβ”€β”€ address_generator.v        # AGU for 2Dβ†’1D mapping
β”‚   β”‚   β”œβ”€β”€ data_loader.v              # Data streaming controller
β”‚   β”‚   └── README.md                  # Control subsystem docs
β”‚   β”œβ”€β”€ mem/                           # Memory subsystem
β”‚   β”‚   β”œβ”€β”€ memory_controller.v        # Ping-pong buffer controller
β”‚   β”‚   └── README.md                  # Memory architecture docs
β”‚   β”œβ”€β”€ tb/                            # Testbenches
β”‚   β”‚   β”œβ”€β”€ tb_processing_element.v    # PE unit tests
β”‚   β”‚   β”œβ”€β”€ tb_systolic_array.v        # Array tests
β”‚   β”‚   β”œβ”€β”€ tb_memory_controller.v     # Memory tests
β”‚   β”‚   β”œβ”€β”€ tb_control_unit.v          # FSM tests
β”‚   β”‚   β”œβ”€β”€ tb_accelerator_integration.v # Integration tests
β”‚   β”‚   └── tb_full_system.v           # Full system tests
β”‚   β”œβ”€β”€ INTEGRATION_README.md          # Integration guide
β”‚   └── README.md                      # RTL directory overview
β”‚
β”œβ”€β”€ scripts/                           # Automation scripts
β”‚   β”œβ”€β”€ golden_model_conv2d.py         # Python reference model
β”‚   β”œβ”€β”€ python/                        # Python utilities
β”‚   β”‚   β”œβ”€β”€ expected_out.txt           # Expected outputs
β”‚   β”‚   β”œβ”€β”€ results_hw.txt             # Hardware results
β”‚   β”‚   └── README.md
β”‚   β”œβ”€β”€ sim/                           # Simulation scripts
β”‚   β”‚   └── README.md
β”‚   └── utils/                         # Utility scripts
β”‚       └── README.md
β”‚
β”œβ”€β”€ sim/                               # Simulation working directory
β”‚   └── run_integration.do             # ModelSim/QuestaSim script
β”‚
β”œβ”€β”€ third_party/                       # Third-party IP
β”‚   β”œβ”€β”€ sram_macros/                   # Sky130 SRAM models
β”‚   β”‚   β”œβ”€β”€ sky130_sram_1kbyte_1rw1r_32x256_8.v
β”‚   β”‚   └── README.md
β”‚   └── README.md
β”‚
β”œβ”€β”€ test_cases/                        # Golden test vectors
β”‚   β”œβ”€β”€ 01_Basic_Minimal_*.hex         # Basic 2Γ—2 kernel test
β”‚   β”œβ”€β”€ 02_Basic_Identity_*.hex        # Identity kernel test
β”‚   β”œβ”€β”€ 03_Basic_AllOnes_*.hex         # All-ones kernel test
β”‚   β”œβ”€β”€ 04_Regular_Standard_*.hex      # Standard 3Γ—3 kernel
β”‚   β”œβ”€β”€ 05_Regular_LargeHalo_*.hex     # Large kernel test
β”‚   β”œβ”€β”€ 06_Regular_PingPong_*.hex      # Ping-pong buffer test
β”‚   β”œβ”€β”€ 07_Adv_MaxSpec_*.hex           # Maximum size (64Γ—64)
β”‚   β”œβ”€β”€ 08_Adv_Throughput_*.hex        # Throughput test
β”‚   β”œβ”€β”€ 09_Pro_PartialTile_*.hex       # Partial tile handling
β”‚   └── 10_Pro_Saturation_*.hex        # Output saturation test
β”‚
β”œβ”€β”€ config/                            # OpenLane configuration
β”‚   └── openlane/                      # Synthesis configs
β”‚       └── README.md
β”‚
└── final/                             # Final implementation outputs
    └── README.md                      # Final deliverables info

πŸš€ Getting Started

Prerequisites

Hardware Simulation:

  • ModelSim/QuestaSim (for Verilog simulation)
  • Icarus Verilog (alternative simulator)

Software Reference:

  • Python 3.7+ with NumPy

Physical Design (Optional):

  • OpenLane flow
  • Sky130 PDK

Quick Start

  1. Clone the Repository:

    git clone https://github.com/Uderscore/VLSI_Project.git
    cd VLSI_Project
  2. Run Basic Simulation:

    cd sim
    vsim -do run_integration.do
  3. Generate Golden Model:

    cd scripts
    python golden_model_conv2d.py

For detailed setup instructions, see QUICK_START.md.


πŸ”§ Technical Specifications

Operational Parameters

Parameter Min Max Type
Input Matrix (NΓ—N) 16Γ—16 64Γ—64 Variable
Kernel Size (KΓ—K) 2Γ—2 16Γ—16 Variable
Stride 1 1 Fixed
Padding 0 0 Fixed
Input/Weight Precision 8-bit 8-bit Unsigned
Internal Accumulation 32-bit 32-bit Fixed Point
Output Precision 8-bit 8-bit Unsigned (Truncated)

Hardware Constraints

Resource Min Max Notes
On-Chip Memory 4 KB 32 KB Total SRAM
Systolic Array 4Γ—4 8Γ—8 Processing Elements
Register Size 8-bit 32-bit Datapath registers
External Bus 8-bit 32-bit DRAM interface
Internal Bus 8-bit 128-bit SRAM ↔ Array

Interface Signals

Signal Direction Width Description
clk Input 1 System clock
rst_n Input 1 Active-low async reset
start Input 1 Begin computation pulse
cfg_N Input 7 Input matrix dimension N
cfg_K Input 5 Kernel dimension K
done Output 1 Computation complete
rx_data Input 8-32 Input data stream
rx_valid Input 1 Input data valid
rx_ready Output 1 Ready to accept input
tx_data Output 8-32 Output data stream
tx_valid Output 1 Output data valid
tx_ready Input 1 Ready to accept output

βœ… Verification & Testing

Test Coverage

The project includes 10 comprehensive test cases covering:

  1. Basic Tests (01-03): Minimal kernels, identity operations, edge cases
  2. Regular Tests (04-06): Standard convolutions, large halos, ping-pong buffering
  3. Advanced Tests (07-08): Maximum specifications, throughput validation
  4. Professional Tests (09-10): Partial tiles, saturation handling

Golden Model

A Python reference implementation generates expected outputs:

python scripts/golden_model_conv2d.py

Results are compared with hardware outputs with a tolerance of Β±1 LSB to account for fixed-point rounding.

Running Tests

# Run integration testbench
cd sim
vsim -do run_integration.do

# Run full system test
vsim -do run_full_system.do

# Run specific module tests
cd rtl/tb
vsim tb_systolic_array -do "run -all"

πŸ“Š Performance Metrics

Design Goals

  • Throughput: 8 MACs per cycle (64 PEs Γ— 1 MAC/cycle)
  • Latency: ~15 cycles (pipeline fill) + NΒ²/64 cycles (computation)
  • Memory Bandwidth: Up to 128 bits/cycle internal
  • Power: Clock gating for idle PEs

Optimization Areas

  1. Area Optimization:

    • Counter bit-width reduction
    • Resource sharing between PEs
    • Minimal state machine complexity
  2. Power Optimization:

    • Clock gating for idle PEs during halo loading
    • Efficient memory access patterns
    • Reduced switching activity
  3. Timing Optimization:

    • Pipeline balancing
    • Critical path analysis
    • Maximum operating frequency tuning

πŸ› Known Issues & Fixes

See BUGFIX_SUMMARY.md for detailed bug reports and resolutions, including:

  • βœ… Multiple driver conflicts resolved
  • βœ… Ping-pong buffer switching corrected
  • βœ… Handshake protocol timing fixed
  • βœ… Testbench timeout protection added

πŸ“š Documentation


πŸ”— Learning Resources

Recommended Reading

Systolic Arrays:

Memory Architecture:

Fixed-Point Arithmetic:

TPU Architecture:

SRAM Integration

Sky130 SRAM Macros:


🀝 Contributing

This is an academic project for CMP3020 - VLSI Design course. Team members are responsible for:

  1. Functional Verification (10 marks): Golden model matching with Β±0.1 precision
  2. Performance Optimization (5 marks): PPA metrics ranking
  3. Personal Contribution (5 marks): Individual component ownership

Team Size: 7-8 members
Deadline: Week 13

Workload Division

Each team member should contribute to specific components:

  • PE design and verification
  • Systolic array assembly
  • Memory controller integration
  • AGU implementation
  • Control FSM development
  • Testbench development
  • Documentation

πŸ“ Project Status

βœ… Completed Components

  • Processing Element (PE) design
  • 8Γ—8 Systolic Array
  • Memory Controller with ping-pong buffering
  • SRAM macro integration
  • Control Unit FSM
  • Address Generation Unit (AGU)
  • Data Loader
  • Top-level integration
  • Comprehensive testbenches
  • Golden model (Python)
  • 10 test cases with golden vectors
  • Bug fixes for integration issues

πŸ”„ In Progress / Future Work

  • OpenLane synthesis and place-and-route
  • Power analysis and optimization
  • Clock gating implementation
  • Final GDS-II generation
  • PPA metrics optimization
  • Additional test cases
  • Performance benchmarking

πŸ“§ Contact & Support

Course: CMP3020 - VLSI Design
Institution: Cairo University Faculty of Engineering (CUFE)
Department: Computer Engineering
Instructor: Muhammad Sayed

For technical questions or issues:

  1. Check existing documentation in the repository
  2. Review testbench outputs and waveforms
  3. Consult BUGFIX_SUMMARY.md for common issues
  4. Contact team members or instructor

πŸ“œ License

This project is part of academic coursework for CMP3020 - VLSI Design at Cairo University Faculty of Engineering. All rights reserved.


πŸ† Project Goals

Primary Objective: Design a functional 2D convolution accelerator that matches the golden model output within Β±0.1 precision.

Secondary Objectives:

  • Achieve competitive PPA (Power, Performance, Area) metrics
  • Demonstrate modular and extensible architecture
  • Implement industry-standard design practices
  • Create comprehensive verification environment

Bonus Opportunities:

  • Advanced dataflow analysis (Input vs Weight Stationary)
  • Sophisticated memory banking strategies
  • Automated regression testing suite
  • Novel architectural optimizations

Last Updated: January 2026
Repository: https://github.com/Uderscore/VLSI_Project
Branch: copilot/describe-repo-details

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors