**Currently does not work for places where there is a data type after a signal type, such as ```output reg``` or ```input wire```** (see TODO section below)

Module definition formats that have been accounted for:

- all inputs and outputs defined in the first line

```
module half_adder(input a, b, output s0, c0);
```

- no inputs or outputs defined in the first line

```
module rams_sp_rom (clk, we, addr, di, dout);

--OR--

module arbiter (
clock      , // clock
reset      , // Active high, syn reset
req_0      , // Request 0
req_1      , // Request 1
gnt_0      , // Grant 0
gnt_1        // Grant 1
);
```

- multiple types of either input or output (different bit widths)

```
module RCA4(output [3:0] sum, output cout, input [3:0] a, b, input cin);
```

- multi-line and multiple types of either input or output

```
module SkipLogic(output cin_next,
  input [3:0] a, b, input cin, cout);
```

- parameters; inputs and outputs defined in an uncommon way

```
module spi_lite
//-----------------------------------------------------------------
// Params
//-----------------------------------------------------------------
#(
     parameter C_SCK_RATIO      = 32
)
//-----------------------------------------------------------------
// Ports
//-----------------------------------------------------------------
(
    // Inputs
     input          clk_i
    ,input          rst_i
    ,input          cfg_awvalid_i
    ,input  [31:0]  cfg_awaddr_i
...
```

Shortcomings and predicted errors:

Firstly, there are likely many different cases I didn't catch, where someone codes in a way that is somewhat unconventional and has a format I didn't account for. A way to combat this would be to rewrite the program in a way that doesn't parse in a traditional manner, but instead looks at the whole thing and interprets it intelligently, and then takes out the pieces based on what it has found. This is pretty much describing some kind of machine learning algorithm.

However, here's what I've noticed so far:

- Does not account for instantiations of a module inside of another module.

- Does not add space between a second bit width specifier if there is one.

In [None]:
# Hide the output of this cell
%%capture

import os

# Remove the folder if it's already there, then make the folder and go into it
! rm -rf FilesToParse
! mkdir FilesToParse
os.chdir('FilesToParse')

# These are a bunch of example repos I found that have some Verilog files in them
! git clone https://github.com/sudhamshu091/32-Verilog-Mini-Projects.git
! git clone https://github.com/snbk001/Verilog-Design-Examples.git
! git clone https://github.com/ashishrana160796/verilog-starter-tutorials.git
! git clone https://github.com/mongrelgem/Verilog-Adders.git
! git clone https://github.com/mihir8181/VerilogHDL-Codes.git
! git clone https://github.com/sudhamshu091/Single-Cycle-Risc-Processor-32-bit-Verilog.git
! git clone https://github.com/ultraembedded/minispartan6-audio.git

# Move back to the starting directory to continue with program
os.chdir('..')

# Remove the sample directory of files that Colab spawns every time we open it again
! rm -rf sample_data

In [None]:
import re

keywords = ['parameter', 'input', 'output', 'reg', 'wire']

def clean_lists(list_):
    whole = ''.join(letter for letter in list_)                             # the list returned from group matching in re search was a list of single characters
    elements = [e.strip() for e in re.split(',|\(|\)|;', whole) if e]       # split the string to create a list, then strip each item in the list

    cleaned_elements = []
    for e in elements:
        if not e:
            continue
        if '=' in e:
            e = e.split('=')[0]
        e = re.sub('\s{1,}:|:\s{1,}|\s{1,}:\s{1,}', ':', e)                 # get rid of any intra-bit width spacing (check for a space on one side of the colon or both)
        e = e.strip()
        cleaned_elements.append(e)

    return cleaned_elements


def print_keyword(keyword, keyword_list):
    print('    {} ({}):'.format(keyword, len(clean_lists(keyword_list[keywords.index(keyword)]))), file=output_file)
    for i in clean_lists(keyword_list[keywords.index(keyword)]):
        print('      ', i, sep='', file=output_file)


def check_if_in_module(in_module, keyword_list):
    if in_module:                                    # this checks to see if we're already in a module, in case the previous module did not have had an 'endmodule'
        for keyword in keywords:
            print_keyword(keyword, keyword_list)
    return True


# This function is used for the cases in which a bit width is specified for some inputs,
#   but only noted once, resulting in the need for it to be distributed to each element for display
def distributor(elements, specifier):
    elements = elements.strip(',')                              # get rid of any commas at the end that may add a blank element when splitting
    s = elements.split(',')
    for i in range(1, len(s)):
        s[i] = specifier + ' ' + s[i]                           # distribute the specifier
        s[i] = re.sub('\s{2,}', ' ', s[i])                      # clean up spacing
    elements_with_specifier = ''
    for string in s:
        elements_with_specifier += ' ' + string + ','           # add a comma after each one because that's what we split by when cleaning the list
    return elements_with_specifier


def generalized_search_module_def(line, keyword_list):          # search through a module definition (generalized because the previous one
                                                                #   looked for inputs and outputs only instead of all keywords)
    mod_def_keyword_lists = [[] for keyword in keywords]        # "module definition keyword lists"
    mod_def_keyword_strings = ['' for keyword in keywords]      # "module definition keyword strings"

    for keyword in keywords:
        k = keywords.index(keyword)     # this is an integer representing an index

        keyword_lookahead = '|'.join(r'\b' + key + r'\b' for key in keywords)
        keyword_lookahead2 = '|'.join(r'.\b' + key + r'\b' for key in keywords)

        if re.search(keyword, line):
            # These checks are currently hardcoded because they didn't work when looping through the keywords list
            # They are checking if a keyword is preceded by one of these words and a space, meaning the current keyword
            #   isn't it's own declaration
            # Currently can only check for a single space, but that should be ok because earlier when we clean the line,
            #   we replace spaces of two or more with just one
            if re.search('(?<=' + 'input' + '\s)' + keyword, line):         # if the keyword is preceded by 'input'
                continue
            elif re.search('(?<=' + 'output' + '\s)' + keyword, line):      # if the keyword is preceded by 'output'
                continue
            elif re.search('(?<=' + 'inout' + '\s)' + keyword, line):       # if the keyword is preceded by 'inout'
                continue

            regex = r'\b' + keyword + r'\b\s*(.*?)\s*(?=' + keyword_lookahead2 + r'|$)'         # get everything until another keyword
            mod_def_keyword_lists[k] = re.findall(regex, line)

        for e in mod_def_keyword_lists[k]:
            specifier = ''
            data_search = re.search('\A\s*(' + keyword_lookahead + ')', e)      # look for a data type specifier
            width_search = re.search('(\[.*:.*\])', e)                          # look for a bit width specifier

            if data_search:
                specifier += data_search.group(1)
            if width_search:
                specifier += width_search.group(1)

            elements_with_specifier = distributor(e, specifier)
            mod_def_keyword_lists[k][mod_def_keyword_lists[k].index(e)] = elements_with_specifier

        mod_def_keyword_strings[k] = ''.join(mod_def_keyword_lists[k])
        keyword_list[k].append(mod_def_keyword_strings[k])

    return keyword_list


def parse_lines(lines, output_file):
    in_comment_block = False
    in_module = False
    in_module_def = False
    in_function = False
    concat_lines = False

    temp_line = ''

    keyword_list = [[] for keyword in keywords]

    # Search through each line
    for line in lines:
        # Clean up the line for processing
        line = line.split('//')[0]              # remove any in-line comments
        line = line.strip()                     # clean up the line by removing any problem-causing whitespaces
                                                # - we can do this because indendation doesn't matter in Verilog
        if re.search('\A\s*,', line):           # if the line starts with a comma, which probably means
                                                #   it's part of a multi-line module definition
            line = line.lstrip(',')             # strip any commas at the beginning of the line, which cause problems

        # Shrink all spaces down to one space
        line = re.sub('\s{2,}', ' ', line)

        # Make sure there is a space after a bit width specifier
        # - only works for the first one, not and subsequent ones
        if ']' in line and re.search('](.)', line) and not (re.search('](.)', line).group(1) == ' '):   # if there is a closing bracket and the character immediately following it is not a space
            temp = line.split(']', 1)           # break up the line into two parts
            line = temp[0] + '] ' + temp[1]     # put the line back together with a space now coming after the closing bracket

        # Check if we should concat the previous line with the current one
        if concat_lines:
            line = temp_line + ' ' + line       # add the previous line to this one
        regex = '|'.join(r'\b' + keyword + r'\b' for keyword in keywords)
        if re.search(regex, line) and not re.search(';$', line):
            concat_lines = True
            temp_line = line
            continue
        concat_lines = False
        temp_line = ''                          # clear the temp line, otherwise the next time, that whole thing will be loaded in

        # Check if there is a variable being assigned to something,
        #   but there is only one in the line and the line isn't continuing to the next
        if '=' in line and not ',' in line:
            line = line.split('=')[0]

        # Look for comment block
        if re.search('\s*/\*', line):           # look for /* at the beginning of the text in the line
            in_comment_block = True             # we are now in a comment block
            continue                            # go to the next line
        if in_comment_block:                    # if currently in a comment block
            if re.search('\s*\*/', line):       # end of the comment block
                in_comment_block = False        # we are no longer in the comment block
            continue                            # go to the next line

        # Look for function
        if re.search(r'\s*\bfunction\b', line): # look for 'function' in the line, signaling the start of a function,
                                                #   which we don't want to pull keywords from
            in_function = True
            continue
        if in_function:
            if re.search(r'\s*\bendfunction\b', line):
                in_function = False
            continue

        # The first line this checks is the second line of the module definition
        if in_module_def:
            keyword_list = generalized_search_module_def(line, keyword_list)
            if re.search('\)\s*;$', line):      # if the line ends with ) ;
                in_module_def = False           # we are no longer in the module definition
            continue                            # go to the next line

        if re.search(r'\bmodule\b', line):
            in_module = check_if_in_module(in_module, keyword_list)
            module_name = re.search('module (\w+)', line).group(1)
            print("  Module:", module_name, file=output_file)
            keyword_list = generalized_search_module_def(line, keyword_list)
            if not re.search('\)\s*;$', line):
                in_module_def = True
            continue

        # See if we have reached the end of the module
        elif re.search('endmodule', line):
            in_module = False
            for keyword in keywords:
                print_keyword(keyword, keyword_list)
            keyword_list = [[] for keyword in keywords]


        # Look for each keyword in the current line
        for keyword in keywords:
            k = keywords.index(keyword)
            keyword_lookahead = '|'.join(r'\b' + key + r'\b' for key in keywords)

            if re.search(keyword, line):
                if re.search('(?<=' + 'input' + '\s)' + keyword, line):
                    continue
                elif re.search('(?<=' + 'output' + '\s)' + keyword, line):
                    continue
                elif re.search('(?<=' + 'inout' + '\s)' + keyword, line):
                    continue

                regex = '\A\s*' + keyword + '\s+(.*)'                           # regex to find keyword and stuff after it
                                                                                # search from the beginning of the line because the keyword
                                                                                #   should only be at the beginning
                match_ = re.search(regex, line)                                 # search for the keyword in the line
                if match_:
                    elements = match_.group(1)
                    specifier = ''

                    data_search = re.search('\A\s*(' + keyword_lookahead + ')', elements)
                    bit_width_search = re.search('(\[.*:.*\])', elements)

                    if data_search:
                        specifier += data_search.group(1)
                    if bit_width_search:
                        specifier += bit_width_search.group(1)

                    elements_with_bit_width = distributor(elements, specifier)
                    keyword_list[k].append(elements_with_bit_width)

    # Print stuff if we have gotten to the end of the file and there was only one module but it had no 'endmodule'
    check_if_in_module(in_module, keyword_list)

In [None]:
parsed_files_filename = 'parsed_file_output.txt'
output_file = open(parsed_files_filename, 'w')

def traverse_directory(directory, extension, output_file):
    files_parsed = 0
    lines_parsed = 0
    for filename in os.listdir(directory):
        filepath = os.path.join(directory, filename)

        if os.path.isfile(filepath) and filename.endswith(extension):
            file_ = open(filepath, 'r')
            lines = file_.readlines()
            print("File:", filepath, file=output_file)
            parse_lines(lines, output_file)
            print(file=output_file)                                     # print a new line after the information for the current file
            files_parsed += 1
            lines_parsed += len(lines)                                  # add the number of lines read in the file
            file_.close()
        elif os.path.isdir(filepath):
            a, b = traverse_directory(filepath, extension, output_file)
            files_parsed += a
            lines_parsed += b

    return files_parsed, lines_parsed

files_parsed, lines_parsed = traverse_directory(".", ".v", output_file)
output_file.close()
print("Files parsed:", files_parsed)
print("Lines parsed:", lines_parsed)
print("Parsed output has been saved to {}".format(parsed_files_filename))

In [None]:
# Cell to parse the lines of just one file

parsed_files_filename = 'single_file_parsed_output2.txt'
output_file = open(parsed_files_filename, 'w')

def traverse_directory(directory, extension, output_file):
    files_parsed = 0
    lines_parsed = 0
    for filename in os.listdir(directory):
        filepath = os.path.join(directory, filename)

        if os.path.isfile(filepath) and filename.endswith(extension):
            file_ = open(filepath, 'r')
            lines = file_.readlines()
            print("File:", filepath, file=output_file)
            parse_lines(lines, output_file)
            print(file=output_file)                                     # print a new line after the information for the current file
            files_parsed += 1
            lines_parsed += len(lines)                                  # add the number of lines read in the file
            file_.close()
        elif os.path.isdir(filepath):
            a, b = traverse_directory(filepath, extension, output_file)
            files_parsed += a
            lines_parsed += b

    return files_parsed, lines_parsed

files_parsed, lines_parsed = traverse_directory(".", "RFSM.v", output_file)
output_file.close()
print("Files parsed:", files_parsed)
print("Lines parsed:", lines_parsed)
print("Parsed output has been saved to {}".format(parsed_files_filename))