**Currently does not work for places where there is a data type after a signal type, such as ```output reg``` or ```input wire```** (see TODO section below)

In [None]:
# TODO
#
# make it work when there is a data type after the signal type, such as 'output reg' or 'input wire' (distribute the data type as in other areas):
# module up_down_counter(input CLK, ENABLE, UPDN, RST, output wire[3:0] VALUE);
#   reference /content/FilesToParse/VerilogHDL-Codes/Counter+BCD/up_down_counter.v
# another example:
# output reg [addrbits  :0] rdptr,
#   reference: ./FilesToParse/VerilogHDL-Codes/Asynchronous FIFO/RFSM.v
#
# remove any extra spaces within bit width distributors as in ./FilesToParse/VerilogHDL-Codes/Asynchronous FIFO/RFSM.v

Module definition formats that have been accounted for:

- all inputs and outputs defined in the first line

```
module half_adder(input a, b, output s0, c0);
```

- no inputs or outputs defined in the first line

```
module rams_sp_rom (clk, we, addr, di, dout);

--OR--

module arbiter (
clock      , // clock
reset      , // Active high, syn reset
req_0      , // Request 0
req_1      , // Request 1
gnt_0      , // Grant 0
gnt_1        // Grant 1
);
```

- multiple types of either input or output (different bit widths)

```
module RCA4(output [3:0] sum, output cout, input [3:0] a, b, input cin);
```

- multi-line and multiple types of either input or output

```
module SkipLogic(output cin_next,
  input [3:0] a, b, input cin, cout);
```

- parameters; inputs and outputs defined in an uncommon way

```
module spi_lite
//-----------------------------------------------------------------
// Params
//-----------------------------------------------------------------
#(
     parameter C_SCK_RATIO      = 32
)
//-----------------------------------------------------------------
// Ports
//-----------------------------------------------------------------
(
    // Inputs
     input          clk_i
    ,input          rst_i
    ,input          cfg_awvalid_i
    ,input  [31:0]  cfg_awaddr_i
...
```

Shortcomings and predicted errors:

Firstly, there are likely many different cases I didn't catch, where someone codes in a way that is somewhat unconventional and has a format I didn't account for. A way to combat this would be to rewrite the program in a way that doesn't parse in a traditional manner, but instead looks at the whole thing and interprets it intelligently, and then takes out the pieces based on what it has found. This is pretty much describing some kind of machine learning algorithm.

However, here's what I've noticed so far:

- Does not account for instantiations of a module inside of another module.

- Does not add space between a second bit width specifier if there is one.

In [None]:
# Hide the output of this cell
%%capture

import os

# Remove the folder if it's already there, then make the folder and go into it
! rm -rf FilesToParse
! mkdir FilesToParse
os.chdir('FilesToParse')

# These are a bunch of example repos I found that have some Verilog files in them
! git clone https://github.com/sudhamshu091/32-Verilog-Mini-Projects.git
! git clone https://github.com/snbk001/Verilog-Design-Examples.git
! git clone https://github.com/ashishrana160796/verilog-starter-tutorials.git
! git clone https://github.com/mongrelgem/Verilog-Adders.git
! git clone https://github.com/mihir8181/VerilogHDL-Codes.git
! git clone https://github.com/sudhamshu091/Single-Cycle-Risc-Processor-32-bit-Verilog.git
! git clone https://github.com/ultraembedded/minispartan6-audio.git

# Move back to the starting directory to continue with program
os.chdir('..')

# Remove the sample directory of files that Colab spawns every time we open it again
! rm -rf sample_data

In [None]:
import re

keywords = ['parameter', 'input', 'output', 'reg', 'wire']

def clean_lists(list_):
    whole = ''.join(letter for letter in list_)                             # the list returned from group matching in re search was a list of single characters
    elements = [e.strip() for e in re.split(',|\(|\)|;', whole) if e]       # split the string to create a list, then strip each item in the list

    cleaned_elements = []
    for e in elements:
        if not e:
            continue
        if '=' in e:
            e = e.split('=')[0]
        if (re.search('\[.*:.*\]', e)) and not (re.search('\]\s{1}[^\s]', e)):          # if there is a bit width specifier and there is not exactly one space after it
            l = e.split(']', 1)                                                         # only split once, in case there is another bit width specifier later on
            e = l[0] + ']' + ' ' + l[1]
        e = e.strip()
        cleaned_elements.append(e)

    return cleaned_elements


def print_keyword(keyword, keyword_list):
    print('    {} ({}):'.format(keyword, len(clean_lists(keyword_list[keywords.index(keyword)]))), file=output_file)
    for i in clean_lists(keyword_list[keywords.index(keyword)]):
        print('      ', i, sep='', file=output_file)


def check_if_in_module(in_module, keyword_list):
    if in_module:                                    # this checks to see if we're already in a module, in case the previous module did not have had an 'endmodule'
        for keyword in keywords:
            print_keyword(keyword, keyword_list)
    return True


# This function is used for the cases in which a bit width is specified for some inputs,
#   but only noted once, resulting in the need for it to be distributed to each element for display
def distributor(elements, specifier):
    elements = elements.strip(',')                              # get rid of any commas at the end that may add a blank element when splitting
    s = elements.split(',')
    for i in range(1, len(s)):
        s[i] = specifier + s[i]
    elements_with_specifier = ''
    for string in s:
        elements_with_specifier += ' ' + string + ','           # add a comma after each one because that's what we split by when cleaning the list
    return elements_with_specifier


def generalized_search_module_def(line, keyword_list):          # search through a module definition (generalized because the previous one
                                                                #   looked for inputs and outputs only instead of all keywords)
                                                                # so ugly :(
    mod_def_keyword_lists = [[] for keyword in keywords]        # "module definition keyword lists"
    mod_def_keyword_strings = ['' for keyword in keywords]      # "module definition keyword strings"

    specifier = ''

    for keyword in keywords:
        if re.search(keyword, line):
            lookahead = '|'.join(r'\b' + k + r'\b' for k in keywords)
            regex = r'\b' + keyword + r'\b\s*(.*?)\s*(?=,\s*' + lookahead + r'|$)'
            mod_def_keyword_lists[keywords.index(keyword)] = re.findall(regex, line)

        for e in mod_def_keyword_lists[keywords.index(keyword)]:
            bit_width_specifier = re.search('\[.*:.*\]', e)             # changed from \[.*\d+:.*\d+\] for the ones that don't have a number at least on the left
            if bit_width_specifier:
                specifier += bit_width_specifier.group()                # if the matched group contains a bit width indicator
                elements_with_specifier = distributor(e, specifier)
                mod_def_keyword_lists[keywords.index(keyword)][mod_def_keyword_lists[keywords.index(keyword)].index(e)] = elements_with_specifier

# TODO trying to make this work for distributing data types
            # look_for_data_type = keyword + '\s*(reg)|' + keyword + '\s*(wire)'              # do we need to add more options than just 'reg' and 'wire' ?
            # data_type_specifier = re.search(look_for_data_type, e)
            # if data_type_specifier:
            #     specifier += data_type_specifier.group(1)
            #     elements_with_specifier = distributor(e, specifier)
            #     mod_def_keyword_lists[keywords.index(keyword)][mod_def_keyword_lists[keywords.index(keyword)].index(e)] = elements_with_specifier


        mod_def_keyword_strings[keywords.index(keyword)] = ''.join(mod_def_keyword_lists[keywords.index(keyword)])

        keyword_list[keywords.index(keyword)].append(mod_def_keyword_strings[keywords.index(keyword)])

    return keyword_list


def parse_lines(lines, output_file):
    in_comment_block = False
    in_module = False
    in_module_def = False
    in_function = False
    concat_lines = False

    temp_line = ''

    keyword_list = [[] for keyword in keywords]

    # Search through each line
    for line in lines:
        # Clean up the line for processing
        line = line.split('//')[0]              # remove any in-line comments
        line = line.strip()                     # clean up the line by removing any problem-causing whitespaces
                                                # - we can do this because indendation doesn't matter in Verilog
        if re.search('\A\s*,', line):           # if the line starts with a comma, which probably means
                                                #   it's part of a multi-line module definition
            line = line.lstrip(',')             # strip any commas at the beginning of the line, which cause problems

        # Check if we should concat the previous line with the current one
        if concat_lines:
            line = temp_line + ' ' + line       # add the previous line to this one
        regex = '|'.join(r'\b' + keyword + r'\b' for keyword in keywords)
        if re.search(regex, line) and not re.search(';$', line):
            concat_lines = True
            temp_line = line
            continue
        concat_lines = False
        temp_line = ''                          # clear the temp line, otherwise the next time, that whole thing will be loaded in

        # Check if there is a variable being assigned to something,
        #   but there is only one in the line and the line isn't continuing to the next
        if '=' in line and not ',' in line:
            line = line.split('=')[0]

        # Look for comment block
        if re.search('\s*/\*', line):           # look for /* at the beginning of the text in the line
            in_comment_block = True             # we are now in a comment block
            continue                            # go to the next line
        if in_comment_block:                    # if currently in a comment block
            if re.search('\s*\*/', line):       # end of the comment block
                in_comment_block = False        # we are no longer in the comment block
            continue                            # go to the next line

        # Look for function
        if re.search(r'\s*\bfunction\b', line): # look for 'function' in the line, signaling the start of a function,
                                                #   which we don't want to pull keywords from
            in_function = True
            continue
        if in_function:
            if re.search(r'\s*\bendfunction\b', line):
                in_function = False
            continue

        # The first line this checks is the second line of the module definition
        if in_module_def:
            keyword_list = generalized_search_module_def(line, keyword_list)
            if re.search('\)\s*;$', line):      # if the line ends with ) ;
                in_module_def = False           # we are no longer in the module definition
            continue                            # go to the next line

        if re.search(r'\bmodule\b', line):
            in_module = check_if_in_module(in_module, keyword_list)
            module_name = re.search('module (\w+)', line).group(1)
            print("  Module:", module_name, file=output_file)
            keyword_list = generalized_search_module_def(line, keyword_list)
            if not re.search('\)\s*;$', line):
                in_module_def = True
            continue

        # See if we have reached the end of the module
        elif re.search('endmodule', line):
            in_module = False
            for keyword in keywords:
                print_keyword(keyword, keyword_list)
            keyword_list = [[] for keyword in keywords]


        # Look for each keyword in the current line
        for keyword in keywords:
            if re.search(keyword, line):
                regex = '\A\s*' + keyword + '\s+(.*)'                           # regex to find keyword
                match_ = re.search(regex, line)                                 # search for the keyword in the line
                if match_:
                    elements = match_.group(1)
                    bit_width = re.search('\[.*:.*\]', elements)                # regex to check for bit width indicator ([number:number])
                    if bit_width:
                        bit_width_specifier = bit_width.group()                 # if the matched group contains a bit width indicator
                        elements_with_bit_width = distributor(elements, bit_width_specifier)
                        keyword_list[keywords.index(keyword)].append(elements_with_bit_width)

# TODO trying to make this work for distributing data types
                    # look_for_data_type = keyword + '\s*(reg)|' + keyword + '\s*(wire)'
                    # data_type_specifier = re.search(look_for_data_type, line)
                    # if data_type_specifier:
                    #     specifier += data_type_specifier.group(1)
                    #     elements_with_specifier = distributor(e, specifier)
                    #     keyword_list[keywords.index(keyword)].append(elements_with_specifier)

# this below was in adjustment to what I tried adding above
                    if not bit_width and not data_type_specifier:
                        keyword_list[keywords.index(keyword)].append(elements)

    # Print stuff if we have gotten to the end of the file and there was only one module but it had no 'endmodule'
    check_if_in_module(in_module, keyword_list)

In [None]:
parsed_files_filename = 'parsed_file_output.txt'
output_file = open(parsed_files_filename, 'w')

def traverse_directory(directory, extension, output_file):
    files_parsed = 0
    lines_parsed = 0
    for filename in os.listdir(directory):
        filepath = os.path.join(directory, filename)

        if os.path.isfile(filepath) and filename.endswith(extension):
            file_ = open(filepath, 'r')
            lines = file_.readlines()
            print("File:", filepath, file=output_file)
            parse_lines(lines, output_file)
            print(file=output_file)                                     # print a new line after the information for the current file
            files_parsed += 1
            lines_parsed += len(lines)                                  # add the number of lines read in the file
            file_.close()
        elif os.path.isdir(filepath):
            a, b = traverse_directory(filepath, extension, output_file)
            files_parsed += a
            lines_parsed += b

    return files_parsed, lines_parsed

files_parsed, lines_parsed = traverse_directory(".", ".v", output_file)
output_file.close()
print("Files parsed:", files_parsed)
print("Lines parsed:", lines_parsed)
print("Parsed output has been saved to {}".format(parsed_files_filename))

Files parsed: 287
Lines parsed: 61208
Parsed output has been saved to parsed_file_output.txt


In [None]:
# Cell to parse the lines of just one file

parsed_files_filename = 'single_file_parsed_output1.txt'
output_file = open(parsed_files_filename, 'w')

def traverse_directory(directory, extension, output_file):
    files_parsed = 0
    lines_parsed = 0
    for filename in os.listdir(directory):
        filepath = os.path.join(directory, filename)

        if os.path.isfile(filepath) and filename.endswith(extension):
            file_ = open(filepath, 'r')
            lines = file_.readlines()
            print("File:", filepath, file=output_file)
            parse_lines(lines, output_file)
            print(file=output_file)                                     # print a new line after the information for the current file
            files_parsed += 1
            lines_parsed += len(lines)                                  # add the number of lines read in the file
            file_.close()
        elif os.path.isdir(filepath):
            a, b = traverse_directory(filepath, extension, output_file)
            files_parsed += a
            lines_parsed += b

    return files_parsed, lines_parsed

files_parsed, lines_parsed = traverse_directory(".", "RFSM.v", output_file)
output_file.close()
print("Files parsed:", files_parsed)
print("Lines parsed:", lines_parsed)
print("Parsed output has been saved to {}".format(parsed_files_filename))

Files parsed: 1
Lines parsed: 104
Parsed output has been saved to single_file_parsed_output1.txt
