# Group 6 Lab 2 Tasks 1 and 2

## Group Members
* **127820** - Catherine Nduta
* **130816** - Natalie Ndetei
* **134205** - Ryan Muema
* **119199** - Mutiku Adina
* **127690** - Sharon Mwangi
* **134583** - Jeffrey Ongicho
* **136667** - Benard Wanyande

## Task 1
- Write a program to identify programming lines (in the same programming language) as
either comments or not.
- Further to this task:
1.   Describe the logic used.
2.   Was lexical analysis or/and syntax analysis concepts important in your
logic? If yes, how?

## Write a program to identify programming lines (in the same programming language) as either comments or not.
*Python has been used below*

In [26]:
import re

def identify_lines(input_code):
    lines = input_code.split('\n')
    in_comment = False  # Initialize the variable here
    for line in lines:
        if re.match(r'^\s*//', line):  # Single-line comment
            print("Comment:", line)
        elif re.match(r'^\s*/\*', line):  # Multi-line comment start
            in_comment = True
            print("Comment:", line)
        elif re.match(r'.*\*/\s*$', line):  # Multi-line comment end
            in_comment = False
            print("Comment:", line)
        elif in_comment:  # Inside a multi-line comment
            print("Comment:", line)
        else:  # Code
            print("Code:", line)

# Example input code
input_code = """
int main() {
    // This is a single-line comment
    /* This is a
       multi-line comment */
    int x = 5; // Another single-line comment
    return 0;
}
"""

identify_lines(input_code)


Code: 
Code: int main() {
Comment:     // This is a single-line comment
Comment:     /* This is a
Comment:        multi-line comment */
Code:     int x = 5; // Another single-line comment
Code:     return 0;
Code: }
Code: 


## i. Describe the logic used

The provided Python code defines a function named `identify_lines` that takes a string of input code as its argument and identifies whether each line in the input code is a comment or code. It uses regular expressions and logic to accomplish this task. Here's a breakdown of the logic:

1. **Splitting Input Code into Lines:**
   - The input code is split into individual lines using `input_code.split('\n')`. This creates a list of lines, making it easier to process them one by one.

2. **Initializing the `in_comment` Variable:**
   - The `in_comment` variable is initialized as `False`. This variable is used to keep track of whether the current line is inside a multi-line comment.

3. **Iterating Through Each Line:**
   - A `for` loop iterates through each line in the input code.

4. **Checking for Single-Line Comments:**
   - The first `if` condition (`re.match(r'^\s*//', line)`) checks if the current line starts with `//`, which indicates a single-line comment.
   - If a single-line comment is found, it prints "Comment:" followed by the line and continues to the next line.

5. **Checking for Multi-Line Comments:**
   - The second and third `if` conditions check for the start (`/*`) and end (`*/`) of multi-line comments.
   - When the start of a multi-line comment is found, the `in_comment` variable is set to `True`, indicating that subsequent lines are inside a multi-line comment.
   - When the end of a multi-line comment is found, the `in_comment` variable is set to `False`, indicating that the multi-line comment has ended.
   - Lines within multi-line comments are also printed as "Comment:" lines.

6. **Identifying Code Lines:**
   - If none of the above conditions are met, the `elif in_comment` condition checks whether the current line is inside a multi-line comment (based on the `in_comment` variable).
   - If inside a multi-line comment, the line is printed as "Comment:"; otherwise, it is considered a code line and printed as "Code:".

7. **Example Input Code:**
   - The code at the end of the script demonstrates the `identify_lines` function with an example input code that includes single-line and multi-line comments as well as code lines.

8. **Output:**
   - The script prints each line with the label "Comment:" or "Code:" based on the identified category.

## ii. Was lexical analysis or/and syntax analysis concepts important in your logic? If yes, how?

In the provided Python code for identifying comments and code lines in input text, both lexical analysis and syntax analysis concepts are indirectly relevant, although they are not explicitly implemented as part of a full-fledged compiler or parser. Let's discuss how these concepts are related to the logic:

**Lexical Analysis (Lexing):**
Lexical analysis, often referred to as "lexing" or "tokenization," involves breaking down a stream of characters (the input text) into meaningful units called tokens. In this code:

The splitting of the input code into individual lines using input_code.split('\n') can be seen as a rudimentary form of lexing. Each line becomes a token.

The regular expressions used to identify single-line comments (r'^\s*//') and multi-line comments (r'^\s*/\*' and r'.*\*/\s*$') effectively tokenize lines into "Comment" tokens or "Code" tokens.

**Syntax Analysis (Parsing):**

Syntax analysis, or "parsing," is the process of analyzing the structure of the code to determine if it conforms to a specified grammar or syntax. In this code:

The logic identifies whether each line is a comment or code, which is a basic form of syntax analysis. It doesn't enforce a specific grammar, but it does classify lines based on simple syntax rules (e.g., lines starting with // are considered single-line comments).

The use of in_comment as a flag represents a basic form of state management during parsing, which is common in more complex parsing tasks where the parser needs to keep track of context.

While this code doesn't handle full programming language syntax, it does perform a simplified form of syntax analysis by classifying lines based on patterns and context.

# Task 2


In [29]:
import keyword

def is_valid_identifier(identifier):
    if keyword.iskeyword(identifier):
        return False

    if not (identifier[0].isalpha() or identifier[0] == '_'):
        return False

    for char in identifier[1:]:
        if not (char.isalnum() or char == '_'):
            return False

    return True

while True:
    identifier = input("Enter an identifier (or 'exit' to quit): ")

    if identifier.lower() == 'exit':
        break

    if is_valid_identifier(identifier):
        print(f"'{identifier}' is a valid Python identifier.")
    else:
        print(f"'{identifier}' is not a valid Python identifier.")


Enter an identifier (or 'exit' to quit): id
'id' is a valid Python identifier.
Enter an identifier (or 'exit' to quit): word
'word' is a valid Python identifier.
Enter an identifier (or 'exit' to quit): ""
'""' is not a valid Python identifier.
Enter an identifier (or 'exit' to quit): if
'if' is not a valid Python identifier.
Enter an identifier (or 'exit' to quit): else
'else' is not a valid Python identifier.
Enter an identifier (or 'exit' to quit): 123abc
'123abc' is not a valid Python identifier.
Enter an identifier (or 'exit' to quit): @invalid_identifier
'@invalid_identifier' is not a valid Python identifier.
Enter an identifier (or 'exit' to quit): my-var
'my-var' is not a valid Python identifier.
Enter an identifier (or 'exit' to quit): exit


## Logic behind the program explained

The Python program for checking the validity of identifiers uses specific rules and logic to determine whether an identifier is valid or not. Let's break down the logic step by step:

**Python Identifier Rules:**
1. An identifier must start with a letter (a-z or A-Z) or an underscore (_).
2. The remaining characters in an identifier can be letters (a-z, A-Z), digits (0-9), or underscores (_).
3. Identifiers are case-sensitive.
4. Python keywords (reserved words) cannot be used as identifiers.

**Logic Explained:**

1. **Using the `keyword` Module:**
   - The program uses Python's `keyword` module to check whether the given identifier is a Python keyword. Python keywords are reserved words with special meanings, and they cannot be used as identifiers.

2. **Checking the First Character:**
   - The program checks if the first character of the identifier is a letter (a-z or A-Z) or an underscore (_). This follows Python's rule that an identifier must start with one of these characters.
   - If the first character is not a letter or an underscore, the identifier is considered invalid.

3. **Checking the Remaining Characters:**
   - The program iterates through the remaining characters of the identifier (starting from the second character).
   - For each character, it checks if it is a letter (a-z or A-Z), a digit (0-9), or an underscore (_). These characters are allowed in Python identifiers.
   - If any character does not meet these criteria, the identifier is considered invalid.

4. **Case-Sensitivity:**
   - The program treats identifiers as case-sensitive, which means that identifiers like `myVar` and `myvar` are considered distinct.

5. **Output Messages:**
   - Depending on the validation checks, the program prints a message indicating whether the provided identifier is valid or not.

**Examples of Valid and Invalid Identifiers:**

- **Valid Identifiers:**
   - `my_variable`: Starts with a letter and includes letters, digits, and underscores.
   - `_private_var`: Starts with an underscore and includes valid characters.
   - `var2`: Starts with a letter and includes valid characters.

- **Invalid Identifiers:**
   - `123abc`: Starts with a digit, violating the rule that identifiers must start with a letter or underscore.
   - `@invalid_identifier`: Contains an invalid character (`@`).
   - `if`: A Python keyword and cannot be used as an identifier.
   - `my-var`: Contains a hyphen, which is not a valid character in Python identifiers.

The program enforces these rules to ensure that identifiers are named according to Python's syntax and conventions. This helps maintain code readability and avoid potential syntax errors when writing Python programs.

## i. Tools like Lex and Flex helped users develop lexers to do tasks like the one given above. Using the Flex specification file, develop a simple lexer to validate identifiers.

### Installing dependencies

In [None]:
!sudo apt install flex bison

### Creating a flex specification file

In [31]:
%%writefile lexer.l
%{
#include <stdio.h>
%}

%%

[A-Za-z_][A-Za-z0-9_]* {
    printf("Valid Identifier: %s\n", yytext);
}

.|\n {
    printf("Invalid Identifier: %s\n", yytext);
}

%%

int main() {
    yylex();
    return 0;
}


Overwriting lexer.l


### Compiling the flex specification file

In [32]:
!flex lexer.l

### Compiling the C source file

In [33]:
!gcc -o lexer lex.yy.c -lfl

### Running the lexer

In [34]:
!./lexer


my_variable
Valid Identifier: my_variable
Invalid Identifier: 

123abc
Invalid Identifier: 1
Invalid Identifier: 2
Invalid Identifier: 3
Valid Identifier: abc
Invalid Identifier: 

@invalididentifier
Invalid Identifier: @
Valid Identifier: invalididentifier
Invalid Identifier: 

exit
Valid Identifier: exit
Invalid Identifier: 


Invalid Identifier: 

^C


The lexer recognizes valid identifiers based on the specified pattern and prints them as "Valid Identifier." Invalid identifiers are printed as "Invalid Identifier."

You can modify the Flex specification file to suit your specific needs or add more complex tokenization rules for identifiers based on your language's syntax.