# ChatGPT Limits on Code Generation: An Example

### Practical Data Science: 1st Assignement

#### Dimitrios Tsirmpas 

## Case 1: Verifying Greek Tax Codes

The Greek Tax Identification Number (TIN) consists of 9 digits. The last digit is a check digit. It is calculated as follows:

1. Remove the check digit, so that we are left with an 8-digit number.
2. Take the 8 digits one by one, from the right to the left. 
3. Multiply each digit by the power of 2 corresponding to its position: the first from the right will be multiplied by 21, the second will be multiplied by 22, and so on.
4. Take these powers and sum them.
5. Calculate the remainder of this sum by 11.
6. Take this remainder and we calculate its remainder by 10. The result must equal the check digit.

This is a pretty simple task, provided we know the algorithm. We will thus use it to check whether the LLM can piece together information about a relatively obscure, but publicly available topic.

In order to check the validity of our algorithm we will build a simple template function:

In [1]:
from typing import Callable


def print_check_tin(check_func: Callable[[str], bool], tin: str) -> None:
    """
    Print the resultof a function validating a Greek TIN.
    :@param check_func: the validator function
    :@param tin: the TIN to be checked
    """
    if check_func(tin):
        print(f"The TIN {tin} is valid.")
    else:
        print(f"The TIN {tin} is not valid.")

And define one correct and one false TIN.

In [2]:
correct_tin = "090034337"
false_tin = "090034336"

### Vague Prompt

We can assume the model knows enough about the Greek TIN identification number, and thus not specify the calculation itself in our prompt. By giving the following prompt:

`Write a python program checking the validity of the Greek Tax Identification Number (TIN)`

The LLM produces the following result:

In [3]:
def is_valid_greek_tin(tin):
    # Check if the TIN has exactly 9 digits
    if len(tin) != 9:
        return False

    # Check if all characters are digits
    if not tin.isdigit():
        return False

    # Extract the last digit (check digit)
    check_digit = int(tin[-1])

    # Calculate the weighted sum of the first 8 digits
    weighted_sum = 0
    for i in range(8):
        digit = int(tin[i])
        if i % 2 == 0:
            digit *= 2
            if digit > 9:
                digit -= 9
        weighted_sum += digit

    # Calculate the remainder when dividing the weighted sum by 10
    remainder = weighted_sum % 10
    
    # Check if the check digit matches the remainder
    return check_digit == remainder

With the following comments:

`A Greek Tax Identification Number (TIN), also known as the AFM (Αριθμός Φορολογικού Μητρώου), follows a specific format and has rules for validity. A valid Greek TIN should consist of 9 digits, and it should satisfy specific mathematical criteria. Here's a Python program to check the validity of a Greek TIN:`

`This program checks the validity of a Greek TIN by verifying that it has 9 digits, all of which are numeric, and that it satisfies the mathematical criteria specified for Greek TINs. It calculates a weighted sum of the first 8 digits and compares the check digit with the remainder when dividing the weighted sum by 10 to determine validity.`

Let's now call the generated code from our template function:

In [4]:
print_check_tin(is_valid_greek_tin, correct_tin)
print_check_tin(is_valid_greek_tin, false_tin)

The TIN 090034337 is not valid.
The TIN 090034336 is not valid.


We can see that the calculation is incorrect. If we examine the code we can notice it's mostly correct, but is missing the `modulo 11` operation at the end.

Notice how the model seems to know (or "believe" it knows) enough about our request to execute it and give us a description of the algorithm. Had we not checked the validity of the code ourselves, we would not have suspected it.

Let's ask the LLM why this mistake took place:

**Prompt**:
```
Explain the calculation step by step.
```

**Answer**:
```
Sure, let's break down the calculation of the Greek Tax Identification Number (TIN) step by step:

    Input Validation:
        The TIN is initially checked to ensure it consists of exactly 9 digits.
        It is also confirmed that the TIN contains only numeric digits (0-9).

    Verification Digit Isolation:
        The last digit of the TIN (the 9th digit) is separated and temporarily removed for later verification.

    Initialization:
        A running total (total) is initialized to zero.

    Iterate Over the First 8 Digits:
        The program loops through the first 8 digits of the TIN (indices 0 to 7).

    Partial Sum Calculation:
        For each digit, it adds the digit to the running total.
        If the current digit is at an even position (0-based indexing, so 1st, 3rd, 5th, and 7th digits), it multiplies the digit by 2.
        If the result of this multiplication is greater than 4, it subtracts 9 from it.

    Calculate Verification Digit:
        After processing all 8 digits, we have a total that includes the sum of the digits as per the algorithm.
        The calculated verification digit is determined as the remainder of total when divided by 10.
        If the remainder is 0, the calculated verification digit is set to 10 (in accordance with the Greek TIN validation rules).

    Comparison:
        Finally, the calculated verification digit is compared to the original 9th digit of the TIN (which was temporarily removed). If they match, the TIN is considered valid.

Here's a simplified example:

Suppose you have a Greek TIN: 123456789.

    The program checks that it consists of 9 digits.
    It separates the verification digit (9) and temporarily removes it.
    The calculation starts with a total of 0.
    It iterates through the first 8 digits (1, 2, 3, 4, 5, 6, 7, 8).
    It adds the digits to total and applies the rules for even-indexed digits (e.g., 2 * 2 = 4, but since it's greater than 4, subtract 9 to get -5).
    The total becomes 1 + 4 - 2 + 8 - 4 + 6 - 6 + 7 = 14.
    The calculated verification digit is 14 % 10 = 4.
    The calculated verification digit (4) is compared to the original 9th digit (9), and they don't match, so this TIN would be considered invalid.

In a valid TIN, the calculated verification digit should match the original 9th digit.
```

Note how the model fails to properly explain the internal logic behind its actions and instead attempts to explain the code line-by-line. This lack of understanding (or rather *incomplete* understanding) is most likely the source of the error.

### Detailed Prompt

Let's try a detailed prompt, where we describe in detail the computation rules, and give a practical example.

**Prompt**:

```
The Greek Tax Identification Number (TIN) consists of 9 digits. The last digit is a check digit. It is calculated as follows:

    We remove the check digit, so that we are left with an 8-digit number.
    We take the 8 digits one by one, from the right to the left. We multiply each digit by the power of 2 corresponding to its position: the first from the right will be multiplied by 21 the second will be multiplied by 22, and so on.
    We take these powers and we sum them.
    We calculate the remainder of this sum by 11.
    We take this remainder and we calculate its remainder by 10. The result must equal the check digit.

For example, let us say we have the TIN 090034337. The check digit is 7. The other digits are 09003433. We have: 3×21+3×22+4×23+3×24+9×27= 3×2+3×4+4×8+3×16+9×128= 6+12+32+48+1152=1250

Then, 1250 mod 11 = 7 and 7 mod 10 = 7

Write a program that will ask the user for TIN and will respond whether it is correct or not, with a function named 'is_valid_greek_tin2'.
```

**Answer**:

In [5]:
def is_valid_greek_tin2(tin):
    # Check if the TIN has exactly 9 digits
    if len(tin) != 9:
        return False

    # Check if all characters are digits
    if not tin.isdigit():
        return False

    # Extract the check digit from the TIN
    check_digit = int(tin[-1])

    # Remove the check digit to get the 8-digit number
    digits = [int(char) for char in tin[:-1]]

    # Calculate the weighted sum
    weighted_sum = 0
    for i, digit in enumerate(reversed(digits), start=1):
        weighted_sum += digit * (2 ** i)

    # Calculate the remainders
    remainder_1 = weighted_sum % 11
    remainder_2 = remainder_1 % 10

    # Check if the check digit matches the calculated remainder
    return remainder_2 == check_digit

This time the `module 11` rule is included. Let's check the function again:

In [6]:
print_check_tin(is_valid_greek_tin2, correct_tin)
print_check_tin(is_valid_greek_tin2, false_tin)

The TIN 090034337 is valid.
The TIN 090034336 is not valid.


Thus, by specifying the rules explicitly, the model is able to generate the correct code.

## Case 2: Calculating Orthodox Easter Sunday

To calculate Orthodox Easter Sunday for any year between 1900 and 2099, we can use the following algorithm:
```
Suppose that y is the year.

1. a=y mod 4
2. b=y mod 7
3. c=y mod 19
4. d=(19*c+15) mod 30
5. e=(2*a+4*b−d+34) mod 7 
6. month=⌊((d+e+114)/31)⌋ (where ⌊x⌋ means the integer part of x)
7. day=((d+e+114)mod31)+1

The result is the day and the month in the Julian calendar. To convert it to the Gregorian calendar, which we actually use, we have to add 13 days.  
```

We will start by defining a test function which will check if the code generated by ChatGPT is correct. We will also need a set of precalculated, correct dates for some Easter Sundays.

In [7]:
import datetime


def test_sunday(easter_func: Callable[[int], datetime.date], expected: datetime.date) -> None:
    """
    Checks if a function correctly calculates Orthodox Easter Sunday and prints the result.
    :@param easter_func: a function that takes a year as an integer and returns a date object
    corresponding to the Orthodox Easter of that year
    :@param expected: the actual Orthodox Easter date
    """
    
    # never trust external code, especially from an LLM
    try:
        output = easter_func(expected.year) # external code call
    except Exception as e:
        print(f"Incorrect: Year {expected.year} ({e})")
        return
        
    if output == expected:
        print(f"Correct: Year {expected.year}")
    else:
        print(f"Incorrect: Year {expected.year} (expected {expected.day}/{expected.month}, "
        f"got {output.day}/{output.month})")


def test_all(easter_func: Callable[[int], datetime.date], dates: list[datetime.date]) -> None:
    """
    A wrapper executing a test for a list of dates. 
    @see test_sunday
    """
    for date in dates:
        test_sunday(easter_func, date)


correct_dates = [datetime.date(2012, 4, 15), 
                 datetime.date(2013, 5, 5), 
                 datetime.date(2014, 4, 20), 
                 datetime.date(2015, 4, 12), 
                 datetime.date(2016, 5, 1), 
                 datetime.date(2017, 4, 16)]

### Vague Prompt

We now query ChatGPT with the following prompt:

**Prompt:**
```
Write a program in python taking as input a year and which outputs the date of that year's Orthodox Easter. Must be valid for all years ranging from 1900 to 2099
```

**Answer**:

In [8]:
def orthodox_easter(year):
    # Julian calendar parameters
    a = year % 4
    b = year % 7
    c = year % 19
    d = (19 * c + 15) % 30
    e = (2 * a + 4 * b - d + 34) % 7
    month = (d + e + 114) // 31
    day = ((d + e + 114) % 31) + 1

    # Convert the Julian date to the Gregorian date
    if month == 3:
        gregorian_month = 4
        gregorian_day = day + 13
    else:
        gregorian_month = 3
        gregorian_day = day + 14

    # Check if the date is valid
    if gregorian_month > 4:
        gregorian_month -= 1

    # Adjust year if necessary
    if year < 1900:
        year += 100

    # Return the Orthodox Easter date
    return datetime.date(year, gregorian_month, gregorian_day)

With the following notes:

`Calculating the date of Orthodox Easter can be a bit complex because it is based on the Julian calendar and the Gregorian calendar, which have different rules for calculating the date of Easter. Here's a Python program that calculates the date of Orthodox Easter for a given year within the range of 1900 to 2099:`

`This program calculates the date of Orthodox Easter using the Julian calendar parameters and then converts it to the corresponding Gregorian date. It also validates the input year to ensure it falls within the specified range.`

Let's now test the code:

In [9]:
test_all(orthodox_easter, correct_dates)

Incorrect: Year 2012 (expected 15/4, got 16/3)
Incorrect: Year 2013 (day is out of range for month)
Incorrect: Year 2014 (expected 20/4, got 21/3)
Incorrect: Year 2015 (day is out of range for month)
Incorrect: Year 2016 (day is out of range for month)
Incorrect: Year 2017 (expected 16/4, got 17/3)


### Detailed Prompt

Our original prompt was purposefully vague, which would explain the inaccuracies and lack of edge-case handling on the part of the LLM. Let's try giving a more detailed prompt instead:

While superficially the code seems correct, it is evident that some logic errors have occured. 

Of note is the fact that some years result in invalid date exceptions being thrown, indicating a clear lack of edge case testing.

**Prompt:**
```
Write a python program that calculates Orthodox Easter Sunday for any year.

To calculate Orthodox Easter Sunday for any year between 1900 and 2099, we can use the following algorithm. Suppose that y is the year.

a=ymod4
b=ymod7
c=ymod19
d=(19c+15)mod30
e=(2a+4b−d+34)mod7
month=⌊((d+e+114)/31)⌋
(Where the symbol ⌊x⌋ means the integer part of x, that is, x round down towards 0)
day=((d+e+114)mod31)+1

The result is the day and the month in the Julian calendar. To convert it to the Gregorian calendar, which we actually use, we have to add 13 days. Be careful, this may change the month.
```

The generated code is as follows:

In [10]:
def calculate_orthodox_easter(year):
    # Calculate parameters
    a = year % 4
    b = year % 7
    c = year % 19
    d = (19 * c + 15) % 30
    e = (2 * a + 4 * b - d + 34) % 7
    month = (d + e + 114) // 31
    day = ((d + e + 114) % 31) + 1

    # Convert to Gregorian calendar
    gregorian_month = month + 3  # Add 3 to the month to convert to Gregorian
    if gregorian_month > 12:
        gregorian_month -= 12  # Adjust if the month is greater than 12
    gregorian_day = day + 13  # Add 13 days to the day

    # Check if the date is valid
    if gregorian_month == 4 and gregorian_day > 30:
        gregorian_day -= 30
        gregorian_month += 1

    # Adjust the year if necessary
    if year < 1900:
        year += 100

    # Return the Orthodox Easter date in the Gregorian calendar
    return datetime.date(year, gregorian_month, gregorian_day)


With the following comments:

`Here's a Python program that calculates Orthodox Easter Sunday for any year between 1900 and 2099 using the provided algorithm and converts it to the Gregorian calendar:`

`This program calculates the Orthodox Easter Sunday for any year between 1900 and 2099 using the provided algorithm and then converts it to the Gregorian calendar. It also handles cases where the date may cross over to the next month or year.`

If we now test the new generated function we will notice similar results:

In [11]:
test_all(calculate_orthodox_easter, correct_dates)

Incorrect: Year 2012 (expected 15/4, got 15/7)
Incorrect: Year 2013 (day is out of range for month)
Incorrect: Year 2014 (expected 20/4, got 20/7)
Incorrect: Year 2015 (day is out of range for month)
Incorrect: Year 2016 (expected 1/5, got 31/7)
Incorrect: Year 2017 (expected 16/4, got 16/7)


We can conclude therefore that the issue wasn't with a vague prompt, but rather with the LLM's inability to correctly tackle generating code for complex problems. 

## Conclusion

In this notebook we analyzed two cases where prompting ChatGPT to generate code for some mildly complex problems failed.

- Our first case demonstrates the dangers of assuming the level of knowledge the LLM can use for its current task, even if it assures us it knows enough to complete it.

- Our second case demonstrates that a sufficiently complex task can make the LLM's code to fail, even if appropriate guideness and examples are provided.

In any case, we should rigorously check and test external code, be it from other humans or LLMs.