# Strings

<div class="alert alert-block alert-info">
    You can find all of the C programs in this notebook in the subdirectory containing this notebook:
    <code>./src/strings</code>
</div>

A string is a sequence of characters usually meant to represent human readable text. Unlike most more modern
general purpose programming languages, C does not have a specific string type. A string in C is simply an
array of characters where the end of the string is indicated with a character called the *null terminator* whose
value is the integer `0`.

The C standard library provides support for strings where the character type is `char` via the header file
`<string.h>`. Because `char` is typically an 8-bit type on most modern architectures, many written languages other
than English are not well supported.

The C90 standard introduced support for *wide strings* using the character type `wchar_t` which is defined
as "a distinct type whose values can represent distinct codes for all members of the largest extended 
character set specified among the supported locales". On most modern architectures, `wchar_t` is either
16 or 32 bits in width.

Characters are integer values that are mapped to symbols using some specified encoding scheme.
There are many different encoding schemes in wide use (see <https://en.wikipedia.org/wiki/Character_encoding>)
and the C standard does not specify any particular encoding. Programmers concerned with internationalization
must look to external libraries for support.

For our purposes, strings are assumed to be arrays of `char` using the POSIX C locale.

## Null terminated strings

A proper C string must be null terminated so that string processing functions can determine the end of the
string. The terminator character is `'\0'` which has the numeric value `0`.

Do not mistake the character `'0'` for the null terminator!

In [None]:
// terminator.c

#include <stdio.h>

int main(void) {
    char zero = '0';
    char terminator = '\0';
    printf("The numeric value of \'0\' is %d\n", zero);
    printf("The numeric value of \'\\0\' is %d\n", terminator);
    
    return 0;
}

The programmer must remember to allocate space for the null terminator when creating an array or allocating
memory for a string. Failure to allocate space for the null terminator, or failure to include the null terminator
at the end of a string are common programming errors that can be difficult to debug. The following program
attempts to print two strings both of which are not correctly terminated. The behavior of the program is
unspecified; on the author's computer, the first string is printed incorrectly:

In [None]:
// missing_terminator.c

#include <stdio.h>

int main(void) {
    
    char s1[4] = {'a', 'b', 'c', 'd'};
    char s2[4] = {'a', 'b', 'c', 'd'};
    
    printf("s1 = %s\n", s1);
    printf("s2 = %s\n", s2);
    
    return 0;
}

The following program contains a struct intended to store a string of length up to 4 and an integer value. Writing
a null terminated string of length 4 into the array overwrites the integer value stored in the struct:

In [None]:
// buffer_overrun.c

#include <stdio.h>

struct t {
    char s[4];
    int i;
};

int main(void) {
    struct t tmp;
    
    // empty string and 99?
    tmp.s[0] = '\0';
    tmp.i = 99;

    printf("tmp.s = %s\n", tmp.s);
    printf("tmp.i = %d\n", tmp.i);

    // "abcd" and 99?
    tmp.s[0] = 'a';
    tmp.s[1] = 'b';
    tmp.s[2] = 'c';
    tmp.s[3] = 'd';
    tmp.s[4] = '\0';

    printf("tmp.s = %s\n", tmp.s);
    printf("tmp.i = %d\n", tmp.i);

    return 0;
}

The null terminator is not necessarily the last element of an array representing a string, and
there may be more than one null terminator in the array. It is the first null terminator after or at the start
of the string that marks the end of the string:

In [None]:
// multiple_terminators.c

#include <stdio.h>

int main(void) {
    
    char s[] = {'a', 'b', 'c', '\0', 
                'd', 'e', 'f', '\0', 
                'g', 'h', 'i', '\0'};
    
    printf("s     = %s\n", s);
    printf("s + 4 = %s\n", s + 4);
    printf("s + 8 = %s\n", s + 8);
    
    return 0;
}

## String literals

A string literal is a sequence of characters enclosed by double quotes (as in Java). The null terminator
need not be part of the quoted sequence; the compiler will append a terminator character to the end of the
sequence. The size of the literal is equal to the number of characters in the literal plus 1, even if one or
more the characters is the null character.

In [None]:
// literal.c

#include <stdio.h>
#include <string.h>

int main(void) {
    char s[] = "abc";
    printf("size   : %lu\n", sizeof(s));
    printf("length : %lu\n", strlen(s));

    char s2[] = "abc\0def";   // weird
    printf("size   : %lu\n", sizeof(s2));
    printf("length : %lu\n", strlen(s2));
    
    return 0;
}

String literals have static storage duration. You should assume that literals are read-only; the C standard
states that attempting to modify the array that stores the literal results in
undefined behavior.

In [None]:
// literal2.c

#include <stdio.h>

char* hello() {
    // static storage duration, "HELLO" exists for the lifetime of the program
    char *s = "HELLO";
    return s;
}

int main(void) {
    // ok, gets a pointer to the string and prints the string
    char *str = hello();
    printf("%s\n", str);
    
    // UNCOMMENT NEXT LINE, possible error: attempt to modify a string literal
    // str[0] = 'h';
    
    return 0;
}

If a literal is used to initialize an array, then the characters of the literal including the null terminator
are copied into the new array resulting in two independent copies of the string. The literal remains read-only,
but the newly initialized array may be modified (if it is not `const`):

In [None]:
// literal3.c

#include <stdio.h>

int main(void) {
    char str[] = "hello";
    // ok, modifies the array str
    str[0] = 'H';
    
    printf("%s\n", str);
    
    return 0;
}

## `<string.h>`

The header file `<string.h>` contains declarations of the C standard library functions that operate on strings
as well as some functions that perform memory manipulation of byte arrays. Documentation for all of the
functions can be found at <https://en.cppreference.com/w/c/string/byte>. This notebook discusses only a subset
of the available functions.

#### `strlen`

`strlen` returns the length of a specified null terminated string:

```c
size_t strlen(const char *str);
```

The function counts the number of characters starting at the specified pointer until the null terminator is found.
Undefined behavior results if the specified string is not null terminated or if the pointer is a null pointer.

In [None]:
// length.c

#include <stdio.h>
#include <string.h>

int main(void) {
    char s[] = "abc";
    size_t len = strlen(s);
    printf("s = %s, length = %lu\n", s, len);

    char s2[] = "abc\0def";   // weird
    len = strlen(s2);
    printf("s2 = %s, length = %lu\n", s2, len);
    
    return 0;
}

The function must iterate over the sequence of characters in the string until the null terminator is found because
there is no way to determine the length of a dynamically allocated array in C. Several possible implementations
are shown below:

```c
size_t strlen(const char *str) {
    size_t i = 0;
    while (str[i] != '\0') {
	      i++;
    }
    return i;
}
```

```c
size_t strlen(const char *str) {
    size_t i = 0;
    while (*str++ != '\0') {
        i++;
    }
    return i;
}
```

```c
size_t strlen(const char *str) {
    const char *s;
    for (s = str; *s; s++) {
        // do nothing
    }
    return s - str;
}
```

#### `strcpy`

`strcpy` copies the characters (including the null terminator) from a source string pointed at by `src` 
into a destination character array pointed at by `dest`:

```c
char *strcpy(char *dest, const char *src);
```

The returned pointer is equal to `dest` which allows the programmer to pass the return value to another function
(such as `printf`, for example).

In [None]:
// copy.c

#include <stdio.h>
#include <string.h>

int main(void) {
    char src[] = "01234";
    size_t len = strlen(src);
    
    // len+1 to make sure that there is space for the null terminator
    char dest1[len + 1];
    strcpy(dest1, src);
    printf("src   = %s\n", src);
    printf("dest1 = %s\n", dest1);
    
    // print the return value instead
    char dest2[len + 1];
    printf("dest2 = %s\n", strcpy(dest2, src));
    
    // copy into the middle of the destination array
    char dest3[] = "abcde-----";
    strcpy(dest3 + len, src);
    printf("dest3 = %s\n", dest3);
    
    // copy the end of a string into a destination array
    char *s = "CISC220";
    char dest4[4];
    strcpy(dest4, s + 4);
    printf("dest4 = %s\n", dest4);
    
    return 0;
}

The behavior of `strcpy` is undefined if:

* the `dest` array is not large enough, or
* the strings overlap, or
* either `dest` is not a pointer to a character array or `src` is not a pointer to a null-terminated byte string

`strcpy` iterates over the characters of `src` until it encounters the null terminator character copying each
character into the destination array. Two possible implementations are shown below:

```c
char * strcpy(char *dest, const char *src) {
    int i;
    for (i = 0; src[i] != '\0'; i++) {
        dest[i] = src[i];
    }
    dest[i] = src[i];  // copies null terminator
    return dest;
}
```

```c
char * strcpy(char *dest, const char *src)  {
    char *save = dest;
    while (*dest++ = *src++) {  // tricky!
        // do nothing 
    }
    return save;
}
```

Note that a function such as `strcpy` is required to copy strings because strings are simply arrays and arrays
cannot be copied via assignment in C. For example, the following is an error:

```c
char s[100];  // un-initialized array
s = "hello";  // error, no array assignment in C
```

#### `strcat`

`strcat` concatenates a null terminated source string pointed at by `str` to the end of a 
null terminated destination string pointed at by `dest`:

```c
char *strcat(char *dest, const char *src);
```

The returned pointer is equal to `dest` which allows the programmer to pass the return value to another function
(such as `printf`, for example).

In [None]:
// concat.c

#include <stdio.h>
#include <string.h>

int main(void) {
    // change 11 to a smaller value to see what happens when dest is too small
    char s[11] = "01234";   
    char t[] = "56789";
    
    // concatenate s and t
    // size of s must be big enough
    // to hold the final string plus
    // the null terminator
    strcat(s, t);
    printf("s = %s\n", s);
    
    return 0;
}


The behavior of `strcat` is undefined if:

* the `dest` array is not large enough to hold the concatenated string, or
* the strings overlap, or
* either `dest` or `src` is not a pointer to a null-terminated byte string

Note that the second condition implies that `strcat` cannot be used to safely concatenate a string with itself.

#### `strcmp`

Arrays cannot be compared for equality using `==` or `!=`, and thus, neither can strings. To test if two
strings pointed at by `s` and `t` are equal use:

```c
bool eq = strcmp(s, t) == 0;
```

To test if two strings pointed at by `s` and `t` are unequal use:

```c
bool not_eq = strcmp(s, t) != 0;
```

The `strcmp` function is similar to Java's `String.compareTo` method. It compares two null terminated
strings `lhs` and `rhs` for lexicographical order:

```c
int strcmp(const char *lhs, const char *rhs);
```

The sign of the result is the sign of the difference between the values of the first pair of characters
(both interpreted as `unsigned char`) that differ in the strings being compared. In other words, the
result is:

* negative if `lhs` comes before `rhs` in lexicographical order,
* positive if `lhs` comes after `rhs` in lexicographical order,
* `0` if `lhs` and `rhs` are equal 

The behavior of `strcmp` is undefined if either `lhs` or `rhs` is not a pointer to a null-terminated byte string.

The following program that you should run in an actual terminal compares two strings provided as command line
arguments for lexicographical order:

```c
// compare.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char *argv[]) {
    if (argc != 3) {
        fprintf(stderr, "Usage: compare str1 str2\n");
        exit(1);
    }
    char *str1 = argv[1];
    char *str2 = argv[2];
    int res = strcmp(str1, str2);
    if (res == 0) {
        printf("%s and %s are equal\n", str1, str2);
    }
    else if (res < 0) {
        printf("%s comes before %s\n", str1, str2);
    }
    else {
        printf("%s comes after %s\n", str1, str2);
    }
    
    return 0;
}
```

#### `strchr` and `strrchr`

`strchr` and `strrchr` are similar to Java's `String.indexOf(char)` and `String.lastIndexOf(char)` methods 
in that they find the first and last occurrence of a specified character in a string:

```c
char *strchr(const char *str, int ch);   // find first occurrence

char *strrchr(const char *str, int ch);  // find last occurrence
```

The use of an `int` parameter for the character is to maintain compatibility with older C code. 

The returned pointer points at the first or last occurrence of the character, or is a null pointer if the
character is not found.

Pointer subtraction may be used if the index of the found character is desired:

In [None]:
// indexes.c

#include <stddef.h>
#include <stdio.h>
#include <string.h>

int main(void) {
    char str[] = "CISC220";
    
    char *first = strchr(str, 'C');
    char *last = strrchr(str, 'C');
    
    ptrdiff_t i_first = first - str;
    ptrdiff_t i_last = last - str;
    
    printf("index of first C = %td\n", i_first);
    printf("index of last C  = %td\n", i_last);
    
    // change C to c
    *first = 'c';
    *last = 'c';
    printf("%s\n", str);
    
    return 0;
}

#### `strstr`

`strstr` is similar to Java's `String.indexOf(String)` method in that it finds the first occurrence of a
specified substring pointed at by `sub` in a string pointed at by `str`:

```c
char *strstr(const char *str, const char *sub);
```

The returned pointer points at the first occurrence of the substring, or is a null pointer if the
substring is not found.

The following program that you should run in an actual terminal searches for a substring in string using `strstr`:

```c
// substr.c

#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char *argv[]) {
    if (argc != 3) {
        fprintf(stderr, "Usage: substr str sub\n");
        exit(1);
    }
    char *str = argv[1];
    char *sub = argv[2];
    char *p = strstr(str, sub);
    if (!p) {
        printf("%s does not occur in %s\n", sub, str);
    }
    else {
        ptrdiff_t index = p - str;
        printf("%s starts at index %td in %s\n", sub, index, str);
    }
    
    return 0;
}
```

### Memory manipulation

The `<string.h>` header file is somewhat unusual in that it also declares functions that operate on
memory buffers. These functions set, copy, or move some specified number of bytes of memory so the programmer
must remember to account for the size of elements in the buffer when using these functions.

#### `memcpy`

`memcpy` is the fastest standard library routine for memory to memory copying.
`memcpy` copies a specified number of bytes `count` from a source object pointed at by `src` to a destination
object pointed at by `dest`:

```c
void *memcpy(void *dest, const void *src, size_t count);
```

The returned pointer is equal to `dest`.

The following program copies illustrates the use of `memcpy` to copy both parts of and entire arrays of various
types:

In [None]:
// memcpy.c

#include <stdio.h>
#include <string.h>

void print_iarr(const int *a, size_t n) {
    printf("%d", a[0]);
    for (size_t i = 1; i < n; i++) {
        printf(", %d", a[i]);
    }
}

void print_darr(const double *a, size_t n) {
    printf("%f", a[0]);
    for (size_t i = 1; i < n; i++) {
        printf(", %f", a[i]);
    }
}

int main(void) {
    // copy entire array of int
    size_t count = 4;
    int x[] = {1, 2, 3, 4};
    int y[count];
    memcpy(y, x, count * sizeof(int));
    printf("destination : ");
    print_iarr(y, count);
    printf("\n");
    
    // copy last count elements of array of double
    double dbl[] = {-1.0, -2.0, -3.0, -4.0, -5.0, -6.0, -7.0};
    double cpy[count];
    memcpy(cpy, dbl + 3, count * sizeof(double));
    printf("destination : ");
    print_darr(cpy, count);
    printf("\n");
    
    // copy substring into another string
    char src[] = "this that then";
    char dest[] = "why what when";
    memcpy(dest + 4, src + 5, count);
    printf("destination : %s\n", dest);
    
    return 0;
}

The behavior of `memcpy` is undefined if:

* memory outside the bounds of the source or destination object is accessed , or
* `src` or `dest` are invalid or null pointers, or
* `dest` is less than `count` bytes after `src` (the memory being copied from cannot overlap with the
memory being copied into)

#### `memmove`

`memmove` is similar to `memcpy` except that the memory being copied may overlap with the
memory being copied into. `memmove` is generally slower than `memcpy` because extra computation is required
to handle the case of overlapping memory. If the source and destination memory do not overlap, then `memcpy`
should be used instead.

```c
void *memmove(void *dest, const void *src, size_t count);
```

The returned pointer is equal to `dest`.

Note that even though `move` is part of the function name, the function copies bytes from the source
object (it does not move bytes from the source object); thus, the source object remains unchanged.

The following program illustrates the use of `memmove` to remove one element
from an array by shifting all of the following elements forward in the array; change the value of
`index` to remove a different element:

In [None]:
// memmove.c

#include <stdio.h>
#include <string.h>

void print_iarr(const int *a, size_t n) {
    printf("%d", a[0]);
    for (size_t i = 1; i < n; i++) {
        printf(", %d", a[i]);
    }
}


int main(void) {
    
    int x[] = {1, 2, 3, 4, 5, 6};
    
    printf("before : ");
    print_iarr(x, 6);
    printf("\n");

    size_t index = 4;
    int *dest = x + index;
    int *src = dest + 1;
    size_t count = 6 - index - 1;
    memmove(dest, src, count * sizeof(int));
    
    printf("after  : ");
    print_iarr(x, 6);
    printf("\n");
    
    return 0;
}

## `sprintf`

The `<stdio.h>` header file declares the `sprintf` function for writing formatted data into a `char` buffer:

```c
int sprintf(char *buffer, const char *format, ...);
```

The function behaves similarly to the `printf` function except that the formatted data is written into a
buffer instead of standard output. The null terminator is always written in the buffer.

The returned value is the number of characters (not including the null terminator) that were written into
the buffer.

The behavior of `sprintf` is undefined if:

* the string to be written plus the null terminator exceeds the buffer length, or
* the `format` string contains an invalid conversion, or
* `buffer` overlaps with one or more of the data arguments

The following example illustrates the use of `sprintf` to copy a string; the example performs most of the
same steps as the `strcpy` example shown above:

In [None]:
// copy_with_sprintf.c

#include <stdio.h>
#include <string.h>

int main(void) {
    char src[] = "01234";
    size_t len = strlen(src);
    
    // len+1 to make sure that there is space for the null terminator
    char dest1[len + 1];
    int n = sprintf(dest1, "%s", src);
    printf("src   = %s\n", src);
    printf("dest1 = %s\n", dest1);
    printf("number of characters written = %d\n", n);
    
    // copy into the middle of the destination array
    char dest3[] = "abcde-----";
    n = sprintf(dest3 + len, "%s", src);
    printf("dest3 = %s\n", dest3);
    printf("number of characters written = %d\n", n);
    
    // copy the end of a string into a destination array
    char *s = "CISC220";
    char dest4[4];
    n = sprintf(dest4, "%s", s + 4);
    printf("dest4 = %s\n", dest4);
    printf("number of characters written = %d\n", n);
    
    return 0;
}

Functionality similar to `strcat` can also be achieved using `sprintf`.

Java's string concatenation operator `+` in combination with the overloaded `toString` method that every object
is guaranteed to have makes it very easy to produce strings from any source types. `sprintf` performs
a similar functionality, albeit in a much less easy to use way:

In [None]:
// concat_with_sprintf.c

#include <stdio.h>
#include <string.h>

int main(void) {
    // format into Queen's course code
    char dept[] = "CISC";
    unsigned int num = 220;
    char section = 'A';
    char course[9];
    sprintf(course, "%s%u%c", dept, num, section);
    puts(course);
    
    // format into yyyy-mm-dd, ensure leading zeros
    unsigned int year = 2023;
    unsigned int month = 10;
    unsigned int day = 3;
    char date[11];
    sprintf(date, "%04u-%02u-%02u", year, month, day);
    puts(date);
    
    // format a space into a postal code
    char pcode[] = "A1B2C3";
    char formatted[8];
    sprintf(formatted, "%.3s %.3s", pcode, pcode + 3);
    puts(formatted);
    
    return 0;
}

## Reading formatted data from a string

A common problem encountered in programming is extracting data from a string having some well-defined format.
An example of such a problem is given a string such as:

```
"2023-01-29"
```

extract the year, month, and day as unsigned integer values. The formatting of such a string is always:

* 4 digits representing the year, followed by
* a hyphen, followed by
* 2 digits representing the month, followed by
* a hyphen, followed by
* 2 digits representing the day

The `sscanf` function (declared in `<stdio.h>`) reads data from a string described by a second string
that instructs the function how to match and convert the pieces of data encoded in the first string.

```c
int sscanf( const char *buffer, const char *format, ... );
```

`buffer` is a pointer to the first character of a null terminated string that contains the formatted data.
`format` is a pointer to the first character of a null terminated string that describes how to 
match and convert the pieces of data encoded in `buffer`.
Any additional function arguments are pointers to objects that `sscanf` will assign the data values to.
The additional arguments are called *receiving arguments* in the documentation found
at <https://en.cppreference.com/w/c/io/fscanf> because the arguments receive their values as they
are converted by `sscanf`. The receiving arguments must be pointers to types that can store the
converted values.

`sscanf` reads the string `buffer` from left to right attempting to match characters and conversions
specified by `format`. `sscanf` stops reading `buffer` as soon as a failed match or conversion occurs.

The return value is equal to the number of receiving arguments that are successfully assigned to.
The return value is equal to zero if no receiving arguments are successfully assigned to.
The return value is equal to `EOF` if the end of `buffer` is reached before a conversion or match is attempted.

The `format` string uses conversion specifiers similar to those used by `printf` to describe the data
string format. For example, the conversion specifier `%u` indicates the expected presence of an `unsigned int`.
Any characters that are not part of a conversion specifier are treated as characters to be matched in
the data string.

The following program illustrates the year-month-date example described above:

In [None]:
// sscanf_date1.c

#include <stdio.h>

int main(void) {
    char data[] = "2023-01-29";
    char fmt[] = "%u-%u-%u";
    unsigned int year;
    unsigned int month;
    unsigned int day;
    
    int num_conversions = sscanf(data, fmt, &year, &month, &day);
    if (num_conversions == EOF) {
        printf("Reached end of string before attempting a conversion\n");
    }
    else if (num_conversions == 0) {
        printf("No successful conversions\n");
    }
    else if (num_conversions >= 1) {
        printf("year = %u\n", year);
        if (num_conversions >= 2) {
            printf("month = %u\n", month);
        }
        if (num_conversions == 3) {
            printf("day = %u\n", day);
        }
    }
    return 0;
}

Readers are encouraged to modify the `data` string to see the effects on the output of the `sscanf` function.

The format string `char fmt[] = "%u-%u-%u";` instructs `sscanf` to attempt to read the data string by:

* matching and converting one `unsigned int` value followed by
* a hyphen followed by
* matching and converting one `unsigned int` value followed by
* a hyphen followed by
* matching and converting one `unsigned int` value followed by

The first converted `unsigned int` value is stored in `year`, the second converted value is
stored in `month`, and the third converted value is stored in `day`. The matched hyphens are not
stored. Notice that the addresses of `year`, `month` and `day` are passed to `sscanf` because
`sscanf` must write into variables held by the caller.

Most conversion specifiers will consume and discard leading whitespace before attempting to perform the conversion.
This means that the conversion specifier `%u` will match and convert any sequence of whitespace followed
by an `unsigned int` value. For example:

```c
char data[] = "    2023-  01-  29";
```

is successfully parsed by the program above.

Characters in the format string that are not part of a conversion specifier must be matched exactly (except
for whitespace; see below). For example, the hyphens `-` in `fmt` must appear *immediately after* each
`unsigned int` value otherwise a matching error occurs. This means that if `data` is defined as:

```c
char data[] = "2023 -01-29";
```

then only the year will be successfully matched and converted.

A space in the format string will match any consecutive sequence of zero or more whitespace characters.
In our date example, if you want to allow the data string to contain spaces before the hyphens then
you would insert a space into the format string. For example, the format string:

```c
char fmt[] = "%u -%u -%u";
```

will successfully convert any of the following strings:

```
"2023-01-29"
"2023- 01-    29"
"2023    -01 -29"
"2023 - 01 -  29"
```

### The most common conversion specifiers

`%d` matches base-10 `int` values.

`%u` matches base-10 `unsigned int` values.

`%f` matches `float` values

`%lf` matches `double` values

`%s` matches a string *not containing whitespace characters*. The null terminator is written into the receiving
argument after the string.

`%c` matches a single `char` value. This specifier *does not* discard leading whitespace; in other words,
whitespace characters can be matched using this specifier.

The following program demonstrates the use of `sscanf` using the conversion specifiers shown above:

In [None]:
// sscanf_examples.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void) {
    char str1[100];
    char str2[100];
    unsigned int x1;
    int y1;
    double z1;
    char c1;

    // scan empty string
    int n = sscanf("", "%s", str1);
    printf("n = %d\n", n);

    // scan blank string
    n = sscanf("    ", "%s", str1);
    printf("n = %d\n", n);
    
    // scan string with unmatched conversion
    n = sscanf("abc", "%d", &y1);
    printf("n = %d\n", n);

    // scan non-blank string
    n = sscanf("abc", "%s", str1);
    printf("n = %d, str1 = %s\n", n, str1);

    // scan non-blank string containing a space
    n = sscanf("abc xyz", "%s", str1);
    printf("n = %d, str1 = %s\n", n, str1);

    // scan non-blank strings and unsigned int
    n = sscanf("abc 1 xyz", "%s%u%s", str1, &x1, str2);
    printf("n = %d, str1 = %s, str2 = %s, x1 = %u\n", n, str1, str2, x1);

    // scan non-blank string for int
    n = sscanf("-99abc", "%d", &y1);
    printf("n = %d, y1 = %d\n", n, y1);
    
    // scan non-blank string for int, double
    n = sscanf("-99 ,   -1.5", "%d , %lf", &y1, &z1);
    printf("n = %d, y1 = %d, z1 = %f\n", n, y1, z1);
    
    // scan non-blank string for single char
    n = sscanf("-99abc", "%c", &c1);
    printf("n = %d, c1 = %c\n", n, c1);
}

### The `%[` conversion specifier for matching strings made up of specified characters

`%[` *set* `]` matches a non-empty string made up of the characters in *set*,
and the null terminator is written into the receiving argument after the string. For example,
`%[abc]` will match any string made up of the characters `a`, `b`, or `c`.

`%[^` *set* `]` matches a non-empty string made up of the characters *not* in *set*,
and the null terminator is written into the receiving argument after the string. For example,
`%[^abc]` will match any string not made up of the characters `a`, `b`, or `c`.

`%[` does not consume and discard whitespace before attempting a match.

There is no portable way to specify ranges of characters when specifying the set (for example,
`%[a-z]` is not guaranteed to match strings made up of lowercase letters on all platforms), and
there is no no way to specify character classes (such as the POSIX character classes known
to the shell).

One possible use of `%[` is for extracting data from a character delimited string. The following
program extracts data from a string where the fields are separated by commas:

In [None]:
// sscanf_student1.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void) {
    char data[] = "Simpson,Bart,123456";
    
    char last_name[10];     // maximum string length 9
    char first_name[10];    // maximum string length 9
    unsigned int stu_num;
    
    int n = sscanf(data, "%[^,],%[^,],%u", last_name, first_name, &stu_num);
    if (n != 3) {
        fprintf(stderr, "Failed to extract all three fields\n");
        exit(EXIT_FAILURE);
    }
    printf("%s %s, student number = %u\n", first_name, last_name, stu_num);
    
    return 0;
}

### The `%n` specifier for getting the number of characters read so far

It is occassionally useful to obtain the number of characters read so far when using `sscanf` to extract
data from a string. For example, in the previous example the location of the separating commas 
in the original string might be useful to know in certain cases.

The `%n` specifier can be used anywhere in the format string to obtain the number of characters
read so far by `sscanf`. The specifier does not consume any input from the input data buffer, nor does
it increment the match count returned by `sscanf`. The receiving argument is a pointer to `int`.

The following program using the previous example replaces the separating commas with vertical bars:

In [None]:
// sscanf_student2.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void) {
    char data[] = "Simpson,Bart,123456";
    
    char last_name[10];     // maximum string length 9
    char first_name[10];    // maximum string length 9
    unsigned int stu_num;
    
    int i1;   // index of first comma
    int i2;   // index of second comma
    
    int n = sscanf(data, "%[^,]%n,%[^,]%n,%u", last_name, &i1, first_name, &i2, &stu_num);
    if (n != 3) {
        fprintf(stderr, "Failed to extract all three fields\n");
        exit(EXIT_FAILURE);
    }
    
    // replace commas and print
    data[i1] = '|';
    data[i2] = '|';
    printf("%s\n", data);
    
    return 0;
}

Notice that the example above does not make use of the converted data (`last_name`, `first_name`, and `stu_num`).
The next section describes how to perform a match but not assign the converted value.

### Optional modifier #1: Suppressing assignment

There are three optional modifiers that can be used to modify the way that a conversion is
performed. The first of these is the `*` modifier.

An asterisk `*` following the percent sign suppresses assignment of the modified conversion.
The matched field is read but not assigned to a receiving argument. Note that this affects the return
value of `sscanf` because fewer receiving arguments are assigned to.

The following program using the previous example eliminates the unused receiving argument variables:

In [None]:
// sscanf_student3.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void) {
    char data[] = "Simpson,Bart,123456";
    
    int i1 = -1;   // index of first comma
    int i2 = -1;   // index of second comma
    
    /*int n = */sscanf(data, "%*[^,]%n,%*[^,]%n,%*u", &i1, &i2);
    // NOTE: The only possible value for n is zero because no receiving
    // arguments are assigned to except for those corresponding to %n
    
    // Error checking is left as an exercise for the student
    
    // replace commas and print
    data[i1] = '|';
    data[i2] = '|';
    printf("%s\n", data);
    
    return 0;
}

### Optional modifier #2: Restricting the number of characters to read

The second optional modifier is a positive integer number that indicates
the maximum number of characters to consume when performing a conversion. 
Suppose that in the date example we want to ensure that the year has no
more than 4 digits and that the month and day have no more than 2 digits.
We can do this by changing the format string to `"%4u-%2u-%2u"`:

In [None]:
// sscanf_date2.c

#include <stdio.h>

int main(void) {
    char data[] = "2023-01-29";
    char fmt[] = "%4u-%2u-%2u";
    unsigned int year;
    unsigned int month;
    unsigned int day;
    
    int num_conversions = sscanf(data, fmt, &year, &month, &day);
    if (num_conversions == EOF) {
        printf("Reached end of string before attempting a conversion\n");
    }
    else if (num_conversions == 0) {
        printf("No successful conversions\n");
    }
    else if (num_conversions >= 1) {
        printf("year = %u\n", year);
        if (num_conversions >= 2) {
            printf("month = %u\n", month);
        }
        if (num_conversions == 3) {
            printf("day = %u\n", day);
        }
    }
    return 0;
}

Use of the maximum field width modifier is of particular importance when converting data into strings to ensure
that the converted string does not exceed the length of the array allocated to hold the string. The earlier
example of converting a student name and number can be rewritten to prevent exceeding the array capacity:

In [None]:
// sscanf_student4.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void) {
    char data[] = "Nahasapeemapetilon,Apu,345678";
    
    char last_name[10];     // maximum string length 9
    char first_name[10];    // maximum string length 9
    unsigned int stu_num;
    
    int i1;   // index of first comma
    int i2;   // index of second comma
    
    int n = sscanf(data, "%9[^,]%n,%9[^,]%n,%u", last_name, &i1, first_name, &i2, &stu_num);
    if (n != 3) {
        fprintf(stderr, "Failed to extract all three fields\n");
        exit(EXIT_FAILURE);
    }
    
    // replace commas and print
    data[i1] = '|';
    data[i2] = '|';
    printf("%s\n", data);
    
    return 0;
}

#### Maximum field width and `%c`

Specifying a maximum field width of $w$ with the `%c` conversion will match exactly $w$ characters (including
white space) which means that the receiving argument must be an array of `char` of capacity at least equal
to $w$. 

Unlike `%s` and `%[`, the `%c` conversion does not write a trailing null terminator after the conversion. If
the receiving argument is meant to be used as a string, then it must be an array of capacity $w + 1$ or greater,
and the null terminator must be appended manually.

The following example uses `sscanf` twice to extract data from a string. The first use of `sscanf` finds the
lengths of two comma separated strings. Arrays are dynamically allocated to hold the strings and then
a second call to `sscanf` is used to extract the strings into their arrays:

In [None]:
// sscanf_student5.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void) {
    char data[] = "Nahasapeemapetilon,Apu,345678";
    
    unsigned int stu_num;
    
    int i1;   // index of first comma
    int i2;   // index of second comma
    
    sscanf(data, "%*[^,]%n,%*[^,]%n,%u", &i1, &i2, &stu_num);
    
    // length of string 1 = i1
    size_t len1 = i1;
    char *last_name = malloc(len1 + 1);     // +1 for \0
    
    // length of string 2 = i2 - i1 - 1
    size_t len2 = i2 - i1 - 1;
    char *first_name = malloc(len2 + 1);     // +1 for \0
    
    // generate format string "%[len1]c,%[len2]c"
    char fmt[100];
    // %% is the conversion for a literal %
    // %zu is the conversion for size_t
    sprintf(fmt, "%%%zuc,%%%zuc", len1, len2);  
    puts(fmt);
    
    sscanf(data, fmt, last_name, first_name);
    last_name[len1] = '\0';
    first_name[len2] = '\0';
    
    printf("%s %s, student number = %u\n", first_name, last_name, stu_num);
    
    return 0;
}

It is left an exercise for the student to determine how to reliably determine the theoretical 
maximum capacity required
for the format string (see <https://stackoverflow.com/questions/35019951/how-many-chars-do-i-need-to-print-a-size-t-with-sprintf>, for example). 

### Optional modifier #3: Size of the receiving argument

The very observant reader may have noticed that there is one conversion for base-10 integers (`%d`) but there
are multiple integer types that may be used for the receiving argument (`int` versus `long`, for example).
The third modifier allows the programmer to specify the precise type of the receiving argument which affects
the accuracy of the conversion and the effect of overflow/saturation.

This notebook does not describe all of the possible values for the length modifier. Instead, readers
should refer to <https://en.cppreference.com/w/c/io/fscanf>, and in particular, to the table and the
columns labelled *Argument type*.

One commonly encountered use of the length modifier is for specifying `double` values for the
receiving argument type. Instead of using `%f`, the preferred modifier is `%lf`.

`l` is also used as the length modifier for `long` integer values.

Another example of the use of the length modifier is for receiving arguments of type `size_t`. Because
the actual type of `size_t` is platform dependent, a special modifier is required. Instead of using
`%u`, the preferred modifier is `%zu`.

The following examples scan a string using two different types for the receiving arguments.

In [None]:
// sscanf_integer.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void) {
    // greater than INT_MAX for 32-bit int but less than LONG_MAX for 64-bit long
    char data[] = "3000000000";   
    
    // scan data for an int value and for a long value
    int i = 0;
    long j = 0;
    
    sscanf(data, "%d", &i);
    sscanf(data, "%ld", &j);
    
    printf("i = %d, j = %ld\n", i, j);
    
    return 0;
}

In [None]:
// sscanf_floatingpt.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void) {
    // much greater than FLT_MAX but less than DBL_MAX
    char data[] = "1e100";   
    
    // scan data for a float value and a double value
    float f = 0.0f;
    double d = 0.0;
    
    sscanf(data, "%f", &f);
    sscanf(data, "%lf", &d);
    
    // annoying inconsistency between printf-like and scanf-like functions
    // no length modifier when using printf
    printf("f = %f, d = %f\n", f, d);
    
    return 0;
}

The behavior of `sscanf` when converting a value
that cannot be represented by the conversion type is undefined. On the author's computer, the value
of `i` is negative value, and the value 
of `f` is positive infinity.