# A simple string type

<div class="alert alert-block alert-info">
    You can find all of the C programs in this notebook in the subdirectory containing this notebook:
    <code>./src/qstr</code>
</div>

C strings are simply arrays with a special terminator character. C strings are mutable in that the characters
making up the string can be modified and the length of the string can be modified to any value between
$0$ and $n-1$ where $n$ is the capacity of the array. The programmer is responsible for allocating and
deallocating memory when using strings with allocated storage duration. A notable weakness of C strings
is that finding the length of a string has computational complexity in $O(n)$.

Languages such as Python, Java, and C++ have dedicated string types that allow the programmer to use
strings easily and safely in their programs. In these languages, finding the length of a string
has computational complexity in $O(1)$ because string objects know their own length.

The string types in Python and Java are immutable. Operations that seem to modify the contents of a string
actually create new string objects.

The string type in C++ is mutable. The programmer may modify the length of a string and individual characters
of a string.

This notebook shows how to use a struct to implement a string type in C. The approach taken here
is arguably the simplest possible one that supports a string length operation in $O(1)$. Our string type
will be mutable with the user being able to directly access the array storing the characters of the string.
Functions that change the length of the string will automatically allocate memory when necessary.

## A string struct

The C mechanism for creating a user-defined type is a struct (or possibly multiple structs) that stores the
data needed by the type and a set of functions that provide functionality of the type. The struct (or structs)
and functions are typically declared in a header file that users `#include` in their programs. The functions
are typically implemented in one or more C source code files. Our string type will be implemented with a
struct having the tagname `qstr`:

```c
struct qstr {
};
```

A string is a sequence of zero or more characters. 
The *length* of a string is the number of characters in the string, not including any special terminator characters.
Our implementation stores the length of the string in an unsigned integer member of the struct:

```c
struct qstr {
    size_t length;
};
```

We will use an array of `char` to store the sequence of characters. 
Furthermore, our functions will ensure that the sequence is null terminated to maintain
interoperability with the standard C library. Unlike ordinary C strings, we will allow a `qstr`
to contain zero or more null terminator characters in its sequence of characters.

Because structs are not allowed to have variable
length arrays as members, our struct stores a pointer to the first element of the array:

```c
struct qstr {
    size_t length;
    char *data;
};
```

Alternatively, we could use a flexible array member for the array (see the last section of the *struct* notebook
for links to further information).

Finally, our implementation stores the capacity of the array so that operations that change the length of
the string can determine if the array must be reallocated to obtain additional memory:

```c
struct qstr {
    size_t length;
    size_t capacity;
    char *data;
};
```

Note that `capacity` $>=$ (`length` + 1) to store the characters of the string and the null terminator.

## Using `qstr`

The `qstr` struct is usable as currently presented, but the following example illustrates that the user
is responsible for setting the values of the members and memory management:

In [None]:
#include <stdio.h>
#include <stdlib.h>

struct qstr {
    size_t length;
    size_t capacity;
    char *data;
};

int main(void) {
    // dynamically allocated, continue reading the notebook for explanation
    struct qstr *s = malloc(sizeof(struct qstr));
    
    // set capacity and allocate array
    s->capacity = 8;
    s->data = malloc(s->capacity);
        
    // write a string into the array and set length
    sprintf(s->data, "CISC220");
    s->length = 7;
    
    // print
    puts(s->data);
    
    // free allocated memory
    free(s->data);
    free(s);
    
    return 0;
}

Tasks that modify the length or capacity of a `qstr` also require the user to carefully manage the members of 
the struct.

Memory for the `data` should always be dynamically allocated to ensure that the `qstr` object:

* always has a valid array, and
* has an array can be reallocated by other functions, and
* has an array that can be deallocated using `free`

The following example incorrectly attempts to assign the `data` member to an array having automatic storage
duration:

In [None]:
#include <stdio.h>
#include <stdlib.h>

struct qstr {
    size_t length;
    size_t capacity;
    char *data;
};

struct qstr *badstring() {
    struct qstr *s = malloc(sizeof(struct qstr));
    
    // set capacity and allocate array
    s->capacity = 8;
    char buffer[s->capacity];      // ohoh, automatic storage duration
    s->data = &buffer[0];
        
    // write a string into the array and set length
    sprintf(s->data, "CISC220");
    s->length = 7;
    
    return s;
}

int main(void) {
    struct qstr *str = badstring();
    
    // ohoh, attempt to print non-existant string
    puts(str->data);
    
    // ohoh, attempt to free memory not allocated by malloc
    free(str->data);
    
    // ok, free memory allocated by malloc, but program flow probably won't make it this far...
    free(str);
    
    return 0;
}

## Creating a library of functions

Using `qstr` would be much more convenient if there was a library of functions for using `qstr` objects that
managed the members of the struct for the user. A general purpose string library would have too many functions
to illustrate here. Instead we will implement the small number of functions described briefly below:

* `struct qstr *qstr_new()`
    * returns a new empty `qstr`
* `struct qstr *qstr_fromcstr(const char *s)`
    * returns a new `qstr` by copying a C-style string
* `struct qstr *qstr_copy(const struct qstr *s)`
    * returns a new `qstr` by copying another `qstr`
* `struct qstr *qstr_clear(struct qstr *s)`
    * clears a `qstr` so that it becomes equal to the empty string
* `char *qstr_get(const struct qstr *s, size_t index)`
    * checked access to the character at a specified index
* `char *qstr_set(const struct qstr *s, size_t index, char c)`
    * checked modification of the character at a specified index
* `struct qstr *qstr_assign(struct qstr *lhs, const struct qstr *rhs)`
    * overwrites a `qstr` with the contents of another `qstr`
* `struct qstr *qstr_concat(struct qstr *lhs, const struct qstr *rhs)`
    * concatenates a `qstr` to the end of another `qstr` (possibly itself)
* `struct qstr *qstr_remrange(struct qstr *s, size_t start, size_t stop)`
    * removes a range of characters from a `qstr`
* `void qstr_destroy(struct qstr *s)`
    * deallocates memory used by `qstr`
    
Notice that the functions return and accept `struct qstr` objects by pointer, and our library assumes
that `struct qstr` objects are dynamically allocated. This is a common practice for C libraries that
provide user-defined types because it allows the library implementer to provide an *opaque type*.
Opaque types are discussed in greater detail at the end of this notebook.

Returning a pointer also allows us to return `NULL` to indicate that the function has encountered an error.

Also notice that all of our functions have the prefix `qstr_`. The reason for using the prefix is to minimize
the risk that the user's program will contain a function having the same name as one of our functions.


## The header file `qstr.h`

To implement `qstr`, we begin by creating a header file named `qstr.h`. In the header file we place
the definition of our struct and the declarations of the supporting functions. The contents of the header
file are shown below:

```c
#ifndef QSTR_H
#define QSTR_H

// qstr.h

#include <stddef.h>

struct qstr {
    size_t length;
    size_t capacity;
    char *data;
};

/*
 * Default capacity of a newly allocated empty qstr.
 */
static size_t QSTR_DEFAULT_CAPACITY = 8;

/*
 * Returns a pointer to a newly allocated empty qstr.
 * Returns a null pointer if the qstr cannot be allocated.
 */
struct qstr *qstr_new();

/*
 * Returns a pointer to a newly allocated qstr containing the characters
 * copied from a C-style string.
 * Returns an empty qstr if s is NULL.
 * Returns a null pointer if the qstr cannot be allocated.
 */
struct qstr *qstr_fromcstr(const char *s);

/*
 * Returns a pointer to a newly allocated qstr containing the characters
 * copied from another qstr s. 
 * Returns an empty qstr if s is NULL.
 * Returns a null pointer if the qstr cannot be allocated.
 */
struct qstr *qstr_copy(const struct qstr *s);

/*
 * Clears the specified qstr setting its length to 0. Returns s.
 * Returns a null pointer if s is NULL.
 */
struct qstr *qstr_clear(struct qstr *s);

/*
 * Checked character access by index.
 * Returns a pointer to the character at the specified index of the specified qstr.
 * index must satisfy the inequality 0 <= index < s->length
 * Returns a null pointer if s is NULL.
 * Returns a null pointer if the index is invalid.
 */
char *qstr_get(const struct qstr *s, size_t index);

/*
 * Checked character modification by index.
 * Sets the character at the specified index of the specified qstr to
 * the specified character.
 * Returns a pointer to the character at the specified index of the specified qstr.
 * index must satisfy the inequality 0 <= index < s->length
 * Returns a null pointer if s is NULL.
 * Returns a null pointer if the index is invalid.
 */
char *qstr_set(const struct qstr *s, size_t index, char c);

/*
 * Assigns the value of the qstr rhs to the qstr lhs by copying the
 * length and characters of rhs into lhs. Returns lhs.
 * lhs remains unchanged if rhs is NULL.
 * Returns a null pointer if lhs is NULL.
 */
struct qstr *qstr_assign(struct qstr *lhs, const struct qstr *rhs);

/*
 * Concatenates the characters of the qstr rhs to the end of lhs.
 * The capacity of lhs is increased if necessary. Returns lhs.
 * lhs remains unchanged if rhs is NULL.
 * Returns a null pointer if lhs is NULL.
 * Returns a null pointer if the qstr cannot be allocated.
 */
struct qstr *qstr_concat(struct qstr *lhs, const struct qstr *rhs);

/*
 * Checked substring removal by range.
 * Removes a range of characters from s starting at index start going
 * up to but not including index stop.
 * start and stop must satisfy 0 <= start <= stop <= s->length
 * Returns a null pointer if s is NULL.
 * Returns a null pointer if start and/or stop are invalid.
 */
struct qstr *qstr_remrange(struct qstr *s, size_t start, size_t stop);

/*
 * Deallocates memory allocated for the struct pointed at by s. Both s
 * and its data buffer are deallocated.
 * Does nothing if s is NULL.
 */
void qstr_destroy(struct qstr *s);

#endif
```

## The source code file `qstr.c`

The definitions of the functions declared in `qstr.h` are placed in the source code file `qstr.c`. In `qstr.c`,
we `#include` the standard library headers that we require and we `#include` the header `qstr.h`. Including
the header `qstr.h` allows us to call `qstr` functions from within other `qstr` functions without needing
to consider the order of the functions in the source code file. The beginning of `qstr.c` is shown below:

```c
// qstr.c

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "qstr.h"
```

### Functions that initialize a `qstr`

There are three functions that initialize a new `qstr` object:

* `struct qstr *qstr_new()`
    * returns a new empty `qstr`
* `struct qstr *qstr_fromcstr()`
    * returns a new `qstr` by copying a C-style string
* `struct qstr *qstr_copy(const struct qstr *s)`
    * returns a new `qstr` by copying another `qstr`
    
Each of these functions performs a similar sequence of operations:

1. allocate a new `struct qstr` object
2. allocate the `data` buffer of the new object
3. sets the `len` and `capacity` of the new object
4. sets the characters of object in the `data` buffer
5. returns a pointer to the object

Steps 1 and 2 deal with memory allocation for a new `qstr` object. Rather than repeating the memory
allocation code in each function, it is preferable to create a separate function that performs the required
allocations. The function:

```c
struct qstr *qstr_alloc(size_t capacity);
```

allocates a new `qstr` object equal to the empty string and having an array with the specified capacity
returning a pointer to the newly allocated object. Its implementation is shown below:

```c
/*
 * Allocates a new struct qstr object having length 0.
 * Returns a pointer to the new object, or NULL if allocation fails.
 */
struct qstr *qstr_alloc(size_t capacity) {
    struct qstr *q = malloc(sizeof(struct qstr));
    if (!q) {
        fprintf(stderr, "qstr allocation error");
        return NULL;
    }
    if (capacity == 0) {                                   // 1
        capacity = 1;
    }
    char *buf = calloc(capacity, 1);                       // 2
    if (!buf) {
        free(q);                                           // 3
        fprintf(stderr, "qstr->data allocation error");
        return NULL;
    }
    q->length = 0;
    q->capacity = capacity;
    q->data = buf;
    return q;
}
```

Implementation notes (marked by a comment in the code above):

1. The standard library memory allocation functions cannot portably be called with a value less than `1`.
We ensure that the capacity is at least `1` before calling the allocation function.
2. `calloc` sets each byte of the allocated array to `0` which ensures that we have a string of length
`0` (because the first character in the array is equal to the null terminator).
3. If memory for the struct is allocated but memory for the `data` buffer cannot be allocated, then we
make sure to deallocate the struct using `free` even though we call `exit`. The reasons for doing so is
because it is always good practice to free allocated memory when it is no longer needed, and in case we
later change the implementation to perform some other action instead of calling `exit`.

The three initialization fucntions are now easy to implement given the function `qstr_alloc`:

```c
struct qstr *qstr_new() {
    return qstr_alloc(QSTR_DEFAULT_CAPACITY);
}

struct qstr *qstr_fromcstr(const char *s) {
    size_t len = strlen(s);
    struct qstr *q = qstr_alloc(len + 1);            // 1
    if (q) {
        q->length = len;
        strcpy(q->data, s);                          // 2
    }
    return q;
}

struct qstr *qstr_copy(const struct qstr *s) {
    if (!s) {
        return qstr_new();
    }
    struct qstr *q = qstr_alloc(s->capacity);
    if (q) {
        qstr_assign(q, s);                           // 3
    }
    return q;
}
```

Implementation notes (marked by a comment in the code above):

1. The capacity of the allocated `qstr` must be one greater than the string length to accommodate the null
terminator.
2. `strcpy` can be used to copy the characters of `s` although `memcpy` is likely faster in this case.
`memcpy` is usable here because we know the length of the string `s`.
3. `qstr_assign` copies the characters from one `qstr` object into another existing `qstr` object. Calling
`qstr_assign` avoids code duplication in this case (even though it is not yet implemented).


### Functions that do not alter the capacity of the string

There are four functions that do not alter the capacity of a string:

* `struct qstr *qstr_clear(struct qstr *s)`
    * clears a `qstr` so that it becomes equal to the empty string
* `char *qstr_get(const struct qstr *s, size_t index)`
    * checked access to the character at a specified index
* `char *qstr_set(const struct qstr *s, size_t index, char c)`
    * checked modification of the character at a specified index
* `struct qstr *qstr_remrange(struct qstr *s, size_t start, size_t stop)`
    * removes a range of characters from a `qstr`

There is nothing particularly
interesting of note for the functions `qstr_clear`, `qstr_get`, and `qstr_set`
as these functions simply access the `length` and `data` members of the struct. 

```c
struct qstr *qstr_clear(struct qstr *s) {
    if (!s) {
        return NULL;
    }
    s->data[0] = '\0';
    s->length = 0;
    return s;
}

char *qstr_get(const struct qstr *s, size_t index) {
    if (!s || index >= s->length) {
        return NULL;
    }
    return s->data + index;
}

char *qstr_set(const struct qstr *s, size_t index, char c) {
    char *res = qstr_get(s, index);
    if (!res) {
        return NULL;
    }
    s->data[index] = c;
    return res;
}
```

#### `qstr_remrange`

The function `qstr_remrange` is slightly more involved than the previous functions. The following
figure illustrates a string of length 8 stored in an array having capacity 10:

![](images/qstr-remrange.png)

Suppose that we want to remove the characters `EFG` from the string. Removing the characters can be accomplished
by moving all of the characters after the `G` forward so that they overwrite the removed characters:

![](images/qstr-remrange-after.png)

The implementation of `qstr_remrange` is shown below:

```c
/*
 * Checked substring removal by range.
 * Removes a range of characters from s starting at index start going
 * up to but not including index stop.
 * start and stop must satisfy 0 <= start <= stop <= s->length
 * Returns a null pointer if s is NULL.
 * Returns a null pointer if start and/or stop are invalid.
 */
struct qstr *qstr_remrange(struct qstr *s, size_t start, size_t stop) {
    if (!s || start > stop || stop > s->length) {
        return NULL;
    }
    size_t n_removed = stop - start;                         // 1
    size_t n_tomove = s->length - stop + 1;                  // 2
    memmove(s->data + start, s->data + stop, n_tomove);      // 3
    s->length -= n_removed;
    return s;
}
```

Implementation notes (marked by a comment in the code above):

1. `n_removed` is the number of characters that are being removed.
2. `n_tomove` is the number of characters that need to be moved. It is equal to the number of
characters starting at index `stop` extending to the null-terminator after the last character of the string.
3. `memmove` is required instead of `memcpy` because the destination and source data ranges overlap.
The destination location for the moved characters begins at index `start`; thus, a pointer to the
destination location is `s->data + start`. The source location for the moved characters begins at 
index `stop`; thus, a pointer to the source location is `s->data + stop`. 


### Functions that may alter the capacity of the string

There are two functions that may alter the capacity of a string:

* `struct qstr *qstr_assign(struct qstr *lhs, const struct qstr *rhs)`
    * overwrites a `qstr` with the contents of another `qstr`
* `struct qstr *qstr_concat(struct qstr *lhs, const struct qstr *rhs)`
    * concatenates a `qstr` to the end of another `qstr` (possibly itself)

For both functions, the capacity of the `data` array belonging to the object pointed at by `lhs` may need to
be increased if the array cannot hold the final result of the computation. Increasing the capacity
of the array requires reallocating the array and adjusting the `data` pointer to point at the new array.
Rather than repeating the reallocation code in each function, it is preferable to create a separate 
function that performs the required reallocation. The function:

```c
struct qstr *qstr_ensure_capacity(struct qstr *s, size_t capacity);
```

ensures that the object pointed at by `s` has the specified capacity. If needed, it
reallocates and reassigns the `data` pointer held by the object pointed at by `s`. 
Its implementation is shown below:

```c
/*
 * Ensures that the object pointed at by s has at least the
 * specified capacity in its data array, reallocating the data
 * array if required.
 * Returns s, or NULL if reallocation failed.
 */
struct qstr *qstr_ensure_capacity(struct qstr *s, size_t capacity) {
    if (s->capacity >= capacity) {
        return s;
    }
    char *buf = realloc(s->data, capacity);
    if (!buf) {
        fprintf(stderr, "qstr->data reallocation error");
        return NULL;
    }
    s->data = buf;
    s->capacity = capacity;
    return s;
}
```

Given a valid `qstr` object pointer, reallocation is performed using the `realloc` function in the
usual way.

#### `qstr_assign`

The function `qstr_assign(struct qstr *lhs, const struct qstr *rhs)` overwrites the `length` and `data`
members of `lhs` with the information from `rhs`. It provides a functionality similar to `strcpy`
except that it reallocates storage for the copy when required. Its implementation is shown below:

```c
struct qstr *qstr_assign(struct qstr *lhs, const struct qstr *rhs) {
    if (lhs == rhs) {                                   // 1
        return lhs;
    }
    const size_t REQ_CAPACITY = rhs->length + 1;        // 2
    qstr_ensure_capacity(lhs, REQ_CAPACITY);
    
    // bug, qstr can contain \0
    // strcpy(lhs->data, rhs->data);
    memcpy(lhs->data, rhs->data, rhs->length + 1);      // 3
    lhs->length = rhs->length;
    return lhs;
}
```

Implementation notes (marked by a comment in the code above):

1. If `lhs` and `rhs` point at the same `qstr` object, then nothing needs to be done and we simply return `lhs`
(or `rhs`).
2. The required minimum capacity is equal to the length of the string represented by `rhs` plus 1 for the 
null terminator character.
3. `memcpy` is required instead of `strcpy` because our strings are allowed to contain the null terminator
character (`strcpy` stops copying characters after seeing the first null terminator character). The total
number of characters to copy is `rhs->length + 1` because we must copy over the trailing null terminator.

#### `qstr_concat`

The function `qstr_concat(struct qstr *lhs, const struct qstr *rhs)` concatenates the characters from 
the string pointed at by `rhs` to the end of the string pointed at by `lhs`. It provides a functionality 
similar to `strcat`
except that it reallocates storage for the resulting string when required. Its implementation is shown below:

```c
struct qstr *qstr_concat(struct qstr *lhs, const struct qstr *rhs) {
    if (!lhs) {
        return NULL;
    }
    if (!rhs) {
        return lhs;
    }
    const size_t REQ_CAPACITY = lhs->length + rhs->length + 1;         // 1
    struct qstr *tmp = qstr_ensure_capacity(lhs, REQ_CAPACITY);
    if (!tmp) {
        return NULL;
    }
    
    // must use memmove instead of memcpy here, can you see why?
    memmove(lhs->data + lhs->length, rhs->data, rhs->length + 1);      // 2
    lhs->length = lhs->length + rhs->length;
    return lhs;
}
```

Implementation notes (marked by a comment in the code above):

1. The required minimum capacity is equal to the length of the two strings being concatenated plus 1 for the 
null terminator character.
2. We want to copy the characters from the string pointed at by `rhs` to the end of the string pointed at by `lhs`.
The destination location is immediately after the last character in `lhs`; a pointer to this location
is `lhs->data + lhs->length`. 
The source location is the location of the first character in `rhs`; a pointer to this location
is `rhs->data`.
The total number of characters to copy is `rhs->length + 1` to account for the trailing null terminator.

### A function to deallocate the memory used by a string

A `qstr` object is always dynamically allocated and thus must be freed at the end of its lifetime. Freeing
a `qstr` object requires freeing memory used by the struct plus freeing the memory used by the `data` array.
The `qstr_destroy` function performs the necessary deallocation steps:

```c
void qstr_destroy(struct qstr *s) {
    if (!s) {                              // 1
        return;
    }
    free(s->data);                         // 2
    s->length = 0;                         // 3
    s->capacity = 0;
    s->data = NULL;
    free(s);                               // 4
}
```

Implementation notes (marked by a comment in the code above):

1. If the argument pointer is `NULL` then there is no object to free. We immediately return to avoid
dereferencing the pointer.
2. We have to free the `data` array before freeing the struct `s`. If we freed `s` first then attempting
to access the `data` array would lead to undefined behavior.
3. Although not required, we set `length` and `capacity` to zero, and `data` to `NULL` to indicate that
the object is longer a valid `qstr` object.
4. Freeing `s` deallocates memory used by the struct.

## Hiding implementation details

Our `qstr` struct is declared and defined in the header file as:

```c
// defined in qstr.h
struct qstr {
    size_t length;
    size_t capacity;
    char *data;
};
```

C has no notion of an access modifier (such as Java's `private` modifier) that restricts visibility of
a struct member.
Users are able to see and are expected to interact with the members of the struct. For example, a user
might change the last character of a string like so:

```c
// last_char.c

#include <stdio.h>

#include "qstr.h"

int main(void) {
    struct qstr *str = qstr_fromcstr("HELLO");
    str->data[str->length - 1] = 'o';
    puts(str->data);

    return 0;
}
```

Although we cannot hide the visibility of individual members of a struct, it is possible to hide the
visibility of all of the members by creating what is called an *opaque struct* or an *opaque type*.
In the header file, we can declare, but not define, our struct:

```c
// declared in qstr.h
struct qstr;
```

Then in our source code file, we can define our struct:

```c
// defined in qstr.c
struct qstr {
    size_t length;
    size_t capacity;
    char *data;
};
```

Recall that users of a C library typically have access only to the header files of the library and the
precompiled source files. By using opaque types, the library implementer provides an interface via the
header files but hides the implementation details in the precompiled source files. An example of an
opaque type found in the standard library is the `FILE` type declared in `<stdio.h>` that represents
a stream.

If we change `struct qstr` so that it becomes an opaque type, then we must also provide a function that
returns the length of the string. For users to be able to use standard library functions with our string,
we also must provide a function that returns a pointer to the start of the `data` array.

Earlier in this notebook, it was stated that libraries that provide user-defined opaque types typically provide
an interface where objects are passed by pointer instead of by value. An important advantage of such an interface
is that users of the library can write code that is *binary compatible* with different versions of the library.
Binary compatible means that users do not need to recompile their code when updating to a newer version of
the library. Instead, users only need to relink their compiled code to the new library. Binary compatibility
is achieved because instances of the library type are always passed by pointer (which always have the same
size) and members of the type are never accessed directly by the user.

Consider two different implementations of `qstr_new()`:

```c
struct qstr qstr_new();    // 1 : return by value
struct qstr *qstr_new();   // 2 : return by pointer
```

Version 1 of `qstr_new()` returns a struct by value which means that the compiler must know the size of the
struct so that it can allocate the correct amount of space for the returned object. If the implementation
of the struct changes in such a way that the size of the struct changes, then any user of such a function
must recompile their code.

Version 2 of `qstr_new()` returns a struct by pointer and the size of a pointer is defined by the
language. Modifications that change the size of the struct have no effect on the user's compiled code.

