# Elements of Programming Interview

# Strings

Strings are represented by the **immutable** *str* data type which holds a sequence of Unicode characters. 


## Basic Operations

### Comparison

Strings support the usual comparison operators <, <=, ==, !=, >, and >=. These operators compare strings byte by byte in memory. 

2 problems (neither is specific to Python):
1. Some Unicode characters can be represented by two or more different byte sequences. For example , the character Å (Unicode code point 0x00C5) can be represented in UTF-8 encoded bytes in three different ways: [0xE2, 0x84, 0xAB], [0xC3, 0x85], and [0x41, 0xCC, 0x8A]. 
    * **Solution**: import $unicodedata$ module and call *unicodedata.normalize()* with "NFKC" as the first argument (this is a normalization method-three others are also available, "NFC", "NFD", and "NFKD"), and a string containing Å character using any of its valid byte sequences, the function will return a string that when represented  as UTF-8 encoded bytes will always by the byte sequence [0xC3, 0x85].
2. Language-specific sorting of characters. For example, although in english we sort *ø* as if it were *o*, in Danish and norwegian it is sorted after *z*.

In [1]:
'ø' > 'z'

True

**Problem**. Check whether a string is palindromic.  
**Solution**. Time complexity = $O(n)$, space complexity = $O(1)$.

In [2]:
def is_palindromic(s):
    return all(s[i] == s[-i-1] for i in range(len(s)//2))
print('aka .....', is_palindromic('aka'))
print('akb .....', is_palindromic('akb'))

aka ..... True
akb ..... False


## Know your String Libraries

### Immutable

Strings are immutable which means that operations like *s = s[1:]* or *s += '123'* imply creating a new array of characters that is then assigned back to *s*.

* Concatenating a single character n times to a string in a for loop has $O(n^2)$ time complexity.

## How Python saves memory when storing strings
https://rushter.com/blog/python-strings-and-memory/

**Note**. Double check this information.

### How strings are represented in memory

Since Python3, the **str** type uses Unicode representation. Python uses 3 kinds of internal representations of Unicode strings:

* 1 byte per character (Latin-1 encoding)
    * Latin-1 supports many Latin languages, such as English, Swedish, Italian, Norwegian and so on.
    * However, it cannot store non-Latin languages, such as Chinese, Japanese, Hebrew, Cyrillic.
* 2 bytes per character-UCS-2 encoding (UTF-16)
    * Most of the popular natural languages can fit in 2-byte.
* 4 bytes per character-UCS-4 encoding (UTF-32)
    * is used when a string contains special symbols, emojis or rare languages.
    

|       | UTF-8 encoding | Fixed length encoding |
| ------| -------------- | -- |
| **Storage Efficient** | **Yes**: Each character is encoded using 1-4 bytes depending on the character it is representing | **No**: If you insert a single emoji in our text the size of a string will increase by the factor of 4! 
| **$O(1)$ locating a character by index** | **No**: To perform a simple operation such as *string[5]* with UTF-8 Python would need to scan the string until it finds the required character | **Yes**: To locate a character by index Python just multiplies an index number by the length of one character (1, 2 or 4 bytes)