### What is unicode?
It is a system for representing characters from all possible human languages.

### Why is unicode necessary?
It is necessary to simplify understanding of characters from different languages. Earlier, every language has its own system of encoding characters which made exchange of information across platforms difficult.

### What is the relationship between unicode and characters?
Unicode standard describes the relationship between code points and character.
A code point value is an integer ranging from 0 to 0x10FFFF values.
0X10FFFF is a hexadecimal number. It is equivalent to 1114111 in decimal system. Hexadecimal number system has "0"-"9" and "A"-"F", where "A"-"F" represents 10-15 numbers.

$0x10FFFF_{base16} = 1114111_{base10}$

$0x10FFFF \longrightarrow 1 * 16^5 + 0 * 16^4 + 15 * 16^3 + 15 * 16^2 + 15 * 16^1 + 15 * 16^0$

Out of the 1.1 million values, 110k values have been assigned to characters. So, unicode standard defines the relationship between code point value and
character value. For example, code point value of U+0061 represents the
character 'a'. In informal contexts, U+0061 is referred to as 'a' character,
which is technically incorrect.

### What is the relationship between character and bytes?
A character's code point value is required to be represented in memory, which
is done as code unit, and code units are mapped to 8-bit bytes. The rules of
mapping code units to 8-bit bytes are called a character encoding.

### What are different ways to represent code points in computer's memory?
UTF-32, UTF-16, UTF-8 are some encoding standards to encode code points into
32 bits, 16 bits, and 8 bits bytes. UTF-8 is most widely used encoding format.


### Source
- https://docs.python.org/3/howto/unicode.html
- https://web.archive.org/web/20120315050914/http://www.diveintopython.net/xml_processing/unicode.html

In [1]:
import sys
print(sys.getdefaultencoding())

utf-8


In [2]:
#let us take an example of a character A
encode_utf8 = 'A'.encode('utf-8')
print(encode_utf8)
print('Size of the encoded variable is ', sys.getsizeof(encode_utf8))

b'A'
Size of the encoded variable is  34


In [3]:
encode_utf16 = 'A'.encode('utf-16')
print(encode_utf16)
print('Size of the encoded variable is ', sys.getsizeof(encode_utf16))

b'\xff\xfeA\x00'
Size of the encoded variable is  37


In [4]:
encode_utf32 = 'A'.encode('utf-32')
print(encode_utf32)
print('Size of the encoded variable is ', sys.getsizeof(encode_utf32))

b'\xff\xfe\x00\x00A\x00\x00\x00'
Size of the encoded variable is  41


### Why do we see such a large number for the size of a character?

In [5]:
import platform
print(platform.python_implementation())

CPython


sys.getsizeof() calls the __sizeof__ method and adds garbage collector overhead if the object is managed by the garbage collector.

A good description about memory management in Python (CPython implementation of Python) is given here - https://stackify.com/python-garbage-collection/.

Two main reasons for implementing memory management via garbage collector are:
- memory should be empty if not in use
- memory should not be emptied when still in use

The reference counting process associated wth a python object, when referenced, increments python object's reference count in the garbage collection process, and decrements its reference count when dereferenced.

Another process in the garbage collector is the generational garbage collector which has generally three generations, each with different thresholds for holding python objects. This process pushes objects to the next generation when executed. (The exact terminology for this process is: an object moves into an older generation whenever it survives a garbage collection process on its current generation.) Garbage collector tunes the garbage collection process.

Some memory is allocated for garbage collection overhead, which is added to the sys.getsizeof method's output.

In [6]:
sys.getsizeof('A')

50

In [7]:
sys.getsizeof('')

51

In [8]:
sys.getsizeof([])

64

In [9]:
sys.getsizeof([2])

72

In [10]:
sys.getsizeof([2, 4])

80

In [11]:
sys.getsizeof([2, 4, 6])

88

In [12]:
sys.getsizeof('A'.encode('utf-8')) #16 bytes smaller than the size of 'A'

34