There is a difference between the **number of characters** and the **number of bytes** when dealing with Unicode characters and their encoding in UTF-8.

Let's break down what's happening in our code:


In [1]:
# write two unicode characters to file
with open('unicode_test_write.txt', 'w', encoding="utf-8") as f:
    # code point < 128 and code point > 128
    num_char = f.write("\u0041\u0394")
    #
    print(num_char)

2


- **`\u0041`** corresponds to **'A'**, which is a character with a Unicode code point less than 128.
- **`\u0394`** corresponds to **'Δ'** (Greek capital letter Delta), which has a Unicode code point greater than 128.

When you call `f.write("\u0041\u0394")`, you're writing **two characters** to the file:

1. `'A'` (Unicode code point U+0041)
2. `'Δ'` (Unicode code point U+0394)

The `write()` method returns the number of **characters** written, which in this case is `2`. So `num_char` is `2`, and `print(num_char)` outputs:

```
2
```

Now, let's examine how these characters are encoded in UTF-8 and why the total number of **bytes** is `3`.

### UTF-8 Encoding

- **UTF-8** is a variable-length character encoding for Unicode.
- Characters are encoded using 1 to 4 bytes, depending on their code points.

#### Encoding 'A' (U+0041)

- Code point: U+0041 (65 in decimal)
- Since it's less than 128, it's encoded in **1 byte**.
- Binary representation: `01000001`
- Hexadecimal representation: `0x41`

#### Encoding 'Δ' (U+0394)

- Code point: U+0394 (916 in decimal)
- Since it's between 128 and 2047, it's encoded in **2 bytes**.
- Binary representation of code point: `0000 0011 1001 0100`

**UTF-8 Encoding Steps for 'Δ':**

1. **Convert the code point to binary:**

   ```
   U+0394 => 916 decimal => 1110010100 binary
   ```

   This gives us 10 bits: `1110010100`

2. **Pad to 11 bits if necessary:**

   ```
   Bits: 0 1110010100 (now we have 11 bits)
   ```

3. **Split the bits into two parts:**

   - First byte: Bits 1-5 (from the left)
   - Second byte: Bits 6-11

4. **Apply UTF-8 encoding format for 2-byte characters:**

   - **First byte format:** `110xxxxx`
   - **Second byte format:** `10xxxxxx`

5. **Fill in the bits:**

   - **First byte:**

     ```
     110xxxxx
     xxxxx = bits 1-5 = 0 1110
     First byte: 11001110
     ```

   - **Second byte:**

     ```
     10xxxxxx
     xxxxxx = bits 6-11 = 010100
     Second byte: 10010100
     ```

6. **Convert bytes to hexadecimal:**

   - **First byte:** `11001110` => `0xCE`
   - **Second byte:** `10010100` => `0x94`

So, the UTF-8 encoding of `'Δ'` is the byte sequence `0xCE 0x94`.

### Total Bytes Written

- **'A'** is encoded as `0x41` (1 byte).
- **'Δ'** is encoded as `0xCE 0x94` (2 bytes).
- **Total bytes:** `1 + 2 = 3 bytes`

### Reading the File in Binary Mode


In [2]:
with open('unicode_test_write.txt', 'br') as f:
    print(list(f))
    f.seek(0)
    print(f.read().decode())
    f.seek(0)
    print(f.read(1))
    print(f.read(1))
    print(f.read(1))
    print(f.read(1))
    print(f"Length of two unicode characters is: {len(str.encode('AΔ'))}, WHY???")


[b'A\xce\x94']
AΔ
b'A'
b'\xce'
b'\x94'
b''
Length of two unicode characters is: 3, WHY???


- **`list(f)`**: Reads the file line by line in binary mode, resulting in:

  ```
  [b'A\xce\x94']
  ```

  This shows that the file contains the bytes `0x41`, `0xCE`, `0x94`.

- **`f.read().decode()`**: Reads all bytes and decodes them using UTF-8, resulting in:

  ```
  AΔ
  ```

- **Reading byte by byte:**


In [3]:
with open('unicode_test_write.txt', 'br') as f:
    f.seek(0)
    print(f.read(1))  # Outputs: b'A' (0x41)
    print(f.read(1))  # Outputs: b'\xce' (0xCE)
    print(f.read(1))  # Outputs: b'\x94' (0x94)
    print(f.read(1))  # Outputs: b'' (End of file)



b'A'
b'\xce'
b'\x94'
b''



### Length of Encoded String



In [6]:
print(f"Length of two unicode characters is: {len(str.encode('AΔ'))}, WHY???")


Length of two unicode characters is: 3, WHY???


- **`str.encode('AΔ')`** encodes the string `'AΔ'` into bytes using UTF-8.
- The resulting byte sequence is `b'A\xce\x94'`.
- **`len(b'A\xce\x94')`** returns `3`, which is the number of bytes.

### Observation

The length of the two Unicode characters is **2 characters**, but when encoded in UTF-8, the total number of bytes is **3 bytes** because:

- **'A'** is encoded in **1 byte**.
- **'Δ'** is encoded in **2 bytes**.

Therefore, `len(str.encode('AΔ'))` returns `3` because it counts the total number of **bytes** in the encoded string, not the number of characters.

### Summary

- **Number of characters:** `2` (as returned by `f.write()`)
- **Number of bytes after UTF-8 encoding:** `3` (as shown by `len(str.encode('AΔ'))`)
- **Reason:** Unicode characters may be encoded using multiple bytes in UTF-8, depending on their code points.

### Visual Representation

| Character | Unicode Code Point | UTF-8 Encoding | Bytes |
|-----------|--------------------|----------------|-------|
| `'A'`     | U+0041             | `0x41`         | 1     |
| `'Δ'`     | U+0394             | `0xCE 0x94`    | 2     |
| **Total** |                    |                | **3** |

### Conclusion

The length difference arises because UTF-8 is a variable-length encoding system. Characters with code points above 127 require more than one byte to represent. In your example:

- **'A'** (`U+0041`): Encoded in 1 byte.
- **'Δ'** (`U+0394`): Encoded in 2 bytes.

Therefore, when you encode `'AΔ'` in UTF-8, you get a byte sequence of length **3 bytes**, which explains why `len(str.encode('AΔ'))` returns `3`.

---

**Additional Note:**

If you were to use an encoding where each character is represented by the same number of bytes (like UTF-32, which uses 4 bytes per character), you would see different results:

```python
print(len('AΔ'.encode('utf-32')))  # Output: 8
```

Here, each character is encoded in 4 bytes, so the total length is `2 * 4 = 8 bytes`.

---

**Understanding Encoding Lengths**

- **ASCII Characters (U+0000 to U+007F):** 1 byte in UTF-8.
- **Latin-1 Supplement and similar (U+0080 to U+07FF):** 2 bytes in UTF-8.
- **Higher code points:** May require 3 or 4 bytes in UTF-8.

---

**Your Original Output Explained**

```python
2  # Number of characters written

[b'A\xce\x94']  # Contents of the file in bytes

AΔ  # Decoded content from bytes

b'A'  # First byte read (0x41)
b'\xce'  # Second byte read (0xCE)
b'\x94'  # Third byte read (0x94)
b''  # End of file

Length of two unicode characters is: 3, WHY???  # Length of encoded bytes
```

---

**Why Is This Important?**

Understanding the difference between character count and byte length is crucial when:

- **Reading/Writing Files:** To avoid data corruption, especially with non-ASCII characters.
- **Networking Applications:** Where data size affects transmission.
- **Databases:** Where storage size and encoding impact performance.