In [8]:
from IPython.display import YouTubeVideo, HTML, Image

In [6]:
%%HTML
<div>
<span style="font-size:18px" align="right">Abdeljebbar BOUBEKRI (Dajebbar)</span>
<br>

<span tyle="font-size:16px">AI Programming with Python</span>
<br>
<span><b>Licence CC BY-NC-ND</b></span>
</div>


In [9]:
Image('Img/neural-python.jpg', width=150, height=150)

<IPython.core.display.Image object>

# Coding, character sets and unicode

In [2]:

YouTubeVideo(id='oXVmZGN6plY', width=900, height=400)

##### A character is not a byte

With Unicode, we broke the pattern *one character* == *one byte*. Also in Python 3, when it comes to manipulating data from various data sources:

* the `byte` type is appropriate if you want to load raw binary data into memory, in the form of bytes therefore;
* the `str` type is suitable for representing a string of characters - which again are not necessarily bytes;
* switching from one of these types to the other by encoding and decoding operations, as illustrated below;
* and for **all** encoding and decoding operations, it is necessary to know the encoding used.

![les types bytes et str](Img/str-bytes.png)

You can call the `encode` and` decode` methods without specifying the encoding (in this case Python chooses the default encoding on your system). However, it is far better to be explicit and choose your encoding. If in doubt, it is recommended to **explicitly specify** `utf-8`, which generalizes to the detriment of older encodings like` cp1252` (Windows) and `iso8859- *`, than to leave the host system choose for you.

### What is an encoding?

As you know, a computer's memory - or disk - can only store binary representations. So there is no "natural" way to represent a character like 'A', a quotation mark or a semicolon.

An encoding is used for this, for example [the code `US-ASCII`] (http://www.asciitable.com/) stipulates, to make it simple, that an 'A' is represented by byte 65 which is written in binary 01000001. It turns out that there are several encodings, of course incompatible, depending on the system and language. You can find more details below.

The important point is that in order to be able to open a file "properly", you must of course have the **contents** of the file, but you must also know the **encoding** that was used to write it.

### Precautions to be taken when encoding your source code

Encoding is not just about string objects, but also your source code. **Python 3** assumes that your source code uses **by default the `UTF-8` encoding**. We advise you to keep this encoding which is the one that will offer you the most flexibility.

You can still change the encoding **of your source code** by including in your files, **in the first or second line**, a declaration like this:

```python
# -*- coding: <nom_de_l_encodage> -*-

```
or more simply, like this:


```python
# coding: <nom_de_l_encodage>
```

Note that the first option is also interpreted by the _Emacs_ text editor to use the same encoding. Apart from the use of Emacs, the second option, simpler and therefore more pythonic, is to be preferred.

The name **`UTF-8`** refers to **Unicode** (or to be precise, the most popular encoding among those defined in the Unicode standard, as we will see below). On some older systems you may need to use a different encoding. To determine the value to use in your specific case you can do in the interactive interpreter:

```python
# this must be performed on your machine
import sys
print(sys.getdefaultencoding())
```

For example with old versions of Windows (in principle increasingly rare) you may have to write:

```python
# coding: cp1252
```

The syntax of the `coding` line is specified in [this documentation] (https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarations) and in [PEP 263] (https://www.python.org/dev/peps/pep-0263/).

### The great misunderstanding

If I send you a file containing French encoded with, say, [ISO / IEC 8859-15 - a.k.a. `Latin-9`] (http://en.wikipedia.org/wiki/ISO/IEC_8859-15); you can see in the table that a '€' character will be materialized in my file by a byte '0xA4', that is 164.

Now imagine that you are trying to open this same file from an old Windows computer configured for French. If it is not given any indication of the encoding, the program that will read this file on Windows will use the system default encoding, ie [CP1252] (http: //en.wikipedia. org / wiki / Windows-1252). As you can see in this table, the byte '0xA4' corresponds to the character ¤ and this is what you will see instead of €.

Contrary to what one might hope, this type of problem cannot be solved by adding a tag `# coding: <name_of_encoding>`, which only acts on the encoding used * to read the source file in question * (the one that contains the tag).

To correctly solve this type of problem, you must explicitly specify the encoding to be used to decode the file. And therefore have a reliable way to determine this encoding; which is not always easy, moreover, but unfortunately that is another discussion.
This means that to be completely clean, you must be able to explicitly specify the `encoding` parameter when calling all the methods that are likely to need it.

### Why does it work locally?

When the producer (the program that writes the file) and the consumer (the program that reads it) run on the same computer, everything works fine - in general - because both programs come down to the encoding defined as encoding. by default.

There is however a limit, if you are using a minimally configured Linux, it may default to the `US-ASCII` encoding - see below - which being very old does not" know "a simple é, nor a fortiori €. To write French, it is therefore necessary at least that the default encoding of your computer contains French characters, such as:

* `ISO 8859-1` (` Latin-1`)
* `ISO 8859-15` (` Latin-9`)
* `UTF-8`
* `CP1252`

Again in this list, UTF-8 should be clearly preferred when possible.

In [3]:
%%html
<script src="https://cdn.rawgit.com/parente/4c3e6936d0d7a46fd071/raw/65b816fb9bdd3c28b4ddf3af602bfd6015486383/code_toggle.js"></script>
