###  Introduction of Unicode

One of the major changes in Python 3 was the move to make all strings Unicode. Previously, there was a str type and a unicode type. For example:

```
# Python 2
x = 'blah'
print (type(x))
#str

y = u'blah'
print (type(y))
#unicode
```

If we do the same thing in Python 3, you will note that it always returns a string type

```
# Python 3
x = 'blah'
print (type(x))
#<class 'str'>

y = u'blah'
print (type(y) )
#<class 'str'>
```

Python 3 defaults to the UTF-8 encoding. What all this means is that you can now use Unicode characters in your strings and for variable names. so a text file written in Russion will open in Python3 but not in Python2

```
# Python 2
print ('abcdef' + chr(255))
#'abcdef\xff'
```

Note that the end of that string has some funny characters there. That should be a “ÿ” instead of xff, which is basically a hex version of the character. In Python 3 you will get what you expect:

```
# Python 3
print (('abcdef' + chr(255)).encode('utf-8'))
#b'abcdef\xc3\xbf'
```

###  Encoding/Decoding

Back in the Pythons 2 days, you quickly learned that you cannot decode a unicode string and you cannot encode a str type either. If you tried to take a unicode string and decode it to ascii (i.e. convert it to a byte string), you would get a UnicodeEncodeError. For example:

In [None]:
u"\xa0".decode("ascii")

#Traceback (most recent call last):
#   File "/usercode/__ed_file.py", line 1, in <module>
#    u"\xa0".decode("ascii")
#UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)

If you tried something like this in Python 3, you would get an AttributeError

In [2]:
u"\xa0".decode("ascii")

AttributeError: 'str' object has no attribute 'decode'

The reason is pretty obvious in that strings in Python 3 don’t have the decode attribute. However, byte strings do! Let’s try a byte string instead:

In [3]:
b"\xa0".decode("ascii")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

That still didn’t quite work the way we wanted as the ASCII codec didn’t know how to handle the character we passed to it. Fortunately you can pass extra parameters to the decode method and make this work:

In [4]:
b"\xa0".decode("ascii","replace")

'�'

In [5]:
b"\xa0".decode("ascii",'ignore')

''

This example shows what happens if you tell the decode method to replace the character or to just ignore it. You can see the results for yourself.

Now let’s look at an example from the Python documentation to learn how to encode a string

In [6]:
u = chr(40960) + 'abcd' + chr(1972)
print (u.encode('utf-8'))


print (u.encode('ascii'))

b'\xea\x80\x80abcd\xde\xb4'


UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)

In [7]:
u = chr(40960) + 'abcd' + chr(1972)

print (u.encode('ascii', 'ignore'))
#b'abcd'

print (u.encode('ascii', 'replace'))
#b'?abcd?'

b'abcd'
b'?abcd?'


For this example, we took a string and added a non-ASCII character to the beginning and the end of the string. Then we tried to convert the string to a bytes representation of the Unicode string using the encode method. The first attempt did not work and gave us an error. The next one used the ignore flag, which basically removed the non-ASCII characters from the string entirely. The last example uses the replace flag which just puts question marks in place for the unknown Unicode characters.

If you need to work with encodings a lot, Python also has a module called **codecs** that you should check out.