**Some Useful Resources**

https://docs.python.org/3/howto/unicode.html

http://unicode.org/charts/PDF/U0900.pdf

https://www.pythonsheets.com/notes/python-unicode.html

https://en.m.wikipedia.org/wiki/Devanagari_(Unicode_block)

In Python (2 or 3), strings can either be represented in bytes or unicode code points.
Byte is a unit of information that is built of 8 bits — bytes are used to store all files in a hard disk. <br><br>
Unicode is international standard where a mapping of individual characters and a unique number is maintained. As of May 2019, the most recent version of Unicode is 12.1 which contains over 137k characters including different scripts including English, Hindi, Chinese and Japanese, as well as emojis. These 137k characters are each represented by a unicode code point. So unicode code points refer to actual characters that are displayed.
These code points are encoded to bytes and decoded from bytes back to code points. Examples: Unicode code point for alphabet a is U+0061, emoji 🖐 is U+1F590, and for Ω is U+03A9.

**UTF-8:** It uses 1, 2, 3 or 4 bytes to encode every code point. It is backwards compatible with ASCII. All English characters just need 1 byte — which is quite efficient. We only need more bytes if we are sending non-English characters.
It is the most popular form of encoding, and is by default the encoding in Python 3. In Python 2, the default encoding is ASCII (unfortunately). <br><br>
**UTF-16** is variable 2 or 4 bytes. This encoding is great for Asian text as most of it can be encoded in 2 bytes each. It’s bad for English as all English characters also need 2 bytes here.<br><br>
**UTF-32** is fixed 4 bytes. All characters are encoded in 4 bytes so it needs a lot of memory. It is not used very often.


<br><br>
 **What data types in Python handle Unicode code points and bytes?** <br>
As we discussed earlier, in Python, strings can either be represented in bytes or unicode code points.
The main takeaways in Python are:
1. Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII. So if an incoming file is Cyrillic characters, Python 2 might fail because ASCII will not be able to handle those Cyrillic Characters. In this case, we need to remember to use decode("utf-8") during reading of files. This is inconvenient.
2. Python 3 came and fixed this. Strings are stillstr type by default but they now mean unicode code points instead — we carry what we see. If we want to store these str type strings in files we use bytes type instead. Default encoding is UTF-8 instead of ASCII. Perfect!

In [None]:

    #  \\U........      # 8-digit hex escapes
    # | \\u....          # 4-digit hex escapes
    # | \\x..            # 2-digit hex escapes

    

In [2]:
text=''' रचन: तुलसी दास्
दॊहा
श्री गुरु चरन सरॊज रज निजमनु मुकुरु सुधारि ।
बरनऊ रघुबर बिमल जसु जॊ दायकु फल चारि ॥
बुद्धिहीन ननु जानिकॆ सुमिरौ पवन कुमार ।
बल बुद्धि विद्या दॆहु मॊहि हरहु कलॆस बिकार् ॥'''

In [3]:
text

' रचन: तुलसी दास्\nदॊहा\nश्री गुरु चरन सरॊज रज निजमनु मुकुरु सुधारि ।\nबरनऊ रघुबर बिमल जसु जॊ दायकु फल चारि ॥\nबुद्धिहीन ननु जानिकॆ सुमिरौ पवन कुमार ।\nबल बुद्धि विद्या दॆहु मॊहि हरहु कलॆस बिकार् ॥'

In [4]:
arr=[]

In [6]:
for i in text.split():
    print(i," ",i.encode())
    arr.append([i,i.encode()])

रचन:   b'\xe0\xa4\xb0\xe0\xa4\x9a\xe0\xa4\xa8:'
तुलसी   b'\xe0\xa4\xa4\xe0\xa5\x81\xe0\xa4\xb2\xe0\xa4\xb8\xe0\xa5\x80'
दास्   b'\xe0\xa4\xa6\xe0\xa4\xbe\xe0\xa4\xb8\xe0\xa5\x8d'
दॊहा   b'\xe0\xa4\xa6\xe0\xa5\x8a\xe0\xa4\xb9\xe0\xa4\xbe'
श्री   b'\xe0\xa4\xb6\xe0\xa5\x8d\xe0\xa4\xb0\xe0\xa5\x80'
गुरु   b'\xe0\xa4\x97\xe0\xa5\x81\xe0\xa4\xb0\xe0\xa5\x81'
चरन   b'\xe0\xa4\x9a\xe0\xa4\xb0\xe0\xa4\xa8'
सरॊज   b'\xe0\xa4\xb8\xe0\xa4\xb0\xe0\xa5\x8a\xe0\xa4\x9c'
रज   b'\xe0\xa4\xb0\xe0\xa4\x9c'
निजमनु   b'\xe0\xa4\xa8\xe0\xa4\xbf\xe0\xa4\x9c\xe0\xa4\xae\xe0\xa4\xa8\xe0\xa5\x81'
मुकुरु   b'\xe0\xa4\xae\xe0\xa5\x81\xe0\xa4\x95\xe0\xa5\x81\xe0\xa4\xb0\xe0\xa5\x81'
सुधारि   b'\xe0\xa4\xb8\xe0\xa5\x81\xe0\xa4\xa7\xe0\xa4\xbe\xe0\xa4\xb0\xe0\xa4\xbf'
।   b'\xe0\xa5\xa4'
बरनऊ   b'\xe0\xa4\xac\xe0\xa4\xb0\xe0\xa4\xa8\xe0\xa4\x8a'
रघुबर   b'\xe0\xa4\xb0\xe0\xa4\x98\xe0\xa5\x81\xe0\xa4\xac\xe0\xa4\xb0'
बिमल   b'\xe0\xa4\xac\xe0\xa4\xbf\xe0\xa4\xae\xe0\xa4\xb2'
जसु   b'\xe0\xa4\x9c\xe0\xa4\xb8\xe0\xa5\

In [7]:
##let's look at this particular word
# 'गुरु'
g=b'\xe0\xa4\x97\xe0\xa5\x81\xe0\xa4\xb0\xe0\xa5\x81'

In [8]:
g.decode()

'गुरु'

In [13]:
##let's see individual words g is made of
g1=b'\xe0\xa4\x97'
g2=b'\xe0\xa5\x81'
g3=b'\xe0\xa4\xb0'
g4=b'\xe0\xa5\x81'
print('Individual Code points in ',g.decode())
print(g1.decode())
print(g2.decode())
print(g3.decode())
print(g4.decode())

Individual Code points in  गुरु
ग
ु
र
ु


In [9]:
type(g)

bytes

In [10]:
# To get text back from unicode value

for i in arr:
    u=i[1] #unicodes are present at 1st index in our array
    print(u.decode())

रचन:
तुलसी
दास्
दॊहा
श्री
गुरु
चरन
सरॊज
रज
निजमनु
मुकुरु
सुधारि
।
बरनऊ
रघुबर
बिमल
जसु
जॊ
दायकु
फल
चारि
॥
बुद्धिहीन
ननु
जानिकॆ
सुमिरौ
पवन
कुमार
।
बल
बुद्धि
विद्या
दॆहु
मॊहि
हरहु
कलॆस
बिकार्
॥


In [11]:
# encode() will result in a sequence of bytes.
# The type of encoding to be followed is shown by the encoding parameter. 
# There are various types of character encoding schemes, out of which the scheme UTF-8 is used in Python by default.