<a href="https://colab.research.google.com/github/eftekhar-hossain/SKBI_Training/blob/main/Session_5_(Python_Dictionaries).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center> <h1> <u> <font color='red'> Training on AI and ML with Python </font> </u> </h1> </center>

#Session-5: Python Dictionaries 


> Objective: 
 1. *Get familiar with Python Dictionaries*
 2. *Solving some advanced text analysis problems using it .*


#Dictionaries

A dictionary is like a list, but more general. In a list, the index positions have to be integers; in a dictionary, the indices can be (almost) any type.

You can think of a dictionary as a mapping between a set of indices (which are
**called keys**) and a **set of values**. Each key maps to a value. The association of a
key and a value is called a **key-value pair** or sometimes an **item**.

The function `dict` creates a new dictionary with no items. Because `dict` is the name of a built-in function, you should avoid using it as a variable name.





In [None]:
eng2sp = dict()
print(eng2sp)

{}


The curly brackets, `{}`, represent an empty dictionary. To add items to the dictionary, you can use **square brackets**:

In [None]:
eng2sp['one'] = 'uno'
print(eng2sp)
#This line creates an item that maps from the key 'one' to the value “uno”.

{'one': 'uno'}


In [None]:
d = {"M":"Mercuray","P":"Planet"}
print(d)

{'M': 'Mercuray', 'P': 'Planet'}


In [None]:
# you can create a new dictionary with three items.
eng2sp = {'one': 'uno', 'two': 'dos', 'three': 'tres'}
print(eng2sp)

{'one': 'uno', 'two': 'dos', 'three': 'tres'}


In [None]:
# you use the keys to look up the corresponding values
print(eng2sp['two'])

dos


In [None]:
# The len function works on dictionaries; it returns the number of key-value pairs:
len(eng2sp)

3

The `in` operator works on dictionaries; it tells you whether something appears as a `key` in the dictionary.

In [None]:
print('one' in eng2sp)
print('uno' in eng2sp)

True
False


To see whether something appears as a value in a dictionary, you can use the
method **`values`**, which returns the values as a type that can be converted to a list, and then use the in operator:

In [None]:
# how to find all the keys of a dictionary

key = list(eng2sp.keys())
print(key)

['one', 'two', 'three']


In [None]:
# how to find all the values of a dictionary
vals = list(eng2sp.values())
print(vals)
print('uno' in vals)

['uno', 'dos', 'tres']
True


##Dictionary as a set of counters

In [None]:
# Suppose you are given a string and you want to count how many times each letter appears.

word = 'brontosaurus'
d = dict()
for c in word:
  if c not in d:
    d[c] = 1
  else:
    d[c] = d[c] + 1
print(d)


{'b': 1, 'r': 2, 'o': 2, 'n': 1, 't': 1, 's': 2, 'a': 1, 'u': 2}


**We are effectively computing a histogram, which is a statistical term for a set of counters (or frequencies)**.The for loop traverses the string. Each time through the loop, if the character `c` is not in the dictionary, we create a new item with key` c `and the initial value `1` (since we have seen this letter once). If `c` is already in the dictionary we increment `d[c]`.

Dictionaries have a method called **get** that takes a **`key`** and a default **`value`**. If the key appears in the dictionary, get returns the corresponding value; otherwise it returns the default value. For example:

In [None]:
counts = { 'liza' : 1 , 'annie' : 42, 'jan': 100}
print(counts.get('jan', 0))
print(counts.get('tim', 0))


100
0


We can use **`get`** to write our histogram loop more concisely. Because the **`get`**
method automatically handles the case where a` key `is not in a dictionary, we can
reduce four lines down to one and eliminate the if statement.

In [None]:
word = 'brontosaurus'
d = dict()
for c in word:
  d[c] = d.get(c,0) + 1
print(d)

{'b': 1, 'r': 2, 'o': 2, 'n': 1, 't': 1, 's': 2, 'a': 1, 'u': 2}


##Dictionaries and files

In [None]:
!wget https://www.py4e.com/code3/romeo.txt

In [None]:
!wget https://www.py4e.com/code3/romeo-full.txt

**One of the common uses of a dictionary is to count the occurrence of words in a file with some written text.**
```
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
```
We will write a Python program to read through the lines of the file, break each
line into a list of words, and then loop through each of the words in the line and count each word using a dictionary.



In [None]:
fname = input('Enter the file name: ')
try:
  fhand = open(fname)
except:
  print('File cannot be opened:', fname)
  exit()

# word-frequency  
counts = dict()
for line in fhand:
  words = line.split()
  for word in words:
    if word not in counts:
      counts[word] = 1
    else:
      counts[word] = counts[word]+ 1
print(counts)

##Looping and dictionaries

If you use a dictionary as the sequence in a for statement, it traverses the keys of the dictionary.

In [None]:
counts = { 'it' : 1 , 'annie' : 42, 'jan': 100}
for key in counts:
  print(key, counts[key])

#the keys are in no particular order.  

it 1
annie 42
jan 100


In [None]:
# if we wanted to find all the entries in a dictionary with a value above ten
counts = { 'it' : 1 , 'annie' : 42, 'jan': 100}

for key in counts:
  if counts[key] > 10 :
    print(key, counts[key])

# The for loop iterates through the keys of the dictionary, 
# so we must use the index operator to retrieve the corresponding value for each key.    

annie 42
jan 100


If you want to print the keys in alphabetical order, you first make a list of the keys in the dictionary using the keys method available in dictionary objects, and then sort that list and loop through the sorted list, looking up each key and printing out key-value pairs in sorted order as follows:

In [None]:
counts = { 'it' : 1 , 'annie' : 42, 'jan': 100}
lst = list(counts.keys())
print(lst)
lst.sort()
print("Sorted---> ",lst)
for key in lst:
  print(key, counts[key])

['it', 'annie', 'jan']
Sorted--->  ['annie', 'it', 'jan']
annie 42
it 1
jan 100


# Advanced Text Parsing

In [None]:
romeo = ["But, soft! what light through yonder window breaks?\nIt is the east, and Juliet is the sun.\nArise, fair sun, and kill the envious moon,\nWho is already sick and pale with grief,"]
for line in romeo:
  print(line)

But, soft! what light through yonder window breaks?
It is the east, and Juliet is the sun.
Arise, fair sun, and kill the envious moon,
Who is already sick and pale with grief,


Since the Python **split** function looks for spaces and treats words as tokens separated by spaces, we would treat the words “soft!” and “soft” as different words and create a separate dictionary entry for each word.
Also since the file has capitalization, we would treat **“who”** and **“Who”** as different words with different counts.

We can solve both these problems by using the string methods **`lower`**, **`punctuation`**, and **`translate`**. The translate is the most subtle of the methods. Here is the documentation for translate:

**`line.translate(str.maketrans(fromstr, tostr, deletestr))`**

Replace the characters in **fromstr** with the character in the same position in **tostr**
and delete all characters that are in **deletestr**. The fromstr and tostr can be
empty strings and the deletestr parameter can be omitted.
We will not specify the tostr but we will use the deletestr parameter to delete
all of the punctuation. We will even let Python tell us the list of characters that
it considers “punctuation”:

In [None]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
fname = input('Enter the file name: ') ## romeo-full.txt
try:
  fhand = open(fname)
except:
  print('File cannot be opened:', fname)
  exit()

counts = dict()
for line in fhand:
  line = line.rstrip() 
  line = line.translate(line.maketrans('', '', string.punctuation)) # punctuation removal code
  line = line.lower() # make lowercase 
  words = line.split()
  for word in words:
    if word not in counts:
      counts[word] = 1
    else:
      counts[word] += 1
print(counts)

# Tuples are immutable

Dictionaries have a method called **items** that returns a list of tuples, where each tuple is a key-value pair:

In [None]:
d = {'a':10, 'b':1, 'c':22}
t = list(d.items())
print(t)

[('a', 10), ('b', 1), ('c', 22)]


In [None]:
for key,val in d.items():
  print(key,val)

a 10
b 1
c 22


As you should expect from a dictionary, the items are in no particular order.

However, since the list of tuples is a list, and tuples are comparable, we can now sort the list of tuples. Converting a dictionary to a list of tuples is a way for us to output the contents of a dictionary sorted by key:

In [None]:
d = {'b':1, 'a':10,'c':22}
t = list(d.items())  ##  return list of tuples
print(t)
t.sort()
print(t)

#The new list is sorted in ascending alphabetical order by the key value.

[('b', 1), ('a', 10), ('c', 22)]
[('a', 10), ('b', 1), ('c', 22)]


##Multiple assignment with dictionaries

Combining **items**, **tuple assignment**, and **for**, you can see a nice code pattern for traversing the **keys** and **values **of a dictionary in a single loop

In [None]:
for key, val in list(d.items()):
  print(val, key)

1 b
10 a
22 c


This loop has two iteration variables because items returns a list of tuples and **key, val** is a tuple assignment that successively iterates through each of the key-value
pairs in the dictionary.
For each iteration through the loop, both key and value are advanced to the next
key-value pair in the dictionary

**If we combine these two techniques, we can print out the contents of a dictionary sorted by the value stored in each key-value pair**

To do this, we first make a list of tuples where each tuple is (value, key). The
items method would give us a list of (key, value) tuples, but this time we want
to sort by value, not key. Once we have constructed the list with the value-key
tuples, it is a simple matter to sort the list in reverse order and print out the new,
sorted list.

In [None]:
d = {'a':10, 'b':1, 'c':22}
l = list()
for key, val in d.items() :
  l.append( (val, key) )

print(l)

l.sort(reverse=True)
print(l)


[(10, 'a'), (1, 'b'), (22, 'c')]
[(1, 'b'), (10, 'a'), (22, 'c')]


##The most Common Words

In [None]:
import string
fhand = open('romeo-full.txt')
counts = dict()

for line in fhand:
  # remove puntuation from the lines
  line = line.translate(str.maketrans('', '', string.punctuation))
  line = line.lower()
  words = line.split()
  for word in words:
    if word not in counts:
      counts[word] = 1
    else:
      counts[word] += 1

# Sort the dictionary by value
lst = list()
for key, val in list(counts.items()):
  lst.append((val, key))

lst.sort(reverse=True)
lst
for key, val in lst[:10]:
   print(val, key)

The first part of the program which reads the file and computes the dictionary
that maps each word to the count of words in the document is unchanged. But
instead of simply printing out counts and ending the program, we construct a list of **`(val, key)`** tuples and then sort the list in **reverse order**.


Since the value is first, it will be used for the comparisons. If there is more than one tuple with the same value, it will look at the second element (the key), so tuples where the value is the same will be further sorted by the alphabetical order of the key. At the end we write a nice for loop which does a multiple assignment iteration and prints out the ten most common words by iterating through a slice of the list.


#Practice Excercises
1. Write a program that categorizes each mail message by
which day of the week the commit was done. To do this look for lines
that start with “From”, then look for the third word and keep a running
count of each of the days of the week. At the end of the program print
out the contents of your dictionary (order does not matter).

  **Sample Line:**
  ```
  From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
  ```
  **Sample Execution:**
  ```
  Enter a file name: mbox-short.txt
  {'Fri': 20, 'Thu': 6, 'Sat': 1}
  ```

2. Write a program to read through a mail log, build a histogram using a dictionary to count how many messages have come from
each email address, and print the dictionary.
  ```
  Enter file name: mbox-short.txt
{'gopal.ramasammycook@gmail.com': 1, 'louis@media.berkeley.edu': 3,
'cwen@iupui.edu': 5, 'antranig@caret.cam.ac.uk': 1,
'rjlowe@iupui.edu': 2, 'gsilver@umich.edu': 3,
'david.horwitz@uct.ac.za': 4, 'wagnermr@iupui.edu': 1,
'zqian@umich.edu': 4, 'stephen.marquard@uct.ac.za': 2,
'ray@media.berkeley.edu': 1}
```

3. Add code to the above program to figure out who has the
most messages in the file. After all the data has been read and the dictionary has been created, look through the dictionary using a maximum
loop to find who has the most messages and print how many messages the person has.

  ```
  Enter a file name: mbox-short.txt
  cwen@iupui.edu 5
  Enter a file name: mbox.txt
  zqian@umich.edu 195

  ```

4. This program records the domain name (instead of the
address) where the message was sent from instead of who the mail came
from (i.e., the whole email address). At the end of the program, print
out the contents of your dictionary.

  ```
  Enter a file name: mbox-short.txt
  {'media.berkeley.edu': 4, 'uct.ac.za': 6, 'umich.edu': 7,
  'gmail.com': 1, 'caret.cam.ac.uk': 1, 'iupui.edu': 8}
  ```

5. Revise a previous program as follows: Read and parse the
“From” lines and pull out the addresses from the line. Count the number of messages from each person using a dictionary.
After all the data has been read, print the person with the most commits
by creating a list of (count, email) tuples from the dictionary. Then
sort the list in reverse order and print out the person who has the most
commits.
Sample Line:
```
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
```
```
Enter a file name: mbox-short.txt
cwen@iupui.edu 5
Enter a file name: mbox.txt
zqian@umich.edu 195
```


6. This program counts the distribution of the hour of the day
for each of the messages. You can pull the hour from the “From” line
by finding the time string and then splitting that string into parts using
the colon character. Once you have accumulated the counts for each
hour, print out the counts, one per line, sorted by hour as shown below

  ```
  Enter a file name: mbox-short.txt
  04 3
  06 1
  07 1
  09 2
  10 3
  11 6
  14 1
  15 2
  16 4
  17 2
  18 1
  19 1

  ```

#References:
1. [Programiz.com](https://www.programiz.com/python-programming/operators)
2. [Python for Everybody-Coursera](https://www.coursera.org/learn/python)