<a href="https://colab.research.google.com/github/adrian-alejandro/BDMA/blob/main/data-management/AdvancedPython.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Python

# Advanced Python Session

The purpose of this session is to provide you with a reference document
on libraries, techniques and tools that will be helpful throughout the rest of
the programming-based sessions in the master. Thus, instead of a classical
lab session, we will follow a tutorial format, where the assignment document
is considerably longer than the rest of sessions and no deliverable is expected

## Tasks To Do Before the Session
Before coming to this session, you should have checked the session slides and
write down all doubts you might have about them. We also encourage you
to check the Python documentation in the following website:
- Python (https://docs.python.org/3/index.html)


## Tutorial
### A. Data structures manipulation with Python
Python has contained since its origin a lot of different operations and libraries to work with data collections. We'll review the ones that will
be more useful for completing the exercises. Additional information can
be checked in [Python Standard Library official documentation](https://docs.python.org/3/library/index.html)

#### A.1 Lists
A [list](https://docs.python.org/3/library/stdtypes.html#list) is an ordered collection whose elements can be accessed by their integer index. Together with the Tuple and the Range they conform the [Sequence Types in Python](https://docs.python.org/3/library/stdtypes.html#sequence-types-list-tuple-range). A list of elements
is defined as:


In [2]:
emptyList = []
emptyList.append("A")
first = emptyList[0]
print(first)


A


Instead of Java method Lists.partition(), we have to use itertools.islice which returns an iterator with n selected elements from the iterable of size. For instance, assume we have created a list with four elements:

In [3]:
import itertools
alist = ["A","B","C","D"]

We define a function to encapsulate the steps needed for using islice:

In [4]:
def partition(l, size) :
  it = iter(l)
  return iter( lambda : tuple ( itertools.islice(it, size)) , ())

Then, we can call partition() method specifying two partitions:

In [5]:
partitions = list(partition(alist,2))
print(partitions)

[('A', 'B'), ('C', 'D')]


In this example, the list *partitions* would contain [[A,B],[C,D]]. The
method returns consecutive sublists of the same size, with the exception of
the last sublist, which may be smaller.

#### A.2 Sets
In Python, we define a [set](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset) as an unordered collection of
hashable objects that contains no duplicate elements. The Pythonic way to
to create a set of elements is:

In [6]:
emptySet = set()
emptySet.add("A")
randomElement = emptySet.pop()
print(randomElement)

A


Python offers the following operations from set theory in order to operate with sets.

* **Difference**: The *set.difference()* method of class set takes others set instances as parameters and returns a new set with the elements found
in the first set, but not in the ones passed as parameters. Any elements
that exist in the sest passed as parameters but not in the first set are not
included.
* **Symmetric Difference**: The *set.symmetric_difference()* method gets
one single set as a parameter and returns a new set with elements that are
contained in one set or the other set, but not contained in both.
* **Intersection**: The *set.intersection()* method takes others set instances
as parameters and returns a new set containing elements that are found in all set instances.
* **Union**: The *set.union()* method takes other sets instances and returns
a new set instance that contains elements that are found in any set.




#### A.3 Dictionary
A [dictionary](https://docs.python.org/3/tutorial/datastructureshtml#dictionaries) is a data estructure that maps keys to values. A dictionary cannot contain duplicate keys; each key can map to at most one value (i.e., an object, which in general can be of an arbitrary complexity).

In Python a dictionary is used as follows:



In [7]:
emptyDictionary = dict()
emptyDictionary["k1"]="A"
emptyDictionary["k2"]="B"
AandB = list(emptyDictionary.values())
print(AandB)

['A', 'B']


### B. JSON manipulation with Python
JSON (JavaScript Object Notation) is a lightweight data-interchange format (https://www.json.org/). It is easy for humans to read and write. It is easy for machines to parse and generate. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers.

JSON is built on two structures:

*   A collection of key/value pairs. In various programming languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or map (as seen above).
*   An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.

These are universal data structures, which can be nested for creating more complex structures. Virtually all modern programming languages support them in one form or another. It makes sense that a data format that is interchangeable with programming languages is also based on these structures.

There are several Python libraries that can be used to parse a JSON string. The standard one is simply called [json](https://docs.python.org/3/library/json.html#module-json). You can check several examples on how to use this library in the same page.

One advantage that Python's weak typing brings to us when parsing JSON is that no type exceptions are raised. The same variable can be used to store different types.

In [8]:
import json
jsonObject = json.loads('{"bar" : 1 , "bas" : "value" ,"bat" : [null, 2.0]}')
variable = jsonObject.get("bar")
print(variable) #prints 1
variable = jsonObject.get("bas")
print(variable) #prints value
variable = jsonObject.get("bat")
print(variable) # prints [None, 2.0]
print(variable[0]) # prints None
print(variable[1]) # prints 2.0

1
value
[None, 2.0]
None
2.0


### C. List Comprehension
Python has particular way for dealing with lists, what it's called List Comprehension. They are a oneliner form of a for loop, and they're used to both create lists from scratch and perform operations over each element of the list. For example we can create a list with all the integers between 0 and 9.



In [9]:
aList = [x for x in range (10)]
print(aList)
#prints [ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


We are not limited to a single for clause, so we can combine several of them and if conditions in order to create more complex lists. In this example, we're creating a tuple list from the elements of two lists, with the condition that both elements cannot be equal:

In [10]:
tList = [(x, y) for x in [1, 2, 3] for y in [ 3, 4 ] if x != y ]
print( tList )
#prints [ (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)]

[(1, 3), (1, 4), (2, 3), (2, 4), (3, 4)]


If we want to perfom an action to all the elements of a list using List Comprehension, it can be done as in this example. We are taking all the numbers in a List and rounding them to two decimals:

In [11]:
nList = [1.2384, 4.972638, 7.826344]
rnList = [ str(round(number, 2)) for number in nList ]
print(rnList)
#prints[ '1.24' , '4.97', '7.83' ]

['1.24', '4.97', '7.83']


### D. Lambda expressions
Lambda expressions, also known as anonymous functions, allow to write quick throw away functions (one use) without naming them. Here is an expression, which tests if the parameter is even:

In [12]:
def isEven(x):
  return x % 2 == 0
result = isEven(4)
print(result)

True


Now, here is the same method implemented as a lambda expression:


In [13]:
isEvenLambda = lambda x: x%2 == 0
result = isEvenLambda(4)
print(result)

True


In other languages as Java, lambda expressions are the main way to perform changes iterating over collections. This purpose is mainly fulfilled using List comprehension in Python, but lambda expressions can still be useful for manipulating lists using the map() and filter() functions. For instance, the following code filters a list to only contain the even elements of the original one:

In [14]:
isEvenLambda = lambda x: x%2 == 0
aList = [x for x in range(10)]
evenList = list(filter(isEvenLambda,aList))
print(evenList)

[0, 2, 4, 6, 8]


You may have noticed that we have used both List Comprehension and lambda expressions here, taking advantage of each strengths. Note that in the lambda expression the variable x is just a name we decide to give to each number to be evaluated. Python however presents some disadvantages over Java when using lambda expressions:


*   They can only contain a single expression in a single line. If you need
to perform several actions you have to encapsulate them in a function.
*   There is no stream interface equivalent in python, so it’s not easy to
chain different collection processing code in the same manner as other
languages.

However, when dealing with Spark on Python lambda functions become really useful, as its counterparts on Java language, being used in the map, reduce or filter operations. You’ll probably end up using List Comprehension for creating lists or transforming them, and lambda expressions when using Spark.



### E. Regular expressions
A regular expression (regex) is a text string that allows to describe a search pattern in strings. In this section, we will provide you the information on how to deal with them in Python, if you are not familiar with regular expressions and their syntax, we encourage you to check a good summary at https://www.regular-expressions.info/quickstart.html

The standard module for regex is Lib.re, which consists of the following
classes:


*   A **Pattern** object is a compiled representation of a regular expression. The **Pattern** class provides no public constructors. To create a pattern, you must first invoke one of the *compile* methods in module *re*, which will then return a **Pattern** object. These methods accept a regular expression as the first argument and behaviour flags as second one.
*   A **Match** object is the result given by a match or search operation
performed over a **Pattern** object

The **Pattern** object have, among others, the following useful methods.Note that some of them have additional parameters to define starting and ending positions for the input string.


*   The *search* method tries to find any part of the input string to the defined pattern.
*   The *match* method tries to match the beginning of the input string to
the defined pattern.
*   The *fullmatch* method tries to match all the input string to the defined
pattern.
*   The *findall* method returns all the matches in the input string for
the defined pattern.
*   The *finditer* method is similar to *findall* but returns an iterator.

A **Match** is created from a **Pattern** when invoking any of the previous methods and there's at least one match. Otherwise, *None* is returned. A returned **Match** object also have a set of methods to get the start and ending position of the match, the matching groups et al. Check the documentation at https://docs.python.org/3/library/re.html#match-objects for reference.

The following program depicts an example of using such methods:







In [15]:
import re
# Input for matching the regexe pattern
input = "This is an apple . These are 33 ( thirty−three) apples"
# Regexe to be matched
regexe = "Th"
# Step 1 : Allocate a Pattern object to compile a regexe
pattern = re.compile(regexe)
#pattern = re.compile( regexe , re.IGNORECASE) #case−insensitive
# Step 2 : Use the pattern to perform the matching and process
# the matching result
# Using finditer()
for match in pattern.finditer(input):
  print('findIter() found the pattern \"%s \" starting at \
  index %02d and ending at index %02d ' %\
  (match.group(0), match.start(), match.end()))

#Using fullmatch()
match = pattern.fullmatch(input)
if(match):
  print('fullmatch() found the pattern \"%s \" starting at \
  index %02d and ending at index %02d ' %\
  (match.group(0), match.start(), match.end()))
else:
  print("fullmatch() found nothing")

#Use method match()
match = pattern.match(input)
if(match):
  print('match() found the pattern \"%s \" starting at \
  index %02d and ending at index %02d ' %\
  (match.group(0), match.start(), match.end()))
else:
  print("match() found nothing")

findIter() found the pattern "Th " starting at   index 00 and ending at index 02 
findIter() found the pattern "Th " starting at   index 19 and ending at index 21 
fullmatch() found nothing
match() found the pattern "Th " starting at   index 00 and ending at index 02 


The output of this code is the following one:


```
findIter() found the pattern "Th " starting at   index 00 and ending at index 02
findIter() found the pattern "Th " starting at   index 19 and ending at index 21
fullmatch() found nothing
match() found the pattern "Th " starting at   index 00 and ending at index 02
```




## In-class Practice

Now, we ask you to perform some data processing where the previously described techniques might come handy. Precisely, we will process data related to the Nobel prizes as obtained from their API (https://nobelprize.readme.io/docs/getting-started).
We provide you with the following three  JSON files correspondingto the prize, laureate and country APIs. You can uncomment any of the print lines to see the structure of the json file for your reference:

In [76]:
#!wget --no-check-certificate -O "country.json" "https://fpc-git.upc.es/upcschool-lab/dataset-repository/-/raw/main/AdvancedPython/country.json"
#!wget --no-check-certificate -O "laureate.json" "https://fpc-git.upc.es/upcschool-lab/dataset-repository/-/raw/main/AdvancedPython/laureate.json"
#!wget --no-check-certificate -O "prize.json" "https://fpc-git.upc.es/upcschool-lab/dataset-repository/-/raw/main/AdvancedPython/prize.json"

import json
countryjsonfile = open("country.json",'r')
countryjsoncontent = countryjsonfile.read()
countryjsonfile.close()
countriesContent = json.loads(countryjsoncontent)
#print(json.dumps(countriesContent, indent=4))

laureatejsonfile = open("laureate.json",'r')
laureatejsoncontent = laureatejsonfile.read()
laureatejsonfile.close()
laureatesContent = json.loads(laureatejsoncontent)
print(json.dumps(laureatesContent, indent=4))

prizejsonfile = open("prize.json",'r')
prizejsoncontent = prizejsonfile.read()
prizejsonfile.close()
prizesContent = json.loads(prizejsoncontent)
#print(json.dumps(prizesContent, indent=4))

{
    "laureates": [
        {
            "id": "1",
            "firstname": "Wilhelm Conrad",
            "surname": "R\u00f6ntgen",
            "born": "1845-03-27",
            "died": "1923-02-10",
            "bornCountry": "Prussia (now Germany)",
            "bornCountryCode": "DE",
            "bornCity": "Lennep (now Remscheid)",
            "diedCountry": "Germany",
            "diedCountryCode": "DE",
            "diedCity": "Munich",
            "gender": "male",
            "prizes": [
                {
                    "year": "1901",
                    "category": "physics",
                    "share": "1",
                    "motivation": "\"in recognition of the extraordinary services he has rendered by the discovery of the remarkable rays subsequently named after him\"",
                    "affiliations": [
                        {
                            "name": "Munich University",
                            "city": "Munich",
                         



### Exercise 1
Print the code and name for each country. If there are multiple country names for the same code, print all the names together as a list. An excerpt of your output should look like the following:


```
PT −− [Portugal]
LR −− [Liberia]
DK −− [Denmark , Sleswick, then Denmark , Sleswick]
LT −− [Lithuania , Russian Empire (now Lithuania ), Poland(now
Lithuania)]
LU −− [Luxembourg]
```


In [48]:

def allCountries(countriesJsonContent):
  countries = json.loads(countriesJsonContent)['countries']
  codes = [x['code'] for x in countries]
  codes = list(set(codes)) # we want unique values
  # we get a mapping between codes and all possible names linked to a code
  mapping = [(x, [y['name'] for y in countries if y['code'] == x]) for x in codes]
  output = [' -- '.join([x, str(y)]) for x, y in mapping]
  output = '\n'.join(output).replace("'", "")
  print(output)

allCountries(countryjsoncontent)

 -- [Southern Rhodesia]
MX -- [Mexico]
NG -- [Nigeria]
HU -- [Hungary, Austria-Hungary, Austria-Hungary (now Hungary)]
SE -- [Sweden]
TL -- [East Timor]
CH -- [Switzerland]
KE -- [Kenya]
CA -- [Canada]
PT -- [Portugal]
VN -- [Democratic Republic of Vietnam, Vietnam]
CR -- [Costa Rica]
FI -- [Finland, Russian Empire (now Finland)]
SI -- [Austria-Hungary (now Slovenia)]
KP -- [Korea]
KR -- [South Korea, Korea (now South Korea)]
UK -- [United Kingdom]
ID -- [Java, Dutch East Indies (now Indonesia)]
GP -- [Guadeloupe Island]
BG -- [Bulgaria]
FR -- [France, Germany (now France)]
DK -- [Denmark, Sleswick, then Denmark, Sleswick]
GT -- [Guatemala]
CY -- [Cyprus]
GR -- [Greece, Crete (now Greece)]
AR -- [Argentina]
GB -- [Scotland]
DZ -- [Algeria]
PL -- [Poland, Poland, then Russian Empire, Poland, Russian Empire, Prussia, Prussia (now Poland), Germany (now Poland), Austria-Hungary (now Poland), German-occupied Poland, Russian Empire (now Poland), Free City of Danzig (now Poland)]
TN -- [Tunis


### Exercise 2
For each year, print the total number of laureates that got a Nobel prize. An excerpt of your output should look like the following:


```
2015 −− 11
2016 −− 11
2017 −− 12
```



In [75]:

def amountPrizesPerYear(prizesJsonContent):
  prizes = json.loads(prizesJsonContent)['prizes']

  # we extract the years in a sorted manner:
  years = sorted(list(set([x['year'] for x in prizes])))

  # we determine the number of laureates per year
  # there are as many tuples per year as categories with winners
  n_prizes = [(y, len(x['laureates'])) for y in years for x in prizes if x['year'] == y]

  # we make sure we sum all the prizes by year
  n_prizes = [(y, sum([n for x, n in n_prizes if x == y])) for y in years]

  # we set the table structure:
  n_prizes = [' -- '.join([y, str(n)]) for y,n in n_prizes]
  n_prizes = '\n'.join(n_prizes)

  # print results
  print(n_prizes)

amountPrizesPerYear(prizejsoncontent)

1901 -- 6
1902 -- 7
1903 -- 7
1904 -- 6
1905 -- 5
1906 -- 6
1907 -- 6
1908 -- 7
1909 -- 7
1910 -- 5
1911 -- 6
1912 -- 6
1913 -- 5
1914 -- 3
1915 -- 4
1916 -- 1
1917 -- 4
1918 -- 2
1919 -- 4
1920 -- 5
1921 -- 5
1922 -- 6
1923 -- 5
1924 -- 3
1925 -- 6
1926 -- 6
1927 -- 7
1928 -- 4
1929 -- 7
1930 -- 5
1931 -- 6
1932 -- 5
1933 -- 5
1934 -- 6
1935 -- 5
1936 -- 7
1937 -- 7
1938 -- 5
1939 -- 5
1943 -- 4
1944 -- 6
1945 -- 7
1946 -- 8
1947 -- 8
1948 -- 4
1949 -- 6
1950 -- 8
1951 -- 7
1952 -- 7
1953 -- 6
1954 -- 8
1955 -- 5
1956 -- 9
1957 -- 6
1958 -- 9
1959 -- 7
1960 -- 6
1961 -- 6
1962 -- 8
1963 -- 11
1964 -- 8
1965 -- 9
1966 -- 6
1967 -- 8
1968 -- 7
1969 -- 10
1970 -- 9
1971 -- 6
1972 -- 11
1973 -- 12
1974 -- 12
1975 -- 12
1976 -- 9
1977 -- 11
1978 -- 11
1979 -- 11
1980 -- 11
1981 -- 11
1982 -- 9
1983 -- 7
1984 -- 9
1985 -- 8
1986 -- 11
1987 -- 9
1988 -- 12
1989 -- 10
1990 -- 11
1991 -- 7
1992 -- 7
1993 -- 11
1994 -- 12
1995 -- 12
1996 -- 13
1997 -- 12
1998 -- 12
1999 -- 7
2000 -- 13
2001 -- 

### Exercise 3
Print the name and date of birth of those laureates that got a Nobel prize motivated by their work on DNA. An excerpt of your output should look like the following:

```
Berg , Paul −− 1926−06−30
Lindahl , Tomas −− 1938−01−28
Modrich , Paul −− 1946−06−13
Sancar , Aziz −− 1946−09−08
```

In [109]:
import re


def birthDateDNANobel(laureatesJsonContent):
  laureates = json.loads(laureatesJsonContent)['laureates']

  #print(laureates)

  # we extract the prizes list as a text since all we care is that any of their
  # prizes is related to their work on DNA

  laureates_clean = [l for l in laureates if 'surname' in l.keys() and 'firstname' in l.keys()]

  prizes = [([(x['surname'], x['firstname']), x['born']], str(x['prizes'])) for
            x in laureates_clean]

  # we now apply reg  print([x for x in laureates if x['id'] in dna_laureates and 'surname' in x.keys()])Ex to extract all the items that have 'DNA' in their content
  pattern = re.compile("DNA")

  dna_laureates = [x for x, s in prizes if re.search(pattern, s)]

  dna_laureates = [', '.join(x) + '  --  ' + y for x, y in dna_laureates]

  dna_laureates = '\n'.join(dna_laureates)

  print(dna_laureates)

birthDateDNANobel(laureatejsoncontent)



Berg, Paul  --  1926-06-30
Mullis, Kary B.  --  1944-12-28
Smith, Michael  --  1932-04-26
Lindahl, Tomas  --  1938-01-28
Modrich, Paul  --  1946-06-13
Sancar, Aziz  --  1946-09-08
