# Statistics and Data Science: Exercises library

## Data normalization with Euclidean norm

A very common operation is to transform you data by normalization. Imagine you have a list of data points $x=$`[21.4,45.7,38.5,76.4,61.9,43.4,52.6,27.2]` and you want to normalize your data using the [Euclidean norm](https://en.wikipedia.org/wiki/Norm_(mathematics)#Euclidean_norm), i.e., convert the data between 0 and 1 with the following operation:

$\hat{x}_{i} = \frac{x_{i}}{||x||}$

where: $||x||=\sqrt{x_1^2+...+x_n^2}$

Normalization is common (necessary) when you deal with several variables that have very different scales.

- Using list comprehension, create a new list with normalized $x$ using the Euclidean norm. 

In [5]:
x = [21.4,45.7,38.5,76.4,61.9,43.4,52.6,27.2]

# We first compute the square of each element using comprehension
x_square = [i**2 for i in x]

# Then we compute the Euclidean norm
x_norm = (sum(x_square))**(1/2)

# Finally we normalize our list
x_hat = [i/x_norm for i in x]

print(x_hat)

[0.15489594360747555, 0.33078245901222586, 0.27866793592933686, 0.5529929949350997, 0.44804013594872605, 0.3141347641385252, 0.3807255436333278, 0.19687708720202501]


## Data cleaning with comprehension

Suppose we have the following list: $x=$`[21.4, 'NaN', 45.7,38.5,76.4,61.9, 'NaN', 43.4,52.6,27.2]`. Unfortunately we have some `'NaN'` values (Not a Number).

- Clean your list, dropping `'NaN'` values, using list comprehension

In [7]:
x=[21.4, 'NaN', 45.7,38.5,76.4,61.9, 'NaN', 43.4,52.6,27.2]

x_clean = [i for i in x if i != 'NaN']

print(x_clean)

[21.4, 45.7, 38.5, 76.4, 61.9, 43.4, 52.6, 27.2]


## Data manipulation using dictionary comprehension

Comprehension is not only for list, dictionary too! Suppose you have the following dictionary, with the grades of some students on a 0-100 scale:

`{'Adam': 72, 'Elena': 91, 'Xiang': 87, 'Julie': 81, 'Takafumi': 79}`

- Use dictionary comprehension to convert the grade from the 0-100 scale to the Swiss 0-6 scale.
- Use dictionary comprehension to round to the nearest 0.25 (for instance 4.2 should be converted to 4.25). 

Tips: you can use the `round()` function

In [17]:
grade_100 = {'Adam': 72, 'Elena': 91, 'Xiang': 87, 'Julie': 81, 'Takafumi': 79}

# We convert the grade dividing the value by 100 and multiplying by 6
grade_6 = {key: val*6/100 for key,val in grade_100.items()}

# We round the grade
grade_6_rounded = {key: round(val*4)/4 for key,val in grade_6.items()}

print(grade_6)
print(grade_6_rounded)

{'Adam': 4.32, 'Elena': 5.46, 'Xiang': 5.22, 'Julie': 4.86, 'Takafumi': 4.74}
{'Adam': 4.25, 'Elena': 5.5, 'Xiang': 5.25, 'Julie': 4.75, 'Takafumi': 4.75}


## Green Bonds

You have a list of green bonds identifiers: 
`gb_ID = ['CH843556=S', 'CH843556=', 'CH868037=', 'CH6YT=RR', 'CH30YT=RR', 'CH975519=', 'CH1580323=', 'CH1580323=S', 'CH2452496=S']`

1. Create a new list with the elements of `gb_ID` but removing the `'='` sign and what follows. For instance 'CH843556=S' should be CH843556
2. Create a new list selecting the elements of `gb_ID` with nothing after the `'='` sign, i.e. we disregard elements such as 'CH843556=S' 

Hints: 
- You can use list comprehension inside another list comprehension.
- For the second question, you could use Regular Expressions [RegEx](https://docs.python.org/3/library/re.html). See also this [tutorial](https://www.w3schools.com/python/python_regex.asp)

### Question 1

In [9]:
gb_ID = ['CH843556=S', 'CH843556=', 'CH868037=', 'CH6YT=RR', 'CH30YT=RR', 'CH975519=', 'CH1580323=', 'CH1580323=S', 'CH2452496=S']

# We split each element of our list at the '=' character, and only keep the left part, indexed by 0:
gb_clean = [i.split('=')[0] for i in gb_ID]

print(gb_clean)

['CH843556', 'CH843556', 'CH868037', 'CH6YT', 'CH30YT', 'CH975519', 'CH1580323', 'CH1580323', 'CH2452496']


### Question 2

We'll see two methods:
- a list comprehension using string indexing
- a list comprehension using RegEx

In [11]:
# We use the fact that the last character is indexed by '-1'
gb_clean_2 = [i for i in gb_ID if i[-1]=='=']

print(gb_clean_2)

['CH843556=', 'CH868037=', 'CH975519=', 'CH1580323=']


In [12]:
import re

# The idea is the same as before: we check if the last character is '='
# '$' means the last character
gb_clean_2 = [i for i in gb_ID if bool(re.search('=$',i))]

print(gb_clean_2)

['CH843556=', 'CH868037=', 'CH975519=', 'CH1580323=']


## Optimizing recursive function

During the lecture, we have defined a function to calculate Fibonacci numbers: 

$F(0)=0$

$F(1)=1$

$F(n)=F(n-1)+F(n-2)$

However, our function was not efficient since we needed to repeat operations. For example, to compute $F(5)$, we needed $F(4)$ and $F(3)$, but to know $F(4)$ we needed to compute $F(3)$ and $F(2)$, and so on. Since Fibonacci numbers were not stored in memory, the function calculated many identical subproblems over and over again.

- Design a function that calculate Fibonacci numbers and solves the repetition issue.
- Create a list of the first 12 Fibonacci numbers.

Hint: you can use a dictionary

In [19]:
fibonacci_number = {0: 0, 1: 1}  #Initial values

def fibonacci_of(n):
    """This function computes Fibonacci numbers, 
    storing in a dictionary the previous calculations"""
    if n in fibonacci_number:  # If already stored in memory
        return fibonacci_number[n]
    else: 
        fibonacci_number[n] = fibonacci_of(n - 1) + fibonacci_of(n - 2)  # Else, recursive case
        return fibonacci_number[n]

[fibonacci_of(n) for n in range(12)]

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]

## Book information

We have some information about two books:

`(
Title = 'Sapiens: A Brief History of Humankind', 
Author = 'Yuval Noah Harari',
Year = 2011,
Language = 'Hebrew',
ISBN = '978-0062316097')`

`(
Title = 'Les Racines du ciel',
Author = 'Romain Gary',
Year = 1956,
Publisher = 'Gallimard'
)`

As you can see, the information we have differs.

- Write a function that prints for each key: 'The (key) is (value).'. The key should be in lower cases, except the ISBN number.
- Call your function with our two books.

For instance, the output for the second book should look like this:

The title is Les Racines du ciel.
The author is Romain Gary.
The year is 1956.
The publisher is Gallimard.

Hint: Try to use arbitrary keyword argument `**kwarg` and the format string method

In [1]:
def book_info(**book):
    for key, value in book.items():
        if key == 'ISBN':
            print("The {} is {}.".format(key,value))
        else:
            print("The {} is {}.".format(key.lower(),value))
    else:
        print('')

book_info(Title = 'Sapiens: A Brief History of Humankind', 
          Author = 'Yuval Noah Harari',
          Year = 2011,
          Language = 'Hebrew',
          ISBN = '978-0062316097')

book_info(Title = 'Les Racines du ciel',
          Author= 'Romain Gary',
          Year = 1956,
          Publisher = 'Gallimard')

The title is Sapiens: A Brief History of Humankind.
The author is Yuval Noah Harari.
The year is 2011.
The language is Hebrew.
The ISBN is 978-0062316097.

The title is Les Racines du ciel.
The author is Romain Gary.
The year is 1956.
The publisher is Gallimard.

