<div class="frontmatter text-center">
<h1> MATH5027 Scientific Python</h1>
<h3>Central European University, Fall 2017/2018</h3>
<h3>Instructor: Prof. Roberta Sinatra, TA: Johannes Wachs</h3>
inspired to the Python lectures of J.R. Johansson, available at [http://github.com/jrjohansson/scientific-python-lectures](http://github.com/jrjohansson/scientific-python-lectures).
</div>

# Today
We will 
* learn an important feature of lists: comprehension 
* do more exercises with lists and strings
* one more exercise with web scraping

When you iterate over a list, like done before, you get the elements in the list, not their index!

In [16]:
words=["scientific", "computing", "with", "python"]
for word in words:
    print(word)

scientific
computing
with
python


If you want also the indexes in the list, you need to use enumerate:

In [17]:
words=["scientific", "computing", "with", "python"]
for idx, word in enumerate(words):
    print(idx,word)

0 scientific
1 computing
2 with
3 python


In [18]:
type(enumerate(words))

enumerate

In [19]:
for x in range(10): # by default range start at 0
    print(x)

0
1
2
3
4
5
6
7
8
9


In [11]:
type((range(10)))

range

Range and enumerate are an objects, but you can convert them easily to a list and to a list of tuples: 

In [20]:
list(range(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [13]:
list(enumerate(range(-3,3)))

[(0, -3), (1, -2), (2, -1), (3, 0), (4, 1), (5, 2)]

In [15]:
words=["scientific", "computing", "with", "python"]
list(enumerate(words))

[(0, 'scientific'), (1, 'computing'), (2, 'with'), (3, 'python')]

# Example of web scraping
Complete the exercises from last time (reported below).

## Exercise &#x1F4D8;
* Select all the annual and quarterly tickets
* How much do you save if you buy four quarterly tickets instead of one annual ticket (use the average price)?

## Exercise &#x1F4D7;
* If you look at the webpage, you'll notice that we missed some tickets, like the Suburban railway extension ticket. Write code to get that information too and merge it with the list of pairs we found it already 
* Some ticket types have been truncated, like "Pass certificate &#8211; genera" - it should be "Pass certificate &#8211; general". Can you handle this exception?

## List comprehensions: Creating lists using `for` loops:

A convenient and compact way to initialize lists is through **list comprehension**. A list comprehension mimics the mathematic formalism of defining sets. For example:
$$ L=\lbrace x^2 : x \in \lbrace 0, 1, 2, 3, 4\rbrace \rbrace.$$
This translates into:

In [32]:
L = [x**2 for x in range(0,5)]

print(L)


[0, 1, 4, 9, 16]


In [86]:
L=[]
for x in range(0,5):
    L.append(x**2)
print(L)

[0, 1, 4, 9, 16]


You can also combine it with conditional statements. For example:
$$S = \lbrace x : x \in L \text{ and } x > 0\rbrace.$$
This becomes:

In [34]:
S=[x for x in L if x>0]
print(S)

[1, 4, 9, 16]


More examples:
$$ M = \lbrace x : x \in S \text{ and } x \text{ even} \rbrace$$

In [35]:
M = [x for x in S if x % 2 == 0]  

**BTW** do you remember the operator %? If not, refresh it and have a look at the documentation!

You can also combine two ```for``` loops together:

In [36]:
[(x,y) for x in [1,2] for y in [1,2]]

[(1, 1), (1, 2), (2, 1), (2, 2)]

This is equivalent of:
$$\lbrace (x,y) : \forall x \in \lbrace 1, 2\rbrace, \forall y \in \lbrace 1, 2\rbrace \rbrace$$

More examples of ```for``` loops and conditional statements together:

In [38]:
mylist1=[(x, y) for x in [1,2,3] for y in [3,1,4] if x != y]
print(mylist1)

[(1, 3), (1, 4), (2, 3), (2, 1), (2, 4), (3, 1), (3, 4)]


Equivalent to:
$$\lbrace (x,y) : x \in \lbrace 1, 2, 3\rbrace, y \in \lbrace 1, 3, 4\rbrace \text{ and } x\neq y \rbrace $$

The line above produce the same list as the block of code below: 

In [39]:
mylist2=[]
for x in [1,2,3]:
    for y in [3,1,4]:
        if x!=y:
            mylist2.append((x,y))
mylist2

[(1, 3), (1, 4), (2, 3), (2, 1), (2, 4), (3, 1), (3, 4)]

A convenient and compact way to initialize lists:

You can also use an ``if else`` statement

In [40]:
[x[0]+x[1] if x[0]>x[1] else 'smaller' for x in mylist2]

['smaller', 'smaller', 'smaller', 3, 'smaller', 4, 'smaller']

## Exercise  &#x1F4D8;
Use ```%timeit``` (see questions asked during the first class) to check the best time to create ```mylist1``` and ```%%timeit``` for creating ```mylist2```. Which one is faster? Any guess why? 

You can also nest one list in the other

In [41]:
mylist1=[x+1 for x in [y**3 for y in [-3,1,4]] if x > 0]

With maths, the above would be like:
$$M=\lbrace y^3 : y \in \lbrace -3, 1, 4\rbrace \rbrace$$
$$\text{mylist1}= \lbrace x+1 : x \in M \text{ and } x>0 \rbrace.$$
The code below also produces the same list:

In [42]:
mysecondlist=[]
for y in [-3,1,4]:
    temp=y**3
    mysecondlist.append(temp)
print(mysecondlist)

mylist=[]
for x in mysecondlist:
    if x>0:
        mylist.append(x+1)
print(mylist)

[-27, 1, 64]
[2, 65]


More examples of list comprehension:

In [43]:
strs = ['hello', 'and', 'goodbye']
shouting = [ s2.upper()+'!!!' for s2 in [s for s in strs] if s2=='hello']
print(shouting)

['HELLO!!!']


In [44]:
mylist=[]
for s in strs:
    mylist.append(s.upper()+'!!!')
print(mylist)

['HELLO!!!', 'AND!!!', 'GOODBYE!!!']


In [45]:
# Select fruits containing 'a'
fruits = ['apple', 'cherry', 'banana', 'lemon']
afruits = [ s for s in fruits if 'a' in s ]
print(afruits)

['apple', 'banana']


## Exercises  
&#x1F4D8;
* Select all the fruits that contain the letter 'n', and convert to uppercase.
* Using a list comprehension, create a new list called "newlist" out of the list "numbers", which contains only the positive numbers from the list, as integers: ``numbers=[34.6, -203.4, 44.9, 68.3, -12.2, 44.6, 12.7]``

&#x1F4D7;
* Using a list comprehension, create a list of integers which specify the length of each word in a certain text, but only if the word is not the word "the" or "and". Use as input text the following:

    _Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do. Once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, "and what is the use of a book," thought Alice, "without pictures or conversations?" So she was considering in her own mind (as well as she could, for the day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her._
    
    _Hint_: remove first all punctuation, for example take inspiration from [ here](https://stackoverflow.com/questions/16050952/how-to-remove-all-the-punctuation-in-a-string-python)

## Sorting lists

The easiest way to sort is with the sorted(list) function, which takes a list and returns a new list with those elements in sorted order. The original list is not changed.

In [46]:
a = [5, 1, 4, 3]
print(sorted(a))
print(a)

[1, 3, 4, 5]
[5, 1, 4, 3]


The sorted() function can be customized though optional arguments. The sorted() optional argument reverse=True, e.g. sorted(list, reverse=True), makes it sort backwards.

In [47]:
mystrs = ['aa', 'BB', 'zz', 'CC']
print(sorted(mystrs))  ## Remember! Sorting is case sensitive
print(sorted(mystrs, reverse=True))

['BB', 'CC', 'aa', 'zz']
['zz', 'aa', 'CC', 'BB']


### You can do customized sorting
For more complex custom sorting, ``sorted()`` takes an optional ``"key="`` specifying a "key" function that transforms each element before comparison. The key function takes in 1 value and returns 1 value, and the returned "proxy" value is used for the comparisons within the sort.

For example with a list of strings, specifying ``key=len`` (the built in ``len()`` function) sorts the strings by length, from shortest to longest. The sort calls ``len()`` for each string to get the list of proxy length values, and the sorts with those proxy values.

In [48]:
strs = ['ccc', 'aaaa', 'd', 'bb']
print(sorted(strs, key=len))

['d', 'bb', 'ccc', 'aaaa']


As another example, specifying "str.lower" as the key function is a way to force the sorting to treat uppercase and lowercase the same:

In [49]:
print(sorted(strs, key=str.lower)) 

['aaaa', 'bb', 'ccc', 'd']


You can also pass in your own function as the key function. For example, the function MyFn below takes a string, and returns its last letter. We then pass this function as key for sorting

In [50]:
strs = ['xc', 'zb', 'yd' ,'wa']

def MyFn(s):
    return s[-1]

print(sorted(strs, key=MyFn))

['wa', 'zb', 'xc', 'yd']


### Exercise &#x1F4D8;
* Create a different sorting function - invent one - that works with strings 


## Proficiency with lists and strings is fundamental if you want to be a good Pythonist, so let's do a few more exercises
### Exercise &#x1F4D7;
* Create the functions requested in each cell below. Once you are done, check the solution by following the instructions in the cell following the exercises. 

In [51]:
# Function match_ends
# Given a list of strings, return the count of the number of
# strings where the string length is 2 or more and the first
# and last chars of the string are the same.
# Note: in Python the operator to increase the count is +=

def match_ends(words):
  # +++your code here+++
  return

In [52]:
# Function front_x
# Given a list of strings, return a list with the strings
# in sorted order, except group all the strings that begin with 'x' first.
# e.g. ['mix', 'xyz', 'apple', 'xanadu', 'aardvark'] yields
# ['xanadu', 'xyz', 'aardvark', 'apple', 'mix']
# Hint: this can be done by making 2 lists and sorting each of them
# before combining them.
def front_x(words):
  # +++your code here+++
  return

In [53]:
# Function sort_last
# Given a list of non-empty tuples, return a list sorted in increasing
# order by the last element in each tuple.
# e.g. [(1, 7), (1, 3), (3, 4, 5), (2, 2)] yields
# [(2, 2), (1, 3), (3, 4, 5), (1, 7)]
# Hint: use a custom key= function to extract the last element form each tuple, as in some example above
def sort_last(tuples):
  # +++your code here+++
  return

### To check the solution
* Open the file list1.py with a text editor (remember to use the jupyter notebook text editor, or install a professional editor like Sublime - it's free)
* Copy your solutions in the appropriate place (you will see once you open the files). 
* Execute the file in your Jupyter notebook with the command ```%run list1.py```. Make sure that list1.py is in the same folder as your notebook!
* If you did everything correctly, you will see an output as the one below ``%run list1.py``

In [60]:
%run list1.py

match_ends
 OK  got: 3 expected: 3
 OK  got: 2 expected: 2
 OK  got: 1 expected: 1

front_x
 OK  got: ['xaa', 'xzz', 'axx', 'bbb', 'ccc'] expected: ['xaa', 'xzz', 'axx', 'bbb', 'ccc']
 OK  got: ['xaa', 'xcc', 'aaa', 'bbb', 'ccc'] expected: ['xaa', 'xcc', 'aaa', 'bbb', 'ccc']
 OK  got: ['xanadu', 'xyz', 'aardvark', 'apple', 'mix'] expected: ['xanadu', 'xyz', 'aardvark', 'apple', 'mix']

sort_last
 OK  got: [(2, 1), (3, 2), (1, 3)] expected: [(2, 1), (3, 2), (1, 3)]
 OK  got: [(3, 1), (1, 2), (2, 3)] expected: [(3, 1), (1, 2), (2, 3)]
 OK  got: [(2, 2), (1, 3), (3, 4, 5), (1, 7)] expected: [(2, 2), (1, 3), (3, 4, 5), (1, 7)]


### Exercise &#x1F4D7;
* Do the exercise in the following cells. Check the solution as explained for the previous exercise, by running file ```script1.py```.

In [55]:
# donuts
# Given an int count of a number of donuts, return a string
# of the form 'Number of donuts: <count>', where <count> is the number
# passed in. However, if the count is 10 or more, then use the word 'many'
# instead of the actual count.
# So donuts(5) returns 'Number of donuts: 5'
# and donuts(23) returns 'Number of donuts: many'
def donuts(count):
  # +++your code here+++
  return

In [56]:
# both_ends
# Given a string s, return a string made of the first 2
# and the last 2 chars of the original string,
# so 'spring' yields 'spng'. However, if the string length
# is less than 2, return instead the empty string.
def both_ends(s):
  # +++your code here+++
  return

In [57]:
#fix_start
# Given a string s, return a string
# where all occurences of its first char have
# been changed to '*', except do not change
# the first char itself.
# e.g. 'babble' yields 'ba**le'
# Assume that the string is length 1 or more.
# Hint: s.replace(stra, strb) returns a version of string s
# where all instances of stra have been replaced by strb.
def fix_start(s):
  # +++your code here+++
  return

In [58]:
#MixUp
# Given strings a and b, return a single string with a and b separated
# by a space '<a> <b>', except swap the first 2 chars of each string.
# e.g.
#   'mix', pod' -> 'pox mid'
#   'dog', 'dinner' -> 'dig donner'
# Assume a and b are length 2 or more.
def mix_up(a, b):
  # +++your code here+++
  return

If all functions are correct, when running ``string1.py`` you will see an ouput as the one below:

In [61]:
%run string1.py

donuts
 OK  got: 'Number of donuts: 4' expected: 'Number of donuts: 4'
 OK  got: 'Number of donuts: 9' expected: 'Number of donuts: 9'
 OK  got: 'Number of donuts: many' expected: 'Number of donuts: many'
 OK  got: 'Number of donuts: many' expected: 'Number of donuts: many'

both_ends
 OK  got: 'spng' expected: 'spng'
 OK  got: 'Helo' expected: 'Helo'
 OK  got: '' expected: ''
 OK  got: 'xyyz' expected: 'xyyz'

fix_start
 OK  got: 'ba**le' expected: 'ba**le'
 OK  got: 'a*rdv*rk' expected: 'a*rdv*rk'
 OK  got: 'goo*le' expected: 'goo*le'
 OK  got: 'donut' expected: 'donut'

mix_up
 OK  got: 'pox mid' expected: 'pox mid'
 OK  got: 'dig donner' expected: 'dig donner'
 OK  got: 'spash gnort' expected: 'spash gnort'
 OK  got: 'fizzy perm' expected: 'fizzy perm'


## Exercise: more web scraping
I have hidden strings containing the family names (no accents or special characters):
`Ang`, `Baida`, `Baroma`, `Boza`, `Chen`, `Craciun`, `Czobor`, `Dezsenyi`, `Drucker`, `Duronelly`, `Fleck`, `Gabriel`, `Juhasz`, `Karimli`, `Kattika`, `Kazmina`, `Kripalani`, `Lukacs`, `Mattos`, `Mekhrishvili`, `Menyhert`, `Molnar`, `Mark`, `Natera`, `Neri`, `Oleksandrenko`, `Strabel`, `Szabo`, `Yetkin`, `Zhu`, 
as well as the string `Sinatra.Roberta` in the html code of one the pages of my website http://www.robertasinatra.com. 

Write a program that:
* crawls the html code of the homepage `http://www.robertasinatra.com/index.html`;
* finds all other html pages that are contained in index.html and that refer to the main domain www.robertasinatra.com (e.g. `research.html`) - this step should be done automatically, you cannot list the html pages manually.
* searches your family name, as listed above (no accents and special characters), in the html code of these pages. Should your family name not being included in the list above, search for `Sinatra.Roberta`. Note that different names are in different pages and that the name might appear in the pages in lower case, upper case or a combination of the two;
* prints on screen ‘My family name is hidden in ...’ , where ... is substituted with the address of the html page containing your family name.
Note: I have not hidden the name in html pages that are not linked in `index.html`, like `www.robertasinatra.com/teaching/math5027.html`.



In [1]:
import urllib.request
request = urllib.request.Request('http://www.robertasinatra.com/index.html')

result = urllib.request.urlopen(request)
text = result.read()


In [2]:
text=str(text)

In [3]:
text

'b\'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\\r\\n  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> \\r\\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" > \\r\\n<head> \\r\\n    <meta http-equiv="content-type" content="text/html; charset=iso-8859-1" /> \\r\\n    <meta name="author" content="Roberta Sinatra" /> \\r\\n    <meta name="keywords" content="Roberta Sinatra" /> \\r\\n    <meta name="description" content="Roberta Sinatra" /> \\r\\n    <meta name="robots" content="all" /> \\r\\n    <title>Roberta Sinatra</title> \\r\\n \\r\\n    <style type="text/css" media="all"> \\r\\n        @import "stylesheet.css";\\r\\n    </style> \\r\\n \\r\\n<!-- Begin Google analytics code --> \\r\\n<script type="text/javascript"> \\r\\n \\r\\n  var _gaq = _gaq || [];\\r\\n  _gaq.push([\\\'_setAccount\\\', \\\'UA-18180011-2\\\']);\\r\\n  _gaq.push([\\\'_trackPageview\\\']);\\r\\n \\r\\n  (function() {\\r\\n    var ga = document.createElement(\\\'script\\\'); ga.type = \\\

In [6]:
splitted = text.split('.html')
splitted[6]

'?_r=0" target=\\\'_blank\\\'>New York Times</a>, <a href="https://www.wired.com/2016/11/see-scientists-influential-work-comes-waves/" target=\\\'_blank\\\'>Wired</a>,  \\r\\n        <a href="https://blogs.scientificamerican.com/sa-visual/the-science-of-success-in-science/" target=\\\'_blank\\\'>Scientific American</a>, \\r\\n        <a href="http://www.the-scientist.com/?articles.view/articleNo/47423/title/Predicting-Scientific-Success/" target=\\\'_blank\\\'>The Scientist</a>, <a href="http://www.chronicle.com/article/Older-Scientists-Are-Touted-as/238307\\r\\n" target=\\\'_blank\\\'>The Chronicle of Higher Education</a>, \\r\\n         <a href="http://www.forbes.com/sites/nextavenue/2016/11/04/study-shows-youth-isnt-the-key-to-making-a-mark/#780fd32d19c4" target=\\\'_blank\\\'>Forbes</a>, \\r\\n <a href="http://www.huffingtonpost.com/entry/science-success-age_us_5824a19ee4b07751c390d9b2" target=\\\'_blank\\\'>Huffington Post</a>,  <a href="http://bigthink.com/laurie-vazquez/scientis

In [95]:
url_list = []

for s in splitted:
    temp = s.split('"')

    if len(temp)>1:
        if '.html' in temp[1]:
            url_list.append(temp[1])

    
url_list

['index.html',
 'bio.html',
 'research.html',
 'publications.html',
 'http://www.nytimes.com/2016/11/04/science/stem-careers-success-achievement.html?_r=0',
 'http://www.nytimes.com/2016/11/04/science/stem-careers-success-achievement.html?_r=0',
 'http://nymag.com/scienceofus/2016/11/can-you-be-too-old-for-success.html',
 'http://phys.org/news/2016-11-success-age.html',
 'http://cen.acs.org/articles/94/i44/Early-career-scientists-dont-necessarily.html?',
 'hhttp://sports.yahoo.com/news/either-trump-clinton-winner-defies-age-bias-213041043--politics.html',
 'http://www.universityherald.com/articles/49183/20161115/stem-researchers-discovered-when-the-eureka-moment-happens.html',
 'http://www.spiegel.de/wissenschaft/mensch/wissenschaft-die-maer-von-der-genialen-jugend-a-1119383.html',
 'http://www.adnkronos.com/salute/2016/11/04/studi-scientifici-successo-possibili-anche-eta-avanzata_aejbrOapqcNbj3JkFGWuhL.html',
 'http://www.lavanguardia.com/ciencia/cuerpo-humano/20161104/411563354798/ci

In [85]:
strings = []
for i in url_list:
    request = urllib.request.Request('http://www.robertasinatra.com/' + i)
    #print('http://www.robertasinatra.com/' + i)
    result = urllib.request.urlopen(request)
    text = result.read()
    mystring = str(text)
    strings.append(mystring)


strings



['b\'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\\r\\n  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> \\r\\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" > \\r\\n<head> \\r\\n    <meta http-equiv="content-type" content="text/html; charset=iso-8859-1" /> \\r\\n    <meta name="author" content="Roberta Sinatra" /> \\r\\n    <meta name="keywords" content="Roberta Sinatra" /> \\r\\n    <meta name="description" content="Roberta Sinatra" /> \\r\\n    <meta name="robots" content="all" /> \\r\\n    <title>Roberta Sinatra</title> \\r\\n \\r\\n    <style type="text/css" media="all"> \\r\\n        @import "stylesheet.css";\\r\\n    </style> \\r\\n \\r\\n<!-- Begin Google analytics code --> \\r\\n<script type="text/javascript"> \\r\\n \\r\\n  var _gaq = _gaq || [];\\r\\n  _gaq.push([\\\'_setAccount\\\', \\\'UA-18180011-2\\\']);\\r\\n  _gaq.push([\\\'_trackPageview\\\']);\\r\\n \\r\\n  (function() {\\r\\n    var ga = document.createElement(\\\'script\\\'); ga.type = \\

## Exercise &#x1F4D9;
Create a list of n lists, each having N elements. The values of the first list should go from 1 to N, the elements of the second list from N+1 to 2N,... the elements of the last list, should go from $N^2-N+1$ to $N^2$. In other words, this is like creating a matrix

 \begin{pmatrix}
  1 & 2 & \cdots & N \\
  N+1 & N+2 & \cdots & 2N \\
  \vdots  & \vdots  & \ddots & \vdots  \\
  N^2-N+1 & N^2-N+2 & \cdots & N^2 
 \end{pmatrix}
 Can you create it with only one line of code? _Hint_: try first with multiple lines of code, and then make it more compact.

In [6]:
list1 = list(range(1,5))

[1, 2, 3, 4]

TypeError: list() takes at most 1 argument (2 given)

### More on loops and iterators:
* [ an overview of for loops and iterators](https://www.codementor.io/sheena/python-generators-and-iterators-du1082iua)
* What are the most basic definitions of "iterable", "iterator" and "iteration"?  [ A good explanation from stackoverflow](https://stackoverflow.com/questions/9884132/what-exactly-are-pythons-iterator-iterable-and-iteration-protocols)

### Authentication with urllib 
* [ Python 3 documentation](https://docs.python.org/3/howto/urllib2.html)
* [ Step by step explanation, with examples](http://www.voidspace.org.uk/python/articles/authentication.shtml)

## Further reading

* http://www.python.org - The official web page of the Python programming language.
* [Python Essential Reference](http://www.amazon.com/Python-Essential-Reference-4th-Edition/dp/0672329786) - A good reference book on Python programming.