In [1]:
from IPython.display import HTML

# Python Data Structures 

Karl Kosack (LEPCHE)

CEA SAp python workshop

TOPICS:
--------

* types, objects
* Advanced scalar data:
  * imaginary numbers
  * strings
* Collections:
  * tuples
  * lists and comprehensions
  * sets
  * dicts
* Generators and generator comprehensions

Reference: https://docs.python.org/2/library/stdtypes.html

### syntax preview:

#### Strings
```python
x = "text"  # a string
x = 'text'  # also a string
x = """ long text with maybe more than one 
        line in it """   # a string
```
#### containers
```python
x = (a,b)  # a tuple
x = [a,b]  # a list
x = {a,b}  # a set
x = {a: b, c:d}  # a dict
```


# Data Types

In [2]:
x = 3
y = 3.1
print "x is a ", type(x)
print "y is a ", type(y)

x is a  <type 'int'>
y is a  <type 'float'>


In [3]:
print type(False)

<type 'bool'>


In [4]:
print type(type(1))

<type 'type'>


In [5]:
print type( isinstance )

<type 'builtin_function_or_method'>


In [6]:
isinstance(1,int)

True

In [7]:
isinstance("bork",str)

True

In [8]:
isinstance(15.3, int)

False

All values in Python are actually **Objects**:

> * Objects are data types that have **methods** (functions that operate on themselves) and **properties** (internal data)
> * you access the methods and properties via the ``.`` (period) operator: ``obj.method()``, or ``obj.property``
> * IPython and the Notebook can tell you what methods are available by **tab-completion**

In [9]:
x = 3
# example of tab-completing the x variable above:
# x.[tab]


In [10]:
x.bit_length()

2

In [11]:
(128).bit_length()

8

In [12]:
(128.1).bit_length()

AttributeError: 'float' object has no attribute 'bit_length'

In [13]:
# bit_lenth is only a method for Integers! 
#let's see what other methods there are:
f = 1.5

In [14]:
f.as_integer_ratio()

(3, 2)

In [15]:
print "real=",f.real
print "imag=",f.imag

real= 1.5
imag= 0.0


Hmm.. interesting! Imaginary numbers are built in!

Declare them by **using the "j" suffix after a number**

In [16]:
z = 12.0 + 3.0j
print z

(12+3j)


In [17]:
z.real

12.0

In [18]:
z.conjugate

<function conjugate>

In [19]:
# oops, that wasn't a property... it's a method! need to call it
z.conjugate()

(12-3j)

In [20]:
z**2

(135+72j)

## the String type


In [21]:
s = "this is a test"
print type(s)

<type 'str'>


In [22]:
# concatination:
print s + s 
print s + " of stuff"

this is a testthis is a test
this is a test of stuff


note: string concatination like this is *not fast*, so for complex string processing there are better ways (see `join()` later)

In [23]:
print "="*50
print "hello"*10

hellohellohellohellohellohellohellohellohellohello


In [24]:
print s.capitalize()
print s.lower()
print s.upper()
print s.title()

print "-"*80
print s.upper().center(80)
print "-"*80

This is a test
this is a test
THIS IS A TEST
This Is A Test
--------------------------------------------------------------------------------
                                 THIS IS A TEST                                 
--------------------------------------------------------------------------------


In [25]:
print s.startswith("this")
print s.endswith("test")
print s.endswith("is")

True
True
False


### Formatting Strings

In [26]:
line = "My name is {} and I have lived in Paris for {} years"
print line.format("Karl",6)

My name is Karl and I have lived in Paris for 6 years


In [27]:
# with numbered placeholders
line = "My name  is {1} and I have lived in Paris for {0} years" 
line.format(6,"Karl") # note positions are now numbered

'My name  is Karl and I have lived in Paris for 6 years'

In [28]:
# with named placeholders
line = "My name  is {name} and I have lived in Paris for {years} years"
line.format( name="Fabio", years=2 )

'My name  is Fabio and I have lived in Paris for 2 years'

In [29]:
# formatting the values: use {placeholder:<format>}  Format is similar to printf
line = "The {0:>14s} has a mass of {1:0.3g} kg"
print line.format("electron",9.10938291e-31)
print line.format("proton",1.672621777e-27)

The       electron has a mass of 9.11e-31 kg
The         proton has a mass of 1.67e-27 kg


# Collections
Data types that represent multple pieces of data:


## Tuples: ordered, immutable data of mixed types

In [30]:
# define tuples using parantheses (a,b,...)
t = (12.3, "astro", 1+12j)
print t
print type(t)

(12.3, 'astro', (1+12j))
<type 'tuple'>


In [31]:
# access an element of a tuple:
t[1]

'astro'

In [32]:
t[1] = 'SAp' # cannot change the data after definition! immutable

TypeError: 'tuple' object does not support item assignment

tuples are fixed-structures (you cannot add/remove items or change types once they are defined!), but you can create a new tuple via concatination:

In [33]:
t2 = t + ("value",6)
print t2

(12.3, 'astro', (1+12j), 'value', 6)


In [34]:
# a single-element tuple must be defined like this:
t3 = (12,)
print t3

(12,)


In [35]:
len(t2) # get the length of a tuple

5

In most cases, you can even skip the parentheses and python always assumes it's a tuple:

In [36]:
x = 1,4,5
print type(x)

<type 'tuple'>


very useful for **multiple assignment**: use a tuple on the left-hand-side to "undo" the tuple!
    

In [37]:
x,y,z = 1,2,3
print y

2


In [38]:
#swap two variables:
x = 3
y = 4
x,y = y,x
print "x is now:", x
print "y is now:",y

x is now: 4
y is now: 3


####What are tuples good for?
* packing or unpacking sets of variables:
    * multiple return values for a function
    * multiple generic input parameters for a function that accepts a single parameters:
   ``f(p) -> f( (x,y) )``
* passing around data that you do not want modified

## Lists: mutable, expandable arrays of data
Like tuples, the definition is simple: replace ``()`` with ``[]``

In [39]:
v = [1,2,3,4,5] ; 
print v
print len(v), type(v)

[1, 2, 3, 4, 5]
5 <type 'list'>


In [40]:
# element access, same as tuple, but more flexible:
print v[0]
print v[1:3]  # slice: returns a sub-list of elements 1 to 2, not including element 3
# note: indexing from 0!

1
[2, 3]


In [41]:
print v[3:]  # from element 3 onward

[4, 5]


In [42]:
print v[:3] # all elements up to (but not including) element 3

[1, 2, 3]


In [43]:
print v[-1] # negative indices count from the end
print v[-2]

5
4


In [44]:
print v[:-1] , v[1:]  

[1, 2, 3, 4] [2, 3, 4, 5]


In [45]:
# can be mixed-types:
v = ["CEA", 12, "SAp"]
print len(v)

3


In [46]:
# Lists can be nested to make more complex data:
M = [ [0,1],[1,0.5] ]
print M
print "determininant is: ", M[0][0]*M[1][1] - M[0][1]*M[1][0]

[[0, 1], [1, 0.5]]
determininant is:  -1.0


In [47]:
#you can change the list elements at will! (unlike tuples)
M[0][0] = 5
print "determininant is: ", M[0][0]*M[1][1] - M[0][1]*M[1][0]

determininant is:  1.5


#### Manipulating Lists

In [48]:
v = [1,2,3]
v.append(4)
v.append([-5,6])
print v

[1, 2, 3, 4, [-5, 6]]


In [49]:
#concatinating is not the same as appending (only works with 2 lists):
v = [1,2,3]
v.append(4)
v += [-5,6]
print v

[1, 2, 3, 4, -5, 6]


Lists can work like Stacks or Queues using `.append(x)` and `.pop()`

In [50]:
v = [1,2,3,4,5]
print v.pop()  # pops from end of list, stack-wise
print v.pop()
print v

5
4
[1, 2, 3]


In [51]:
v = [1,2,3,4,5]
print v.pop(0)  # pops from element 0, queue-wise
print v.pop(0)

1
2


Lists can be sorted:

In [52]:
v1 = [5,6,2,4,3,3,4,16]
v1.sort()  # does *in-place* sort, the original v1 is now overwritten
print v1
v1.sort(reverse=True) # can go backwards too
print v1
l = ["this","is","a","list"]
l.reverse() # reverse without sorting
print l

[2, 3, 3, 4, 4, 5, 6, 16]
[16, 6, 5, 4, 4, 3, 3, 2]
['list', 'a', 'is', 'this']


In [53]:
v2 = [5,6,2,4,3,3,4,16]
s = list(sorted(v2))  # use the sorted function to get a new sorted list 
print v2
print s
print list(reversed(v2))

[5, 6, 2, 4, 3, 3, 4, 16]
[2, 3, 3, 4, 4, 5, 6, 16]
[16, 4, 3, 3, 4, 2, 6, 5]


**Note**: `sorted()` and `reversed()` actually return  iterator functions, which doesn't get evaluated until you loop over them! Calling `list()` on the result forces it to be realized as a list.  

We'll talk about looping over lists next...

### List iteration:

In [54]:
names = ["one","two","three","four"]

for item in names:
    print "got",item

got one
got two
got three
got four


In [55]:
# if you're used to C, you often also want a "index" variable
for item in enumerate(names):
    print item

(0, 'one')
(1, 'two')
(2, 'three')
(3, 'four')


In [56]:
# the output are tuples! ii,name
# we can write this better by using a tuple in the loop:
for ii,nn in  enumerate(names):
    print "name",ii,"is",nn

name 0 is one
name 1 is two
name 2 is three
name 3 is four


Now let's combine some string and list functions:

In [57]:
mylist = ["this","is","a","test"]
" ".join(mylist)  # join on space (join is a function of strings!) 

'this is a test'

In [58]:
print " *** ".join(mylist)

this *** is *** a *** test


In [59]:
l = "here are some words".split(" ") # split into a list on space
l.sort()
print l

['are', 'here', 'some', 'words']


> note:  more complex splitting and joining can be done with the `re` (regexp) module, but that's for another time

#### A complete example:

In [60]:
quote = """
Be that word our sign in parting, bird or fiend! I shrieked, upstarting
Get thee back into the tempest and the Night's Plutonian shore!
Leave no black plume as a token of that lie thy soul hath spoken!
Leave my loneliness unbroken! quit the bust above my door!
Take thy beak from out my heart, and take thy form from off my door!"
Quoth the Raven "Nevermore." 
"""
clean_quote = quote.replace('\n'," ").replace('"','').replace("!","").replace(".","").lower()
words = clean_quote.split(" ")
print words
print "----------"

for word in ["my", "the","door","soul"]:
    numtimes = words.count(word)
    print "* '{0}' appears {1} times".format( word, numtimes )

['', 'be', 'that', 'word', 'our', 'sign', 'in', 'parting,', 'bird', 'or', 'fiend', 'i', 'shrieked,', 'upstarting', 'get', 'thee', 'back', 'into', 'the', 'tempest', 'and', 'the', "night's", 'plutonian', 'shore', 'leave', 'no', 'black', 'plume', 'as', 'a', 'token', 'of', 'that', 'lie', 'thy', 'soul', 'hath', 'spoken', 'leave', 'my', 'loneliness', 'unbroken', 'quit', 'the', 'bust', 'above', 'my', 'door', 'take', 'thy', 'beak', 'from', 'out', 'my', 'heart,', 'and', 'take', 'thy', 'form', 'from', 'off', 'my', 'door', 'quoth', 'the', 'raven', 'nevermore', '', '']
----------
* 'my' appears 4 times
* 'the' appears 4 times
* 'door' appears 2 times
* 'soul' appears 1 times


### List Comprehensions: quickly generating lists

Something very *powerful and very useful* is the _list comprehension_: it allows you to use a function to define a list, and select items from that function

> * alist = [ func(var) **for** var **in** anotherlist ]
> * alist = [ func(var) **if** condition **else** anotherfunc(var) **for** var **in** anotherlist ]

In [61]:
mylist = ["this","is","a","test"]

# get all elements that start with "t" and make them upper-case:
selected = [ X.upper() for X in mylist if X.startswith("t") ]
print selected

['THIS', 'TEST']


In [62]:
# use the ternary operator:
selected = [ X.capitalize() if X.startswith("t") else X.upper() for X in mylist ]
print selected

['This', 'IS', 'A', 'Test']


In [63]:
mylist = [1,2,3,4,"test",5]
# convert all to strings:
mystringlist = [ str(s) for s in mylist]
print mylist
print mystringlist
print "'",", ".join(mystringlist),"'"

[1, 2, 3, 4, 'test', 5]
['1', '2', '3', '4', 'test', '5']
' 1, 2, 3, 4, test, 5 '


## Sets: list-like objects with no repeating values

In [64]:
set1 = {"apples","oranges","pears","blueberries"}
set2 = {"grapes","pears","apples","strawberries","grapes"} # note grapes appears twice
print type(set1)
print set1
print set2

<type 'set'>
set(['blueberries', 'oranges', 'pears', 'apples'])
set(['strawberries', 'pears', 'apples', 'grapes'])


Sets work a lot like lists or tuples, but they cannot have repeating values (appending an existing value will do nothing)

In [65]:
"apples" in set1

True

In [66]:
"oranges" in set2

False

In [67]:
# find the intersection
print set1.intersection(set2)
print set1 & set2 

set(['apples', 'pears'])
set(['apples', 'pears'])


In [68]:
# find the difference:
print set1.difference(set2)
print set1 - set2 

set(['blueberries', 'oranges'])
set(['blueberries', 'oranges'])


In [69]:
# adding and removing items:
set1.add("tomatoes")
set1.remove("pears")
print set1
print set1 | set2 
print set1.union(set2)

set(['tomatoes', 'blueberries', 'oranges', 'apples'])
set(['blueberries', 'pears', 'oranges', 'strawberries', 'apples', 'grapes', 'tomatoes'])
set(['blueberries', 'pears', 'oranges', 'strawberries', 'apples', 'grapes', 'tomatoes'])


In [70]:
set3 = {"apples","oranges","pears","blueberries"}
print {"apples","pears"}.issubset( set3 )
print {"apples","grapes"}.issubset( set3 )

True
False


## Dicts: Key-Value storage (associative arrays)

One of the most powerful and useful structures is the **dict**  = "Dictionary"

In [71]:
mydict = { 'firstname': 'Karl', 'lastname': 'Kosack' }
print mydict

{'lastname': 'Kosack', 'firstname': 'Karl'}


In [72]:
# can also declare using a function:
mydict = dict( firstname="Karl", lastname="Kosack" )
print mydict

{'lastname': 'Kosack', 'firstname': 'Karl'}


In [73]:
# access an element
print mydict['firstname']

Karl


In [74]:
print mydict['stuff']

KeyError: 'stuff'

In [75]:
print mydict.has_key('stuff')

False


In [76]:
print mydict[0] # doesn't index like a list

KeyError: 0

In [77]:
print mydict.get("firstname") # same as ['firstname']

Karl


In [78]:
print mydict.get("stuff") # returns None instead of throwing an exception

None


In [79]:
print mydict.get("stuff","key not found") # can specify the value to return if key is not there. 

key not found


In [80]:
# add a new item
mydict['middlename'] = "Peter"
print mydict

{'middlename': 'Peter', 'lastname': 'Kosack', 'firstname': 'Karl'}


> NOTE Dict keys **do not need to be strings!** 

In [81]:
# for example, you can use integer keys to make a sparse array!
dd = {}  # an empty dict, ready to be filled
dd[10] = 1.0
dd[0] = 2.0
print dd
print dd[10]

{0: 2.0, 10: 1.0}
1.0


Note that **dictionaries are unordered**: you can't assume that one key comes after another 

(though there is a special ordereddict type that can be used in that case) 

http://docs.python.org/whatsnew/2.7.html#pep-372-adding-an-ordered-dictionary-to-collections

#### Accesing the dictionary

we already  saw you can access by key using [] syntax, but you can also get at the details in other ways"

In [82]:
print mydict

{'middlename': 'Peter', 'lastname': 'Kosack', 'firstname': 'Karl'}


In [83]:
# keys as a list:
print mydict.keys() , type(mydict.keys())

['middlename', 'lastname', 'firstname'] <type 'list'>


In [84]:
# values as a list: (guaranteed to be in the same order as keys())
print mydict.values() , type(mydict.values())

['Peter', 'Kosack', 'Karl'] <type 'list'>


In [85]:
# items (key,value) tuple as a list:
print mydict.items() , type(mydict.items())

[('middlename', 'Peter'), ('lastname', 'Kosack'), ('firstname', 'Karl')] <type 'list'>


In [86]:
# iterating:
for item in mydict:
    print item, mydict[item]


middlename Peter
lastname Kosack
firstname Karl


In [87]:
#another way
for key,val in mydict.items():
    print key,val

middlename Peter
lastname Kosack
firstname Karl


In [88]:
# an even better way (for large dictionaries especially):
for key,val in mydict.iteritems():
    print key,val


middlename Peter
lastname Kosack
firstname Karl


It looks the same! But `iteritems()` returns an iterator that access each item one at a time, rather than converting them into a list.  If the dictionary is very large, this is more memory efficient!

In [89]:
# A common example: use a dictionary to count things.  Recall our previous example:
quote = """
Be that word our sign in parting, bird or fiend! I shrieked, upstarting
Get thee back into the tempest and the Night's Plutonian shore!
Leave no black plume as a token of that lie thy soul hath spoken!
Leave my loneliness unbroken! quit the bust above my door!
Take thy beak from out my heart, and take thy form from off my door!"
Quoth the Raven "Nevermore." 
"""
clean_quote = quote.replace('\n'," ").replace('"','').replace("!","").replace(".","").lower()
words = clean_quote.split(" ")

# let's count the frequency using a dictionary:
freq = {}
for word in words:
    if not freq.has_key(word):
        freq[word] = 0
    freq[word] += 1
    
print freq


{'': 3, 'and': 2, 'spoken': 1, 'into': 1, 'back': 1, 'sign': 1, 'hath': 1, 'as': 1, 'in': 1, 'our': 1, 'quoth': 1, 'tempest': 1, 'out': 1, 'lie': 1, 'from': 2, 'heart,': 1, 'no': 1, 'plume': 1, 'get': 1, 'token': 1, 'bust': 1, 'door': 2, 'black': 1, 'take': 2, 'above': 1, 'fiend': 1, 'plutonian': 1, 'bird': 1, 'unbroken': 1, 'raven': 1, 'be': 1, 'thee': 1, 'form': 1, 'that': 2, 'shrieked,': 1, 'quit': 1, 'beak': 1, 'off': 1, 'a': 1, 'upstarting': 1, 'word': 1, 'thy': 3, "night's": 1, 'i': 1, 'of': 1, 'loneliness': 1, 'soul': 1, 'parting,': 1, 'leave': 2, 'shore': 1, 'nevermore': 1, 'the': 4, 'my': 4, 'or': 1}


Let's try that again, but now using a helper that removes the need to do the check if the key exists first:

for that we will use a **defaultdict** which is just a helper type where if you access a dictionary element that doesn't exist, it is inserted automatically using the given type.

In [90]:
from collections import defaultdict
freq = defaultdict(int)  # the default value for a key in freq is an int with value 0

for word in words:
    freq[word] += 1
    
print freq

defaultdict(<type 'int'>, {'': 3, 'and': 2, 'spoken': 1, 'into': 1, 'back': 1, 'sign': 1, 'hath': 1, 'as': 1, 'in': 1, 'our': 1, 'quoth': 1, 'tempest': 1, 'out': 1, 'lie': 1, 'from': 2, 'heart,': 1, 'no': 1, 'plume': 1, 'get': 1, 'token': 1, 'bust': 1, 'door': 2, 'black': 1, 'take': 2, 'above': 1, 'fiend': 1, 'plutonian': 1, 'bird': 1, 'unbroken': 1, 'raven': 1, 'be': 1, 'thee': 1, 'form': 1, 'that': 2, 'shrieked,': 1, 'quit': 1, 'beak': 1, 'off': 1, 'a': 1, 'upstarting': 1, 'word': 1, 'thy': 3, "night's": 1, 'i': 1, 'of': 1, 'loneliness': 1, 'soul': 1, 'parting,': 1, 'leave': 2, 'shore': 1, 'nevermore': 1, 'the': 4, 'my': 4, 'or': 1})


In [91]:
# let's sort by frequency... but dicts aren't ordered, so we have to do some work:
#
# let's take the keys and values and put them each into a list, 
# and then zip them together making a single list of tuples:
zlist = zip(freq.values(),freq.keys())
zlist.sort(reverse=True) # now reverse it so the highest numbers start
print zlist[:10]

[(4, 'the'), (4, 'my'), (3, 'thy'), (3, ''), (2, 'that'), (2, 'take'), (2, 'leave'), (2, 'from'), (2, 'door'), (2, 'and')]


In [92]:
# Here's another way using sorted()'s key method, and a list comprehension:

print [ (freq[word],word) for word in sorted( freq, key=freq.get, reverse=True) ][:10]

[(4, 'the'), (4, 'my'), (3, ''), (3, 'thy'), (2, 'and'), (2, 'from'), (2, 'door'), (2, 'take'), (2, 'that'), (2, 'leave')]


 note again that order is not preserved! The results are both correct, but not the same

## Final Thoughts:

### you can convert between containers easily:

In [93]:
keys = ["a","b","c","c","c","d"]
print keys
print set(keys)
print tuple(keys)
print list( (1,2,3) )

['a', 'b', 'c', 'c', 'c', 'd']
set(['a', 'c', 'b', 'd'])
('a', 'b', 'c', 'c', 'c', 'd')
[1, 2, 3]


In [94]:
vals = list(range(len(keys)))
print vals
print keys,vals
print zip(keys,vals)
print dict(zip(keys,vals))

[0, 1, 2, 3, 4, 5]
['a', 'b', 'c', 'c', 'c', 'd'] [0, 1, 2, 3, 4, 5]
[('a', 0), ('b', 1), ('c', 2), ('c', 3), ('c', 4), ('d', 5)]
{'a': 0, 'c': 4, 'b': 1, 'd': 5}


### You can nest containers easily:
(allowing you to make complex tree-structures)

In [95]:
a = {"test": [1,2,3], 12: 6, 15:("one","two"), "nesteddict": dict(x=1,y=15,z=[1,3,5]) }
print a

{'test': [1, 2, 3], 'nesteddict': {'y': 15, 'x': 1, 'z': [1, 3, 5]}, 12: 6, 15: ('one', 'two')}


In [96]:
print a['nesteddict']['z'][2]

5


In [97]:
print a['nesteddict'].keys()
print a[15][1]

['y', 'x', 'z']
two


In [98]:
tree = {}
tree['left'] = {}
tree['right'] = {}
tree['left']['left'] = 15
tree['left']['right'] = 10
tree['right']['left'] = 5
tree['right']['right'] = {}
tree['right']['right']['left'] = {}
tree['right']['right']['right'] = 2
tree['right']['right']['left']['left'] = 1
tree['right']['right']['left']['right'] = 100
print tree

{'right': {'right': {'right': 2, 'left': {'right': 100, 'left': 1}}, 'left': 5}, 'left': {'right': 10, 'left': 15}}


----------------------
A more real-world example: select data out of a database (with 2 columns "name", and "mass"):

In [99]:
particles = \
[{"name":"π+"  ,"mass": 139.57018}, {"name":"π0"  ,"mass": 134.9766}, 
 {"name":"η5"  ,"mass": 47.853}, {"name":"η'(958)","mass": 957.78}, 
 {"name":"ηc(1S)", "mass": 2980.5}, {"name": "ηb(1S)","mass": 9388.9}, 
 {"name":"K+",  "mass": 493.677}, {"name":"K0"  ,"mass": 497.614}, 
 {"name":"K0S" ,"mass":  497.614}, {"name":"K0L" ,"mass":  497.614},
 {"name":"D+"  ,"mass": 1869.62}, {"name":"D0"  ,"mass": 1864.84},
 {"name":"D+s" ,"mass":  1968.49}, {"name":"B+"  ,"mass": 5279.15},
 {"name":"B0"  ,"mass": 5279.5}, {"name":"B0s" ,"mass":  5366.3},
 {"name":"B+c" ,"mass":    6277}]

# data source: http://en.wikipedia.org/wiki/List_of_mesons


In [100]:
# select mesons with mass in a particular range, and make them into a list of tuples:
# using a list-compreheension
my_mesons = [ (x['name'],x['mass']) \
             for x in particles \
             if x['mass'] <= 1000.0 and x['mass'] >= 100.0 ]

for part,mass in my_mesons:
    print "{0:>17s}: {1}".format( part,mass )

              π+: 139.57018
              π0: 134.9766
         η'(958): 957.78
               K+: 493.677
               K0: 497.614
              K0S: 497.614
              K0L: 497.614


## Advanced topic: Generators 

You might have noticed that nearly all collecitons let you iterate over them using :

In [101]:
container = [1,3,5,7]
for variable in container:
    print variable

1
3
5
7


That is because all containers are _iterable_ (they have a specific interface that lets one loop over them)

But what happens when you want to represent a very large (or infinitely large!) container?

   `` mylist = range(1000000000) ``

**it would fill up memory pretty quick!**

> **Generators** and **Iterators** fix that problem: they are special objects that are iterable, but calculate the next value on the fly rather than calculating all values at once
>
> In Python 2.x, `range()` always returns a list in memory, while `xrange()` returns a generator:
>> (note in Python 3+ `range()` returns a generator, and it must be cast to list(range(10)) if you want it to be a list)



In [102]:
print range(10)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [103]:
print xrange(10)

xrange(10)


In [104]:
for ii in range(10): print ii,
print " "
for ii in xrange(10): print ii,

0 1 2 3 4 5 6 7 8 9  
0 1 2 3 4 5 6 7 8 9


In [105]:
rr= xrange(100000000)
#doesn't fill up memory! (don't try this with range()!)

In [106]:
for ii in rr:
    print ii
    if ii>10:
        break

0
1
2
3
4
5
6
7
8
9
10
11


Generators are very complex and powerful, and I don't have time to go through them here in detail. They let you do things that normally would not fit in memory, with similar or faster speed, and using the same looping and interfaces you would use with lists

In [107]:
X = xrange(100)
mylist = [ val**2/2.0 for val in X ]    # a list comprehension
mygen = ( val**2/2.0 for val in X )     # a generator comprehension

In [108]:
print mylist

[0.0, 0.5, 2.0, 4.5, 8.0, 12.5, 18.0, 24.5, 32.0, 40.5, 50.0, 60.5, 72.0, 84.5, 98.0, 112.5, 128.0, 144.5, 162.0, 180.5, 200.0, 220.5, 242.0, 264.5, 288.0, 312.5, 338.0, 364.5, 392.0, 420.5, 450.0, 480.5, 512.0, 544.5, 578.0, 612.5, 648.0, 684.5, 722.0, 760.5, 800.0, 840.5, 882.0, 924.5, 968.0, 1012.5, 1058.0, 1104.5, 1152.0, 1200.5, 1250.0, 1300.5, 1352.0, 1404.5, 1458.0, 1512.5, 1568.0, 1624.5, 1682.0, 1740.5, 1800.0, 1860.5, 1922.0, 1984.5, 2048.0, 2112.5, 2178.0, 2244.5, 2312.0, 2380.5, 2450.0, 2520.5, 2592.0, 2664.5, 2738.0, 2812.5, 2888.0, 2964.5, 3042.0, 3120.5, 3200.0, 3280.5, 3362.0, 3444.5, 3528.0, 3612.5, 3698.0, 3784.5, 3872.0, 3960.5, 4050.0, 4140.5, 4232.0, 4324.5, 4418.0, 4512.5, 4608.0, 4704.5, 4802.0, 4900.5]


In [109]:
print mygen

<generator object <genexpr> at 0x111e84910>


In [110]:
print mygen[0]  # it's no longer exactly like a list.. 

TypeError: 'generator' object has no attribute '__getitem__'

In [111]:
# no random access! but we can get items sequentially:
print next(mygen)

0.0


In [112]:
print next(mygen)
print next(mygen)
print next(mygen)

0.5
2.0
4.5


In [113]:
for ii in mygen:
    print ii

8.0
12.5
18.0
24.5
32.0
40.5
50.0
60.5
72.0
84.5
98.0
112.5
128.0
144.5
162.0
180.5
200.0
220.5
242.0
264.5
288.0
312.5
338.0
364.5
392.0
420.5
450.0
480.5
512.0
544.5
578.0
612.5
648.0
684.5
722.0
760.5
800.0
840.5
882.0
924.5
968.0
1012.5
1058.0
1104.5
1152.0
1200.5
1250.0
1300.5
1352.0
1404.5
1458.0
1512.5
1568.0
1624.5
1682.0
1740.5
1800.0
1860.5
1922.0
1984.5
2048.0
2112.5
2178.0
2244.5
2312.0
2380.5
2450.0
2520.5
2592.0
2664.5
2738.0
2812.5
2888.0
2964.5
3042.0
3120.5
3200.0
3280.5
3362.0
3444.5
3528.0
3612.5
3698.0
3784.5
3872.0
3960.5
4050.0
4140.5
4232.0
4324.5
4418.0
4512.5
4608.0
4704.5
4802.0
4900.5


The power here is that at no point is the full list every in memory, so a large memory allocation is never needed.

In many cases this is much more efficient

In [114]:
sum(mygen)

0

In [115]:
# huh? Oh right - 
# the generateor is "empty" now, we used it up in the last statement
mygen = ( val**2/2.0 for val in X )     
sum(mygen)

164175.0

In [116]:
def f(size):
    mylist = [ val**2/2.0 for val in list(xrange(size)) ]
    return sum(mylist)

def g(size):
    mygen  = ( val**2/2.0 for val in xrange(size) )
    return sum(mygen)

In [117]:
%timeit f(10000)
%timeit g(10000)

1000 loops, best of 3: 1.14 ms per loop
1000 loops, best of 3: 1.18 ms per loop


Not much difference as far as speed

**However: we will see tomorrow** that there is another data structure (a `NumPy NDArray`) that is similar to a list (with N dimensions), but allows for fast math operations...  So really this example is not the best way to do math on large datasets!

In [119]:
# a notebook extension that you probably don't have installed, but can
%load_ext memory_profiler  

In [120]:
# let's look at memory usage now. for that we will increase the loop length to show the effect better
print "with 100000 entries"
%memit f(100000)
%memit g(100000)
print "with 1000000 entries"
%memit f(1000000)
%memit g(1000000)

with 100000 entries




peak memory: 31.35 MiB, increment: 5.38 MiB
peak memory: 30.19 MiB, increment: 0.00 MiB
with 1000000 entries
peak memory: 87.59 MiB, increment: 57.40 MiB
peak memory: 46.51 MiB, increment: -2.00 MiB


so we can see from a _memory_ usage standpoint, using generators is much more efficient!

**Therefore the rule of thumb is:** if you are working with a very large amount of data, and don't need random access, use generators
Otherwise, lists are fine (and provide more flexibility like sorting, random access, etc)

#BREAKOUT PROBLEM:

Make a program that:

* Opens the text file "verne-de_la_terre_a_la_lune.txt" (which contains the entire text of Jules Verne's _De la Terre à la Lune_) and  reads the contents into memory as a list of lines, as follows: (you can also loop directly over the lines if you prefer to be more memory efficient)

``` python
with open( "verne-de_la_terre_a_la_lune.txt" ) as infile:
    lines = infile.readlines()

# lines is now a list of strings, one per line in the file

#  <your code here>
# hint: loop over the lines, and use the string's  .split(" ") method
# hint: count the word frequency using a dict (perhaps defaultdict to make it easier)
# hint: dicts can't be sorted by value, so you need to turn the dict's
#       keys and values into a list of tuples (use zip() to combine them)
# hint: sorted() (or list.sort()) always sorts by the first value of each item in a list
```

* At the end, you code should print:
  * the total number of lines of text
  * the total number of words in the text
  * the number of _distinct_ words in the text
  * **the top 20 most common words** in the novel that are **not** one of the following _set_:

```python
ignore = {'la','à','et','le','les','des','un','du','en','que','dans','se',
          'il', 'une', 'ne', 'qui', 'au', 'pas', 'son', 'par', 'plus', 'pour',
          'ce', 'sur', 'cette', 'avec','de'}
```

