# Data Structure

Side note, not covered. It is possible to restrict arbitrary properties on our own classes using **slots**, which saves memory.

1. **Tuples:** They are immutable and generally store values that are somehow different from each other.
2. **Named Tuples:** indexing tuples using numbers can be unreadable, since we have no idea what those "magic" numbers meant unless we paw through the code to find where the tuple was initially declared. Thus if we know how many attributes we're going to store (if not use a dictionary) and confirmed that they are objects without bahavior, use Named Tuple.

In [1]:
from collections import namedtuple

# the first argument is the identifier for the named tuple and the second is the attributes that 
# the named tuple can have
Stock = namedtuple( "Stock", ['symbol', 'current', 'high', 'low'] )

# we can use the Stock to create as many instances of this named tuple as we like
stock = Stock( symbol = "GOOG", current = 613.30, high = 625.86, low = 610.50 )
print(stock.high)

# stock.current = 609.27 # tuples are immutable, thus we can't set the new value

625.86


**Dictionary**

`.get` provide the optional default value if the key doesn't exist

In [2]:
stocks = {
    "GOOG": (613.30, 625.86, 610.50),
    "MSFT": (30.25, 30.70, 30.19)
}
print( stocks.get("RIM", "NOT FOUND") )

NOT FOUND


You can use the index-syntax to set value for any key, regardless whether the key is in the dictionary or not, though we can't use list or mutable objects as keys.

In [3]:
stocks["GOOG"] = (597.63, 610.00, 596.28)
print(stocks["GOOG"])

(597.63, 610.0, 596.28)


Setting default values if a key doesn't exist using dictionaries' `.setdefault`. The method sets a value in the dictionary only if that value has not previously been set. Then it returns the value in the dictionary, either the one that was already there, or the newly provided default value.

In [4]:
value = stocks.setdefault( "RIM", (500.00, 600.00, 696.28) )
print(value)
print(stocks) # the "RIM"'s default value has been set

(500.0, 600.0, 696.28)
{'MSFT': (30.25, 30.7, 30.19), 'GOOG': (597.63, 610.0, 596.28), 'RIM': (500.0, 600.0, 696.28)}


In [5]:
from collections import defaultdict
def letter_frequency(sentence):
    frequencies = defaultdict(int)
    for letter in sentence:
        frequencies[letter] += 1
    return frequencies

**List**

The most common methods are `append(element)`, `insert( index, element )`, `count(element)`, `index(element)`.

In [6]:
test_list = [ 1, 1, 2, 3 ]
test_list.count(1)

2

`sort()` for sorting lists is quite straightforward. e.g. 
- If a list of tuples is provided, the list is sorted by the first element in each tuple. 
- If a mixture of unsortable items is supplied, the sort will raise a TypeError exception.

If we want to place objects that we've defined into a list and make them sortable, we have to define the `__lt__` method, which stands for less than. The following example only implements the `__lt__` method to enable sorting, however, the class should normally implement the similar `__gt__`, `__eq__`, `__ne__`,` __ge__`, and `__le__` methods, so that all of the >, ==, !=, >=, and <= also works properly.

In [7]:
class WeirdSortee(object):
    """the class's object can be sorted based on either a string or a number"""
    
    def __init__( self, string, number, sort_num ):
        self.string = string
        self.number = number
        self.sort_num = sort_num

    def __lt__( self, other ):
        """
        compares the object to another instance of the same class,
        or any duck typed object that has the string, number and sort_num attributes
        """
        if self.sort_num:
            return self.number < other.number
        return self.string < other.string
    
    def __repr__(self):
        return"{}:{}".format( self.string, self.number )

In [8]:
a = WeirdSortee( 'a', 4, True )
b = WeirdSortee( 'b', 3, True )
c = WeirdSortee( 'c', 2, True )
d = WeirdSortee( 'd', 1, True )
l = [a,b,c,d]
l # print out the class information using __repr__

[a:4, b:3, c:2, d:1]

In [9]:
l.sort()
l

[d:1, c:2, b:3, a:4]

Assign the additional `key` argument to the sort method. e.g. This is useful if we have a tuple of values and want to sort on the second item in the tuple rather than the first (which is the default for sorting tuples).

In [10]:
x = [(1,'c'), (2,'a'), (3, 'b')]
x.sort()
print(x)
x.sort( key = lambda i: i[1] )
print(x)

[(1, 'c'), (2, 'a'), (3, 'b')]
[(2, 'a'), (3, 'b'), (1, 'c')]


In [11]:
# another example of key : sort starting from lower case
l = [ "hello", "HELP", "Helo" ]
l.sort()
print(l)
l.sort( key = str.lower )
print(l)

['HELP', 'Helo', 'hello']
['hello', 'Helo', 'HELP']


**Set**

The primary feature of set is uniqueness, but like dictionary they are unordered. Thus if you want to sort or order them you'll have to convert them back to a list.

In [12]:
song_library = [
    ("Phantom Of The Opera", "Sarah Brightman"),
    ("Knocking On Heaven's Door", "Guns N' Roses"),
    ("Captain Nemo", "Sarah Brightman"),
    ("Patterns In The Ivy", "Opeth"),
    ("November Rain", "Guns N' Roses"),
    ("Beautiful", "Sarah Brightman"),
    ("Mal's Song", "Vixy and Tony")
]

artists = set()
for song, artist in song_library:
    artists.add(artist)
print(artists)

{"Guns N' Roses", 'Sarah Brightman', 'Opeth', 'Vixy and Tony'}


But the primary purpose of set is to use them in combination to efficiently combine or compare the itmes in two or more sets.

In [13]:
# set operations
my_artists = { "Sarah Brightman", "Guns N' Roses", "Opeth", "Vixy and Tony" }
auburns_artists = { "Nickelback", "Guns N' Roses", "Savage Garden" }

# .union or the | operator, elements that are in either of the two sets
print( "All: {}".format( my_artists.union(auburns_artists) ) )

# .intersection or the & operator, elements that appeared in both sets
print( "Both: {}".format( auburns_artists.intersection(my_artists) ) )

# .symmetric_difference, elements that are in one set or the other
print( "Either but not both: {}".format( my_artists.symmetric_difference(auburns_artists) ) )

All: {'Savage Garden', 'Opeth', 'Sarah Brightman', 'Nickelback', "Guns N' Roses", 'Vixy and Tony'}
Both: {"Guns N' Roses"}
Either but not both: {'Opeth', 'Vixy and Tony', 'Savage Garden', 'Nickelback', 'Sarah Brightman'}


The set operation above returns the same result regardless of which set calls the other. The following are methods that return different results depending on who is the caller and who is the argument.

In [14]:
my_artists = { "Sarah Brightman", "Guns N' Roses", "Opeth", "Vixy and Tony" }
bands = {"Guns N' Roses", "Opeth"}
print("my_artists is to bands:")

# issuperset : return True if all the element in the argument is also in the calling set
# s.issubset(t) == t.issuperset(s)
print( "issuperset: {}".format(my_artists.issuperset(bands) ) )
print( "issubset: {}".format(my_artists.issubset(bands) ) )

# difference, returns the elements that are in the calling set but not in the argument set
print( "difference: {}".format(my_artists.difference(bands) ) )
print( "*"*20)
print( "bands is to my_artists:")
print( "issuperset: {}".format(bands.issuperset(my_artists) ) )
print( "issubset: {}".format(bands.issubset(my_artists) ) )
print( "difference: {}".format(bands.difference(my_artists) ) )

my_artists is to bands:
issuperset: True
issubset: False
difference: {'Sarah Brightman', 'Vixy and Tony'}
********************
bands is to my_artists:
issuperset: False
issubset: True
difference: set()


## Extending built-in

We can add the `__add__` method to any class and if we use the `+` operator of that class, it will be called. This is how list concatenation works.

In [15]:
# overriding so that two things adds to 0
class SillyInt(int):
    
    def __add__( self, num ):
        return 0

a = SillyInt(1)
b = SillyInt(2)
a + b

0

[Blog post](http://www.marinamele.com/2014/04/modifying-add-method-of-python-class.html) on `__add__`.
- `__add__` method to be able to add two instances of a custom object. 
- `__radd__` method to be able to sum a list of instances.

In [16]:
class Day(object):
    """visits and contacts that a webpage generates per day"""
    
    def __init__(self, visits, contacts):
        self.visits = visits
        self.contacts = contacts
        
    def __str__(self):
        return 'Visits: %d, Contacts: %d' % ( self.visits, self.contacts )
    
    def __add__( self, other ):
        """total number of visits and contacts"""
        total_visits = self.visits + other.visits
        total_contacts = self.contacts + other.contacts
        return Day(total_visits, total_contacts)
    
    def __radd__( self, other ):
        """
        the sum method starts with 0.__add__(day1),
        not knowing how to do that it will fall back to day1.__radd__(0)
        so we have to the __radd__ method
        """
        if other == 0:
            return self
        else:
            return self.__add__(other)

day1 = Day(10, 1)
day2 = Day(20, 2)
day3 = day1 + day2
print(day3) 

Visits: 30, Contacts: 3


In [17]:
day4 = sum([day1, day2, day3])
print(day4)

Visits: 60, Contacts: 6


In [18]:
# dir(list) # inspect all the methods

This is true of all the special methods. If we want to use `x in myobj `syntax, we can override `__contains__`. If we want to use `myobj[i] = value` syntax, we implement `__setitem__` and if we want to use `something = myobj[i]`, we implement `__getitem__`.


## Case Study

`python3 -m http.server` will start a server running on http://localhost:8000/. 

Notes on **index.html:**
Websites are built inside of directories on a web server, with each web page as a separate file. Sometimes when you go to a URL, there is no file listed in the address. For example: http://webdesign.about.com/ . Even though it's not listed in the URL, there is still a file that the web server delivers so that the browser has something to display. This file is the default page for that directory. On most web servers, the default page in a directory is named **index.html**. In essence, when you go to a URL and do not specify a specific file, the server looks for a default file and displays that automatically - almost as if you had typed in that file name in the URL.

`re.compile` [reference](http://chimera.labs.oreilly.com/books/1230000000393/ch02.html#_matching_and_searching_for_text_patterns) and regular expression [reference](http://www.tutorialspoint.com/python/python_reg_expressions.htm).

In [19]:
# from urllib.request import urlopen
from urllib.parse import urlparse # parse the url into different parts
import requests
import re

url = 'http://localhost:8000/'
# [^...] Matches any single character not in brackets
# .compile the regular expression if you are to use it for multiple times
LINK_REGEX = re.compile("<a [^>]*href=['\"]([^'\"]+)['\"][^>]*>")

class LinkCollector(object):
    """
    the link collector will only collect links from the network location 
    localhost:8000; all links that points to external pages are simpled store
    """
    def __init__( self, url ):
        """
        Parameters
        ----------
        collected_links : dictionary
            each key is the webpage's url and the value is the set of links on that page
        """
        # ensure that the url starts with http://
        self.url = "http://" + urlparse(url).netloc
        self.collected_links = {}
        self.visited_links = set()
    
    def collect_links( self, path = '/' ):
        full_url = self.url + path
        self.visited_links.add(full_url)
        
        # page = str( urlopen(full_url).read() )
        page = str(requests.get(full_url).content)
        
        # .findall the links and normalize them using set comprehension
        links = LINK_REGEX.findall(page)
        links = { self.normalize_url(link) for link in links } 
        
        self.collected_links[full_url] = links
        unvisited_links = links.difference(self.visited_links) 

        for link in unvisited_links:
            if link.startswith(self.url):
                self.collect_links( path = urlparse(link).path )
            else:
                # set the external links as key and empty set as value
                self.collected_links.setdefault( link, set() )

    def normalize_url( self, link ):
        """
        Notice that the links may be absolute or relative. e.g. contact.html and /contact.html.
        so we have to normalize them (convert them to the same)
        """
        if link.startswith('http://'):
            return link
        elif link.startswith('/'):
            return self.url + link
        else:
            return self.url + '/' + link

In [21]:
collector = LinkCollector(url)
collector.collect_links()
for link, items in collector.collected_links.items():
    print( "{}: {}".format(link, items) )

http://en.wikipedia.org/wiki/Cavalier_King_Charles_Spaniel: set()
http://localhost:8000/contact.html: {'http://ccphillips.net/', 'http://localhost:8000/', 'http://localhost:8000/blog.html', 'http://localhost:8000/contact.html'}
http://archlinux.me/dusty/: set()
http://ccphillips.net/: set()
http://localhost:8000/taichi.html: {'http://masterhelenwu.com'}
http://masterhelenwu.com: set()
http://localhost:8000/esme.html: {'http://en.wikipedia.org/wiki/Cavalier_King_Charles_Spaniel', 'http://localhost:8000/hobbies.html'}
http://localhost:8000/: {'http://localhost:8000/esme.html', 'http://localhost:8000/contact.html', 'http://localhost:8000/blog.html', 'http://localhost:8000/hobbies.html', 'http://www.archlinux.org/'}
http://localhost:8000/blog.html: {'http://localhost:8000/esme.html', 'http://localhost:8000/', 'http://archlinux.me/dusty/', 'http://localhost:8000/contact.html'}
http://www.archlinux.org/: set()
http://localhost:8000/hobbies.html: {'http://localhost:8000/esme.html', 'http://ma

In [22]:
hasattr( collector, 'normalize_url' )

True