Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
743 lines (567 sloc) 52.3 KB

Blog

Fizz-Buzz: deluxe edition

Origin story

I remember a game I played in school called “Fizz Buzz” where we sat in a circle and progressed around the circle by counting up from “one”. The rule was: replace the number with “Fizz” if it was a multiple of 5, “Buzz” if it was a multiple of 7 and “FizzBuzz” if it was a multiple of both 5 and 7. There may have been some direction changes (and possibly drinking) involved but I can’t recall. It would look something like this:

“one” → “two” → “three” → “four” → “fizz” ← “six” ← “buzz” → “eight” → “nine” → “fizz”

Fast forward many years… everyone is encouraged to learn computer programming. A favorite exercise often used to teach or learn a programming language is to replicate this game. Another more challenging exercise is to add all the multiples of some numbers, usually 3 and 5, up to some limit. For example:

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.

Check this link to try it out: Project Euler #1

Easy answers (danger! code ahead)

A straight-forward solution uses the basic tools any beginning programmer will quickly learn. In python, using a list comprehension, it might look like this:

print(sum ([x for x in range(1000) if (x % 3 == 0 or x % 5 == 0)]))

>>> 233168

This is a correct solution… but we can do better! The problem with this solution lies here:

for x in range(1000)

So what is the problem?[fn:7] This technique isn’t as efficient as it could be. Its performance (speed) depends on the size of the range. In this example it steps through 1000 numbers, one at a time, checking if each number is evenly divisible by 3 or 5. It then adds the number to a list if either test is true. Finally, it takes that list and adds all the items. What if the question asked for the sum up to 1,000,000? or 1,000,000,000? Even with a fast processor these solutions will take a long time to produce an answer. It turns out there is a way to shortcut this march through a long series of numbers.

Math to the rescue!

There is a mathematical formula that calculates the sum of a series of numbers (of course there is!). A nice explanation is here: Triangle Numbers where you can see the sum of a series can be rewritten as a formula:

1 + 2 + 3 + … n = n(n+1)/2

With this insight I can find our answer to the sum of multiples question without marching through a list of numbers. Here’s how:

First let’s look at just the multiples of threes. A list of the multiples of 3 would look like this:

3 + 6 + 9 + … + 999

which can be re-written as

3 * (1 + 2 + 3 + … + 333)

or

3 * n(n+1)/2 (where n = 333)

In this case n = 333 which is 1000 divided by 3 dropping any remainder (called integer division and uses the // symbol in python). So a python function to calculate the sum of multiples of 3 below 1000 would be:

def sum_of_threes (factor=3, limit=1000-1):
		  n = limit // factor
		  return factor * n * (n+1) // 2

print(sum_of_threes())

>>> 166833

Note that this would work for any factor and any limit. I could also get the answer for multiples of five by passing 5 as an input to the function. So, to keep the function generic I will rename it to sum_of_multiples and use “factor” and “limit” as its inputs.

def sum_of_multiples (factor, limit):
    n = (limit-1) // factor
    return factor * n * (n+1) // 2

print(sum_of_multiples(7, 1000))

>>> 71071

Next step

Now that I have a quick way to calculate the sum of a series of multiples of one number up to a limit, I need to figure out how to do it with two numbers. I could call the function twice, once with a factor of 3 and again with a factor of 5, then add the two sums together! Let’s see what we get:

def easy_answer(limit):
    return sum([x for x in range(limit) if x % 3 == 0 or x % 5 == 0])

def sum_of_multiples(factor, limit):
    n_count = (limit-1) // factor
    return factor * n_count * (n_count + 1) // 2

print("easy answer ", easy_answer(1000))
print("faster answer ", sum_of_multiples(3, 1000) + sum_of_multiples(5, 1000))

>>> easy answer  233168
>>> faster answer  266333

Wrong answer! My answer is too high. It is because I am counting some numbers twice. Those numbers that are evenly divisible by BOTH 3 AND 5 are counted twice, once in each subtotal. For example the threes list contains 15 and 30 and 45 … and the fives list contains 15 and 30 and 45… Therefore I must subtract the sum of the list of common multiples. To find all the common multiples I must use the lowest common multiple (LCM) of the two factors. In this case, using 3 and 5, the LCM is 15. Another example would use the LCM of 4 and 6 which is 12 (not 24).

So I must adjust our “faster answer” by subtracting sum_of_multiples(15, 1000).

Let’s see if it works!

def easy_answer(limit):
    return sum([x for x in range(limit) if x % 3 == 0 or x % 5 == 0])

def sum_of_multiples(factor, limit):
    n_count = (limit-1) // factor
    return factor * n_count * (n_count + 1) // 2

print("easy answer ", easy_answer(1000))
print("faster answer ", sum_of_multiples(3, 1000) + sum_of_multiples(5, 1000) - sum_of_multiples(15, 1000))

>>> easy answer  233168
>>> faster answer  233168

It works! And to show that it is much faster when the limit is very high, I’ll add some code to time each method using factors of 3 and 5 and a limit of 100,000,000.

import time

def easy_answer(limit):
    return sum([x for x in range(limit) if x % 3 == 0 or x % 5 == 0])

def sum_of_multiples(factor, limit):
    n_count = (limit-1) // factor
    return factor * n_count * (n_count+1) // 2

t0 = time.clock()
print("easy answer ", easy_answer(100_000_000))
print("took ", time.clock()-t0, "seconds")

t0 = time.clock()
print("faster answer ", sum_of_multiples(3, 100_000_000) + sum_of_multiples(5, 100_000_000) - sum_of_multiples(15, 100_000_000))
print("took ", time.clock()-t0, "seconds")

>>> easy answer  2333333316666668
>>> took  13.653634 seconds
>>> faster answer  2333333316666668
>>> took  1.100000000064938e-05 seconds

There you can see the easy answer method took over 13 seconds to run while the faster method took only 0.000011 seconds! Thanks math!

But we can do better!

Going further

Before going further, I will switch over to the haskell programming language for the code examples. Hopefully you won’t notice much of a difference (and I get to practice another language).

First, let’s rewrite the python code to haskell code (without the timing):

easyAnswer :: Integer -> Integer
easyAnswer limit = sum [x | x <- [1..limit-1], x `mod` 3 == 0 || x `mod` 5 == 0]

triangle :: Integer -> Integer -> Integer
triangle factor limit = factor * n * (n + 1) `div` 2 where
                        n = (limit-1) `div` factor

sumOfMultiples :: Integer -> Integer -> Integer -> Integer
sumOfMultiples factor1 factor2 limit = triangle factor1 limit +
                                       triangle factor2 limit - 
                                       triangle (lcm factor1 factor2) limit

You’ll hopefully notice many similarities like the list comprehension and the triangle function. Also, you can see a term lcm which is a function that does what I want: find the lowest common multiple of two integers.

So how can I go further? I can make the sumOfMultiples function truly generic so it can take any number of factors and find the sum of their multiples up to a limit. For example, I could find the sum of the multiples of 3, 5 and 7 below 1,000. Before I start, I’ll get the right answer using the easy method so I know what we’re looking for. In fact, I’ll build up a table of sums as I go along.

easyAnswer = sum [x | x <- [1..1000-1], x `mod` 3 == 0 ||
                                        x `mod` 5 == 0 ||
                                        x `mod` 7 == 0]

λ> easyAnswer
271066

My answer is: 271066 (I’ll call this my target number)

And now, using a limit of 1000, I’ll make a table of sums of multiples of 3, 5 and 7. I already know we’re going to subtract the sum of the LCM’s of 3 & 5 (from our previous solution)… and now I’ll have to subtract LCM’s for 3 & 7 AND 5 & 7. And for good measure, I’ll include the LCM for 3 & 5 & 7 to see if I need it. Finally, I’ll add some additional details in the table (I have a feeling I’ll need them).

factors# factorsLCMsum (1000)
313166833
51599500
71771071
3, 521533165
3, 722123688
5, 723514210
3, 5, 731054725

If I follow my earlier strategy, I will first add the 3’s, 5’s and 7’s and then subtract the 3&5, 3&7 and 5&7 sums. That result is 266341. Exactly 4725 too low! It turns out that when I subtract all three pairs I need to add back the 3&5&7 triple. When I do, I get a total sum of 271066!

I feel like I got lucky there, so I am going to go to the next level and try 4 factors and see how it adds up. I’ll use 3, 5, 7, and 11 as my factors and a limit of 10,000 this time. Using the easyAnswer method I get a target number of: 29224307.

I’ll build up my table again (using the easyAnswer method for each row):

factors# factorsLCMsum (10,000)
31316668333
5159995000
7177142142
111114549545
3, 52153331665
3, 72212384046
3, 112331519848
5, 72351426425
5, 11255905905
7, 11277645645
3, 5, 73105478800
3, 5, 113165301950
3, 7, 113231218526
5, 7, 113385125125
3, 5, 7, 114115541580

Whew! Things are getting complicated! Now I’ll follow my strategy again: add the sums of the single factors, subtract the sums of the pairs LCM, then add back the sums of the triples. I get a total of 29265887. Exactly 41580 too high! So it looks as though I must subtract the last quartet (3&5&7&11) LCM for a total of 29224307: my target number!

Now I think I have enough examples to build a general case. Here’s what my intuition leads me to:

the sum of multiples of all factors =

the sum of the LCM sums if the number of factors is odd

MINUS

the sum of the LCM sums if the number of factors is even

Before I get the to the final code, I should figure out how to generate some of the data in our table automatically. First, I need an easy way to generate the sets of factors in column 1. Searching through the Data.List module I found the function subsequences which takes a list and outputs a list of all subsequences from the input. Just what I need!

import Data.List (subsequences)

λ> subsequences [3,5,7,11]
[[],[3],[5],[3,5],[7],[3,7],[5,7],[3,5,7],[11],[3,11],[5,11],[3,5,11],[7,11],[3,7,11],[5,7,11],[3,5,7,11]]

Perfect! (except for the empty list [] at the beginning)

Second, I need a way to calculate the LCM of not just two factors but any number of factors. Here is where a foldl can help:

λ> foldl lcm 1 [3]
3
λ> foldl lcm 1 [3,5]
15
λ> foldl lcm 1 [3,5,7]
105
λ> foldl lcm 1 [3,5,7,11]
1155

Perfect! We’re now ready to put it all together:

import Data.List (subsequences, length)

factorsList :: [Integer] -> [(Int, Integer)]
factorsList factors = [(length a, foldl lcm 1 a) |
                       a <- subsequences factors, a /= []]

triangle :: (Int, Integer) -> Integer -> Integer
triangle (_, b) n = b * ((n `div` b) * ( n `div` b + 1)) `div` 2

triangleListPlus :: [Integer] -> Integer -> [Integer]
triangleListPlus factors limit = [triangle (a, b) (limit-1) |
                                  (a, b) <- factorsList factors, odd a]

triangleListMinus :: [Integer] -> Integer -> [Integer]
triangleListMinus factors limit = [triangle (a, b) (limit-1) | 
                                   (a, b) <- factorsList factors, even a]

sumOfMultiples :: [Integer] -> Integer -> Integer
sumOfMultiples factors limit = sum(triangleListPlus factors limit) -
                               sum(triangleListMinus factors limit)

First I created a factorsList function that put together the subsequences and lcm functions to output a list of all the possible combinations of factors showing their length (number of factors) and their corresponding LCM.

The triangleListPlus and triangleListMinus functions transformed the factorsList elements to their triangle sums: one list for even number of factors, the other for odd.

Finally, the sumOfMultiples function takes the plus list and subtracts the minus list to get the final total.

Not very elegant, but it does the job. To check that it works, I’ll try some test cases:

λ> sumOfMultiples [] 10000         -- should be 0
0
λ> sumOfMultiples [3] 10000
16668333
λ> sumOfMultiples [3,5] 10000
23331668
λ> sumOfMultiples [3,6] 10000      -- should be the same as [3]
16668333
λ> sumOfMultiples [3,5,7] 10000
27142139
λ> sumOfMultiples [3,3,5] 10000    -- should be the same as [3,5]
23331668
λ> sumOfMultiples [3,5,7,11] 10000
29224307
λ> sumOfMultiples [4,6,8] 25         -- [4,6,8,12,16,18,20,24] = 108
108

Look at that! Even the “tricky” test cases appear to pass. An empty list (no factors) returns 0, a case where the LCM of two factors is equal to one of the factors (3 and 6), and a case where a factor is repeated (3, 3, 5). The test case sumOfMultiples [3,5,7,11] 10000 gives me my expected target answer: 29224307! And the last test case shows me that my method handles factors with overlapping LCMs. Not Bad!

But we can do better!

Refactoring

I’m probably not going to get much more improvement in overall performance but I can save a little time and memory space with a few tweaks.

import Data.List (foldl', nub, subsequences, length)

lcmList :: [Integer] -> [Integer]
lcmList factors = tail [sign fs $ lcm' fs | fs <- subsequences $ nub factors]
                  where sign fs | odd (length fs) = id
                                | otherwise       = negate
                        lcm' = foldl' lcm 1

triangle' :: Integer -> Integer -> Integer
triangle' _ 0 = 0
triangle' n f = f * n1 * (n1 + 1) `div` 2
                where n1 = n `div` abs f

sumOfMultiples :: [Integer] -> Integer -> Integer
sumOfMultiples factors limit = sum (triangle' (limit-1) <$> lcmList factors)

main :: IO ()
main =
  mapM_
   print [
    sumOfMultiples [] 10,                  -- should be 0
    sumOfMultiples [0] 10,                 -- should be 0
    sumOfMultiples [] 0,                   -- should be 0
    sumOfMultiples [0] 0,                  -- should be 0
    sumOfMultiples [3] 0,                  -- should be 0
    sumOfMultiples [5] 3,                  -- should be 0
    sumOfMultiples [3] 1000,               -- should be 166833
    sumOfMultiples [3,3] 1000,             -- should be 166833
    sumOfMultiples [3,3,3] 1000,           -- should be 166833
    sumOfMultiples [3,0,3] 1000,           -- should be 166833
    sumOfMultiples [3,6] 1000,             -- should be 166833
    sumOfMultiples [3,6,9] 1000,           -- should be 166833
    sumOfMultiples [5] 10,                 -- should be 5
    sumOfMultiples [5,15] 10,              -- should be 5
    sumOfMultiples [3,5] 10000,            -- should be 23331668
    sumOfMultiples [3,5,7] 10000,          -- should be 27142139
    sumOfMultiples [3,5,7,11] 10000,       -- should be 29224307
    sumOfMultiples [3,5,7,11,13] 10000,    -- should be 30824867
    sumOfMultiples [3,5,7,11,13,17] 10000, -- should be 31950148
    sumOfMultiples [4,6] 25,               -- should be 108
    sumOfMultiples [201,203,207] 10,       -- should be 0
    sumOfMultiples [17,37,67] 10000000000000000000000000000000000000    
               ]

λ> main
0
0
0
0
0
0
166833
166833
166833
166833
166833
166833
5
5
23331668
27142139
29224307
30824867
31950148
108
0
4896424079918373158057091331893790194458273022803312531143962223856863266
λ> lcmList [3,5,7,11]
[3,5,-15,7,-21,-35,105,11,-33,-55,165,-77,231,385,-1155]
λ> triangle' 10000 15
3331665
λ> triangle' 10000 (-15)
-3331665
λ> triangle' (10000-1) <$> lcmList [3,5,7,11]
[16668333,9995000,-3331665,7142142,-2384046,-1426425,478800,4549545,-1519848,-905905,301950,-645645,218526,125125,-41580]
λ> '

First, I cleaned up the triangle function for readability and added a case where: if the only factor is zero then it returns zero (otherwise it would return a divide by zero error. I also designed it to negate the sum if the factor is negative. Finally, the new ~triangle’~ only takes a limit (Integer) and a list of factors (also Integers).

Second, I combined the old factorsList, triangleListPlus, triangleListMinus functions into one lcmList function. The subsequences function still creates the desired list of factor combinations I need – and the lcmList function transforms each sublist into its contribution (plus or minus) to the final sum.

Finally, the new sumOfMultiples maps the triangle'~ function over the ~lcmList and sums the resulting list for the final result.

You can see in the printout that this function handles empty lists, zero limits, zero factors, duplicate factors, similar factors (e.g. 3 and 6), and many factors as well as very large limits. It even handles negative factors – treating them as their positive equivalent.

Summary

I started this journey with a simple programming exercise: Find the sum of all the multiples of 3 or 5 below 1000. I showed a simple and correct method of computing the answer using a list comprehension. However, that method took $O(n)$ space and $O(n)$ time (where $n$ is the size of the limit) to compute the answer. Clearly this is not a good method if the limit is very large as it creates a large list in memory and takes tens of seconds or longer to compute when the limit is very large.

I then found a mathematical shortcut that calculates the sum of a series of natural numbers:

1 + 2 + 3 + ... + n = n(n + 1)/2

From that insight I explored how multiple factors and their lowest common multiples added (or subtracted) to the final total. I then created a series of functions that produced the building blocks to produce the final total: and my final answer for any general case.

My final step was to refactor (rewrite) my functions and combine them to produce a simpler (and hopefully more readable) set of functions.

The final method takes $O(2^k)$ space and $O(2^k)$ time (which looks very scary) BUT this time $k$ is the number of factors instead of the size of the limit! Assuming the number of factors is 20 or below (which I think is reasonable), this is a huge performance boost!

My journey began with a well known, easy(?) math/computing problem, solvable with a simple but inefficient solution. I went from a specific solution to a general case. Finally, I ended with a tidy and efficient solution.

Until next time!

List of links

Here’s a list of things I learned (or at least practiced again):

Download data from the web

Task

A colleague of mine has been entering meta-data for books through a web-based portal (let’s call it site A) over the last several years. Earlier this year we found out we needed to transfer a great deal of that data from the website to another place (site B). Unfortunately, even though there was a way to export a summary of all the entries which contained much of the information, there were several important pieces missing. In order to complete the transfer, we needed a way to add the missing data. If we had only 20-50 titles it would be feasible to copy-paste the missing fields into the summary export. But there were 962 titles! This would have taken a long time and would also be prone to errors. My task:

Find a way to automatically copy the missing data from the website to a spreadsheet. The spreadsheet shall have a column of ISBNs as an index and each row shall contain the missing data.

Steps

My gameplan was to use python to download the webpages with the data, search for the missing data and copy it to a file. I started by using the standard library urllib and tried to download a page. That’s where I hit my first snag. The site I was using required a login and was actually an https site and not an http site (which is standard these days (which is standard these days). I found a way to handle this using a library called http.cookiejar. Like so:

logging in with cookiejar

After some searches on Stack Overflow, I found a way to do this:

import http.cookiejar
import urllib.request

# Storing cookies in cj variable

cj = http.cookiejar.CookieJar()

# Defining a handler for later http operations with cookies(cj)

op = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))

# Logging in

URL = ('https://target_site.com/Account/Login?ReturnUrl=%2F')

# login data below should be hidden/deleted before posting

VAL = {'Username': 'USERNAME', 'Password': 'PASSWORD'}

DATA = urllib.parse.urlencode(VAL)
ASCII_DATA = DATA.encode('ascii')
RES = op.open(URL, ASCII_DATA)

Using two common python libraries, http.cookiejar and urllib.request, I was able to login and maintain my login status in a cookie just like a web browser does. Cool!

Once I could do this, I could proceed to open the webpages I needed. Next I needed to figure out how to step through every page (one title per page).

URL parsing

The only way I could see to step through all the titles was to mimic how I would do it in a browser: navigate to the summary page(s) and then click on each title in turn. At each step I examined the URL to see what changed. The summary page URLs looked like this:

https://target_site.com/content/list?pagesize=20&page=1

I could see two variables: pagesize and page. Sure enough, as I navigated through the summary pages, these were the only variables that changed in the URL. Since I knew we had 962 titles, I merely (cough, cough) had to step through the summary pages to see them all. In the end, I chose PAGESIZE = 120 and PAGES = 9. But for testing I could use smaller numbers.

# 962 titles can be seen with 9 pages and 120 each page
# set these smaller when testing

PAGES = [1, 2]
PAGESIZE = 10

# iterate through the pages

for p in PAGES:

    list_page_url = 'https://target_site.com/content/list?pagesize='\
                    + str(PAGESIZE) + '&page=' + str(p)

Needles in Haystacks

Now that I could navigate to each title’s web page, my next task was to find all the missing data. One way would be to search through each webpage’s HTML text and find all the sections containing the missing data. I would have to examine a sample page’s HTML code, look for the sections I need, find a way to describe that section using HTML tags or something similar, and then copy the data between the tags. Fortunately, there are already good libraries that can help work with HTML and I chose one called BeautifulSoup. So instead of searching through the raw HTML text, I could instead search an object. Time for some code!

# create a searchable object for each summary page

list_page = op.open(list_page_url)
list_soup = BeautifulSoup(list_page, 'html5lib')

# iterate through the book titles on the page

for link in list_soup.select('a[data-content-id]'):

    item_page_url = 'https://target_site.com/content/edit/'\
                    + link.get('data-content-id')

    # create a searchable object for each book title

    item_page = op.open(item_page_url)
    item_soup = BeautifulSoup(item_page, 'html5lib')

First I had to open each summary page and turn it into an object. Then I could use the data-content-id to point to each title. Once I opened each title’s page I then transformed it into a BeautifulSoup object ready for searching.

The two important pieces of missing data were a title’s long description and any category codes that help define the genre of the book. While I’m looking at the page I will also copy the ISBN so I can see which title I’m copying from.

Here’s the code for finding the ISBN and the long description:

# extract the meaningful data we're looking for
# (BIC and long description)

isbn = item_soup.find('input', id='Identifier')['value']
long_description = item_soup.textarea.get_text().strip()

See? Once I had a page’s code transformed into a searchable object (item_soup), it only took one line to get the data I needed. It did take some trial and error to see what tags to look for, but once I found them, it worked every time.

However, the category codes turned out to be a little tricky.

Javascript, oh my!

There were two complications when I started looking for the category codes. The first was that each title could have up to two types of category codes (BIC or Thema) and any number of category codes for each type. And it was also possible for a title to have zero codes (missing data). The second was that the page displayed in the browser had embedded javascript to help render or edit the category data. BeautifulSoup works well with HTML but not with javascript. Oh no!

After researching how to handle javascript, I decided to simply search the javascript code using python’s regex library and find the data I needed. Fortunately, the data was there - it just needed a way to be found. It would have been much harder if the javascript didn’t have the actual codes I was looking for (maybe it pulled the codes from a database somewhere else?).

There were several chunks of javascript code throughout the page. I had to find the right chunk and then search for what I needed. I searched the page’s object for the right chunk and then searched that blob of text for the data.

Remember the first complication? I needed a way to handle the varying types and codes that could exist for each title. I noticed that the blob looked very much like a python dictionary - so I decided to transform the blob into a python dictionary (using python’s exec) and then pull out what I needed from there.

Time for some more code!

# BIC codes are embedded in javascript so we have to isolate and
# search for the part we need. It is found in a variable
# called "subjects"

js_text = item_soup.find_all('script', type='text/javascript')
var_subj = re.search(r"subjects = \[.+\]", js_text[-1].get_text())
bic_code = None
description = None

'''
if we find the variable we convert it to a list of dicts (like below):
    [{'code': 'JFSL1',
    'description': 'Etniske minoriteter og multikulturelle studier',
    'label': 'JFSL1: ',
    'mainSubject': False,
    'qualifier': False,
    'subjectId': 14110,
    'type': 1},
    {'code': 'JBSL1',
    'description': 'Her: racerelationer',
    'label': 'JBSL1: Etniske minoriteter og multikulturelle studier',
    'mainSubject': True,
    'qualifier': False,
    'subjectId': 20410,
    'type': 3}]
'''

if var_subj:

    fix_false = re.sub('false', 'False', var_subj[0])
    fix_true = re.sub('true', 'True', fix_false)
    exec(fix_true)

# create some subject code lists

bic_list = []
thema_list = []
for sub in subjects:
    if sub['type'] == 1:
        bic_list.append(sub['code'])
    if sub['type'] == 3:
        thema_list.append(sub['code'])

Wrapping it up

Almost there! To wrap everything together I decided to use this data structure:

book_data = {"isbn" : {"BIC_list" : [bic codes], 
                      "Thema_list" : [thema codes],
                      "long description" : long_description}
            }

Finally, to make everything even easier to work with and export to Excel, I transformed the dictionary into a pandas DataFrame which easily exports into Excel formats.

# create a DataFrame object, transpose the data
# and send it to an Excel sheet

BOOK_DF = pd.DataFrame(book_data)
BOOK_DF = BOOK_DF.transpose()
BOOK_DF.to_excel('book_data.xlsx')

Final Analysis

  • Did I accomplish the task?

Yes. I successfully copied all 962 titles and created an Excel spreadsheet which included ISBN, long description and any category codes.

  • What did I learn?
    • how to handle cookies and log into an https site in python
    • how to use BeautifulSoup to work with HTML downloads
    • how to use regex to search a blob of text (in this case JavaScript code-
    • how to transform data structures from one type to another (JavaScript -> python dictionary and python dictionary -> pandas DataFrame)
  • How much time did I save?

It took me about a day to get all this code working, so I estimate about 6 hours of work. If I had done it manually, by clicking and cutting and pasting into a spreadsheet - for 962 entries - I estimate it would have taken about

962 titles x 30 seconds/title x 1 hour/60 seconds = 8 hours

8 hours!

I consider that a big win! And not just two hours. Considering I would have probably made a few errors with all the clicking, copying and pasting - not to mention the RSI pain I would have had afterward - the automatic solution is much more reliable. Also, the next time I need to do this kind of thing, I will probably take around half the time to set up the code - now that I am familiar with the tools in python.

Just a Hobby

The Start

Programming. It’s my new hobby. Well, also reading about programming. Hacker News, Reddit, all kinds of blogs and other sites. Also computer science. Also a hobby. Algorithms, big O notation, functional versus imperative languages, searching, sorting, data structures, strong versus weak typed languages, operator overloading, etc. Ad infinitum. I think it’s fascinating. Not terribly fun sometimes, not easy to grasp most of the time. But still… it’s fascinating. And empowering.

I grew up at the start of the personal computer era. My neighbor worked at a local IBM facility and had an Apple II in his basement (was that disloyal?). Another one down the street built his own (I think a Heathkit H8?) My first computer was an IBM PC with an 8088 processor (yes, before the 8086, 286, 386, 486, pentium… etc.) with a green CRT monitor just like you’d imagine. It was an incredibly expensive toy for a kid to play with. I took it to college and never did much more than write papers and play with spreadsheets. Actually, I did my senior project on it… wrote my senior thesis and wrote some very basic programs in TrueBasic. I even used Lotus 1-2-3 and began using spreadsheets for some basic engineering work. Yes, I’m that old.

I never really learned to program though. I took one semester of Pascal… hated the whole process. Hated the waiting in line at the computer lab to sit in front of a terminal to debug a 150 line program. I recall we were to use 3D matrix transformations on basic images. Rotate, resize, translate. It never clicked with me and I didn’t have the patience for it. I never felt like I knew how to even approach the task. I just wrestled with syntax. And Pascal wasn’t forgiving. I wonder if things would have been different if Turbo-Pascal had come out just a few years earlier. What took us hours could have been done in seconds. Who knows?

That was 1986. Fast forward to 2014.

My first reaction to that temporal jump is: holy cow! Almost thirty years! What I hold in my hand does more than the entire computer lab at my university could do. I can actually write powerful code… in minutes!.. on my phone!.. It used to take me days to struggle with code that never even worked in the end. Sigh.

What changed? Everything.

The Journey

After my first steps in high school and college, I spent many years in the world of Microsoft; starting with DOS 6.2 and progressing through windows 3, 3.1, 3.11, windows 95, 98, NT, 2000, then to windows ME and XP… I saw it all. I saw the MS Office suite take over and dominate the world. I always had some sort of PC in the house. Nothing too powerful. Always in the $3,000 range.

I remember my first modem was 300 baud. Then I was excited to upgrade to 14,400 baud. That was so I could connect to Compuserve and what was then a very exciting “internet”. No programming though.

Years later, I remember seeing the first browsers Gopher and Netscape Navigator. Even Internet Explorer. Those were interesting days. Still, no programming. One colleague of mine was pretty absorbed in his computer world and programmed some pretty cool things in Visual Basic. I was impressed. I was more of a power user of applications. Spreadsheets, slide shows and the like. Never real programming though.

Sometime around 2001 I switched over to the Mac. It was when they switched over to Mac OS X and I still have a powerPC laptop with both System 9.2 and 10.1 on it. Mac was new to me then so I thought the older system looked… older. The newer system was shiny but pretty slow. Not that I cared much though. I was tired of the never-ending reboot -> reinstall program ->reinstall windows -> cycle of misery. Firewalls, antivirus programs, malware detectors… it was a lousy experience. But one that everyone seemed to tolerate. I wanted something different. So I switched.

Since then I have used Mac OS 10.1 through 10.9 and the switch from powerPC to Intel. And I have seen the resurgence of Apple as a contender for the home PC. I have also seen the emergence and complete dominance of computing with phones and tablets. Android, iOS, windows mobile and the ill-fated web OS and Blackberry. All this time I would call myself a computer enthusiast. Just smart enough to become the family tech guy but not smart enough to make it a career. Maybe a technology enthusiast with a minor in computers. Anyhow, it all led to a slow re-acquaintance with programming.

You notice after a while, the closer you are to technology, the more you encounter programming. Often it is disguised as something simple. Often it’s called “automation steps” or “scripts” or “macros”. Even formulas in Excel are a form of programming (take this number, multiply by the number to the left and add the number above… do this for every row). These days you find it in web applications (IFTTT comes to mind) and other apps (Hazel, for example). I never felt intimidated by the notion of programming, but it never felt like I was really programming (as I think of it). I always thought of programming as something closer to mathematics or data manipulation. Solving esoteric problems or crunching numbers. Of course it is a whole lot more than that. In it’s entirety, programming (developing software) is at the heart of everything we experience with technology. It is the great enabler.

The Courtship

And so I found myself lured by the siren call of programming. This time that world was much different than the one I left in 1986. Sure, compilers are still around. Even Pascal is still around. However, so are dozens of new, faster, easier, more powerful languages that make the programming experience much less frustrating and much more enjoyable.

So where does one even begin? Why not use your favorite internet search engine (a powerful set of programs by the way) and ask that very question? I didn’t, but if I do it today I would find the same resources I still use. Turns out, millions of people ask the very same question and there are thousands (probably millions) who are ready to answer or at least point you in the right direction. Welcome to the collective intelligence of the internet!

The top hits of a search on [Google.com] include an online course called “Getting Started with Programming”, an article titled “What’s the Best Programming Language for a Beginner?” and a YouTube video titled “Beginning Programming - Tutorial 1 Introduction”. And there are 10,000 more search results! As you can guess, it’s a hot topic and the internet is a perfect place to start.

Like I said earlier, I didn’t begin this way. I approached it in a slightly more indirect way. As I mentioned, I switched to Mac OS in 2001. When you make that switch, you can’t help but notice the complete difference there is between the Windows world and the Unix world. Since Mac OS X is really a fancy layer on top of a Unix operating system (just as Windows is or was a fancy layer on top of DOS) you must relearn some basic computer skills. For those who can find a command prompt on Windows and run a few basic commands (the cmd prompt and the DOS shell), there is a parallel experience on the Mac but in a new language (the terminal and bash). When I first encountered it, it felt like driving on the left side of the road in the UK for the first time. The same, but different.

Along with a new operating system came new system commands, a new file system, and new toys. Preinstalled on a Mac are things like Apple Script and Xcode. Apple Script is the friendly looking scripting language that allows one to automate nearly anything on a Mac. Only, it isn’t very friendly and can’t (easily) do the things I imagine a real program should do. But to be fair, I haven’t given it much of a chance yet. Maybe someday. Xcode is the intimidating developer application that gives any Mac owner the ability to create any program they can imagine - as long as they learn the lingua franca of the Mac world: Objective-C. Yikes! Have you seen what that looks like?

However, hidden among the many goodies of the Mac system are a couple of other pre-installed programming languages. And I’m not talking about BASH which is the default terminal shell environment (similar to the command prompt in Windows). Bash is itself a scriptable language (and a pretty powerful one too but focused mainly on system level things). No, I’m talking about Python and Ruby. Two relatively young, dynamic programming languages that come preinstalled and ready to run with just a one word command: “python” or “irb” ( the “irb” comes from “interactive ruby” and ruby files usually end with .rb).

I was vaguely aware of the dozens of newer programming languages being used by those magical “programmers”. I knew of C, C\++, Java, Perl, PHP, Lisp, Javascript, Python and Ruby to name a few. But I couldn’t tell one from another and I surely couldn’t understand their differences. I was aware of the popular concept of object oriented programming (OOP) but couldn’t describe it in any meaningful way. Still, the attraction was there for me. Right under my fingers was a tool I could learn that could let me do magical things! Real programming languages. How hard could it be?

Pretty hard. That’s how hard.

The Wedding

Most honest programming teachers will tell you up front: learning to program is hard. It is a new skill much like learning a new language. It takes concentration and commitment. It takes practice and study. It covers many disciplines such as mathematics and logic. It requires attention to detail and it takes a willingness to ask questions and search for answers. But none of this should scare you! Every honest programmer will also tell you: it was hard for them when they started too. And the barrier to learning is lower and lower every year.

So where did I begin? I began with Python. And Ruby. But mostly Python. And since I was curious as to which was the better one to start with, I read all the “Python vs. Ruby - which is better?” and “Compiled vs. Interpreted - which is better?” and “Learn C first before you learn another language” and “Java gets you more jobs” … blah blah blah blah… blah. As entertaining as all that is, it really isn’t helpful when starting out. After a while, you realize that every programming language has its strengths and weaknesses and no language is perfect for every task. But some languages are easier to learn than others, especially when you’re just a beginner. Python is one of those languages. Easy to grasp the basics and flexible enough to be forgiving. Many will argue that Ruby is better, or that you should start with a real language like C or Java. It comes down to a personal choice - which path will help you learn the basics in a way that makes sense to you. For me it was Python. And I see it is becoming the language of choice for beginning computer science courses and other computation oriented fields. So I think others can recognize its merits too.

In short: Python it was. I started the usual way: with a tutorial from the main site [python.org]. It does what many tutorials do. It starts with some basic math to show you how it calculates things. Then it introduces variables, functions, loops, branches… and away you go. But where exactly are you going? And why? There is always a point where you lose the purpose and the meaning of what you are shown. What is this used for? Why should I do this? Some of those questions can be answered in another forum. I found that out when I took an online course called CS 101 from [Udacity.com]. This course covered some of my why questions and introduced some good techniques for how to approach programming challenges. And it used Python as its language. Yeah! Things began to make more sense. Concepts began to click. When I finished that course I felt I could begin to program for real. I was hooked. And it was just the beginning.

Shortly after that class I took a few more similar ones, more practice with the language, more problem solving skills. It’s all a form of mental exercise - and it feels like my brain is the muscle getting a workout. I even branched out into some other areas such as statistics, finance and differential equations. All of those were now accessible to me with my basic programming skills. And I freely admit that basic is the best word to describe my skills.

In the mean time I have discovered a new language that has challenged my idea of programming. Haskell. Known as an “academic” language based on mathematics and set theory, this language forces me to approach programming in an entirely different way than I would with Python. The main difference is that Python (an imperative language) requires me to describe how to solve a problem, step by step. Haskell (a functional language) requires me to describe what a solution looks like without having to describe how to get it. It is a completely different way of approaching a problem and it forces me to think in a “new” way.

A language like Haskell appeals to mathematicians (I’m told) because its code looks like mathematical equations. It appeals to many programmers (I’m told) because of its elegant, minimal language and its pure, functional nature. It also has a strong type system that helps one to understand what’s going on in a program. I’ll refer to other resources ([learnyouahaskell.com]) to explain this better than I can. I can only say that it appeals to me because it’s a different way to think about programming and I’m enough of a novice not to harbor any prejudice. Finally, I must also admit that if I describe my Python skills as basic, my Haskell skills would qualify as beginner. Still, it’s so interesting.

So, all that sounds wonderful but what good is it really? Here’s an example: I am a novice investor who likes to track my investment performance and investigate better methods of investing. Although I follow a “buy and hold” strategy and only think in the long term, I do like to play around with various theories centered around optimum portfolios and efficient investing. This led me to theories on minimum variance portfolios and capital asset pricing models. So I took a course in computational investing so I could see the methods explained. The course relied heavily on Excel and a little R (a mostly statistics oriented language). I could follow the theory and the methods with no trouble. But it was rather tedious to work inside a spreadsheet for all the heavy calculations involved.

For example, the first few steps involve downloading some price histories of a few stocks, pasting them into a spreadsheet, reordering the prices, calculating returns on the stock and finally calculating the mean and standard deviation of the returns. If you have to do that for more than a few stocks, and for more than a few years worth of data, it gets very tedious and error prone. The next steps include creating a covariance matrix for all the stocks and finally setting up a Solver scenario to find a maximum return for a minimum variance. Have you set up matrix formulas in Excel? (Hint: they can be tricky) This little project alone takes a few hours to set up and run. And if you want to change any of the inputs (like time period, which stocks to include, etc.) it takes almost as long to reset the spreadsheets as it did to set it up in the first place.

So I decided to try my hand at programming a better solution. With my simple program I am able to input a start and stop date and a list of stocks (or whatever thing you want prices for). The program goes to the web, downloads all the prices you require, and performs all the calculations required to return the same answer as the “manual” Excel method. In seconds. As many times as you want. With as many different inputs as you want. Complete with publication quality graphs!

Happily Ever After?

And what else do I learn after an exercise like that? I get to (re)learn about matrices, statistical variance and covariance, maximize/minimize numerical methods and Lagrange multipliers. That’s just one little project! And that’s not including the new experiences with python’s powerful packages such as numpy, scipy, pandas and matplotlib (heavy duty number crunching and plotting packages).

After an exercise like that, I look around and see what applications are for sale in the app stores that may do something similar. Nothing that I can find. I even looked for finance apps that could calculate rates of return for investments. Again, nothing that I can find. The closest thing I can find is Quicken from Intuit. Way more than I need and it doesn’t do what I would like anyway.

So there’s an opportunity to create something of value for me to use - and maybe sell someday to others. Once the basic steps are done and tested, it only remains to build an interface that anyone can use and see clear results. What kind of results? How about: Using the last five years of data and a list of stocks and bonds, if I pick ten stocks then what portion should I allocate to each to give me the best return with the lowest risk? Or, which combination of eight stocks and bonds gives me the best return with the lowest risk? How about using only two years of data? What if I only want four stocks? Or, given my investment history, what is my internal rate of return? (Analogous to your average annual return) And how does that compare to the overall market? Doesn’t that sound interesting?

A project like that is more in the realm of a developer rather than a just a programmer. There are a whole new set of skills to learn to be a developer. Creating data structures, databases, application interfaces, testing and deployment just to name a few. See where this leads? It all started with curiosity about programming and now I’m talking about making an app for sale in the app store. And that’s what is so great about a hobby like this one. It is so powerful. Little by little, the tech around me looks less magical and more attainable, understandable.

Yes, I’m hooked.

Why the name “Dave 2.1”?

$O(nlogn)$

About

Colophon

This site is written and edited in Emacs.

But that’s not the whole story…

Here’s a list of stuff I use to make this site:

Once all those pieces and parts are linked together[fn:8], I just write entries in a text file using org mode in Emacs. All the rest happens automagically.

About Me

Some stuff about me here[fn:9].

Footnotes

[fn:9] yet another footnote

[fn:8] (i.e. Connecting · the · · dots · · ·) Each of these parts deserves a deeper dive - I’ll add more entries when I’m ready.

[fn:7] I guess technically not a problem. Just that it is an inefficient algorithm. Also, not very memory space efficient since it builds a list before it finds a sum. But now we’re getting ahead of ourselves!

[fn:6] this should be footnote number 2

[fn:5] more footnote stuff here

[fn:4] just checking the numbering here

[fn:3] how do you like the footnotes?

[fn:2] how will this be numbered?

[fn:1] using either SPC f s or ~, e e H A~

You can’t perform that action at this time.