# Week 4: lists and subloops

# Admin announcements
Remember to email with questions, use the forum, etc if you have any questions.  

# This week

We're exploring our first real data collection data type. We've seen data types before, but all more content based.  Lists are more of a container data type, where you put other objects inside them.

# Meaning of being a sequence

As a review, sequences have order, meaning that individual items have specific positions.  You can use those positions:

* as their **explicit meaning** (so the numbers are directly meaningful)
* as a **transformed meaning** (so you can add 1 or do something else to the position number to make it meaningful)
* with a **referenced meaning** (such that you can use the position number to look up the meaning).
* with an **structural meaning** (where the position in the structure of the data itself is meaningful)

These items can also be individually manipulated.  Once referenced, they can be stored in memory, placed in another structure, and used as a source for further data.

We've explored some of these uses with last week's module of the raven, where we used the explicit meaning of the position numbers to refecence the line numbers (with a little transformation on top to make the numbers more human readable).

You may not have picked up on it as well, but we were also using a referenced meaning.  When we wrote the `range()` to produce the position numbers (that we could transform into our human readable line numbers), there wasn't actually any real connection between that range and the list.  Neither knew about eachother, but we knew about each and knew that their patterns and structures were compatible.  So we were able to craft a `range` call that produced numbers that we could use to look up the line numbers that we wanted.

# A note on `eval()` vs `eval(input())` vs `int(input())`

The `input("prompt text:")` function is given a string prompt to present to the user and waits for the user to enter some data.  You put what you want shown to the user in the `()` and what the user gives back is what's returned back from this function.

You've only used this for numerical input at this point.  This week we're going to use it to get string input, and the usage will be a bit differently.

The use of `eval()` is a funky one, but is an easy (if blunt) way of interpreting a string as numerical if there are numbers.  There are better ways to do this, but a bit more complicated.

You've also been wrapping this with `int()`, which converts the input into an integer number.

Because we want to retain the string value of the phrase given to us by the user, we will be using `input()` alone and not wrapped inside of any other function.

Here's a quick decision tree for which form to use:

What do you want from the user?

* Numbers: use `int(input("prompt..."))`:  this will automatically cast the numbers entered into int/float
* Text:  use `input()`:  it will retain the input coming in from the user and store it as a single string

# Lists in context

We're going to be working through a problem statement here so we can see lists in context of their use.

Sometimes you'll want to make a list directly within your code, but many other times you are using a function, method, or other tool to create a list for you.  This means that you start with content in another data type, do something to it, and then you end up with a list.  These lists usually hold some version or metadata about the original content.

You would have seen some of this with the string methods that you played with in homework one.

You may also be a little confused when you see more uses of square brackets.  Here's a quick way to understand what you're looking at:

* when you see `[]` hanging around on their own in the code and the `[]` aren't directly following anything
    * example:  `words = ['hello', 'human', 'student']` is a list
* when you see `[]` sitting directly after something, either some content or a variable name, you've got an extraction syntax.  So this is slicing something out of that thing right before it and not a list.
    * example:  `"hello"[:3]` is a string that is being sliced
    
So we're going to work through a real dataset problem and see how lists are being used.

# List quick facts we're going to explore in more detail

* **lists are surrounded by `[]`**, so when you see that in the code of you see it surrounding an output that you're printing, you're looking at a list
* lists can be empty, with a length of 0.  
    * These appear as `[]`
* **lists contain elements**, and when viewed in the list's content, those elements are visually separated by commas.  These are solely for our human eyes and not actually part of the content of the elements within the list (unless the commas are inside of that content, such as a string with a comma.
    * This appears as: `[1, 2, 3]` is a list of three integers and `['hello', 'yes, I would like to science', 'I am a penguin']` is a list of three strings.  Note the commas that are outside of the strings are separating the three elements, but the comma inside of the string is just part of the string's content.  You'll get used to looking at this, but it'll take a little bit.
* **lists can hold any kind of object, including mixes**
* **lists have an order**
    * `[1, 2, 3]` is a different list than `[3, 2, 1]`
* **lists can be sliced** using the same extraction syntax as strings. This will work in a similar mechanism, but going over the individual elements, not pieces of content within those elements.
* **lists can be looped over** in a regular `for` loop.  The loop will unpack the list by individual elements.  A list of three elements will make a loop execute three times, and the iterable variable will hold the element values one at a time.


# Problem statement:  count the downloads in this non-tabular data file

Given a data file of data repository records ([Illinois Data Bank](https://databank.illinois.edu/)) and extract the number of downloads.  Calculate the total number of dataset downloads. 

This data file contains records for individual data deposit records.  These records contain things like the citation, funder, and downloads.  Here's an example record:

```
Park, Jungsik; Le, Brian; Sklenar, Joseph; Chern, Gia-wei; Watts, Justin; Schiffer, Peter (2017): Magnetic response of brickwork artificial spin ice. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1528275_V1
Funder: U.S. Department of Energy (DOE), Grant: DE-SC0010778
Downloads: 10 (2017-09-08 to 2017-09-13 )
```

For the sake of keeping this short, I'll read the file into memory.  This isn't a concept that we are going to cover until later, but the end result of this piece of code is that the contents of the text file are now going to be stored in a string variable called full_text.  I also have a small sample of the file contents that is useful for us developing and exploring our methods for this.

In [1]:
# we'll be discussing reading in files later!
# just accept this at face value and don't worry about it

f = open('report.txt', 'r')

full_text = f.read()

f.close()

Here's a small snippet we can see in one screen:

In [2]:
sample = """Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V1
Downloads: 9 (2017-08-30 to 2017-09-13 )
-----

Park, Jungsik; Le, Brian; Sklenar, Joseph; Chern, Gia-wei; Watts, Justin; Schiffer, Peter (2017): Magnetic response of brickwork artificial spin ice. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1528275_V1
Funder: U.S. Department of Energy (DOE), Grant: DE-SC0010778
Downloads: 10 (2017-09-08 to 2017-09-13 )
-----

Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon . University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V2
Downloads: 6 (2017-09-06 to 2017-09-13 )
-----

Christensen, Sarah; Molloy, Erin K.; Vachaspati, Pranjal; Warnow, Tandy (2017): Datasets from the study: Optimal completion of incomplete gene trees in polynomial time using OCTAL. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-8402610_V1
Funder: U.S. National Science Foundation (NSF), Grant: CCF-1535977
Funder: U.S. National Science Foundation (NSF), Grant: DGE-1144245
Downloads: 47 (2017-06-15 to 2017-09-13 )
-----"""

# What's our structure here?

We've got data on datasets here, but this isn't a tabular data file.  We can't open it up in excel or directly do computations on it, yet.  We need to transform it into something with a standard structure for analysis.

A few things to note:

* the number of **lines per entry is variable**, so we can't exploit a steady mathematical structure like we did with the Raven last week.
* even if variable, we can see that the **lines inside each entry are meaningful**
* each line **has a field label** followed by a : and then content
* some of these **fields have multiple entries** (so multiple lines with the same field entry)
* some of the **line content has multiple values within it**
* the string **"-----" appears** (to appear!) between each entry

# A basic framework for an approach

Last week we learned the core accumulator and counter patterns.  We were doing so without a #dostuff problem.  This week we're going to explore how to take care of business and do data excraction as part of our accumulators.

Many of these things are "hands off the keyboard and think" problems.  Your code development process will be in stages, where you apply patterns and elements on the same block of code.  Much like a painting of a person involves wire framing and sketching, there's a non/linear process involved.  You will not be writing this one line by line.  You'll be adding things in with passes, checking your work along the way.  Here is a general thought process for considering these things.

1.  Identify the sequence in question. 
    * Where does it live and what is the variable name?  Do you need to transform the sequence?  
2.  Run the sequence through a for loop printing out the iterable item until the individual results are what you want.  Don't worry about getting too far into the deep end here, becuase you can further extract out the data you want from the iterable.  What you're after is isolating the individual data containers.
    * Just because it is a sequence doesn't mean that the what a for loop will grab from it will be what you want.  This can be the roughest stage, and often requires experimentation.  You may need to split a string up into a list, or do some fun slicing stuff, etc. 
3.  Inside your loop, transform or extract whatever you need outside of the iterable variable.
    * You're going to do this inside of the body of your for loop, and NEVER in your opening for loop line of code.
4.  Once you have the specific data points that you want to collect printing out in your for loop, start your accumulator process as normal.

# What's our data granularity?

Last week we explored a poem and found that our unit of analysis of interest are the individual lines.  That made it easy because python knows about lines and has functions designed to easily interact with them.  In this case, our individual data entities records about the dataset.  This sort of thing, especially with the variability in number of lines within each record, is not something that python is able to immediately interact with without us doing some more leg work.  Our recourse for this issue is to use the tools inside of python to encode these chunks as individual data entries.  Once we have the individual records out we can operate on them independently and extract the information out.

This pattern of breaking the data apart so that we can apply broader (easier) methods of splitting individual data points out will be a common one.

Let's take a moment to consider the Raven again.  Say that we want to know about the words in each stanza.  We could use regular `.split()` on the entire poem and get all the words.  We'd have the granularity that we want (the words) but the membership information would then be gone.

Recall our basic for loop over lines through the poem:  it allows us to isolate each line at a time such that we can manipulate or take measurements from that line and infer that those measurements we get belong to that line because it is the one we are looping over at the moment.

Our example of why using the `range(0, len(raven_lines), 7)` function to make the position numbers and look up the line versus just doing a list slice (`::7`) gave us the same content back (the first line of each stanza), but using the list slicing method lost the line number information in the process.  That origin information could not always be dependably derived back from the individual line itself, and thus we depend on our iterable variable within the `range` loop to represent that component metadata about the line.

Unlike our pattern where we used `range` to generate the line numbers, and the content (the numbers) that we generated with `range` wasn't directly related to our original content, but this time we're going to be looping over our actual data.  We used a known pattern to (correctly!) generate position numbers.  So there was a certain trust that we had to put into place to make the things work.

This time around we are splitting our data into chunks so that we can individually act on the data inside of it.  This means that instead of getting out all the lines that have the funder information and doing something fancy to figure out which chunk it was from, we can take out all the chunks and then get the funder lines from there.  Because we are isolating the data records from each other, we can use a pretty unfancy method of getting out the lines that we want.  

For example, the last line of each record is the download count for that dataset.  That location rule would be impossible to use if we weren't isolating each chunk.

So to answer our question:  we have several granularities.

1. Each data record
2. Each line in that data record
3. Each data point in each line (in the lines that we care about)

# What's the magic word that we see in that list?

**`each`** appears in each statement, which can tell us to use a for loop.

We aren't going to tackle a triple nested loop to start with, but we need to start somewhere.  Let's give ourselves a little to do list:

1. Get all the data records
2. For each record, get the lines (and figure out where the downloads and figure out how to get the download count out of it)
3. For each line of interest, get the data of interest

Don't worry, we're going to do this one at a time.  We won't actually need to do a triple nested loop because we can store our intermediate results.

# Task 1:  Get the records

Not shockingly, we're going to start with `.split()`.  Remember that we need two things for this:

1.  A string to split apart
    * Got this covered:  `sample`
2.  Something in that string to split it apart on
    * We have a good theory here: `-----` but we need to confirm
    
We can visually inspect our file and see that this appears to be between each record.  Not only that, we can look all the way to the end and see that it isn't just between the file but it appears at the end of each record.  So there isn't one before and there is one at the end, meaning that we can expect to see the appearance of 56 instances (so one for every expected record).

We can use a new string method to count how many times it appears in the file.  We'll try it first on our small sample to see it working, and then we'll deploy it onto the whole document that we have stored in memory.

We can visually inspect our sample to see that there are 4 entries, so we can expect (hope?) to see a result of 4 when we run our candidate delimiter through the string function that counts how many appearances it makes.

In [3]:
print(sample.count("-----"))

4


Great, we can see that we found the expected number of instances in our sample.  Let's try it on our full version.  Remember that we are hoping to see 56.

In [4]:
print(full_text.count('-----'))

56


Yay!  Now we can see what the results are of running this through `.split()`.

In [5]:
sample_split = sample.split('-----')

print("There are", len(sample_split), "records")
print("-------------------")
print(sample_split)

There are 5 records
-------------------
['Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V1\nDownloads: 9 (2017-08-30 to 2017-09-13 )\n', '\n\nPark, Jungsik; Le, Brian; Sklenar, Joseph; Chern, Gia-wei; Watts, Justin; Schiffer, Peter (2017): Magnetic response of brickwork artificial spin ice. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1528275_V1\nFunder: U.S. Department of Energy (DOE), Grant: DE-SC0010778\nDownloads: 10 (2017-09-08 to 2017-09-13 )\n', '\n\nKozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon . University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V2\nDownloads: 6 (2017-09-06 to 2017-09-13 )\n', '\n\nChristensen, Sarah; Molloy, Erin K.; Vachaspati, Pranjal; Warnow, Tandy (2017): Datasets from the study: Optimal co

That length of 5 is concerning, but so we need to take a look inside the elements and see what might be unexpected.  Everything looks pretty normal until we get to the last item.  Hopefully this looks like someithing you've seen before, because you have.  This is an empty string.  Sometimes this happens when the string we are splitting on appears at the end.  As you can see in the example below.

In [6]:
print('a-b-c-d-'.split("-"))

['a', 'b', 'c', 'd', '']


So our split has worked, but as usual there's a little more fussing we can do to make it better.  This is a good example of a situation where you won't know exactly what you'll need to do until you get the contents loaded and start working with it.

Look at the the sections in the list where the strings should be separated.  This is a small section of the split result list, showing where two elements meet.  So we've got the end of one element, the comma the list uses to show the separation of elements, and the beginning of the next element.

`2017-09-13 )`**`\n', '\n\n`**`Christensen,`

Here is that same piece of text as it appears in the original text with the newline characters intact:

`2017-09-13 )\n-----\n\nChristensen,`

In this case we've got the `-----` removed from the split, leaving the surrounding newlines.

We haven't mentioned it before, but we've been passing a string with multiple characters in it to split and it has been using it as a single item to split on (which is the designed way it works and what we want it to do!).  This means that we can add more to the string that we are passing it and see how what changes things.  From this example, we can see that there are newline characters (`\n`) surrounding the delimiter, 1 before and 2 after.  If we look closer at the actual file we can see these characters in action.  The `-----` appears on its own line (so that's the 1 before newline), and there is an extra empty line just below it (so that's the 2 after).  We can try including that in our split.  This change could provide two benefit:  it'll clean up the results a bit (we could always use `.strip()` on them, so that wasn't really a concern.  But it may also get rid of that trailing empty string in our results.

So I'm going to copy over the code for our previous example and just add those characters in.

In [7]:
sample

'Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V1\nDownloads: 9 (2017-08-30 to 2017-09-13 )\n-----\n\nPark, Jungsik; Le, Brian; Sklenar, Joseph; Chern, Gia-wei; Watts, Justin; Schiffer, Peter (2017): Magnetic response of brickwork artificial spin ice. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1528275_V1\nFunder: U.S. Department of Energy (DOE), Grant: DE-SC0010778\nDownloads: 10 (2017-09-08 to 2017-09-13 )\n-----\n\nKozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon . University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V2\nDownloads: 6 (2017-09-06 to 2017-09-13 )\n-----\n\nChristensen, Sarah; Molloy, Erin K.; Vachaspati, Pranjal; Warnow, Tandy (2017): Datasets from the study: Optimal completion of incomplete gene trees in p

In [8]:
sample_split = sample.split('\n-----\n\n')

print("There are", len(sample_split), "records")

print(sample_split)

There are 4 records
['Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V1\nDownloads: 9 (2017-08-30 to 2017-09-13 )', 'Park, Jungsik; Le, Brian; Sklenar, Joseph; Chern, Gia-wei; Watts, Justin; Schiffer, Peter (2017): Magnetic response of brickwork artificial spin ice. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1528275_V1\nFunder: U.S. Department of Energy (DOE), Grant: DE-SC0010778\nDownloads: 10 (2017-09-08 to 2017-09-13 )', 'Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon . University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V2\nDownloads: 6 (2017-09-06 to 2017-09-13 )', 'Christensen, Sarah; Molloy, Erin K.; Vachaspati, Pranjal; Warnow, Tandy (2017): Datasets from the study: Optimal completion of incomplete gene trees in p

There's a lot of information happening in this readout, so we'll have to look closely to see what's going on.  Looking between the elements, we can see that the breaks are pretty clean, but the last element is still holding a copy of our delimiter at the end of the string.  This is because it has the `\n-----` but is missing the final `\n\n` at the end of the record because it is the end of the file, and thus is not removed.

We can play with removing some of the new lines in the split and see if that helps.

In [9]:
sample_split = sample.split('\n-----') # see the \n\n at the end gone?

print("There are", len(sample_split), "records")

print(sample_split)

There are 5 records
['Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V1\nDownloads: 9 (2017-08-30 to 2017-09-13 )', '\n\nPark, Jungsik; Le, Brian; Sklenar, Joseph; Chern, Gia-wei; Watts, Justin; Schiffer, Peter (2017): Magnetic response of brickwork artificial spin ice. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1528275_V1\nFunder: U.S. Department of Energy (DOE), Grant: DE-SC0010778\nDownloads: 10 (2017-09-08 to 2017-09-13 )', '\n\nKozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon . University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V2\nDownloads: 6 (2017-09-06 to 2017-09-13 )', '\n\nChristensen, Sarah; Molloy, Erin K.; Vachaspati, Pranjal; Warnow, Tandy (2017): Datasets from the study: Optimal completion of incomplete gen

Removing the trailing two `\n` characters from the end of the string provided to split allows the delimiter to be removed from the last element of the split records, but now we have that empty line appearing again.  

We have to choose:  do we want to deal with removing the `-----` at the end of the file or have to strip the whitespace off the records and remove the last empty string in our list.  We could also alter our original text to fix this last delimiter to look like the others.

We're going to handle this in 2 stages:
   1. Add the `\n\n` to the end of the full sting, so that we can use the full `\n-----\n\n` delimiter and avoid both having to strip the elements and cleaning the `-----` off the end of the last element.
   2. This will still leave us the empty string in our list, so we'll have to remove that as well.
   
In case you're wondering, we could also use `.strip("-----")` on the sample, but sometimes this is not something we could handle so easily.  So it'll be valuable to explore handling those other methods.

In [10]:
sample_fixed = sample + "\n\n" # append the newlines to the sample

sample_split = sample_fixed.split('\n-----\n\n')

print("There are", len(sample_split), "records")

print(sample_split)

There are 5 records
['Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V1\nDownloads: 9 (2017-08-30 to 2017-09-13 )', 'Park, Jungsik; Le, Brian; Sklenar, Joseph; Chern, Gia-wei; Watts, Justin; Schiffer, Peter (2017): Magnetic response of brickwork artificial spin ice. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1528275_V1\nFunder: U.S. Department of Energy (DOE), Grant: DE-SC0010778\nDownloads: 10 (2017-09-08 to 2017-09-13 )', 'Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon . University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V2\nDownloads: 6 (2017-09-06 to 2017-09-13 )', 'Christensen, Sarah; Molloy, Erin K.; Vachaspati, Pranjal; Warnow, Tandy (2017): Datasets from the study: Optimal completion of incomplete gene trees in p

So adding the newlines at the beginning of the elements fixed the `-----` appearing in the last element and the extra newlines within the elements, but we still have that last element hanging around.  We could always remove tho last element, so long as we are sure that the last item is empty.  That kind of check will be something we can do when we start with boolean logic and decision structures.

We could also try to remove it, but the content is a subset of our larger delimiter, so removing just what it is would remove all our delimiter.

We could also remove the last 5 characters from the string, which is a heavy handed version.  As I mentioned, `.split()` would be a nice way of handing this.

But for the sake of practicing our list methods, we're going to explore `.pop()`.  With more advanced tests we could put in a check that it really is a string of length 0, but for now we can at least visually inspect what we are removing.

In [11]:
sample_split.pop(-1) # we want the last one, so our -1 friend will come back

''

`.pop()` will mutate our original list, so you see how there is no assignment statement happening here.  In fact, if we try to reassign our list to the results of `pop` we will have erased all the data we want with the data that we are removing. 

If we don't get this empty string out of here, then it will cause errors when we loop over all the records and try to split things apart.

We can see this in action.

In [12]:
sample_fixed = sample + "\n\n"

sample_split = sample_fixed.split('\n-----\n\n')

print("There are", len(sample_split), "records")

print(sample_split)

sample_split = sample_split.pop(-1)

print("and now our data is:", sample_split)

There are 5 records
['Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V1\nDownloads: 9 (2017-08-30 to 2017-09-13 )', 'Park, Jungsik; Le, Brian; Sklenar, Joseph; Chern, Gia-wei; Watts, Justin; Schiffer, Peter (2017): Magnetic response of brickwork artificial spin ice. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1528275_V1\nFunder: U.S. Department of Energy (DOE), Grant: DE-SC0010778\nDownloads: 10 (2017-09-08 to 2017-09-13 )', 'Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon . University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V2\nDownloads: 6 (2017-09-06 to 2017-09-13 )', 'Christensen, Sarah; Molloy, Erin K.; Vachaspati, Pranjal; Warnow, Tandy (2017): Datasets from the study: Optimal completion of incomplete gene trees in p

In [13]:
sample_fixed = sample + "\n\n"

sample_split = sample_fixed.split('\n-----\n\n')

print("There are", len(sample_split), "records")

sample_split.pop(-1)

print(sample_split)

print("There are now", len(sample_split), "records")

There are 5 records
['Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V1\nDownloads: 9 (2017-08-30 to 2017-09-13 )', 'Park, Jungsik; Le, Brian; Sklenar, Joseph; Chern, Gia-wei; Watts, Justin; Schiffer, Peter (2017): Magnetic response of brickwork artificial spin ice. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1528275_V1\nFunder: U.S. Department of Energy (DOE), Grant: DE-SC0010778\nDownloads: 10 (2017-09-08 to 2017-09-13 )', 'Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon . University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V2\nDownloads: 6 (2017-09-06 to 2017-09-13 )', 'Christensen, Sarah; Molloy, Erin K.; Vachaspati, Pranjal; Warnow, Tandy (2017): Datasets from the study: Optimal completion of incomplete gene trees in p

# Step 2: split the records apart

So let's back up and consider what we have done and still need to do.

We've got a list of our individual records.  Next step:  go over each record and get the line with the downloads data.

A boolean filtering approach to this step could have rendered the record splitting unnecessary.  But as we've discussed, we don't know how to do that, and it would reduce the context of the data coming back to us -- we might need more information than such an approach would give us. 

As we have explored, we know that the downloads line is the last line of each record (now do you see why I gave this problem statement?)

Let's grab a single record to play with and get a proof of concept going.  Once we're happy with how we are splitting the record up, then we can integrate that into a for loop.  We know that we want the lines out, and we've already explored how to do that with a `str.split('\n')`

In [14]:
print(sample_split[2])

Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon . University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V2
Downloads: 6 (2017-09-06 to 2017-09-13 )


In [15]:
print(sample_split[2].split('\n'))

['Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon . University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V2', 'Downloads: 6 (2017-09-06 to 2017-09-13 )']


This is a smaller record, but we can see that once we have the citation as one element, then the downloads element is indeed the last one.

This seems good enough to deploy on our entire sample.  Remember that we should start small and just print out the basics first.

In [16]:
for record in sample_split:
    print(record.split('\n'))

['Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V1', 'Downloads: 9 (2017-08-30 to 2017-09-13 )']
['Park, Jungsik; Le, Brian; Sklenar, Joseph; Chern, Gia-wei; Watts, Justin; Schiffer, Peter (2017): Magnetic response of brickwork artificial spin ice. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1528275_V1', 'Funder: U.S. Department of Energy (DOE), Grant: DE-SC0010778', 'Downloads: 10 (2017-09-08 to 2017-09-13 )']
['Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon . University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V2', 'Downloads: 6 (2017-09-06 to 2017-09-13 )']
['Christensen, Sarah; Molloy, Erin K.; Vachaspati, Pranjal; Warnow, Tandy (2017): Datasets from the study: Optimal completion of incomplete gene trees in polynomial

So far so good, but how can we tell if the downloads line is indeed the last line of each?  We could just print out the last line of each record and visually inspect that each line starts with downloads.

In [17]:
for record in sample_split:
    record_split = record.split('\n')
    print(record_split[-1])

Downloads: 9 (2017-08-30 to 2017-09-13 )
Downloads: 10 (2017-09-08 to 2017-09-13 )
Downloads: 6 (2017-09-06 to 2017-09-13 )
Downloads: 47 (2017-06-15 to 2017-09-13 )


We have 4 records and now we have 4 download lines appearing here, so we're good to go.  So this step seems to be taken care of and now this seems to be a good time to test this on all of our data.  Let's also go ahead and combine all these things together.

If we printed out the first few of these records, we would see that there is extra information in the first record because of the metadata header at the top of the file.  This top piece of information won't bother us, because it's the end of the record info that we want.

In [18]:
full_text_fixed = full_text + "\n\n"

full_text_fixed_split = full_text_fixed.split('\n-----\n\n')

print("There are", len(full_text_fixed_split), "records")

full_text_fixed_split.pop(-1)

print("There are now", len(full_text_fixed_split), "records")

There are 57 records
There are now 56 records


Good!  We have 56 records after correction for the empty string there at the end.  Lets start looping.

As we keep stepping forward through these examples, watch closely at haw I'm only adding lines to the end of my loop block.  Each time I add something, I stop to check that I have done the thing that I intended to do.  Sometimes this is just printing out results to visually inspect what I've done, and other times I'll calculate or print out a derivative value to check my work.

Also note my consistancy in variable names and that I'm not chaining together multiple methods in a single line of code.

In [19]:
for record in full_text_fixed_split:
    record_split = record.split('\n') # I've put my split in and saving the results
    print(record_split[-1]) # print out the last item of my split results

Downloads: 9 (2017-08-30 to 2017-09-13 )
Downloads: 10 (2017-09-08 to 2017-09-13 )
Downloads: 6 (2017-09-06 to 2017-09-13 )
Downloads: 47 (2017-06-15 to 2017-09-13 )
Downloads: 99 (2016-12-18 to 2017-09-13 )
Downloads: 3 (2017-12-12 to 2017-09-13 )
Downloads: 30 (2016-12-12 to 2017-09-13 )
Downloads: 42 (2016-12-12 to 2017-09-13 )
Downloads: 93 (2016-12-12 to 2017-09-13 )
Downloads: 31 (2016-12-12 to 2017-09-13 )
Downloads: 50 (2016-12-12 to 2017-09-13 )
Downloads: 12 (2017-08-11 to 2017-09-13 )
Downloads: 0 (2017-08-21 to 2017-09-13 )
Downloads: 99 (2017-07-29 to 2017-09-13 )
Downloads: 931 (2017-06-28 to 2017-09-13 )
Downloads: 28 (2017-06-16 to 2017-09-13 )
Downloads: 26 (2017-06-16 to 2017-09-13 )
Downloads: 28 (2017-06-16 to 2017-09-13 )
Downloads: 40 (2017-06-01 to 2017-09-13 )
Downloads: 68 (2017-06-01 to 2017-09-13 )
Downloads: 18 (2017-05-01 to 2017-09-13 )
Downloads: 41 (2017-05-31 to 2017-09-13 )
Downloads: 516 (2016-06-23 to 2017-09-13 )
Downloads: 72 (2017-05-22 to 2017-09

Visual inspection of the results says that we've got this part done.

# Step 3: get the downloads number

At this point we've got a good set of results. We've isolated our individual dataset records and inside each record we have isolated each line.  With those lines now accessible via position number, we can select the last line of the record, which happens to be the download count data line that we want.

So let's now consider each download line we're printing out and think of a way to get those download numbers out of them.  We've got a string in here, but our granularity is at the level of the word.  While we don't have just words in here, we've got stuff separated by white space.  Instead of a `-----` or `\n` delimiter, we have a single space.

Back once again with our friend `.split()`.  But what do we use in this case?  We care about white spaces, which happens to be the default for `.split()`.  Let's just copy one of these lines in and play with splitting the content.

In [20]:
line = "Downloads: 118 (2016-06-23 to 2017-09-13 )"

print(line.split())

['Downloads:', '118', '(2016-06-23', 'to', '2017-09-13', ')']


That seems to have done a part of the job.  We've split our line into a series of elements, one of which is actually the data point that we want.  Can we exploit some consistency across all the download lines?  Let's first just look at what this split does across all the download lines.

In [21]:
for record in full_text_fixed_split:
    record_split = record.split('\n')
    download_line = record_split[-1]
    print(download_line.split()) # now split the download lines on whitespace

['Downloads:', '9', '(2017-08-30', 'to', '2017-09-13', ')']
['Downloads:', '10', '(2017-09-08', 'to', '2017-09-13', ')']
['Downloads:', '6', '(2017-09-06', 'to', '2017-09-13', ')']
['Downloads:', '47', '(2017-06-15', 'to', '2017-09-13', ')']
['Downloads:', '99', '(2016-12-18', 'to', '2017-09-13', ')']
['Downloads:', '3', '(2017-12-12', 'to', '2017-09-13', ')']
['Downloads:', '30', '(2016-12-12', 'to', '2017-09-13', ')']
['Downloads:', '42', '(2016-12-12', 'to', '2017-09-13', ')']
['Downloads:', '93', '(2016-12-12', 'to', '2017-09-13', ')']
['Downloads:', '31', '(2016-12-12', 'to', '2017-09-13', ')']
['Downloads:', '50', '(2016-12-12', 'to', '2017-09-13', ')']
['Downloads:', '12', '(2017-08-11', 'to', '2017-09-13', ')']
['Downloads:', '0', '(2017-08-21', 'to', '2017-09-13', ')']
['Downloads:', '99', '(2017-07-29', 'to', '2017-09-13', ')']
['Downloads:', '931', '(2017-06-28', 'to', '2017-09-13', ')']
['Downloads:', '28', '(2017-06-16', 'to', '2017-09-13', ')']
['Downloads:', '26', '(2017

Visual inspection seems to indicate that there is a consistency that we want.  Much like the way we are depending on the downloads line to be always at the end of the record, we apparently can depend on the downloads count to always be the second element of each download line.  

There isn't anything intrinsic within the download content that says it should be in the second element of each line, but we can discover that about our data as we investigate things.  This is not something we should have expected to know immediately upon receiving this problem statement.  An experienced eye can look at this file and tell you that the structure is orderly enough and the data clearly marked enough that extracting it out shouldn't be a problem, but the exact approach to getting that job done can't be determined until you're in the thick of things.

The fact that this file is computer generated helps give us confidence that this system can be trusted to consistently place the data in the same positions each time.  Now, how it might handle missing data points or other problems is something that we should keep an eye on if we are reusing this code over a long period of time.  

But I digress, let's keep moving forward.

Here are the pieces that we have put together: 

1. We can loop through the individual records
2. Once inside of each record, we can access the downloads line.
3. Once we have the downloads line split apart, we can isolate the downloads number.

We can go ahead and prove that we can get the download count for each line.

In [22]:
for record in full_text_fixed_split:
    record_split = record.split('\n')
    download_line = record_split[-1]
    download_line_split = download_line.split()
    downloads = download_line_split[1] # take the second element of the split text from the downloads line
    print(downloads)

9
10
6
47
99
3
30
42
93
31
50
12
0
99
931
28
26
28
40
68
18
41
516
72
82
37
1634
99
110
312
132
78
46
140
84
424
269
212
237
98
207
193
200
403
283
336
55
69
163
132
51
70
250
913
130
118


At this point, if you're feeling a little uncomfortable with using a hard coded position number for both the downloads line position and the downloads number, you would be right.  That can be a dangerous game that breaks apart when your data structure changes.  If this is the one and only time you'll need to do things to this file, and you don't expect the data to be updated or to have to use this over time on a data format that may shift, this isn't a terrible method of solving the problem.  It is just a fragile way of solving the problem.  There are better, more flexible methods of handling this, but they require skills and concepts that we haven't explored yet.

What's our next step?  We've isolated the data points that we want, but we haven't yet done anything with them.  Our final step will be to calculate the total number of downloads for all our dataset records.

For this, we will need to use an accumulator.  Recall the pieces that we will need to have to make that work.

1. an accumulator variable (we don't have this yet)
2. a for loop going over our data (we already have this in place)
3. an incrementer that is adding the download value to our accumulator (we don't have this)

In [23]:
download_total = 0 # here's our missing accumulator

for record in full_text_fixed_split: # the for loop we already have 
    record_split = record.split('\n')  # look at all this stuff we have to do!
    download_line = record_split[-1]
    download_line_split = download_line.split()
    downloads = download_line_split[1]
    print(downloads)
    download_total = download_total + downloads # our accumulator

9


TypeError: unsupported operand type(s) for +: 'int' and 'str'

What's that error?  Saying that we can't add an int and a str.

What we're missing here is that the downloads number is still just a string.  We have been doing string methods to it, which was necessary to get the information out.  But now that we have just the number, we can turn this into an integer that we can do math things to.  Let's add that recast into here.  We can presume that this number is a whole number integer, but this is something that you should check if you don't know you data very well.

In [24]:
download_total = 0

for record in full_text_fixed_split:
    record_split = record.split('\n') 
    download_line = record_split[-1]
    download_line_split = download_line.split()
    downloads = download_line_split[1]
    download_total = download_total + int(downloads) # see the int cast here?

print(download_total)

9866


We can check if our total would change anything if we cast it to a float and see if it would differ.

In [25]:
download_total = 0 

for record in full_text_fixed_split: 
    record_split = record.split('\n') 
    download_line = record_split[-1]
    download_line_split = download_line.split()
    downloads = download_line_split[1]
    download_total = download_total + float(downloads) # see the float cast here

print(download_total)

9866.0


Nope, doesn't change.  Which is good.  So we can say that the total number of downloads is 9,866.  There are other methods we could use to collect and calculate information about the download counter.

I'm going to put the complete program below.

In [26]:
######
# open the file and read the contents into the full_text string

f = open('report.txt', 'r')

full_text = f.read()

f.close()

#####
# add the missing newlines to the end of our string and split it apart.

full_text_fixed = full_text + "\n\n"

full_text_fixed_split = full_text_fixed.split('\n-----\n\n')

#####
# pop out the last element, which should be an empty string.
# there are print statements in here to chect that it is deleting an empty string

print("There are", len(full_text_fixed_split), "records")

removed = full_text_fixed_split.pop(-1)

print("this was removed:", '"' + removed + '"')

print("There are now", len(full_text_fixed_split), "records")

#####
# loop through the records, getting the downloads line, splitting that, 
# grab the downloads item, convert it to an int, and accumulate that value

download_total = 0 # here's our missing accumulator

for record in full_text_fixed_split: # the for loop we already have 
    record_split = record.split('\n')  # look at all this stuff we have to do!
    download_line = record_split[-1]
    download_line_split = download_line.split()
    downloads = download_line_split[1]
    download_total = download_total + int(downloads) # our accumulator

print("There have been", download_total, "total downloads for the Illinois Data Bank")

There are 57 records
this was removed: ""
There are now 56 records
There have been 9866 total downloads for the Illinois Data Bank


So we have 2 distinct sections of our code here:

First, we read in our data and prepare the data for our looping actions.  We fixed a few things in there with the text, but not everything

Second, we start looping over the records, where we break down additional data in order to extract the individual data element that we want.

Finally, we print out the number that we were after in the first place:  the total number of downloads.

# Activity:

Pair up!  

Break open the book and talk through the data granularity issues of the homework problems.