# Week 4: lists and subloops

Sequences have order, meaning that individual items have specific positions.  You can use those positions:

* as their explicit meaning (so the numbers are directily meaningful)
* as a transformed meaning (so you can add 1 or do something else to the position number to make it meaningful)
* with a referenced meaning (such that you can use the position number to look up the meaning).

These items can also be individually manipulated.  Once referenced, they can be stored in memory, placed in another structure, and used as a source for further data.  

In [10]:
# we'll be discussing reading in files later!

f = open('report.txt', 'r')

full_text = f.read()

f.close()

Here's a small snippet we can see in one screen:

In [11]:
sample = """Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V1
Downloads: 9 (2017-08-30 to 2017-09-13 )
-----

Park, Jungsik; Le, Brian; Sklenar, Joseph; Chern, Gia-wei; Watts, Justin; Schiffer, Peter (2017): Magnetic response of brickwork artificial spin ice. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1528275_V1
Funder: U.S. Department of Energy (DOE), Grant: DE-SC0010778
Downloads: 10 (2017-09-08 to 2017-09-13 )
-----

Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon . University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V2
Downloads: 6 (2017-09-06 to 2017-09-13 )
-----

Christensen, Sarah; Molloy, Erin K.; Vachaspati, Pranjal; Warnow, Tandy (2017): Datasets from the study: Optimal completion of incomplete gene trees in polynomial time using OCTAL. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-8402610_V1
Funder: U.S. National Science Foundation (NSF), Grant: CCF-1535977
Funder: U.S. National Science Foundation (NSF), Grant: DGE-1144245
Downloads: 47 (2017-06-15 to 2017-09-13 )
-----"""

# What's our structure here?

We've got data on datasets here, but this isn't a tabular data file.  We can't open it up in excel or directly do computations on it, yet.  We need to transform it into something with a standard sturucture for analysis.

A few things to note:

* the number of lines per entry is variable, so we can't exploit a steady mathematical structure like we did with the Raven last week.
* even if variable, we can see that the lines inside each entry are meaningful 
* each line has a field label followed by a : and then content
* some of these fields have multiple entries (so multiple lines with the same field entry)
* some of the line content has multiple values
* the string "-----" appears (to appear!) between each entry

# What's our data granularity?

Last week we explored a poem and found that our unit of analysis of interest are the individual lines.  That made it easy because python knows about lines and has functions designed to easily interact with them.  In this case, our individual data entities records about the dataset.  This sort of thing, especially with the variability in size, is not something that python is able to directly interact with.  Our recourse for this issue is to use the tools inside of python to encode these chunks as individual data entries.  Once we have the individual records out we can operate on them independently and exract the information out.

This pattern of breaking the data apart so that we can apply broader (easier) methods of splitting individual data points out will be a common one.

Let's take a moment to consider the Raven again.  Say that we want to know about the words in each stanza.  We could use regular `.split()` on the entire poem and get all the words.  We'd have the granularity that we want (the words) but the membership information would then be gone.

Recall our basic for loop over lines through the poem:  it allows us to isolate each line at a time such that we can manupulate or take measurements from that line and infer that those measurements we get belong to that line because it is the one we are looping over at the moment.

Our example of why using the `range(0, len(raven_lines), 7)` function to make the position numbers and look up the line versus just doing a list slice (`::7`) gave us the same content back (the first line of each stanza), but using the list slicing method lost the line number information in the process.  That origin information could not be derived back from the individual line itself, and thus we depend on our iterable variable within the `range` loop to represent that component metadata about the line.

Unlike our pattern where we used `range` to generate the line numbers, the content (the numbers) that we generated with `range` wasn't directly related to our original content.  We used a known pattern to (correctly!) generate postiton numbers.  So there was a certain trust that we had to put into place to make the things work.

This time around we are splitting our data into chunks so that we can individually act on the data inside of it.  This means that instead of getting out all the lines that have the funder information and doing something fancy to figure out which chunk it was from, we can take out all the chunks and then get the funder lines from there.  Becuase we are isolating the data records from eachother, we can use a pretty unfancy method of getting out the lines that we want.  

For example, the last line of each record is the download count for that dataset.  That location rule would be impossible to use if we weren't isolating each chunk.

So to answer our question:  we have several granularities.

1. Each data record
2. Each line
3. Each data point in each line

# What's the magic word that we see in that list?

`each`

We aren't going to tackle a triple nested loop to start with, but we need to start somewhere.  Let's give ourselves a little to do list:

1. Get all the data records
2. For each record, get the lines
3. For each line of interest, get the data of interest

Don't worry, we're going to do this one at a time.

# Task 1:  Get the records

Not shockingly, we're going to start with `.split()`.  Remember that we need two things for this:

1.  A string to split apart
    * Got this covered:  `original`
2.  Something in that string to split it apart on
    * We have a good theory here: `-----` but we need to confirm
    
We can visually inspect our file and see that this appears to be between each record.  Not only that, we can look all the way to the end and see that it isn't just between the file but it appears at the end of each record.  So there isn't one before and there is one at the end, meaning that we can expect to see the appearance of 56 instances (so one for every expected record).

We can do a quick string method to count how many times it appears in the file.  We'll try it first on our small sample to see it working, and then we'll deploy it onto the whole document that we have stored in memory.

We can visually inspect our sample to see that there are 4 entries, so we can expect (hope?) to see a result of 4 when we run our candidate delimiter through the string function that counts how many appearances it makes.

In [16]:
print(sample.count("-----"))

4


Great, we can see that we found the expected number of instances in our sample.  Let's try it on our full version.  Remember that we are hoping to see 56.

In [17]:
print(full_text.count('-----'))

56


Yay!  Now we can 

In [13]:
datasets

['Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V1\nDownloads: 9 (2017-08-30 to 2017-09-13 )',
 'Park, Jungsik; Le, Brian; Sklenar, Joseph; Chern, Gia-wei; Watts, Justin; Schiffer, Peter (2017): Magnetic response of brickwork artificial spin ice. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1528275_V1\nFunder: U.S. Department of Energy (DOE), Grant: DE-SC0010778\nDownloads: 10 (2017-09-08 to 2017-09-13 )',
 'Kozuch, Laura; Walker, Karen; Marquardt, William (2017): Modern sinistral whelk spire angles, genus Busycon . University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2031816_V2\nDownloads: 6 (2017-09-06 to 2017-09-13 )',
 'Christensen, Sarah; Molloy, Erin K.; Vachaspati, Pranjal; Warnow, Tandy (2017): Datasets from the study: Optimal completion of incomplete gene trees in polynomial time us