# While loops

When it comes to data processing in python, the while loop is only deployed when the for loop won't cut it.  

While loops are the standard way to loop for some languages, but not so much in python.  There are certain tasks, such as interactive programs, where the while loop is almost always what you use.

This class is dealing with data, which is usually in files so you know exactly how many times to run.

The while loop is valuable in the cases when you don't know how many times you'll need to run through things.

Let's look at a real data example here.

So we know that there are states happening in the file, but we don't know how many universities will be listed under each.  We can use a combination of pattern here:

* a decision tree to filter the lines we have found into their meaning groups
* a main and sub accumalator pattern that you used in the week where you were going over a file and using an accumulaor on each of the lines, and then collecting all of the results in a final accumulator.

Let's do the easy stuff first.

## pattern 1: filter all the things

Taking a look at the file in question, we can see that there are two kinds of lines:

* a line talking about a state
* a line talking about a university

We need to figure out how to detect these differences.  Going back to our previous discussions:

1. we could try and detect is only a state name is mentioned in a line
2. we could exploit the known and steady sturucture

Let's go for item 2 first.

* some of the universities have () in the names
* some of the universities have footnotes, and therefor end withd [a number]

But is this something regular enough to depend on?

We can test the () first, and let's print out what that might tell us about the states, since that is the smaller category.

In [34]:
fileio = open('crappystatedata.txt', 'r')

text = fileio.read()
lines = text.split("\n")
fileio.close()

for line in lines:
    if '(' in line and ')' in line:
        print("----")
    else:
        print("this in a state:", line)

this in a state: Alabama[edit]
----
----
----
----
----
----
----
----
----
----
----
----
----
----
this in a state: Alaska[edit]
----
----
----
----
----
this in a state: Arizona[edit]
----
----
this in a state: Gilbert[14]
----
----
----
----
----
----
----
----
----
this in a state: Arkansas[edit]
----
----
----
----
----
----
----
----
----
----
this in a state: California[edit]
----
----
----
----
this in a state: Antioch[14]
----
this in a state: Arden-Arcade[14]
this in a state: Azusa
----
----
----
----
----
----
----
----
----
----
----
----
----
this in a state: Corona
----
----
this in a state: Daly City
----
this in a state: Downey
this in a state: East Los Angeles
----
this in a state: El Monte
----
this in a state: Escondido
----
this in a state: Fontana
----
----
----
this in a state: Garden Grove
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----


Ouch, we've got some universities in this.  Some of them have the [] in there, so we can add that in next.

In [35]:
fileio = open('crappystatedata.txt', 'r')

text = fileio.read()
lines = text.split("\n")
fileio.close()

for line in lines:
    if '(' in line and ')' in line:
        print("----")
    elif '[' in line and ']' in line:
        print("----")
    else:
        print("this in a state:", line)

----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
this in a state: Azusa
----
----
----
----
----
----
----
----
----
----
----
----
----
this in a state: Corona
----
----
this in a state: Daly City
----
this in a state: Downey
this in a state: East Los Angeles
----
this in a state: El Monte
----
this in a state: Escondido
----
this in a state: Fontana
----
----
----
this in a state: Garden Grove
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
this in a state: Rialto
----
----
----
----
----
this in a state: Sacramento :
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
----
this in a state: Simi Vall

That made things worse. We've lost the states completely.

In [39]:
fileio = open('crappystatedata.txt', 'r')

text = fileio.read()
lines = text.split("\n")
fileio.close()

for line in lines:
    if '(' in line and ')' in line and '[' in line and ']' in line:
        print("----")
    else:
        print("this in a state:", line)

this in a state: Alabama[edit]
----
----
this in a state: Dothan (Fortis College, Troy University Dothan Campus, Alabama College of Osteopathic Medicine)
this in a state: Florence (University of North Alabama)
this in a state: Homewood (Samford University)
this in a state: Huntsville (University of Alabama, Huntsville)
----
----
----
----
this in a state: Montgomery (Alabama State University, Huntingdon College, Auburn University at Montgomery, H. Councill Trenholm State Technical College, Faulkner University)
----
----
----
this in a state: Alaska[edit]
----
----
this in a state: Juneau (University of Alaska Southeast)
this in a state: Ketchikan (University of Alaska Southeast-extended campus)
this in a state: Sitka (University of Alaska Southeast-extended campus)
this in a state: Arizona[edit]
----
----
this in a state: Gilbert[14]
----
this in a state: Lake Havasu City (ASU Colleges at Lake Havasu City, Northern Arizona University extended campus)
----
----
----
this in a state: Pre

This also didn't work.

Maybe this is time to think about what might be unique about the states. We can eyeball the first few in the file, and we can take a look at the source page this came from to find something.

https://en.wikipedia.org/wiki/List_of_college_towns

Skimming through the wixipedia page shows us that there appearts to be an edit link next to each state name.  Is this duplitated through the file?

In [40]:
fileio = open('crappystatedata.txt', 'r')

text = fileio.read()

print(text.count('[edit]'))
fileio.close()


51


So there are 51 results!  Yes, there are 50 states but District of Columbia is counted as a state.  Also you can look at the outline of the page and it shows that there are 51 sections.

We might think about now splitting the text on this [edit] text since it does mark the start of a new section. 

However, when we do that we end up with this:

In [42]:
fileio = open('crappystatedata.txt', 'r')

text = fileio.read()
lines = text.split("[edit]")

print(lines[:2])
fileio.close()


['Alabama', '\nAuburn (Auburn University, Edward Via College of Osteopathic Medicine)[7]\nBirmingham (University of Alabama at Birmingham, Birmingham School of Law, Cumberland School of Law, Miles Law School)[8]\nDothan (Fortis College, Troy University Dothan Campus, Alabama College of Osteopathic Medicine)\nFlorence (University of North Alabama)\nHomewood (Samford University)\nHuntsville (University of Alabama, Huntsville)\nJacksonville (Jacksonville State University)[9]\nLivingston (University of West Alabama)[9]\nMobile (University of South Alabama)[8]\nMontevallo (University of Montevallo, Faulkner University)[9]\nMontgomery (Alabama State University, Huntingdon College, Auburn University at Montgomery, H. Councill Trenholm State Technical College, Faulkner University)\nTroy (Troy University)[9][10]\nTuscaloosa (University of Alabama, Stillman College, Shelton State)[11][12]\nTuskegee (Tuskegee University)[13]\nAlaska']


The state names are now separated from their contents, and the name of the next state is at the end of the previous one's data.

So that sucks.

Since we can't use split here, we might want to just use a filter.  In this case, we know that [edit] comes at the end. So we can use .endswith to check if that's our line.

In [44]:
fileio = open('crappystatedata.txt', 'r')

text = fileio.read()
fileio.close()

lines = text.split("\n")

states = []

for line in lines:
    if line.endswith("[edit]"):
        print(line)
        states.append(line)

print(len(states))

Alabama[edit]
Alaska[edit]
Arizona[edit]
Arkansas[edit]
California[edit]
Colorado[edit]
Connecticut[edit]
Delaware[edit]
District of Columbia[edit]
Florida[edit]
Georgia[edit]
Hawaii[edit]
Idaho[edit]
Illinois[edit]
Indiana[edit]
Iowa[edit]
Kansas[edit]
Kentucky[edit]
Louisiana[edit]
Maine[edit]
Maryland[edit]
Massachusetts[edit]
Michigan[edit]
Minnesota[edit]
Mississippi[edit]
Missouri[edit]
Montana[edit]
Nebraska[edit]
Nevada[edit]
New Hampshire[edit]
New Jersey[edit]
New Mexico[edit]
New York[edit]
North Carolina[edit]
North Dakota[edit]
Ohio[edit]
Oklahoma[edit]
Oregon[edit]
Pennsylvania[edit]
Rhode Island[edit]
South Carolina[edit]
South Dakota[edit]
Tennessee[edit]
Texas[edit]
Utah[edit]
Vermont[edit]
Virginia[edit]
Washington[edit]
West Virginia[edit]
Wisconsin[edit]
Wyoming[edit]
51


There we have it!  We found a filter that correctly identifies our states, without any false positives of false negatives.

Now that we can correctly find the markers for a state, we can turn our attention to what we might want to do about that getting the data under it.

So we have two kinds of line:

1. lines that serve as a marke that a new state is reporting
2. lines of data about college towns, that all belong to the state that above it

We need to decide what we want to do when we hit a nen state line.

## the main/sub loop pattern

Recall our previous pattern that dealt with lines, and you needed to act on those lines with another loop.  For example:

``` python
for line in file:
    loop through stuff in that line
    add that collected data to the broader list
```

Let's consider a final data structure here:


``` python
[ ['state1', 'city1', 'city2'],
  ['state2', 'city3', 'city4'] ]
```

Our primary accumulator is our outermost list (the outside []).  It serves the purpose of collecting all the secondary accumulators.  

## Analogy: making info packets

Let's imagine a situation where you are assembling info packets for visiting candidates.  You have a table lined up with all the papers and packets to include.  You go down that table, grabbing one of each and pacing it on the bottom of your current pile.  At the end of the table, you staple all the papers together.  Finally, you put the packet into the box holding all the completed packets.

That's a pretty defined accumulator pattern.  You just have to keep grabbing all the papers until there are none left to grab.

Our data is a little different.

## Analogy: a sandwich maker

Say you work at a sub sandwich shop.  A customer comes in and puts in an order.  There are a few standard things to do, but then you get to toppings.  Your job is to keep adding toppings to the sandwhich until they tell you they are done. You don't know how many times you'll add stuff to the sandwich, but you'll know when they say they are done.  Then you package up the sandwich and hand it to the customer.

All the customers you have in the day is your primary accumulator, and the secondary accumulators are the single sandwiches that you make.


## OK so this still isn't perfect

If finding a state line tells us that we need to start collecting a new bunch of data, then we need to decide what to do with the stuff we have previously collected.  Like a customer putting a bunch of shoe requests into you, and by the time you get back to them, they are gone and a new customer has a whole different set of requests.  You're still standing there with a bunch of shoes in your hand.  

When you see that new customer, you excuse yourself and put the previous order to the side.  This is more what we are after.  The customer itself does't tell us that we are done, the fact that we see a new customer tells us that the previous order is done.

So here's our plan:

* loop through the lines
* when the line is city (so not a state):
    * add that data to the secondary accumulator
* what the line is a state:
    * add secondary accumulator package to your primary accumulator
    * reset the secondary accumulator back to empty
    * add the state to that newly empty secondary accumulator
    
We've sort of seen this before, where we had:

```python
allstuff = []

for line in file:
    newthing = []
    for something in line:
        ...
        newthing.append(stuff)
    allstuff.append(newthing)
```

`allstuff` is our primary accumulator and `newthing` is our secondary accumulator.  Because there's a very difinite place where we want to start over, we have `newthing = []` just inside our primary for loop, with the secondary loop adding to it.  This allows it to reset everytime we hit a new group.

In our case, we are using a conditional to check when we need to reset things.  That reset process includes adding the currently accrued contents to the primary accumulator, and then setting the secondary accumulator back to empty.  

In [30]:
fileio = open('crappystatedata.txt', 'r')

text = fileio.read()
lines = text.split("\n")

allchunks = []
statechunk = []

foundfirst = False

for line in lines:
    if line.endswith("[edit]") and foundfirst == False:
        print("This is the first state!", line)
        statechunk.append(line)
        foundfirst = True
    elif line.endswith("[edit]"):
        print("a new state has begun!", line)
        # add completed state
        allchunks.append(statechunk)
        # reset the chunk
        statechunk = [line]
    else:
        statechunk.append(line)
    allchunks.append(statechunk)
        

This is the first state! Alabama[edit]
a new state has begun! Alaska[edit]
a new state has begun! Arizona[edit]
a new state has begun! Arkansas[edit]
a new state has begun! California[edit]
a new state has begun! Colorado[edit]
a new state has begun! Connecticut[edit]
a new state has begun! Delaware[edit]
a new state has begun! District of Columbia[edit]
a new state has begun! Florida[edit]
a new state has begun! Georgia[edit]
a new state has begun! Hawaii[edit]
a new state has begun! Idaho[edit]
a new state has begun! Illinois[edit]
a new state has begun! Indiana[edit]
a new state has begun! Iowa[edit]
a new state has begun! Kansas[edit]
a new state has begun! Kentucky[edit]
a new state has begun! Louisiana[edit]
a new state has begun! Maine[edit]
a new state has begun! Maryland[edit]
a new state has begun! Massachusetts[edit]
a new state has begun! Michigan[edit]
a new state has begun! Minnesota[edit]
a new state has begun! Mississippi[edit]
a new state has begun! Missouri[edit]
a ne

In [32]:
allchunks[-1]

['Wyoming[edit]', 'Laramie (University of Wyoming)[13]']