# Strings, Lists, and Loops for HTML

This notebook begins to introduce more complex data structures
and how to manipulate them in Python programs.

Here we will be looking at

- Strings, used to represent text fragments.
- Lists, used to organize data into sequences.
- Loops, used to operate on each element of a data structure.

The notebook is in two parts

- In the first part all the code is given ready to execute.  This
part introduces the concepts.

- In the second part the code is given with `FILL_IN_THE_BLANK` placeholders where
the user must apply the material introduced in the first part to complete the
computational workflow,.

# Introduction:  Parsing presidential information

Let's look at manipulating some text strings obtained from
<a href="https://www.presidentsusa.net/birth.html">https://www.presidentsusa.net/birth.html</a>
which lists information about the presidents.

<img src="presidents.png" width="600">

The goal of this exercise is to format information extracted in an HTML report
in an automated fashion.  This is called "web scraping" and a lot of people
do it full time.
The first exercise is completed in the notebook.  After that the user
is asked to solve a similar exercise by filling in the blanks.

First look at the header for the information as a Python string.

In [3]:
headers_string = "President 	Birth Date 	Birth Place 	Death Date 	Location of Death"
print (headers_string)

President 	Birth Date 	Birth Place 	Death Date 	Location of Death


In [4]:
# The "arrow" shown in the string represents the TAB character which prints as blank space.

# To use the headers in a program we need to break them out individually.  
# The "split" function for a string will do this, but not without a hint of where to split:
headers_string.split()

['President',
 'Birth',
 'Date',
 'Birth',
 'Place',
 'Death',
 'Date',
 'Location',
 'of',
 'Death']

In [5]:
# Here the split divided "Birth Date" into "Birth" and "Date".
# To get the right words for each header together we need to split on the TAB character
headers = headers_string.split("\t")

# Now we have the header names stored as a list of strings in the variable named headers.
headers

['President ',
 'Birth Date ',
 'Birth Place ',
 'Death Date ',
 'Location of Death']

In [22]:
# Below we demonstrate some useful methods for lists using the headers.

# indexing:
print ("the first header is", headers[0])
print ("the second header is", headers[1])
print ("the last header is", headers[-1])

# slicing
print ("the first two headers are", headers[:2])
print ("the fourth to the last header are", headers[3:])

# length
print ("there are", len(headers), "headers")

# simple looping
print("Here are the headers:")
for header in headers:
    print ("  here's one:", header)

# enumeration in looping
for (index, header) in enumerate(headers):
    print ("   ", header, "lives at index", index)
    
# and there are many more.

the first header is President 
the second header is Birth Date 
the last header is Location of Death
the first two headers are ['President ', 'Birth Date ']
the fourth to the last header are ['Death Date ', 'Location of Death']
there are 5 headers
Here are the headers:
  here's one: President 
  here's one: Birth Date 
  here's one: Birth Place 
  here's one: Death Date 
  here's one: Location of Death
    President  lives at index 0
    Birth Date  lives at index 1
    Birth Place  lives at index 2
    Death Date  lives at index 3
    Location of Death lives at index 4


In [6]:
# Now look at some data extracted from the web site.

# To save some space I only captured the first 3 entries, but the 
# automated methods we will develop could work on all of the data
# (and also on much bigger data sets) automatically.

data_string = """  
George Washington 	Feb 22, 1732 	Westmoreland Co., Va. 	Dec 14, 1799 	Mount Vernon, Va.
John Adams 	Oct 30, 1735 	Quincy, Mass. 	July 4, 1826 	Quincy, Mass.
Thomas Jefferson 	Apr 13, 1743 	Albemarle Co., Va. 	July 4, 1826 	Albemarle Co., Va.
  """
print(data_string)
data_string

  
George Washington 	Feb 22, 1732 	Westmoreland Co., Va. 	Dec 14, 1799 	Mount Vernon, Va.
John Adams 	Oct 30, 1735 	Quincy, Mass. 	July 4, 1826 	Quincy, Mass.
Thomas Jefferson 	Apr 13, 1743 	Albemarle Co., Va. 	July 4, 1826 	Albemarle Co., Va.
  


'  \nGeorge Washington \tFeb 22, 1732 \tWestmoreland Co., Va. \tDec 14, 1799 \tMount Vernon, Va.\nJohn Adams \tOct 30, 1735 \tQuincy, Mass. \tJuly 4, 1826 \tQuincy, Mass.\nThomas Jefferson \tApr 13, 1743 \tAlbemarle Co., Va. \tJuly 4, 1826 \tAlbemarle Co., Va.\n  '

In [7]:
# When we "print" the string we see separate lines but when
# We view the string "value" we see all of the characters including the "newline"
# character "\n" which separates lines.

# The first thing to notice is that there is "extra junk whitespace"
# at the beginning and end of the string.  We can strip that using the strip method.
data_string = data_string.strip()
data_string

'George Washington \tFeb 22, 1732 \tWestmoreland Co., Va. \tDec 14, 1799 \tMount Vernon, Va.\nJohn Adams \tOct 30, 1735 \tQuincy, Mass. \tJuly 4, 1826 \tQuincy, Mass.\nThomas Jefferson \tApr 13, 1743 \tAlbemarle Co., Va. \tJuly 4, 1826 \tAlbemarle Co., Va.'

In [8]:
# In fact let's strip the extra white space to all of the headers too
# using a "list comprehension" which calls strip for each of the header strings:

header_strings = [header.strip() for header in headers]
header_strings

['President', 'Birth Date', 'Birth Place', 'Death Date', 'Location of Death']

In [9]:
# Each of the "President" records is separated from the others by a newline.
# Again "split" on newline will break them out into separate substrings.

data_substrings = data_string.split("\n")
data_substrings

['George Washington \tFeb 22, 1732 \tWestmoreland Co., Va. \tDec 14, 1799 \tMount Vernon, Va.',
 'John Adams \tOct 30, 1735 \tQuincy, Mass. \tJuly 4, 1826 \tQuincy, Mass.',
 'Thomas Jefferson \tApr 13, 1743 \tAlbemarle Co., Va. \tJuly 4, 1826 \tAlbemarle Co., Va.']

In [13]:
# To break the first substring into individual values corresponding to each header split on TAB
first_substring = data_substrings[0]
first_substring.split("\t")

['George Washington ',
 'Feb 22, 1732 ',
 'Westmoreland Co., Va. ',
 'Dec 14, 1799 ',
 'Mount Vernon, Va.']

In [14]:
# We can split *all* of the substrings using a list comprehension
data_lists = [substring.split("\t") for substring in data_substrings]
data_lists

[['George Washington ',
  'Feb 22, 1732 ',
  'Westmoreland Co., Va. ',
  'Dec 14, 1799 ',
  'Mount Vernon, Va.'],
 ['John Adams ',
  'Oct 30, 1735 ',
  'Quincy, Mass. ',
  'July 4, 1826 ',
  'Quincy, Mass.'],
 ['Thomas Jefferson ',
  'Apr 13, 1743 ',
  'Albemarle Co., Va. ',
  'July 4, 1826 ',
  'Albemarle Co., Va.']]

In [18]:
# Now we have the headers and several records all broken out as values.
# There are better and fancier ways to do this, but below I use indexing
# and looping to print the headers together with the individual records

for data_list in data_lists:
    print() # blank line to separate records
    for (index, header) in enumerate(headers):
        value = data_list[index]
        print (header, ":", value)


President  : George Washington 
Birth Date  : Feb 22, 1732 
Birth Place  : Westmoreland Co., Va. 
Death Date  : Dec 14, 1799 
Location of Death : Mount Vernon, Va.

President  : John Adams 
Birth Date  : Oct 30, 1735 
Birth Place  : Quincy, Mass. 
Death Date  : July 4, 1826 
Location of Death : Quincy, Mass.

President  : Thomas Jefferson 
Birth Date  : Apr 13, 1743 
Birth Place  : Albemarle Co., Va. 
Death Date  : July 4, 1826 
Location of Death : Albemarle Co., Va.


In [46]:
# Embed as HTML
HTML_List = []
for data_list in data_lists:
    HTML_List.append("<br/>")  # blank line
    for (index, header) in enumerate(headers):
        value = data_list[index]
        element = "<b> %s: </b> <em>%s</em> <br/>" % (header, value)
        HTML_List.append(element)
HTML_string = "\n".join(HTML_List)
print(HTML_string)

<br/>
<b> President : </b> <em>George Washington </em> <br/>
<b> Birth Date : </b> <em>Feb 22, 1732 </em> <br/>
<b> Birth Place : </b> <em>Westmoreland Co., Va. </em> <br/>
<b> Death Date : </b> <em>Dec 14, 1799 </em> <br/>
<b> Location of Death: </b> <em>Mount Vernon, Va.</em> <br/>
<br/>
<b> President : </b> <em>John Adams </em> <br/>
<b> Birth Date : </b> <em>Oct 30, 1735 </em> <br/>
<b> Birth Place : </b> <em>Quincy, Mass. </em> <br/>
<b> Death Date : </b> <em>July 4, 1826 </em> <br/>
<b> Location of Death: </b> <em>Quincy, Mass.</em> <br/>
<br/>
<b> President : </b> <em>Thomas Jefferson </em> <br/>
<b> Birth Date : </b> <em>Apr 13, 1743 </em> <br/>
<b> Birth Place : </b> <em>Albemarle Co., Va. </em> <br/>
<b> Death Date : </b> <em>July 4, 1826 </em> <br/>
<b> Location of Death: </b> <em>Albemarle Co., Va.</em> <br/>


In [47]:
from IPython.display import HTML, display
display(HTML(HTML_string))

# Exercise: 4 columns to 2 columns to HTML

The following string is pasted from https://state.1keydata.com/state-capitals.php

<img src="capitals.png" width="500"/>

The goal of the exercise is to parse out the state/capital pairs and print
out the states and capitals in a formatted report that looks something like
this:

<img src="states_report.png" width="400"/>

In [43]:
State_Capitals_strings = """
US State 	State Capital 	US State 	State Capital
Alabama 	Montgomery 	Montana 	Helena
Alaska 	Juneau 	Nebraska 	Lincoln
Arizona 	Phoenix 	Nevada 	Carson City
Arkansas 	Little Rock 	New Hampshire 	Concord
California 	Sacramento 	New Jersey 	Trenton
Colorado 	Denver 	New Mexico 	Santa Fe
Connecticut 	Hartford 	New York 	Albany
Delaware 	Dover 	North Carolina 	Raleigh
Florida 	Tallahassee 	North Dakota 	Bismarck
Georgia 	Atlanta 	Ohio 	Columbus
Hawaii 	Honolulu 	Oklahoma 	Oklahoma City
Idaho 	Boise 	Oregon 	Salem
Illinois 	Springfield 	Pennsylvania 	Harrisburg
Indiana 	Indianapolis 	Rhode Island 	Providence
Iowa 	Des Moines 	South Carolina 	Columbia
Kansas 	Topeka 	South Dakota 	Pierre
Kentucky 	Frankfort 	Tennessee 	Nashville
Louisiana 	Baton Rouge 	Texas 	Austin
Maine 	Augusta 	Utah 	Salt Lake City
Maryland 	Annapolis 	Vermont 	Montpelier
Massachusetts 	Boston 	Virginia 	Richmond
Michigan 	Lansing 	Washington 	Olympia
Minnesota 	St. Paul 	West Virginia 	Charleston
Mississippi 	Jackson 	Wisconsin 	Madison
Missouri 	Jefferson City 	Wyoming 	Cheyenne
"""

In [25]:
# get rid if surrounding whitespace
State_Capitals_strings = FILL_IN_THE_BLANK

In [26]:
# Split into lines
State_Capitals_lines = FILL_IN_THE_BLANK
State_Capitals_lines

['US State \tState Capital \tUS State \tState Capital',
 'Alabama \tMontgomery \tMontana \tHelena',
 'Alaska \tJuneau \tNebraska \tLincoln',
 'Arizona \tPhoenix \tNevada \tCarson City',
 'Arkansas \tLittle Rock \tNew Hampshire \tConcord',
 'California \tSacramento \tNew Jersey \tTrenton',
 'Colorado \tDenver \tNew Mexico \tSanta Fe',
 'Connecticut \tHartford \tNew York \tAlbany',
 'Delaware \tDover \tNorth Carolina \tRaleigh',
 'Florida \tTallahassee \tNorth Dakota \tBismarck',
 'Georgia \tAtlanta \tOhio \tColumbus',
 'Hawaii \tHonolulu \tOklahoma \tOklahoma City',
 'Idaho \tBoise \tOregon \tSalem',
 'Illinois \tSpringfield \tPennsylvania \tHarrisburg',
 'Indiana \tIndianapolis \tRhode Island \tProvidence',
 'Iowa \tDes Moines \tSouth Carolina \tColumbia',
 'Kansas \tTopeka \tSouth Dakota \tPierre',
 'Kentucky \tFrankfort \tTennessee \tNashville',
 'Louisiana \tBaton Rouge \tTexas \tAustin',
 'Maine \tAugusta \tUtah \tSalt Lake City',
 'Maryland \tAnnapolis \tVermont \tMontpelier',
 'M

In [28]:
# Split lines into lists
FILL_IN_THE_BLANK
State_capital_lists

[['US State ', 'State Capital ', 'US State ', 'State Capital'],
 ['Alabama ', 'Montgomery ', 'Montana ', 'Helena'],
 ['Alaska ', 'Juneau ', 'Nebraska ', 'Lincoln'],
 ['Arizona ', 'Phoenix ', 'Nevada ', 'Carson City'],
 ['Arkansas ', 'Little Rock ', 'New Hampshire ', 'Concord'],
 ['California ', 'Sacramento ', 'New Jersey ', 'Trenton'],
 ['Colorado ', 'Denver ', 'New Mexico ', 'Santa Fe'],
 ['Connecticut ', 'Hartford ', 'New York ', 'Albany'],
 ['Delaware ', 'Dover ', 'North Carolina ', 'Raleigh'],
 ['Florida ', 'Tallahassee ', 'North Dakota ', 'Bismarck'],
 ['Georgia ', 'Atlanta ', 'Ohio ', 'Columbus'],
 ['Hawaii ', 'Honolulu ', 'Oklahoma ', 'Oklahoma City'],
 ['Idaho ', 'Boise ', 'Oregon ', 'Salem'],
 ['Illinois ', 'Springfield ', 'Pennsylvania ', 'Harrisburg'],
 ['Indiana ', 'Indianapolis ', 'Rhode Island ', 'Providence'],
 ['Iowa ', 'Des Moines ', 'South Carolina ', 'Columbia'],
 ['Kansas ', 'Topeka ', 'South Dakota ', 'Pierre'],
 ['Kentucky ', 'Frankfort ', 'Tennessee ', 'Nashville

In [30]:
# Get the headers list
state_headers = FILL_IN_THE_BLANK
# Get the data lists
state_data_lists = FILL_IN_THE_BLANK
state_headers

['US State ', 'State Capital ', 'US State ', 'State Capital']

In [31]:
state_data_lists

[['Alabama ', 'Montgomery ', 'Montana ', 'Helena'],
 ['Alaska ', 'Juneau ', 'Nebraska ', 'Lincoln'],
 ['Arizona ', 'Phoenix ', 'Nevada ', 'Carson City'],
 ['Arkansas ', 'Little Rock ', 'New Hampshire ', 'Concord'],
 ['California ', 'Sacramento ', 'New Jersey ', 'Trenton'],
 ['Colorado ', 'Denver ', 'New Mexico ', 'Santa Fe'],
 ['Connecticut ', 'Hartford ', 'New York ', 'Albany'],
 ['Delaware ', 'Dover ', 'North Carolina ', 'Raleigh'],
 ['Florida ', 'Tallahassee ', 'North Dakota ', 'Bismarck'],
 ['Georgia ', 'Atlanta ', 'Ohio ', 'Columbus'],
 ['Hawaii ', 'Honolulu ', 'Oklahoma ', 'Oklahoma City'],
 ['Idaho ', 'Boise ', 'Oregon ', 'Salem'],
 ['Illinois ', 'Springfield ', 'Pennsylvania ', 'Harrisburg'],
 ['Indiana ', 'Indianapolis ', 'Rhode Island ', 'Providence'],
 ['Iowa ', 'Des Moines ', 'South Carolina ', 'Columbia'],
 ['Kansas ', 'Topeka ', 'South Dakota ', 'Pierre'],
 ['Kentucky ', 'Frankfort ', 'Tennessee ', 'Nashville'],
 ['Louisiana ', 'Baton Rouge ', 'Texas ', 'Austin'],
 ['Main

In [32]:
# Get the first two columns
cols1_2 = [L[:2] for L in state_data_lists]
cols1_2

[['Alabama ', 'Montgomery '],
 ['Alaska ', 'Juneau '],
 ['Arizona ', 'Phoenix '],
 ['Arkansas ', 'Little Rock '],
 ['California ', 'Sacramento '],
 ['Colorado ', 'Denver '],
 ['Connecticut ', 'Hartford '],
 ['Delaware ', 'Dover '],
 ['Florida ', 'Tallahassee '],
 ['Georgia ', 'Atlanta '],
 ['Hawaii ', 'Honolulu '],
 ['Idaho ', 'Boise '],
 ['Illinois ', 'Springfield '],
 ['Indiana ', 'Indianapolis '],
 ['Iowa ', 'Des Moines '],
 ['Kansas ', 'Topeka '],
 ['Kentucky ', 'Frankfort '],
 ['Louisiana ', 'Baton Rouge '],
 ['Maine ', 'Augusta '],
 ['Maryland ', 'Annapolis '],
 ['Massachusetts ', 'Boston '],
 ['Michigan ', 'Lansing '],
 ['Minnesota ', 'St. Paul '],
 ['Mississippi ', 'Jackson '],
 ['Missouri ', 'Jefferson City ']]

In [34]:
# Get the last two columns
cols3_4 = FILL_IN_THE_BLANK
cols3_4

[['Montana ', 'Helena'],
 ['Nebraska ', 'Lincoln'],
 ['Nevada ', 'Carson City'],
 ['New Hampshire ', 'Concord'],
 ['New Jersey ', 'Trenton'],
 ['New Mexico ', 'Santa Fe'],
 ['New York ', 'Albany'],
 ['North Carolina ', 'Raleigh'],
 ['North Dakota ', 'Bismarck'],
 ['Ohio ', 'Columbus'],
 ['Oklahoma ', 'Oklahoma City'],
 ['Oregon ', 'Salem'],
 ['Pennsylvania ', 'Harrisburg'],
 ['Rhode Island ', 'Providence'],
 ['South Carolina ', 'Columbia'],
 ['South Dakota ', 'Pierre'],
 ['Tennessee ', 'Nashville'],
 ['Texas ', 'Austin'],
 ['Utah ', 'Salt Lake City'],
 ['Vermont ', 'Montpelier'],
 ['Virginia ', 'Richmond'],
 ['Washington ', 'Olympia'],
 ['West Virginia ', 'Charleston'],
 ['Wisconsin ', 'Madison'],
 ['Wyoming ', 'Cheyenne']]

In [35]:
# Put all pairs together
all_pairs = cols3_4 + cols1_2

# Sort by state names
all_pairs = sorted(all_pairs)

all_pairs

[['Alabama ', 'Montgomery '],
 ['Alaska ', 'Juneau '],
 ['Arizona ', 'Phoenix '],
 ['Arkansas ', 'Little Rock '],
 ['California ', 'Sacramento '],
 ['Colorado ', 'Denver '],
 ['Connecticut ', 'Hartford '],
 ['Delaware ', 'Dover '],
 ['Florida ', 'Tallahassee '],
 ['Georgia ', 'Atlanta '],
 ['Hawaii ', 'Honolulu '],
 ['Idaho ', 'Boise '],
 ['Illinois ', 'Springfield '],
 ['Indiana ', 'Indianapolis '],
 ['Iowa ', 'Des Moines '],
 ['Kansas ', 'Topeka '],
 ['Kentucky ', 'Frankfort '],
 ['Louisiana ', 'Baton Rouge '],
 ['Maine ', 'Augusta '],
 ['Maryland ', 'Annapolis '],
 ['Massachusetts ', 'Boston '],
 ['Michigan ', 'Lansing '],
 ['Minnesota ', 'St. Paul '],
 ['Mississippi ', 'Jackson '],
 ['Missouri ', 'Jefferson City '],
 ['Montana ', 'Helena'],
 ['Nebraska ', 'Lincoln'],
 ['Nevada ', 'Carson City'],
 ['New Hampshire ', 'Concord'],
 ['New Jersey ', 'Trenton'],
 ['New Mexico ', 'Santa Fe'],
 ['New York ', 'Albany'],
 ['North Carolina ', 'Raleigh'],
 ['North Dakota ', 'Bismarck'],
 ['Ohio

In [37]:
# print out the states and capitals
for FILL_IN_THE_BLANK in all_pairs:
    print FILL_IN_THE_BLANK

The capital of Alabama  is Montgomery  .
The capital of Alaska  is Juneau  .
The capital of Arizona  is Phoenix  .
The capital of Arkansas  is Little Rock  .
The capital of California  is Sacramento  .
The capital of Colorado  is Denver  .
The capital of Connecticut  is Hartford  .
The capital of Delaware  is Dover  .
The capital of Florida  is Tallahassee  .
The capital of Georgia  is Atlanta  .
The capital of Hawaii  is Honolulu  .
The capital of Idaho  is Boise  .
The capital of Illinois  is Springfield  .
The capital of Indiana  is Indianapolis  .
The capital of Iowa  is Des Moines  .
The capital of Kansas  is Topeka  .
The capital of Kentucky  is Frankfort  .
The capital of Louisiana  is Baton Rouge  .
The capital of Maine  is Augusta  .
The capital of Maryland  is Annapolis  .
The capital of Massachusetts  is Boston  .
The capital of Michigan  is Lansing  .
The capital of Minnesota  is St. Paul  .
The capital of Mississippi  is Jackson  .
The capital of Missouri  is Jefferson Cit

In [48]:
# Embed as HTML
state_HTML_List = []
for FILL_IN_THE_BLANK:
    element = "The capital of <b> %s: </b> is <em>%s</em> <br/>" % FILL_IN_THE_BLANK
    state_HTML_List.append(element)
state_HTML_string = "\n".join(state_HTML_List)

display(HTML(state_HTML_string))