## Importing data from the web
Last week, we imported and plotted data from a spreadsheet. This week we generalize the lesson to acquire data from the web. The problem this week is simple: **Repeat last weeks assignment, plotting the STD data from CDC.** ***But, this week get it directly from the internet.*** I hope you see the importance of being able to do this - most data is accessible over the internet now. Being able to write short Python programs to get is could easily make your scientific endeavours more successful.

What follows is the bare necessities needed to get some data from the web. In truth, it's a large, complex topic. Some of the libraries we use here should be useful, should you get serious about this in the future.

### `urllib` - defining a web page
First, you need to let Python know that you're dealing with a web page. You might see it as similar to the `open` command that you used to access files, but now you are opening a web page. It's pretty simple to use. For this example we'll be looking at the data on:

https://en.wikipedia.org/wiki/List_of_countries_by_income_equality

Which contains some data about the price of shares for a coal company, Arch Coal.

First, we need to open this resource with urllib. This is done as follows. Note that after that, we parse the data in a simple way to verify we got what we wanted.


In [9]:
from __future__ import print_function
import urllib
url = urllib.urlopen("https://en.wikipedia.org/wiki/List_of_countries_by_income_equality")

# Let's just see how many ties to Google are in this page.
for line in url:
    if line.find("Sweden") != -1:
        print(line)

AttributeError: module 'urllib' has no attribute 'urlopen'

### `BeautifulSoup` parsing the web page
   
As you can see above, rudimentary parsing can be done with the `urllib` itself. However, the structure nature of web pages makes more advanced parsing possible. For example - we can look at tags to find repeating elements. We can also consider that web pages are hierarchical - a title, followed by a subtitle, and section and sub-sections, etc.

`BeautifulSoup` is a Python library designed to help with parsing web pages. It can quickly find and extract html tags to get at portions of a document you need. 

For example, let's get the table from the web page we opened above. To see the tags we are targeting, try pressing `<CTRL><SHIFT> C` in your web browser on the page we are looking at. The tags are initialized with `<tag>` and ended with `<\tag>`.

In [7]:
import bs4 # This is BeautifulSoup, version 4. It should be included with Anaconda.
import urllib
import requests #utl open from urllib doesn't seem to be working, so we'll use requests instead

# Create an object for parsing the web page. Use the url you already created.
# There are several 'parsers' available, some are faster, others are more flexible. 
# We'll use htlm.parser
url = requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_income_equality")
url=url.text #it has to be changed to text first, b/c soup can't process a response type onject
soup = bs4.BeautifulSoup(url,'html.parser')
table = soup.find('table')
table.getText()

"\n\n\n\n\n\n\n\n\n\n\n1\n\n\n2\n\n\n3\n\n\n4\n\n\n5\n\n\n6\n\n\n7\n\n\n8\n\n\n9\n\n\n10\n\n\n11\n\n\n12\n\n\n13\n\n\n14\n\n\n15\n\n\n16\n\n\n17\n\n\n18\n\n\n19\n\n\n20\n\n\n21\n\n\n22\n\n\n23\n\n\n24\n\n\n25\n\n\n26\n\n\n27\n\n\n28\n\n\n29\n\n\n30\n\n\n31\n\n\n32\n\n\n33\n\n\n34\n\n\n35\n\n\n36\n\n\n37\n\n\n38\n\n\n39\n\n\n40\n\n\n41\n\n\n42\n\n\n43\n\n\n44\n\n\n45\n\n\n46\n\n\n47\n\n\n48\n\n\n49\n\n\n50\n\n\n51\n\n\n52\n\n\n53\n\n\n54\n\n\n55\n\n\n56\n\n\n57\n\n\n58\n\n\n59\n\n\n60\n\n\n61\n\n\n62\n\n\n63\n\n\n64\n\n\n65\n\n\n66\n\n\n67\n\n\n68\n\n\n69\n\n\n70\n\n\n71\n\n\n72\n\n\n73\n\n\n74\n\n\n75\n\n\n76\n\n\n77\n\n\n78\n\n\n79\n\n\n80\n\n\n81\n\n\n82\n\n\n83\n\n\n84\n\n\n85\n\n\n86\n\n\n87\n\n\n88\n\n\n89\n\n\n90\n\n\n91\n\n\n92\n\n\n93\n\n\n94\n\n\n95\n\n\n96\n\n\n97\n\n\n98\n\n\n99\n\n\n100\n\n\n101\n\n\n102\n\n\n103\n\n\n104\n\n\n105\n\n\n106\n\n\n107\n\n\n108\n\n\n109\n\n\n110\n\n\n111\n\n\n112\n\n\n113\n\n\n114\n\n\n115\n\n\n116\n\n\n117\n\n\n118\n\n\n119\n\n\n120\n\n\n121\n

### Pulling the data out of a table
With the table containing the data extracted from the web page, we can now write a loop over it. This loop looks a lot like loops we've done before, over each row and then having the columns accessible inside the loop.

In [8]:
# For now, just get the countries, refine later to get the data too.
for line in table.findAll('tr'): # Loop over all rows (<tr>)
    col = 0                      # Counter on the column number
    for l in line.findAll('td'): # Loop over all columns (<td>)
        if l.find('a'):          # Observation - the country information is in a <a> tag
            print(l.find('a').getText(),end="")
        if col == 3:
            print("\t",l.getText())
        col += 1

	 3
World BankAfghanistanAlbaniaAlgeriaAngolaArgentinaArmeniaAustraliaAustriaAzerbaijanBahrainBangladeshBelarusBelgiumBelizeBeninBhutanBoliviaBosnia and HerzegovinaBotswanaBrazilBulgariaBurkina FasoBurundiCambodiaCameroonCanadaCape VerdeCentral African RepublicChadChileChinaColombiaComorosDR CongoCongo, Republic of theCosta RicaCôte d'IvoireCroatiaCubaCyprusCzech RepublicDenmarkDjiboutiDominican RepublicEcuador[8][8]EgyptEl SalvadorEquatorial GuineaEstoniaEthiopiaEuropean UnionFijiFinlandFranceGabonThe GambiaGeorgiaGermanyGhanaGreeceGuatemalaGuineaGuinea-BissauGuyanaHaitiHondurasHong KongHungaryIcelandIndiaIndonesiaIranIraqIrelandIsraelItalyJamaicaJapanJordanKazakhstanKenyaNorth KoreaSouth KoreaKuwaitKyrgyzstanLaosLatviaLebanonLesothoLiberiaLibyaLithuaniaLuxembourgMacauMacedoniaMadagascarMalawiMalaysiaMaldivesMaliMaltaMauritaniaMauritiusMexicoMoldovaMongoliaMontenegroMoroccoMozambiqueMyanmarNamibiaNepalNetherlands[9]New ZealandNicaraguaNigerNigeriaNorwayOmanPakistanPanamaPapua New Guin

### Rejiggering w/ Internet

In [17]:
#This will be transfering the script from takign information from a file to infomration from the internet
#it will still run from the command line
import bs4 # This is BeautifulSoup, version 4. It should be included with Anaconda.
import urllib
import requests #utl open from urllib doesn't seem to be working, so we'll use requests instead

url = requests.get("https://www.cdc.gov/std/stats16/tables/1.htm")
url=url.text #it has to be changed to text first, b/c soup can't process a response type onject
soup = bs4.BeautifulSoup(url,'html.parser')
table = soup.find('table')
table.getText()

for line in table.findAll('tr'): # Loop over all rows (<tr>)
    col = 0                      # Counter on the column number
    for l in line.findAll('td'): # Loop over all columns (<td>)
        print(l)
        col += 1

<td> 1941</td>
<td>485,560</td>
<td> 368.2</td>
<td> 68,231</td>
<td> 51.7</td>
<td>109,018</td>
<td> 82.6</td>
<td>202,984</td>
<td> 153.9</td>
<td> 17,600</td>
<td> 651.1</td>
<td> NR </td>
<td>Â </td>
<td> 193,468</td>
<td> 146.7</td>
<td> 3,384</td>
<td> 2.5</td>
<td> 1942</td>
<td>479,601</td>
<td> 363.4</td>
<td> 75,312</td>
<td> 57.0</td>
<td>116,245</td>
<td> 88.0</td>
<td>202,064</td>
<td> 153.1</td>
<td> 16,918</td>
<td> 566.0</td>
<td> NR </td>
<td>Â </td>
<td> 212,403</td>
<td> 160.9</td>
<td> 5,477</td>
<td> 4.1</td>
<td> 1943</td>
<td>575,593</td>
<td> 447.0</td>
<td> 82,204</td>
<td> 63.8</td>
<td>149,390</td>
<td> 116.0</td>
<td>251,958</td>
<td> 195.7</td>
<td> 16,164</td>
<td> 520.7</td>
<td> NR </td>
<td>Â </td>
<td> 275,070</td>
<td> 213.6</td>
<td> 8,354</td>
<td> 6.4</td>
<td> 1944</td>
<td>467,755</td>
<td> 367.9</td>
<td> 78,443</td>
<td> 61.6</td>
<td>123,038</td>
<td> 96.7</td>
<td>202,848</td>
<td> 159.6</td>
<td> 13,578</td>
<td> 462.0</td>
<td> NR </td>
<td

In [2]:
#get stuff off the Internet
#urllib url open was nto working, so I'm using requests instead
import requests
from bs4 import BeautifulSoup
import pandas as pd #using pandas was a good choice, we got stuff going

res=requests.get("https://www.cdc.gov/std/stats16/tables/1.htm").content #we use requests to access the webpage
soup=BeautifulSoup(data, "lxml")
table=soup.find_all('table')[0] #this finds the first data table
df_list = pd.read_html(str(table)) #thsi is pandas making a list of the data in the table

for i, df in enumerate(df_list):
    print(df)

          Year*                Syphilis     Chlamydia  \
   All Stagesâ Primary  and  Secondary Early  Latent   
          Cases                    Rate         Cases   
0          1941                  485560         368.2   
1          1942                  479601         363.4   
2          1943                  575593         447.0   
3          1944                  467755         367.9   
4          1945                  359114         282.3   
5          1946                  363647         271.7   
6          1947                  355592         252.3   
7          1948                  314313         218.2   
8          1949                  256463         175.3   
9          1950                  217558         146.0   
10         1951                  174924         116.1   
11         1952                  167762         110.2   
12         1953                  148573          95.9   
13         1954                  130697          82.9   
14         1955                

In [3]:
#let's try anothe approach
import requests
from bs4 import BeautifulSoup
from lxml import etree
import pandas as pd

res=requests.get("https://www.cdc.gov/std/stats16/tables/1.htm") #we use requests to access the webpage
res=str(res) #etree html can onyl parse strings, so we need to convert
html=etree.HTML(res)
tr_s=html

### Rejiggering w/ Internet + More

In [None]:
#This will be a rejiggering and revampign of the old script.
#The things we will try to do is use pandas module and make the script run more the way it is intended