In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

# Overview of Scraping and Munging Technologies

### Learning goals for this section


Here's a quick summary of the things you should learn in module 1, as well as some supplementary "good-to-know" items.


<table style="border-collapse:collapse;">
    <tr>
    <th>Concept</th>
    <th>Specialized Language</th>
    <th>Tool</th>
    </tr>
    
    <tr class="blank"><td colspan="3"></td></tr>
    
    <tr>
    <td class="concept">distributed version control</td>
    <td>git commands</td>
    <td>`git`</td>
    </tr>
    
    
    <tr class="blank"><td colspan="3"></td></tr>
        
    <tr>
    <td class="concept">strings</td>
    <td>regular expressions</td>
    <td>`import re`</td>
    </tr>
    
    <tr class="blank"><td colspan="3"></td></tr>
    
    <tr>
    <td rowspan="3" class="concept">relational database</td>
    <td rowspan="3">SQL</td>
    <td>Postgres</td>
    </tr>
    <tr>
    <td>`import psycopg2`</td>
    </tr>
    <tr>
    <td>`import sqlalchemy`</td>
    </tr>
    
    <tr class="blank"><td colspan="3"></td></tr>
    
    <tr>
    <td rowspan="3" class="concept">nested key-value data model</td>
    <td rowspan="3">JSON</td>
    <td>Mongo</td>
    </tr>
    <tr>
    <td>Python dictionaries / named tuples </td>
    </tr>
    <tr>
    <td>`import json`</td>
    </tr>
    
    <tr class="blank"><td colspan="3"></td></tr>
    
    <tr>
    <td rowspan="2" class="concept">HTML parse tree/DOM</td>
    <td>CSS selectors</td>
    <td rowspan="2">`import BeautifulSoup`</td>
    </tr>
    <tr>
    <td>by-hand tree traversal</td>
    </tr>
    
    <tr class="blank"><td colspan="3"></td></tr>
    
    <tr>
    <td class="concept">DataFrames</td>
    <td></td>
    <td>`import pandas`</td>
    </tr>
    
    <tr class="blank"><td colspan="3"></td></tr>
    
    <tr>
    <td rowspan="2" class="concept">dependency management / analysis pipeline</td>
    <td>make syntax</td>
    <td>the `make` command</td>
    </tr>
    <tr>
    <td></td>
    <td>DIY dependency checks in Python / Bash / etc.</td>
    </tr>
    
    <tr class="blank"><td colspan="3"></td></tr>
    
    <tr>
    <td rowspan="2" class="concept">visualization</td>
    <td>pseudo-MATLAB syntax</td>
    <td>`import matplotlib, matplotlib.pyplot`</td>
    </tr>
    <tr>
    <td>JavaScript-based</td>
    <td>D3, NVD3, `import nvd3`</td>
    </tr>
</table>

## Concepts, languages, and tools


  - A **concept** is an idea that you have to come to terms with.  It can be deep and hard to learn.  These call for lecturing. <small>(Some of the examples above are more legitimate than others; c.f. "string" vs "relational database.")</small>
  - A **specialized language** is exactly what it sounds like: a more compact notation for working in some domain or with some concept.  It's not so deep but can also be hard to learn.  We learn these through lecture and "immersion."
  - A **tool** is something that we use to do something.  It's usually neither deep nor particularly hard to learn (unless, for example, it has its own language!).  The important thing here is to know what's available, know any gotchas, and to practice -- so that's what we'll do.
  
The relationship between these three things is many-to-many: the same concept can admit many languages and a language can work with several concepts, etc.  But we're simplifying things here, and pretending that it is many-to-one in one direction.

## Concrete tasks in Python


Instead of what-you-learn there's also why-you-learn, i.e. what are the tasks.  From this viewpoint we're focusing on three key themes:


### Basic tasks in Python:

  - Cleaning / munging data: aka **`Perl`**-like in Python
  - Basic graphing and numerical operations: aka **`MATLAB`**-like in Python
  - Basic data analysis: aka **`R`**-like in Python

     

### Getting data in and out of Python:


  - Importing and exporting `.csv` / `.xls` / etc. files
  - Consuming APIs
  - Web scraping
    

### Storing data:

  - SQL
  - Python <-> SQL
  - ORM (Object Relational Mapping) for Python
  - A few words on NoSQL


## Python library cheat sheet


We'll see how to use these tools in greater depth later.  But with the concepts and language down they're pretty easy.  Here's a quick reference table for that point:

**Downloading things:** `urllib`
```python
from urllib.request import urlopen
page = urlopen("http://www.thedataincubator.com/")
print(page)
```

**Parsing through HTML:** `BeautifulSoup`
```python
from bs4 import BeautifulSoup
soup = BeautifulSoup( urlopen("http://www.thedataincubator.com/") )
print(soup.select("section#faq div#panel-group div#panel-default")[3].select('strong'))
```

**Higher-level web requests:** `requests`
```python
import requests
params = { 'format':'json', 'q':'1600 Penn. Ave' } 
r = requests.get('http://nominatim.openstreetmap.org/search', params=params )
print(r.json())
```

**"Tidying up" broken HTML:** `tidylib`
```python
from tidylib import tidy_document
tidy_page, errors = tidy_document( urlopen("my_bad_code.html").read() )
```

**Reading (and writing, a bit) Excel files:** `xlrd`
```python
import xlrd
url="http://www.census.gov/.../List1.xls"
wb = xlrd.open_workbook( file_contents = urlopen(url).read() )
sh1 = wb.sheet_by_index(0)
for in in range(sh1.nrows):
    print(sh1.row_values(i))
```

**Reading (and writing) CSV:** `csv`
```python
import csv
url="https://www.census.gov/..../NP2012_D1.csv"
reader = csv.reader( urlopen(url) )
for r in reader:
    print(r)
```

**Reading (and writing) JSON:** `json`
```python
import json
url="http://data.nasa.gov/...?great-images-in-nasa"
response = json.loads( urlopen(url).read() )
print(response.keys(), response['post'])
```

**Efficient numerical code:** `numpy`
```python
import numpy as np
x = np.arange( -1000, 1000, 0.1 )
y = np.sqrt( 1 - np.sin(x) )
print(dot( x, y ))
```

**Graphing, plotting, visualization:** `matplotlib`
```python
# with x, y as above
import matplotlib.pyplot as plt
plt.plot( x, y )
```

**Data analysis:** `pandas`
```python
import pandas as pd
url = "https://www.census.gov/..../NP2012_D1.csv"
df = pd.read_csv( url )
print(df.columns, df.dtypes)
df.TOTAL_POP[df.ORIGIN == 0][df.RACE == 0][df.SEX == 0].plot()
```

**Interfacing Python with Postgres (via raw SQL):** `psycopg2`
```python
import psycopg2
conn = psycopg2.connect("dbname='mydb' user='preygel' host='localhost' password='secret'")
cur = conn.cursor()

cur.execute( "SELECT * FROM clients WHERE zipcode=(?) AND gender=(?)", ['10012', 'M'] )
rows = cur.fetchall()
for row in rows:
    print(row)
    
cur.execute( "INSERT INTO clients (name, zipcode, gender) VALUES (?, ?, ?)", ['Toly', '94117', 'M'] )
conn.commit()
```

**Interfacing Python with an SQL database (via ORM):** `sqlalchemy`
```python
from sqlalchemy import Column, Integer, String
from sqlalchemy.orm import sessionmaker

Session = sessionmaker(bind=engine)
session = Session()

class User(Base):
    __tablename__ = 'users'

    id = Column(Integer, primary_key=True)
    name = Column(String)
    fullname = Column(String)
    password = Column(String)

ed_user = User(name='ed', fullname='Ed Jones', password='edspassword')
ed_user.fullname  # Ed Jones
session.add(ed_user)  # Store in DB

for instance in session.query(User).order_by(User.id): 
	print(instance.name, instance.fullname)
```

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*