# Setting up your development environ

1. Clone the state scraper repo from Github

```
git clone https://github.com/influence-usa/scrapers-us-state.git
```

2. Make a new virtualenv

```
mkvirtualenv --python=$(which python3) <name>
```
  
3. Use `pip` to install requirements

```
pip install -r requirements.txt
```
  
4. If you don't see a folder for the state you're working on, run the following:

```
(iusa-scrape)$>pupa init arizona
no pupa_settings on path, using defaults
jurisdiction name (e.g. City of Seattle): Arizona
division id (e.g. ocd-division/country:us/state:wa/place:seattle): ocd-division/country:us/state:az
classification (can be: government, legislature, executive, school_system): government
official URL: http://www.az.gov/
create disclosures scraper? [Y/n]: Y
create bills scraper? [y/N]: n
create events scraper? [y/N]: y
create votes scraper? [y/N]: n
create people scraper? [y/N]: y
```

...what this did was create a new folder for the state. In this example, the state was Arizona (`arizona`).

```
(iusa-scrape)$>tree
.
├── ak
│   └── __init__.py
├── al
│   ├── __init__.py
│   └── people.py
├── arizona
│   ├── disclosures.py
│   ├── events.py
│   ├── __init__.py
│   └── people.py
├── md
│   └── __init__.py
├── README.md
├── requirements.txt
├── Untitled.ipynb
└── utils
    ├── __init__.py
    └── lxmlize.py
```
  
To follow the broader pupa convention, we'll change the directory name to `az`:

```
    (iusa-scrape)$>mv arizona az
    (iusa-scrape)$>tree
    tree
    .
    ├── ak
    │   └── __init__.py
    ├── al
    │   ├── __init__.py
    │   └── people.py
    ├── az
    │   ├── disclosures.py
    │   ├── events.py
    │   ├── __init__.py
    │   └── people.py
    ├── md
    │   └── __init__.py
    ├── README.md
    ├── requirements.txt
    ├── Untitled.ipynb
    └── utils
        ├── __init__.py
        └── lxmlize.py

5 directories, 13 files
```
  
Because we told it to in the questions asked above, it also created the starter code for our scrapers: there's one each for disclosures, events and people.

Also interesting is the `__init__.py` file in our state's directory.  It used the answers to our questions to build a `Jurisdiction` object that represents the state government:

```python
class Arizona(Jurisdiction):
    division_id = "ocd-division/country:us/state:az"
    classification = "government"
    name = "Arizona"
    url = "https://az.gov/"
    scrapers = {
        "events": ArizonaEventScraper,
        "people": ArizonaPersonScraper,
        "disclosures": ArizonaDisclosureScraper,
    }

    def get_organizations(self):
        yield Organization(name=None, classification=None)
```

# Version control

This is a good time to add and commit our changes so far.

```terminal
(iusa-scrape)$>git add az
(iusa-scrape)$>git commit -m "initialized arizona"
[master 3e622ef] initialized arizona
 4 files changed, 47 insertions(+)
 create mode 100644 az/__init__.py
 create mode 100644 az/disclosures.py
 create mode 100644 az/events.py
 create mode 100644 az/people.py
 ```

# Where is the data?

# Creating global authority organizations

## Create the Secretary of State

The `get_organizations` method of the `Jurisdiction` class lets us define some global organizations for all of the data that we'll be scraping from Arizona's sites. For campaign finance disclosures, we'll have to define the Arizona Secretary of State's Office.

```python
def get_organizations(self):                                        
```

First, initialize using the `Organization` class.

```python
    secretary_of_state = Organization(                                    
        name="Office of the Secretary of State, State of Arizona",        
        classification="office"                                           
    )                                                               ```
    
Here, we're able to set particular attributes using `kwargs`.  To get a sense of which attributes you can set at this point, check out the [source](https://github.com/influence-usa/pupa/blob/disclosures/pupa/scrape/popolo.py#L132-L182).

Now, we can add other attribtues, using the helper methods found on the `Organization` class:

```python
    secretary_of_state.add_contact_detail(                                
        type="voice",                                                     
        value="602-542-4285"                                              
    )                    

    secretary_of_state.add_contact_detail(                                
        type="address",                                                   
        value="1700 W Washington St Fl 7, Phoenix AZ 85007-2808"          
    )                                                                     
    secretary_of_state.add_link(                                          
        url="http://www.azsos.gov/",                                      
        note="Home page"                                                  
    )                                                                     
```

We should add the organization we've created to the `Jurisdiction` object as a semi-private property. This is useful, beacuse the `Jurisdiction` object will essentially always be accessible to all of our scrapers. Whenever we want to refer to the AZ Secretary of State, we can always access it from `Arizona` jurisdiction object.

```python
    self._secretary_of_state = secretary_of_state                   
```

Finally, yield the organization we created. This is beacause `get_organizations` is actually the first scraper that we'll run each time we run Arizona scrapers of any kind.

```python
    yield secretary_of_state                                          
```

## Test what we have so far!

Cool, let's try out what we have so far.  From the project root (`scrapers-us-state`), run the command:

```
(iusa-scrape)$>pupa update az disclosures --scrape
```

This will throw a `ScrapeError` because we haven't written any of the main scrapers yet, but before it does we'll see that it creates our `Jurisdiction` object, and the `Organization` representing the Arizona Secretary of State.

```
no pupa_settings on path, using defaults
az (scrape)
  events: {}
  people: {}
  disclosures: {}
Not checking sessions...
13:30:10 INFO pupa: save jurisdiction Arizona as jurisdiction_ocd-jurisdiction-country:us-state:az-government.json
13:30:10 INFO pupa: save organization Office of the Secretary of State, State of Arizona as organization_1e330580-e20b-11e4-a4f5-e90fe0697b56.json
```

# Starting a new scraper

Now it's time to start writing the real meat and potatoes of our scraping code.

## Locate the source of the data

Check out the [Big Board](https://docs.google.com/spreadsheets/d/18-MvVJXg8TkUUNhtBmWoCEPUWEMf7F6-YVV6x7CWrg4/pubhtml) to see which URL you should use to start. Explore the links on that page until you find the data you're looking for.

For this example, we'll look at the Arizona Super PAC list. 

In [11]:
PAC_LIST_URL = "http://apps.azsos.gov/apps/election/cfs/search/SuperPACList.aspx"

## Adding new scrape routines

We're going to add our code to `az/disclosures.py`. 

```python
class ArizonaDisclosureScraper(Scraper):
                                           
    def scrape_super_pacs(self):           
        pass                               
                                           
    def scrape(self):                      
        # needs to be implemented          
        yield from self.scrape_super_pacs()
```

When we're through, the `pupa` CLI commands will call the `scrape` command. It's good practice to follow this pattern to break down that command into a series of subroutines, one for each type of data you're returning. The pupa software actually doesn't care, though, it just expects a stream of Open Civic Data scrape objects (`Person`, `Organizaton`, `Event`, etc).

## Developing your scraper

At this point, you might want to move to a REPL (or, even better, to an IPython notebook) so that you can start figuring out how to obtain the target data. You'll 

In this example, things are fairly straightforward.  There's a `<table>` element in the middle of the page that has all the information we need to generate the `Organization` objects that we want.

In [12]:
from lxml import etree
from lxml.html import HTMLParser

import scrapelib

The scraper we'll eventually write is a subclass of `scrapelib.Scraper()` (see the `Scraper` class in [pupa/pupa/scrape/base.py](https://github.com/influence-usa/pupa/blob/disclosures/pupa/scrape/base.py)), and working with the parent class will be close enough to reality when we're undergoing trial and error at first.

In [13]:
my_scraper = scrapelib.Scraper()

In [14]:
_, resp = my_scraper.urlretrieve(PAC_LIST_URL)

In [16]:
d = etree.fromstring(resp.content, parser=HTMLParser())

The easiest thing to do is just look for the table we're interested by writing an xpath query. The `<table>` 

In [17]:
d.xpath('//table')

[<Element table at 0x7fc07468e368>, <Element table at 0x7fc07468e3b8>]

Hm, looks like there's more than one, so we're going to have to narrow our XPath query. Here's where we can cheat: using Chrome/Chromium, we can right click on the table, and select "Inspect Element" from the dropdown. This opens Developer Tools and shows us where in the page's source that object is. Right click on the `<table>` tag and select "Copy XPath". Now paste the results into a text editor.  What you'll see is:

    //*[@id="ctl00_ctl00_MainPanel"]/table
    
Chrome DevTools does its best to find a short-ish query that uniquely identifies the node you've selected. The results of this query will be more targeted.

In [18]:
d.xpath('//*[@id="ctl00_ctl00_MainPanel"]/table')

[<Element table at 0x7fc07468e3b8>]

Notice that this is still a list. To access it we'll need to select the object by indexing with `[0]`.

We know that the information we need is in this table, and using IPython we can test out scripts that scrape the data.

In [19]:
target_table = d.xpath('//*[@id="ctl00_ctl00_MainPanel"]/table')[0]

The `table` element can, in turn, be queried with an XPath query. We can get all of the rows with `tbody/tr`.

In [20]:
rows = target_table.xpath('tr')
len(rows)

26

In [21]:
rows[0]

<Element tr at 0x7fc07468e728>

In [22]:
def find_the_table(html_document):
    return d.xpath('//*[@id="ctl00_ctl00_MainPanel"]/table')[0]

To see what the original HTML string of any element looks like, use:

In [23]:
etree.tostring(rows[5])

b'<tr>&#13;\n&#13;\n<td>201000081</td>&#13;\n<td>Arizona List P.A.C.<br />PO BOX 42294<br />TUCSON, AZ 85733 </td>&#13;\n<td>(520) 327-0520</td>&#13;\n<td>01/25/2015</td>&#13;\n<td>01/24/2019</td>&#13;\n</tr>&#13;\n&#13;\n'

Now we can write functions to query for, and parse each table. It can be helpful when debugging to break up your scraper's code into small functions, so that when (not if ;) something goes wrong, it's easier to tell where and why.

In [24]:
def scrape_table_first_draft(table_element):
    scraped_rows = []
    for row in table_element.xpath('tr'):
        _data = {}
        columns = row.xpath('td')
        _data['org_id'] = columns[0].text_content()
        _data['org_name_and_address'] = columns[1].text_content()
        _data['org_phone'] = columns[2].text_content()
        _data['org_begin_date'] = columns[3].text_content()
        _data['org_end_date'] = columns[4].text_content()
        scraped_rows.append(_data)
    return scraped_rows

In [25]:
scrape_table_first_draft(target_table)

IndexError: list index out of range

Oops! Let's look at what happened. We got an `IndexError` when looking for item `0` in the list `columns`. Expect to see a lot of these kinds of errors when writing `lxml` scrapers. It usually means that an XPath query returned an empty list.  Let's make use our `"td"` query is working on each row:

In [26]:
[row.xpath("td") for row in target_table.xpath('tr')]

[[],
 [<Element td at 0x7fc07469f4a8>,
  <Element td at 0x7fc07469f408>,
  <Element td at 0x7fc07469f138>,
  <Element td at 0x7fc07469f958>,
  <Element td at 0x7fc07469f9a8>],
 [<Element td at 0x7fc07469f9f8>,
  <Element td at 0x7fc07469fa48>,
  <Element td at 0x7fc07469fa98>,
  <Element td at 0x7fc07469fae8>,
  <Element td at 0x7fc07469fb38>],
 [<Element td at 0x7fc07469fb88>,
  <Element td at 0x7fc07469fbd8>,
  <Element td at 0x7fc07469fc28>,
  <Element td at 0x7fc07469fc78>,
  <Element td at 0x7fc07469fcc8>],
 [<Element td at 0x7fc07469fd18>,
  <Element td at 0x7fc07469fd68>,
  <Element td at 0x7fc07469fdb8>,
  <Element td at 0x7fc07469fe08>,
  <Element td at 0x7fc07469fe58>],
 [<Element td at 0x7fc07469fea8>,
  <Element td at 0x7fc07469fef8>,
  <Element td at 0x7fc07469ff48>,
  <Element td at 0x7fc07469ff98>,
  <Element td at 0x7fc0746b1048>],
 [<Element td at 0x7fc0746b1098>,
  <Element td at 0x7fc0746b10e8>,
  <Element td at 0x7fc0746b1138>,
  <Element td at 0x7fc0746b1188>,
  <E

Ah, the first row has no `td` elements!  Let's look at it again:

In [27]:
etree.tostring(target_table.xpath('tr')[0])

b'<tr><th>Filer ID</th><th>Name &amp; Address</th> <th>Phone #</th><th>Begin Date</th><th>End Date</th></tr>&#13;\n&#13;\n'

Of course: the first row is a header row, and only contains `<th>` tags. (Actually, this is bad HTML. The `th` tags should be inside of a `<thead>` tag, and `tr` tags should be in a separate `tbody` tag, but that's not what was on the page). Let's make sure that there are 5 `td` tags when parsing a row.

In [28]:
def scrape_table_second_draft(table_element):
    scraped_rows = []
    for row in table_element.xpath('tr'):
        _data = {}
        columns = row.xpath('td')
        if len(columns) == 5:
            _data['org_id'] = columns[0].text_content()
            _data['org_name_and_address'] = columns[1].text_content()
            _data['org_phone'] = columns[2].text_content()
            _data['org_begin_date'] = columns[3].text_content()
            _data['org_end_date'] = columns[4].text_content()
            scraped_rows.append(_data)
    return scraped_rows
        

In [29]:
result = scrape_table_second_draft(target_table)

In [30]:
result

[{'org_begin_date': '06/09/2014',
  'org_end_date': '06/08/2018',
  'org_id': '1066',
  'org_name_and_address': 'AEA FUND FOR PUBLIC EDUCATION (FORMERLY) AZ PAC (AZ EDUCATION ASSN PAC)345 E PALM LNPHOENIX, AZ 85004 ',
  'org_phone': '(602) 264-1774'},
 {'org_begin_date': '04/02/2014',
  'org_end_date': '04/01/2018',
  'org_id': '1354',
  'org_name_and_address': 'AFSCME PEOPLE1625 L ST NWWASHINGTON, DC 20036 ',
  'org_phone': '(202) 429-1088'},
 {'org_begin_date': '02/14/2014',
  'org_end_date': '02/13/2018',
  'org_id': '200602733',
  'org_name_and_address': 'ARIZONA CHAMBER POLITICAL ACTION COMMITTEE3200 N CENTRAL AVE, STE 1125PHOENIX, AZ 85012 ',
  'org_phone': '(602) 248-9172'},
 {'org_begin_date': '07/21/2014',
  'org_end_date': '07/20/2018',
  'org_id': '200602848',
  'org_name_and_address': 'ARIZONA FAMILIES UNITED FOR STRONG COMMUNITIES- PROJECT OF SEIU COPE3707 N 7TH ST, STE 100PHOENIX, AZ 85014 ',
  'org_phone': '(602) 279-8016'},
 {'org_begin_date': '01/25/2015',
  'org_end_d

Hm, let's see if we can do a better job with that `org_name_and_address` field. We're going to want to separate them.

In [31]:
rows = target_table.xpath('tr')
row = rows[7]

In [32]:
cell = row.xpath('td')[1]

In [33]:
etree.tostring(cell)

b'<td>Arizona Pipe Trades 469<br />3109 N 24TH ST<br />PHOENIX, AZ 85016 </td>&#13;\n'

The first part of the cell can be extracted using the `text` property. Unlike `text_content`, it will only return the text up until the first `<br/>` tag.

In [34]:
cell.text

'Arizona Pipe Trades 469'

We can take advantage of the `<br />` elements, by using their `tail` properties. This property returns text that follows the element, stopping before the next element encountered.

In [35]:
for br in cell.xpath('br'):
    print('tail: "{}"'.format(br.tail))

tail: "3109 N 24TH ST"
tail: "PHOENIX, AZ 85016 "


In [36]:
def separate_name_and_address(cell):
    name = cell.text
    address = ', '.join([br.tail for br in cell.xpath('br')])
    return name, address

def scrape_table(table_element):
    scraped_rows = []
    for row in table_element.xpath('tr'):
        _data = {}
        columns = row.xpath('td')
        if len(columns) == 5:
            _data['org_id'] = columns[0].text_content()
            _name, _address = separate_name_and_address(columns[1])
            _data['org_name'] = _name
            _data['org_address'] = _address
            _data['org_phone'] = columns[2].text_content()
            _data['org_begin_date'] = columns[3].text_content()
            _data['org_end_date'] = columns[4].text_content()
            scraped_rows.append(_data)
    return scraped_rows

In [37]:
results = scrape_table(target_table)

In [38]:
results

[{'org_address': '345 E PALM LN, PHOENIX, AZ 85004 ',
  'org_begin_date': '06/09/2014',
  'org_end_date': '06/08/2018',
  'org_id': '1066',
  'org_name': 'AEA FUND FOR PUBLIC EDUCATION (FORMERLY) AZ PAC (AZ EDUCATION ASSN PAC)',
  'org_phone': '(602) 264-1774'},
 {'org_address': '1625 L ST NW, WASHINGTON, DC 20036 ',
  'org_begin_date': '04/02/2014',
  'org_end_date': '04/01/2018',
  'org_id': '1354',
  'org_name': 'AFSCME PEOPLE',
  'org_phone': '(202) 429-1088'},
 {'org_address': '3200 N CENTRAL AVE, STE 1125, PHOENIX, AZ 85012 ',
  'org_begin_date': '02/14/2014',
  'org_end_date': '02/13/2018',
  'org_id': '200602733',
  'org_name': 'ARIZONA CHAMBER POLITICAL ACTION COMMITTEE',
  'org_phone': '(602) 248-9172'},
 {'org_address': '3707 N 7TH ST, STE 100, PHOENIX, AZ 85014 ',
  'org_begin_date': '07/21/2014',
  'org_end_date': '07/20/2018',
  'org_id': '200602848',
  'org_name': 'ARIZONA FAMILIES UNITED FOR STRONG COMMUNITIES- PROJECT OF SEIU COPE',
  'org_phone': '(602) 279-8016'},
 {

## Adding to the DisclosureScraper object

After we think that we're able to scrape the data we want, we should add the code to our scraper object.

```python
class ArizonaDisclosureScraper(Scraper):

    def scrape_super_pacs(self):
    
        def find_the_table(html_document):
            return d.xpath('//*[@id="ctl00_ctl00_MainPanel"]/table')[0]
    
        def separate_name_and_address(cell):
            name = cell.text
            address = ', '.join([br.tail for br in cell.xpath('br')])
            return name, address
    
        def scrape_table(table_element):
            scraped_rows = []
            for row in table_element.xpath('tr'):
                _data = {}
                columns = row.xpath('td')
                if len(columns) == 5:
                    _data['org_id'] = columns[0].text_content()
                    _name, _address = separate_name_and_address(columns[1])
                    _data['org_name'] = _name
                    _data['org_address'] = _address
                    _data['org_phone'] = columns[2].text_content()
                    _data['org_begin_date'] = columns[3].text_content()
                    _data['org_end_date'] = columns[4].text_content()
                    scraped_rows.append(_data)
            return scraped_rows
            
        PAC_LIST_URL = "http://apps.azsos.gov/apps/election/cfs/search/SuperPACList.aspx"
        
        resp = self.urlretrieve(PAC_LIST_URL)
        
        html_document = etree.fromstring(resp, parser=HTMLParser())
        
        target_table = find_the_table(html_document)
        
        results = scrape_the_table(target_table)

    def scrape(self):                      
        # needs to be implemented          
        yield from self.scrape_super_pacs()
```

## Use scraped data to build Open Civic Data objects

Let's look at how we can use the PAC data we scraped to build an `Organization` object.

In [39]:
result = results[0]

result

{'org_address': '345 E PALM LN, PHOENIX, AZ 85004 ',
 'org_begin_date': '06/09/2014',
 'org_end_date': '06/08/2018',
 'org_id': '1066',
 'org_name': 'AEA FUND FOR PUBLIC EDUCATION (FORMERLY) AZ PAC (AZ EDUCATION ASSN PAC)',
 'org_phone': '(602) 264-1774'}

If we import the `Organization` scraper model, this becomes very easy:

In [40]:
from pupa.scrape.popolo import Organization

In [41]:
my_org = Organization(
    name=result['org_name'],
    classification='political action committee',
    founding_date=result['org_begin_date'],
    dissolution_date=result['org_end_date']
)

That's actually more information than we needed. Only the `name` and `classification` properties were required.

To add other information, like contact details and identifiers, we can use special helper functions that come with the `Organization` class:

In [42]:
my_org.add_identifier(
    identifier=result['org_id'],
    scheme='urn:az-state:committee'
    )

In [43]:
my_org.add_contact_detail(
    type='address',
    value=result['org_address']
)

In [44]:
my_org.add_contact_detail(
    type='voice',
    value=result['org_phone']
)

To see your organization as a `dict`, use the `as_dict` method:

In [45]:
my_org.as_dict()

{'_id': '0e683e08-e46e-11e4-a4f5-e90fe0697b56',
 'classification': 'political action committee',
 'contact_details': [{'note': '',
   'type': 'address',
   'value': '345 E PALM LN, PHOENIX, AZ 85004 '},
  {'note': '', 'type': 'voice', 'value': '(602) 264-1774'}],
 'dissolution_date': '06/08/2018',
 'extras': {},
 'founding_date': '06/09/2014',
 'identifiers': [{'identifier': '1066', 'scheme': 'urn:az-state:committee'}],
 'image': '',
 'links': [],
 'name': 'AEA FUND FOR PUBLIC EDUCATION (FORMERLY) AZ PAC (AZ EDUCATION ASSN PAC)',
 'other_names': [],
 'parent_id': None,
 'source_identified': False,
 'sources': []}

Ah! That reminds me: We have to be sure to do two things to all `Person` and `Organization` objects returned by our scraper.

First, Make sure `source_identified` is `True`

In [46]:
my_org.source_identified = True

Next, make sure you include the source (ie, the URL that you found the organization at)

In [47]:
my_org.add_source(url=PAC_LIST_URL)

## Yielding your new objects

Adding the new code to our scraper:

```python
class ArizonaDisclosureScraper(Scraper):

    def scrape_super_pacs(self):

        def find_the_table(html_document):
            return d.xpath('//*[@id="ctl00_ctl00_MainPanel"]/table')[0]

        def separate_name_and_address(cell):
            name = cell.text
            address = ', '.join([br.tail for br in cell.xpath('br')])
            return name, address

        def scrape_table(table_element):
            scraped_rows = []
            for row in table_element.xpath('tr'):
                _data = {}
                columns = row.xpath('td')
                if len(columns) == 5:
                    _data['org_id'] = columns[0].text_content()
                    _name, _address = separate_name_and_address(columns[1])
                    _data['org_name'] = _name
                    _data['org_address'] = _address
                    _data['org_phone'] = columns[2].text_content()
                    _data['org_begin_date'] = columns[3].text_content()
                    _data['org_end_date'] = columns[4].text_content()
                    scraped_rows.append(_data)
            return scraped_rows

        PAC_LIST_URL = "http://apps.azsos.gov/apps/election/cfs/search/SuperPACList.aspx"

        _, resp = self.urlretrieve(PAC_LIST_URL)

        html_document = etree.fromstring(resp, parser=HTMLParser())

        target_table = find_the_table(html_document)

        results = scrape_the_table(target_table)
        
        for result in results:
            _org = Organization(
                name=result['org_name'],
                classification='political action committee',
                founding_date=result['org_begin_date'],
                dissolution_date=result['org_end_date']
            )
            
            _org.add_identifier(
                identifier=result['org_id'],
                scheme='urn:az-state:committee'
            )
            
            _org.add_contact_detail(
                type='address',
                value=result['org_address']
            )
            
            _org.add_contact_detail(
                type='voice',
                value=result['org_phone']
            )
            
            _org.add_source(url=PAC_LIST_URL)
            
            _org.source_identified = True
            
            yield _org
            

    def scrape(self):                      
        # needs to be implemented          
        yield from self.scrape_super_pacs()
```

## Finished! 

You don't need to worry about anything after this point. After doing the hard work of writing your scraper code, you can hand off the rest (validation, database storage, deduplication) to the `pupa` framework.  

# Running your new scraper

Okay, time to see how well it works!  Run:

```
(iusa-scrape)$>pupa update az disclosures --scrape
```

Uh oh.

```
    no pupa_settings on path, using defaults
    az (scrape)
      disclosures: {}
    Not checking sessions...
    21:45:21 INFO pupa: save jurisdiction Arizona as jurisdiction_ocd-jurisdiction-country:us-state:az-government.json
    21:45:21 INFO pupa: save organization Office of the Secretary of State, State of Arizona as organization_4b266268-e250-11e4-a4f5-e90fe0697b56.json
    21:45:21 INFO scrapelib: GET - http://apps.azsos.gov/apps/election/cfs/search/SuperPACList.aspx
    Traceback (most recent call last):
      File "/home/blannon/.virtualenvs/iusa-scrape/bin/pupa", line 9, in <module>
        load_entry_point('pupa==0.4.1', 'console_scripts', 'pupa')()
      File "/home/blannon/.virtualenvs/iusa-scrape/src/pupa/pupa/cli/__main__.py", line 71, in main
        subcommands[args.subcommand].handle(args, other)
      File "/home/blannon/.virtualenvs/iusa-scrape/src/pupa/pupa/cli/commands/update.py", line 241, in handle
        report['scrape'] = self.do_scrape(juris, args, scrapers)
      File "/home/blannon/.virtualenvs/iusa-scrape/src/pupa/pupa/cli/commands/update.py", line 123, in do_scrape
        report[scraper_name] = scraper.do_scrape(**scrape_args)
      File "/home/blannon/.virtualenvs/iusa-scrape/src/pupa/pupa/scrape/base.py", line 104, in do_scrape
        for obj in self.scrape(**kwargs) or []:
      File "/home/blannon/dev/ocd/scrapers-us-state/az/disclosures.py", line 78, in scrape
        yield from self.scrape_super_pacs()
      File "/home/blannon/dev/ocd/scrapers-us-state/az/disclosures.py", line 41, in scrape_super_pacs
        html_document = etree.fromstring(resp, parser=HTMLParser())
      File "lxml.etree.pyx", line 3094, in lxml.etree.fromstring (src/lxml/lxml.etree.c:70505)
      File "parser.pxi", line 1827, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:106328)
    ValueError: can only parse strings
```

Looks like something's wrong. Let's run it again, except this time we'll tell `pupa` to drop us into `pdb` when it fails. If you're not familiar with `pdb`, check out @claytron's [excellent talk](https://speakerdeck.com/claytron/so-you-think-you-can-pdb) from this year's PyCon.

```
(iusa-scrape)$>pupa --debug pdb update az disclosures --scrape
```

Now, when it crashes, we're launched into the Python debugger.

```
(Pdb) u
> /home/blannon/dev/ocd/scrapers-us-state/lxml.etree.pyx(3094)lxml.etree.fromstring (src/lxml/lxml.etree.c:70505)()
(Pdb) u
> /home/blannon/dev/ocd/scrapers-us-state/az/disclosures.py(41)scrape_super_pacs()
-> html_document = etree.fromstring(resp, parser=HTMLParser())
(Pdb) resp
('/tmp/tmpy2r8olmk', <Response [200]>)
(Pdb) l
 36  	
 37  	        PAC_LIST_URL = "http://apps.azsos.gov/apps/election/cfs/search/SuperPACList.aspx"
 38  	
 39  	        resp = self.urlretrieve(PAC_LIST_URL)
 40  	
 41  ->	        html_document = etree.fromstring(resp, parser=HTMLParser())
 42  	
 43  	        target_table = find_the_table(html_document)
 44  	
 45  	        results = scrape_the_table(target_table)
 46  	

```
...ah ha! On line 41, we incorrectly assumed that `urlretrieve` would return an HTML string. Instead, it returned a tuple including the location of the temporary file and a  `Response` object.  We'll have to change that line:

```diff
diff --git a/az/disclosures.py b/az/disclosures.py
index 64ebf1b..a10695b 100644
--- a/az/disclosures.py
+++ b/az/disclosures.py
@@ -36,9 +36,9 @@ class ArizonaDisclosureScraper(Scraper):
 
         PAC_LIST_URL = "http://apps.azsos.gov/apps/election/cfs/search/SuperPACList.aspx"
 
-        resp = self.urlretrieve(PAC_LIST_URL)
+        tmp, resp = self.urlretrieve(PAC_LIST_URL)
 
-        html_document = etree.fromstring(resp, parser=HTMLParser())
+        html_document = etree.fromstring(resp.content, parser=HTMLParser())
 
         target_table = find_the_table(html_document)
```

Now, let's re-run!

```
(iusa-scrape)$>pupa --debug pdb update az disclosures --scrape
```

We'll get another error:

```
validictory.validator.ValidationError: validation of Organization 46282b8b-e254-11e4-a4f5-e90fe0697b56 failed: Value '06/08/2018' for field '<obj>.dissolution_date' does not match regular expression '(^[0-9]{4})?(-[0-9]{2}){0,2}$'
> /home/blannon/.virtualenvs/iusa-scrape/src/pupa/pupa/scrape/base.py(191)validate()
-> self.__class__.__name__, self._id, ve)
(Pdb) 
```

This time, `validictory`, the schema validation library, is telling us that our dates aren't formatted correctly.  Let's go back and fix that, with a `reformat_date` function:

```diff
diff --git a/az/disclosures.py b/az/disclosures.py
index 7d33214..40d585d 100644
--- a/az/disclosures.py
+++ b/az/disclosures.py
@@ -5,11 +5,17 @@ from pupa.scrape.popolo import Organization
 from lxml import etree
 from lxml.html import HTMLParser
 
+from datetime import datetime
+
 
 class ArizonaDisclosureScraper(Scraper):
 
     def scrape_super_pacs(self):
 
+        def reformat_date(datestring):
+            dt = datetime.strptime(datestring, '%m/%d/%Y')
+            return dt.strftime('%Y-%m-%d')
+
         def find_the_table(html_document):
             return html_document.xpath('//*[@id="ctl00_ctl00_MainPanel"]/table')[0]
 
@@ -29,8 +35,8 @@ class ArizonaDisclosureScraper(Scraper):
                     _data['org_name'] = _name
                     _data['org_address'] = _address
                     _data['org_phone'] = columns[2].text_content()
-                    _data['org_begin_date'] = columns[3].text_content()
-                    _data['org_end_date'] = columns[4].text_content()
+                    _data['org_begin_date'] = reformat_date(columns[3].text_content())
+                    _data['org_end_date'] = reformat_date(columns[4].text_content())
                     scraped_rows.append(_data)
             return scraped_rows
```

Okay, I got a good feeling about this one:

```
(iusa-scrape)$>pupa --debug pdb update az disclosures --scrape
```

Output looks great!

```
(iusa-scrape)$>pupa --debug pdb update az disclosures --scrape
no pupa_settings on path, using defaults
az (scrape)
  disclosures: {}
Not checking sessions...
22:22:17 INFO pupa: save jurisdiction Arizona as jurisdiction_ocd-jurisdiction-country:us-state:az-government.json
22:22:17 INFO pupa: save organization Office of the Secretary of State, State of Arizona as organization_73f49f2a-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO scrapelib: GET - http://apps.azsos.gov/apps/election/cfs/search/SuperPACList.aspx
22:22:17 INFO pupa: save organization AEA FUND FOR PUBLIC EDUCATION (FORMERLY) AZ PAC (AZ EDUCATION ASSN PAC) as organization_73f49f2b-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization AFSCME PEOPLE as organization_73f49f2c-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization ARIZONA CHAMBER POLITICAL ACTION COMMITTEE as organization_73f49f2d-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization ARIZONA FAMILIES UNITED FOR STRONG COMMUNITIES- PROJECT OF SEIU COPE as organization_73f49f2e-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization Arizona List P.A.C. as organization_73f49f2f-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization Arizona Lodging & Tourism Super PAC as organization_73f49f30-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization Arizona Pipe Trades 469 as organization_73f49f31-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization AZ DENTAL POLITICAL ACTION COMMITTEE as organization_73f49f32-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization AZ MULTIHOUSING ASSN PAC as organization_73f49f33-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization BNSF Railway Company RAILPAC as organization_73f49f34-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization CWA COMMUNICATIONS WORKERS OF AMERICA /AZ COMMITTEE  ON POLITICAL EDUCATION (COPE) as organization_73f49f35-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization Enterprise Holdings, Inc. Political Action Committee as organization_73f49f36-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization GREATER PHOENIX CHAMBER OF COMMERCE POLITICAL ACTION COMMITTEE as organization_73f49f37-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization HOME BUILDERS ASSN OF CENTRAL AZ POLITICAL ACTION COMMITTEE as organization_73f49f38-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization Honeywell International Political Action Committee Arizona-(HIPAC AZ) as organization_73f49f39-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization JPMORGAN CHASE & CO ARIZONA PAC as organization_73f49f3a-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization PARADISE VALLEY FUND FOR CHILDREN IN PUBLIC EDUCATION (PV ED & SUPPORT EMPL ASSN PAC) as organization_73f49f3b-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization PINNACLE WEST CAPITAL CORPORATION POLITICAL ACTION COMMITTEE as organization_73f49f3c-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization REALTORS OF AZ PAC (RAPAC) as organization_73f49f3d-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization SALT RIVER VALLEY WATER USERS ASSN POLITICAL INVOLVEMENT COMMITTEE as organization_73f49f3e-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization The Political Committee of Planned Parenthood Advocates of Arizona as organization_73f49f3f-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization UNITE HERE TIP CAMPAIGN COMMITTEE as organization_73f49f40-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization UNITED FOOD & COMMERCIAL WORKERS ACTIVE BALLOT CLUB (UFCW) as organization_73f49f41-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization UNITED FOOD & COMMERCIAL WORKERS UNION OF AZ LOCAL 99 as organization_73f49f42-e255-11e4-a4f5-e90fe0697b56.json
22:22:17 INFO pupa: save organization United Services Automobile Assn Employee PAC as organization_73f49f43-e255-11e4-a4f5-e90fe0697b56.json
az (scrape)
  disclosures: {}
disclosures scrape:
  duration:  0:00:00.396384
  objects:
    organization: 25
jurisdiction scrape:
  duration:  0:00:00.001806
  objects:
    jurisdiction: 1
    organization: 1
```

# Modeling OCD Types

The following examples are abstract, and they assume that the target data has already been obtained. They're meant to show you how to use the OCD Scrape Objects and their helper functions.

## Person

In [48]:
from pupa.scrape.popolo import Person

Let's assume that this is data we've managed to pull out of an online source:

In [49]:
person_data = {
    'person_name': 'Sheldon Adelson',
    'person_aliases': [
        'Shelly A',
        'The Shelster',
        'Sheldon G Adelson',
        'Sheldon Gary Adelson',
    ],
    'birth_date': '1993-08-04',
    'biography': "Sheldon Gary Adelson (pronounced /ˈædəlsən/; born August 4, 1933) is an American business magnate, investor, and philanthropist. He is the chairman and chief executive officer of the Las Vegas Sands Corporation, which owns the Marina Bay Sands in Singapore and is the parent company of Venetian Macao Limited which operates The Venetian Resort Hotel Casino and the Sands Expo and Convention Center. He also owns the Israeli daily newspaper Israel HaYom. Adelson, a lifelong donor and philanthropist to a variety of causes, founded with his wife's initiative the Adelson Foundation. As of July 2014, Adelson was listed by Forbes as having a fortune of $36.4 billion, and as the 8th richest person in the world. Adelson is also a major contributor to Republican Party candidates, which has resulted in his gaining significant influence within the party.",
    'summary': 'Casino owner and large-dollar donor',
    'image': 'http://upload.wikimedia.org/wikipedia/commons/0/0f/Sheldon_Adelson_21_June_2010.jpg',
    'gender': 'male',
    'national_identity': 'USA',
    'address': '3355 Las Vegas Blvd S, LAS VEGAS, NV 89109',
    'fec_identifier': 'A0035', # Not real, just for demonstration!
}

### Initializing the object

When we initialize a `Person` object, we can set a lot of these properties using keyword arguments. Usually, you won't have this much information, but Shelly's pretty well known.

In [50]:
_person = Person(
    name=person_data['person_name'],
    birth_date=person_data['birth_date'],
    biography=person_data['biography'],
    summary=person_data['summary'],
    image=person_data['image'],
    gender=person_data['gender'],
    national_identity=person_data['national_identity'],
    source_identified=True
)

### Adding multivalued properties

Now that we've initialized the person object, we can use its helper functions to add other properties like contact details, sources, identifiers and other names (aliases). If you're curious about what's possible, many of these helper functions are implemented using the mixins found on [pupa/pupa/scrape/base.py](https://github.com/influence-usa/pupa/blob/disclosures/pupa/scrape/base.py#L213-L320).

In [51]:
_person.add_contact_detail(
    type="address",
    value=person_data['address'],
)

In [52]:
for alias_name in person_data['person_aliases']:
    _person.add_name(name=alias_name)

In [53]:
_person.add_source(
    url="http://www.example.com/disclosure/?DocumentID=21459sadgljs85102h235naosudgyy7",
    note="F342_contributions"
)

### Namespaced Identifiers

Identifiers can be added in the same way, but you'll need to include a [namespace](http://en.wikipedia.org/wiki/Namespace) for them. Namespaces keep identifiers from different sources separate. For instance, the namespace `"urn:sopr:registrant"` is and identifier namespace that covers identifiers assigned to lobbying disclosure registrants by the Senate Office of Public Record. The House Clerk's office also assigns identifiers to many lobbying firms, and they might clash with SOPR's identifiers, so we assign it to a different namespace, which is `"urn:house-clerk:registrant"`. 

If you don't see an existing namespace that is an obvious fit for the data you're scraping, feel free to make one up. It might get changed in the future, but that's okay.

In [54]:
_person.add_identifier(
    identifier=person_data['fec_identifier'],
    scheme='urn:fec:individual'
)

### Viewing the object as a dict

In [55]:
_person.as_dict()

{'_id': '13de4c60-e46e-11e4-a4f5-e90fe0697b56',
 'biography': "Sheldon Gary Adelson (pronounced /ˈædəlsən/; born August 4, 1933) is an American business magnate, investor, and philanthropist. He is the chairman and chief executive officer of the Las Vegas Sands Corporation, which owns the Marina Bay Sands in Singapore and is the parent company of Venetian Macao Limited which operates The Venetian Resort Hotel Casino and the Sands Expo and Convention Center. He also owns the Israeli daily newspaper Israel HaYom. Adelson, a lifelong donor and philanthropist to a variety of causes, founded with his wife's initiative the Adelson Foundation. As of July 2014, Adelson was listed by Forbes as having a fortune of $36.4 billion, and as the 8th richest person in the world. Adelson is also a major contributor to Republican Party candidates, which has resulted in his gaining significant influence within the party.",
 'birth_date': '1993-08-04',
 'contact_details': [{'note': '',
   'type': 'address'

## Organization

(see above)

## Event

In [56]:
from pupa.scrape import Event

Let's assume that this is data we've managed to pull out of an online source:

In [81]:
event_data = {
    "name": "Sidley Austin LLP - New Client for Existing Registrant, Vifor Pharma",
    "description": "Form LD-1: registration of a lobbying firm and the client they represent.",
    "start_time": "2012-03-01",
    "end_time": None,
    "classification": "registration",
    "location": "United States",
    "timezone":"America/New_York"
}

event_data_url = "http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=a3f1bf3c-7fa0-4b08-b703-bef451bb3d27&filingTypeID=1"

A few notes:

  - If we don't know the end time, that's fine: the `end_time` can be left null.
  - When entering times, enter as much detail as you have available. If you know the event's time down to the second, feel free to specify that, but if you only know the month in which it occurs, something like `"2014-01"` is fine
  - Location can also be described as specifically or generally as possible

### Initializing the object

In [86]:
_event = Event(
    name=event_data["name"],
    description=event_data["description"],
    start_time=event_data["start_time"],
    classification=event_data["classification"],
    location=event_data["location"],
    timezone=event_data["timezone"]
)

**protip**: notice how the keys of the `event_data` dict are the same as the kwargs that we assign them to?  a shorter (though admittedly less explicit) way of initializing the object would be:

In [87]:
_quicker_event = Event(**event_data)

### Adding Participants

Participants in an event can be either of class `Organization` or `Person`, and they can be added using the helper function `add_participant`. When adding them, it's helpful to characterize their participation using the `note` field:

In [88]:
_lobbying_org = Organization(
    name="Sidley Austin LLP",
    classification='company',
)

_client = Organization(
    name="Vifor Pharma",
    classification="company",
)

_event.add_participant(
    type=_lobbying_org._type,
    id=_lobbying_org._id,
    name=_lobbying_org.name,
    note="registrant"
)

_event.add_participant(
    type=_client._type,
    id=_client._id,
    name=_client.name,
    note="client"
)

As you'll see, the `participant` list becomes a list of the entities that you added. Actually, they are stubs of those objects, rather than the full objects. They give you an identifier that allows you to resolve to find the full object, however.

In [89]:
_event.participants

[{'entity_type': 'organization',
  'id': '6922f3f4-e486-11e4-a4f5-e90fe0697b56',
  'name': 'Sidley Austin LLP',
  'note': 'registrant'},
 {'entity_type': 'organization',
  'id': '6922f3f5-e486-11e4-a4f5-e90fe0697b56',
  'name': 'Vifor Pharma',
  'note': 'client'}]

### Agenda Items

When describing events, we can give a more organized account of what happened than just summarizing in the `description` property. We can also model the details of the event using the `EventAgendaItem` class. Agenda items can describe things like the subjects of meetings, the issues discussed by lobbying firms or the intended agenda of an event. If you'd like, you can also reflect the order of items in the agenda by specifying them.

In [90]:
_agenda_item = _event.add_agenda_item("issues lobbied on")

Unlike many of the other helper functions, this one adds the item and also returns it. That makes it possible to make further changes to the agenda item. Let's add the subject of the lobbying ("HCR" is code used by the Senate for healthcare-related lobbying), and notes that we found on the registration form.

In [91]:
_agenda_item.add_subject("HCR")

_agenda_item['notes'].append("Regulation of complex large-molecule drugs by the Food and Drug Administration")

Because the object that was returned by `add_agenda_item` is the same object that is stored in `_event`'s `agenda` property, making these changes on `_agenda_item` is equivalent to making changes to the corresponding object inside of `_event`.

In [92]:
_event.agenda

[{'description': 'issues lobbied on',
  'media': [],
  'notes': ['Regulation of complex large-molecule drugs by the Food and Drug Administration'],
  'order': '0',
  'related_entities': [],
  'subjects': ['HCR']}]

In [93]:
_agenda_item

{'description': 'issues lobbied on',
 'media': [],
 'notes': ['Regulation of complex large-molecule drugs by the Food and Drug Administration'],
 'order': '0',
 'related_entities': [],
 'subjects': ['HCR']}

That means that you don't need to `yield` the `EventAgendaItem` objects. When you yield the `Event` object, you'll also be yielding the agenda items that it contains for free.

### Adding pointers to other entities with `related_entities`

Let's pretend that, besides these details, we also know that Sidley Austin LLP lobbied on some specific bills. We can add them to the agenda using the `related_entities` property, and add other information about how they're related using the `note` argument.

In [94]:
_agenda_item.add_bill(
    bill='HR 3590',
    note='supporting'
)

In [95]:
_agenda_item

{'description': 'issues lobbied on',
 'media': [],
 'notes': ['Regulation of complex large-molecule drugs by the Food and Drug Administration'],
 'order': '0',
 'related_entities': [{'entity_type': 'bill',
   'name': 'HR 3590',
   'note': 'supporting'}],
 'subjects': ['HCR']}

### Add the source

Finally, don't forget to add the source:

In [82]:
_event.add_source(
    url=event_data_url,
    note="LDA Form LD-1"
)

## Disclosures

When working with influence-related data (which is often the result of legally required disclosures) an important OCD type is the `Disclosure` type. It contains details about the act of disclosing itself, rather than the events and relationships that are being disclosed. 

Once again, we'll assume we've scraped the relevant data from the source:

In [80]:
disclosure_data = {
    "effective_date": "2012-03-01",
    "submitted_date": "2012-04-03",
    "timezone": "America/New_York",
    "classification": "lobbying"
}

disclosure_data_url = "http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=a3f1bf3c-7fa0-4b08-b703-bef451bb3d27&filingTypeID=1"

The submitted date is often different from the effective date. That's because many disclosure authorities require filers to report activity within a certain timeframe. For federal lobbying registrations, for example, they must submit within 45 days of establishing the relationship with their client.

### Initializing the object

In [78]:
from pupa.scrape.disclosure import Disclosure

In [101]:
_disclosure = Disclosure(
    classification=disclosure_data["classification"],
    effective_date=disclosure_data["effective_date"],
    submitted_date=disclosure_data["submitted_date"],
    timezone=disclosure_data["timezone"]
)   

### Adding related entities

Like `EventAgendaItem`, `Disclosure` objects can have related entities. Most of the time, there are three kinds of related entities:

  - **Registrant** - the registered entity, who is filing the disclosure.
  - **Authority** - the authority that the registrant submits the disclosure to
  - **Disclosed Event(s)** - one or more events that the disclosure describes

There are helper functions for each of these entities. We can make use of some of the objects we made above when we're calling the helper functions.

In [102]:
_disclosure.add_registrant(
    name=_lobbying_org.name,
    type=_lobbying_org._type,
)

_disclosure.add_authority(
    name="Office of Public Record, US Senate",
    type="Organization"
)

_disclosure.add_disclosed_event(
    name=_event.name,
    type=_event._type,
    classification=_event.classification
)

In [103]:
_disclosure.related_entities

[{'entity_type': 'organization',
  'name': 'Sidley Austin LLP',
  'note': 'registrant'},
 {'entity_type': 'Organization',
  'name': 'Office of Public Record, US Senate',
  'note': 'authority'},
 {'classification': 'registration',
  'entity_type': 'event',
  'name': 'Sidley Austin LLP - New Client for Existing Registrant, Vifor Pharma',
  'note': 'disclosed_event'}]

NOTE: Here, we made up the Senate Office of Public Record (the authority) on the spot. This is just for instructional purposes. Usually you'd create it as a property of the `Jurisdiction` object, as [we did above for the Arizona Secretary of State](#Create-the-Secretary-of-State).

### Add source

Finally, once again, don't forget to add the source:

In [104]:
_disclosure.add_source(
    url=disclosure_data_url,
    note="LDA Form LD-1"
)

# Modeling Relationships

The Open Civic Data spec also allows you to model the relationships between entities in the world. Many relationships are between a `Person` and an `Organization`, but it's possible for two `Organization` entities to have relationships with each other as well. Here are some examples:

  - **Employment** - a `Person` is a member of an `Organization` that employs them (eg, a lobbyist might be a person a member of a lobbying organization)
  - **Elected Official** - if a `Person` was elected to a governmental body, then that person is a member of the `Organization` that represents said body. (eg, a senator is a `Person` that is a member of the "United States Senate" `Organization`).
  - **Hierarchy** - an `Organization` can have another `Organization` as its `parent` (eg, the "United States Senate" is the `parent` `Organization` of the "Office of Public Record, US Senate" `Organization`
  - **Ownership** - one `Organization` that owns another (for some definition of ownership) is its `parent` (eg, the `Organization` "The Coca-Cola Company" is the `parent` of the "The Coca-Cola Company Nonpartisan Committee For Good Government" `Organization`)

## Memberships

You can represent a membership by either adding a member to an `Organization` object, or adding a membership to a `Person` object. There's no practical difference between the two.

When adding a membership, you are creating a `Membership` object. If you'd like to look at the source code, [check it out here](https://github.com/influence-usa/pupa/blob/disclosures/pupa/scrape/popolo.py#L37-L65). We'll take a closer look at this object below.

### Adding a member to an organization

In [112]:
_lobbyist = Person(
    name="Patricia DeLoatche",
)

_membership = _lobbying_org.add_member(_lobbyist)

Much like the `EventAgendaItem`, the helper function that adds a member to an `Organization` also returns something: the `Membership` object that represents that relationship.

In [113]:
_membership.as_dict()

{'_id': 'b83b9b7d-e54b-11e4-a4f5-e90fe0697b56',
 'contact_details': [],
 'end_date': '',
 'extras': {},
 'label': '',
 'links': [],
 'on_behalf_of_id': None,
 'organization_id': '6922f3f4-e486-11e4-a4f5-e90fe0697b56',
 'person_id': 'b83b9b7c-e54b-11e4-a4f5-e90fe0697b56',
 'post_id': None,
 'role': 'member',
 'start_date': ''}

We can add more detail to this `Membership` using the `role` and `label` properties. 

In [114]:
_membership.role = "lobbyist"
_membership.label = "lobbyist, as reported on LDA form LD-1"

In [115]:
_membership.as_dict()

{'_id': 'b83b9b7d-e54b-11e4-a4f5-e90fe0697b56',
 'contact_details': [],
 'end_date': '',
 'extras': {},
 'label': 'lobbyist, as reported on LDA form LD-1',
 'links': [],
 'on_behalf_of_id': None,
 'organization_id': '6922f3f4-e486-11e4-a4f5-e90fe0697b56',
 'person_id': 'b83b9b7c-e54b-11e4-a4f5-e90fe0697b56',
 'post_id': None,
 'role': 'lobbyist',
 'start_date': ''}

It's also possible to add details such as start and end date, if they're available.

In [122]:
_membership.start_date = '2012-01-25'
_membership.end_date = '2014-05-06'

### Adding a membership to a Person

We could have made the above connection in the opposite direction. Let's do that for the next lobbyist:

In [120]:
_next_lobbyist = Person(
    name="Peter Goodloe"
)

_next_membership = _next_lobbyist.add_membership(
    organization=_lobbying_org,
    role="lobbyist",
    label="lobbyist, as reported on LDA form LD-1"
)

In [121]:
_next_membership.as_dict()

{'_id': '12e9cd7d-e54d-11e4-a4f5-e90fe0697b56',
 'contact_details': [],
 'end_date': '',
 'extras': {},
 'label': 'lobbyist, as reported on LDA form LD-1',
 'links': [],
 'on_behalf_of_id': None,
 'organization_id': '6922f3f4-e486-11e4-a4f5-e90fe0697b56',
 'person_id': '12e9cd7c-e54d-11e4-a4f5-e90fe0697b56',
 'post_id': None,
 'role': 'lobbyist',
 'start_date': ''}

Comparing this membership to the one created above, it's clear that they are parallel objects

### Including Posts

For more official positions in organizations, it's sometimes useful to represent the position independently of the person who happens to hold it at any point in time. For instance, the `Post` of "Arizona Secretary of State" existed before its current officeholder and it probably will continue to exist after she's replaced.

In [124]:
_office_secretary_of_state = Organization(                                    
        name="Office of the Secretary of State, State of Arizona",        
        classification="office"                                           
)

_az_sos = _office_secretary_of_state.add_post("Arizona Secretary of State", role="secretary of state")

In [128]:
_former_az_sos = Person(
    name="Ken Bennett"
)

_current_az_sos = Person(
    name="Michele Reagan"
)

_former_membership = _office_secretary_of_state.add_member(
    _former_az_sos,
    role=_az_sos.role,
    post_id=_az_sos._id,
    start_date='2009-01-26',
    end_date='2015-01-05'
)

_current_membership = _office_secretary_of_state.add_member(
    _current_az_sos,
    role=_az_sos.role,
    post_id=_az_sos._id,
    start_date='2015-01-05'
)

### No need to yield Memberships

We don't need to `yield` the memberships that we create explicitly. when the people and organizations that are connected by them are yielded, the memberships will be, too.

## Organizational Parents

Because governmental organizations are often hierarchically related, `Organization` objects have a `parent` property that allow them to represent that kind of relationship.

In [129]:
legislature = Organization(
    name="United States Congress",
    classification='legislature'
)


senate = Organization(
    name="United States Senate",
    classification='upper',
    parent_id=legislature._id,
)

sopr = Organization(
    name="Office of Public Record, US Senate",
    classification="office",
    parent_id=senate._id,
)