Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building Permit Exploration #16

Open
NealHumphrey opened this issue Oct 31, 2016 · 19 comments
Open

Building Permit Exploration #16

NealHumphrey opened this issue Oct 31, 2016 · 19 comments

Comments

@NealHumphrey
Copy link
Collaborator

NealHumphrey commented Oct 31, 2016

Figure out what how best to get building permit data.

Data sources:

Questions:

  • How can we filter out only the 'big investment' types of permits (condo conversion, big retail rehabs, etc.)
  • How can we pull systematically while still having a long enough history to account for the lifetime of major projects (e.g. just the 30 day version isn't enough, but maybe we can combine sources and/or maintain a cache ourselves).
  • Make sure we have geocode data available for each permit, joining appropriate tables as needed (if not in source data, first place to look would be the MARS - Master Address Repository for DC)

Location Team:
-Using geocoded permit location data to return a list of all building permits within X miles of a specific affordable building.

@NealHumphrey NealHumphrey added this to the Static copies of all our data, data dictionary, and common join queries milestone Dec 21, 2016
@salomoneb
Copy link
Collaborator

salomoneb commented Jan 6, 2017

Hey Neal,

A couple of items:

  1. I emailed citydw@dc.gov for the data here. We'll see if they get back to me.

  2. In the meantime, I'm trying to load the 2016 data from opendata.dc.gov into the database. I know you mentioned a few issues with it, but it seems like a good starting point. However, when I load the CSV and run the Python script, I get the following error:

sys:1: DtypeWarning: Columns (29,33) have mixed types. Specify dtype option on import or set low_memory=False.
in memory...
Traceback (most recent call last):
File "//anaconda/envs/housing-insights/lib/python3.5/site-packages/sqlalchemy/pool.py", line 1122, in _do_get
return self._pool.get(wait, self._timeout)
File "//anaconda/envs/housing-insights/lib/python3.5/site-packages/sqlalchemy/util/queue.py", line 145, in get
raise Empty
sqlalchemy.util.queue.Empty

This message is followed by a lot of other exception errors.

The first line says that there are conflicting data types in columns 29 and 33. Checking the columns, there are no weird values that would obviously be throwing everything off. This thread recommends specifying the data type.

Following the documentation here and here, I modified this line of the processing script to specify the dtype in a few different ways, but it didn't work.

The docs also say you need the StringIO class from the io module, which I'm not sure is configured right now.

Any insight would be appreciated here.

@NealHumphrey
Copy link
Collaborator Author

NealHumphrey commented Jan 6, 2017

  1. cool, let me know
  2. My only issue with the opendata.dc.gov data is update frequency - they are unreliable with how often data is updated. Clicking on the metadata link it says it was last updated Jan 20 2016; I think that if we just check the date stamp of the permits it will tell us how out of date it actually is.

I have not encountered that error before. I'll dig into it. I'll probably also migrate over those other changes I made to the code in the housing-risk project that make the upload scripts more robust, so watch for a pull request with notes describing the changes.

@NealHumphrey
Copy link
Collaborator Author

p.s. the way I wrote the script, I deliberately used this method because I didn't have to specify the data type - i.e. it's a quick and dirty method that makes it easy to add new files to the upload. But especially with large files it does have the limitations noted in that thread. The fix will likely require optionally specifying some data types. I will probably do that by incorporating it into the json file that deals with columns that have dates.

@NealHumphrey
Copy link
Collaborator Author

NealHumphrey commented Jan 6, 2017

@salomoneb First note, this line is the one that would need to be modified to fix the error you noted, as it is the one that reads the actual file. The line you pointed to (52) is the one that just reads the manifest file, not the file itself.

@salomoneb
Copy link
Collaborator

salomoneb commented Jan 6, 2017

Thanks. I noticed that line after I sent my message and was wondering about it. I'll play around with the load some more when I have time tonight and this weekend. I agree that opendata.dc.gov may not be our best choice if the data's not current or complete; however, when you open the file, it seems to have permits going all the way through 12/30/16. I wonder if the metadata link refers to when the page was initially created, or if the file is being updated in a way that's not reflected in that date.

If the data is complete, it'll serve as a good set of foundational/historical information to start from. We can then figure out a better real-time source to tap into for 2017 and beyond.

I also noticed that the metadata link lists octo.dc.gov as well as dcra.dc.gov under Credits (Attribution) in the lower right sidebar. If it's actually pulling from the octo address, we may have the most complete info here.

@NealHumphrey
Copy link
Collaborator Author

Ok, I just merged a hotfix into dev, and ran the code.

In the medium term, we should rewrite the whole loading function. There's a few issues:

  • The benefit of our current method is you don't have to add custom configuration for newly added data sources, because Pandas parses data types automatically. For long term stability this is a downside, because we risk our updated data sources changing format and breaking other stuff down the pipeline.
  • Pandas to_sql is super slow when connecting to a remote server
  • Pandas is generally not recommended in production code - although its ease of use might outweigh that for this particular project, TBD.

I added a new issue with some notes on this refactor, which we should discuss: #54

@NealHumphrey
Copy link
Collaborator Author

FLAG we need to reload this table and parse the issue_date as date field. Should check the others too...

@NealHumphrey
Copy link
Collaborator Author

@salomoneb if you're at the session tonight, we should coordinate on this issue too so I know where it stands. As with zoning, we should contact OCTO or whoever to find out what's up with this data. Did you ever hear back from your email?

@salomoneb
Copy link
Collaborator

I'll be at the session. Looks like there's a separate source for 2017 data now, too. When/if I get a hold of the map specialist today, I'll ask them about this info.

@NealHumphrey
Copy link
Collaborator Author

Looking into this again. As of now, the 2017 data set does not appear to be updating (only a handful of permits last modified around Jan 1) . Things we need to do:

  • Find how to get access to the live update version if opendata.dc.gov is not going to cooperate.
  • Figure out if we need to deduplicate the annual data versions. On Opendata.dc.gov, each of the annual data sets says that it is all permits applied for or issued in that year, meaning there may be duplicate records first int he year it was applied for and then if it was approved the next calendar year.

@NealHumphrey
Copy link
Collaborator Author

I sent a note on 4/10/2017 to open.data@dc.gov asking for info on why the building permits data is not updating. It looks like both the opendata dataset and the octo data catalog data set are both failing to include new permits. I think we will be able to use the opendata.dc.gov version once this is fixed - hold off on integrating updated data until we find out the reason for this issue.

@NealHumphrey
Copy link
Collaborator Author

Didn't ever receive response from open.data@dc.gov. Sent second inquiry 5/5/2017

@NealHumphrey
Copy link
Collaborator Author

Response from open.data@dc.gov:

"You are correct, the most recent data available for building permits is January 2017. This is what you are seeing within the dataset. Open Data DC is not receiving updates from DCRA at this time. Although, we are working with DCRA to re-establish the live feed but do not yet have a timeline. The date seen in the details page is actually a date where we made an edit to the description. In this case, we added a note regarding the break in data (see attached).
...
Understandably, timely data is a must. Therefore suggest reaching out to DCRA’s FOIA officer, Runako Allsopp. "

@emkap01 Are you up for bugging some government officials? Maybe start w/ their suggested DCRA contact and reach out to Traci Hughes for help if needed?

@emkap01
Copy link
Collaborator

emkap01 commented May 10, 2017

Sure, I will reach out.

@NealHumphrey
Copy link
Collaborator Author

Flagging the ESRI geojson limit on this ticket too, which came up in tax ingestion. The json api's have a default limit of 1,000 records. Looks like the best solution is using the 'geojson' api endpoint, but swap .geojson for .csv to get a bulk download of all the api already in csv format.

@louvis
Copy link
Collaborator

louvis commented May 21, 2017

Hey @NealHumphrey and @emkap01, any word from DCRA on updated building permit data? I've been playing around with some of the data in Tableau and would like to use it but don't want to make it a point of emphasis if it's not accurate.

@NealHumphrey
Copy link
Collaborator Author

@louvis It looks like opendata.dc.gov still does not have access to current data. @emkap01 did you ever reach out to the DCRA FOIA officer? Let me know if you're able to or want me to help.

@louvis I'm pretty sure this data should be reinstated at some point. Data is definitely valid through 2016, so you could do analysis that ends in that calendar year and I think it would be valuable both on its own and as a template for when data access is restored.

@emkap01
Copy link
Collaborator

emkap01 commented May 23, 2017

Sorry for the delay on this - I emailed Runako and copied you on it, once we hear back from him I will follow up as appropriate.

@louvis
Copy link
Collaborator

louvis commented Jul 18, 2017

Hey @NealHumphrey and @emkap01 it looks like the 2017 building permit dataset has been updated through the end of June based on opendata.dc.gov. Issue dates run up into July as well so we might be able to add it to our existing 2016 data.

@NealHumphrey NealHumphrey removed the SQL label Sep 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants