Fetching latest commit…
Cannot retrieve the latest commit at this time
|Failed to load latest commit information.|
FEC Scraper: ------------ By Christopher Schnaars and Anthony DeBarros, USA TODAY Developed with Python 2.7.2 Contents: --------- 1 - What is FEC Scraper? 2 - Requirements 3 - What does FEC Scraper do? 4 - What doesn't FEC Scraper do? 5 - How to incorporate a database with FEC Scraper 6 - How to specify committees that should be scraped 7 - Directories 1 - What is FEC Scraper? ------------------------ FEC Scraper is a Python script you can run to find and download campaign finance filings from the Federal Election Commission website for specified committees, including candidates for president and the House of Representatives as well as Super PACs and other organizations. The scraper is most powerful when you take advantage of its ability to interact with a database manager. The default behavior is for the script to retrieve a list of committees that already exist in the database as well as all files housed in the database, then go to the FEC website to look for and scrape new filings for those committees. The scraper can be used in conjunction with FEC Parser (coming soon) to convert all of the FEC's data into neat, delimited text files for easy import into your database. 2 - Requirements ---------------- To run FEC Scraper, you will need to install the mechanize module (http://wwwsearch.sourceforge.net/mechanize/download.html). If you want to enable functionality that integrates FEC Scraper with a database manager, you'll also need a library to connect to a database manager of your choice, if you don't use the SQLite functionality baked into Python. For example: * pyodbc (SQL Server, http://code.google.com/p/pyodbc/downloads/list) * mySQLdb (mySQL, http://mysql-python.sourceforge.net) * psycopg2 (PostgreSQL, http://initd.org/psycopg/download/) 3 - What does FEC Scraper do? ----------------------------- Every candidate or organization that has to file reports with the FEC is assigned a committee ID number. You can specify the committee IDs FEC Scraper should use when looking for new reports. There are two ways to do this. First, you have the option to specify a database the script can interact with. When you do this, FEC Parser automatically will download all Form 3 filings for all committees in the database, minus any filings previously downloaded or imported into the database. (You can download FECParser.sql for all the scripts you need to create a SQL Server database with all the necessary tables and stored procedures. You also can use these scripts to develop a database for a different database manager, such as mySQL, PostgreSQL or Access.) Second, you can specify a list of committees that you want scraped directly in the script. Even if you are interacting with a database, you'll need to specify the IDs for any committees that never have been downloaded previously. FEC Parser only downloads versions of Form 3, including Forms 3A, 3N, 3PA, 3PN, 3XA and 3XN. This covers monthly and quarterly reports filed by PACs and other political organizations as well as candidates for President and the House of Representatives (but not the Senate, where members still engage in the archaic process of filing their reports on paper to hide contributors and expenditures). Form 3 includes contributions, expenditures and loans, among other types of information, so it probably contains everything you need. As of this writing, the FEC offers data files in both comma and ASCII-28 delimited formats. Because of inherent problems with .csv files (namely, commas in the data itself), FEC Scraper uses the ASCII-28 delimited format. This ASCII code is obscure and unlikely to show up in your data. (The FEC Parser sister script will convert these files to standard tab-delimited files for easier import into a database manager.) 4 - What doesn't FEC Scraper do? -------------------------------- FEC Scraper does not download data for committees you don't specify. Additionally, it does not download data files for any form types other than Form 3. Additionally, FEC Scraper's use is solely to scrape the FEC site and download the data files. The script does not do any data processing. 5 - How to incorporate a database with FEC Scraper --------------------------------------------------- Near the top of the script, in the User Variables section, you can set the usedatabaseflag variable to a value other than 1 if you don't want the script to integrate with a database. As written, FEC Parser uses the pyodbc module (http://code.google.com/p/pyodbc/downloads/list) to communicate with a SQL Server database. If you want to use a different database manager, you should be able to modify the code pretty easily. Here are a few suggestions: * mySQL: MySQLdb module (http://mysql-python.sourceforge.net) * PostgreSQL: psycopg2 module (http://initd.org/psycopg/download/) For the default SQL Server code, you will need to input the server and the database (if you've named it something other than FEC). If necessary, you also can specify a username (UID) and password (PWD). When the script runs, it fires two stored procedures that create a list of Committee IDs and data files that already exist in the database. FEC Scraper uses these lists to determine what committees to scrape and what files do not need to be downloaded. The FECScraper.sql file contains all the SQL Server scripts you need to create the necessary tables and stored procedures. You can use this code to create similar objects for another database manager. 6 - How to specify committees that should be scraped ---------------------------------------------------- Even if you link FEC Scraper to a database, you'll still need to explicitly tell the script to download files for committees that don't already exist in the database. You'll find commented out examples of how to explicitly add a committee to the script. For example: commidlist.append('C00431445') 7 - Directories --------------- You can specify up to three directories for various files: * savedir: This is where scraped files will be saved. * reviewdir: This is a directory where you can house files that you have not imported into your database because the data requires cleanup before it can be successfully imported. This script uses this directory only to look for files that previously have been downloaded. No files in this directory will be altered, and no new files will be saved here. * processeddir: Similarly, you can put files you've already imported into a database in this directory. As with reviewdir, the script will use this directory only to look for files that previously have been downloaded. It will not alter files in this directory or save new files here.