Scripts to scrape the FEC website and parse campaign filings
Fetching latest commit…
Cannot retrieve the latest commit at this time
FEC Scraper/FEC Parser: ----------------------- By Christopher Schnaars and Anthony DeBarros, USA TODAY Developed with Python 2.7.2 Contents: --------- 1 - What's in the box? 2 - Requirements 3 - Notes about FEC data 4 - What is FEC Scraper? 5 - What is FEC Parser? 6 - What don't FEC Scraper and FEC Parser do? 7 - How to incorporate a database manager 8 - Directories 1 - What's in the box? ---------------------- FEC Scraper and FEC Parser are separate Python scripts that work together to download and parse data from the Federal Election Commission (FEC) provided in ASCII-28 delimited format. Optionally, you can enable integration with a database manager (SQL Server by default) to help you import the data and maintain database integrity. The scripts can handle hundreds or even thousands of files at once. The end result is eight to 11 tab-delimited text files housing cleaned-up data ready for easy import into a database manager or spreadsheet program. On the github site, you'll find a file called FECScraper.sql, which contains all the scripts you need to create tables and all other necessary database objects. If you are using a database manager other than SQL Server, you can use the schema contained in FECScraper.sql to build tables on a different RDMS. 2 - Requirements ---------------- If you want to enable functionality that integrates FEC Scraper and FEC Parser with a database manager, you'll need a library to connect to a database manager of your choice, if you don't use the SQLite functionality baked into Python. Examples: * pyodbc (SQL Server, http://code.google.com/p/pyodbc/downloads/list) * mySQLdb (mySQL, http://mysql-python.sourceforge.net) * psycopg2 (PostgreSQL, http://initd.org/psycopg/download/) Otherwise, the scripts just use standard libraries included with Python. These scripts were built in Python 2.7.2. 3 - Notes about FEC data ------------------------ Every candidate or organization that has to file reports with the FEC is assigned a committee ID number. FEC Scraper will download all Form 3 filings for each committe you specify, including Forms 3A, 3N, 3PA, 3PN, 3XA and 3XN. FEC Parser will churn through the data housed in these files and output standardized, tab-delimited text files for easy import. The Form 3 data files include monthly, quarterly and annual reports housing itemized contributions, expenditures and loans, among other types of information, so they probably contain everything you need. These reports are required by the FEC for all PACs and other political organizations as well as candidates for President and the House of Representatives (but not the Senate, where members still engage in the archaic process of filing their reports on paper to hide contributors and expenditures). As of this writing, the FEC offers data files in both comma and ASCII-28 delimited formats. Because of inherent problems with .csv files (namely, commas in the data itself), FEC Scraper uses the ASCII-28 delimited format. This ASCII code is obscure and unlikely to show up in your data. Each data file contains two (and sometimes three) header rows. Among other information, these header rows specify the header version used by the data file. FEC Scraper downloads all Form 3 filings regardless of the header version, but FEC Parser supports only header versions 6.4, 7.0 and 8.0. Data files that utilize other header versions are moved to the Review directory and are not processed. Please note that as of this writing, if a data file contains a third header row, that row is ignored. A third header row occurs when data is included for Column C on 30G reports. You can see what that form looks like at: http://www.fec.gov/pdf/forms/fecfrm3post.pdf The script also does not yet handle Schedule H records, which sometimes occur on Form F3X. Support for this data will be added in an upcoming release. For what it's worth, processing all data for more than 250 PACs and all candidates for President and the House of Represenatives yielded just under 50 data rows, so this issue is minimal. All Schedule H data rows presently are saved to the Review file. 4 - What is FEC Scraper? ------------------------ FEC Scraper is a Python script you can run to find and download campaign finance filings from the Federal Election Commission website for specified committees, including candidates for president and the House of Representatives as well as Super PACs and other organizations. FEC Scraper is most powerful when you take advantage of its ability to interact with a database manager. When this functionality is enabled (by setting the usedatabaseflag variable to 1), the script will retrieve a list of committees that already exist in the database as well as all files housed in the database, then go to the FEC website to look for and scrape new filings for those committees. Reports already housed in the database will not be downloaded. You also can specify a "Processed" directory. FEC Parser will move files that have been parsed successfully to this directory, and FEC Scraper will look in this directory to determine whether a data file previously has been imported. This helps ensure that even if the script doesn't interact with a database manager, you can avoid importing the same file more than once as long as you don't delete the processed files. (Development note: In a future version, we could have FEC Parser create and modify a text file housing a list of all Image IDs that have been processed.) If you don't interact with a database or if you want to download reports for a committe that does not exist in the database, you must specify in the script which committees you want to scrape. This is handled in a Try block near line 100 of the code. There are two ways to specify committees: First, the code will look for a text file called commidappend.txt. You can specify a list of committees (one per line) in the file. Don't use commas or any other separator, and make sure the committe IDs begin with a capital C. The file should look something like this: C00431445 C00500587 C00431171 Additionally, you can write code in the Finally block to append the commidlist list directly. Examples: commidlist.append('C00431445') # Obama for America commidlist.append('C00500587') # RickPerry.org Inc. commidlist.append('C00431171') # Romney for President Inc. 5 - What is FEC Parser? ----------------------- FEC Parser is a Python script that converts ASCII-28 delimited data files from the FEC into standard tab-delimited text files that easily can be imported into any database manager or spreadsheet program. When you download a Form 3 filing from the FEC, the data file contains two or three header rows at the top of the file, followed by data housed on eight separate forms (Schedules A, B, C, C1, C2, D and E and a long-text form). This data is in no particular order, and each form has different numbers of columns, and different column headings, making import tricky. Additionally, the FEC routinely adds and removes columns. If FEC Parser finds a file that is not a Form 3 filing (see the "Notes about FEC data" section above), the file will be moved to a Review directory you specify. In some cases, FEC Parser will append to the filename a brief explanation of why the file was not parsed. FEC Parser will go through all files with an .fec extension in whatever directory you specify and output eight tab-delimited text files: one for each schedule form. FEC Parser can process hundreds or even thousands of files at once. Each data row will be prepended with the FEC's unique ID for that file (called the ImageID) to create parent-child relationships between a header and each data row. Any tabs contained in the data are converted to spaces during parsing. If you implement the functionality that allows FEC Parser to interact with a database manager, it will check to make sure each report has not previously been imported into the database. If not, it will load the header rows into parent tables. If database functionality is disabled (the default behavior), FEC Parser will create three additional text files for the F3, F3P and F3X headers. FEC Parser presently supports only header versions 6.4, 7.0 and 8.0. Data files that utilize other header versions are moved to the Review directory and are not processed. Please note that due to the inherent messiness of FEC data (particularly problems with missing or excess column delimiters, FEC Parser will add delimiters to data rows that do not have the correct number of columns and remove excess delimiters when there are too many. If FEC Parser finds data in any excess columns, that row is instead written to the review file so no data is lost. This could cause data to be written to the wrong column if excess or missing delimiters occur anywhere other than the end of a data row. I've tested the data extensively, however, and as of this writing, I have never found a problem other than at the end of a data row. 6 - What don't FEC Scraper and FEC Parser do? --------------------------------------------- FEC Scraper downloads only Form 3 filings. (For more information about Form 3 and other filings, see the "Notes about FEC data" section above.) All other filings are ignored. Additionally, FEC Scraper does not download data for committees you don't specify. For the script to download filings for a particular committee you must do one of two things: * You must specify the committee directly in the code (commented-out examples of how to do this are included in the script). * You must enable database integration (by setting usedatabaseflag to 1) AND you must have data for that committee in the database. (When database integration is enabled, FEC Scraper will get a list of existing committees from the database and automatically download all new filings for them.) If you have database integration enabled, FEC Parser will load the header rows from the data files into your database. Otherwise, the script does not load any data into your database. You must manually import the data files after the script runs. Additionally, FEC Parser handles only Form 3 filings, and those filings must have header version 6.4, 7.0 or 8.0. All other form types and all Form 3 filings with other header versions are moved to the Review directory and are not processed. 7 - How to incorporate a database with FEC Scraper --------------------------------------------------- Near the top of the code for both FEC Scraper and FEC Parser, in the User Variables section, you can set the usedatabaseflag variable to 1 if you want the script to interact with a datatbase manager and 0 if you want the script to run independent of a database. Both scripts include code to import the pyodbc module (http://code.google.com/p/pyodbc/downloads/list) to integrate them with a SQL Server database. If you want to use a different database manager, you will need to modify the code to use a different library. Here are a few suggestions: * mySQL: MySQLdb module (http://mysql-python.sourceforge.net) * PostgreSQL: psycopg2 module (http://initd.org/psycopg/download/) For the default SQL Server code, you will need to specify the server and the name of the database (if you've named it something other than FEC) in the connection string. If necessary, you also can specify a username (UID) and password (PWD). When FEC Scraper runs with database integration enabled, it calls two stored procedures that create lists of Committee IDs and Image IDs that already exist in the database. FEC Scraper uses these lists to determine what committees to scrape and what files do not need to be downloaded. The script also will not download any files that already exist in the processed or review directories. When FEC Parser runs with databse integration enabled, it will check the database each time it processes a file to make sure that Image ID does not already exist in the database. The user will be alerted any time it finds a duplicate file, and that file will be omitted from the text files created by the script. Additionally, the header rows are loaded into the database for you. When database integration is disabled, FEC Parser will output the header rows to an additional three data files that you can import manually. The FECScraper.sql file contains all the SQL Server scripts you need to create the necessary tables, stored procedures and other database objects. You can use this code to create similar objects for other database managers. 8 - Directories --------------- You can specify up to four directories for various files: * savedir: This is where FEC Scraper saves files. FEC Parser will process all .fed files housed in this directory. * reviewdir: FEC Scraper gets a list of files housed in this directory and uses this list to avoid downloading previously downloaded files. FEC Parser moves all files that cannot be processed to this directory, such as files that are not Form 3 or use an unsupported header version. For data rows that are in a valid file but can't be parsed, FEC Parser writes those data rows to a single timestamped review file and saves that file here. No files in this directory are altered by either script, though FEC Parser may append to the file name a brief description of why the file could not be parsed. * processeddir: FEC Scraper gets a list of files housed in this directory and uses this list to avoid downloading previously downloaded files. FEC Parser moves all valid, processed files from the save directory to this directory. * outputdir: FEC Scraper does not use this directory. FEC Parser saves the time-stamped tab-delimited data files it creates in this directory.