Supreme Court dataset (for Python 3)
A collection of ~8.4k (almost all) decisions issued by the U.S. Supreme Court from November 1946 through June 2016 — the "modern" era.
Records include the following fields:
text: full text of the Court's decisioncase_name: name of the court case, in all capsargument_date: date on which the case was argued before the Court, as a string with format 'YYYY-MM-DD'decision_date: date on which the Court's decision was announced, as a string with format 'YYYY-MM-DD'decision_direction: ideological direction of the majority decision; either 'conservative', 'liberal', or 'unspecifiable'maj_opinion_author: name of the majority opinion's author, if available and identifiable, as an integer code whose mapping is given inSupremeCourt.opinion_author_codesn_maj_votes: number of justices voting in the majorityn_min_votes: number of justices voting in the minorityissue: subject matter of the case's core disagreement (e.g. affirmative action) rather than its legal basis (e.g. the equal protection clause), as a string code whose mapping is given inSupremeCourt.issue_codesissue_area: higher-level categorization of the issue (e.g. Civil Rights), as an integer code whose mapping is given inSupremeCourt.issue_area_codesus_cite_id: citation identifier for each case according to the official United States Reports; Note: There are ~300 cases with duplicate ids, and it's not clear if that's "correct" or a data quality problem
The text in this dataset was derived from FindLaw's searchable database of court cases: http://caselaw.findlaw.com/court/us-supreme-court
The metadata was extracted without modification from the Supreme Court Database:
Harold J. Spaeth, Lee Epstein, et al. 2016 Supreme Court Database, Version 2016 Release 1. http://supremecourtdatabase.org.
Its license is CC BY-NC 3.0 US: https://creativecommons.org/licenses/by-nc/3.0/us/
This corpus' creation was inspired by a blog post by Emily Barry: http://www.emilyinamillion.me/blog/2016/7/13/visualizing-supreme-court-topics-over-time
NOTE: The two datasets were merged through much munging and a carefully trained model using the dedupe package. The model's duplicate threshold was set so as to maximize the F-score where precision had twice as much weight as recall. Still, given occasionally baffling inconsistencies in case naming, citation ids, and decision dates, a very small percentage of texts may be incorrectly matched to metadata. (Sorry.)