Cast vote record (CVR) files are downloaded from here: https://dataverse.harvard.edu/dataverse/rcv_cvrs

They are unpacked and placed into the corresponding folders in the "cvr_records/raw" folder, single winner cvrs in the "single" folder, proportional in the "proportional" folder, and sequential in the "sequential" folder.

In order to get more precise information about each cvr, without modifiying the actual cvr files, the CVR files are renamed to follow this format "[YYYY.MM.DD]_-_[City,\_County,\_State]_-_[Office].cvr", it could of course be done in a JSON file, but this is nicer to read and understand and allows easier interopability with other programs. The ".cvr" format is a ".csv" but it's saved this wait to play nice with GitHub large file storage (LFS) and git ignore.

The files names are then manipulated in the "raw" folder with the following code:

In [1]:
import os
import re


base_dir = r'..'
base_dir = os.path.join(base_dir, 'cvr_records')
base_dir = os.path.join(base_dir, 'raw')
current_dir = os.getcwd()
path = os.path.join(current_dir, base_dir)

print(path)

/Users/es5891/Documents/GitHub/bugs-in-democracy/team_arrow/utils_notebooks/../cvr_records/raw


Since all single winner CVRs are in the same format, with LOCATION_DATE_OFFICE.csv, we can parse and convert, repeating whatever the marked location is. Afterwards we add the additonal location information manually (sorry).

In [15]:
def rename_files(directory):
    for filename in os.listdir(directory):
        if filename.endswith('.csv'):
            # Extract details from the filename
            parts = re.split('_', filename)  # This splits the filename at underscores and dots

            # Here, we make some assumptions about how your current filenames are formatted
            location = parts[0]  # e.g., Berkeley
            date = parts[1]  # e.g., 11032020
            office = parts[2].replace('.csv','')  # e.g., MemberCityCouncilDist2BerkeleyRCV

            # Reformat the date
            date = date[4:8] + '-' + date[0:2] + '-' + date[2:4]  # change from MMDDYYYY to YYYY-MM-DD

            # Construct the new filename
            new_filename = f'[{date}]_[{location},_{location},_{location}]_[{office}].cvr'

            # Rename the file
            os.rename(os.path.join(directory, filename), os.path.join(directory, new_filename))

Now we rename the single winner CVRs to the format described above.

In [16]:
dir_path = os.path.join(base_dir, 'single')
rename_files(dir_path)

print(dir_path)

../cvr_records/raw/single


The final task now for the single winner CVRs is adding the additional location information, and standardizing the office names. Unfortunately, this has to be done manually (for now), but it is a one time task. 

Following from the single CVR procedure, we will now do the same for the sequential CVRs. However the sequential CVR dataset also include pretabulated CVRs, which are not needed for the analysis, so we will remove them.

Once again the file names are manipulated in the "raw" folder with the following code:

In [18]:
def rename_files(directory):
    for filename in os.listdir(directory):
        if filename.endswith('.csv'):
            # Extract details from the filename
            parts = re.split('_', filename)  # This splits the filename at underscores and dots

            # Here, we make some assumptions about how your current filenames are formatted
            location = parts[0]  # e.g., Berkeley
            date = parts[1]  # e.g., 11032020
            office = parts[2].replace('.csv','')  # e.g., MemberCityCouncilDist2BerkeleyRCV

            # If there exists a fourth part, it's the pretabulated CVR, so we delete the file
            if len(parts) > 3:
                os.remove(os.path.join(directory, filename))
                continue

            # Reformat the date
            date = date[4:8] + '-' + date[0:2] + '-' + date[2:4]  # change from MMDDYYYY to YYYY-MM-DD

            # Construct the new filename
            new_filename = f'[{date}]_[{location},_{location},_{location}]_[{office}].cvr'

            # Rename the file
            os.rename(os.path.join(directory, filename), os.path.join(directory, new_filename))

Now we remove the pretabulated CVRs and rename the sequential CVRs to the format described above.

In [19]:
dir_path = os.path.join(base_dir, 'sequential')
rename_files(dir_path)

print(dir_path)

../cvr_records/raw/sequential


Similar to the single CVRs, we will have to manually add the additional location information, and standardize the office names. Unfortunately, this has to be done manually (for now), but it is a one time task. 

Finally, we will do the same for the proportional CVRs. There is one CVR file that is not in a standard format, specifically "Minneapolis 2013-board of estimation and taxation cvr.csv", so we will have to manually rename it. Please go rename that now to either the output format or the standard input format.

The code for the proportional CVRs is as follows:

In [21]:
# Stops the script here incase the person running the file hasn't renamed the file (read the instructions in the cell above)
print("Please rename the file in the 'proportional' folder before continuing.")
input("Press Enter to continue...")

dir_path = os.path.join(base_dir, 'proportional')
rename_files(dir_path)

print(dir_path)

Please rename the file in the 'proportional' folder before continuing.
../cvr_records/raw/proportional


### TODO - Add additional location information to CVRs
### TODO - Standardize office names to CVRs
### TODO - Talk to Sam about GitHub LFS (CVRs are too large for 1GB free tier limit)

In [2]:
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="myGeocoder")
location_memory = {}  # a dictionary to save the previous selections

In [3]:

def parse_location(location):
    # Split the address into components
    address = location.address.split(', ')
    city = address[0]
    county = address[1] if len(address) > 2 else city  # if county is not available, use the city name
    state = address[2] if len(address) > 3 else city   # if state is not available, use the city name
    return state, county, city

def get_location(city_name):
    # Check if we've already selected a location for this city
    if city_name in location_memory:
        return location_memory[city_name]

    # If not, ask the user to select a location
    location_list = geolocator.geocode(city_name, exactly_one=False)
    for i, location in enumerate(location_list):
        print(f"{i}: {location}")

    selection = int(input(f"Please select a location for {city_name}: "))
    selected_location = location_list[selection]

    # Parse the location into state, county, and city
    state, county, city = parse_location(selected_location)

    # Remember this selection for next time
    location_memory[city_name] = (state, county, city)

    return state, county, city

In [6]:
def rename_files(directory):
    for filename in os.listdir(directory):
        if filename.endswith('.cvr'):
            # Extract details from the filename
            parts = re.split('\]_\[|\.', filename)  # This splits the filename at ']_[' and '.'

            # Here, we make some assumptions about how your current filenames are formatted
            location = parts[1]  # e.g., Cambridge
            date = parts[0].replace('[','')  # e.g., 2001-11-06
            office = parts[2].replace(']','')  # e.g., CityCouncil

            # Use the geolocator to get the fully qualified location
            state, county, city = get_location(location)

            # Construct the new filename
            new_filename = f'[{state}, {county}, {city}]_[{date}]_[{office}].cvr'

            # Rename the file
            os.rename(os.path.join(directory, filename), os.path.join(directory, new_filename))

In [7]:
dir_path = os.path.join(base_dir, 'sequential')
rename_files(dir_path)

print(dir_path)

../cvr_records/raw/sequential
