# Algorithm X application to constituency data

Previously we found all sets of 2 / 3 / 4 constituencies which are neighbours, i.e. those constituencies which share a border, which we shall call sets (with a unique identifier `set_no`). We will now apply Algorithm X to these merged constituencies and find (a subset of) solutions so that every constituency is selected once and only once. We shall do this on a region-by-region basis for two reasons:

1. it will reduce the amount of possible combinations substantially
1. it also (mostly) ensures consistency of political parties, so that e.g. we wouldn't have one constituency on England and one in Wales, so that Plaid Cymru vote would potentially halve.

There are often times when the total number of constituencies in a region is not divisible by 2 / 3 / 4. For these cases we shall remove a set from a different constituency size until they are divisible, e.g. for the North East we have 29 constituencies so if we want to find all solutions where we merge 2 constituencies we shall pick at random one of the sets where 3 constituencies have been merged and remove them from our initial analysis. We shall repeat this, removing another of the 3-way merged sets, until we get a large enough sample.

For some of the sets we have a large number of solutions, so we will only keep a subset of them. When there are a large number of solutions we shall rerun the analysis with the dataframe resampled and this can change the initial solutions given.

The (sampled) solutions will be saved as csv files.

All functions used are stored in the `algox_modules.py` file.


In [1]:
import pandas as pd
from joblib import Parallel, delayed
from algox_modules import *

In [2]:
const_pairs = pd.read_csv("../Analysis/Data/const_pairs.csv.gz")
const_tris = pd.read_csv("../Analysis/Data/const_tris.csv.gz")
const_quads = pd.read_csv("../Analysis/Data/const_quads.csv.gz")

In [3]:
regions = np.unique(const_pairs['region'])

In [4]:
# Set up folders used to store logs and info during runthrough
import os
if not os.path.isdir("Logs"):
    os.makedirs("Logs/")
    
# Remove any files that were created in a previous run
import glob
def del_files(dir):
    files = glob.glob(dir)
    if len(files) > 0:
        [os.remove(f) for f in files]
del_files("Logs/log_*.log")
del_files("Solutions/solns_*.csv.gz")


In [5]:
# Command to run with joblib.
element_information = Parallel(n_jobs=5, verbose=10)(
    delayed(get_solns)(const_pairs, const_tris, const_quads, seats, region, max_solns=1e5) 
        for region in regions for seats in [2,3,4])

[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.
[Parallel(n_jobs=5)]: Done   3 tasks      | elapsed: 20.1min
[Parallel(n_jobs=5)]: Done   8 tasks      | elapsed: 56.1min
[Parallel(n_jobs=5)]: Done  15 tasks      | elapsed: 179.7min
[Parallel(n_jobs=5)]: Done  22 tasks      | elapsed: 782.2min
[Parallel(n_jobs=5)]: Done  31 out of  36 | elapsed: 1505.9min remaining: 242.9min
[Parallel(n_jobs=5)]: Done  36 out of  36 | elapsed: 4486.0min finished


In [None]:
# The above code took about 3 days to run. It might have been advantageous to have run the longest running jobs first
# So the above code could have been written as
# regions = ["London","South East","West Midlands","North West","Scotland","East","Yorkshire and The Humber","South West","East Midlands","Wales","North East","Northern Ireland"]
# element_information = Parallel(n_jobs=5, verbose=10)(
#     delayed(get_solns)(const_pairs, const_tris, const_quads, seats, region, max_solns=1e5) 
#         for seats in [4,3,2]) for region in regions 

In [None]:
# Now need to remove any duplicates that may have occured.
# In addition only keep a maximum of 25,000 solutions
import glob
from ast import literal_eval
sampled_solutions = 25000
if not os.path.isdir("../Analysis/Data/SampledSolutions/"):
    os.makedirs("../Analysis/Data/SampledSolutions/")
files = glob.glob("Solutions/solns_*.csv.gz")
for file in files:
    df = pd.read_csv(file, dtype={'region': str}, converters={'soln': literal_eval})
    df2 = pd.DataFrame(df['soln'].tolist())
    df3 = df2.drop_duplicates()
    if df2.shape[0] != df3.shape[0]:
        df = df[df.index.isin(df3.index)].reset_index(drop=True)
    # Save a sample of 'sampled_solutions' (if the number of solutions is bigger than that)
    file_name = file.replace("Solutions/solns_", "../Analysis/Data/SampledSolutions/sampled_solns_")
    if df.shape[0] > sampled_solutions:
        df.sample(sampled_solutions).to_csv(file_name, index=False)
    else:
        df.to_csv(file_name, index=False)