## Welcome!

Firstly, congratulations on advancing this far in the hiring process! The fact that you are reading this means that E&N has recognized your potential and is excited to test your skills further. We hope you find this task both challenging and enjoyable. Please feel free to share any feedback during your upcoming skills interview.

### Task Overview

This task is designed to test your problem-solving abilities and your familiarity with computational notebooks. While it is not intended to be easy, we are not expecting perfection. We value clear, concise documentation and a genuine attempt to produce functional code.

**Background:** In the future, when a data consumer (aka developer/customer) comes to The Graph and pays for data to be indexed,  a gateway will need to allocate a group of indexers (service providers) to index their subgraph. What follows is written from the point of view of a gateway operator (such as Edge & Node). Indexers are given a score that shows how suitable we think they are to serve data on each chain. After a score has been assigned to all indexers on each chain we use indexers’ scores to inform selection probability of receiving an indexing agreement for each chain. An indexing agreement is an agreement that means the indexer will provide indexing services and we will pay the indexer for their indexing services (the customer pays us and we act as an intermediary). An indexer's aim is to receive as many indexing agreements as they can in order to maximise the income they generate from their indexing activities. Indexers that provide higher quality of service (QoS) at lower prices will increase their chances of receiving indexing agreements.

A single indexing agreement contains information such as the price the indexer will get paid, and what data they have to make available to be queried. We can award contracts that specify the portion of data which has to be made available to several indexers, instead of a single indexer. By giving a group of indexers an indexing agreement instead of allocating that agreement to only 1 indexer we are aiming to provide customers (developers/data consumers) with a higher quality of service. The rationale is that if a single indexer inside the group was to have issues, then other indexers in the group can continue to operate. Awarding indexing agreements to a group should help to improve quality of service, reduce query latency, increase data uptime and increase query success rate, among other things that a consumer may find valuable. However the QoS that a customer receives is a function of the indexers that are included in the group the customer is assigned. Your task is to figure out how to group indexers together for the best QoS for customers.

In this task we have provided you with two CSV's containing an assortment of (simulated) indexers, where those indexers are located, what VPS provider the indexer is using, and the indexer's current scores for a variety of metrics. We have also already computed how many indexing agreements each indexer should receive based on their scores, for the next batch of 10,000 indexing agreements. To test your python skills we want you to group indexers into groups of 3 indexers per group. The catch is, we have lost our data that gives each indexer a overall score, so we'll also need your help to reassign a overall score to each indexer, this will help you to figure out the aggregate overall score of each group you assemble. Fortunately, we kept some of the information that we used to construct the indexer's overall scores. For example, in the 'Example_scores_df_skills_take_home.csv' you'll find indexer's stake_score, uptime_score, success_score and coeff_score. 

**stake_score** is a measure of how much GRT the indexer is staking relative to the query fees they're collecting, scaled between 0 and 1, where a higher score indicates the indexer is providing a high level of economic security relative to the revenue they are generating from serving queries.

**uptime_score** is a measure of how often an indexer responds to a query from our gateway. It's scaled linearly between 0 and 1, where a score of 1 represents 100% uptime and a score of 0 represents 97% or less uptime. Therefore a score of 0.5 represents 98.5% uptime. 

**success_score** is a measure of how often an indexer *successfully* responds to a query from our gateway. It's scaled linearly between 0 and 1, where a score of 1 represents 100% uptime and a score of 0 represents 97% or less uptime. Therefore a score of 0.5 represents 98.5% uptime. 

**coeff_score** is a measure of how fast indexers are at responding to queries - this score was generated from a linear regression. It's normalised between 0 and 1 where a score of 1 means the indexer is the fastest at responding to queries and a score of 0 means the indexer is the slowest at responding to queries. Customers want low latency when they send a query to the network, because their applications may rely on them providing their users with accurate and timely data. 

**overall_score** is an aggregate score of the prior scores, but unfortunately we lost the data! Oops!

The groups you create (in this hypothetical example) will be paid to index data for future customers, with each group having a chance to be awarded indexing agreements. When forming these groups, you should consider decentralisation and the overall quality of service (QoS) each group can provide. Another crucial factor is that the number of agreements allocated to the groups in which an indexer appears should sum up to the total number of agreements that indexer has been awarded for the batch. This ensures that, as groups are randomly selected to serve customers' data, the total number of agreements each indexer fulfils equilibrates to their targeted number over time. If you do decide to deviate from this requirement make sure to document your logic, as we are keen to understand your thought process.

You should also think about what you are optimising for when creating the groups given the data you have available and efficient ways to complete this task without spending too much time on it, although if you want to focus on the quality of your submission you are free to spend more time wherever you think is appropriate. Your submission will be compared against other candidates so you should aim to do your best work, but don't stress about it, this is far from the only selection criteria for this role. Let us know if you have any problems with this assignment in your upcoming skills interview. We are looking forward to meeting you!


### Key Points to Focus On

1. **Optimization:** Aim to optimise the indexer groups you create. Randomly allocating indexers into groups of 3 will not score many points.
2. **Documentation:** Ensure your code and your logic is well-documented where appropriate, making it easy to understand and follow.
3. **Plots**: Attempt to demonstrate your work visually with graphs where appropriate. How you chose to do this is left for you to decide. You are not required to use python.
4. **Optimization:** If possible, try to optimise the speed of your code execution. More care should be taken to perform sound analysis, than writing efficient code, however writing efficient code is also beneficial.

You will notice that a start to the assignment has already been made, you are free to continue by adding to the existing code or start fresh with your own code. If you are having any trouble, or would like further clarity on anything, you can send as many emails as you like to samuel@edgeandnode.com, pablo@edgeandnode.com & rem@edgeandnode.com.

We understand that candidates might have different time availability for take home challenges - we hope you can dedicate between 2 and 4 hours to solving this, and that you can present your solution in a week or two; if this will not work for you please let us know and we'll do our best to accommodate.

We look forward to seeing your approach and solutions. Good luck!

Please direct your submission to [samuel\@edgeandnode.com](mailto:samuel@edgeandnode.com) (@MoonBoi9001 on GitHub), [pablo\@edgeandnode.com](mailto:pablo@edgeandnode.com) (@pcarranzav) & [rem\@edgeandnode.com](mailto:rem@edgeandnode.com)  (@RembrandK). You can create a private fork of this repository on GitHub, and add us as collaborators, or send us a compressed file with your solution.

#### Import relevant packages

In [16]:
# Imports
import pandas as pd
import numpy as np
from itertools import combinations

#### Load the example data you have been given for this asignment

In [38]:
indexer_df = pd.read_csv("Example_indexer_df_skills_take_home.csv")
scores_df = pd.read_csv("Example_scores_df_skills_take_home.csv")

In [39]:
# Display the first 4 and last 2 rows of the indexer_df
pd.concat([
    indexer_df.head(4),
    pd.DataFrame(['...'] * len(indexer_df.columns)).transpose().set_axis(indexer_df.columns, axis=1),
    indexer_df.tail(2)
])

Unnamed: 0,indexer,indexer_vps_provider,indexer_location,indexing_agreements
0,0xbcfd8fadabb6cffba8fdfdd3aac9f478d82eacc2,AS24940 Hetzner Online GmbH,6020,262
1,0x1f7a67bbdea486fd31c62ef60de15adc5d9beffb,AS3170 VeloxServ Communications Ltd,600,247
2,0xa7ff5a5dedbdbadd9e60797c2e07fbcedc8ef34f,AS24940 Hetzner Online GmbH,6020,242
3,0xa9aae95abfbbbeec6fee421684ddcbdfcfa8dcfa,"AS13335 Cloudflare, Inc.","40,-120",234
0,...,...,...,...
72,0x31cdacf1fcdcc4454e89b0cc1ec83de86bdb81f9,AS37153 Xneelo (Pty) Ltd,-2020,7
73,0x110bc4d10bd862aed3bf6c2ebd9effd29c7f1ef8,"AS13335 Cloudflare, Inc.","40,-120",6


In [40]:
# Display the first 4 and last 2 rows of the scores_df
pd.concat([
    scores_df.head(4),
    pd.DataFrame(['...'] * len(scores_df.columns)).transpose().set_axis(scores_df.columns, axis=1),
    scores_df.tail(2)
])

Unnamed: 0,indexer,stake_score,uptime_score,success_score,coeff_score,overall_score
0,0xbcfd8fadabb6cffba8fdfdd3aac9f478d82eacc2,0.848469,1.0,1.0,0.88395,
1,0x1f7a67bbdea486fd31c62ef60de15adc5d9beffb,,0.999667,0.999722,0.870472,
2,0xa7ff5a5dedbdbadd9e60797c2e07fbcedc8ef34f,0.753762,0.999667,0.999696,0.892237,
3,0xa9aae95abfbbbeec6fee421684ddcbdfcfa8dcfa,,0.964,0.903438,1.0,
0,...,...,...,...,...,...
72,0x31cdacf1fcdcc4454e89b0cc1ec83de86bdb81f9,0.532803,0.415333,0.415205,0.845812,
73,0x110bc4d10bd862aed3bf6c2ebd9effd29c7f1ef8,0.623717,0.0,0.0,0.886796,


#### 1. Create a new df containing permutations of indexers from the indexer_df, give each permutation an overall_score that represents the performance of the group.
#### 2. Add any further columns to this df that might be useful to determine how suitable the group is.

In [41]:
# Create dictionary's for quick lookups
provider_dict = indexer_df.set_index('indexer')['indexer_vps_provider'].to_dict()
loc_dict = indexer_df.set_index('indexer')['indexer_location'].to_dict()
score_dict = scores_df.set_index('indexer')['overall_score'].to_dict()

# Calculate Possible Groupings and Their Scores.
groups = list(combinations(indexer_df['indexer'], 3))
group_scores = [score_dict[x] + score_dict[y] + score_dict[z] for x, y, z in groups]

In [42]:
group_df = pd.DataFrame({
    "group": groups,
    "overall_score": group_scores
})

# Filter out groups with too low scores. We already know that we wont be using them. 
# In this case any group with lower group score than the 1% percentile group score is removed.

threshold_score = np.percentile(group_df['overall_score'], 1)
group_df = group_df[group_df['overall_score'] > threshold_score].reset_index(drop=True)

In [43]:
group_df

Unnamed: 0,group,overall_score


#### Using your df above that contains permutations of indexers, their scores and their suitability, figure out which groups should receive the indexing agreements and how many agreements they should receive.

#### You may want to formulate a linear programming problem, however you are free to tackle this however you see fit. 