## Dataset Generation for Section 4 Experiments

This notebook generates the datasets used in the experiments described in **Section 4**, which aim to predict the value of $a_p$ given the sequence $(a_q)_{q \ne p,\, q < 100}$. The experiments focus on the prime values  $p \in \{2, 3, 97\}$.

### Data Source

The required data is loaded from the file [`ECQ6ap_1e2.csv`](https://zenodo.org/records/15832317), which contains:

- The sequence $(a_q(E))_{q < 100}$ for various elliptic curves $E$;
- The conductor $N(E)$
- The rank $r(E)$

### Data Preparation

The file `ECQ6ap_1e2.csv` is derived from isogeny classes of elliptic curves with conductor $N(E) < 10^6$. These were selected from a larger dataset of curves with conductor $N(E) < 10^8$, available at https://zenodo.org/records/14847809.

One curve (row **2268768**) was removed from the dataset because its rank was not specified in the original source.


In [9]:
# Imports and loading files
import pandas as pd
# Choose the prime p that we generate the data for predicting a_p. The experiments cover p = 2, 3 and 97
p=3
df = pd.read_csv('ECQ6ap_1e2.csv')
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Take only curves with good reduction at p
df=df[(df['conductor']%p !=0)]
df.drop(columns=['conductor', 'rank'], inplace=True)

df

Unnamed: 0,a_2,a_3,a_5,a_7,a_11,a_13,a_17,a_19,a_23,a_29,...,a_53,a_59,a_61,a_67,a_71,a_73,a_79,a_83,a_89,a_97
0,0,-2,-3,1,5,-7,2,0,-4,8,...,-1,-1,10,1,3,12,2,14,3,-7
2,1,-1,0,3,-3,0,3,-1,-4,-1,...,-8,-3,-6,-4,-3,-4,-1,6,7,-8
3,0,-2,1,-4,0,-1,-2,6,6,2,...,-2,6,2,-4,-6,6,-12,16,2,-2
10,0,0,1,-2,1,6,-6,4,-6,-1,...,4,12,2,4,0,16,4,8,14,-16
11,1,-2,0,-1,1,-1,2,6,-8,0,...,6,3,10,3,3,8,-3,-4,2,18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3607199,0,0,-1,0,4,-2,2,-4,8,-6,...,-6,-12,10,8,0,2,0,-12,2,10
3607205,1,-1,0,0,-2,3,-3,-2,8,8,...,-6,3,-2,16,12,10,11,-14,-14,-10
3607206,-1,0,-1,2,0,-1,7,5,4,-4,...,-2,0,13,-2,-16,-2,-5,0,-11,-6
3607208,-1,2,2,2,-4,1,-4,-7,-5,6,...,-6,-11,-6,4,4,-11,0,-9,-6,9


In [10]:
# Encoding functions

ps = [ 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]


def encode_integer(val, base=1000, digit_sep=" "):
    if val == 0:
        return '+ 0'
    sgn = '+' if val >= 0 else '-'
    val = abs(val)
    r = []
    while val > 0:
        r.append(str(val % base))
        val = val//base
    r.append(sgn)
    r.reverse()
    return digit_sep.join(r)
    

In [11]:
# Encode data
min_vals={2:-2, 3:-3, 97:-19}
for q in ps:
    if q !=p:
        print(f"Encoding columns for prime {q}")
        df['a_'+str(q)] = df['a_'+str(q)].apply(lambda x: encode_integer(x))  
    if q == p:
        df['a_'+str(q)]=df['a_'+str(q)].apply(lambda x:  x - min_vals[p])    
df["data_type"]="V"+str(len(ps)-1)

Encoding columns for prime 2
Encoding columns for prime 5
Encoding columns for prime 7
Encoding columns for prime 11
Encoding columns for prime 13
Encoding columns for prime 17
Encoding columns for prime 19
Encoding columns for prime 23
Encoding columns for prime 29
Encoding columns for prime 31
Encoding columns for prime 37
Encoding columns for prime 41
Encoding columns for prime 43
Encoding columns for prime 47
Encoding columns for prime 53
Encoding columns for prime 59
Encoding columns for prime 61
Encoding columns for prime 67
Encoding columns for prime 71
Encoding columns for prime 73
Encoding columns for prime 79
Encoding columns for prime 83
Encoding columns for prime 89
Encoding columns for prime 97


In [4]:
# Separating training and test sets
df_train=df.head(df.shape[0]-10000)
df_test=df.tail(10000)

In [5]:
df_train1=df_train[['data_type']+['a_'+str(q) for q in ps if q !=p]+['a_'+str(p)]]
df_test1=df_test[['data_type']+['a_'+str(q) for q in ps if q !=p]+['a_'+str(p)]]

In [None]:
# Generate training dataset
dftoint_train = pd.DataFrame()
dftoint_train['input'] =  df_train1.iloc[:, :-1].agg(' '.join, axis=1)
dftoint_train['output'] = df_train1['a_'+str(p)]
dftoint_train.to_csv("ap_to_a"+str(p)+"_train.txt", sep='\t', index=False, header=False)

In [None]:
# Generate test dataset
dftoint_test = pd.DataFrame()
dftoint_test['input'] =  df_test1.iloc[:, :-1].agg(' '.join, axis=1)
dftoint_test['output'] = df_test1['a_'+str(p)]
dftoint_test.to_csv("ap_to_a"+str(p)+"_test.txt", sep='\t', index=False, header=False)