# Machine Learning for Complete Intersection Calabi-Yau 3-folds

Harold Erbin and Riccardo Finotello

_Physics Department_

_Università degli Studi di Torino and I.N.F.N. - sezione di Torino_

_via Pietro Giuria 1, I-10125 Torino, Italy_

---
---

In this notebook we prepare the original dataset ([1]-[2]) and the favourable dataset ([3]) for the analysis: we download the tarballs, extract them and tidy the datasets.

- [1]: P. Candelas, A. M. Dale, C. A. Lutken and R. Schimmrigk, "Complete Intersection Calabi-Yau Manifolds", Nucl. Phys. B 298 (1988)
- [2]: P. S. Green, T. Hubsch and C. A. Lutken, "All Hodge Numbers of All Complete Intersection Calabi-Yau Manifolds", Class. Quant. Grav. 6 (1989)
- [3]: L. B. Anderson, X. Gao, J. Gray and S. J. Lee, "Fibrations in CICY Threefolds", JHEP 10 (2017), arXiv:1708.07907.

In [1]:
%load_ext autoreload
%autoreload 2

## Fetch the Datasets

In this section we download the tarballs and load the datasets.

In [2]:
import os
import tarfile
import urllib
import pandas as pd

# create a directory to store the datasets
os.makedirs('./data', exist_ok=True)

# download the files
file_o_url = 'http://www.lpthe.jussieu.fr/~erbin/files/data/cicy3o_data.tar.gz'
file_f_url = 'http://www.lpthe.jussieu.fr/~erbin/files/data/cicy3f_data.tar.gz'

file_o_out = './data/cicy3o_data.tar.gz'
file_f_out = './data/cicy3f_data.tar.gz'

if not os.path.isfile(file_o_out):
    urllib.request.urlretrieve(file_o_url, file_o_out)

if not os.path.isfile(file_f_out):
    urllib.request.urlretrieve(file_f_url, file_f_out)
    
# open the tarballs
if os.path.isfile(file_o_out):
    with tarfile.open(file_o_out, 'r') as tar:
        tar.extract('cicy3o.h5', './data')
        
if os.path.isfile(file_f_out):
    with tarfile.open(file_f_out, 'r') as tar:
        tar.extract('cicy3f.h5', './data')
        
# load the dataset
df_o = pd.read_hdf('./data/cicy3o.h5')
df_f = pd.read_hdf('./data/cicy3f.h5')

The datasets are composed by the following columns and dtypes:

In [3]:
df_o.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7890 entries, 1 to 7890
Data columns (total 31 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   c2              7890 non-null   object 
 1   euler           7890 non-null   int16  
 2   h11             7890 non-null   int16  
 3   h21             7890 non-null   int16  
 4   matrix          7890 non-null   object 
 5   redun           7890 non-null   object 
 6   size            7890 non-null   object 
 7   num_cp          7890 non-null   int8   
 8   num_eqs         7890 non-null   int64  
 9   dim_cp          7890 non-null   object 
 10  min_dim_cp      7890 non-null   int64  
 11  max_dim_cp      7890 non-null   int64  
 12  mean_dim_cp     7890 non-null   float64
 13  median_dim_cp   7890 non-null   float64
 14  num_dim_cp      7890 non-null   object 
 15  num_cp_1        7890 non-null   int8   
 16  num_cp_2        7890 non-null   int8   
 17  num_cp_neq1     7890 non-null   i

In [4]:
df_f.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7890 entries, 1 to 7890
Data columns (total 31 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   c2              7890 non-null   object 
 1   favour          7890 non-null   int64  
 2   h11             7890 non-null   int16  
 3   h21             7890 non-null   int16  
 4   isprod          7890 non-null   int64  
 5   kahlerpos       7890 non-null   int64  
 6   matrix          7890 non-null   object 
 7   size            7890 non-null   object 
 8   euler           7890 non-null   int16  
 9   num_cp          7890 non-null   int8   
 10  num_eqs         7890 non-null   int64  
 11  dim_cp          7890 non-null   object 
 12  min_dim_cp      7890 non-null   int64  
 13  max_dim_cp      7890 non-null   int64  
 14  mean_dim_cp     7890 non-null   float64
 15  median_dim_cp   7890 non-null   float64
 16  num_dim_cp      7890 non-null   object 
 17  num_cp_1        7890 non-null   i

## Tydy Datasets

To better handle the dataset, we extract each _object_ column into a dense format and create one unique variable (column) for each entry:

In [5]:
import numpy as np

def extract_series(series: pd.Series) -> pd.Series:
    '''
    Extract a Pandas series into its dense format.
    
    Required arguments:
        series: the pandas series.
        
    Returns:
        the pandas series in dense format.
    '''
    # avoid direct overwriting
    series = series.copy()
    
    # cget the maximum size of each axis
    max_shape = series.apply(np.shape).max()
    
    # return the transformed series
    if np.prod(max_shape) > 1:
        # compute the necessary shift and apply it
        offset = lambda s: [(0, max_shape[i] - np.shape(s)[i])
                            for i in range(len(max_shape))
                           ]
        return series.apply(lambda s: np.pad(s, offset(s), mode='constant'))
    else:
        return series

In [6]:
# apply it to all variables of the dataset 
df_o = df_o.apply(extract_series)
df_f = df_f.apply(extract_series)

We then save the shape of all variables to be able to access them later:

In [7]:
df_shape_o = df_o.applymap(np.shape).apply(np.unique).to_dict(orient='records')[0]
df_shape_f = df_f.applymap(np.shape).apply(np.unique).to_dict(orient='records')[0]

In [8]:
import json

with open('./data/cicy3o_shapes.json', 'w') as o:
    json.dump(df_shape_o, o)

with open('./data/cicy3f_shapes.json', 'w') as f:
    json.dump(df_shape_f, f)

We then create a new dataframe holding each entry of the _object_ features in a separate variable to be able to separately handle each entry (since we already know that all entries are numeric):

In [9]:
def explode_variables(series: pd.Series) -> pd.DataFrame:
    '''
    Take one variable and explode its components in a new column.
    
    Required arguments:
        series: the variable to explode.
        
    Returns:
        a dataframe containing one observable for each column.
    '''
    # avoid direct overwriting
    series = series.copy()
    
    if series.apply(lambda x: np.prod(np.shape(x))).max() == 1:
        return series
    else:
        # flatten the array
        series = series.apply(lambda x: np.reshape(x, (-1,)))

        # explode over columns
        series = series.apply(pd.Series).rename(columns=lambda x: series.name + '_{}'.format(x+1))

        return series

In [10]:
# choose the order of the columns
column_list = ['h11', 'h21', 'euler',
               'c2', 'size', 'isprod', 'favour',
               'num_cp', 'num_eqs',
               'dim_h0_amb',
               'dim_cp', 'num_dim_cp',
               'min_dim_cp', 'max_dim_cp', 'mean_dim_cp', 'median_dim_cp',
               'num_cp_1', 'num_cp_2', 'num_cp_neq1',
               'num_over', 'num_ex',
               'deg_eqs', 'num_deg_eqs',
               'min_deg_eqs', 'max_deg_eqs', 'mean_deg_eqs', 'median_deg_eqs',
               'rank_matrix', 'norm_matrix',
               'matrix' 
              ]

# create a new data frame and "left join" each new variable
df_o_tmp = pd.DataFrame(index=df_o.index)
for var in column_list:
    df_o_tmp = df_o_tmp.join(explode_variables(df_o[var]))
    
df_f_tmp = pd.DataFrame(index=df_f.index)
for var in column_list:
    df_f_tmp = df_f_tmp.join(explode_variables(df_f[var]))
    
# overwrite the old files
df_o = df_o_tmp
df_f = df_f_tmp

del df_o_tmp, df_f_tmp

Look for duplicates in rows (all variables must coincide):

In [11]:
df_o.duplicated().rename('dup_o').agg({'sum', 'mean'})

sum     0.0
mean    0.0
Name: dup_o, dtype: float64

In [12]:
df_f.duplicated().rename('dup_f').agg({'sum', 'mean'})

sum     0.0
mean    0.0
Name: dup_f, dtype: float64

Save the new datasets to a separate CSV files for later use:

In [13]:
df_o.to_csv('./data/cicy3o_tidy.csv', index=False)
df_f.to_csv('./data/cicy3f_tidy.csv', index=False)