<h1 style="color:#fc3f51"> Google AI4Code : The results of my first experiments. </h1>

The majority of this content comes from [Getting Started with AI4Code](https://www.kaggle.com/code/ryanholbrook/getting-started-with-ai4code) Notebook.


My contribution :

- New params for TfidfVectorizer (In Cell 11)
- Better params for XGBRanker Model (In Cell 14)

I tried OPTUNA to look for better parameters, but I did not use it in this notebook.

---

# Welcome to the Google AI4Code Competition! #

In this competition you're challenged to reconstruct the order of Kaggle notebooks whose cells have been shuffled. Check out the [Competition Pages](https://www.kaggle.com/competitions/AI4Code/overview) for a complete overview.

This notebook will walk you through making a submission with a simple ranking model. We'll look at how to:
- Wrangle the competition data and create validation splits,
- Represent the code cell orders with a feature,
- Build a ranking model with XGBoost,
- Evaluate predictions with a Python implementation of the competition metric, and,
- Format predictions to make a successful submission.

Our model will be able to learn roughly where a cell should go in a notebook based on what words it contains -- that, for example, cells containing "Introduction" or `import` should usually be near the beginning, while cells containing "Submit" or `submission.csv` should usually be near the end. These simple features are effective at reconstructing the global order of typical data science workflows. An understanding of the *interactions* or *relationships between cells*, however, will be required of the most successful solutions. We encourage you therefore to explore things like modern neural network language models for learning the relationships between natural language and computer code.

# Setup #

In [1]:
import json
from pathlib import Path

import numpy as np
import pandas as pd
from scipy import sparse
from tqdm import tqdm

pd.options.display.width = 180
pd.options.display.max_colwidth = 120

data_dir = Path('../input/AI4Code')

# Load Data #

The notebooks are stored as individiual JSON files. They've been cleaned of the usual metadata present in Jupyter notebooks, leaving only the `cell_type` and `source`. The [Data](https://www.kaggle.com/competitions/AI4Code/data) page on the competition website has the full documentation of this dataset.

We'll load the notebooks here and join them into a dataframe for easier processing. The full set of training data takes quite a while to load, so we'll just use a subset for this demonstration.

In [2]:
NUM_TRAIN = 10000


def read_notebook(path):
    return (
        pd.read_json(
            path,
            dtype={'cell_type': 'category', 'source': 'str'})
        .assign(id=path.stem)
        .rename_axis('cell_id')
    )


paths_train = list((data_dir / 'train').glob('*.json'))[:NUM_TRAIN]
notebooks_train = [
    read_notebook(path) for path in tqdm(paths_train, desc='Train NBs')
]
df = (
    pd.concat(notebooks_train)
    .set_index('id', append=True)
    .swaplevel()
    .sort_index(level='id', sort_remaining=False)
)

df

Train NBs: 100%|██████████| 10000/10000 [01:28<00:00, 112.79it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,cell_type,source
id,cell_id,Unnamed: 2_level_1,Unnamed: 3_level_1
000c0a9b2fef4d,1087237d,code,# Data manipulation\nimport pandas as pd\nimport numpy as np\n\n# Data visualization\nimport matplotlib.pyplot as pl...
000c0a9b2fef4d,d7209f1f,code,fifa_raw_dataset = pd.read_csv('../input/data.csv')
000c0a9b2fef4d,daf5b8ee,code,fifa_raw_dataset.head()
000c0a9b2fef4d,e404213c,code,fifa_raw_dataset.info()
000c0a9b2fef4d,2bad59b0,code,fifa_raw_dataset.shape
...,...,...,...
fffc63ff750064,56aa8da7,code,"\nsubmission.to_csv('house_price_rf.csv', index = False)"
fffc63ff750064,411b85d9,markdown,1. # Data exploration
fffc63ff750064,e7e67119,markdown,# # Data preprocessing
fffc63ff750064,8b54cf58,markdown,# Post-process for submission


Each notebook has all the code cells given first with the markdown cells following. The code cells are in the correct relative order, while the markdown cells are shuffled. In the next section, we'll see how to recover the correct orderings for notebooks in the training set.

In [3]:
# Get an example notebook
nb_id = df.index.unique('id')[6]
print('Notebook:', nb_id)

print("The disordered notebook:")
nb = df.loc[nb_id, :]
display(nb)
print()

Notebook: 00290ddf866418
The disordered notebook:


Unnamed: 0_level_0,cell_type,source
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1
4e6f32f6,code,# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/pyt...
0aeca210,code,"import warnings\nimport random\n\nSEED=44\nrandom.seed(SEED)\nnp.random.seed(SEED)\npd.set_option('display.width', N..."
cadfdb16,code,train = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2021/train.csv')\ntest = pd.read_csv('/kaggle/input...
fe39c117,code,train.info()
bbc5f229,code,"y = train.iloc[:,-1]\nX = train.iloc[:,:-1]\nZ = test"
46cc92d1,code,def get_obj_cols(df):\n return [col for col in df.columns if df.dtypes[col] == np.object]
9d6ea72b,code,X_objs = get_obj_cols(X)\nX_objs_idx = [X.columns.get_loc(col) for col in X_objs]\nZ_objs = get_obj_cols(Z)\nZ_objs_...
d0f88604,code,for obj in X_objs:\n X[obj] = X[obj].astype('category').cat.codes\nfor obj in Z_objs:\n Z[obj] = Z[obj].astype...
8df28832,code,"X.drop('id', axis=1, inplace=True)\nZ.drop('id', axis=1, inplace=True)"
8c01934e,code,"from sklearn.model_selection import StratifiedKFold, cross_val_score\nfrom sklearn.metrics import roc_auc_score\nK = 10"





# Ordering the Cells #

In the `train_orders.csv` file we have, for notebooks in the training set, the correct ordering of cells in terms of the cell ids.

In [4]:
df_orders = pd.read_csv(
    data_dir / 'train_orders.csv',
    index_col='id',
    squeeze=True,
).str.split()  # Split the string representation of cell_ids into a list

df_orders

id
00001756c60be8    [1862f0a6, 448eb224, 2a9e43d6, 7e2f170a, 038b763d, 77e56113, 2eefe0ef, 1ae087ab, 0beab1cd, 8ffe0b25, 9a78ab76, 0d136...
00015c83e2717b    [2e94bd7a, 3e99dee9, b5e286ea, da4f7550, c417225b, 51e3cd89, 2600b4eb, 75b65993, cf195f8b, 25699d02, 72b3201a, f2c75...
0001bdd4021779    [3fdc37be, 073782ca, 8ea7263c, 80543cd8, 38310c80, 073e27e5, 015d52a4, ad7679ef, 7fde4f04, 07c52510, 0a1a7a39, 0bcd3...
0001daf4c2c76d    [97266564, a898e555, 86605076, 76cc2642, ef279279, df6c939f, 2476da96, 00f87d0a, ae93e8e6, 58aadb1d, d20b0094, 986fd...
0002115f48f982                                 [9ec225f0, 18281c6c, e3b6b115, 4a044c54, 365fe576, a3188e54, b3f6e12d, ee7655ca, 84125b7a]
                                                                           ...                                                           
fffc30d5a0bc46    [09727c0c, ff1ea6a0, ddfef603, a01ce9b3, 3ba953ee, bf92a015, f4a0492a, 095812e6, 53125cfe, aa32a700, 63340e73, 06d8c...
fffc3b44869198    [978a5137, fa

In [5]:
# Get the correct order
cell_order = df_orders.loc[nb_id]

print("The ordered notebook:")
nb.loc[cell_order, :]

The ordered notebook:


Unnamed: 0_level_0,cell_type,source
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1
91d97bb2,markdown,# Read Data
4e6f32f6,code,# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/pyt...
0aeca210,code,"import warnings\nimport random\n\nSEED=44\nrandom.seed(SEED)\nnp.random.seed(SEED)\npd.set_option('display.width', N..."
cadfdb16,code,train = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2021/train.csv')\ntest = pd.read_csv('/kaggle/input...
fe39c117,code,train.info()
bbc5f229,code,"y = train.iloc[:,-1]\nX = train.iloc[:,:-1]\nZ = test"
91194a54,markdown,# Categorical
46cc92d1,code,def get_obj_cols(df):\n return [col for col in df.columns if df.dtypes[col] == np.object]
9d6ea72b,code,X_objs = get_obj_cols(X)\nX_objs_idx = [X.columns.get_loc(col) for col in X_objs]\nZ_objs = get_obj_cols(Z)\nZ_objs_...
d0f88604,code,for obj in X_objs:\n X[obj] = X[obj].astype('category').cat.codes\nfor obj in Z_objs:\n Z[obj] = Z[obj].astype...


The correct numeric position of a cell we will call the **rank** of the cell. We can find the ranks of the cells within a notebook by referencing the true ordering of cell ids as given in `train_orders.csv`.

In [6]:
def get_ranks(base, derived):
    return [base.index(d) for d in derived]

cell_ranks = get_ranks(cell_order, list(nb.index))
nb.insert(0, 'rank', cell_ranks)

nb

Unnamed: 0_level_0,rank,cell_type,source
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4e6f32f6,1,code,# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/pyt...
0aeca210,2,code,"import warnings\nimport random\n\nSEED=44\nrandom.seed(SEED)\nnp.random.seed(SEED)\npd.set_option('display.width', N..."
cadfdb16,3,code,train = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2021/train.csv')\ntest = pd.read_csv('/kaggle/input...
fe39c117,4,code,train.info()
bbc5f229,5,code,"y = train.iloc[:,-1]\nX = train.iloc[:,:-1]\nZ = test"
46cc92d1,7,code,def get_obj_cols(df):\n return [col for col in df.columns if df.dtypes[col] == np.object]
9d6ea72b,8,code,X_objs = get_obj_cols(X)\nX_objs_idx = [X.columns.get_loc(col) for col in X_objs]\nZ_objs = get_obj_cols(Z)\nZ_objs_...
d0f88604,9,code,for obj in X_objs:\n X[obj] = X[obj].astype('category').cat.codes\nfor obj in Z_objs:\n Z[obj] = Z[obj].astype...
8df28832,10,code,"X.drop('id', axis=1, inplace=True)\nZ.drop('id', axis=1, inplace=True)"
8c01934e,12,code,"from sklearn.model_selection import StratifiedKFold, cross_val_score\nfrom sklearn.metrics import roc_auc_score\nK = 10"


Sorting a notebook by the cell ranks is another way to order the notebook.

In [7]:
from pandas.testing import assert_frame_equal

assert_frame_equal(nb.loc[cell_order, :], nb.sort_values('rank'))

The algorithm we'll be using for our baseline model uses the cell ranks as the target, so let's create a dataframe of the ranks for each notebook.

In [8]:
df_orders_ = df_orders.to_frame().join(
    df.reset_index('cell_id').groupby('id')['cell_id'].apply(list),
    how='right',
)

ranks = {}
for id_, cell_order, cell_id in df_orders_.itertuples():
    ranks[id_] = {'cell_id': cell_id, 'rank': get_ranks(cell_order, cell_id)}

df_ranks = (
    pd.DataFrame
    .from_dict(ranks, orient='index')
    .rename_axis('id')
    .apply(pd.Series.explode)
    .set_index('cell_id', append=True)
)

df_ranks

Unnamed: 0_level_0,Unnamed: 1_level_0,rank
id,cell_id,Unnamed: 2_level_1
000c0a9b2fef4d,1087237d,2
000c0a9b2fef4d,d7209f1f,4
000c0a9b2fef4d,daf5b8ee,6
000c0a9b2fef4d,e404213c,7
000c0a9b2fef4d,2bad59b0,8
...,...,...
fffc63ff750064,56aa8da7,25
fffc63ff750064,411b85d9,1
fffc63ff750064,e7e67119,6
fffc63ff750064,8b54cf58,22


# Splits #

The `df_ancestors.csv` file identifies groups of notebooks derived from a common origin, that is, notebooks belonging to the same forking tree.

In [9]:
df_ancestors = pd.read_csv(data_dir / 'train_ancestors.csv', index_col='id')
df_ancestors

Unnamed: 0_level_0,ancestor_id,parent_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1
00001756c60be8,945aea18,
00015c83e2717b,aa2da37e,317b65d12af9df
0001bdd4021779,a7711fde,
0001daf4c2c76d,090152ca,
0002115f48f982,272b483a,
...,...,...
fffc30d5a0bc46,6aed207b,
fffc3b44869198,a6aaa8d7,
fffc63ff750064,0a1b5b65,
fffcd063cda949,d971e960,


To prevent leakage, the test set has no notebook with an ancestor in the training set. We therefore form a validation split using `ancestor_id` as a grouping factor.

In [10]:
from sklearn.model_selection import GroupShuffleSplit

NVALID = 0.1  # size of validation set

splitter = GroupShuffleSplit(n_splits=1, test_size=NVALID, random_state=0)

# Split, keeping notebooks with a common origin (ancestor_id) together
ids = df.index.unique('id')
ancestors = df_ancestors.loc[ids, 'ancestor_id']
ids_train, ids_valid = next(splitter.split(ids, groups=ancestors))
ids_train, ids_valid = ids[ids_train], ids[ids_valid]

df_train = df.loc[ids_train, :]
df_valid = df.loc[ids_valid, :]

# Feature Engineering #

Let's generate [tf-idf features](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer) to use with our ranking model. These features will help our model learn what kinds of words tend to occur most often at various positions within a notebook.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Training set
tfidf = TfidfVectorizer(min_df=0.001357, lowercase=True)
X_train = tfidf.fit_transform(df_train['source'].astype(str))
# Rank of each cell within the notebook
y_train = df_ranks.loc[ids_train].to_numpy()
# Number of cells in each notebook
groups = df_ranks.loc[ids_train].groupby('id').size().to_numpy()

Now let's add the code cell ordering as a feature. We'll append a column that enumerates the code cells in the correct order, like `1, 2, 3, 4, ...`, while having the dummy value `0` for all markdown cells. This feature will help the model learn to put the code cells in the correct order.

In [12]:
# Add code cell ordering
X_train = sparse.hstack((
    X_train,
    np.where(
        df_train['cell_type'] == 'code',
        df_train.groupby(['id', 'cell_type']).cumcount().to_numpy() + 1,
        0,
    ).reshape(-1, 1)
))
print(X_train.shape)

(416586, 1680)


In [13]:
df_train

Unnamed: 0_level_0,Unnamed: 1_level_0,cell_type,source
id,cell_id,Unnamed: 2_level_1,Unnamed: 3_level_1
000c0a9b2fef4d,1087237d,code,# Data manipulation\nimport pandas as pd\nimport numpy as np\n\n# Data visualization\nimport matplotlib.pyplot as pl...
000c0a9b2fef4d,d7209f1f,code,fifa_raw_dataset = pd.read_csv('../input/data.csv')
000c0a9b2fef4d,daf5b8ee,code,fifa_raw_dataset.head()
000c0a9b2fef4d,e404213c,code,fifa_raw_dataset.info()
000c0a9b2fef4d,2bad59b0,code,fifa_raw_dataset.shape
...,...,...,...
fffc63ff750064,56aa8da7,code,"\nsubmission.to_csv('house_price_rf.csv', index = False)"
fffc63ff750064,411b85d9,markdown,1. # Data exploration
fffc63ff750064,e7e67119,markdown,# # Data preprocessing
fffc63ff750064,8b54cf58,markdown,# Post-process for submission


# Train #

We'll use the ranking algorithm provided by XGBoost.

In [14]:
from xgboost import XGBRanker

#params = {'subsample': 0.9587444099995703, 'min_child_weight': 58, 'tree_method':'hist'}

model = XGBRanker(#**params
    min_child_weight=12,
    subsample=0.55,
    tree_method='hist',
)
model.fit(X_train, y_train, group=groups)

XGBRanker(base_score=0.5, booster='gbtree', callbacks=None, colsample_bylevel=1,
          colsample_bynode=1, colsample_bytree=1, early_stopping_rounds=None,
          enable_categorical=False, eval_metric=None, gamma=0, gpu_id=-1,
          grow_policy='depthwise', importance_type=None,
          interaction_constraints='', learning_rate=0.300000012, max_bin=256,
          max_cat_to_onehot=4, max_delta_step=0, max_depth=6, max_leaves=0,
          min_child_weight=12, missing=nan, monotone_constraints='()',
          n_estimators=100, n_jobs=0, num_parallel_tree=1, predictor='auto',
          random_state=0, reg_alpha=0, reg_lambda=1, ...)

In [15]:
from sklearn.metrics import mean_squared_error

In [16]:
"""
import optuna
#optuna.logging.set_verbosity(optuna.logging.WARNING)


# 1. Define an objective function to be maximized.
def objective(trial):
    
    params = {
        'tree_method':'hist',
        'subsample': trial.suggest_float('subsample', 0.02, 1),
        'min_child_weight': trial.suggest_int('min_child_weight', 10, 100),
        
        'learning_rate': trial.suggest_float('learning_rate', 0, 1),
        'n_estimators': trial.suggest_int('n_estimators', 1, 2500),
        #'max_depth': trial.suggest_int('max_depth', 3, 16),
    }

    model = XGBRanker(
        **params
        #min_child_weight=12,
    )
    model.fit(X_train, y_train, group=groups)
    y_pred = model.predict(X_train)

    return 1 - mean_squared_error(y_train, y_pred)

# 3. Create a study object and optimize the objective function.
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=125)
"""



In [17]:
#params = study.best_params
#params

In [18]:
"""
#params = {'subsample': 0.9587444099995703, 'min_child_weight': 58, 'tree_method':'hist'}

model = XGBRanker(**params)
model.fit(X_train, y_train, group=groups)
"""

"\n#params = {'subsample': 0.9587444099995703, 'min_child_weight': 58, 'tree_method':'hist'}\n\nmodel = XGBRanker(**params)\nmodel.fit(X_train, y_train, group=groups)\n"

# Evaluate #

Now let's see how well our model learned to order Kaggle notebook cells. We'll evaluate predictions on the validation set with a variant of the Kendall tau correlation.

## Validation set ##

First we'll create features for the validation set just like we did for the training set.

In [19]:
# Validation set
X_valid = tfidf.transform(df_valid['source'].astype(str))
# The metric uses cell ids
y_valid = df_orders.loc[ids_valid]

X_valid = sparse.hstack((
    X_valid,
    np.where(
        df_valid['cell_type'] == 'code',
        df_valid.groupby(['id', 'cell_type']).cumcount().to_numpy() + 1,
        0,
    ).reshape(-1, 1)
))

Here we'll use the model to predict the rank of each cell within its notebook and then convert these ranks into a list of ordered cell ids.

In [20]:
y_pred = pd.DataFrame({'rank': model.predict(X_valid)}, index=df_valid.index)
y_pred = (
    y_pred
    .sort_values(['id', 'rank'])  # Sort the cells in each notebook by their rank.
                                  # The cell_ids are now in the order the model predicted.
    .reset_index('cell_id')  # Convert the cell_id index into a column.
    .groupby('id')['cell_id'].apply(list)  # Group the cell_ids for each notebook into a list.
)
y_pred.head(10)

id
0019c0de64fe80    [76b81a6e, c06f3027, 2e8ddcdc, 0bdb7484, 4470e13e, 12d2c276, 62cd27a1, 975dc007, 6a2f9600, 153d9221, 342a10b0, 175ac...
0098e6a711804b    [0b57c7d1, bac84fb8, 7befb981, 2b1c4938, ec8c11d4, c85adef0, 92aebe06, 1f2574e6, b39006ed, 0b34ec73, d6e30791, ad148...
0115938e54b661    [7bdb2279, 26008f09, 1393af9d, bc18f233, 2b20e565, 4ae6d198, e7596437, 60f096be, e2b23e8b, 38a4d1e2, 3912c173, 4247f...
01a86eb72c41c6    [43e4bff0, bd5969e9, 6f3a8cf6, e155f365, 7af1cc87, c6e3de3d, e0c51b34, c05bfc2a, 904ebd55, 1c3732fa, 86e80ef2, 1e0ef...
01b0f2f0cc925b    [e1ed7e8f, 4f96d02a, 46e763b2, ce25eefa, 6e0ade08, a720b51d, 8107082f, 55dbfb53, 26d53ebe, c4a12ce1, 786f1175, e3eec...
021671a4f2e18c    [ebdb791f, 9572d02e, 296fb2fe, cbb7476c, a1d6a390, 298e543f, 24afcc91, c784ebec, 88a62103, 65d9ae24, c46226f3, 987aa...
0245a6f3fe3b0f    [d868648f, 1dcf6f80, 70cd8db6, b7c21c6d, 201e80f2, 5411f98a, 906911bf, 86405f4a, 5df866b3, 1c7a4a93, 63a94813, 37a54...
02462dad8226f3    [94503eaf, 9f

Now let's examine a notebook to see how the model did.

In [21]:
nb_id = df_valid.index.get_level_values('id').unique()[8]

display(df.loc[nb_id])
display(df.loc[nb_id].loc[y_pred.loc[nb_id]])

Unnamed: 0_level_0,cell_type,source
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1
e84775a2,code,# Bread and butter of Machine learning\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\n\n...
f4f088b2,code,train = pd.read_csv('/kaggle/input/digit-recognizer/train.csv')\ntest = pd.read_csv('/kaggle/input/digit-recognizer/...
1a9b7d44,code,"# train.shape[0] is the number of images in the training data, and test.shape[0] holds the number of images in the t..."
2f87a4ad,code,def to_tensor(data):\n return [torch.FloatTensor(point) for point in data]\n\nclass MNISTData(Dataset):\n def ...
8b1b5945,code,# We'll split our data into 90% training and 10% test data \nsplit = int(0.9 * len(train))\nvalid_data = train[split...
d1b5be9b,code,"# Getting features of the image (pixel 0-783, 784 pixels in total for a 28*28 image)\nX_col = list(train.columns[1:]..."
c7c5c317,code,"class dummyModel(nn.Module):\n def __init__(self):\n super(dummyModel, self).__init__()\n \n ..."
93a9ff7f,code,network = dummyModel()\nprint(network)\nnum_epochs = 5\nfor epoch in range(num_epochs):\n for train_batch in trai...
2fd5e864,code,"class model(nn.Module):\n def __init__(self):\n super(model, self).__init__()\n self.conv1 = nn.Seq..."
67a541a1,code,"# Before we start training our neural network, we'll define our accuracy function\ndef acc(y_true, y_pred):\n y_t..."


Unnamed: 0_level_0,cell_type,source
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1
e84775a2,code,# Bread and butter of Machine learning\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\n\n...
f4f088b2,code,train = pd.read_csv('/kaggle/input/digit-recognizer/train.csv')\ntest = pd.read_csv('/kaggle/input/digit-recognizer/...
4389b63f,markdown,"# Mnist Handwritten Digit dataset with Pytorch (CNN)\n> Hi Everyone, in this notebook we're going to train a neural ..."
1a9b7d44,code,"# train.shape[0] is the number of images in the training data, and test.shape[0] holds the number of images in the t..."
2f87a4ad,code,def to_tensor(data):\n return [torch.FloatTensor(point) for point in data]\n\nclass MNISTData(Dataset):\n def ...
8b1b5945,code,# We'll split our data into 90% training and 10% test data \nsplit = int(0.9 * len(train))\nvalid_data = train[split...
d1b5be9b,code,"# Getting features of the image (pixel 0-783, 784 pixels in total for a 28*28 image)\nX_col = list(train.columns[1:]..."
b877bdf1,markdown,# Converting the data in csv files to images\n> Let's convert our data from pixel values in csv to actual images\n\n...
c7c5c317,code,"class dummyModel(nn.Module):\n def __init__(self):\n super(dummyModel, self).__init__()\n \n ..."
f7f7779e,markdown,# Meeting the data


## Metric ##

This competition uses a variant of the [Kendall tau correlation](https://www.kaggle.com/competitions/AI4Code/overview/evaluation), which will measure how close to the correct order our predicted orderings are. See this notebook for more on this metric: [Competition Metric - Kendall Tau Correlation](https://www.kaggle.com/code/ryanholbrook/competition-metric-kendall-tau-correlation/notebook).

In [22]:
from bisect import bisect


def count_inversions(a):
    inversions = 0
    sorted_so_far = []
    for i, u in enumerate(a):
        j = bisect(sorted_so_far, u)
        inversions += i - j
        sorted_so_far.insert(j, u)
    return inversions


def kendall_tau(ground_truth, predictions):
    total_inversions = 0
    total_2max = 0  # twice the maximum possible inversions across all instances
    for gt, pred in zip(ground_truth, predictions):
        ranks = [gt.index(x) for x in pred]  # rank predicted order in terms of ground truth
        total_inversions += count_inversions(ranks)
        n = len(gt)
        total_2max += n * (n - 1)
    return 1 - 4 * total_inversions / total_2max

Let's test the metric with a dummy submission created from the ids of the shuffled notebooks.

In [23]:
y_dummy = df_valid.reset_index('cell_id').groupby('id')['cell_id'].apply(list)
kendall_tau(y_valid, y_dummy)

0.42511216883092573

Comparing this to the score on the predictions, we can see that our model was indeed able to improve the cell ordering somewhat.

In [24]:
kendall_tau(y_valid, y_pred)

0.6325826824249605

# Submission #

To create a submission for this competition, we'll apply our model to the notebooks in the test set. Note that this is a **Code Competition**, which means that the test data we see here is only a small sample. When we submit our notebook for scoring, this example data will be replaced with the full test set of about 20,000 notebooks.

First we load the data.

In [25]:
paths_test = list((data_dir / 'test').glob('*.json'))
notebooks_test = [
    read_notebook(path) for path in tqdm(paths_test, desc='Test NBs')
]
df_test = (
    pd.concat(notebooks_test)
    .set_index('id', append=True)
    .swaplevel()
    .sort_index(level='id', sort_remaining=False)
)

Test NBs: 100%|██████████| 4/4 [00:00<00:00, 115.66it/s]


Then create the tf-idf and code cell features.

In [26]:
X_test = tfidf.transform(df_test['source'].astype(str))
X_test = sparse.hstack((
    X_test,
    np.where(
        df_test['cell_type'] == 'code',
        df_test.groupby(['id', 'cell_type']).cumcount().to_numpy() + 1,
        0,
    ).reshape(-1, 1)
))

And then create predictions on the test set.

In [27]:
y_infer = pd.DataFrame({'rank': model.predict(X_test)}, index=df_test.index)
y_infer = y_infer.sort_values(['id', 'rank']).reset_index('cell_id').groupby('id')['cell_id'].apply(list)
y_infer

id
0009d135ece78d    [ddfd239c, c6cd22db, 1372ae9b, 8cb8d28a, 7f388a41, 0a226b6a, 90ed07ab, 2843a25a, 06dbf8cf, f9893819, e25aa9bd, ba55e...
0010483c12ba9b                       [54c7cab3, fe66203e, 7844d5f8, 5ce8863c, 4a32c095, 7f270e34, 02a0be6d, 4a0777c4, 865ad516, 4703bb6d]
0010a919d60e4f    [aafc3d23, b7578789, 80e077ec, b190ebb4, ed415c3c, bbff12d4, 322850af, 8ce62db4, c069ed33, 23607d04, 868c4eae, 80433...
0028856e09c5b7                                                                                   [012c9d02, eb293dfc, 3ae7ece3, d22526d1]
Name: cell_id, dtype: object

The `sample_submission.csv` file shows what a correctly formatted submission must look like. We'll just use it as a visual check, but you might like to directly modify the values of sample submission instead. (This would help prevent failed submissions due to missing notebook ids or incorrectly named columns, for instance.)

In [28]:
y_sample = pd.read_csv(data_dir / 'sample_submission.csv', index_col='id', squeeze=True)
y_sample

id
0009d135ece78d       ddfd239c c6cd22db 1372ae9b 90ed07ab 7f388a41 2843a25a 06dbf8cf f9893819 ba55e576 39e937ec e25aa9bd 0a226b6a 8cb8d28a
0010483c12ba9b                                  54c7cab3 fe66203e 7844d5f8 5ce8863c 4a0777c4 4703bb6d 4a32c095 865ad516 02a0be6d 7f270e34
0010a919d60e4f    aafc3d23 80e077ec b190ebb4 ed415c3c 322850af c069ed33 868c4eae 80433cf3 bd8fbd76 0e2529e8 1345b8b2 cdae286f 4907b9ef...
0028856e09c5b7                                                                                        012c9d02 d22526d1 3ae7ece3 eb293dfc
Name: cell_order, dtype: object

We can see that a correctly formatted submission needs the index named `id` and the column of cell orders named `cell_order`. Moreover, we need to convert the list of cell ids into a space-delimited string of cell ids.

In [29]:
y_submit = (
    y_infer
    .apply(' '.join)  # list of ids -> string of ids
    .rename_axis('id')
    .rename('cell_order')
)
y_submit

id
0009d135ece78d       ddfd239c c6cd22db 1372ae9b 8cb8d28a 7f388a41 0a226b6a 90ed07ab 2843a25a 06dbf8cf f9893819 e25aa9bd ba55e576 39e937ec
0010483c12ba9b                                  54c7cab3 fe66203e 7844d5f8 5ce8863c 4a32c095 7f270e34 02a0be6d 4a0777c4 865ad516 4703bb6d
0010a919d60e4f    aafc3d23 b7578789 80e077ec b190ebb4 ed415c3c bbff12d4 322850af 8ce62db4 c069ed33 23607d04 868c4eae 80433cf3 bac960d3...
0028856e09c5b7                                                                                        012c9d02 eb293dfc 3ae7ece3 d22526d1
Name: cell_order, dtype: object

And finally we'll write out the formatted submissions to a file `submission.csv`. When we submit our notebook, it will be rerun on the full test data to create the submission file that's actually scored.

In [30]:
y_submit.to_csv('submission.csv')

<h1 style="color:#fc3f51"> Conclusion </h1>


The readjustment of the model parameters improved the score (0.588 to 0.592), but a good vectorization of the data improves much better (0.592 to 0.603).

<h1 style="color:#fc3f51"> What next ? </h1>


- Improve TfidfVectorizer params for better score
- Research other Feature Engineering
- Other Model (like LGBMRanker)

<center>
    <h1 style="color:#3c3f51"> Thanks for reading 🙂 </h1>
