## Second stage feature extraction

This program will take the features from the first stage and the first stage model.
It will build a graph using the predictions from first stage and then extract graph features and 
write them to csv files.

These features can be used later to get more accurate predictions using the second stage model

To run this make sure you have enough RAM. The program creates a graph of 200 million nodes which will require around 120G of memory.

In [1]:
from graph_tool.all import *
from second_stage import *

In [5]:
# download the first stage features : 52G

#!aws s3 cp --no-sign-request s3://ai2-s2-research-public/s2amp/inferred/first_stage_features/ data/inferred/first_stage_features/
#!aws s3 cp --no-sign-request s3://ai2-s2-research-public/s2amp/gold/lgb_first.stage.model.pkl data/gold/lgb_first.stage.model.pkl

inferred_first_stage_features_dir = 'data/inferred/first_stage_features/'
first_stage_model = 'data/gold/lgb_first.stage.model.pkl'

In [3]:
# make directory second_stage_features
os.makedirs("second_stage_features")

In [6]:
# reading first stage pairwise features
df_features = read_features(inferred_first_stage_features_dir)

INFO:root:reading first stage features . . 
100%|██████████| 200/200 [09:09<00:00,  2.75s/it]


In [7]:
# getting predictions for first stage
df_pred = get_prediction_feature_dataframe(df_features, first_stage_model)

In [9]:
# building weighted graph
get_directed_mentor_mentee_graph(df_pred)

In [10]:
# now the graph has been built we can get rid of the huge df
del df_pred

In [None]:
# getting graph features for each pair in the first stage feature file
# this will output around 400 csv files with features

count = 0
for filename in tqdm(glob.glob(inferred_first_stage_features_dir + "/*.csv")):
    df = pd.read_csv(filename)
    # splitting huge file into smaller chunks
    df_splits = np.array_split(df, 5)

    # processing each chunk using the multiprocessing pool
    for df_split in df_splits:
        df_graph_features = parallelize_dataframe(
            df_split, get_graph_features, n_cores=10,
        )
        df_graph_features.to_csv(
            "second_stage_features/features." + str(count) + ".csv", index=False,
        )
        count += 1