### Notes:

##### The test_dataset is based on Trump Meltdown, with the top title rows removed, and some missing score values added.

##### After encoding the CSV file, the 'phantom comments' have to be addressed manually by adding row entries of their id (i.e. 1395, 1396 ...) with 0 score values and assume as 'comment on post'

##### The modified encoded CSV file will be processed in torch_graph_8oct.ipynb

In [1]:
import pandas as pd

# Load the CSV file
df = pd.read_csv('trump_dataset.csv')

# Combine 'Source' and 'Target' users, and get unique users
all_users = pd.concat([df['comment_id'], df['parent_id']]).dropna().unique()  # Drop 'None' to focus only on users

# Create a DataFrame for unique users and assign an index as their user ID
user_ids = pd.DataFrame(all_users, columns=['user'])
user_ids['user_id'] = user_ids.index + 1  # Start IDs from 1

# Add a row for the post itself (e.g., assign ID 0 for the post)
user_ids = pd.concat([pd.DataFrame([{'user': 'post', 'user_id': 0}]), user_ids], ignore_index=True)

# Replace 'None' parent_id values with 'post' to indicate replies to the post itself
df['parent_id'] = df['parent_id'].fillna('post')

# Merge the user IDs back into the original DataFrame for both 'Source' and 'Target'
df_with_ids = df.merge(user_ids, left_on='comment_id', right_on='user', how='left') \
                .rename(columns={'user_id': 'source_id'}) \
                .drop(columns=['user'])  # Drop the temporary 'user' column

df_with_ids = df_with_ids.merge(user_ids, left_on='parent_id', right_on='user', how='left') \
                         .rename(columns={'user_id': 'target_id'}) \
                         .drop(columns=['user'])  # Drop the temporary 'user' column

# Save the updated DataFrame to a new CSV file
df_with_ids.to_csv('trump_encoded.csv', index=False)

print("CSV with user IDs for both 'Source' and 'Target' has been created.")


CSV with user IDs for both 'Source' and 'Target' has been created.
