# Remove bot-or-not noises

### Used files
- bot_or_not_without_info
- sybilscar_results

### Summary:
1. Load necessary data
2. Apply logic to add new column "is_noisy"
3. Check bot label changes from sybilscar
4. Save new bot_or_not_without_noises

### 1. Load necessary data

In [None]:
import polars as pl
import os
pl.Config.set_fmt_str_lengths(400)

In [None]:
DATA_PATH = os.getenv("DATA_PATH", "")

In [None]:
bot_or_not = pl.read_parquet(f"{DATA_PATH}/interim/bot_or_not_without_info.parquet")
bot_or_not


In [None]:
sybilscar_result = pl.read_parquet(f"{DATA_PATH}/../farcaster-social-graph-api/farcaster_social_graph_api/data/sybil_scar_results.parquet")
sybilscar_result

In [None]:
fnames = pl.read_parquet(f"{DATA_PATH}/raw/farcaster-fnames-0-1730134800.parquet")
last_fnames = fnames[["fid","updated_at"]].group_by("fid").max()
last_fnames = last_fnames.join(fnames,on=["fid","updated_at"],how="left",coalesce=True)[["fid","fname"]]
# will be used in "3. Check bot label changes from sybilscar"
last_fnames

### 2. Apply logic to add new column "is_noisy"

For now, we are considering a sample noisy if sybil scar result (threshold p < 0.5) is different than bot_or_not


In [None]:
df = bot_or_not.join(sybilscar_result,on="fid",coalesce=True,how="left")
df

In [None]:
# Check that there are indexes in bot_or_not that are outside the sybilscar result
df.filter(pl.col("posterior").is_null())

In [None]:
df = df.with_columns([
    pl.when(pl.col("posterior").is_null())
    .then(pl.col("bot"))
    .otherwise(pl.col("bot") != (pl.col("posterior") < 0.5 ))
    .alias("is_noisy")
])

display(df)
print("number of noisy elements: ",df["is_noisy"].sum())

### 3. Check bot label changes from sybilscar

In [None]:
bot_or_not_with_fnames = df.join(last_fnames[["fid","fname"]],on="fid",how="left", coalesce=True)
bot_or_not_with_fnames.filter(pl.col("is_noisy"))

After manual inspection of the changed labels (noisy values), it is possible to check that ~70% of the changes make sense

### 4. Save new bot_or_not_without_noises

In [None]:
# Filter and remove unnecessary columns
bot_or_not_without_noises = df.filter(~pl.col("is_noisy"))[["fid","bot"]]
bot_or_not_without_noises

In [None]:
bot_or_not_without_noises.write_parquet(f"{DATA_PATH}/interim/bot_or_not_without_noises.parquet")