# Remove bot-or-not noises

### Used files
- bot_or_not_without_info
- sybilscar_results

### Summary:
1. Load necessary data
2. Apply logic to add new column "is_noisy"
3. Check bot label changes from sybilscar
4. Save new bot_or_not_without_noises

### 1. Load necessary data

In [1]:
import polars as pl
import os
pl.Config.set_fmt_str_lengths(400)

polars.config.Config

In [2]:
DATA_PATH = os.getenv("DATA_PATH", "")

In [3]:
bot_or_not = pl.read_parquet(f"{DATA_PATH}/interim/bot_or_not_without_info.parquet")
bot_or_not


fid,bot
i64,bool
446097,false
3,false
8,false
12,false
2,false
…,…
327500,true
428200,true
469138,false
278549,true


In [4]:
sybilscar_result = pl.read_parquet(f"{DATA_PATH}/../farcaster-social-graph-api/farcaster_social_graph_api/data/sybil_scar_results.parquet")
sybilscar_result

fid_index,posterior,fid
i64,f64,i64
198306,0.0,362936
47055,0.0,690195
326843,0.0,551357
120189,0.0,429013
297387,0.344896,818125
…,…,…
100725,1.0,466914
16259,0.7,863574
128403,1.0,720296
61238,0.0,727956


In [5]:
fnames = pl.read_parquet(f"{DATA_PATH}/raw/farcaster-fnames-0-1730134800.parquet")
last_fnames = fnames[["fid","updated_at"]].group_by("fid").max()
last_fnames = last_fnames.join(fnames,on=["fid","updated_at"],how="left",coalesce=True)[["fid","fname"]]
# will be used in "3. Check bot label changes from sybilscar"
last_fnames

fid,fname
i64,str
606810,"""webfan"""
291006,"""elawgrrl"""
863985,"""hardiewalingvo"""
481618,"""ericnam"""
847339,"""maria0425"""
…,…
339354,"""rakos"""
836647,"""americans"""
860644,"""dogavehayat"""
492446,"""fainiguez"""


### 2. Apply logic to add new column "is_noisy"

For now, we are considering a sample noisy if sybil scar result (threshold p < 0.5) is different than bot_or_not


In [6]:
df = bot_or_not.join(sybilscar_result,on="fid",coalesce=True,how="left")
df

fid,bot,fid_index,posterior
i64,bool,i64,f64
446097,false,163975,1.0
3,false,8129,1.0
8,false,255872,1.0
12,false,43493,1.0
2,false,248340,1.0
…,…,…,…
327500,true,169966,0.0
428200,true,72388,0.0
469138,false,105841,0.0
278549,true,22377,0.0


In [7]:
# Check that there are indexes in bot_or_not that are outside the sybilscar result
df.filter(pl.col("posterior").is_null())

fid,bot,fid_index,posterior
i64,bool,i64,f64
2348,false,,
12144,false,,
12775,false,,
191322,false,,
194515,false,,
…,…,…,…
854040,false,,
854041,false,,
854043,false,,
854923,false,,


In [8]:
df = df.with_columns([
    pl.when(pl.col("posterior").is_null())
    .then(pl.col("bot"))
    .otherwise(pl.col("bot") != (pl.col("posterior") < 0.5 ))
    .alias("is_noisy")
])

display(df)
print("number of noisy elements: ",df["is_noisy"].sum())

fid,bot,fid_index,posterior,is_noisy
i64,bool,i64,f64,bool
446097,false,163975,1.0,false
3,false,8129,1.0,false
8,false,255872,1.0,false
12,false,43493,1.0,false
2,false,248340,1.0,false
…,…,…,…,…
327500,true,169966,0.0,false
428200,true,72388,0.0,false
469138,false,105841,0.0,true
278549,true,22377,0.0,false


number of noisy elements:  3946


### 3. Check bot label changes from sybilscar

In [9]:
bot_or_not_with_fnames = df.join(last_fnames[["fid","fname"]],on="fid",how="left", coalesce=True)
bot_or_not_with_fnames.filter(pl.col("is_noisy"))

fid,bot,fid_index,posterior,is_noisy,fname
i64,bool,i64,f64,bool,str
1731,true,149846,1.0,true,"""fayiz"""
1771,true,305979,1.0,true,"""ruslan"""
2183,true,48253,1.0,true,"""djo"""
2247,false,272536,0.0,true,"""papeclaus"""
2278,false,92265,0.0,true,"""versadchikov"""
…,…,…,…,…,…
390605,false,367876,0.0,true,"""siatoshi"""
810027,false,350222,0.0,true,"""naqu"""
287794,true,52460,1.0,true,"""jenny1"""
423036,true,283687,1.0,true,"""sheva7.eth"""


In [12]:
bot_or_not_with_fnames.filter(pl.col("is_noisy")).sample(10)

fid,bot,fid_index,posterior,is_noisy,fname
i64,bool,i64,f64,bool,str
415001,False,10115,0.0,True,"""parviz8998"""
826255,False,178004,0.0,True,"""austilicious123"""
472997,True,161983,1.0,True,"""jinkyo"""
843895,False,126476,0.0,True,"""escalord92"""
473155,False,199930,0.0,True,"""amircyber"""
324605,False,88363,0.0,True,"""babaika.eth"""
513102,False,311558,0.0,True,"""zach19"""
2864,True,351649,1.0,True,"""launch"""
507710,False,305621,0.0,True,"""cryptobeauty"""
322511,False,305219,0.0,True,"""lukichka"""


|      **fname**      | **Bot or Not label** | **SybilSCAR label** | **inspection result** |
|-----------------|------------------|-----------------|-------------------|
| fayiz           | bot              | human           | human             |
| ruslan          | bot              | human           | bot               |
| djo             | bot              | human           | bot               |
| papeclaus       | human            | bot             | bot               |
| versadchikov    | human            | human           | bot               |
| siatoshi        | human            | bot             | bot               |
| naqu            | human            | bot             | bot               |
| jenny1          | bot              | human           | bot               |
| sheva7.eth      | bot              | human           | bot               |
| noormuhammad    | human            | bot             | bot               |
| parviz8998      | human            | bot             | bot               |
| austilicious123 | human            | bot             | bot               |
| jinkyo          | bot              | human           | human             |
| escalord92      | human            | bot             | bot               |
| amircyber       | human            | bot             | bot               |
| babaika.eth     | human            | bot             | bot               |
| zach19          | human            | bot             | bot               |
| launch          | bot              | human           | bot               |
| cryptobeauty    | human            | bot             | bot               |
| lukichka        | human            | bot             | bot               |

After manual inspection of the changed labels (noisy values), it is possible to check that ~70% of the changes make sense

### 4. Save new bot_or_not_without_noises

In [10]:
# Filter and remove unnecessary columns
bot_or_not_without_noises = df.filter(~pl.col("is_noisy"))[["fid","bot"]]
bot_or_not_without_noises

fid,bot
i64,bool
446097,false
3,false
8,false
12,false
2,false
…,…
280179,true
327500,true
428200,true
278549,true


In [11]:
bot_or_not_without_noises.write_parquet(f"{DATA_PATH}/interim/bot_or_not_without_noises.parquet")