This script processes JSON files from Serper.dev Playground and outputs the combined data into different formats while filtering out specific TLDs and domains/strings, and removing duplicates.
I made it for friends who want to use the Playground and get data quickly, without having to boot up the API.
- Python 3
- pandas
-
Clone the repository:
git clone https://github.com/garetharnold/serparse.git
cd serparse
-
Install the required packages:
pip3 install pandas
-
(OPTIONAL) Download the
tlds.json
file from the following URL and place it in the same directory asserparse.py
: -
You can create your own
tlds.json
file for TLD filtering:{ "blocked_tlds": [ { "tld": ".xyz", "description": "Commonly associated with spam and low-quality emails." }, { "tld": ".top", "description": "Commonly used by spammers." }, { "tld": ".win", "description": "Frequently used for spam and fraudulent activities." }, { "tld": ".vip", "description": "Known for a high volume of spam." }, { "tld": ".click", "description": "Often used in phishing and spam emails." } ] }
-
You can create your own
blacklist.json
file for domain or string filtering:
{
"blocked_domains": [
"amazon.com",
"ebay.com",
"example.com",
"string",
"news",
"wikipedia"
]
}
To run the script:
python3 serparse.py -i filename1.json filename2.json
Replace filename1.json
and filename2.json
with your actual JSON files. The output will be in CSV format.
- Default: JSON
- CSV:
-o csv
- URL CSV:
-o urls
python3 serparse.py -i input1.json input2.json -o csv
This will process the input files, filter out unwanted TLDs and domains/strings, remove duplicates, and save the output in CSV format with a timestamped filename.
- Input Files: The script accepts multiple JSON input files specified with the
-i
or--input
option. - Output Format: The output format can be specified with the
-o
or--output
option. Supported formats arecsv
,urls
, andjson
(default). - Filtering: The script filters out entries based on TLDs specified in
tlds.json
and domains/strings specified inblacklist.json
. - Duplicate Removal: Duplicates are removed based on the domain.
- Logging: The script generates a log file that records which entries were removed due to duplicates or filtering.