# Booru Data Analysis
This notebook will analyze good parameters to use when downloading images from boorus to create datasets

## Instal dependencies

In [None]:
pip install pybooru

## Get Data
Here we will make multie queries to booru and get back all the metadata related to images relating to the tags we are searching through. This is the data we will be analyzing.

In [None]:
from pybooru import Danbooru

client = Danbooru('danbooru')
tags = "amiya_(arknights)  -rating:explicit -rating:questionable" #@param {'type':'string'}
metadata = []
page = 1
limit = 100
while True:
  posts = client.post_list(limit=limit, tags=tags, page=page)
  print(posts)
  if len(posts) == 0:
      print(f"On Page {page} for tags {tags}. Found no posts. Ending.")
      break # we have reached the last page since there are no results
  else:
      metadata.extend(posts)
      page += 1

print(f"Data collection for tags=\"{tags}\" completed. Found {len(metadata)} results.")

## Filtering data
Downloading the metadata early might have taken some time so we don't want to have to redownload everything as we decide to filter our dataset down by pruning for unwanted tags. We will make a copy of our metadata and work on that instead.
We will filter it by
1. removing posts that contain unwanted tags
2. remove posts that don't meet a minimum score
3. remove posts that are not parents, if desired. Useful because children posts tend to be very similar to parent posts and are often edits.


In [None]:
from numpy.ma.core import minimum
metadata2 = []
must_have_tags = "solo" #@param {'type': 'string'}
unwated_tags = "meme, amiya_(guard)_(arknights), amiya_(newsgirl)_(arknights), amiya_(fresh_fastener)_(arknights), amiya_(planter)_(arknights), monochrome, comic, sex, hetero, yuri, 2girls, 3girls, multiple_boys, multiple_girls, 1boy, chibi, 2boys, 3boys, english_text, multiple_views, japanese_text, chinese_text, translation_request, censored" #@param {'type': 'string'}
minimum_score = 9 #@param {'type': 'integer'}
include_children = False #@param {'type': 'boolean'}
acceptable_file_types = ["jpeg", "jpeg", "png", "webp", "bmp"]
must_have_tags_list = must_have_tags.replace(" ", "").split(",")
unwated_tags_list = unwated_tags.replace(" ", "").split(",")

for post in metadata:
  wanted = True
  if post['is_deleted']:
    continue # we won't consider deleted posts
  if not post['file_ext'] in acceptable_file_types:
    continue # we won't consider non image based sources
  if not include_children and post['parent_id']:
    continue # reject because we don't want children and this is a child by virtue of having a parent
  if post['score'] < minimum_score:
    continue; # reject for being under score
  for wanted_tag in must_have_tags_list:
    if not wanted_tag in post['tag_string']:
      wanted = False # reject because the wanted tag is not present
      break
  for unwanted_tag in unwated_tags_list:
    if unwanted_tag in post['tag_string']:
      wanted = False # reject for having unwanted tag
      break
  if wanted:
    metadata2.append(post)

print(f"Original metadata contained {len(metadata)} posts. Trimmed down to {len(metadata2)} posts. Removed {len(metadata) - len(metadata2)} posts.")
    

In [None]:
import numpy as np
from matplotlib import pyplot as plt
scores = [x["score"] for x in metadata2]
np_scores = np.asarray(scores)
plt.figure(figsize=(15, 5))
plt.boxplot(np_scores)
plt.title("Score")
percentiles = np.percentile(np_scores, [0, 25, 50, 75, 100]).astype(np.int64)
print(f"lowest score: {percentiles[0]}")
print(f"25th percentile score: {percentiles[1]}")
print(f"50th percentile score: {percentiles[2]}")
print(f"75th percentile score: {percentiles[3]}")
print(f"highest score: {percentiles[4]}")

## Detour
What do images at a particular score in our current filtered list look like?

In [None]:
import cv2
target_score =  9#@param {'type':'integer'}
image_limit = 3 #@param {'type': 'integer'}

used = 0
for post in metadata2:
  if post["score"] == target_score:
    !wget {post["file_url"]} -O "example"
    img = cv2.imread("example")
    img_cvt=cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    plt.imshow(img_cvt)
    plt.show()
    print(f"Post {post['id']} with score {post['score']}")
    used+=1
    if used >= image_limit:
      break


## Partitioning our data
When training a LoRA we want images that have a broad range of representation of our subject. For example, if we wanted to train a LoRA for a character we would want full body tagged images or we might never learn what kind of shoes they wear. 

Sometimes certain types of images are a lot rarer than others so we want to break down our potential images into several categories so later we can set appropriate thresholds for determining which images we will keep for each category while maintaining a suitable number of images for training. 

**categories**: a comma delimited list of space sperated tags.
Example: *full_body standing, cowboy_shot*
each category will have only posts that have all tags in the category (so a category of only posts with *full_body and standing*)

An image will only be in one category, the first category in the list where they meet the condition. So if your categories were *standing, standing full_body*, then all the standing images would be in the first category and the second category of *standing full_body* will have no images because those were also *standing* images.

The *remaining* category will be everything else that didn't meet a category criteria.

In [None]:
from collections import defaultdict
categories_dict = defaultdict(list)
categories = "violin, ascot jacket ring"  #@param {'type': 'string'}
category_tags = [x.strip().split(" ") for x in categories.split(",")]
print(f"Creating {len(category_tags) + 1} categories of {category_tags} and remaining")

for category in category_tags: # we set a default ahead of time so the order of dictionary is same as user entry
  categories_dict[" ".join(category)] = []

for post in metadata2:
  found_matching_category = False
  for category in category_tags:
    has_all_tags = True
    for tag in category:
      if tag in post['tag_string']:
        continue
      else:
        has_all_tags = False

    if has_all_tags:
      categories_dict[" ".join(category)].append(post)
      found_matching_category = True
  
  if not found_matching_category:
    categories_dict["remaining"].append(post)

i=1
for key in categories_dict.keys():
  print(f"category {i}: {key} has {len(categories_dict[key])} posts")
  i+=1   


## Visualize Categories
Let's break down the number of posts and the images in each category

In [None]:
categories_keys = []
categories_scores = []
for key in categories_dict.keys():
  scores = [post['score'] for post in categories_dict[key]]
  num_images = len(scores)
  np_scores = np.asarray(scores)
  percentiles = np.percentile(np_scores, [0, 25, 50, 75, 100]).astype(np.int64)
  categories_keys.append(key + f"\nimages:{num_images}" + f"\nlowest score:{percentiles[0]}" + f"\n25th percentil score:{percentiles[1]}" + f"\n50th percentil score:{percentiles[2]}" + f"\n75th percentil score:{percentiles[3]}" + f"\nhighest score:{percentiles[4]}")
  categories_scores.append(np_scores)
np_categories_scores = np.asarray(categories_scores)
plt.figure(figsize=(25, 15))
plt.boxplot(np_categories_scores, labels=categories_keys, showfliers=False)
plt.title("Categories and scores")
plt.xlabel("Categories")
plt.ylabel("Score")
plt.show()

## Per category score thresholding
For each category we will individuall assign a score threshold for what is accepted. While score isn't truly the best measure of how well an image will serve as training data for getting an accurate representation of our subject, we do have a good deal of confidence that higher rated images are drawn better and aesthetics is important for our dataset.

**scores_for_each_category**: this is a list of scores for each category where the post must be at or above this score in order to be kept. This should have as many numbers as you had categories in the previous section, including the _remaining_ category. It will be in the order you specified with *remaining* being last.

In [None]:
scores_for_each_category = "9, 12, 24" #@param {'type': 'string'}
scores_for_each_category_list = [int(x.strip()) for x in scores_for_each_category.split(",")]
print("Thresholding each category with scores:")
i = 0
for key in categories_dict.keys():
  print(f"    {key:<30} score: {scores_for_each_category_list[i]} ")
  i+=1

threshholded_categories_dict = defaultdict(list)
i = 0
for key in categories_dict.keys():
  required_score_for_category = scores_for_each_category_list[i]
  for post in categories_dict[key]:
    if post['score'] >= required_score_for_category:
      threshholded_categories_dict[key].append(post)
  i+= 1

print(f"Before thresholding:")
i=1
for key in categories_dict.keys():
  print(f"    category {i}: {key} has {len(categories_dict[key])} posts")
  i+=1   
# report thresholding results
print(f"After thresholding:")
i=1
for key in threshholded_categories_dict.keys():
  print(f"    category {i}: {key} has {len(threshholded_categories_dict[key])} posts")
  i+=1   

## Download data
Now we will download all the posts that remain after thresholding.
Each post will be downloaded into a directory for the category.


In [None]:
import json
import requests
import shutil
from pathlib import Path

directory = "/content/data/" #@param {'type': 'string'}
prepend_tags = "amiya_(arknights)" #@param {'type': 'string'}
process_tags_for_training_format = True #@param {'type': 'boolean'}
download_images = True #@param {'type': 'boolean'}

Path(f"{directory}").mkdir(exist_ok=True)

metadata_json = json.dumps(metadata)
with open(f'{directory}/metadata.json','w') as f:
  f.write(metadata_json)

for key in threshholded_categories_dict.keys():
  Path(f"{directory}/{key}").mkdir(exist_ok=True)
  for post in threshholded_categories_dict[key]:
    header = {'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'}
    if not "file_url" in post:
      print(f"post {post['id']} has no file_url")
      continue # some posts don't have a file_url
    file_url = post["file_url"]
    file_name = post["id"]
    file_extension = post["file_ext"]
    tag_string = post["tag_string_general"]

    file_path = Path(f'{directory}/{key}')
    file_path.mkdir(exist_ok=True)

    # Write metadata
    p = Path(file_path, f'{file_name}.json')
    p.write_text(json.dumps(post))
    # Write tags
    tags = prepend_tags + " " + tag_string
    if process_tags_for_training_format:
      tags = tags.replace(" ", ", ").replace("_", " ")
    Path(file_path, f'{file_name}.txt').write_text(tags)

    if download_images:
      r = requests.get(file_url, stream = True, headers=header)
      if r.status_code == 200:
          # Set decode_content value to True, otherwise the downloaded image file's size will be zero.
          r.raw.decode_content = True
          Path(file_path, f'{file_name}.{file_extension}').write_bytes(r.raw.data)
    else:
        print('Image Couldn\'t be retreived,', file_url, post)
    


# Package up data
Zip up the data for download. You can download directly from collab (might be slow) or move the contents to a mounted gdrive or upload to huggingface via git lfs (implement it yourself, this only supports direct download from collab atm)

In [None]:
directory = "/content/data/" #@param {'type': 'string'}
zip_file = "/content/data.zip" #@param {'type': 'string'}
!zip -r {zip_file} {directory}

#Upload to hugging face for later consumption

##install dependencies

In [None]:
pip install huggingface_hub

##login

In [None]:
import huggingface_hub
huggingface_hub.notebook_login()
huggingface_hub.whoami()

##upload to hf
Creates a hugging face repo if it doesn't exist
uploads zip to it

In [None]:
from huggingface_hub import create_repo, upload_file
from huggingface_hub.utils import HfHubHTTPError

zip_file = "/content/data.zip" #@param {'type': 'string'}
huggingface_database_repo_name = "breakcore2/amiya_arknights" #@param {'type': 'string'} 
create_new_repro_if_not_exist = True #@param {'type': 'boolean'}
make_repro_private = True #@param {'type': 'boolean'}

if create_new_repro_if_not_exist:
  try:
    create_repo(huggingface_database_repo_name, repo_type="dataset", private=make_repro_private)
  except HfHubHTTPError:
    # there could be other http errors codes but w/e
    print(f"dataset {huggingface_database_repo_name} already exists")

upload_file(
    path_or_fileobj=zip_file,
    path_in_repo="data.zip",
    repo_id=huggingface_database_repo_name,
    repo_type="dataset"
)

# Clean up Data
After inspecting the data you have, you may find certain images unwanted from your data set. Here we will delete them, then you can run the zip cell and reupload to hugging face.

In [None]:
import pathlib

unwanted_post_ids = "4015097, 4492858, 4525492, 4953900" #@param {'type': 'string'}
directory = "/content/data/" #@param {'type': 'string'}
unwanted_post_ids_list = [x for x in unwanted_post_ids.replace(" ", "").split(",")]

path = Path(directory)
for unwanted_post_id in unwanted_post_ids_list:
  unwanted_files = list(path.rglob(f'{unwanted_post_id}.*'))
  for unwated_file in unwanted_files:
    print(f"deleting {unwated_file}")
    !rm "{str(unwated_file)}"

