<a href="https://colab.research.google.com/github/geo-tp/Keras-Colaboratory-Models/blob/main/WoW_OCR_Parsing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `wow-ocr` : Parsing screenshots to texts

- Get files urls and hashes from Alpha Archive API endpoint /file/
- Download and parse screenshots with wow-ocr to extract texts
- Save results to Alpha Archive API endpoint /image-text/

## INSTALL

In [None]:
!pip install wow-ocr

## CONFIG API

In [30]:
# Used to save results on Alpha Archive

IMAGE_HASH_URL = "https://archive.thealphaproject.eu/api/v1/image-text/"
ARCHIVE_URL = "https://archive.thealphaproject.eu/api/v1/files/?parent={}&ordering=filename"
HEADERS = {'Content-Type':'application/json', "Accept": "application/json"}
TOKEN = "1kjze876ezf76zse68mlczKDe2"

## FOLDERS

In [31]:
# Alpha archive folders that we want to parse

SORTED_FOLDERS = [
    "Darnassus",
    "Ironforge",
    "Orgrimmar",
    "Stormwind",
    "Thunder Bluff",
    "Undercity",
    "Alterac Mountains",
    "Arathi Highlands",
    "Badlands",
    "Blackrock Moutain",
    "Blasted Lands",
    "Burning Steppes",
    "Deadwind Pass",
    "Deeprun Tram",
    "Dun Morogh",
    "Duskwood",
    "Eastern Plaguelands",
    "Elwynn Forest",
    "Hilsbrad Foothills",
    "Hinterlands",
    "Loch Modan",
    "Redridge Mountains",
    "Searing Gorge",
    "Silverpine Forest",
    "Stranglethorn Vale",
    "Swamp of Sorrows",
    "Tirisfal Glades",
    "Western Plaguelands",
    "Westfall",
    "Wetlands",
    "Ashenvale",
    "Azshara",
    "Barrens",
    "Darkshore",
    "Desolace",
    "Durotar",
    "Dustwallow Marsh",
    "Felwood",
    "Feralas",
    "Hyjal",
    "Mulgore",
    "Silithus",
    "Stonetalon Mountains",
    "Tanaris",
    "Teldrassil",
    "Thousand Needles",
    "Un'goro",
    "Winterspring",
    "Blackrock Depths",
    "Ragefire",
    "Stratholme",
    "Zul'Farak",
    "Uldaman",
    "Scholomance",
    "Razorfen Downs",
    "Razorfen Kraul",
    "Walling Caverns",
    "Deadmines",
    "Sunken Temple",
    "Scarlet Monastery",
    "Gnomeregan",
    "Shadowfang Keep",
    "Lower Blackrock Spire",
    "Stockades",
    "Molten Core", 
    "Onyxia's Lair", 
    "Karazhan", 
    "Blackwing Lair", 
    "Dustwallow Marsh", 
    "Redridge Mountains", 
    "Elwynn Forest", 
    "Blackrock Mountain", 
    "Interface", 
    "Magazine", 
]

UNSORTED_FOLDERS = [
    "Computer Gaming World - Oct 2003", 
    "Computer Gaming World - Nov 2001", 
    "koaworld uploads", 
    "www_taurentotem_com", 
    "www_war3_com", 
    "from_alpha_archive_website_02043022", 
    "from_alpha_archive_19122022_and_koaworld", 
    "www_malador__com", 
    "ign_alpha_beta", 
    "langley beta", 
    "from_alpha_archive_website_0112022", 
    "from_alpha_archive_website_26032022", 
    "www_blizzplanet_com",
    "World Of Warcraft - Dossier Jeuxvideo-com - 05-09-2021 02-17-25",
    "www_gamekult_com/Preview - World of Warcraft - nos impressions - Gamekult - 05-09-2021 01-59-44", 
    "www_imgur_com", 
    "www_kevinho_net", 
    "www_thebigguild_org", 
    "from_alpha_archives_07082022", 
    "gamespy_alpha_beta", 
    "image_gamespot_com", 
    "www_azeroth-times_net", 
    "www_ign_com", 
    "www_worldofwar_net", 
    "www_wowzard_com", 
    "wow_warcry_com", 
    "wow_pimpbunnies_com", 
    "pc_mobygames_com", 
    "www_warbucket_com", 
    "gamespy_alpha_beta", 
    "www_planetwarcraft_com", 
    "image_gamespot_com", 
    "from_alpha_archives_102022", 
    "from_alpha_archive_03012023", 
    "www_ultimategamers_com"
    "from_alpha_archive_10042023"
]

## FILES

API FILE OBJECT 
```
    parent = models.CharField(max_length=1024)
    filename = models.CharField(max_length=1024)
    is_folder = models.BooleanField(default=False)
    image_raw = models.ImageField(blank=True, null=True, max_length=4096)
    image_thumbnail = models.ImageField(blank=True, null=True)
    image_hash = models.CharField(max_length=256, blank=True, null=True)
```

API IMAGE TEXT OBJECT
```
    image_hash = models.CharField(max_length=255, primary_key=True)
    easy_ocr_content = models.TextField(blank="True", default="")
    wow_ocr_content = models.TextField(blank="True", default="")
```

## GET IMAGE HASHES ALREADY DONE

In [32]:
import requests

# We get list of all image text object
response = requests.get(IMAGE_HASH_URL, headers=HEADERS)
image_texts = response.json()

# if wow_ocr_content exists, this file is already done
HASHES_ALREADY_DONE = [
    image_text["image_hash"] 
    for image_text in image_texts 
    if image_text["wow_ocr_content"]
]


In [33]:
print(len(HASHES_ALREADY_DONE), "files have been succesfully parsed")

8122 files have been succesfully parsed


## GET IMAGES TO PARSE

In [34]:
# Load API files, saves urls and hashes

FOLDERS = UNSORTED_FOLDERS

# wow-ocr is not supporting .gif
VALID_EXTS = ["jpg", "JPG", "png", "PNG", "jpeg", "JPEG"]

file_urls = []
file_hashes = []

for folder in FOLDERS:
  res = requests.get(ARCHIVE_URL.format(folder))
  api_files = res.json();

  for file_ in api_files:
    url = file_["image_raw"]
    if not url:
      continue

    image_hash = file_["image_hash"]
    if image_hash in HASHES_ALREADY_DONE:
      continue

    ext = url.split(".").pop()
    if ext in VALID_EXTS:
      file_urls.append(url)
      file_hashes.append(image_hash)


In [35]:
print(len(file_urls), "files will be parsed")

3809 files will be parsed


## INITIALIZE MODEL

In [36]:
import wow_ocr

# default weights fine tuned for WoW Screenshots
pipeline = wow_ocr.pipeline.Pipeline()

Looking for /root/.keras-ocr/detector_craft_mlt_25k.h5
Looking for /root/.keras-ocr/recognizer_wow_ocr.h5


## PARSE SCREENSHOTS AND SAVE RESULTS

In [None]:
STEP = 3 # we use a small step to avoid OOM

for i in range(0, len(file_urls), STEP):

  print("STARTING URL SLICE", i," TO", i+STEP, "OF", len(file_urls))

  # read screenshots from file_urls slice
  images = [
      wow_ocr.tools.read(url)
      for url in file_urls[i:i+STEP]
  ]

  prediction_groups = pipeline.recognize(images)
  # # Each list of predictions in prediction_groups is a list of
  # # (word, box) tuples.

  for index, pred in enumerate(prediction_groups):

      # we save extracted words as plain text
      text = " ".join([word for word, boxes in pred])
      #  we add  '§' as 'no content'
      text = text if text else "§" 
      file_hash = file_hashes[i+index]

      # format api message
      msg = {
          "token": TOKEN, 
          "image_hash": file_hash, 
          "wow_ocr_content": text
      }
      print(file_hash, "|", text)

      # sending
      res = requests.put(IMAGE_HASH_URL+file_hash+"/", json=msg, headers=HEADERS)
      print(res)


STARTING URL SLICE 0  TO 3 OF 3809




[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
280c7c7c0c0c0010 | word map centnans 3os ancnod snanp ol sorows loom out rignt cilch on zoon map to our l wo ae s ssor rastos s lo o hrdigh s towds
<Response [200]>
STARTING URL SLICE 1455  TO 1458 OF 3809
7e210000c5ff3f08 | thebigguild org duskwood dogmatix mrionn es rueror nen filzer tueo ulriker unriker napcess cragnarak hordes bi tneforsak shapeless gn srucrusher lessiensler cedrich oerance zerg ole nce the horde thunderbane saknd the hdnde ether greensleves delfinas loungors slurking rde sthe horder nau brego sthe oner barando tolerance lero merei horde in hundebane ackeron the hordes sthe nune chat skullcrusher yells raen skullcrusher bursts into dance 2 dogmatx raid hes lol sleeping greensleeves nightowl with dances 2 raidj delfinast here for bit get the back everyone wailt a people 2 ackeron daaancel raid imrickj bursts ames into dance ackeron with delethor dances cnen1 1d ggnl n
<Respon