# Uidealist

## We will use the Quick, Draw! dataset, released by Google, to create a hand drawn images classifier

Images drawn by users in UIdealist will be parsed to object types, then further ideas generation will be easier for additional models

# 1. Loading data from gcloud repository

At a start, and as computer resources increase, we will be able to load more data files

In [17]:
# Library needed for reading XML data
# (Just in case it's not already installed)
!pip install beautifulsoup4



In [44]:
# There is a master file for XML data, from there all further public
# dataset files are included, we'll do it to prevent gcloud storage
# issues related to the need to create a gcloud project

import requests

base_url = "https://storage.googleapis.com/quickdraw_dataset"

xml_file = requests.get(base_url).content.decode()

In [45]:
# Parse XML file
from bs4 import BeautifulSoup

xml_data = BeautifulSoup(xml_file, "xml")

In [46]:
# Get all available files ending with .ndjson extension
import re

keys = [key.contents[0] for key in xml_data.find_all('Key')]
matcher = "full/raw/(.*).ndjson"

filenames = []

for key in keys:
  result = re.search(matcher, key)

  if result is not None: filenames.append(result.group(1))

print(filenames[:3])

['The Eiffel Tower', 'The Great Wall of China', 'The Mona Lisa']


In [55]:
# Now download data locally and save to pandas dataframes for the given filenames
# There is a simplified version of raw data under /full/simplified/<filename>.ndjson
import shutil
import pandas as pd
import urllib.parse

num_files = 10
target_directory = "."
save_locally = False

dataframes = []

for file in filenames[:num_files]:
  url = f"{base_url}/full/simplified/{urllib.parse.quote(file)}.ndjson"

  df =  pd.read_json(url, lines=True)
  dataframes.append( df )

  if save_locally:
    df.to_json(f"{target_directory}/{file}.ndjson", lines=True, orient="records")

# 2. Explore datasets info and condense them into a single dataframe with class annotations

In [56]:
dataframes[0].describe()

Unnamed: 0,key_id
count,134801.0
mean,5628867000000000.0
std,649235100000000.0
min,4503652000000000.0
25%,5067949000000000.0
50%,5626143000000000.0
75%,6187254000000000.0
max,6755398000000000.0
