## Initial Data Exploration Kuzushiji

Once you've identified a Use Case and Data Set it is time to get familiar with data. In the process model this task is called Initial Data Exploration. Please take a minute or two to (re)visit the following lecture

https://www.coursera.org/learn/data-science-methodology

Module 2 - Data Understanding

Please also revisit:

http://coursera.org/learn/ds

Module 3 - Mathematical Foundations and Module 4 - Visualizations

Given the lectures above, please create statistics and visualization on your Data Set to identify good columns for modeling, potential data quality issues and anticipate potential feature transformations necessary.

Create a jupyter notebook where you document your code and include visualizations as first deliverable. Please also stick to the naming conventions explained in the the process model manual.

So, the most important reasons / steps are:

Identify quality issues (e.g. missing values, wrong measurements, …)

Assess feature quality – how relevant is a certain measurement (e.g. use correlation matrix)

Get an idea on the value distribution of your data using statistical measures and visualizations

In [None]:
!pip install tensorflow
!pip install seaborn==0.11.1
!pip install Pillow
!pip install python-mnist
!pip install pyspark

Waiting for a Spark session to start...


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import image
import tensorflow as tf
import seaborn as sns
from mnist import MNIST
import numpy as np
import PIL
from PIL import Image
import os
import matplotlib.image as mping
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer
from numpy import asarray
import pandas as pd

Waiting for a Spark session to start...


In [None]:
tf.__version__

Waiting for a Spark session to start...


In [None]:
sns.__version__

Waiting for a Spark session to start...


In [None]:
PIL.__version__

Waiting for a Spark session to start...


In [None]:
# fire up the spark session
# remove this for spark environment
#sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

#spark = SparkSession \
#    .builder \
#    .getOrCreate()

Waiting for a Spark session to start...


In [None]:
# enable arrow which lets us transfrom a pandas dataframe into a pyspark dataframe
spark.conf.set("spark.sql.execution.arrow.enabled","true")

Waiting for a Spark session to start...


In [None]:
!wget http://codh.rois.ac.jp/kmnist/dataset/kmnist/train-images-idx3-ubyte.gz?raw=True
!mv train-images-idx3-ubyte.gz?raw=True train-images-idx3-ubyte.gz
!gunzip train-images-idx3-ubyte.gz
!ls -lahr train-images-idx3-ubyte

!wget http://codh.rois.ac.jp/kmnist/dataset/kmnist/train-labels-idx1-ubyte.gz?raw=True
!mv train-labels-idx1-ubyte.gz?raw=True train-labels-idx1-ubyte.gz
!gunzip train-labels-idx1-ubyte.gz
!ls -lahr train-labels-idx1-ubyte

!wget http://codh.rois.ac.jp/kmnist/dataset/kmnist/t10k-images-idx3-ubyte.gz?raw=True
!mv t10k-images-idx3-ubyte.gz?raw=True t10k-images-idx3-ubyte.gz
!gunzip t10k-images-idx3-ubyte.gz
!ls -lahr t10k-images-idx3-ubyte

!wget http://codh.rois.ac.jp/kmnist/dataset/kmnist/t10k-labels-idx1-ubyte.gz?raw=True
!mv t10k-labels-idx1-ubyte.gz?raw=True t10k-labels-idx1-ubyte.gz
!gunzip t10k-labels-idx1-ubyte.gz
!ls -lahr t10k-labels-idx1-ubyte

Waiting for a Spark session to start...


In [None]:
url = "http://codh.rois.ac.jp/kmnist/dataset/kmnist/kmnist_classmap.csv"
df_classmap = pd.read_csv(url)
df_classmap.head(11)

Waiting for a Spark session to start...


In [None]:
# let's add the sound the character makes for non-Japanese speakers
# just to clarify the function of these characters in spoken Japanese

phonetic = ['o','ki','su','tsu','na','ha','ma','ya','re','wo']
df_classmap['phonetic'] = phonetic
df_classmap

Waiting for a Spark session to start...


In [None]:
!mkdir kmnistdata

Waiting for a Spark session to start...


In [None]:
!ls -al

Waiting for a Spark session to start...


In [None]:
!cp t10k-images-idx3-ubyte kmnistdata/t10k-images-idx3-ubyte
!cp t10k-labels-idx1-ubyte kmnistdata/t10k-labels-idx1-ubyte
!cp train-images-idx3-ubyte kmnistdata/train-images-idx3-ubyte
!cp train-labels-idx1-ubyte kmnistdata/train-labels-idx1-ubyte

Waiting for a Spark session to start...


In [None]:
!ls -al kmnistdata

Waiting for a Spark session to start...
Still waiting for Spark session to start..


In [None]:
data = MNIST('kmnistdata')
train_images, train_labels = data.load_training()
test_images, test_labels = data.load_testing()

Waiting for a Spark session to start...
Still waiting for Spark session to start..
Still waiting for Spark session to start..
Still waiting for Spark session to start..


In [None]:
print(train_labels[0])
print(train_images[0])

Waiting for a Spark session to start...
Still waiting for Spark session to start..
Still waiting for Spark session to start..


In [None]:
print(train_labels[1])
print(train_images[1])

Waiting for a Spark session to start...
Still waiting for Spark session to start..
Still waiting for Spark session to start..


In [None]:
type(train_labels)

Waiting for a Spark session to start...
Still waiting for Spark session to start..


In [None]:
type(train_images)

Waiting for a Spark session to start...
Still waiting for Spark session to start..


In [None]:
train_labels.typecode

Waiting for a Spark session to start...
Still waiting for Spark session to start..


In [None]:
# the following output is (address, length) giving current memory address
# and length in elements of the buffer used to hold the array's
# contents

train_labels.buffer_info()

Waiting for a Spark session to start...
Still waiting for Spark session to start..


In [None]:
# In the first dataset that we downloaded and loaded above, the data
# is already flattened, which is good for the machine learning model
# but we would like to actaully be able to view the images

# so, the first step here is the convert the data into numpy arrays
# numpy arrays can be used to normalize for the ML model, but also
# numpy arrays are easier to reshape in case we want to actually view 
# the data as images

Waiting for a Spark session to start...
Still waiting for Spark session to start..


In [None]:
# transform to numpy arrays
train_images = np.array(train_images)
train_labels = np.array(train_labels)
test_images = np.array(test_images)
test_labels = np.array(test_labels)

Waiting for a Spark session to start...


In [None]:
# now we should be able to do more data exploration
train_images.shape

Waiting for a Spark session to start...
Still waiting for Spark session to start..


In [None]:
# the data is already flattened for use in the model, but
# we need to unflatten the data if we want to view and verify 
# that these are actually images of kuzushiji characters

train_images = np.reshape(train_images, (60000, 28, 28))

Waiting for a Spark session to start...
Still waiting for Spark session to start..


In [None]:
# let's see the image at index 0
plt.figure()
plt.imshow(train_images[0])
plt.show()

Waiting for a Spark session to start...
Still waiting for Spark session to start..


In [None]:
df_classmap

Waiting for a Spark session to start...
Still waiting for Spark session to start..


In [None]:
# so we can see, using the classmap that this character 
# pronounced "re" should be type number 8
train_labels[0]

Waiting for a Spark session to start...
Still waiting for Spark session to start..


In [None]:
# we can check another one:
plt.figure()
plt.imshow(train_images[5])
plt.show()

Waiting for a Spark session to start...
Still waiting for Spark session to start..


In [None]:
# this one is 'su', so according to the classmap
# it should be type 2
train_labels[5]

Waiting for a Spark session to start...
Still waiting for Spark session to start..


In [None]:
# next
plt.figure()
plt.imshow(train_images[7])
plt.show()

Waiting for a Spark session to start...
Still waiting for Spark session to start..


In [None]:
# I cannot visually determine which one it is
# Checking the type:
train_labels[7]

Waiting for a Spark session to start...


In [None]:
# next we print the class number, the written character, and the phonetic
# for index numbers 20 through 40:
for i in range(20,40):
    print(train_labels[i])
    print(df_classmap.loc[train_labels[i],'char'], df_classmap.loc[train_labels[i],'phonetic'])
    plt.figure()
    plt.imshow(train_images[i])
    plt.show()
    
    i+=1

In [None]:
# So we now we have verified this data is what we want: namely images of kuzushiji characters
# which we can view in "unflattned" format (60000, 28, 28) as images and that we can leave in "flattened" 
# format in two dimensional numpy arrays with dimensions (60000, 784) for use in the ML model

# next we convert the train_labels numpy array to a dataframe

df_train_labels = pd.DataFrame(train_labels)

In [None]:
sns.displot(df_train_labels)

In [None]:
# so shows us this is a balanced set

In [None]:
df_train_labels[0].value_counts()

In [None]:
# this confirms the set is perfectly balanced

In [None]:
# next we obtain a slightly more difficult dataset
# which includes 49 classes of kuzushiji instead of just
# 10 classes:

!wget http://codh.rois.ac.jp/kmnist/dataset/k49/k49-train-imgs.npz?raw=True
!mv k49-train-imgs.npz?raw=True k49-train-imgs.npz
!ls -lahr k49-train-imgs.npz

!wget http://codh.rois.ac.jp/kmnist/dataset/k49/k49-train-labels.npz?raw=True
!mv k49-train-labels.npz?raw=True k49-train-labels.npz
!ls -lahr k49-train-labels.npz

!wget http://codh.rois.ac.jp/kmnist/dataset/k49/k49-test-imgs.npz?raw=True
!mv k49-test-imgs.npz?raw=True k49-test-imgs.npz
!ls -lahr k49-test-imgs.npz

!wget http://codh.rois.ac.jp/kmnist/dataset/k49/k49-test-labels.npz?raw=True
!mv k49-test-labels.npz?raw=True k49-test-labels.npz
!ls -lahr k49-test-labels.npz

In [None]:
# define a load function and use it to load to numpy arrays

def load(f):
    return np.load(f)['arr_0']

k49_train_images = load('k49-train-imgs.npz')
k49_train_labels = load('k49-train-labels.npz')
k49_test_images = load('k49-test-imgs.npz')
k49_test_labels = load('k49-test-labels.npz')

In [None]:
# and the class map of the 49 character data set

url = "http://codh.rois.ac.jp/kmnist/dataset/k49/k49_classmap.csv"
df_k49_classmap = pd.read_csv(url)
df_k49_classmap.head(51)

In [None]:
# let's add the phonetic sound the character makes for clarity

k49_phonetic = ['a', 'i',  'u',  'e',  'o',\
                'ka','ki', 'ku', 'ke', 'ko',\
                'sa','shi','su', 'se', 'so',\
                'ta','chi','tsu','te', 'to',\
                'na','ni', 'nu', 'ne', 'no',\
                'ha','hi', 'fu', 'he', 'ho',\
                'ma','mi', 'mu', 'me', 'mo',\
                'ya','yu', 'yo',\
                'ra','ri', 'ru', 're', 'ro',\
                'wa','wi', 'we ','wo', 'n','iteration_mark']
print(len(k49_phonetic))
df_k49_classmap['phonetic'] = k49_phonetic
df_k49_classmap

In [None]:
type(k49_train_images)

In [None]:
k49_train_images.shape

In [None]:
# this tells us that we have 232,365 images 
# that are each 28 x 28 pixels

In [None]:
# let's see the image at index 0
plt.figure()
plt.imshow(k49_train_images[0])
plt.show()

In [None]:
# visually, this one look like 'ma'
# so according to the classmap
# it should be type 30
k49_train_labels[0]

In [None]:
# let's see the image at index 11
plt.figure()
plt.imshow(k49_train_images[11])
plt.show()

In [None]:
# visually, we can see using the classmap
# that this one looks like 'no'
# so according to the classmap
# it should be type 24
k49_train_labels[11]

In [None]:
# next we print the class number, the written character, and the phonetic
# for index numbers 80 through 100:
for i in range(80,100):
    print(k49_train_labels[i])
    print(df_k49_classmap.loc[k49_train_labels[i],'char'], df_k49_classmap.loc[k49_train_labels[i],'phonetic'])
    plt.figure()
    plt.imshow(k49_train_images[i])
    plt.show()
    
    i+=1

In [None]:
# For the 2nd dataset with 49 classes
# we now we have verified visually that data is what we want:
# images of 49 different kuzushiji characters

# next we convert k49_train_labels numpy array to a dataframe
# in order to inspect the dataset further:

df_k49_train_labels = pd.DataFrame(k49_train_labels)

In [None]:
# lets see how the character data is distributed among the 49 classes:
sns.displot(df_k49_train_labels)

In [None]:
df_k49_train_labels[0].value_counts().sort_index()

In [None]:
# so we confirm that many of the classes do not have 6,000 in the second data 
# set so the data set is not balanced

In [None]:
# the final dataset includes Kanji characters
# this data has 3832 different classes and
# consists of 140,426 images

# this dataset is not as processed as the other two:
# it is just a bunch of png images in a directory
# inside an archive file

# we download the archive:

!wget http://codh.rois.ac.jp/kmnist/dataset/kkanji/kkanji.tar?raw=True
!mv kkanji.tar?raw=True kkanji.tar
!ls -lahr kkanji.tar

In [None]:
# next, unarchive and set up the numpy arrays
# for the third (kanji) dataset

# first we'll set up the classmap for the 
# kanji dataset

In [None]:
# list the contents of the archive
# limit output to the first 70 files

!tar -tf kkanji.tar | head -70

In [None]:
# remove the directory of images, so it an be rebuilt
!rm -rf kkanji2

In [None]:
# checking whats in the current working directory:

!ls -al

In [None]:
# so the format of the data is that the individual images exist
# in folders whose names are the codepoint for the image category
# these folder names will be the basis for the class Index for each image

# next lets extract:

!tar -xf kkanji.tar

In [None]:
# checking whats in the current working directory:

!ls -al

In [None]:
# now list out the first 50 folder names in the directory kkanji2
# the folder names are also the codepoint of each of the characters:

!ls kkanji2 | head -50

In [None]:
# put the codepoints in a file
# let the first line of the file be name
# of the column in the dataframe we are creating
!echo codepoint > codepoints.csv
!ls kkanji2 >> codepoints.csv
!cat codepoints.csv | head -50

In [None]:
!ls -al

In [None]:
print (os.path.abspath("codepoints.csv"))

In [None]:
# read all the lines of the file into pandas dataframe
# including the column header which is already in the file

df_kanji_classmap = pd.read_csv("codepoints.csv")

In [None]:
df_kanji_classmap

In [None]:
#read the first listed image in the first folder and display it
img = mping.imread('kkanji2/U+5B87/72d56fcb33d10fe0.png')
plt.imshow(img)
plt.show()

In [None]:
# verify the full path for the folder containing the images
path_var = str(os.path.abspath("kkanji2/"))
path_var

In [None]:
# create a pandas dataframe that contains the codepoint for each image, 
# and its full path in the os and display that dataframe

data = []
dir = os.path.realpath(path_var)
for r, d, f in os.walk(dir):
    for file in f:
        if ".png" in file:
            data.append((r.split('/')[-1],os.path.join(r,file)))
df_kanji2 = pd.DataFrame(data, columns=['codepoint', 'image_file_path']).sort_values(by=['codepoint'], ignore_index = True)
df_kanji2

In [None]:
# show the os path of the first image
df_kanji2['image_file_path'][0]

In [None]:
# read the image using Pillow
pimage = Image.open(df_kanji2['image_file_path'][0])

In [None]:
# show some information about the image:
print(pimage.format)

In [None]:
print(pimage.size)

In [None]:
print(pimage.mode)

In [None]:
# it is important to know that the image is 64 x 64 pixels, unlike the first two datasets

In [None]:
# next read the image using matplotlib

In [None]:
img = image.imread(df_kanji2['image_file_path'][0])

In [None]:
print(img.dtype)

In [None]:
print(img.shape)

In [None]:
plt.imshow(img)

In [None]:
# next we want to try to convert this image into a numpy array

In [None]:
image_nparray = asarray(img)

In [None]:
# verify we have created a numpy array
print(type(image_nparray))

In [None]:
# verify the numpy array is the correct dimensions:
print(image_nparray.shape)

In [None]:
image_nparray

In [None]:
# convert the pandas dataframe into a pyspark dataframe
df_kanji2_pyspk = spark.createDataFrame(df_kanji2)

In [None]:
# our data has 3,831 different classes each with unique string names
# which is based on their character codepoints
# but we want simple numeric class index
# so we instantiate a StringIndexer in spark:

indexer = StringIndexer(inputCol="codepoint",outputCol="classIndex")
indexed_df = indexer.fit(df_kanji2_pyspk).transform(df_kanji2_pyspk)
indexed_df.show()

In [None]:
# transform back to pandas dataframe:
df_kanji2 = indexed_df.toPandas()

In [None]:
df_kanji2

In [None]:
# now it is easy to explore the distribution between the classes:
# note that the StringIndexer took frequency into account when
# creating the classes, so the lowest indexes have the largest count:

sns.displot(df_kanji2['classIndex'])

In [None]:
df_kanji2['classIndex'].value_counts().sort_index()

In [None]:
# walk through the dataframe and display an image and the
# image's class Index

for i in range(0,100):
    print('Dataframe Index: ', i)
    print('Class Index: ', df_kanji2['classIndex'][i])
    imag = image.imread(df_kanji2['image_file_path'][i])
    plt.figure()
    plt.imshow(imag)
    plt.show()
    
    i+=1

In [None]:
# so in our exploration of the 3rd dataset, we can see
# that the data set contains more complex images and 
# that they are larger, and also that this 3rd dataset
# is also an unbalanced dataset i.e. there are not
# the same numbers of each class.

[end]