<a href="https://colab.research.google.com/github/amarsinghen/landmark-detection-kaggle/blob/unit5_2_and_5_10/google_landmark_detection_data_wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Setting the tensorflow to version 2.0

In [1]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

TensorFlow 2.x selected.


### Imports
All the imports for the project

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals
import functools
import numpy as np
import pandas as pd
import pathlib
import tensorflow as tf
from PIL import Image
import imageio
import logging as log

In [3]:
tf.__version__

'2.0.0-rc2'

Setting the log level

In [0]:
log.basicConfig(level=log.DEBUG)

### Data Download
Assigning Training and Test Data file URLS to variables

In [0]:
TRAIN_DATA_URL = "https://s3.amazonaws.com/google-landmark/metadata/train.csv"
TRAIN_ATTRIBUTION_DATA_URL = "https://s3.amazonaws.com/google-landmark/metadata/train_attribution.csv"
TRAIN_LABEL_TO_CATEGORY_DATA_URL = "https://s3.amazonaws.com/google-landmark/metadata/train_label_to_category.csv"
TRAIN_IMAGES_DATA_TAR_URL = "https://s3.amazonaws.com/google-landmark/train/images_001.tar"

TEST_DATA_CSV_URL = "https://s3.amazonaws.com/google-landmark/metadata/test.csv"
TEST_DATA_RECOGNITION_SOLUTION_V2_URL = "https://s3.amazonaws.com/google-landmark/ground_truth/recognition_solution_v2.1.csv"
TEST_DATA_RETRIEVAL_SOLUTION_V2_URL = "https://s3.amazonaws.com/google-landmark/ground_truth/retrieval_solution_v2.1.csv"
TEST_IMAGES_DATA_TAR_URL = "https://s3.amazonaws.com/google-landmark/test/images_000.tar"

Downloading the test and training data (csv) files. These csv files contain the urls and landmark classification information. The variables below are of type string that holds the file locations on local server. The overall image dataset is very large, ~4 million images for training and ~200k for test

In [6]:
train_file_csv = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
train_attribution_csv= tf.keras.utils.get_file("train_attribution.csv", TRAIN_ATTRIBUTION_DATA_URL)
train_label_to_category_csv = tf.keras.utils.get_file("train_label_to_category.csv",TRAIN_LABEL_TO_CATEGORY_DATA_URL)

Downloading data from https://s3.amazonaws.com/google-landmark/metadata/train.csv
Downloading data from https://s3.amazonaws.com/google-landmark/metadata/train_attribution.csv
Downloading data from https://s3.amazonaws.com/google-landmark/metadata/train_label_to_category.csv


In [7]:
log.debug("The type of variables is : " + str(type(train_file_csv)))

DEBUG:root:The type of variables is : <class 'str'>


In [8]:
test_file_csv = tf.keras.utils.get_file("test.csv", TEST_DATA_CSV_URL)
test_image_recognition_solution_csv= tf.keras.utils.get_file("test_images_recognition_solution.csv", TEST_DATA_RECOGNITION_SOLUTION_V2_URL)
test_image_retrieval_solution_csv = tf.keras.utils.get_file("test_images_retrieval_solution.csv",TEST_DATA_RETRIEVAL_SOLUTION_V2_URL)

Downloading data from https://s3.amazonaws.com/google-landmark/metadata/test.csv
Downloading data from https://s3.amazonaws.com/google-landmark/ground_truth/recognition_solution_v2.1.csv
Downloading data from https://s3.amazonaws.com/google-landmark/ground_truth/retrieval_solution_v2.1.csv


### Converting to pandas data frame
Loading training and test data from csv files and converting to pandas dataframe

In [0]:
train_file_csv_df = pd.read_csv(train_file_csv)
train_attribution_csv_df = pd.read_csv(train_attribution_csv)
train_label_to_category_csv_df = pd.read_csv(train_label_to_category_csv)

In [0]:
test_file_csv_df = pd.read_csv(test_file_csv)
test_image_recognition_solution_csv_df = pd.read_csv(test_image_recognition_solution_csv)
test_image_retrieval_solution_csv_csv_df = pd.read_csv(test_image_retrieval_solution_csv)

- Checking for **null values**

In [54]:
train_file_csv_df[train_file_csv_df.id.isnull() | train_file_csv_df.url.isnull() | train_file_csv_df.landmark_id.isnull()].count()

id             0
url            0
landmark_id    0
dtype: int64

* The **top 20** landmarks with highest number of images in the training set

In [50]:
train_file_csv_df.landmark_id.value_counts()[:20]

138982    10247
62798      4333
177870     3327
176528     3243
192931     2627
126637     2589
83144      2351
171772     2268
20409      2248
151942     1727
84689      1721
139894     1717
62074      1637
10618      1539
45428      1513
41808      1509
139706     1509
60532      1447
161902     1424
194914     1399
Name: landmark_id, dtype: int64

In [56]:
landmarks_less_than_10_df = train_file_csv_df.landmark_id.value_counts().reset_index(name="count").query("count<10")
log.debug("Sample of record in the train_file_csv_df dataframe : \n" + str(train_file_csv_df.head(1)))
log.debug("Total number of images in the training set : " + str(train_file_csv_df['url'].count()))
log.debug("Total number of unique landmark_ids in the training dataset : " + str(train_file_csv_df.landmark_id.value_counts()
                                                                                 .reset_index(name="count")["index"].count()))
log.debug("Total number of landmarks with less than 10 images in the dataset : " + str(landmarks_less_than_10_df["index"].count()))

DEBUG:root:Sample of record in the train_file_csv_df dataframe : 
                 id  ... landmark_id
0  6e158a47eb2ca3f6  ...      142820

[1 rows x 3 columns]
DEBUG:root:Total number of images in the training set : 4132914
DEBUG:root:Total number of unique landmark_ids in the training dataset : 203094
DEBUG:root:Total number of landmarks with less than 10 images in the dataset : 110354


In [52]:
log.debug("Sample of record in the test_file_csv_df dataframe : \n" + str(test_file_csv_df.head(1)))
log.debug("Total number of images in the test set : " + str(test_file_csv_df['id'].count()))
log.debug("Total number of unique landmark_ids in the test dataset : " + str(test_file_csv_df.id.value_counts()
                                                                                 .reset_index(name="count")["index"].count()))

DEBUG:root:Sample of record in the test_file_csv_df dataframe : 
                 id
0  00016575233bc956
DEBUG:root:Total number of images in the test set : 117577
DEBUG:root:Total number of unique landmark_ids in the test dataset : 117577


* Filtering training set if there are less than 10 images for a landmark

In [0]:
filtered_train_df = train_file_csv_df[~train_file_csv_df.landmark_id.isin(landmarks_less_than_10_df['index'])]

In [64]:
train_file_csv_df.landmark_id.value_counts().reset_index(name="count").describe()

Unnamed: 0,index,count
count,203094.0,203094.0
mean,101546.5,20.349759
std,58628.332123,52.366016
min,0.0,1.0
25%,50773.25,3.0
50%,101546.5,8.0
75%,152319.75,20.0
max,203093.0,10247.0


In [62]:
landmarks_more_than_10_df = train_file_csv_df.landmark_id.value_counts().reset_index(name="count").query("count>=10")
landmarks_more_than_10_df.describe()

Unnamed: 0,index,count
count,92740.0,92740.0
mean,101346.5613,39.66143
std,58807.538571,72.880997
min,0.0,10.0
25%,50116.0,14.0
50%,101362.0,22.0
75%,152211.25,42.0
max,203093.0,10247.0


In [0]:
# test_file_csv_df.head(1)

In [0]:
# test_image_recognition_solution_csv_df.head(1)

In [0]:
# test_image_retrieval_solution_csv_csv_df.head(1)

In [0]:
train_image_count = len(list(train_images_data_dir.glob('**/*.jpg')))
print(train_image_count)

In [0]:
counter = 0
imagesSizes = []
for image in train_images_data_dir.glob('**/*.jpg'):
#   print(image)
  with Image.open(str(image)) as img:
    imagesSizes.append(img.size)
    counter = counter + 1
print(counter)
print(len(imagesSizes))