<a href="https://colab.research.google.com/github/amarsinghen/landmark-detection-kaggle/blob/master/google_landmark_detection_data_wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Setting the tensorflow to version 2.0

In [1]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

TensorFlow 2.x selected.


### Imports
All the imports for the project

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals
import functools
import numpy as np
import pandas as pd
import pathlib
import tensorflow as tf
from PIL import Image
import imageio
import logging as log

In [3]:
tf.__version__

'2.1.0-rc1'

Setting the log level

In [0]:
log.basicConfig(level=log.DEBUG)

### Data Download
Assigning Training and Test Data file URLS to variables

In [0]:
TRAIN_DATA_URL = "https://s3.amazonaws.com/google-landmark/metadata/train.csv"
TRAIN_ATTRIBUTION_DATA_URL = "https://s3.amazonaws.com/google-landmark/metadata/train_attribution.csv"
TRAIN_LABEL_TO_CATEGORY_DATA_URL = "https://s3.amazonaws.com/google-landmark/metadata/train_label_to_category.csv"
TRAIN_IMAGES_DATA_TAR_URL = "https://s3.amazonaws.com/google-landmark/train/images_001.tar"

TEST_DATA_CSV_URL = "https://s3.amazonaws.com/google-landmark/metadata/test.csv"
TEST_DATA_RECOGNITION_SOLUTION_V2_URL = "https://s3.amazonaws.com/google-landmark/ground_truth/recognition_solution_v2.1.csv"
TEST_DATA_RETRIEVAL_SOLUTION_V2_URL = "https://s3.amazonaws.com/google-landmark/ground_truth/retrieval_solution_v2.1.csv"
TEST_IMAGES_DATA_TAR_URL = "https://s3.amazonaws.com/google-landmark/test/images_000.tar"

Downloading the test and training data (csv) files. These csv files contain the urls and landmark classification information. The variables below are of type string that holds the file locations on local server. The overall image dataset is very large, ~4 million images for training and ~200k for test

In [6]:
train_file_csv = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
train_attribution_csv= tf.keras.utils.get_file("train_attribution.csv", TRAIN_ATTRIBUTION_DATA_URL)
train_label_to_category_csv = tf.keras.utils.get_file("train_label_to_category.csv",TRAIN_LABEL_TO_CATEGORY_DATA_URL)

Downloading data from https://s3.amazonaws.com/google-landmark/metadata/train.csv
Downloading data from https://s3.amazonaws.com/google-landmark/metadata/train_attribution.csv
Downloading data from https://s3.amazonaws.com/google-landmark/metadata/train_label_to_category.csv


In [7]:
log.debug("The type of variables is : " + str(type(train_file_csv)))

DEBUG:root:The type of variables is : <class 'str'>


In [8]:
test_file_csv = tf.keras.utils.get_file("test.csv", TEST_DATA_CSV_URL)
test_image_recognition_solution_csv= tf.keras.utils.get_file("test_images_recognition_solution.csv", TEST_DATA_RECOGNITION_SOLUTION_V2_URL)
test_image_retrieval_solution_csv = tf.keras.utils.get_file("test_images_retrieval_solution.csv",TEST_DATA_RETRIEVAL_SOLUTION_V2_URL)

Downloading data from https://s3.amazonaws.com/google-landmark/metadata/test.csv
Downloading data from https://s3.amazonaws.com/google-landmark/ground_truth/recognition_solution_v2.1.csv
Downloading data from https://s3.amazonaws.com/google-landmark/ground_truth/retrieval_solution_v2.1.csv


### Converting to pandas data frame
Loading training and test data from csv files and converting to pandas dataframe

In [0]:
train_file_csv_df = pd.read_csv(train_file_csv)
train_attribution_csv_df = pd.read_csv(train_attribution_csv)
train_label_to_category_csv_df = pd.read_csv(train_label_to_category_csv)

In [0]:
test_file_csv_df = pd.read_csv(test_file_csv)
test_image_recognition_solution_csv_df = pd.read_csv(test_image_recognition_solution_csv)
test_image_retrieval_solution_csv_csv_df = pd.read_csv(test_image_retrieval_solution_csv)

- Checking for **null values**

In [11]:
train_file_csv_df[train_file_csv_df.id.isnull() | train_file_csv_df.url.isnull() | train_file_csv_df.landmark_id.isnull()].count()

id             0
url            0
landmark_id    0
dtype: int64

* The **top 20** landmarks with highest number of images in the training set

In [12]:
train_file_csv_df.landmark_id.value_counts()[:20]

138982    10247
62798      4333
177870     3327
176528     3243
192931     2627
126637     2589
83144      2351
171772     2268
20409      2248
151942     1727
84689      1721
139894     1717
62074      1637
10618      1539
45428      1513
41808      1509
139706     1509
60532      1447
161902     1424
194914     1399
Name: landmark_id, dtype: int64

In [13]:
landmarks_between_50_and_100_df = train_file_csv_df.landmark_id.value_counts().reset_index(name="count").query('count<55 and count>49')
log.debug("Sample of record in the train_file_csv_df dataframe : \n" + str(train_file_csv_df.head(1)))
log.debug("Total number of images in the training set : " + str(train_file_csv_df['url'].count()))
log.debug("Total number of unique landmark_ids in the training dataset : " + str(train_file_csv_df.landmark_id.value_counts()
                                                                                 .reset_index(name="count")["index"].count()))
log.debug("Total number of landmarks with 50 and 100 images in the dataset : " + str(landmarks_between_50_and_100_df["index"].count()))

INFO:numexpr.utils:NumExpr defaulting to 2 threads.
DEBUG:root:Sample of record in the train_file_csv_df dataframe : 
                 id  ... landmark_id
0  6e158a47eb2ca3f6  ...      142820

[1 rows x 3 columns]
DEBUG:root:Total number of images in the training set : 4132914
DEBUG:root:Total number of unique landmark_ids in the training dataset : 203094
DEBUG:root:Total number of landmarks with 50 and 100 images in the dataset : 2195


* Filtering training set to nly inlcude images count between 50 and 100 for a landmark

In [14]:
filtered_train_between_50and100_df = train_file_csv_df[train_file_csv_df.landmark_id.isin(landmarks_between_50_and_100_df['index'])]
log.debug("Total number of images between 50 and 100 count in the training set after filtering  : " + str(filtered_train_between_50and100_df["id"].count()))
log.debug(filtered_train_between_50and100_df.head(1))
log.debug(filtered_train_between_50and100_df.iloc[1]['url'])

DEBUG:root:Total number of images between 50 and 100 count in the training set after filtering  : 113948
DEBUG:root:                 id  ... landmark_id
3  e7f70e9c61e66af3  ...      102140

[1 rows x 3 columns]
DEBUG:root:https://upload.wikimedia.org/wikipedia/commons/e/e2/Mount_Vernon_Mansion_as_seen_from_the_Bowling_Green._-_panoramio.jpg


In [15]:
!pwd

/content


In [0]:
import urllib.request
from urllib.error import *
import os
# for row in filtered_train_between_50and100_df.iterrows():
for index, row in filtered_train_between_50and100_df.iterrows():
  # print(row['id'])
  # print(row['url'])
  # print(row['landmark_id'])
  local_location = str("/tmp/" + str(row['landmark_id']) + "/" + str(row['id']) + ".jpg")
  # print(local_location)
  if not os.path.isdir(local_location.split(row['id'])[0]):
        os.makedirs(local_location.split(row['id'])[0])
  try:
    urllib.request.urlretrieve(row['url'], local_location)
  except FileNotFoundError as err:
    print(err)   # something wrong with local path
  except HTTPError as err:
    print(err)  # something wrong with url
    print(row['url'])
  except ContentTooShortError as err:
    print(err)  # something wrong with url
    print(row['url'])

HTTP Error 404: Not Found
https://upload.wikimedia.org/wikipedia/commons/a/af/Calrencedockpublicart.jpg
HTTP Error 404: Not Found
https://upload.wikimedia.org/wikipedia/commons/6/62/Interieur_doopvont_-_Voorburg_-_20533881_-_RCE.jpg
HTTP Error 404: Not Found
https://upload.wikimedia.org/wikipedia/commons/8/8f/Exterieur_ZUIDGEVEL_-_Voorburg_-_20275792_-_RCE.jpg
HTTP Error 404: Not Found
https://upload.wikimedia.org/wikipedia/commons/5/59/Namak_Lake_%282%29.jpg
HTTP Error 404: Not Found
https://upload.wikimedia.org/wikipedia/commons/b/bb/HK_CWB_Causeway_Bay_%E7%BE%85%E7%B4%A0%E8%A1%97_Russell_Street_Times_Square_red_heros_figure_August_2018_SSG_02.jpg
HTTP Error 404: Not Found
https://upload.wikimedia.org/wikipedia/commons/9/9c/Caverne_du_Pont_d%27Arc_Acc%C3%A8s_visiteurs.jpg
HTTP Error 404: Not Found
https://upload.wikimedia.org/wikipedia/commons/5/52/M%C3%BCnchen_2012_%2853%29.jpg


In [0]:
log.debug("Sample of record in the test_file_csv_df dataframe : \n" + str(test_file_csv_df.head(3)))
log.debug("Total number of images in the test set : " + str(test_file_csv_df['id'].count()))
log.debug("Total number of unique landmark_ids in the test dataset : " + str(test_file_csv_df.id.value_counts()
                                                                                 .reset_index(name="count")["index"].count()))

View a few pictures. Configuring matplot parameters

In [0]:
%matplotlib inline

In [0]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

image_path = "/tmp/100157/9cd60ce98c029534.jpg"

fig = plt.gcf()
img = mpimg.imread(image_path)
plt.imshow(img)
plt.show()

Build a tensorflow model

In [0]:
import tensorflow as tf
model=tf.keras.models.Sequential([
                                  tf.keras.layers.Conv2D(16, (3,3), activation='relu', input_shape=(400,400,3)),
                                  tf.keras.layers.MaxPooling2D(2,2),
                                  tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
                                  tf.keras.layers.MaxPooling2D(2,2),
                                  tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
                                  tf.keras.layers.MaxPooling2D(2,2),
                                  tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
                                  tf.keras.layers.MaxPooling2D(2,2),
                                  tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
                                  tf.keras.layers.MaxPooling2D(2,2),
                                  tf.keras.layers.Flatten(),
                                  tf.keras.layers.Dense(512, activation='relu'),
                                  tf.keras.layers.Dense(filtered_train_between_50and100_df["id"].count(), activation='softmax')
])
model.summary()

Compiling the model

In [0]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])

Data preprocessing with ImageDataGenerator

In [0]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(rescale=1/255)
#flow training images in batches of 128
train_generator = train_datagen.flow_from_directory(
    '/tmp/',
    target_size=(400,400),
    batch_size=128,
    class_mode='categorical'
)
history = model.fit_generator(
    train_generator,
    steps_per_epoch=8,
    epochs = 15,
    verbose = 1
)

Evaluating Accuracy and Loss

In [0]:
acc = history.history['acc']
loss = history.history['loss']
epochs = range(len(acc))

plt.plot(epochs, acc)
plt.title('Training accuracy')
plt.figure()

plt.plot(epochs, loss)
plt.title('Training Loss')
plt.figure()