# Data exploration  - Snakeclef2021

**About:**

- This notebook is focus on explore the original  dataset  from [snakeclef2021](https://www.aicrowd.com/challenges/snakeclef2021-snake-species-identification-challenge).
---
David Andrés Torres Betancour <br/>
Computer Engineering  Student <br/>
University of Antioquia <br/>
davida.torres@udea.edu.co

## Importing Libraries

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import h5py
import random
import codecs
from ipywidgets import FileUpload, Output
import asyncio
import threading
%matplotlib inline

## Tools

In [8]:
class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKCYAN = '\033[96m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'
    

def fetchDatasetFromKaggle(dataset_name, force_fetch=False):
    
  print(bcolors.BOLD + "Fetching data from kaggle ( This may take some time)..." + bcolors.ENDC)   
  if force_fetch==True:
   ! kaggle datasets download -d deividt/{dataset_name} --force #Download dataset
         
   process_info=! kaggle datasets download -d deividt/{dataset_name} --force #Download dataset
   if "100%" in list(process_info)[-1]:
       print(bcolors.OKGREEN + "Data from kaggle successfully fetch\n" + bcolors.ENDC)
       print(bcolors.BOLD + "Unzipping data..." + bcolors.ENDC)
       ! unzip \*.zip && rm *.zip
       print(bcolors.OKGREEN + "Data is ready in your local folder!\n" + bcolors.ENDC)
   elif "404 - Not Found" in list(process_info)[0]: 
       print(bcolors.FAIL + "404 - Dataset  Not Found in 'deividt' Account\n" + bcolors.ENDC)
   else:
       assert False,list(process_info)[-1]

  else:
   process_info =  ! kaggle datasets download -d deividt/{dataset_name} 
   if "Skipping" in list(process_info)[0]:
       print(bcolors.WARNING + "Data already exists locally\nIf you want force fetch set force_fetch parameter to True" + bcolors.ENDC)


def missingValues(dataF):
  '''
  Description:
    Returns the reversed String.

  Parameters:
      None.

  Returns:
      reverse(str1):The string which gets reversed.   
  '''
  k = dataF.isna().sum()
  miss_values = k[k!=0]
  if miss_values.size>0:
    return (miss_values)
  return ("No missing values")


def selectRandomBreeds(dataF,num_breeds=5):

  labels=pd.unique(dataF.binomial).tolist()
  np.random.shuffle(labels)
  return labels[:num_breeds]
  

def displayImagesForRandomBreeds(df,num_breeds = 5, num_imgs_forbreed= 2):
  labels = selectRandomBreeds(df,num_breeds)
  for label in labels:
    sub_df=df[df.binomial==label]
    sub_df=sub_df.sample(frac=1)[:num_imgs_forbreed]
    img_paths=sub_df.image_path.tolist()
    
    f, axarr = plt.subplots(1,2)

    for i,img in enumerate(img_paths):
      path_ ="data/"+img.split('/')[-1]
      print(path_)
      img = cv2.imread(path_)
      print(img)
      img=cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
      axarr[i].imshow(img)
      axarr[i].title.set_text(f'Shape: {img.shape}')
    plt.suptitle(f"Breed: {label}")
    plt.show()
    print()

## Data exploration

- Reading main .csv file

In [None]:
df = pd.read_csv('data_csv/data.csv')

- First  samples

In [None]:
df.head(5)

- Last samples

In [None]:
df.tail(5)

- Missing values in columns

In [None]:
missingValues()

- Data types

In [None]:
for c in df.columns:
    print ("%20s"%c, df[c].dtype)

- Data Size

In [None]:
df.shape

- Breeds:

In [None]:
print(pd.unique(df['breed']));

- Total breeds:

In [None]:
len ( pd.unique(df['breed']))

- Total images for each breed : 

In [None]:
df['breed'].value_counts()

- Data density:

In [None]:
sns.distplot(df['breed_encode']);

### Visualizating Images

#### Loading data

 - In order to use the Kaggle’s public API, you must first authenticate using an API token. For that follow the next steps: <br/> <br/>
     1. Go to https://www.kaggle.com/
     2. Click on your user profile picture
     3. Then on "Account" from the dropdown menu. This will take you yo your account settings
     4. Scroll down to the section of the page labelled API
     5. To create a new token, click on the “Create New API Token” button. This will download a fresh authentication token onto your machine named "kaggle.json"
     6.  <font color='red'>YOU MUST UPLOAD kaggle.json FILE in the ROOT FOLDER/)</font> 



In [9]:
assert os.path.exists('kaggle.json'), "kaggle.json file has not been uploaded" 
print ("... kaggle.json file has been uploaded")
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd() #Setup kaggle.json dir

... kaggle.json file has been uploaded


- Downloading original dataset [snakeclef2021](https://www.kaggle.com/deividt/snakeclef2021) from Kaggle. 
<br/>
<font color='red'>Fetching data and unzipping files can take several minutes ( at least 30 min)  </font> 
<br/>


In [None]:
fetchDatasetFromKaggle( dataset_name = "snakeclef2021")

[1mFetching data from kaggle ( This may take some time)...[0m


- Display images for random breeds :

In [None]:
displayImagesForRandomBreeds(df,num_breeds = 5, num_imgs_forbreed= 2)