# **Data Collection**

## Objectives

* Write here your notebook objective, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write here which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Import Libraries

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import tensorflow as tf
sns.set_style("white")
from matplotlib.image import imread

# Change working directory

In [2]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\cmjim\\ml_pp5_drowsiness_detector\\jupyter_notebooks'

* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir('C:\\Users\\cmjim\\ml_pp5_drowsiness_detector')
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'C:\\Users\\cmjim\\ml_pp5_drowsiness_detector'

#  Install Kaggle

In [6]:
pip install kaggle

Note: you may need to restart the kernel to use updated packages.


Change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON

In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Collect Kaggle Dataset link, and download dataset

In [8]:
KaggleDataset = "hazemfahmy/openned-closed-eyes"
DestinationFolder = "inputs/drowsiness"
! kaggle datasets download -d {KaggleDataset} -p {DestinationFolder}

Downloading openned-closed-eyes.zip to inputs/drowsiness




  0%|          | 0.00/526M [00:00<?, ?B/s]
  0%|          | 1.00M/526M [00:00<05:36, 1.64MB/s]
  1%|          | 3.00M/526M [00:00<01:55, 4.75MB/s]
  1%|▏         | 7.00M/526M [00:00<00:47, 11.5MB/s]
  2%|▏         | 9.00M/526M [00:01<00:44, 12.2MB/s]
  2%|▏         | 12.0M/526M [00:01<00:47, 11.3MB/s]
  3%|▎         | 14.0M/526M [00:01<00:45, 11.8MB/s]
  3%|▎         | 18.0M/526M [00:01<00:35, 15.0MB/s]
  4%|▍         | 21.0M/526M [00:01<00:32, 16.2MB/s]
  5%|▍         | 25.0M/526M [00:02<00:29, 18.0MB/s]
  5%|▌         | 28.0M/526M [00:02<00:25, 20.2MB/s]
  6%|▌         | 31.0M/526M [00:02<00:25, 20.3MB/s]
  7%|▋         | 35.0M/526M [00:02<00:24, 21.0MB/s]
  7%|▋         | 38.0M/526M [00:02<00:28, 18.1MB/s]
  8%|▊         | 42.0M/526M [00:02<00:24, 21.1MB/s]
  9%|▊         | 45.0M/526M [00:03<00:25, 19.6MB/s]
 10%|▉         | 50.0M/526M [00:03<00:20, 24.8MB/s]
 10%|█         | 54.0M/526M [00:03<00:17, 27.5MB/s]
 11%|█         | 58.0M/526M [00:03<00:17, 28.7MB/s]
 12%|█▏        | 61.

Unzip the dowloaded file and delete the zip file

In [9]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/openned-closed-eyes.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/openned-closed-eyes.zip')

---

# Data Preparation

---

For cleaning up the data we will create two folders, one for images of open eyes and one for closed eyes. Then we will iterate through all the folders to collect the files from any folder called "Awake" or "Drowsy". Finally deleteing any files that are not needed or any empty file paths. 

In [10]:
import shutil

In [11]:
def move_open_folders_to_open(source_dir, destination_dir, folder_to_move='Opened', destination_folder='Awake'):
    open_folder_path = os.path.join(destination_dir, destination_folder)
    os.makedirs(open_folder_path, exist_ok=True)

    for root, dirs, files in os.walk(source_dir):
        if os.path.basename(root) == folder_to_move:
            for file in files:
                file_path = os.path.join(root, file)
                shutil.move(file_path, os.path.join(open_folder_path, file))

source_directory_path = 'inputs/drowsiness'
destination_directory_path = 'inputs/drowsiness'

move_open_folders_to_open(source_directory_path, destination_directory_path)

In [12]:
def move_close_folders_to_close(source_dir, destination_dir, folder_to_move='Closed', destination_folder='Drowsy'):
    close_folder_path = os.path.join(destination_dir, destination_folder)
    os.makedirs(close_folder_path, exist_ok=True)

    for root, dirs, files in os.walk(source_dir):
        if os.path.basename(root) == folder_to_move:
            for file in files:
                file_path = os.path.join(root, file)
                shutil.move(file_path, os.path.join(close_folder_path, file))

source_directory_path = 'inputs/drowsiness'
destination_directory_path = 'inputs/drowsiness'

move_close_folders_to_close(source_directory_path, destination_directory_path)

In [13]:
def delete_files_with_extension(directory, file_extension):
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith(file_extension):
                pass
            else:
                file_path = os.path.join(root, file)
                os.remove(file_path)

directory_path = 'inputs/drowsiness'
file_extension_to_save = '.jpg'

delete_files_with_extension(directory_path, file_extension_to_save)

In [14]:
def delete_empty_folders(directory):
    for root, dirs, files in os.walk(directory, topdown=False):
        for dir_name in dirs:
            folder_path = os.path.join(root, dir_name)
            if not os.listdir(folder_path):
                # If the folder is empty, delete it
                os.rmdir(folder_path)

directory_path = r'inputs\drowsiness'
delete_empty_folders(directory_path)

---

## Data Cleaning

Checks for and removes non-image files in the input folders by checking filename extensions match accepted image formats

In [15]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [16]:
remove_non_image_file(r'inputs\drowsiness')
#I have to use the r in front of the file path due to an unicodeescape error otherwise

Folder: Awake - has image file 4933
Folder: Awake - has non-image file 0
Folder: Drowsy - has image file 4936
Folder: Drowsy - has non-image file 0


---

## Split Dataset into Train, Validation, and Test sets

In [17]:
import os
import shutil
import random
import joblib


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

* The training set is divided into a 0.70 ratio of data.
* The validation set is divided into a 0.10 ratio of data.
* The test set is divided into a 0.20 ratio of data.

In [18]:
split_train_validation_test_images(my_data_dir=r"inputs\drowsiness",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )