<a href="https://colab.research.google.com/github/dataeducator/capstone/blob/main/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Final Project Submission:Capstone
(Capstone)

- Student Name: Tenicka Norwood
- Program Pace: self-paced
- Scheduled Project Review Time: Tuesday, September 19, 2023, 12 pm
- Instructor name: Morgan Jones
- Blog post Url: https://medium.com/mlearning-ai/fueling-student-success-1723abd2991b

In this project I will be using __CRISP-DM__ process which has six phases:

* Business Understanding __&#8594;__ Understanding the project objectives, requirements, and constraints from a business perspective.

* Data Understanding __&#8594;__ Exploring and assessing the available data, its quality, structure, and initial insights.

* Data Preparation __&#8594;__ Cleaning, transforming, and preparing the data to be used for modeling, including handling missing values and outliers.

* Modeling __&#8594;__ Selecting and applying appropriate machine learning algorithms or techniques to build predictive or descriptive models.

* Evaluation __&#8594;__ Assessing the performance of the models and determining their suitability for solving the business problem.

* Deployment __&#8594;__ Integrating the chosen model into the business environment, making it accessible for end-users.

## Business Understanding


#### __Disclaimer:__
This Jupyter notebook and its contents are __intended solely for educational purposes__. The included business case and the results of the deep learning models should not be interpreted as medical advice, and have not received endorsement or approval from any professional or medical organization.

The models and outcomes presented here are for illustrative purposes __only__. Users should __not__ use these models or their outcomes for making real-world decisions without consulting appropriate domain experts and medical professionals. Any actions taken based on the information in this notebook are at the user's own risk.
The author and contributors of this notebook disclaim any liability for the accuracy, completeness, or efficacy of the information provided.

## Data Understanding

## __Metrics__
We will prioritize recall in this project over accuracy. We will also aim for balance between recall (sensitivity) while maintaining a high level of precision (specificity). With these objectives in mind, we aim to reduce the number of false positives and increase the model's ability to correctly identify patients with pneumonia. In this context, false positives could lead to unnecessary treatment or interventions.


* __True Positives (TP)__: The model correctly predicted one of the positive classes (glioma_tumor, pituitary_tumor, or meningioma_tumor).

* __True Negatives (TN):__ This metric is not applicable in multi-class classification, as it is specific to binary classification where there are only two classes.

* __False Positives (FP):__ The model predicted one of the positive classes, but it was incorrect.

* __False Negatives (FN):__ The model failed to predict one of the positive classes.

<br>
\begin{gathered}   
Precision =  \frac{True\ Positive}{True\ Positive + False\ Positive}
\end{gathered}
<br>

</br>

</br>
\begin{gathered}
Recall = \frac{True\ Positive}{False\ Negative + True\ Positive}
\end{gathered}
<br>

A high precision indicates that when our model predicts the presence of a tumor, the patient will likely have a tumor.
<br>

# Downloading and Preparing  Dataset for Deep Learning Analysis
1. __Create or Log in to Your Kaggle Account:__
    If you do not already have a Kaggle account, create one. If you have an account log in.
2. __Access the Pneumonia Dataset:__
    Go to the following direct link to access dataset on Kaggle: [Dataset](https://www.kaggle.com/datasets/)
3. __Download the Dataset:__
    On the dataset page, you will see a "Download" button. Click on it to download the dataset.
   The dataset is approximately __2GB__.

4. __Unzip the file Add the unzipped archive to your Google Drive:__
    After downloading and unzipping the dataset you'll have a folder named 'archive'. This folder contains the dataset. To use this notebook you will need to provide the location of the .zip file in your Google Drive.

5. __Run the next two cells without making any chages__
    Mount your google drive to allow colab access to your google drive.
    Unzip the file so you can use the contents of this notebook.
    The file should include train, test and val folders that contain subfolders with  and images.

In [None]:
# Run cell without making any changes
from google.colab import drive

class ObtainData:
  """
  A class to obtain dataset location from Google Drive.

  Usage:
  data_obtainer = ObtainData()
  dataset_location = data_obtainer.get_dataset_location()
  """

  def __init__(self):
      self.drive_mounted = False

  def mount_drive(self):
    """
    Mounts Google Drive to '/content/drive'.
    """
    drive.mount('/content/drive')
    self.drive_mounted = True

  def get_dataset_location(self):
    """
    Prompts the user to enter the location of the datset folder in their Google Drive.
    Returns the full file path of the dataset location.
    """

    while True:
      # Munt Google Drive if not already mounted
      if not self.drive_mounted:
        self.mount_drive()

      # Provide a template for the user input
      example_input ="/MyDrive/Your_Folder_name/"
      dataset_location = input(f"Enter the location of the dataset folder in your Google Drive (e.g., {example_input}):")
      file_path = f'/content/drive{dataset_location}'

      # Check if the file exists
      if os.path.exists(file_path):
        print(f" The file '{dataset_location}' exists in your Google Drive.")
        return file_path
      else:
        print(f" The file '{dataset_location}' does not exist in your Google Drive. Please try again.")

In [None]:
# Run cell without making any changes
# Create an instance of the ObtainData class
data_obtainer = ObtainData()

# Get the dataset location
dataset_location = data_obtainer.get_dataset_location()
print(f"Datatset location")

## Data Preparation


In [None]:
import os
import shutil
import random
class DatasetPaths:
  """
  Helper class to manage paths for different sets and classes of the dataset
  """

  def __init__(self, base_location):
    self.base_location = base_location
    self.class_names = ['NORMAL', 'PNEUMONIA']
    self.set_names = ['train', 'test', 'val']

  def get_single_path(self, set_name):
      return os.path.join(self.base_location, set_name)
  def get_path(self,set_name, class_name):
    """
    Get the path for a specific set and class.

    Parameters:
        set_name(str): Name of the dataset set('train', 'test', 'val')
        class_name(str): Name of the class ('NORMAL' or 'PNEUMONIA')

    Returns:
        path(str): Path to the specified set and class.
    """
    return f"{self.base_location}/{set_name}/{class_name}"

  def get_all_paths(self):
    """
    Get a dictionary containing all paths for the dataset.

    Returns:
      paths(dict): Dictionary containing paths for each set and class.
    """

    paths = {}
    for set_name in self.set_names:
        paths[set_name] = {}
        for class_name in self.class_names:
          paths[set_name][class_name] = self.get_path(set_name, class_name)
    return paths

In this section, we used the <code>DatasetPaths</code> class to calculate and display the distributeion of images across different sets and classess within the dataset. The output displays the number of images for each combination of training, testing, and validation sets along with the two class categories: 'NORMAL' and 'PNEUMONIA'. Next, we will create a visualization to get a quick view of the dataset's composition.

__ClassDistributionPlot Class Description__

The <code>ClassDistributionPlot</code> class is a helper class designed to create a bar plot to visualize the distribution of classes across different dataset sets.

__Features:__
* Uses the <code>DatasetPaths</code> class to manage and access dataset paths.
* Accepts a dictionary containing the count of images per class for each dataset set.
* Aligns bars for each class for comparison
* Adopts the <code>fivethirthyeight</code> style for consistency.

__Usage:__
1. Create an instance of the <code>DatasetPaths</code> class.
2. Create an instance of the <code>ClassDistributionPlot</code> class, providing a dictionary with the class distribution data.
3. Use the <code>plot()</code> method to generate bar plots that illustrate class distribution across different sets.

In [None]:
class ClassDistributionPlot:
    """
    Helper class to create class distribution plots for different dataset sets.
    """

    def __init__(self, dataset_paths, set_names, class_names):
        """
        Initialize the ClassDistributionPlot instance.

        Parameters:
            dataset_paths (DatasetPaths): An instance of the DatasetPaths class.
            set_names (list): List of dataset set names ('train', 'test', 'val').
            class_names (list): List of class names.
        """
        self.dataset_paths = dataset_paths
        self.set_names = set_names
        self.class_names = class_names

    def plot_distribution(self):
        """
        Create and display class distribution plots for different dataset sets.
        """
        plt.style.use('fivethirtyeight')
        plt.figure(figsize=(10, 8))
        x = range(len(self.class_names))

        for i, set_name in enumerate(self.set_names):
            num_images_per_class = [
                len(os.listdir(self.dataset_paths.get_path(set_name, class_name)))
                for class_name in self.class_names
            ]
            plt.bar([j + i * 0.2 for j in x], num_images_per_class, width=0.2,
                    align='center', label=set_name.capitalize())

        plt.xlabel('Class')
        plt.ylabel('Number of Images')
        plt.title('Class Distribution Across Sets')
        plt.xticks([i + 0.2 for i in x], self.class_names)
        plt.legend()

        plt.tight_layout()
        plt.show()

In [None]:
# Extract class names and corresponding number of images for each class
set_names = dataset_paths.set_names
class_names = dataset_paths.class_names
num_images_per_class = {
    'train': [num_images_dict['train'][class_name] for class_name in class_names],
    'test': [num_images_dict['test'][class_name] for class_name in class_names],
    'val': [num_images_dict['val'][class_name] for class_name in class_names]
}

# Create an instance of the ClassDistributionPlot class
distribution_plot = ClassDistributionPlot(dataset_paths, set_names, class_names)

# Plot and display class distribution
distribution_plot.plot_distribution()

Next we will implement The `ScrubData` class for preprocessing our images.

### Key Features

The `ScrubData` class provides the following key features:

1. **Count Images**: It counts the number of images in each class for different sets (e.g., training, testing, validation) within your dataset.

2. **Create Data Generator**: It creates data generators using the Keras `ImageDataGenerator` class, simplifying the configuration of data augmentation and normalization.

3. **Plot Sample Images**: It enables you to visualize sample images from your dataset for a specified class, aiding in data exploration and understanding.

### Usage

Here's how you can use the `ScrubData` class:

* Initialize the ScrubData instance with your dataset location and optional parameters
    - <code>base_dataset_location = '/path/to/your/dataset'
    - scrubber = ScrubData(base_dataset_location)</code>

* Count images in different sets

* Create data generators

* Plot sample images for data exploration

In this section we will perform visualizations on data that we preprocessed using the ScrubData class. To create an array of NORMAL and PNEUMONIA images and to review the pixel intensities of sample images from each class.

In [None]:
import random
from keras.preprocessing.image import ImageDataGenerator

class ScrubData:
    """
    The ScrubData class is responsible for managing and processing the dataset for training and testing.

    Parameters:
        base_dataset_location (str): The base directory path of the dataset.
        image_size (tuple): The dimensions to which the images will be resized.
        class_names (list): List of class names in the dataset.
        batch_size_train (dict): Number of images in each class for the training set.
        batch_size_test (dict): Number of images in each class for the test set.
        batch_size_val (dict): Number of images in each class for the validation set.

    Methods:
        count_images(set_name): Counts the number of images in each class within a specified set.
        create_data_generator(directory, batch_size): Creates a data generator for a specified directory.
        plot_sample_images(data_generator, class_name, num_images, num_rows, num_cols, figsize):
            Plots a set of sample images from the data generator.
    """
    # Constructor method to initialize class attributes
    def __init__(self, base_dataset_location, image_size=(224, 224), class_names=['NORMAL', 'PNEUMONIA']):
        self.base_dataset_location = base_dataset_location
        self.image_size = image_size
        self.class_names = class_names

        # Calculate batch sizes for different sets
        self.batch_size_train = self.count_images('train_resampled')
        self.batch_size_test = self.count_images('test')
        self.batch_size_val = self.count_images('val')

    # Method to count the number of images in each class within a set
    def count_images(self, set_name):
        counts = {}
        set_path = os.path.join(self.base_dataset_location, set_name)

        for class_name in self.class_names:
            class_path = os.path.join(set_path, class_name)
            if os.path.exists(class_path):
                num_images = len(os.listdir(class_path))
                counts[class_name] = num_images

        return counts
    # Method to create a data generator for a specified directory
    def create_data_generator(self, directory, batch_size):
        normalization_params = {
            'NORMAL': {'rescale': 1.0 / 255.0},
            'PNEUMONIA': {'rescale': 1.0 / 255.0}
        }
        # Specify class mode for the data generator
        class_mode = 'binary'

        return ImageDataGenerator(**normalization_params[class_name]).flow_from_directory(
            directory,
            target_size=self.image_size,
            batch_size=batch_size,
            class_mode=class_mode,
            seed=42
        )

    # Method to plot sample images from a data generator
    def plot_sample_images(self, data_generator, class_name, num_images=16, num_rows=4, num_cols=4, figsize=(10, 10)):
        class_index = self.class_names.index(class_name)
        class_indices = [index for index, label in enumerate(data_generator.labels) if label == class_index]
        random_class_indices = random.sample(class_indices, num_images)

        plt.figure(figsize=figsize)
        plt.suptitle(class_name, fontsize=16)

        for i, index in enumerate(random_class_indices):
            plt.subplot(num_rows, num_cols, i + 1)
            image = data_generator._get_batches_of_transformed_samples([index])[0][0]
            plt.imshow(image)
            plt.axis('off')

        # Add an empty subplot for the white space
        plt.subplot(num_rows, num_cols, num_images + 1)
        plt.axis('off')

        plt.tight_layout(rect=[0, 0.03, 1, 0.95])
        plt.show()

## Modeling


## Evaluation


## Deployment