# Phase 1: Business Understanding

## Project context

The health insurance sector in Tunisia has seen significant growth in recent years, driven by the increasing demand for better healthcare coverage. While these companies play a crucial role in society, handling large volumes of paperwork remains a major challenge.

Health insurance companies process vast amounts of medical documents daily. Manual categorization is **time-consuming, error-prone, and inefficient**. The objective of this project is to **automate document classification**, minimizing human intervention while improving speed and accuracy.

### Objective
We aim to develop an **automated document classification model** using **convolutional neural networks (CNNs)**. The model will distinguish between two primary categories of medical documents:

1. **Medical Care Forms** – Documents related to health insurance claims.
2. **Prescriptions** – Documents containing doctor-issued prescriptions for medications.

To ensure optimal performance, we will also **experiment with a pre-trained model** (VGG16) and compare its results against our custom CNN model. The most effective model will be selected to enhance the **document management workflow** in health insurance companies.

In [1]:
import os
import random
import shutil
from PIL import Image, ImageEnhance

# **Phase 2: Data Acquisition and Understanding**

## **Dataset Overview**
To build our medical document classification model, we collected three types of medical documents:

- **Medical Care Forms**: 1,600 images  
- **Prescriptions**: 580 images  

These documents were sourced from a **health insurance company** and stored in a folder named `dataset_classification`. Each category contains scanned images of medical documents with varying resolutions, formats, and quality.

## **Data Preprocessing**
To enhance the quality of the images and ensure consistency, we applied several preprocessing steps:

### **1. Grayscale Conversion**
Since color information is not critical for document classification, we converted all images to grayscale to reduce computational complexity and focus on text-based features.

### **2. Image Sharpening**
To enhance text clarity and improve feature detection, we applied sharpening using `ImageEnhance.Sharpness`. This step increases the contrast between text and background, making classification easier.

### **3. Resizing**
All images were resized to a uniform **512x512 pixels** using the **LANCZOS** resampling method to standardize input dimensions for our deep learning model.

### **4. Further Enhancement**
After resizing, an additional sharpening step was applied to further improve document readability.

### **5. Dataset Splitting**
After preprocessing, the dataset was split into **training**, **validation** and **test** sets:

- **Training Set**: 80% of the images  
- **Validation Set**: 20% of the images  

Each class folder was split while maintaining balance, ensuring that the model learns meaningful representations without bias toward any category.
Additionally, we randomly selected 53 images and set them aside as a test set. These images will not be used during training or validation but will serve as a final evaluation to measure how well the model generalizes to unseen data. 

In [None]:
input_dir = r"dataset_classification\Others"
output_dir = r"preprocessed_dataset\Processed_Others"

os.makedirs(output_dir, exist_ok=True)

TARGET_SIZE = (512, 512) 

def preprocess_image(image_path, output_path):
    image = Image.open(image_path)
    
    gray_image = image.convert("L")

    enhancer = ImageEnhance.Sharpness(gray_image)
    sharp_image = enhancer.enhance(2.0) 

    resized_image = sharp_image.resize(TARGET_SIZE, Image.LANCZOS)

    enhancer = ImageEnhance.Sharpness(resized_image)
    final_image = enhancer.enhance(1.5) 

    final_image.save(output_path, dpi=(300, 300), quality=95) 

for filename in os.listdir(input_dir):
    if filename.lower().endswith(('.png', '.jpg', '.jpeg', '.tiff', '.bmp')):
        input_path = os.path.join(input_dir, filename)
        output_path = os.path.join(output_dir, filename)
        preprocess_image(input_path, output_path)

print("Image preprocessing completed.")


In [None]:


input_dir = r"preprocessed_dataset"
train_dir = r"train_dataset"
val_dir = r"validation_dataset"

os.makedirs(train_dir, exist_ok=True)
os.makedirs(val_dir, exist_ok=True)

split_ratio = 0.2

for class_folder in os.listdir(input_dir):
    class_path = os.path.join(input_dir, class_folder)
    train_class_path = os.path.join(train_dir, class_folder)
    val_class_path = os.path.join(val_dir, class_folder)
    
    if os.path.isdir(class_path):
        os.makedirs(train_class_path, exist_ok=True)
        os.makedirs(val_class_path, exist_ok=True)
        images = [img for img in os.listdir(class_path) if img.endswith(('.jpg', '.png'))]
        val_size = int(len(images) * split_ratio)
        val_images = random.sample(images, val_size)
        
        for img in val_images:
            shutil.move(os.path.join(class_path, img), os.path.join(val_class_path, img))
        
        for img in os.listdir(class_path):
            shutil.move(os.path.join(class_path, img), os.path.join(train_class_path, img))

print("Dataset split into Train & Validation!")