# Working With Directories and Files in Kaggle

![Kaggle Banner](https://www.kaggle.com/static/images/site-logo.png)

## Project Information
- **Author**: Dr. Saad Laouadi
- **Date**: April 16, 2025
- **Dataset**: Heart Failure Synthetic Dataset
- **Version**: 1.0
- **Repository**: [GitHub: Kaggle-File-Management](https://github.com/dr-saad-la/kaggle-projects-kaggle-file-management) - Contains all code, additional analysis, and documentation for this project

## Project Overview
This notebook demonstrates efficient techniques for file and directory management within the Kaggle environment. Whether you're participating in competitions, creating datasets, or sharing analysis, understanding how to properly navigate and manipulate the file system is essential for productive data science workflows.

## Objectives
- Explore Kaggle's file system structure and conventions
- Implement robust file handling using pathlib and other modern Python libraries
- Create reproducible data loading patterns for machine learning projects
- Establish best practices for organizing complex data science projects

## Technologies & Libraries
```python
# Core libraries
from pathlib import Path
import os
import glob

# Data processing
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Optional - for potential modeling demonstrations
from sklearn.model_selection import train_test_split
```

## Dataset Description
The Heart Failure Synthetic Dataset contains simulated medical records focused on heart failure conditions. We'll use this dataset to demonstrate file operations while also performing basic exploratory data analysis.

## Key Features of This Notebook
- **Path Management**: Using modern `pathlib` for cross-platform compatibility
- **Error Handling**: Implementing robust error checking for file operations
- **Workflow Optimization**: Techniques for efficient data loading and processing
- **Project Organization**: Best practices for structuring machine learning projects

---

### Why This Matters
Proper file management is the foundation of reproducible data science. By establishing good practices early in your workflow, you can:

- Create more maintainable code
- Improve collaboration with team members
- Ensure portability across different environments
- Reduce errors in data processing pipelines

## Dataset Access

This notebook uses the Heart Failure Synthetic Dataset available on Kaggle.

**Note:** For detailed instructions on downloading this dataset using the Kaggle API for local use, please see the [GitHub repository README](https://github.com/dr-saad-la/kaggle-projects).

Let's begin by exploring the Kaggle file system structure and implementing some best practices for working with datasets.

In [29]:
# This will install watermark notebook extension to show 
# information about the working environment
import sys
!{sys.executable} -m pip install -q watermark

In [30]:
# Environment Setup
from pathlib import Path
import os
import glob

import pandas as pd  
import numpy as np   

%reload_ext watermark

print("--------- Showing Environment Information---------")
%watermark -a "Dr. Saad Laouadi"
%watermark -iv    


--------- Showing Environment Information---------
Author: Dr. Saad Laouadi

sys    : 3.11.11 (main, Dec  4 2024, 08:55:07) [GCC 11.4.0]
pathlib: 1.0.1
polars : 1.9.0
numpy  : 1.26.4
pandas : 2.2.3



## Understanding Kaggle's File Structure
Let's first explore the current working directory and its parent to understand Kaggle's notebook environment structure:

In [31]:
# Explore the working directory structure
cwd = Path('.').resolve()
parent_dir = Path('..').resolve()

print(f"Current working directory: {cwd}")
print(f"Parent directory: {parent_dir}")

# List contents with more descriptive output
print("\nFiles and directories in current working directory:")
for item in cwd.iterdir():
    item_type = "FILE" if item.is_file() else "DIRECTORY"
    print(f"  {item.name} ({item_type})")

print("\nFiles and directories in parent directory:")
for item in parent_dir.iterdir():
    item_type = "FILE" if item.is_file() else "DIRECTORY"
    print(f"  {item.name} ({item_type})")

Current working directory: /kaggle/working
Parent directory: /kaggle

Files and directories in current working directory:
  .virtual_documents (DIRECTORY)

Files and directories in parent directory:
  lib (DIRECTORY)
  input (DIRECTORY)
  working (DIRECTORY)


## Accessing the Input Directory

In Kaggle, datasets are mounted in the /kaggle/input directory. Let's explore this location to see what data is available:

In [32]:
# Explore the input directory where datasets are mounted
input_dir = Path('/kaggle/input').resolve()

# Get files and directories
try:
    # Separate files and directories for clarity
    files = [item for item in input_dir.iterdir() if item.is_file()]
    directories = [item for item in input_dir.iterdir() if item.is_dir()]
    
    # Print organized results
    print(f"Found {len(files)} files and {len(directories)} directories in {input_dir}")
    
    if files:
        print("\nFiles:")
        for file in files:
            print(f"  {file.name} ({file.stat().st_size / 1024:.2f} KB)")
    
    if directories:
        print("\nDirectories:")
        for directory in directories:
            dir_contents = list(directory.iterdir())
            print(f"  {directory.name} ({len(dir_contents)} items)")
            
            # Show first few items in each directory
            for item in dir_contents[:3]:  # Show only first 3 items
                item_type = "FILE" if item.is_file() else "DIR"
                print(f"    ├── {item.name} ({item_type})")
            if len(dir_contents) > 3:
                print(f"    └── ... and {len(dir_contents) - 3} more items")
                
except PermissionError:
    print(f"Permission denied accessing {input_dir}")
except FileNotFoundError:
    print(f"Directory not found: {input_dir}")

Found 0 files and 1 directories in /kaggle/input

Directories:
  heart-failure-prediction-synthetic-dataset (1 items)
    ├── heart_failure_prediction.csv (FILE)


## Creating a Helper Function for Dataset Exploration
Let's create a reusable function to explore dataset directories more thoroughly:

In [33]:
def explore_dataset(dataset_path, max_files=5):
    """
    Explore a dataset directory and return structured information.
    
    Args:
        dataset_path (Path): Path to the dataset directory
        max_files (int): Maximum number of files to display per directory
        
    Returns:
        dict: Dictionary containing dataset information
    """
    dataset_info = {
        "name": dataset_path.name,
        "path": str(dataset_path),
        "file_count": 0,
        "dir_count": 0,
        "size_bytes": 0,
        "files": [],
        "directories": []
    }
    
    try:
        # List all items
        all_items = list(dataset_path.iterdir())
        
        # Count files and directories
        files = [f for f in all_items if f.is_file()]
        dirs = [d for d in all_items if d.is_dir()]
        
        dataset_info["file_count"] = len(files)
        dataset_info["dir_count"] = len(dirs)
        
        # Calculate total size
        for file in files:
            size = file.stat().st_size
            dataset_info["size_bytes"] += size
            
            # Get file information
            if len(dataset_info["files"]) < max_files:
                dataset_info["files"].append({
                    "name": file.name,
                    "extension": file.suffix,
                    "size_bytes": size,
                    "size_formatted": f"{size/1024/1024:.2f} MB" if size > 1024*1024 else f"{size/1024:.2f} KB"
                })
        
        # Get directory information
        for directory in dirs:
            dir_files = list(directory.glob('**/*'))
            dir_info = {
                "name": directory.name,
                "file_count": len([f for f in dir_files if f.is_file()]),
                "sample_files": [f.name for f in dir_files if f.is_file()][:3]
            }
            dataset_info["directories"].append(dir_info)
            
        return dataset_info
        
    except Exception as e:
        print(f"Error exploring {dataset_path}: {e}")
        return dataset_info

In [34]:
# Check the available datasets
for dataset_dir in directories:
    dataset_info = explore_dataset(dataset_dir)
    
    # Print formatted information
    print(f"\n{'='*50}")
    print(f"DATASET: {dataset_info['name']}")
    print(f"{'='*50}")
    print(f"Total files: {dataset_info['file_count']}")
    print(f"Total directories: {dataset_info['dir_count']}")
    print(f"Total size: {dataset_info['size_bytes']/1024/1024:.2f} MB")
    
    print("\nSample files:")
    for file in dataset_info["files"]:
        print(f"  • {file['name']} ({file['size_formatted']})")


DATASET: heart-failure-prediction-synthetic-dataset
Total files: 1
Total directories: 0
Total size: 1.09 MB

Sample files:
  • heart_failure_prediction.csv (1.09 MB)


## Results Summary

After exploring the Kaggle file system structure, we've discovered several key insights:

* Kaggle's environment organizes files in a predictable structure with `/kaggle/input` containing all mounted datasets
* Each competition or dataset gets its own subdirectory with a standardized naming convention
* Understanding this structure allows for creating more robust paths in analysis code
* File permissions are read-only for input directories, requiring use of the `/kaggle/working` directory for outputs
* Large datasets are efficiently mounted with minimal overhead, enabling fast access even to large files

## Practical Example: Working with a Heart Failure Dataset

Now let's apply our file management knowledge to a real-world example by loading and examining the heart failure dataset:

In [35]:
# Locate our heart failure dataset
heart_dataset_dir = [d for d in input_dir.iterdir() if "heart-failure" in d.name.lower()]

if heart_dataset_dir:
    # Get the directory
    dataset_path = heart_dataset_dir[0]
    
    # Find CSV files in the dataset
    csv_files = list(dataset_path.glob("**/*.csv"))
    
    print(f"Found {len(csv_files)} CSV files in {dataset_path.name}")
    
    if csv_files:
        # Load the first CSV file
        df = pd.read_csv(csv_files[0])
        
        # Display basic information
        print("\nDataset Overview:")
        print(f"- Rows: {df.shape[0]}")
        print(f"- Columns: {df.shape[1]}")
        print("\nColumn names:")
        for col in df.columns:
            print(f"  - {col}")
            
        # Display the first few rows
        print("\nSample data:")
        display(df.head())
        
        # Basic statistics
        print("\nBasic statistics:")
        display(df.describe())
else:
    print("Heart failure dataset not found. Please ensure the dataset is mounted.")

Found 1 CSV files in heart-failure-prediction-synthetic-dataset

Dataset Overview:
- Rows: 10000
- Columns: 20

Column names:
  - Age
  - Gender
  - Chest_Pain_Type
  - Resting_BP
  - Cholesterol
  - Fasting_Blood_Sugar
  - Resting_ECG
  - Max_Heart_Rate
  - Exercise_Induced_Angina
  - Oldpeak
  - Slope
  - Num_Major_Vessels
  - Thalassemia
  - Diabetes
  - Smoking_History
  - Alcohol_Consumption
  - Physical_Activity_Level
  - Family_History
  - BMI
  - Heart_Failure

Sample data:


Unnamed: 0,Age,Gender,Chest_Pain_Type,Resting_BP,Cholesterol,Fasting_Blood_Sugar,Resting_ECG,Max_Heart_Rate,Exercise_Induced_Angina,Oldpeak,Slope,Num_Major_Vessels,Thalassemia,Diabetes,Smoking_History,Alcohol_Consumption,Physical_Activity_Level,Family_History,BMI,Heart_Failure
0,69,Male,Atypical,106,250,1,ST-T Wave Abnormality,171,0,0.92,Flat,2,Normal,1,Former,Heavy,Low,0,36.92,1
1,32,Male,Non-anginal,124,396,1,Left Ventricular Hypertrophy,73,0,0.92,Downsloping,2,Reversible Defect,1,Current,,Low,0,36.92,1
2,89,Female,Non-anginal,164,256,1,Left Ventricular Hypertrophy,157,0,0.92,Upsloping,1,Fixed Defect,1,Former,,Low,0,36.92,0
3,78,Female,Typical,116,297,1,Normal,163,1,0.92,Flat,1,Reversible Defect,1,Former,Heavy,Low,1,36.92,0
4,38,Male,Non-anginal,88,386,1,ST-T Wave Abnormality,123,1,0.92,Upsloping,3,Fixed Defect,0,Never,Moderate,Low,1,36.92,1



Basic statistics:


Unnamed: 0,Age,Resting_BP,Cholesterol,Fasting_Blood_Sugar,Max_Heart_Rate,Exercise_Induced_Angina,Oldpeak,Num_Major_Vessels,Diabetes,Family_History,BMI,Heart_Failure
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,58.5849,139.5692,247.2062,0.5054,129.3466,0.5072,0.92,1.4814,0.5012,0.5063,36.92,0.5036
std,23.645835,34.86205,86.862739,0.499996,40.316689,0.499973,6.250868e-14,1.117488,0.500024,0.499985,7.17684e-13,0.500012
min,18.0,80.0,100.0,0.0,60.0,0.0,0.92,0.0,0.0,0.0,36.92,0.0
25%,38.0,109.0,171.0,0.0,95.0,0.0,0.92,0.0,0.0,0.0,36.92,0.0
50%,59.0,140.0,247.0,1.0,130.0,1.0,0.92,1.0,1.0,1.0,36.92,1.0
75%,79.0,170.0,322.0,1.0,164.0,1.0,0.92,2.0,1.0,1.0,36.92,1.0
max,99.0,199.0,399.0,1.0,199.0,1.0,0.92,3.0,1.0,1.0,36.92,1.0


This example demonstrates how to:
1. Dynamically locate a dataset without hardcoding paths
2. Use globbing to find specific file types
3. Create a robust data loading workflow that handles potential errors
4. Perform basic data exploration after loading

## Documentation Notes

### Official Resources
- [Kaggle API Documentation](https://github.com/Kaggle/kaggle-api)
- [Kaggle Notebooks Documentation](https://www.kaggle.com/docs/notebooks)
- [Kaggle Datasets Documentation](https://www.kaggle.com/docs/datasets)

### Related Resources
- [Python Pathlib Documentation](https://docs.python.org/3/library/pathlib.html)
- [Best Practices for File Handling in Data Science](https://realpython.com/working-with-files-in-python/)
- [Reproducible Data Science Guidelines](https://the-turing-way.netlify.app/reproducible-research/reproducible-research.html)

## Conclusion

### Best Practices Learned
1. **Use pathlib instead of os.path** for more readable and cross-platform compatible code
2. **Implement error handling** when working with file operations to create robust workflows
3. **Dynamic path discovery** reduces hardcoding and makes notebooks more portable
4. **Structured file exploration** helps understand complex dataset organizations
5. **Documenting file structures** improves reproducibility and collaboration

### Next Steps
- Apply these file management techniques to your own projects
- Extend the helper functions to create a reusable file management toolkit
- Explore creating a standardized project structure for data science work
- Implement versioning for datasets to track changes over time
- Consider creating a pipeline for automated dataset validation and cleaning

This notebook provides a foundation for effective file management in Kaggle environments. By applying these techniques, you can create more robust, maintainable, and reproducible data science workflows.