# README: Annotation File Processing and Analysis

## Overview
This project involves processing a set of annotation files extracted from a ZIP archive. The tasks cover a range of operations including file extraction, validation, counting, organizing, and analysis of satellite data and unique regions. The code is broken down into 7 main tasks, each serving a specific purpose to analyze the dataset effectively.

## Project Structure
- **ZIP File Path**: The location of the ZIP file containing the annotation data.
- **Extraction Path**: The directory where the extracted files are stored.
- **Tasks**: Each task is outlined below, explaining its purpose, the process, and the expected outcome.

---

### Task 1: Extract ZIP File and Count the Number of Files
**Objective**: Extract the annotation files from a ZIP archive and count the total number of files in the extracted directory.

**Process**:
- **Import necessary modules**:
  - `zipfile` for handling the extraction of the ZIP file.
  - `os` for interacting with the file system.
- **Define paths**:
  - `zip_file_path`: The path to the ZIP file that needs to be extracted.
  - `extraction_path`: The location where the contents of the ZIP file will be extracted.
- **Extract the ZIP file**:
  - Use `zipfile.ZipFile` in read mode (`'r'`) to open the ZIP file.
  - Extract all files using `extractall()` and store them in the specified `extraction_path`.
- **Verify extraction**:
  - Print a message indicating the path where files were extracted to confirm successful extraction.
- **Count the number of files**:
  - Use `os.listdir()` and `len()` to count the total number of files in the extracted directory.

**Outcome**:
- Ensures successful extraction and counts the files in the directory to provide an overview of the dataset size.

---

### Task 2: Identify Files That Follow the Naming Convention
**Objective**: Validate filenames against a specific naming convention to ensure consistency.

**Process**:
- **Import the `re` module** for regular expressions.
- **Define a regex pattern** to match filenames with the format `{DATE}_{TIME}_SN{SATELLITE_NUMBER}_QUICKVIEW_VISUAL_{VERSION}_{REGION}.txt`.
- **Filter and count valid files**:
  - Use list comprehension to iterate over `all_files` and filter filenames matching the pattern.
- **Print the count of valid files**:
  - Display the number of files adhering to the specified naming convention.

**Outcome**:
- Verifies which files adhere to the defined structure, ensuring data consistency and reliability.

---

### Task 3: Count Annotations Per Month
**Objective**: Count how many annotation files exist per month and identify the month with the most annotations.

**Process**:
- **Import `defaultdict` from `collections`** to simplify counting with default values.
- **Extract `year` and `month`** from each valid filename and increment counts in a `defaultdict`.
- **Identify the month with the highest count** using `max()`.

**Outcome**:
- Provides insight into data distribution over time and highlights peak annotation periods.

---

### Task 4: Organize Annotations by Month
**Objective**: Organize valid annotation files into subfolders based on the month (formatted as `YYYY_MM`).

**Process**:
- **Import `shutil`** and use `os.makedirs()` to create month-based subfolders.
- **Iterate over valid files**:
  - Extract `year` and `month` from filenames.
  - Create a month-based folder and move files using `os.rename()`.

**Outcome**:
- Enhances data organization, making future data access and analysis more efficient.

---

### Task 5: Print Annotations from Most Recent to Oldest
**Objective**: Sort and display the valid annotation files in descending order by date.

**Process**:
- **Import `numpy`** for sorting.
- **Sort filenames** using `np.sort()` and reverse the order (`[::-1]`).
- **Print the sorted filenames** to display them from the most recent to the oldest.

**Outcome**:
- Displays files from the most recent to the oldest, providing a quick overview of the data timeline.

---

### Task 6: Count Annotations Per Satellite and Identify the Most Recent Satellite
**Objective**: Count the number of annotations for each satellite and identify the satellite in the most recent annotation file.

**Process**:
- **Use `defaultdict`** to count annotations per satellite.
- **Extract `SATELLITE_NUMBER`** from each valid filename.
- **Identify the most recent satellite** using `sorted_files[0]`.

**Outcome**:
- Gives a breakdown of data distribution per satellite and highlights the latest satellite source.

---

### Task 7: Count Unique Regions
**Objective**: Identify and count the number of unique regions represented in the dataset.

**Process**:
- **Initialize a `set`** to store unique regions.
- **Extract the region** from each valid filename and add it to the set.
- **Print the total count** of unique regions using `len()`.

**Outcome**:
- Shows how diverse the dataset is in terms of geographical or segmented regions.

---

## How to Run the Code
1. Ensure Python and the required libraries (`os`, `re`, `numpy`, `collections`, `shutil`) are installed.
2. Set the `zip_file_path` and `extraction_path` according to your directory structure.
3. Run each task sequentially for a comprehensive analysis of the dataset.

## Conclusion
These tasks provide a step-by-step analysis of the dataset, from initial extraction to detailed data validation and organization. Each step is designed to improve data quality, organization, and insight into the dataset's structure and distribution.