# Python for Data Science
## Session 4 
### Basic Libraries I

---

## Outline

1. Numpy for numerical operations

2. Scipy for scientific computing

3. Math, os, glob, shutil 

---

## Basic Libraries I

Let's jump into today's exercice.

### Exercise


Given a zip file with a subfolder with multiple annotations, where the name convention for each one of them is: 

{DATE}_{TIME}_SN{SATELLITE_NUMBER}_QUICKVIEW_VISUAL_{VERSION}_{UNIQUE_REGION}.txt

where:

- DATE expressed as YYYYMMDD (year, month and day), e.g. 20241201, 20230321 ...
- TIME expressed as HHMMSS (hour, minutes and seconds), e.g. 2134307
- SATELLITE_NUMBER an integer that represents the satellite number.
- VERSION provides the version of the pipeline, e.g. "0_1_2", "1_3_1" ...
- UNIQUE_REGION provides a unique location in the form of a string, e.g SATL-2KM-10N_552_4164

Find out the following thing about your data:

1. How many files the annotations folder has.
2. How many of them follow the name convention expressed above.
3. How many of annotations you have per month and year. Which month has more annotation files.
4. Create a new annotations folder with multiple folders corresponding to a month.
5. Print all the annotations from the most recent to the oldest one. 
6. How many different satellites there are, how many annotations we have per satellite number, and which one was used in the most recent annotation file. 
7. How many unique regions there are.

some tips:
- str class has a method called split, you can use it to get each field per annotation.
- you can use sort from numpy on strings.

In [1]:
import os
import glob
import shutil
import numpy as np

## Exercise 1

In [2]:
files = os.listdir(r'../session_4/annotations')
files

['20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_404_3770.txt',
 '20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_404_3772.txt',
 '20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_552_4162.txt',
 '20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_552_4164.txt',
 '20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_554_4162.txt',
 '20240101_213601_SN31_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_392_3740.txt',
 '20240101_213601_SN31_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_392_3742.txt',
 '20240101_213601_SN31_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_396_3752.txt',
 '20240102_185527_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_740_3850.txt',
 '20240102_185605_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_690_3572.txt',
 '20240102_185954_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_414_3786.txt',
 '20240104_220339_SN31_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_556_4178.txt',
 '20240110_192002_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_380_3728.txt',
 '20240112_192510_SN27_QU

In [3]:
number_of_files = len(files)    
number_of_files

206

## Exercise 2

Find out how many follows the name convention

Strategy: Using the split of underscores and hyphens may not be the best strategy here. I will use regex and import *re*

In [4]:
#Create relative path to annotations folder:
annotations_folder = r'../Session_4/annotations'

all_files = glob.glob(annotations_folder + '/*txt')
all_files

['../Session_4/annotations\\20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_404_3770.txt',
 '../Session_4/annotations\\20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_404_3772.txt',
 '../Session_4/annotations\\20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_552_4162.txt',
 '../Session_4/annotations\\20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_552_4164.txt',
 '../Session_4/annotations\\20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_554_4162.txt',
 '../Session_4/annotations\\20240101_213601_SN31_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_392_3740.txt',
 '../Session_4/annotations\\20240101_213601_SN31_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_392_3742.txt',
 '../Session_4/annotations\\20240101_213601_SN31_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_396_3752.txt',
 '../Session_4/annotations\\20240102_185527_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_740_3850.txt',
 '../Session_4/annotations\\20240102_185605_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_690_3

In [5]:
import re

def count_matching_files(directory_path):
    # Define the regex pattern for the naming convention
    pattern = r'^\d{8}_\d{6}_SN\d+_QUICKVIEW_VISUAL_\d+_.+\.txt$'
    
    # Initialize the count
    matching_file_count = 0
    
    # Iterate over all files in the directory
    for file_name in os.listdir(directory_path):
        # Check if the file name matches the pattern
        if re.match(pattern, file_name):
            matching_file_count += 1
    
    return matching_file_count

matching_files = count_matching_files(annotations_folder)
print(f"Number of files matching the convention: {matching_files}")


Number of files matching the convention: 194


## Exercise 3:
1) Sort by months with most annotation files
2) Find number of annotations for each month and for each year.

From here onwards, we include all the files that do not follow the correct format. I will also not use regex here anymore.

My strategy is just a simple string operation with a split of '_' and extract the 1st index position

In [6]:
import os

def count_files(files_path):

    # Dictionary to store counts of files for each month-year
    file_counts = {}
    
    # Iterate over all files in the directory
    for file_name in os.listdir(files_path):
        # Split the file name into parts using the underscore as a delimiter
        parts = file_name.split('_')
        

        if (len(parts) >= 6 and 
            len(parts[0]) == 8 and parts[0].isdigit()):  # Date check
            
            year_month = parts[0][:6]  # YYYYMM extracted here 
            
            # Increment of the count for the corresponding year-month
            if year_month in file_counts:
                file_counts[year_month] += 1
            else:
                file_counts[year_month] = 1 #To create a new key in the dictionary
    
    # Sort the results by count in descending order by using lambda x: x[1] to indicate the second element in the tuple
    sorted_counts = sorted(file_counts.items(), key=lambda x: x[1], reverse=True)
    
    return sorted_counts

# Populate the dictionary by parsing the files into this defined function:
monthly_counts = count_files(annotations_folder)

# Print the results
for year_month, count in monthly_counts:
    print(f"{year_month} | {count}")


202406 | 52
202402 | 45
202404 | 37
202405 | 28
202401 | 27
202403 | 17


## Exercise 4:

Create a new annotations folder with multiple folders corresponding to a month.

#### Strategy: From above answer since we know all satellites are in 2024 and there are only 6 different months, we can focus on index 5 and 6 of the datestring.

In [7]:
import os
import shutil

def create_new_folder(base_directory):
    """
    Creates a new 'new_annotations' folder with subfolders for each month from January to June.
    
    Parameters:
        base_directory (str): The base directory where the 'new_annotations' folder will be created.
    """
    # Path for the new_annotations folder
    annotation_folder = os.path.join(base_directory, 'new_annotations')
    
    # Create the new_annotations folder if it does not exist
    if not os.path.exists(annotation_folder):
        os.makedirs(annotation_folder)
    
    # List of months to create subfolders for (January to June)
    months = ['01', '02', '03', '04', '05', '06']
    
    # Create a folder for each month
    for month in months:
        # Folder name in the format "2024-MM"
        folder_name = f'2024-{month}'
        # Full path for the subfolder
        month_folder_path = os.path.join(annotation_folder, folder_name)
        # Create the month subfolder if it does not exist
        if not os.path.exists(month_folder_path):
            os.makedirs(month_folder_path)

    print(f"Annotations folders created under {annotation_folder}.")



In [8]:
# Create new_annotations sub-folder with subfolders under session_4 folder:
session_4_folder = r'../Session_4/'
create_new_folder(session_4_folder)

Annotations folders created under ../Session_4/new_annotations.


Small note: I am stuck, I do not know how to use shutil to copy the files from annotations to new_annotations

## Exercise 5:

Print all the annotations from the most recent to the oldest one. 

To be clearer, it refers to the text files sorted from the latest time and date to the earliest.

In [9]:
def sort_by_date_time(directory):

    # Get all .txt files in the directory
    text_files = [f for f in os.listdir(directory) if f.endswith('.txt') and os.path.isfile(os.path.join(directory, f))]
    
    # Extract date and time information for sorting
    # Create a structured array with the file name and extracted date/time
    file_data = np.array(
        [(f, f.split('_')[0], f.split('_')[1]) for f in text_files],  # (file_name, YYYYMMDD, HHMMSS)
        dtype=[('file_name', 'U100'), ('date', 'U8'), ('time', 'U6')] #Needed for sorting as it defines the structure of the data allowing numpy to sort the array
    )
    
    # Use numpy to sort by date and time in descending order
    sorted_files = np.sort(file_data, order=['date', 'time'])[::-1]
    
    # Print the sorted files
    print("Text files from most recent to oldest based on date and time:")
    for file in sorted_files:
        print(file['file_name'])



sort_by_date_time(annotations_folder)

Text files from most recent to oldest based on date and time:
20240623_215120_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-10N_596_4134.txt
20240623_215102_SN43_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_384_3750.txt
20240623_193704_SN27_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_566_3734.txt
20240619_215556_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-10N_742_4460.txt
20240619_185757_SN24_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_528_3700.txt
20240619_052401_SN30_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-52N_368_4336.txt
20240618_215539_SN31_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_458_3756.txt
20240618_215539_SN31_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_452_3740.txt
20240618_193146_SN27_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_530_3682.txt
20240617_211350_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_724_3614.txt
20240617_184443_SN24_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_702_3566.txt
20240617_052859_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-51N_730_4348.txt
20240616_213053_SN30_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_460_3792.txt
20240616_213047_SN30_QUICKVI

## Exercise 6:

1) How many different satellite number there are
2) How many annotations we have per satellite number
3) Which satellite number was used in the most recent annotation file. 

In this exercise, we can only work with Satellite Numbers starting with SN. The other annotation files that start with MS are assumed to be NOT satellites as they do not follow the naming convention.

In [10]:
def analyze_satellite_data(directory):
    # Dictionary to store counts of satellites
    satellite_counts = {}
    # List to store file data
    file_data = []
    
    # Get all .txt files in the directory
    text_files = [f for f in os.listdir(directory) if f.endswith('.txt') and os.path.isfile(os.path.join(directory, f))]
    
    # Extract satellite numbers and update counts
    for file_name in text_files:
        # Split the file name into parts using the underscore as a delimiter
        parts = file_name.split('_')
        
        # Check if the file name follows the expected pattern
        if parts[2].startswith('SN'):
            # Extract the satellite number after "SN", date, time information
            satellite_number = parts[2][2:]
            date = parts[0]
            time = parts[1]
            
            # Update the count for the satellite number
            if satellite_number in satellite_counts:
                satellite_counts[satellite_number] += 1
            else:
                satellite_counts[satellite_number] = 1
    
    # Add the file data to the list for sorting
    file_data.append((file_name, date, time, satellite_number))
    
    # Convert the list to a numpy array for sorting
    file_data_np = np.array(file_data, dtype=[('file_name', 'U100'), ('date', 'U8'), ('time', 'U6'), ('satellite_number', 'U10')])

    # Sort the array by date and time in descending order
    sorted_files = np.sort(file_data_np, order=['date', 'time'])[::-1]

    # 1) Number of different satellite numbers
    num_different_satellites = len(satellite_counts)
    
    # 2) Annotations per satellite number
    print(f"Number of different satellite numbers: {num_different_satellites}")
    print("Annotations per satellite number:")
    for satellite, count in satellite_counts.items():
        print(f"Satellite SN{satellite}: {count} annotations")
    
    # 3) Most recent annotation's satellite number
    most_recent_satellite = sorted_files[0]['satellite_number'] if len(sorted_files) > 0 else None
    print(f"The most recent annotation is from satellite SN{most_recent_satellite}.")
    

analyze_satellite_data(annotations_folder)


Number of different satellite numbers: 9
Annotations per satellite number:
Satellite SN33: 16 annotations
Satellite SN24: 26 annotations
Satellite SN31: 19 annotations
Satellite SN27: 29 annotations
Satellite SN28: 16 annotations
Satellite SN29: 22 annotations
Satellite SN26: 37 annotations
Satellite SN30: 18 annotations
Satellite SN43: 11 annotations
The most recent annotation is from satellite SN29.


## Exercise 7:

Count the number of unique regions.

Note: Although all the Satellites begin with SN, meaning there are only 194 of them, there are 206 text files where they all consistently show a region. 
Hence, we shall assume that there are 206 text files showing regions, even if 12 of them may not be from satellites.

In [11]:
## To ensure that the regions of all 206 annotations are captured, regex below will be modified:

def count_unique_regions(directory):

    # Set to store unique regions
    unique_regions = set()

    # Define the regex pattern for the file naming convention with the focus on SATL
    pattern = r'^\d{8}_\d{6}_SN\d+_QUICKVIEW_VISUAL_\d+_\d+_\d+_(SATL-[\w\-]+_\d+_\d+)\.txt$'

    # Get all .txt files in the directory
    text_files = [f for f in os.listdir(directory) if f.endswith('.txt') and os.path.isfile(os.path.join(directory, f))]
    
    # Extract the region part from each file name using regex
    for file_name in text_files:
        # Match the file name against the regex pattern
        match = re.match(pattern, file_name)
        if match:
            # Extract the region from the capturing group
            region_info = match.group(1)
            # Add the region to the set
            unique_regions.add(region_info)
    
    # Number of unique regions
    num_unique_regions = len(unique_regions)
    
    # Print the result
    print(f"Number of unique regions: {num_unique_regions}")



count_unique_regions(annotations_folder)



Number of unique regions: 137
