### Exercise


Given a zip file with a subfolder with multiple annotations, where the name convention for each one of them is: 

{DATE}_{TIME}_SN{SATELLITE_NUMBER}_QUICKVIEW_VISUAL_{VERSION}_{UNIQUE_REGION}.txt

where:

- DATE expressed as YYYYMMDD (year, month and day), e.g. 20241201, 20230321 ...
- TIME expressed as HHMMSS (hour, minutes and seconds), e.g. 2134307
- SATELLITE_NUMBER an integer that represents the satellite number.
- VERSION provides the version of the pipeline, e.g. "0_1_2", "1_3_1" ...
- UNIQUE_REGION provides a unique location in the form of a string, e.g SATL-2KM-10N_552_4164

Find out the following thing about your data:

1. How many files the annotations folder has.
2. How many of them follow the name convention expressed above.
3. How many of annotations you have per month and year. Which month has more annotation files.
4. Create a new annotations folder with multiple folders corresponding to a month.
5. Print all the annotations from the most recent to the oldest one. 
6. How many different satellites there are, how many annotations we have per satellite number, and which one was used in the most recent annotation file. 
7. How many unique regions there are.

some tips:
- str class has a method called split, you can use it to get each field per annotation.
- you can use sort from numpy on strings.

1. How many files the annotations folder has.

In [2]:
import os
annotations_folder = '/Users/abbi23/Downloads/session_4/annotations'
files = [f for f in os.listdir(annotations_folder) if os.path.isfile(os.path.join(annotations_folder,f))]
total_files = len(files)
total_files

207

2. How many of them follow the name convention expressed above.

In [3]:
pattern = r'(\d{8})_(\d{6})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_([A-Za-z0-9\-_.]+)\.txt'
import re
valid_files = [f for f in files if re.match(pattern, f)]
valid_file_count = len(valid_files)
valid_file_count

194

3. How many of annotations you have per month and year. Which month has more annotation files.

In [4]:
from datetime import datetime
month_counts = {}
year_counts= {}

for file in files:
    match = re.match(pattern, file)
    if match:
        date_str = match.group(1)
        year = date_str[:4]
        month = date_str[4:6]
        if year in year_counts:
            year_counts[year] += 1
        else:
            year_counts[year] = 1
        
        month_key = f"{year}-{month}"
        if month_key in month_counts:
            month_counts[month_key] += 1
        else:
            month_counts[month_key] = 1

most_common_month = max(month_counts, key= month_counts.get)
print(month_counts)
print(most_common_month)




{'2024-01': 27, '2024-06': 52, '2024-04': 25, '2024-02': 45, '2024-03': 17, '2024-05': 28}
2024-06


4. Create a new annotations folder with multiple folders corresponding to a month.

In [5]:
import shutil as sh
new_folder_path = '/Users/abbi23/Downloads/session_4/new_annotations'
os.makedirs(new_folder_path, exist_ok=True)

for file in files:
    filename = os.path.basename(file)
    match = re.match(pattern, filename)  
    if match:
        date_str = match.group(1)
        year = date_str[:4]
        month = date_str[4:6]
        
        month_key = f"{year}-{month}"
        month_folder_path = os.path.join(new_folder_path, month_key)

        os.makedirs(month_folder_path,exist_ok=True)
        sh.copy(os.path.join(annotations_folder,file),os.path.join(month_folder_path,filename))

print(new_folder_path)
os.listdir(new_folder_path)



/Users/abbi23/Downloads/session_4/new_annotations


['2024-06', '2024-01', '2024-04', '2024-03', '2024-02', '2024-05']

5. Print all the annotations from the most recent to the oldest one. 


In [6]:
sorted_files = sorted(valid_files, key=lambda f: datetime.strptime(re.match(pattern, f).group(1)+re.match(pattern, f).group(2), '%Y%m%d%H%M%S'), reverse=True)

for file in sorted_files:
    print (file)


20240623_215120_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-10N_596_4134.txt
20240623_215102_SN43_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_384_3750.txt
20240623_193704_SN27_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_566_3734.txt
20240619_215556_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-10N_742_4460.txt
20240619_185757_SN24_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_528_3700.txt
20240619_052401_SN30_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-52N_368_4336.txt
20240618_215539_SN31_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_452_3740.txt
20240618_215539_SN31_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_458_3756.txt
20240618_193146_SN27_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_530_3682.txt
20240617_211350_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_724_3614.txt
20240617_184443_SN24_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_702_3566.txt
20240617_052859_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-51N_730_4348.txt
20240616_213053_SN30_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_460_3792.txt
20240616_213047_SN30_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_464_3830.txt
20240616_213047_SN30

6. How many different satellites there are, how many annotations we have per satellite number, and which one was used in the most recent annotation file.

In [7]:
satellite_counts={}

for file in valid_files:
    filename= os.path.basename(file)
    match=re.match(pattern, filename)
    if match:
        satellite_number = match.group(3)

        if satellite_number in satellite_counts:
            satellite_counts[satellite_number] += 1
        else:
            satellite_counts[satellite_number] = 1

print(f'Number of total different satellites is {len(satellite_counts)}')

for satellite, count in satellite_counts.items():
    print(f'Satellite {satellite} has {count} annotations.')

most_recent_satellite= re.match(pattern, sorted_files[0]).group(3)
print(f'The most recent annoation file is from satellite SN{most_recent_satellite}.')



Number of total different satellites is 9
Satellite 27 has 29 annotations.
Satellite 24 has 26 annotations.
Satellite 26 has 37 annotations.
Satellite 33 has 16 annotations.
Satellite 29 has 22 annotations.
Satellite 28 has 16 annotations.
Satellite 31 has 19 annotations.
Satellite 30 has 18 annotations.
Satellite 43 has 11 annotations.
The most recent annoation file is from satellite SN29.


7. How many unique regions there are.

In [8]:
unique_regions = set()

for file in valid_files:
    filename = os.path.basename(file)
    match = re.match(pattern, filename)
    if match:
        unique_regions.add(match.group(5))

print(f'There are {len(unique_regions)} unique regions.')
print('unique regions:')
for region in unique_regions:
    print(region)

There are 137 unique regions.
unique regions:
SATL-2KM-11N_466_3828
SATL-2KM-10N_544_4186
SATL-2KM-11N_578_3722
SATL-2KM-10N_596_4134
SATL-2KM-11N_244_3818
SATL-2KM-11N_624_3630
SATL-2KM-51N_728_4342
SATL-2KM-11N_380_3764
SATL-2KM-11N_378_3722
SATL-2KM-11N_488_3638
SATL-2KM-10N_630_4262
SATL-2KM-10N_568_4176
SATL-2KM-10N_594_4136
SATL-2KM-11N_376_3724
SATL-2KM-51N_748_4364
SATL-2KM-11N_490_3638
SATL-2KM-39N_558_2794
SATL-2KM-11N_380_3728
SATL-2KM-10N_546_4206
SATL-2KM-51N_730_4348
SATL-2KM-10N_742_4460
SATL-2KM-10N_556_4178
SATL-2KM-11N_574_3714
SATL-2KM-11N_700_3690
SATL-2KM-11N_718_3640
SATL-2KM-51N_686_4422
SATL-2KM-11N_416_3862
SATL-2KM-11N_712_3566
SATL-2KM-10N_542_4168
SATL-2KM-11N_264_4022
SATL-2KM-11N_418_3862
SATL-2KM-11N_566_3734
SATL-2KM-11N_706_3778
SATL-2KM-39N_562_2788
SATL-2KM-10N_726_3862
SATL-2KM-10N_556_4180
SATL-2KM-11N_500_3632
SATL-2KM-11N_546_3742
SATL-2KM-39N_562_2792
SATL-2KM-11N_500_3602
SATL-2KM-52N_368_4336
SATL-2KM-11N_248_4068
SATL-2KM-10N_562_4178
SATL-2KM