### Exercise


Given a zip file with a subfolder with multiple annotations, where the name convention for each one of them is: 

{DATE}_{TIME}_SN{SATELLITE_NUMBER}_QUICKVIEW_VISUAL_{VERSION}_{UNIQUE_REGION}.txt

where:

- DATE expressed as YYYYMMDD (year, month and day), e.g. 20241201, 20230321 ...
- TIME expressed as HHMMSS (hour, minutes and seconds), e.g. 2134307
- SATELLITE_NUMBER an integer that represents the satellite number.
- VERSION provides the version of the pipeline, e.g. "0_1_2", "1_3_1" ...
- UNIQUE_REGION provides a unique location in the form of a string, e.g SATL-2KM-10N_552_4164

Find out the following thing about your data:

1. How many files the annotations folder has.
2. How many of them follow the name convention expressed above.
3. How many of annotations you have per month and year. Which month has more annotation files.
4. Create a new annotations folder with multiple folders corresponding to a month.
5. Print all the annotations from the most recent to the oldest one. 
6. How many different satellites there are, how many annotations we have per satellite number, and which one was used in the most recent annotation file. 
7. How many unique regions there are.

some tips:
- str class has a method called split, you can use it to get each field per annotation.
- you can use sort from numpy on strings.

In [230]:
import os
import glob

1. How many files the annotations folder has.

In [243]:
annotations = '/Users/biancabaldonado/Desktop/session_4/annotations'
len(os.listdir(annotations))

218

2. How many of them follow the name convention expressed above.

In [246]:
import re
import glob
import os

pattern = r'(\d{8})_(\d{6})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_([A-Za-z0-9\-_.]+)\.txt' 

annotations = glob.glob('session_4/annotations/*.txt')

correct_convention = []
incorrect_convention = []

for annotation in annotations:

    # extract the file name
    filename = os.path.basename(annotation)
    
    # Search and extract values
    match = re.match(pattern, filename)
    if match:
        correct_convention.append(filename)
    else:
        incorrect_convention.append(filename)


print(len(correct_convention))

193


In [248]:
len(incorrect_convention) #not part of some numbers because it is not following the naming convention

12

In [249]:
incorrect_convention

['20240405_183824_409694_MS_NS24_QUICKVIEW_VISUAL_1_3_0_SATL-2KM-11N_736_3716.txt',
 '20240418_213446_163074_MS_NS28_QUICKVIEW_VISUAL_1_3_0_SATL-2KM-11N_388_3748.txt',
 '20240417_215406_715231_MS_NS43_QUICKVIEW_VISUAL_1_3_0_SATL-2KM-10N_740_4446.txt',
 '20240412_191539_631044_MS_NS24_QUICKVIEW_VISUAL_1_3_0_SATL-2KM-11N_258_4036.txt',
 '20240410_214321_024179_MS_NS30_QUICKVIEW_VISUAL_1_3_0_SATL-2KM-11N_296_3786.txt',
 '20240410_214305_399233_MS_NS43_QUICKVIEW_VISUAL_1_3_0_SATL-2KM-11N_380_3764.txt',
 '20240408_211552_958249_MS_NS29_QUICKVIEW_VISUAL_1_3_0_SATL-2KM-11N_734_3742.txt',
 '20240412_191539_377035_MS_NS24_QUICKVIEW_VISUAL_1_3_0_SATL-2KM-11N_258_4038.txt',
 '20240412_191549_672087_MS_NS24_QUICKVIEW_VISUAL_1_3_0_SATL-2KM-11N_240_3966.txt',
 '20240407_190149_742846_MS_NS24_QUICKVIEW_VISUAL_1_3_0_SATL-2KM-11N_258_4028.txt',
 '20240420_181053_341939_MS_NS33_QUICKVIEW_VISUAL_1_3_0_SATL-2KM-10N_556_4180.txt',
 '20240412_052750_556466_MS_NS29_QUICKVIEW_VISUAL_1_3_0_SATL-2KM-51N_688_442

### Disclaimer

All those part of incorrect convention will not be included in the succeeding numbers since they do not follow the naming convention pattern and as such, the numbers in the file are not intuitive and would need further information on how to interpret (only given the corresponding values for those following the naming convention and as such, I have only extracted the data from these)

3. How many of annotations you have per month and year. Which month has more annotation files.

In [252]:
import re
import glob
import os
from datetime import datetime
from collections import Counter

pattern = r'(\d{8})_(\d{6})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_([A-Za-z0-9\-_.]+)\.txt' 

annotations = glob.glob('session_4/annotations/*.txt')

ann_datetime = []
total_years = []
total_months = []
total_year_month = []

for annotation in annotations:

    filename = os.path.basename(annotation)
    
    match = re.match(pattern, filename)
    if match:
        date, time, _, _, _ = match.groups()

        datetime_str = date + time 

        datetime_obj = datetime.strptime(datetime_str, "%Y%m%d%H%M%S")

        year = datetime_obj.year
        month = datetime_obj.month
        
        ann_datetime.append((year, month))

total_years = []
total_months = []
total_year_month = []

for year, month in ann_datetime:
    total_years.append(year) 
    total_months.append(month)
    total_year_month.append((year,month))
      
years_count = Counter(total_years)
months_count = Counter(total_months)
yearmonth_count = Counter(total_year_month)

print("Value Count per Year")
for x, y in years_count.items():
    print(f"Year: {x}, Count: {y}")

print("\nValue Count per Month")
for x, y in months_count.items():
    print(f"Month: {x}, Count: {y}") 

print("\nValue Count per Year and Month")
for x, y in yearmonth_count.items():
    print(f"Year & Month: {x}, Count: {y}") 

max_month = max(months_count, key=months_count.get)

def convert_month(month):
    match month:
        case 1:return "January"
        case 2:return "February"
        case 3:return "March"
        case 4:return "April"
        case 5:return "May"
        case 6:return "June"
        case 7:return "July"
        case 8:return "August"
        case 9:return "September"
        case 10:return "October"
        case 11:return "November"
        case 12:return "December"
        case _:return "Error"  

max_month = convert_month(max_month)

print("\nMonth that has the most annotations:",max_month)


Value Count per Year
Year: 2024, Count: 193

Value Count per Month
Month: 6, Count: 52
Month: 4, Count: 25
Month: 2, Count: 45
Month: 1, Count: 26
Month: 3, Count: 17
Month: 5, Count: 28

Value Count per Year and Month
Year & Month: (2024, 6), Count: 52
Year & Month: (2024, 4), Count: 25
Year & Month: (2024, 2), Count: 45
Year & Month: (2024, 1), Count: 26
Year & Month: (2024, 3), Count: 17
Year & Month: (2024, 5), Count: 28

Month that has the most annotations: June


4. Create a new annotations folder with multiple folders corresponding to a month.

In [235]:
import os

annotations = '/Users/biancabaldonado/Desktop/session_4/annotations'  

os.makedirs(annotations,exist_ok=True)

monthsfolder = ["January","February","March","April","May","June","July","August","September","October","November","December"]

add_folders = [os.makedirs(os.path.join(annotations, month), exist_ok=True) for month in monthsfolder]


5. Print all the annotations from the most recent to the oldest one. 

In [238]:
pattern = r'(\d{8})_(\d{6})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_([A-Za-z0-9\-_.]+)\.txt'

annotations = glob.glob('session_4/annotations/*.txt')

ann_datetime = []

for annotation in annotations:
    filename = os.path.basename(annotation)
    match = re.match(pattern, filename)
    if match:
        date, time, satellite_number, _, _ = match.groups()
        datetime_str = date + time
        datetime_obj = datetime.strptime(datetime_str, "%Y%m%d%H%M%S")
        ann_datetime.append((filename, datetime_obj))

recent_to_oldest = sorted(ann_datetime, key=lambda x: x[1], reverse=True)

print("Files Arranged from Recent to Oldest:")
for filename, datetime_obj in recent_to_oldest:
    year = datetime_obj.year
    month = datetime_obj.month
    print(f"{filename} --> [Year: {year}, Month: {month}]")

Files Arranged from Recent to Oldest:
20240623_215120_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-10N_596_4134.txt --> [Year: 2024, Month: 6]
20240623_215102_SN43_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_384_3750.txt --> [Year: 2024, Month: 6]
20240623_193704_SN27_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_566_3734.txt --> [Year: 2024, Month: 6]
20240619_215556_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-10N_742_4460.txt --> [Year: 2024, Month: 6]
20240619_185757_SN24_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_528_3700.txt --> [Year: 2024, Month: 6]
20240619_052401_SN30_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-52N_368_4336.txt --> [Year: 2024, Month: 6]
20240618_215539_SN31_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_452_3740.txt --> [Year: 2024, Month: 6]
20240618_215539_SN31_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_458_3756.txt --> [Year: 2024, Month: 6]
20240618_193146_SN27_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_530_3682.txt --> [Year: 2024, Month: 6]
20240617_211350_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_724_3614.txt --> [Year: 2024, Mo

6. How many different satellites there are, how many annotations we have per satellite number, and which one was used in the most recent annotation file. 

In [257]:
satellites=[]

for annotation in annotations:

    filename = os.path.basename(annotation)
    
    match = re.match(pattern, filename)
    if match:
        _, _, satellite_number, _, _ = match.groups()
        satellites.append((satellite_number))

satellites_count = Counter(satellites)

for annotation in annotations:
    filename = os.path.basename(annotation)
    match = re.match(pattern, filename)
    if match:
        date, time, satellite_number, _, _ = match.groups()
        datetime_str = date + time
        datetime_obj = datetime.strptime(datetime_str, "%Y%m%d%H%M%S")

        ann_datetime.append((filename, datetime_obj, satellite_number))
        satellites.append(satellite_number)

pattern2 = r'SN\d+'

most_recent_filename = recent_to_oldest[0][0]

match = re.search(pattern2, most_recent_filename)

print("Number of Unique Satellites", len(set(satellites_count)))

print("\nList of Unique Satellites")
for satellite_number in sorted(set(satellites)):  
    print(f"Satellite Number: {satellite_number}")  

print("\nValue Count per Satellite")
for satellite_number,count in sorted(satellites_count.items()):
    print(f"Satellite Number: {satellite_number}, Count: {count}")

print("\nSN Number of Most Recent File")
if match:
    sn_number = match.group(0)
    print(f"SN Number: {sn_number}, File Name: {filename}")

Number of Unique Satellites 9

List of Unique Satellites
Satellite Number: 24
Satellite Number: 26
Satellite Number: 27
Satellite Number: 28
Satellite Number: 29
Satellite Number: 30
Satellite Number: 31
Satellite Number: 33
Satellite Number: 43

Value Count per Satellite
Satellite Number: 24, Count: 26
Satellite Number: 26, Count: 37
Satellite Number: 27, Count: 28
Satellite Number: 28, Count: 16
Satellite Number: 29, Count: 22
Satellite Number: 30, Count: 18
Satellite Number: 31, Count: 19
Satellite Number: 33, Count: 16
Satellite Number: 43, Count: 11

SN Number of Most Recent File
SN Number: SN29, File Name: 20240321_190819_SN27_QUICKVIEW_VISUAL_1_2_0_SATL-2KM-11N_714_3632.txt


7. How many unique regions there are.

In [240]:
regions=[]

for annotation in annotations:

    filename = os.path.basename(annotation)
    
    match = re.match(pattern, filename)
    if match:
        _, _, _, _, unique_region = match.groups()
        regions.append(unique_region)

regions_count = Counter(regions)

print("\nNumber of Unique Region:")
print(len(regions_count))

print("\nValue Count per Unique Region:")
for region, count in sorted(regions_count.items()):
    print(f"Region Number: {region}, Count: {count}")



Number of Unique Region:
136

Value Count per Unique Region:
Region Number: SATL-2KM-10N_542_4168, Count: 2
Region Number: SATL-2KM-10N_544_4186, Count: 1
Region Number: SATL-2KM-10N_546_4206, Count: 1
Region Number: SATL-2KM-10N_550_4202, Count: 1
Region Number: SATL-2KM-10N_552_4162, Count: 3
Region Number: SATL-2KM-10N_552_4164, Count: 2
Region Number: SATL-2KM-10N_554_4162, Count: 3
Region Number: SATL-2KM-10N_554_4172, Count: 2
Region Number: SATL-2KM-10N_556_4176, Count: 1
Region Number: SATL-2KM-10N_556_4178, Count: 1
Region Number: SATL-2KM-10N_556_4180, Count: 1
Region Number: SATL-2KM-10N_558_4184, Count: 1
Region Number: SATL-2KM-10N_560_4178, Count: 2
Region Number: SATL-2KM-10N_562_4170, Count: 1
Region Number: SATL-2KM-10N_562_4178, Count: 1
Region Number: SATL-2KM-10N_562_4196, Count: 1
Region Number: SATL-2KM-10N_564_4194, Count: 1
Region Number: SATL-2KM-10N_568_4176, Count: 1
Region Number: SATL-2KM-10N_594_4136, Count: 1
Region Number: SATL-2KM-10N_596_4134, Count: 