# Python for Data Science
## Session 5 
### Basic Libraries II

---

## Outline

1. Json, pickle and parquet formats

2. Re library

3. Time and Datetime libraries

---

## Basic Libraries II

Before starting working with different formats, let's see how we can create and read text files using Python buil-in function called **open**. 

In [53]:
# Open and write down a file
f = open('text_file.txt', 'w')  # built in function where you are proving a file (binary file) 
f.write('Hello')
f.write('\n') # this means to create a new line
f.write('Bye')
f.close()

In [54]:
# Open and read content of a file
f = open('text_file.txt', 'r') # esta r es el 'mode' del file, en este caso r es read, osea para leerlo
content = f.read()  # here we just read it
f.close()
print(content)

Hello
Bye


In [55]:
# We can also simply split lines by using
f = open('text_file.txt', 'r')
lines = f.read().splitlines() # esto separa las cosas en lineas
f.close()
# loop over the lines
for idx, line in enumerate(lines): # esto print cada indice con una enumeracion (enumerate) el contenido de cada linea
    print(f'Line {idx}: {line}')

Line 0: Hello
Line 1: Bye


In [56]:
# Let's create a CSV (comma separated values) file
header = "Name,Age,Grade\n" # remember that '\n' this is create it in a new line
rows = [
    "Jaume,30,8.9\n",
    "Francisco,25,7.1\n",
    "Elena,35,9.2\n"
]

In [57]:
with open("grades.csv", "w") as f: # we are using with to do whatever that does, we are writing it on the f
    f.write(header) # Write the header
    
    # Write each row of data
    for row in rows:
        f.write(row)

In [58]:
# Abre el archivo "grades.csv" en modo de lectura ('r')
with open("grades.csv", "r") as f:
    # Lee todas las líneas del archivo y las convierte en una lista, quitando el salto de línea al final de cada línea
    lines = f.read().splitlines()

# Saca la primera línea del archivo que contiene los nombres de las columnas (el encabezado)
header = lines.pop(0)
# Divide el encabezado en columnas, separando por comas
header = header.split(',')

# Imprime el encabezado para ver las columnas
print(header)

# Crea un diccionario llamado 'grades' con una lista vacía bajo la clave 'students'
grades = {'students': []}

# Itera sobre cada línea que queda en 'lines' (cada una representa un estudiante)
for line in lines:
    # Crea un diccionario temporal para almacenar los datos del estudiante actual
    student_dict = {}
    # Divide la línea actual en valores, separados por comas
    values = line.split(',')
    
    # Itera sobre cada valor junto con su índice para relacionarlo con su columna del encabezado
    for idx, column in enumerate(header):
        # Asocia cada valor con su respectiva columna en el diccionario del estudiante
        student_dict[column] = values[idx]
    
    # Añade el diccionario del estudiante a la lista de 'students' en el diccionario 'grades'
    grades['students'].append(student_dict)

# Devuelve el diccionario final 'grades' que contiene los datos de todos los estudiantes
grades


['Name', 'Age', 'Grade']


{'students': [{'Name': 'Jaume', 'Age': '30', 'Grade': '8.9'},
  {'Name': 'Francisco', 'Age': '25', 'Grade': '7.1'},
  {'Name': 'Elena', 'Age': '35', 'Grade': '9.2'}]}

## Basic Libraries II

Another useful statement is **with**. It helps handling properly the resources within its reach, by closing them after its execution. It also makes the code more readable and maintainable.

In [59]:
# whatever we create within the with statement is going to be closed once we finish it
# like small enviroment that what you put inside only happens there and when we finished everything is going to be closed
with open('text_file.txt', 'r') as f: # we don't have to close the open file, f.close()
    lines = f.read().splitlines()
    
print(lines)

['Hello', 'Bye']


## Basic Libraries II

JavaScript Object Notation (JSON) is a text-based format used for data storing and data interchange across different platforms and languages.

Same as dictionaries, data is represented as key-value pairs. 

## Basic Libraries II

JavaScript Object Notation (JSON) is a text-based format used for data storing and data interchange across different platforms and languages.

Same as dictionaries, data is represented as key-value pairs. 

In [60]:
{
    "students": [
        {
            "name": "Amelie",
            "age": 35
        },
        {
            "name": "Edgar",
            "age": 32
        }
    ]
}

{'students': [{'name': 'Amelie', 'age': 35}, {'name': 'Edgar', 'age': 32}]}

In [61]:
# other valid formats
[
    {
        "name": "Amelie",
        "age": 35
    },
    {
        "name": "Edgar",
        "age": 32
    }
]

[{'name': 'Amelie', 'age': 35}, {'name': 'Edgar', 'age': 32}]

In [62]:
# other valid formats
[
    "Amelie",
    137,
    True, # within the json file True is equivalent to true
    None, # within the json file None is equivalent to null
    {"age": 35},
    [10, 12, 13]
]

['Amelie', 137, True, None, {'age': 35}, [10, 12, 13]]

## Basic Libraries II

To read and write down json files and manipulate them, we have the built-in json library within Python.

In [63]:
import json # built in library
data = {
    "students": [
        {
            "name": "Amelie",
            "age": 35,
            "scolarship": True
        },
        {
            "name": "Edgar",
            "age": 32,
            "scolarship": None
        }
    ]
}

with open('json_example.json', 'w') as f: # write down json
    json.dump(data, f) # this is dumping the data in json here.

In [64]:
# Abre el archivo 'json_example.json' en modo de lectura ('r')
with open('json_example.json', 'r') as f:
    # Carga los datos del archivo JSON y los convierte en un diccionario o una lista de Python
    json_data = json.load(f) 

# Imprime el contenido cargado desde el archivo JSON
print(json_data)


{'students': [{'name': 'Amelie', 'age': 35, 'scolarship': True}, {'name': 'Edgar', 'age': 32, 'scolarship': None}]}


## Basic Libraries II

Similar to JSON, Python includes a Pickle library. However, in contrast to the JSON format, Pickle is a Python-specific serialization format. The Pickle library provides tools to serialize Python objects, which involves transforming them into a stream of bytes. It also allows you to read these byte streams by deserializing them, transforming them back into their original Python objects.

In contrast to the JSON format, the binary format is usually more compact and, therefore, more efficient.

In [65]:
# pickle serialises the information you provide to it.
# if you have text its going to transform it to bytes, so its going to become unreadble.

import numpy as np  # Importa la biblioteca numpy

# Genera un array de 10 números aleatorios en punto flotante entre 0 y 1
data = np.random.rand(10)

import pickle  # Importa el módulo pickle para serialización y deserialización

# Serializa (guarda) el objeto 'data' en un archivo llamado 'data.pkl'
with open('data.pkl', 'wb') as f:
    pickle.dump(data, f)  # Escribe los datos binarios en el archivo

# Deserializa (carga) el objeto de vuelta desde 'data.pkl'
with open('data.pkl', 'rb') as f:
    loaded_data = pickle.load(f)  # Lee los datos binarios del archivo

# Imprime los datos cargados para verificar que coincidan con los originales
print(loaded_data)


[0.93487018 0.8619097  0.17525086 0.90562099 0.60160151 0.42311113
 0.72666613 0.89620021 0.72594826 0.91393961]


## Basic Libraries II

**IMPORTANT**: Be extremely carefull when loading pickled data from untrusted sources. Pickles can execute arbitrary code.

## Basic Libraries II

To work with **Parquet** files, you need either the **pyarrow** or **pandas** library. Parquet is a columnar storage format, meaning that each row represents a sample, and each column represents an attribute. This is a powerful format commonly used as a standard in platforms like **Hugging Face**.

In [66]:
import pandas as pd # if it is not working, simply uncomment the following line
# !pip install pandas

# Creating a DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35]
})

# Writing DataFrame to Parquet file with Pandas
df.to_parquet('data.parquet')

# Reading DataFrame from Parquet file with Pandas
df_loaded = pd.read_parquet('data.parquet')

print(df_loaded)

      name  age
0    Alice   25
1      Bob   30
2  Charlie   35


## Basic Libraries II

When working with text, one of the most powerful tools is regular expressions, aka **regex**. With regex, you can perform complex pattern matching using wildcards and other special characters. Let's see how we could have handled session's four exercise:

In [67]:
import re

# complex pattern matching

data = "What a wonderful life if we could play more time."

# Regex pattern to find 'if'
pattern = 'if'

# Search for the pattern
matches = re.findall(pattern, data)

print(matches) 

['if', 'if']


## Basic Libraries II

Let's see how we could have handled session's four exercise:

In [68]:
import re  # Importa la biblioteca para expresiones regulares
import glob  # Importa la biblioteca para trabajar con rutas de archivos
import os  # Importa la biblioteca para operaciones del sistema operativo

# Patrón de regex, la 'r' antes de la cadena indica que es una cadena cruda
# Esto significa que Python no interpretará los caracteres de escape como '\'
pattern = r'(\d{8})_(\d{6})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_([A-Za-z0-9\-_.]+)\.txt'  
# _QUICKVIEW_VISUAL_ this is the fix part that we never change

# Usa glob para obtener una lista de todos los archivos .txt en el directorio 'session_4/annotations/'
annotations = glob.glob('session_4/annotations/*.txt')

# Itera sobre cada archivo de anotación
for annotation in annotations:

    # Extrae solo el nombre del archivo (sin la ruta completa)
    filename = os.path.basename(annotation)
    
    # Busca y extrae los valores del nombre del archivo utilizando el patrón de regex
    match = re.match(pattern, filename)
    if match:
        # Si coincide con el patrón, extrae los grupos correspondientes (fecha, hora, etc.)
        date, time, satellite_number, version, unique_region = match.groups()
        
        # Imprime los valores extraídos con un formato claro
        print(f"Date: {date}; Time: {time}; SN: {satellite_number}; ver: {version}; region: {unique_region}")

    else:
        print(filename)


Date: 20240102; Time: 185527; SN: 27; ver: 1_1_10; region: SATL-2KM-11N_740_3850
Date: 20240623; Time: 193704; SN: 27; ver: 1_7_0; region: SATL-2KM-11N_566_3734
Date: 20240402; Time: 184757; SN: 24; ver: 1_2_0; region: SATL-2KM-11N_488_3638
Date: 20240423; Time: 190101; SN: 26; ver: 1_5_0; region: SATL-2KM-11N_418_3872
Date: 20240201; Time: 075140; SN: 26; ver: 1_1_10; region: SATL-2KM-39N_558_2794
Date: 20240222; Time: 074151; SN: 26; ver: 1_1_10; region: SATL-2KM-39N_560_2794
Date: 20240101; Time: 174301; SN: 33; ver: 1_1_10; region: SATL-2KM-11N_404_3770
Date: 20240218; Time: 180121; SN: 33; ver: 1_1_10; region: SATL-2KM-10N_568_4176
Date: 20240101; Time: 192856; SN: 24; ver: 1_1_10; region: SATL-2KM-10N_552_4164
Date: 20240402; Time: 184757; SN: 24; ver: 1_2_0; region: SATL-2KM-11N_486_3630
Date: 20240102; Time: 185954; SN: 24; ver: 1_1_10; region: SATL-2KM-11N_414_3786
Date: 20240213; Time: 212524; SN: 29; ver: 1_1_10; region: SATL-2KM-11N_542_3750
Date: 20240222; Time: 074155; SN

In [69]:
# explanation of each line

pattern = r'(\d{8})_(\d{6})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_([A-Za-z0-9\-_.]+)\.txt'

'''
(\d{8}): Captures 8 digits (YYYYMMDD).
_(\d{6}): Captures 6 digits (HHMMSS).
_SN(\d+): Captures one or more digits.
_QUICKVIEW_VISUAL_([\d_]+): Captures digits and underscores.
_([A-Za-z0-9\-_.]+): Captures letters, numbers, hyphens (-), underscores (_), and dots (.).
\.txt: Makes sure that the filename ends with .txt.
'''

  '''


'\n(\\d{8}): Captures 8 digits (YYYYMMDD).\n_(\\d{6}): Captures 6 digits (HHMMSS).\n_SN(\\d+): Captures one or more digits.\n_QUICKVIEW_VISUAL_([\\d_]+): Captures digits and underscores.\n_([A-Za-z0-9\\-_.]+): Captures letters, numbers, hyphens (-), underscores (_), and dots (.).\n\\.txt: Makes sure that the filename ends with .txt.\n'

## Basic Libraries II

**Time** and **Datetime** are other two Python built-in libraries used in plenty of pipelines involving time measurements, timestamp creation and dates manipulation.

In [70]:
import time

In [71]:
# Get current timestamp - you generate a timestamp when you generate something
# result: seconds that have passed since one specific moment
t = time.time() 
print(t)

1730722777.775599


In [72]:
time.sleep(1) # wait 1 second(s)
# telling the system to stop this specific run in the time putted.

In [73]:
# Formatting time, localtime where the code is run
formatted_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) # specific time in your local machine
print(formatted_time)

2024-11-04 13:19:38


In [74]:
from datetime import datetime, timedelta  # Importa las clases 'datetime' y 'timedelta' del módulo 'datetime'

# El método now() nos da la fecha y hora actual
now = datetime.now()
print(now)  # Imprime la fecha y hora actual

# Similar a la función strftime de la biblioteca time, podemos usarla con datetime para formatear la fecha
formatted_now = now.strftime("%Y-%m-%d %H:%M:%S")  
# Formatea la fecha actual como una cadena en el formato "Año-Mes-Día Hora:Minuto:Segundo"
print(formatted_now)  # Imprime la fecha formateada

# Convertir (parsear) una cadena de texto a un objeto datetime usando strptime
parsed_date = datetime.strptime("2024-10-17 21:00:00", "%Y-%m-%d %H:%M:%S")
# Convierte la cadena "2024-10-17 21:00:00" en un objeto de tipo datetime con el mismo formato especificado
print(parsed_date)  # Imprime la fecha convertida a objeto datetime

# Sumar una semana a la fecha actual usando 'timedelta' con 7 días
future_date = now + timedelta(days=7)  
# Añade 7 días a la fecha y hora actual y genera una nueva fecha
print(future_date)  # Imprime la fecha de una semana en el futuro


2024-11-04 13:19:38.825142
2024-11-04 13:19:38
2024-10-17 21:00:00
2024-11-11 13:19:38.825142


In [75]:
parsed_date.year, parsed_date.month, parsed_date.day, parsed_date.hour

(2024, 10, 17, 21)

## Basic Libraries II

Let's now try to use them to order the annotations by date

In [76]:
import re  # Importa el módulo de expresiones regulares
import glob  # Importa el módulo para buscar archivos con patrones específicos
import os  # Importa el módulo para operaciones del sistema de archivos
from datetime import datetime  # Importa la clase 'datetime' para trabajar con fechas y horas

# Patrón de regex: la 'r' indica que es una cadena cruda para evitar que se interpreten los caracteres de escape
pattern = r'(\d{8})_(\d{6})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_([A-Za-z0-9\-_.]+)\.txt' 

# Busca todos los archivos .txt en el directorio 'session_4/annotations/'
annotations = glob.glob('session_4/annotations/*.txt')

# Vamos a crear una lista donde guardaremos, para cada anotación, su respectivo objeto datetime
ann_datetime = []

# Iteramos sobre cada archivo de anotación
for annotation in annotations:

    # Extraemos solo el nombre del archivo (sin la ruta completa)
    filename = os.path.basename(annotation)
    
    # Buscamos y extraemos valores del nombre del archivo utilizando el patrón de regex
    match = re.match(pattern, filename)
    if match:
        # Extraemos la fecha y hora (los primeros dos grupos del patrón) y omitimos los otros grupos (poniendo _ ponemos que no nos interesa)
        date, time, _, _, _ = match.groups()

        # Juntamos la fecha y la hora, por ejemplo: "20240101192856" (AñoMesDíaHoraMinutoSegundo)
        datetime_str = date + time 

        # Convertimos esa cadena en un objeto datetime --> que es una library
        datetime_obj = datetime.strptime(datetime_str, "%Y%m%d%H%M%S")

        # Mostramos el objeto datetime resultante
        print(f"Datetime Object: {datetime_obj}")
        
        # Añadimos una tupla con el nombre del archivo y su objeto datetime a la lista
        ann_datetime.append((filename, datetime_obj))

print(ann_datetime)


Datetime Object: 2024-01-02 18:55:27
Datetime Object: 2024-06-23 19:37:04
Datetime Object: 2024-04-02 18:47:57
Datetime Object: 2024-04-23 19:01:01
Datetime Object: 2024-02-01 07:51:40
Datetime Object: 2024-02-22 07:41:51
Datetime Object: 2024-01-01 17:43:01
Datetime Object: 2024-02-18 18:01:21
Datetime Object: 2024-01-01 19:28:56
Datetime Object: 2024-04-02 18:47:57
Datetime Object: 2024-01-02 18:59:54
Datetime Object: 2024-02-13 21:25:24
Datetime Object: 2024-02-22 07:41:55
Datetime Object: 2024-06-03 21:52:26
Datetime Object: 2024-01-04 22:03:39
Datetime Object: 2024-06-03 21:53:48
Datetime Object: 2024-06-16 21:30:53
Datetime Object: 2024-01-15 21:38:34
Datetime Object: 2024-02-22 07:41:55
Datetime Object: 2024-03-22 21:25:16
Datetime Object: 2024-01-26 17:37:52
Datetime Object: 2024-03-17 22:12:29
Datetime Object: 2024-05-06 19:20:08
Datetime Object: 2024-02-22 07:41:51
Datetime Object: 2024-02-20 19:04:55
Datetime Object: 2024-01-01 17:43:01
Datetime Object: 2024-04-23 19:01:01
D

In [77]:
# Crea una lista de fechas a partir de la lista de tuplas 'ann_datetime', que contiene (nombre_archivo, fecha)
# Solo extrae la parte 'date' de cada tupla
indices = np.argsort([date for name, date in ann_datetime])

# Imprime o devuelve los índices que ordenan las fechas
indices

array([ 25,   6,   8,  69,  44, 151,  88,  80,   0, 185,  10,  14, 164,
       155, 158,  17,  20, 124,  39,  60, 127,  90, 111, 148,  30, 144,
       115, 157,   4,  72, 190, 179,  97, 170, 109,  49, 172, 147, 145,
       175,  41, 110, 132, 165, 116,  74,  11,  78, 112,  99,   7,  87,
        24, 131,   5,  23, 103, 141, 149,  55,  18,  47,  12, 119, 162,
        42, 126,  50, 173, 152, 153,  54,  58, 128,  68,  86,  64,  33,
       102,  52,  81,  21, 193, 154,  66, 182,  19,  37,  53,  82, 123,
       146, 181,  32,   2,   9, 174,  46, 183, 160,  38,  77,  26, 169,
         3, 189,  61, 139,  71, 177,  96,  70,  98, 166,  83,  22, 161,
       192, 114,  28,  36, 113, 140, 104, 125,  73, 130,  51,  84,  29,
       150, 133, 117, 180, 186,  27,  93, 137, 188, 159,  34, 105, 163,
       143,  13,  15,  76, 178,  62, 138,  57, 184, 191,  75, 122,  31,
        65, 106,  92, 118,  35, 121,  91, 108, 129, 107,  79,  63, 176,
        85, 167,  95, 187,  67, 134,  48,  94, 100, 168, 120,  4

In [78]:
for i in indices:
    print(ann_datetime[i][0])

20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_404_3772.txt
20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_404_3770.txt
20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_552_4164.txt
20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_554_4162.txt
20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_552_4162.txt
20240101_213601_SN31_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_396_3752.txt
20240101_213601_SN31_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_392_3742.txt
20240101_213601_SN31_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_392_3740.txt
20240102_185527_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_740_3850.txt
20240102_185605_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_690_3572.txt
20240102_185954_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_414_3786.txt
20240104_220339_SN31_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_556_4178.txt
20240110_192002_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_380_3728.txt
20240112_192510_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_386_3750.txt
202401

### Exercise


Reusing the same annotations we work with in the previous session, answer the following items using the libraries we saw today: 

1. How many annotations you have per month and year. Which month has more annotation files.
2. Create a dictionary where each **key** is a month, and the corresponding **value** is a list containing all the annotation names with where their date corresponds to the month. 
    a. Save it following the json format, and load it again to check that everything is ok.
    b. Save it this time using Pickle.
    c. Instead of storing a list of all the annotation names happening that month, let's create for each annotation a dictionary with keys: name and date (using a datetime object).
3. Print all the annotations from the oldest ones to the newest one during the seconf half of the 2024. 

1. How many annotations you have per month and year. Which month has more annotation files.

In [97]:
import re 
import glob  
import os  
from datetime import datetime  

pattern = r'(\d{8})_(\d{6})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_([A-Za-z0-9\-_.]+)\.txt' 

annotations = glob.glob('session_4/annotations/*.txt')

annotations_per_month = {}

for annotation in annotations:
    filename = os.path.basename(annotation)
    match = re.match(pattern, filename)
    if match:
        date, _, _, _, _ = match.groups()
        date_format = datetime.strptime(date, "%Y%m%d")
        year = date_format.year
        month = date_format.month
# until here we have what we have seen new in session 5
# from here in advance, we will apply the same methodology used in session 4
        key = (year, month)

        if key in annotations_per_month:
            annotations_per_month[key] += 1
        else:
            annotations_per_month[key] = 1

print("Annotations per month and year:")
for (year, month), count in annotations_per_month.items():   
    print(f" In {month} of {year} there are {count} files")

most_repeated = 0
most_repeated_month = None

for key, count in annotations_per_month.items():
    if count > most_repeated:
        most_repeated = count
        most_repeated_month = key

if most_repeated_month:
    year, month = most_repeated_month
    print(f"The month with the most annotations is {month} of {year} with {most_repeated} appearances.")


Annotations per month and year:
 In 1 of 2024 there are 27 files
 In 6 of 2024 there are 52 files
 In 4 of 2024 there are 25 files
 In 2 of 2024 there are 45 files
 In 3 of 2024 there are 17 files
 In 5 of 2024 there are 28 files
The month with the most annotations is 6 of 2024 with 52 appearances.


2. Create a dictionary where each **key** is a month, and the corresponding **value** is a list containing all the annotation names with where their date corresponds to the month. 
    
    a. Save it following the json format, and load it again to check that everything is ok.
    
    b. Save it this time using Pickle.
   
    c. Instead of storing a list of all the annotation names happening that month, let's create for each annotation a dictionary with keys: name and date (using a datetime object).


In [99]:
# First, let's create the dictionary with lists
import re 
import glob  
import os  
from datetime import datetime  

pattern = r'(\d{8})_(\d{6})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_([A-Za-z0-9\-_.]+)\.txt' 

annotations = glob.glob('session_4/annotations/*.txt')

annotations_by_month = {}

for annotation in annotations:
    filename = os.path.basename(annotation)
    match = re.match(pattern, filename)
    if match:
        date, _, _, _, _ = match.groups()
        date_format = datetime.strptime(date, "%Y%m%d")
        year = date_format.year
        month = date_format.month

        key = (year, month)
# exactly the same as before but now we append the filename too
        if key in annotations_by_month:
            annotations_by_month[key].append(filename)
        else:
            annotations_by_month[key] = [filename]
print('Annotations by month:')
for (year, month), files in annotations_by_month.items():
    print(f"In {month}-{year}, existing files: {files}")


Annotations by month:
In 1-2024, existing files: ['20240102_185527_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_740_3850.txt', '20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_404_3770.txt', '20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_552_4164.txt', '20240102_185954_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_414_3786.txt', '20240104_220339_SN31_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_556_4178.txt', '20240115_213834_SN28_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_376_3722.txt', '20240126_173752_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_386_3722.txt', '20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_404_3772.txt', '20240130_173903_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_366_3756.txt', '20240127_190620_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_500_3600.txt', '20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_552_4162.txt', '20240127_190620_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_500_3602.txt', '20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_554

In [100]:
# JSON
import json

# it generated an error because we needed strings not tuples, so first, we needed to change those to strings by creating a year-month 
string_annotations = {f"{year}-{month}": files for (year, month), files in annotations_by_month.items()}

with open('annotations_by_month.json', 'w') as json_file:
    json.dump(string_annotations, json_file)

with open('annotations_by_month.json', 'r') as json_file:
    data_json = json.load(json_file)

print(f"Annotations by month in JSON format: {data_json}")


Annotations by month in JSON format: {'2024-1': ['20240102_185527_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_740_3850.txt', '20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_404_3770.txt', '20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_552_4164.txt', '20240102_185954_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_414_3786.txt', '20240104_220339_SN31_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_556_4178.txt', '20240115_213834_SN28_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_376_3722.txt', '20240126_173752_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_386_3722.txt', '20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_404_3772.txt', '20240130_173903_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_366_3756.txt', '20240127_190620_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_500_3600.txt', '20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_552_4162.txt', '20240127_190620_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_500_3602.txt', '20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_554_

In [82]:
# PICKLE
import pickle

with open('annotations_by_month.pkl', 'wb') as pickle_file:
    pickle.dump(annotations_by_month, pickle_file)

with open('annotations_by_month.pkl', 'rb') as pickle_file:
    pickle_data = pickle.load(pickle_file)

print(f"Annotations by month in Pickle format: {pickle_data}")


Annotations by month in Pickle format: {(2024, 1): ['20240102_185527_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_740_3850.txt', '20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_404_3770.txt', '20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_552_4164.txt', '20240102_185954_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_414_3786.txt', '20240104_220339_SN31_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_556_4178.txt', '20240115_213834_SN28_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_376_3722.txt', '20240126_173752_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_386_3722.txt', '20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_404_3772.txt', '20240130_173903_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_366_3756.txt', '20240127_190620_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_500_3600.txt', '20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_552_4162.txt', '20240127_190620_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_500_3602.txt', '20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_5

In [102]:
# DICTIONARY
annotations_by_month_dict = {}

for annotation in annotations:
    filename = os.path.basename(annotation)
    match = re.match(pattern, filename)
    if match:
        date, _, _, _, _ = match.groups()
        date_format = datetime.strptime(date, "%Y%m%d")
        year = date_format.year
        month = date_format.month

        key = (year, month)
 
 # until here we do the same, but now we need to create a dictionary to store the data
        annotation_info = {
            "name": filename,
            "date": datetime_obj
        }

        if key in annotations_by_month_dict:
            annotations_by_month_dict[key].append(annotation_info)
        else:
            annotations_by_month_dict[key] = [annotation_info]

print("Annotations by month and year in dictionary:")
for (year, month), details in annotations_by_month_dict.items():
    print(f"In {month}-{year}:")
    for annotation in details:
        print(f" For the {annotation['name']}, the date is {annotation['date']}")


Annotations by month and year in dictionary:
In 1-2024:
 For the 20240102_185527_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_740_3850.txt, the date is 2024-03-21 00:00:00
 For the 20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_404_3770.txt, the date is 2024-03-21 00:00:00
 For the 20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_552_4164.txt, the date is 2024-03-21 00:00:00
 For the 20240102_185954_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_414_3786.txt, the date is 2024-03-21 00:00:00
 For the 20240104_220339_SN31_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_556_4178.txt, the date is 2024-03-21 00:00:00
 For the 20240115_213834_SN28_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_376_3722.txt, the date is 2024-03-21 00:00:00
 For the 20240126_173752_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_386_3722.txt, the date is 2024-03-21 00:00:00
 For the 20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_404_3772.txt, the date is 2024-03-21 00:00:00
 For the 20240130_173903_SN33_QUICKVIEW_

3. Print all the annotations from the oldest ones to the newest one during the seconf half of the 2024. 

In [103]:
import re
import glob
import os
from datetime import datetime

pattern = r'(\d{8})_(\d{6})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_([A-Za-z0-9\-_.]+)\.txt'

annotations = glob.glob('session_4/annotations/*.txt')

annotations_list = []

for annotation in annotations:
    filename = os.path.basename(annotation)
    match = re.match(pattern, filename)
    if match:
        date, _, _, _, _ = match.groups()
        date_format = datetime.strptime(date, "%Y%m%d")
        # same as before we only need to add the filter of the month
        if date_format.year == 2024 and (7 <= date_format.month <= 12):
            annotations_list.append((filename, date_format))

# Sort the list by datetime (oldest to newest)
annotations_list.sort(key=lambda x: x[1])

# Print the sorted annotations
print("Annotations from the second half of 2024:")
for filename, date in annotations_list:
    print(f"{date.strftime('%Y-%m-%d')} - {filename}")

# the result should appear empty, as we have seen, our data only have results for the first half

Annotations from the second half of 2024:
