# Python for Data Science
## Session 5 
### Basic Libraries II

---

## Outline

1. Json, pickle and parquet formats

2. Re library

3. Time and Datetime libraries

---

## Basic Libraries II

Before starting working with different formats, let's see how we can create and read text files using Python buil-in function called **open**. 

- Pickle files
- json files
- parqueet format
- '/n' this means new line




In [1]:
# Open and write down a file
f = open('text_file.txt', 'w')
f.write('Hello')
f.write('\n')
f.write('Bye')
f.close()

In [2]:
# Open and read content of a file
f = open('text_file.txt', 'r')
content = f.read()
f.close()
print(content)

Hello
Bye


In [5]:
# We can also simply split lines by using
f = open('text_file.txt', 'r')
lines = f.read().splitlines()
f.close()
# loop over the lines
for idx, line in enumerate(lines): # enumerate provides returns the index and element
    print(f'Line {idx}: {line}')

Line 0: Hello Enric
Line 1: Bye


In [6]:
# Let's create a CSV (comma separated values) file
header = "Name,Age,Grade\n"
rows = [
    "Jaume,30,8.9\n",
    "Francisco,25,7.1\n",
    "Elena,35,9.2\n"
]

In [7]:
with open("grades.csv", "w") as f:
    f.write(header) # Write the header
    
    # Write each row of data
    for row in rows:
        f.write(row)

In [8]:
with open("grades.csv", "r") as f:
    lines = f.read().splitlines()
    
header = lines.pop(0)
header = header.split(',')

print(header)

grades = {'students': []}
# create dictionary
for line in lines:
    student_dict = {}
    values = line.split(',')
    for idx, column in enumerate(header):
        student_dict[column] = values[idx]
    grades['students'].append(student_dict)
    
grades

['Name', 'Age', 'Grade']


{'students': [{'Name': 'Jaume', 'Age': '30', 'Grade': '8.9'},
  {'Name': 'Francisco', 'Age': '25', 'Grade': '7.1'},
  {'Name': 'Elena', 'Age': '35', 'Grade': '9.2'}]}

## Basic Libraries II

Another useful statement is **with**. It helps handling properly the resources within its reach, by closing them after its execution. It also makes the code more readable and maintainable.

In [9]:
with open('text_file.txt', 'r') as f: # we don't have to close the open file, f.close()
    lines = f.read().splitlines()
    
print(lines)

['Hello Enric', 'Bye']


## Basic Libraries II

JavaScript Object Notation (JSON) is a text-based format used for data storing and data interchange across different platforms and languages.

Same as dictionaries, data is represented as key-value pairs. 

## Basic Libraries II

JavaScript Object Notation (JSON) is a text-based format used for data storing and data interchange across different platforms and languages.

Same as dictionaries, data is represented as key-value pairs. 

In [12]:
{
    "students": [
        {
            "name": "Amelie",
            "age": 35
        },
        {
            "name": "Edgar",
            "age": 32
        }
    ]
}

{'students': [{'name': 'Amelie', 'age': 35}, {'name': 'Edgar', 'age': 32}]}

In [13]:
# other valid formats
[
    {
        "name": "Amelie",
        "age": 35
    },
    {
        "name": "Edgar",
        "age": 32
    }
]

[{'name': 'Amelie', 'age': 35}, {'name': 'Edgar', 'age': 32}]

In [14]:
# other valid formats
[
    "Amelie",
    137,
    True, # within the json file True is equivalent to true
    None, # within the json file None is equivalent to null
    {"age": 35},
    [10, 12, 13]
]

['Amelie', 137, True, None, {'age': 35}, [10, 12, 13]]

## Basic Libraries II

To read and write down json files and manipulate them, we have the built-in json library within Python.

In [15]:
import json
data = {
    "students": [
        {
            "name": "Amelie",
            "age": 35,
            "scolarship": True
        },
        {
            "name": "Edgar",
            "age": 32,
            "scolarship": None
        }
    ]
}

with open('json_example.json', 'w') as f: # write down json
    json.dump(data, f)

In [None]:
with open('json_example.json', 'r') as f:
    json_data = json.load(f)
    
print(json_data)

## Basic Libraries II

Similar to JSON, Python includes a Pickle library. However, in contrast to the JSON format, Pickle is a Python-specific serialization format. The Pickle library provides tools to serialize Python objects, which involves transforming them into a stream of bytes. It also allows you to read these byte streams by deserializing them, transforming them back into their original Python objects.

In contrast to the JSON format, the binary format is usually more compact and, therefore, more efficient.

In [12]:
import numpy as np
data = np.random.rand(10)

import pickle

# Serializing (dumping) the object
with open('data.pkl', 'wb') as f:
    pickle.dump(data, f)

# Deserializing (loading) the object
with open('data.pkl', 'rb') as f:
    loaded_data = pickle.load(f)

print(loaded_data)

[0.11420616 0.45564953 0.49928084 0.63505174 0.68843895 0.01117045
 0.82594488 0.59493887 0.0017653  0.17464063]


## Basic Libraries II

**IMPORTANT**: Be extremely carefull when loading pickled data from untrusted sources. Pickles can execute arbitrary code.

## Basic Libraries II

To work with **Parquet** files, you need either the **pyarrow** or **pandas** library. Parquet is a columnar storage format, meaning that each row represents a sample, and each column represents an attribute. This is a powerful format commonly used as a standard in platforms like **Hugging Face**.

In [13]:
import pandas as pd # if it is not working, simply uncomment the following line
# pip install pandas

# Creating a DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35]
})

# Writing DataFrame to Parquet file with Pandas
df.to_parquet('data.parquet')

# Reading DataFrame from Parquet file with Pandas
df_loaded = pd.read_parquet('data.parquet')

print(df_loaded)

      name  age
0    Alice   25
1      Bob   30
2  Charlie   35


## Basic Libraries II

When working with text, one of the most powerful tools is regular expressions, aka **regex**. With regex, you can perform complex pattern matching using wildcards and other special characters. Let's see how we could have handled session's four exercise:

In [14]:
import re

data = "What a wonderful life if we could play more time."

# Regex pattern to find 'if'
pattern = 'if'

# Search for the pattern
matches = re.findall(pattern, data)

print(matches) 

['if', 'if']


## Basic Libraries II

Let's see how we could have handled session's four exercise:

In [None]:
import re
import glob
import os

# Regex pattern, r in front of strings tell python to treat them as raw strings
# we do this so slashes don't get interpret as scaping symbol
pattern = r'(\d{8})_(\d{6})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_([A-Za-z0-9\-_.]+)\.txt' 

annotations = glob.glob('session_4/annotations/*.txt')

for annotation in annotations:

    # extract the file name
    filename = os.path.basename(annotation)
    
    # Search and extract values
    match = re.match(pattern, filename)
    if match:
        date, time, satellite_number, version, unique_region = match.groups()
        print(f"Date: {date}; Time: {time}; SN: {satellite_number}; ver: {version}; region: {unique_region}")

In [None]:
pattern = r'(\d{8})_(\d{6})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_([A-Za-z0-9\-_.]+)\.txt'

'''
(\d{8}): Captures 8 digits (YYYYMMDD).
_(\d{6}): Captures 6 digits (HHMMSS).
_SN(\d+): Captures one or more digits.
_QUICKVIEW_VISUAL_([\d_]+): Captures digits and underscores.
_([A-Za-z0-9\-_.]+): Captures letters, numbers, hyphens (-), underscores (_), and dots (.).
\.txt: Makes sure that the filename ends with .txt.
'''

## Basic Libraries II

**Time** and **Datetime** are other two Python built-in libraries used in plenty of pipelines involving time measurements, timestamp creation and dates manipulation.

In [None]:
import time

In [None]:
# Get current timestamp
t = time.time() 
print(t)

In [None]:
time.sleep(1) # wait 1 second(s)

In [None]:
# Formatting time, localtime where the code is run
formatted_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) 
print(formatted_time)

In [None]:
from datetime import datetime, timedelta

# method now() gives us the current date and time
now = datetime.now()
print(now)

# Similar to the strftime function in time, we can it from datetime
formatted_now = now.strftime("%Y-%m-%d %H:%M:%S")
print(formatted_now)

# Parsing a string to a datetime object
parsed_date = datetime.strptime("2024-10-17 21:00:00", "%Y-%m-%d %H:%M:%S")
print(parsed_date)

# Adding a week using days with timedelta
future_date = now + timedelta(days=7)
print(future_date)

In [None]:
parsed_date.year, parsed_date.month, parsed_date.day, parsed_date.hour

## Basic Libraries II

Let's now try to use them to order the annotations by date

In [None]:
import re
import glob
import os
from datetime import datetime

# Regex pattern, r in front of strings tell python to treat them as raw strings
# we do this so slashes don't get interpret as scaping symbol
pattern = r'(\d{8})_(\d{6})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_([A-Za-z0-9\-_.]+)\.txt' 

annotations = glob.glob('session_4/annotations/*.txt')

# let's create a dictionary where per each annotations we gather the datetime object
ann_datetime = []

for annotation in annotations:

    # extract the file name
    filename = os.path.basename(annotation)
    
    # Search and extract values
    match = re.match(pattern, filename)
    if match:
        date, time, _, _, _ = match.groups()

        # Put them together, e.g. "20240101192856"
        datetime_str = date + time 

        # Parse the string into a datetime object
        datetime_obj = datetime.strptime(datetime_str, "%Y%m%d%H%M%S")

        # Output the datetime object
        print(f"Datetime Object: {datetime_obj}")
        
        ann_datetime.append((filename, datetime_obj))

In [None]:
indices = np.argsort([date for name, date in ann_datetime])
indices

In [None]:
for i in indices:
    print(ann_datetime[i][0])

### Exercise


Reusing the same annotations we work with in the previous session, answer the following items using the libraries we saw today: 

1. How many annotations you have per month and year. Which month has more annotation files.
2. Create a dictionary where each **key** is a month, and the corresponding **value** is a list containing all the annotation names with where their date corresponds to the month. 
    a. Save it following the json format, and load it again to check that everything is ok.
    b. Save it this time using Pickle.
    c. Instead of storing a list of all the annotation names happening that month, let's create for each annotation a dictionary with keys: name and date (using a datetime object).
3. Print all the annotations from the oldest ones to the newest one during the seconf half of the 2024. 

1. How many annotations you have per month and year. Which month has more annotation files.

In [2]:
import os
import re
from datetime import datetime
from collections import Counter

# Define the path to the annotations folder
annotations_path = '/Users/enriccortesarbues/Documents/ESADE/Term 1/Python for Data Science/PDS_EC/session_4/annotations'

# Define the naming convention pattern
pattern = re.compile(r"^(\d{8})_\d{6}_SN\d+_QUICKVIEW_VISUAL_[\d_]+_.+\.txt$")

# List to hold dates extracted from filenames
dates = []

# Check if the annotations path exists
if os.path.exists(annotations_path):
    # Get all files in the annotations folder (including those organized in month folders)
    for root, dirs, files in os.walk(annotations_path):
        for filename in files:
            match = pattern.match(filename)
            if match:
                # Extract the date portion from the filename
                date_str = match.group(1)  # This captures the DATE part (YYYYMMDD)
                date = datetime.strptime(date_str, '%Y%m%d')  # Convert to a datetime object
                month_year = date.strftime('%Y-%m')  # Format as 'YYYY-MM'
                dates.append(month_year)

    # Count occurrences of each month/year combination
    month_counts = Counter(dates)

    # Identify the month with the most annotations
    most_common_month, most_common_count = month_counts.most_common(1)[0] if month_counts else (None, 0)

    # Output the results
    print("Annotations per month and year:", month_counts)
    print(f"Month with the most annotations: {most_common_month} ({most_common_count} files)")

else:
    print("The specified annotations folder does not exist.")

Annotations per month and year: Counter({'2024-06': 52, '2024-02': 45, '2024-05': 28, '2024-01': 27, '2024-04': 25, '2024-03': 17})
Month with the most annotations: 2024-06 (52 files)


2. Create a dictionary where each **key** is a month, and the corresponding **value** is a list containing all the annotation names with where their date corresponds to the month.
    
    a. Save it following the json format, and load it again to check that everything is ok.
    
    b. Save it this time using Pickle.
    
    c. Instead of storing a list of all the annotation names happening that month, let's create for each annotation a dictionary with keys: name and date (using a datetime object).

In [8]:
import json
import pickle
from collections import defaultdict

# Define the path to the annotations folder

annotations_path = '/Users/enriccortesarbues/Documents/ESADE/Term 1/Python for Data Science/PDS_EC/session_4/annotations'

# Define the naming convention pattern

pattern = re.compile(r"^(\d{8})_(\d{6})_SN\d+_QUICKVIEW_VISUAL_[\d_]+_.+\.txt$")

# Step 1: Create a dictionary to hold annotations per month

annotations_by_month = defaultdict(list)

# Populate the dictionary
if os.path.exists(annotations_path):
    for root, dirs, files in os.walk(annotations_path):
        for filename in files:
            match = pattern.match(filename)
            if match:
                # Extract date from filename and convert to month/year format
                date_str = match.group(1)  # DATE part (YYYYMMDD)
                date = datetime.strptime(date_str, '%Y%m%d')
                month_year = date.strftime('%Y-%m')

                # Add filename to the list for the corresponding month
                annotations_by_month[month_year].append(filename)

    # Step 2.1: Save the dictionary as JSON
    json_path = 'annotations_by_month.json'
    with open(json_path, 'w') as json_file:
        json.dump(annotations_by_month, json_file, indent=4)
    
    # Load the JSON file to verify it
    with open(json_path, 'r') as json_file:
        loaded_json_data = json.load(json_file)
    print("Loaded JSON data:", loaded_json_data)

    # Step 2.2: Save the dictionary using Pickle
    pickle_path = 'annotations_by_month.pkl'
    with open(pickle_path, 'wb') as pickle_file:
        pickle.dump(annotations_by_month, pickle_file)

    # Load the Pickle file to verify it
    with open(pickle_path, 'rb') as pickle_file:
        loaded_pickle_data = pickle.load(pickle_file)
    print("Loaded Pickle data:", loaded_pickle_data)

    # Step 2c: Modify the dictionary to store each annotation as a dictionary with 'name' and 'date'
    annotations_by_month_detailed = defaultdict(list)
    for month_year, annotations in annotations_by_month.items():
        for annotation in annotations:
            match = pattern.match(annotation)
            if match:
                date_str = match.group(1)
                time_str = match.group(2)
                annotation_date = datetime.strptime(date_str + time_str, '%Y%m%d%H%M%S')
                annotation_dict = {'name': annotation, 'date': annotation_date}
                annotations_by_month_detailed[month_year].append(annotation_dict)

    # Save the detailed dictionary as JSON
    detailed_json_path = 'annotations_by_month_detailed.json'
    with open(detailed_json_path, 'w') as json_file:
        json.dump({k: [{'name': d['name'], 'date': d['date'].isoformat()} for d in v] for k, v in annotations_by_month_detailed.items()}, json_file, indent=4)
    
    # Save the detailed dictionary using Pickle
    detailed_pickle_path = 'annotations_by_month_detailed.pkl'
    with open(detailed_pickle_path, 'wb') as pickle_file:
        pickle.dump(annotations_by_month_detailed, pickle_file)

    print("Data saved successfully in JSON and Pickle formats.")

else:
    print("The specified annotations folder does not exist.")

Loaded JSON data: {'2024-06': ['20240623_193704_SN27_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_566_3734.txt', '20240603_215226_SN28_QUICKVIEW_VISUAL_1_6_0_SATL-2KM-11N_248_4068.txt', '20240603_215348_SN28_QUICKVIEW_VISUAL_1_6_0_SATL-2KM-11N_346_3786.txt', '20240616_213053_SN30_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_460_3792.txt', '20240611_025943_SN26_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-51N_748_4366.txt', '20240612_185400_SN24_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_574_3714.txt', '20240619_185757_SN24_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_528_3700.txt', '20240619_052401_SN30_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-52N_368_4336.txt', '20240616_213047_SN30_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_464_3830.txt', '20240616_102144_SN28_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-39N_560_2792.txt', '20240617_052859_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-51N_730_4348.txt', '20240608_214614_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_248_4068.txt', '20240617_211350_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_724_3614.txt', '20240606_180251_SN3

3. Print all the annotations from the oldest ones to the newest one during the seconf half of the 2024. 

In [6]:
# Define the path to the annotations folder

annotations_path = '/Users/enriccortesarbues/Documents/ESADE/Term 1/Python for Data Science/PDS_EC/session_4/annotations'
# Define the naming convention pattern

pattern = re.compile(r"^(\d{8})_(\d{6})_SN\d+_QUICKVIEW_VISUAL_[\d_]+_.+\.txt$")

# List to store files with their datetime information

files_with_dates = []

# Check if the annotations path exists

if os.path.exists(annotations_path):
    # Get all files in the annotations folder (including those organized in month folders)
    for root, dirs, files in os.walk(annotations_path):
        for filename in files:
            match = pattern.match(filename)
            if match:
                # Extract DATE and TIME from the filename
                date_str = match.group(1)  # DATE in YYYYMMDD format
                time_str = match.group(2)  # TIME in HHMMSS format
                # Combine date and time for sorting
                file_datetime = datetime.strptime(date_str + time_str, '%Y%m%d%H%M%S')
                
                # Check if the date is in the second half of 2024
                if file_datetime.year == 2024 and 7 <= file_datetime.month <= 12:
                    file_path = os.path.join(root, filename)
                    files_with_dates.append((file_datetime, file_path))
                    print(f"Added: {file_datetime}: {file_path}")  # Debug statement
    
    # Check if any files were added
    if files_with_dates:
        # Sort files by datetime in ascending order (oldest to newest)
        files_with_dates.sort(key=lambda x: x[0])

        # Print the sorted filenames
        print("\nAnnotations from oldest to newest in the second half of 2024:")
        for file_datetime, file_path in files_with_dates:
            print(f"{file_datetime}: {os.path.basename(file_path)}")
    else:
        print("No matching files were found in the specified date range.")
else:
    print("The specified annotations folder does not exist.")

No matching files were found in the specified date range.
