# Python for Data Science
## Session 5 
### Basic Libraries II

---

## Outline

1. Json, pickle and parquet formats

2. Re library

3. Time and Datetime libraries

在 Python 中，常用的三种数据格式分别是 JSON、Pickle 和 Parquet。每种格式都有其独特的优缺点和适合的使用场景。以下是对它们的详细分析：

1. JSON (JavaScript Object Notation)
定义：JSON 是一种轻量级的数据交换格式，常用于存储和交换结构化数据，尤其是应用程序与 API 之间。
使用场景：
Web 应用程序：JSON 是前后端数据交互的首选格式。
配置文件：可以用来存储程序的配置信息和元数据。
序列化简单对象：适用于字符串、列表、字典等简单结构。
优点：
人类可读：JSON 是纯文本格式，便于人类阅读和编辑。
跨语言支持：JSON 被大多数编程语言所支持，易于跨平台共享数据。
轻量：数据格式简单，便于在网络中快速传输。
缺点：
复杂对象不易序列化：对于 Python 特有的数据类型（如集合或类对象），需要额外处理。
文件较大：JSON 文件没有压缩，数据量较大时存储效率低。
2. Pickle
定义：Pickle 是 Python 的对象序列化工具，可以将 Python 对象存储为二进制文件，方便后续的加载。
使用场景：
序列化 Python 对象：适合存储任意 Python 对象，包括列表、字典、类实例等复杂对象。
机器学习模型存储：通常用来保存训练好的机器学习模型，便于之后加载再使用。
优点：
支持复杂对象：可以序列化任何 Python 对象，包括自定义类、嵌套数据结构。
易于使用：pickle 的 API 非常简单，序列化和反序列化操作都很方便。
缺点：
安全性问题：Pickle 文件可能包含恶意代码，不要加载不信任来源的文件。
不跨语言：Pickle 格式只适用于 Python，不适合跨平台或跨语言的场景。
3. Parquet
定义：Parquet 是一种专门用于数据分析和存储的列式存储格式，通常用于大数据场景。
使用场景：
大数据存储：适用于数据量较大的分析任务，尤其是在数据仓库或数据湖中。
数据分析和查询：通常与大数据处理工具（如 Apache Spark、Hadoop）配合使用。
优点：
列式存储：便于压缩和高效读取，特别适合于读取特定列的场景。
高压缩率：Parquet 支持多种压缩算法，因此在存储空间和 IO 效率方面具有很大优势。
兼容性：支持多种数据处理工具，特别适合在分布式计算环境中使用。
缺点：
复杂性：与 JSON 和 Pickle 相比，操作 Parquet 文件的过程更复杂，需要借助库如 pyarrow 或 pandas。
不便于人类阅读：Parquet 是一种二进制格式，不像 JSON 那样容易阅读。

re.match() 适合匹配字符串的开始部分。
re.search() 在整个字符串中查找第一次匹配。
re.findall() 查找所有匹配的部分并返回列表。
re.sub() 用于替换匹配的字符串内容。
3. 正则表达式基本语法
3.1 常见的元字符：
.：匹配任意字符（除换行符）
^：匹配字符串的开始位置
$：匹配字符串的结束位置
*：匹配前面的字符零次或多次
+：匹配前面的字符一次或多次
?：匹配前面的字符零次或一次
[]：匹配括号中的任意字符（例如 [a-z] 表示匹配所有小写字母）
|：匹配左或右的表达式（例如 a|b）
3.2 常见的预定义字符集：
\d：匹配数字，等价于 [0-9]
\D：匹配非数字字符
\w：匹配字母、数字、下划线，等价于 [a-zA-Z0-9_]
\W：匹配非字母、数字、下划线
\s：匹配空白字符（如空格、制表符）
\S：匹配非空白字符

---

## Basic Libraries II

Before starting working with different formats, let's see how we can create and read text files using Python buil-in function called **open**. 

In [140]:
# Open and write down a file
f = open('text_file.txt', 'w')
f.write('Hello')
f.write('\n')
f.write('Bye')
f.close()

In [141]:
# Open and read content of a file
f = open('text_file.txt', 'r')
content = f.read()
f.close()
print(content)

Hello
Bye


In [142]:
# We can also simply split lines by using
f = open('text_file.txt', 'r')
lines = f.read().splitlines()
f.close()
# loop over the lines
for idx, line in enumerate(lines): # enumerate provides returns the index and element
    print(f'Line {idx}: {line}')

Line 0: Hello
Line 1: Bye


In [143]:
# Let's create a CSV (comma separated values) file
header = "Name,Age,Grade\n"
rows = [
    "Jaume,30,8.9\n",
    "Francisco,25,7.1\n",
    "Elena,35,9.2\n"
]

In [144]:
with open("grades.csv", "w") as f:
    f.write(header) # Write the header
    
    # Write each row of data
    for row in rows:
        f.write(row)

In [145]:
with open("grades.csv", "r") as f:
    lines = f.read().splitlines()
    
header = lines.pop(0)
header = header.split(',')

print(header)

grades = {'students': []}
# create dictionary
for line in lines:
    student_dict = {}
    values = line.split(',')
    for idx, column in enumerate(header):
        student_dict[column] = values[idx]
    grades['students'].append(student_dict)
    
grades

['Name', 'Age', 'Grade']


{'students': [{'Name': 'Jaume', 'Age': '30', 'Grade': '8.9'},
  {'Name': 'Francisco', 'Age': '25', 'Grade': '7.1'},
  {'Name': 'Elena', 'Age': '35', 'Grade': '9.2'}]}

## Basic Libraries II

Another useful statement is **with**. It helps handling properly the resources within its reach, by closing them after its execution. It also makes the code more readable and maintainable.

In [146]:
with open('text_file.txt', 'r') as f: # we don't have to close the open file, f.close()
    lines = f.read().splitlines()
    
print(lines)

['Hello', 'Bye']


## Basic Libraries II

JavaScript Object Notation (JSON) is a text-based format used for data storing and data interchange across different platforms and languages.

Same as dictionaries, data is represented as key-value pairs. 

## Basic Libraries II

JavaScript Object Notation (JSON) is a text-based format used for data storing and data interchange across different platforms and languages.

Same as dictionaries, data is represented as key-value pairs. 

In [147]:
{
    "students": [
        {
            "name": "Amelie",
            "age": 35
        },
        {
            "name": "Edgar",
            "age": 32
        }
    ]
}

{'students': [{'name': 'Amelie', 'age': 35}, {'name': 'Edgar', 'age': 32}]}

In [148]:
# other valid formats
[
    {
        "name": "Amelie",
        "age": 35
    },
    {
        "name": "Edgar",
        "age": 32
    }
]

[{'name': 'Amelie', 'age': 35}, {'name': 'Edgar', 'age': 32}]

In [149]:
# other valid formats
[
    "Amelie",
    137,
    True, # within the json file True is equivalent to true
    None, # within the json file None is equivalent to null
    {"age": 35},
    [10, 12, 13]
]

['Amelie', 137, True, None, {'age': 35}, [10, 12, 13]]

## Basic Libraries II

To read and write down json files and manipulate them, we have the built-in json library within Python.

In [150]:
import json
data = {
    "students": [
        {
            "name": "Amelie",
            "age": 35,
            "scolarship": True
        },
        {
            "name": "Edgar",
            "age": 32,
            "scolarship": None
        }
    ]
}

with open('json_example.json', 'w') as f: # write down json
    json.dump(data, f)

In [151]:
with open('json_example.json', 'r') as f:
    json_data = json.load(f)
    
print(json_data)

{'students': [{'name': 'Amelie', 'age': 35, 'scolarship': True}, {'name': 'Edgar', 'age': 32, 'scolarship': None}]}


## Basic Libraries II

Similar to JSON, Python includes a Pickle library. However, in contrast to the JSON format, Pickle is a Python-specific serialization format. The Pickle library provides tools to serialize Python objects, which involves transforming them into a stream of bytes. It also allows you to read these byte streams by deserializing them, transforming them back into their original Python objects.

In contrast to the JSON format, the binary format is usually more compact and, therefore, more efficient.

In [152]:
import numpy as np
data = np.random.rand(10)

import pickle

# Serializing (dumping) the object
with open('data.pkl', 'wb') as f:
    pickle.dump(data, f)

# Deserializing (loading) the object
with open('data.pkl', 'rb') as f:
    loaded_data = pickle.load(f)

print(loaded_data)

[0.12448755 0.22450707 0.18077445 0.78055662 0.23850298 0.41243278
 0.78928847 0.25504021 0.58352405 0.553139  ]


## Basic Libraries II

**IMPORTANT**: Be extremely carefull when loading pickled data from untrusted sources. Pickles can execute arbitrary code.

## Basic Libraries II

To work with **Parquet** files, you need either the **pyarrow** or **pandas** library. Parquet is a columnar storage format, meaning that each row represents a sample, and each column represents an attribute. This is a powerful format commonly used as a standard in platforms like **Hugging Face**.

In [153]:
import pandas as pd # if it is not working, simply uncomment the following line
# !pip install pandas

# Creating a DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35]
})

# Writing DataFrame to Parquet file with Pandas
df.to_parquet('data.parquet')

# Reading DataFrame from Parquet file with Pandas
df_loaded = pd.read_parquet('data.parquet')

print(df_loaded)

      name  age
0    Alice   25
1      Bob   30
2  Charlie   35


## Basic Libraries II

When working with text, one of the most powerful tools is regular expressions, aka **regex**. With regex, you can perform complex pattern matching using wildcards and other special characters. Let's see how we could have handled session's four exercise:

In [154]:
import re

data = "What a wonderful life if we could play more time."

# Regex pattern to find 'if'
pattern = 'if'

# Search for the pattern
matches = re.findall(pattern, data)

print(matches) 

['if', 'if']


## Basic Libraries II

Let's see how we could have handled session's four exercise:

In [155]:
import re
import glob
import os

# Regex pattern, r in front of strings tell python to treat them as raw strings
# we do this so slashes don't get interpret as scaping symbol
pattern = r'(\d{8})_(\d{6})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_([A-Za-z0-9\-_.]+)\.txt' 

annotations = glob.glob('session_4/annotations/*.txt')

for annotation in annotations:

    # extract the file name
    filename = os.path.basename(annotation)
    
    # Search and extract values
    match = re.match(pattern, filename)
    if match:
        date, time, satellite_number, version, unique_region = match.groups()
        print(f"Date: {date}; Time: {time}; SN: {satellite_number}; ver: {version}; region: {unique_region}")

Date: 20240101; Time: 123456; SN: 42; ver: 01; region: RegionName.txt


In [156]:
pattern = r'(\d{8})_(\d{6})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_([A-Za-z0-9\-_.]+)\.txt'

'''
(\d{8}): Captures 8 digits (YYYYMMDD).
_(\d{6}): Captures 6 digits (HHMMSS).
_SN(\d+): Captures one or more digits.
_QUICKVIEW_VISUAL_([\d_]+): Captures digits and underscores.
_([A-Za-z0-9\-_.]+): Captures letters, numbers, hyphens (-), underscores (_), and dots (.).
\.txt: Makes sure that the filename ends with .txt.
'''

  '''


'\n(\\d{8}): Captures 8 digits (YYYYMMDD).\n_(\\d{6}): Captures 6 digits (HHMMSS).\n_SN(\\d+): Captures one or more digits.\n_QUICKVIEW_VISUAL_([\\d_]+): Captures digits and underscores.\n_([A-Za-z0-9\\-_.]+): Captures letters, numbers, hyphens (-), underscores (_), and dots (.).\n\\.txt: Makes sure that the filename ends with .txt.\n'

## Basic Libraries II

**Time** and **Datetime** are other two Python built-in libraries used in plenty of pipelines involving time measurements, timestamp creation and dates manipulation.

In [157]:
import time

In [158]:
# Get current timestamp
t = time.time() 
print(t)

1732099706.8522573


In [159]:
time.sleep(1) # wait 1 second(s)

In [160]:
# Formatting time, localtime where the code is run
formatted_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) 
print(formatted_time)

2024-11-20 11:48:27


In [161]:
from datetime import datetime, timedelta

# method now() gives us the current date and time
now = datetime.now()
print(now)

# Similar to the strftime function in time, we can it from datetime
formatted_now = now.strftime("%Y-%m-%d %H:%M:%S")
print(formatted_now)

# Parsing a string to a datetime object
parsed_date = datetime.strptime("2024-10-17 21:00:00", "%Y-%m-%d %H:%M:%S")
print(parsed_date)

# Adding a week using days with timedelta
future_date = now + timedelta(days=7)
print(future_date)

2024-11-20 11:48:27.887829
2024-11-20 11:48:27
2024-10-17 21:00:00
2024-11-27 11:48:27.887829


In [162]:
parsed_date.year, parsed_date.month, parsed_date.day, parsed_date.hour

(2024, 10, 17, 21)

## Basic Libraries II

Let's now try to use them to order the annotations by date

In [163]:
import re
import glob
import os
from datetime import datetime

# Regex pattern, r in front of strings tell python to treat them as raw strings
# we do this so slashes don't get interpret as scaping symbol
pattern = r'(\d{8})_(\d{6})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_([A-Za-z0-9\-_.]+)\.txt' 

annotations = glob.glob('session_4/annotations/*.txt')

# let's create a dictionary where per each annotations we gather the datetime object
ann_datetime = []

for annotation in annotations:

    # extract the file name
    filename = os.path.basename(annotation)
    
    # Search and extract values
    match = re.match(pattern, filename)
    if match:
        date, time, _, _, _ = match.groups()

        # Put them together, e.g. "20240101192856"
        datetime_str = date + time 

        # Parse the string into a datetime object
        datetime_obj = datetime.strptime(datetime_str, "%Y%m%d%H%M%S")

        # Output the datetime object
        print(f"Datetime Object: {datetime_obj}")
        
        ann_datetime.append((filename, datetime_obj))

ann_datetime




Datetime Object: 2024-01-01 12:34:56


[('20240101_123456_SN42_QUICKVIEW_VISUAL_01_RegionName.txt.txt',
  datetime.datetime(2024, 1, 1, 12, 34, 56))]

In [164]:
indices = np.argsort([date for name, date in ann_datetime])
indices

array([0], dtype=int64)

In [165]:
for i in indices:
    print(ann_datetime[i][0])


20240101_123456_SN42_QUICKVIEW_VISUAL_01_RegionName.txt.txt


### Exercise


Reusing the same annotations we work with in the previous session, answer the following items using the libraries we saw today: 

1. How many annotations you have per month and year. Which month has more annotation files.
2. Create a dictionary where each **key** is a month, and the corresponding **value** is a list containing all the annotation names with where their date corresponds to the month. 
    a. Save it following the json format, and load it again to check that everything is ok.
    b. Save it this time using Pickle.
    c. Instead of storing a list of all the annotation names happening that month, let's create for each annotation a dictionary with keys: name and date (using a datetime object).
3. Print all the annotations from the oldest ones to the newest one during the seconf half of the 2024. 

In [166]:
import re
import glob
import os
import json
import pickle
from datetime import datetime
pattern = r'(\d{8})_(\d{6})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_([A-Za-z0-9\-_.]+)\.txt'
annotations = glob.glob('session_4/annotations/*.txt')
ann_datetime = []
for annotation in annotations:
    filename = os.path.basename(annotation)
    match = re.match(pattern, filename)
    if match:
        date, time, _, _, _ = match.groups()
        datetime_str = date + time
        datetime_obj = datetime.strptime(datetime_str, "%Y%m%d%H%M%S")
        ann_datetime.append((filename, datetime_obj))

from collections import Counter
month_year_count = Counter((dt.year, dt.month) for _, dt in ann_datetime)

most_common_month = month_year_count.most_common(1)[0]
print(f"Year-Month with the most annotations: {most_common_month[0]}, Count: {most_common_month[1]}")





Year-Month with the most annotations: (2024, 1), Count: 1


In [168]:
annotations_by_month = {}
for filename, dt in ann_datetime:
    key = f"{dt.year}-{dt.month:02d}"
    if key not in annotations_by_month:
        annotations_by_month[key] = []
    annotations_by_month[key].append(filename)
with open('annotations_by_month.json', 'w') as json_file:
    json.dump(annotations_by_month, json_file)

with open('annotations_by_month.json', 'r') as json_file:
    loaded_json_data = json.load(json_file)
print("Loaded JSON data:", loaded_json_data)

Loaded JSON data: {'2024-01': ['20240101_123456_SN42_QUICKVIEW_VISUAL_01_RegionName.txt.txt']}


In [169]:
with open('annotations_by_month.pkl', 'wb') as pickle_file:
    pickle.dump(annotations_by_month, pickle_file)
with open('annotations_by_month.pkl', 'rb') as pickle_file:
    loaded_pickle_data = pickle.load(pickle_file)
print("Loaded Pickle data:", loaded_pickle_data)


Loaded Pickle data: {'2024-01': ['20240101_123456_SN42_QUICKVIEW_VISUAL_01_RegionName.txt.txt']}


In [171]:
annotations_with_details = {}
for filename, dt in ann_datetime:
    key = f"{dt.year}-{dt.month:02d}"
    if key not in annotations_with_details:
        annotations_with_details[key] = []
    annotations_with_details[key].append({'name': filename, 'date': dt})
annotations_with_details 

{'2024-01': [{'name': '20240101_123456_SN42_QUICKVIEW_VISUAL_01_RegionName.txt.txt',
   'date': datetime.datetime(2024, 1, 1, 12, 34, 56)}]}

In [172]:
second_half_2024 = [item for item in ann_datetime if item[1].year == 2024 and item[1].month >= 7]
second_half_2024_sorted = sorted(second_half_2024, key=lambda x: x[1])

print("\nAnnotations from July to December 2024 (sorted from oldest to newest):")
for filename, dt in second_half_2024_sorted:
    print(f"{dt}: {filename}")


Annotations from July to December 2024 (sorted from oldest to newest):
