## **[Serialize Your Data With Python](https://realpython.com/python-serialize-data/)**
- 마샬링 이라고도 하는 직렬화는 데이터 조각을 네트워크나 광 디스크와 같은 매체의 영구 저장소를 통한 전송에 적합한 임시 표현으로 변환하는 프로세스
- 직렬화된 모든 데이터는 원래 모양이나 형식에 관계없이 **바이트 스트림(Byte Stream)**

In [10]:
# Example XML data (Atom feed)

xml_example = """
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Example Feed</title>
  <subtitle>A subtitle.</subtitle>
  <link href="http://example.org/"/>
  <updated>2003-12-13T18:30:02Z</updated>
  <author>
    <name>John Doe</name>
  </author>
  <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>
  <entry>
    <title>Atom-Powered Robots Run Amok</title>
    <link href="http://example.org/2003/12/13/atom03"/>
    <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
    <updated>2003-12-13T18:30:02Z</updated>
    <summary>Some text.</summary>
  </entry>
</feed>
"""

In [4]:
! pip install xmltodict pymongo

Collecting xmltodict
  Downloading xmltodict-0.13.0-py2.py3-none-any.whl (10.0 kB)
Collecting pymongo
  Downloading pymongo-4.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (670 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.0/670.0 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.6.1-py3-none-any.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.7/307.7 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: xmltodict, dnspython, pymongo
Successfully installed dnspython-2.6.1 pymongo-4.7.2 xmltodict-0.13.0


In [11]:
import xmltodict
from bson import BSON

def parse_and_convert_to_bson(xml_data):
    # Parse XML to a Python dictionary
    data_dict = xmltodict.parse(xml_data)

    # Serialize the dictionary to BSON
    bson_data = BSON.encode(data_dict)

    # Return BSON data
    return bson_data

# Assuming this function is called with the example XML
bson_result = parse_and_convert_to_bson(xml_example)

# Here we would typically write the BSON data to a file or send it over a network
print(bson_result)  # Display the BSON for demonstration (not typically done in production)


b'\xeb\x01\x00\x00\x03feed\x00\xe0\x01\x00\x00\x02@xmlns\x00\x1c\x00\x00\x00http://www.w3.org/2005/Atom\x00\x02title\x00\r\x00\x00\x00Example Feed\x00\x02subtitle\x00\x0c\x00\x00\x00A subtitle.\x00\x03link\x00$\x00\x00\x00\x02@href\x00\x14\x00\x00\x00http://example.org/\x00\x00\x02updated\x00\x15\x00\x00\x002003-12-13T18:30:02Z\x00\x03author\x00\x18\x00\x00\x00\x02name\x00\t\x00\x00\x00John Doe\x00\x00\x02id\x00.\x00\x00\x00urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6\x00\x03entry\x00\xd8\x00\x00\x00\x02title\x00\x1d\x00\x00\x00Atom-Powered Robots Run Amok\x00\x03link\x005\x00\x00\x00\x02@href\x00%\x00\x00\x00http://example.org/2003/12/13/atom03\x00\x00\x02id\x00.\x00\x00\x00urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a\x00\x02updated\x00\x15\x00\x00\x002003-12-13T18:30:02Z\x00\x02summary\x00\x0b\x00\x00\x00Some text.\x00\x00\x00\x00'


In [12]:
%%writefile atom.bson
"""
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Example Feed</title>
  <subtitle>A subtitle.</subtitle>
  <link href="http://example.org/"/>
  <updated>2003-12-13T18:30:02Z</updated>
  <author>
    <name>John Doe</name>
  </author>
  <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>
  <entry>
    <title>Atom-Powered Robots Run Amok</title>
    <link href="http://example.org/2003/12/13/atom03"/>
    <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
    <updated>2003-12-13T18:30:02Z</updated>
    <summary>Some text.</summary>
  </entry>
</feed>
"""

Writing atom.bson


In [13]:
! hexdump -C atom.bson | head

00000000  22 22 22 0a 3c 66 65 65  64 20 78 6d 6c 6e 73 3d  |""".<feed xmlns=|
00000010  22 68 74 74 70 3a 2f 2f  77 77 77 2e 77 33 2e 6f  |"http://www.w3.o|
00000020  72 67 2f 32 30 30 35 2f  41 74 6f 6d 22 3e 0a 20  |rg/2005/Atom">. |
00000030  20 3c 74 69 74 6c 65 3e  45 78 61 6d 70 6c 65 20  | <title>Example |
00000040  46 65 65 64 3c 2f 74 69  74 6c 65 3e 0a 20 20 3c  |Feed</title>.  <|
00000050  73 75 62 74 69 74 6c 65  3e 41 20 73 75 62 74 69  |subtitle>A subti|
00000060  74 6c 65 2e 3c 2f 73 75  62 74 69 74 6c 65 3e 0a  |tle.</subtitle>.|
00000070  20 20 3c 6c 69 6e 6b 20  68 72 65 66 3d 22 68 74  |  <link href="ht|
00000080  74 70 3a 2f 2f 65 78 61  6d 70 6c 65 2e 6f 72 67  |tp://example.org|
00000090  2f 22 2f 3e 0a 20 20 3c  75 70 64 61 74 65 64 3e  |/"/>.  <updated>|


| Aspect            | Textual                                              | Binary                                              |
|-------------------|------------------------------------------------------|-----------------------------------------------------|
| **Examples**      | CSV, JSON, XML, YAML                                 | Avro, BSON, Parquet, Protocol Buffers, pickle               |
| **Readability**   | Human and machine-readable                           | Machine-readable                                    |
| **Processing Speed** | Slow with bigger datasets                          | Fast                                                |
| **Size**          | Large due to wasteful verbosity and redundancy       | Compact                                             |
| **Portability**   | High                                                 | May require extra care to ensure platform-independence |
| **Structure**     | Fixed or evolving, often self-documenting            | Usually fixed, which must be agreed on beforehand   |
| **Types of Data** | Mostly text, less efficient when embedding binary data | Text or binary data                               |
| **Privacy and Security** | Exposes sensitive information                  | Makes it more difficult to extract information, but not completely immune |


## **Serialize Python Objects**
> ### **1. Pickle Your Python Objects**
>> - pickled cucumbers

In [16]:
import pickle

data = 255
with open("filename.pkl", mode="wb") as file:  # binary , 직렬화, 통신에 적합, 보안 유념
    pickle.dump(data, file)


pickle.dumps(data)

b'\x80\x04K\xff.'

In [17]:
for protocol in range(pickle.HIGHEST_PROTOCOL + 1):
    print(f"v{protocol}:", pickle.dumps(data, protocol))

pickle.DEFAULT_PROTOCOL

v0: b'I255\n.'
v1: b'K\xff.'
v2: b'\x80\x02K\xff.'
v3: b'\x80\x03K\xff.'
v4: b'\x80\x04K\xff.'
v5: b'\x80\x05K\xff.'


4

In [18]:
with open("filename.pkl", mode="rb") as file:
    pickle.load(file)

pickle.loads(b"\x80\x04K\xff.")

255

In [20]:
import pandas as pd
pd.read_pickle("filename.pkl")

255

In [22]:
import pickle

data = {'key': 'value'}

# Serialize data using the default protocol
serialized_data = pickle.dumps(data)
print(serialized_data)

# Deserialize data using the default protocol
deserialized_data = pickle.loads(serialized_data)

# Print the deserialized data
print(deserialized_data)

b'\x80\x04\x95\x12\x00\x00\x00\x00\x00\x00\x00}\x94\x8c\x03key\x94\x8c\x05value\x94s.'
{'key': 'value'}


> ### **2. Encode Objects Using JSON**
>> - Unlike the binary protocols that the pickle module uses, **JSON is a textual serialization format readable by humans**.


In [23]:
import json

data = {
    "email": None,
    "name": "John Doe",
    "age": 42.5,
    "married": True,
    "children": ["Alice", "Bob"],
}

print(json.dumps(data, indent=4, sort_keys=True))

{
    "age": 42.5,
    "children": [
        "Alice",
        "Bob"
    ],
    "email": null,
    "married": true,
    "name": "John Doe"
}


### The JSON format supports only six native data types:
> **1. Array:**
>> An ordered list of values, which can be of different types. In JSON, arrays are written within square brackets, like [1, 2, 3].

> **2. Boolean:**
>> A data type that can hold only two values: true or false.

> **3. Null:**
>> A type that has only one value: null, which is used to represent the absence of a value.

> **4. Number:**
>> This type includes integers and floating-point numbers, without any distinction between them. Examples include 42 and 3.14.

> **5. Object:**
>> A collection of key-value pairs where keys are strings, and values can be any JSON data type. JSON objects are written in curly braces, like {"key1": "value", "key2": 42}.

> **6. String:**
>> A sequence of zero or more Unicode characters, written with double quotes, like "Hello, World!".

In [24]:
json.dumps({"Saturday", "Sunday"})

TypeError: Object of type set is not JSON serializable

In [25]:
def serialize_custom(value):
    if isinstance(value, set):
        return {
            "type": "set",
            "elements": list(value)
        }


data = {"weekend_days": {"Saturday", "Sunday"}}

json.dumps(data, default=serialize_custom)

'{"weekend_days": {"type": "set", "elements": ["Sunday", "Saturday"]}}'

In [26]:
def deserialize_custom(value):
    match value:
        case {"type": "set", "elements": elements}:
            return set(elements)
        case _:
            return value

json_string = """
    {
        "weekend_days": {
            "type": "set",
            "elements": ["Sunday", "Saturday"]
        }
    }
"""

json.loads(json_string, object_hook=deserialize_custom)

{'weekend_days': {'Saturday', 'Sunday'}}

In [28]:
import numpy as np
from sklearn.linear_model import LinearRegression
import pickle

# 데이터 준비: 두 개의 2차원 벡터, 각 벡터는 (x, y) 좌표를 표시
X = np.array([[1, 2], [2, 3], [3, 5], [4, 7]])
y = np.array([1, 2, 3, 4])

# 모델 생성 및 학습
model = LinearRegression()
model.fit(X, y)

# 모델을 pickle 파일로 저장
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)
    print("모델 학습 및 저장이 완료되었습니다.")

모델 학습 및 저장이 완료되었습니다.


In [32]:
# 모델 파일을 불러옵니다.
with open('model.pkl', 'rb') as f:
    model = pickle.load(f)

# 모델 사용 예: 새로운 데이터에 대한 예측 수행, 새로운 2차원 벡터 데이터
new_data = np.array([[5, 8], [6, 9]])
predictions = model.predict(new_data)
predictions

array([5., 6.])

In [34]:
new_data = np.array([[9, 15]])
predictions = model.predict(new_data)
predictions

array([9.])

### **X(Input) should be matrix form**

In [45]:
new_data = np.array([9, 15])
print(new_data.shape)
predictions = model.predict(new_data)
predictions

(2,)


ValueError: Expected 2D array, got 1D array instead:
array=[ 9 15].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [46]:
new_data = new_data.reshape(1, -1)
print(new_data.shape)

predictions = model.predict(new_data)
predictions

(1, 2)


array([9.])

### **[Model persistence](https://scikit-learn.org/stable/model_persistence.html)**
- **Python specific serialization**

In [38]:
from sklearn import svm
from sklearn import datasets
clf = svm.SVC()
X, y= datasets.load_iris(return_X_y=True)
clf.fit(X, y)

import pickle
s = pickle.dumps(clf)   # s object는 binary type
clf2 = pickle.loads(s)
clf2.predict(X[0:1])

array([0])

- The variable s in your code, which is created by the line s = pickle.dumps(clf), is indeed a **binary type**.
- In Python, pickle.dumps() serializes an object into a byte stream (a sequence of bytes), which is suitable for storage or transmission over networks.

In [42]:
print(X[0:1])  # matrix type
print(X[0:1].shape)

[[5.1 3.5 1.4 0.2]]
(1, 4)
