#**1.JSON**

**stands for JavaScript Object Notation. It is a lightweight, text-based format for storing , transporting and serializing structured data over network. It's  Language-independent , Human-readable and widely used in data engineering for:**

1.APIs and webservices (REST, microservices)

2.Configuration files

3.Data exchange between systems

4.Staging data for ETL pipelines

5.Logging and audit trails


######https://chatgpt.com/share/6812cc86-d314-8003-9306-acc977d46cf3.

# **2.pandas or json module, which of them the data engineer uses frequently**
As a data engineer, both pandas and the json module are essential — but they serve different purposes. Here's a professional breakdown of when and why you'd use each:
#**📌 1. pandas — for tabular data (structured)**


- When You're working with rows and columns (tables)

- When Your source is CSV, Excel, SQL, Parquet, or structured JSON

- When You need to filter, join, group, clean, or analyze data

- When You're building data pipelines that transform data in bulk

#**📌 2. json — for configurations, API data, metadata**


- When reading config files (like config.json)

- When Reading semi-structured data from an API or Kafka

- When Writing logs, audit trails, or messages

- When Passing pipeline settings (paths, database info, etc.)

# **3.🧱 JSON Structure**
A JSON file is made of key-value pairs  wrapped in {},vlaue can be of any data type such as arrays, and nested objects.


In [None]:
#Written as key-value pairs wrapped in {}.
{
  "id": 101,
  "name": "Product A"
}


{'id': 101, 'name': 'Product A'}

In [None]:
#2. Arrays
#Lists of values, written in [].
{
  "categories": ["electronics", "gaming", "computers"]
}

{'categories': ['electronics', 'gaming', 'computers']}

In [None]:
#3. Nested Objects
#Useful for representing structured, hierarchical data.

{
  "user": {
    "id": 1,
    "profile": {
      "first_name": "Ahmed",
      "last_name": "Ali"
    }
  }
}

{'user': {'id': 1, 'profile': {'first_name': 'Ahmed', 'last_name': 'Ali'}}}

# **4.🔄 What is Serialization and deserialization In the Context of JSON?**
Serialization is the process of converting an object (like a Python dictionary or class instance) into a format that can be stored or transmitted — such as a JSON string, a file, or over a network.
The reverse process is called deserialization.

#### **▶️ Serialization**
Python object → JSON string or file.

#### **◀️ Deserialization**
JSON string or file → Python object.

# **5.📦Json Module**
**Basic Functions in the json Module**

**1. json.load()** → Load from a file

**2. json.loads()** → Load from a string

**3. json.dump()** → Write to a file

**4. json.dumps()** → Write to a string

# **5.1 json.load()**
It reads a JSON file and converts its contents into a Python dictionary (or list, depending on structure). This process is called deserialization.

In [None]:
 #config.json — Example ETL Job Configuration File

#Let’s say this file is saved in /content/config.json
{
  "job_name": "daily_sales_etl",
  "source": {
    "type": "csv",
    "path": "./data/raw/sales_data.csv"
  },
  "target": {
    "type": "parquet",
    "path": "./data/processed/sales_data_cleaned.parquet",
    "mode": "overwrite"
  },
  "transformations": {
    "drop_columns": ["temp_id", "debug_info"],
    "convert_to_datetime": ["order_date", "ship_date"]
  },
  "log_path": "./logs/daily_sales_etl.log"
}


{'job_name': 'daily_sales_etl',
 'source': {'type': 'csv', 'path': './data/raw/sales_data.csv'},
 'target': {'type': 'parquet',
  'path': './data/processed/sales_data_cleaned.parquet',
  'mode': 'overwrite'},
 'transformations': {'drop_columns': ['temp_id', 'debug_info'],
  'convert_to_datetime': ['order_date', 'ship_date']},
 'log_path': './logs/daily_sales_etl.log'}

In [None]:
#📜 2. Python Code to Read it Using json.load()
import json

# Load config file
with open("config.json", "r") as f:
    config = json.load(f)

# Access config values
print("Job Name:", config["job_name"])
print("Source File:", config["source"]["path"])
print("Drop Columns:", config["transformations"]["drop_columns"])

Job Name: daily_sales_etl
Source File: ./data/raw/sales_data.csv
Drop Columns: ['temp_id', 'debug_info']


In [11]:
import json

try:
    with open("config.json", "r") as f:
        config = json.load(f)
except FileNotFoundError:
    print("❌ File not found.")
except json.JSONDecodeError as e:
    print("❌ Invalid JSON format:", e)


##**5.1.1Here are the most practical and professional-level tricks you can use with json.load():**



### **🎯 1. Validate Required Keys After Loading**
Why: To avoid runtime errors if keys are missing in the config.

In [14]:
REQUIRED_KEYS = ["source", "target", "transformations"]

with open("config.json") as f:
    config = json.load(f)

for key in REQUIRED_KEYS:
    if key not in config:
        raise KeyError(f"Missing required key: {key}")
    else:
        print(f"{key} found in config.")

source found in config.
target found in config.
transformations found in config.


### **🎯 2. Use Default Values with dict.get()**

Why: Not every config has to be fully defined — .get() gives you safe fallback values.

In [18]:
# Load config file
with open("config.json", "r") as f:
    config = json.load(f)

#comment "remove" the "log_path" in the config.json
log_path = config.get("log_path", "./logs/default.log")
print(log_path)
overwrite_mode = config.get("target", {}).get("mode", "append")
#config.get("target", {}) → If target exists, returns that dictionary. If not, returns an empty dict.
#.get("mode", "append") → Gets the value of "mode" from target, or uses "append" as default.

./logs/default.log


### **🎯 3. Create a schema validator using pydantic or custom rules**

Why: Large configs benefit from strong validation.

Using pydantic:

In [19]:
from pydantic import BaseModel
from typing import List

class Transformations(BaseModel):
    drop_columns: List[str]
    convert_to_datetime: List[str]

class Source(BaseModel):
    type: str
    path: str

class Target(BaseModel):
    type: str
    path: str
    mode: str = "append"

class Config(BaseModel):
    job_name: str
    source: Source
    target: Target
    transformations: Transformations


In [20]:
with open("config.json") as f:
    raw = json.load(f)

cfg = Config(**raw)
print(cfg.target.path)
#when that code print an error

./data/processed/sales_data_cleaned.parquet


### **🎯 4. Load and Cache the Config Once (Singleton)**

Why: If many modules or functions use the same config, don't load it multiple times.

In [None]:
import json

_config_cache = None

def get_config(path="config.json"):
    global _config_cache
    if _config_cache is None:
        with open(path) as f:
            _config_cache = json.load(f)
    return _config_cache


### **🎯 6. Dynamically Load ETL Parameters by Job Name**

Why: Share a single config file that holds multiple job settings.

In [22]:
#jobs.json
{
  "daily_sales_etl": {
    "source": "./data/sales.csv",
    "target": "./data/sales_cleaned.parquet"
  },
  "inventory_update": {
    "source": "./data/inventory.csv",
    "target": "./data/inventory.parquet"
  }
}

{'daily_sales_etl': {'source': './data/sales.csv',
  'target': './data/sales_cleaned.parquet'},
 'inventory_update': {'source': './data/inventory.csv',
  'target': './data/inventory.parquet'}}

In [24]:
job = "daily_sales_etl"

with open("jobs.json") as f:
    all_jobs = json.load(f)

job_config = all_jobs[job]
job_config

{'source': './data/sales.csv', 'target': './data/sales_cleaned.parquet'}

# **5.2 json.loads()**
**json.loads()** is a method used to deserialize a JSON-formatted string into a Python object. The loads() function stands for "load string". It’s commonly used when you have JSON data in the form of a string (often received from an API or a file), and you need to convert that string into Python objects like dictionaries, lists, etc


```
json.loads(s, *, object_hook=None, **kwargs)
```



**When Should You Use json.loads()?**

Reading JSON from APIs: When you make HTTP requests to a web service, the response is often in JSON format.

json.loads() can be used to parse that JSON response into Python objects (like a dictionary or list).

Parsing JSON Files: If you read a JSON file into a string and need to convert it into a Python object, you can use json.loads().

Converting Configuration Data: Many ETL pipelines use JSON configuration files. After reading the configuration as a string, you can use json.loads() to convert it into a Python object for processing.

###**Basic Usage of json.loads()**

In [29]:
import json

# A sample JSON string
json_string = '{"name": "Ahmed", "age": 30, "is_active": true}'

# Convert the JSON string to a Python dictionary
data = json.loads(json_string)

print(data)
# Output: {'name': 'Ahmed', 'age': 30, 'is_active': True}

{'name': 'Ahmed', 'age': 30, 'is_active': True}


###**Using object_hook to Convert JSON Objects into Custom Python Objects**

in some cases, you may want to convert the JSON objects into custom Python objects (instead of regular Python dictionaries). This is where object_hook comes into play.

In [30]:
import json

# Sample JSON string
json_string = '{"name": "Ahmed", "age": 30}'

# Custom class to represent a Person
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age

    def __repr__(self):
        return f"Person(name={self.name}, age={self.age})"

# Custom object_hook function that converts the dictionary into a Person object
def custom_hook(dct):
    return Person(dct['name'], dct['age'])

# Deserialize JSON using custom_hook
person = json.loads(json_string, object_hook=custom_hook)

# Print the custom object
print(person)  # Output: Person(name=Ahmed, age=30)


Person(name=Ahmed, age=30)


## **Using object_pairs_hook to Preserve Key Order**

In [31]:
#JSON objects to be converted into an OrderedDict (instead of a regular dictionary)
import json
from collections import OrderedDict

# Sample JSON string
json_string = '{"name": "Ahmed", "age": 30, "is_active": true}'

# Deserialize JSON into an OrderedDict (preserving key order)
ordered_data = json.loads(json_string, object_pairs_hook=OrderedDict)

print(ordered_data)
# Output: OrderedDict([('name', 'Ahmed'), ('age', 30), ('is_active', True)])


OrderedDict([('name', 'Ahmed'), ('age', 30), ('is_active', True)])


## **Using parse_int, parse_float, and parse_constant**

In [32]:
import json

# Sample JSON string with a large integer and special float values
json_string = '{"int_value": 12345678901234567890, "float_value": 3.14, "special_value": "NaN"}'

# Custom function to parse integers (e.g., to handle large integers)
def parse_int(value):
    return int(value)

# Custom function to parse float (e.g., handling special float values)
def parse_float(value):
    return float(value)

# Custom function to handle special constant values (e.g., NaN)
def parse_constant(value):
    if value == "NaN":
        return float('nan')
    elif value == "Infinity":
        return float('inf')
    return value

# Deserialize JSON with custom parsing
data = json.loads(json_string, parse_int=parse_int, parse_float=parse_float, parse_constant=parse_constant)

print(data)
# Output: {'int_value': 12345678901234567890, 'float_value': 3.14, 'special_value': nan}


{'int_value': 12345678901234567890, 'float_value': 3.14, 'special_value': 'NaN'}


# **json.dump()**

Serialize a Python object (such as a dictionary or list)

Write the serialized object directly to a file in JSON format


```
json.dump(obj, fp, *, skipkeys=False, ensure_ascii=True, check_circular=True,
          allow_nan=True, cls=None, indent=None, separators=None,
          default=None, sort_keys=False, **kw)
```
**📌 Key Parameters**

**obj:** The Python object to serialized.

**fp:** File-like object (must be opened in write text mode).

**indent:** Number of spaces to use for indentation (pretty printing).

**sort_keys:** Boolean – if True, dictionary keys are sorted alphabetically.

**ensure_ascii:** If True (default), escapes all non-ASCII characters; set to False to preserve Unicode.



## **Basic usage**

In [33]:
#Writing a Python dictionary to a file:
import json

data = {
    "name": "Ahmed",
    "age": 30,
    "is_active": True,
    "skills": ["Python", "SQL", "ETL"]
}

with open("user_config.json", "w") as f:
    json.dump(data, f)


In [36]:
#Pretty Printing with indent
with open("user_config_pretty.json", "w") as f:
    json.dump(data, f, indent=4)

In [37]:
# Sorting Keys of the json file
with open("user_config_sorted.json", "w") as f:
    json.dump(data, f, indent=2, sort_keys=True)

In [38]:
#Real ETL Config File
etl_config = {
    "job_name": "daily_sales_etl",
    "source": {
        "type": "csv",
        "path": "data/sales_2025_04_30.csv"
    },
    "target": {
        "type": "parquet",
        "path": "output/sales.parquet",
        "mode": "overwrite"
    },
    "transformations": {
        "drop_columns": ["temp_col"],
        "convert_to_datetime": ["order_date"]
    }
}

with open("etl_config.json", "w") as f:
    json.dump(etl_config, f, indent=4, sort_keys=True)
