### ðŸ”¥ Data Handling â€“ JSON | CSV | Parquet (Hive & PySpark)

---

### ðŸ§± File Types Used in Big Data Pipelines
- JSON (simple, nested)
- CSV (delimited)
- Parquet (columnar)

---

### =======================
###ðŸŸ¢ JSON HANDLING
###=======================

### JSON Examples

### Simple JSON
```json
{"id":1,"name":"Sathya","age":30}
```
### Nested JSON
```
{
  "emp": {
    "id": 101,
    "name": "Arun",
    "skills": ["Hive","Spark"],
    "address": {
      "city": "Chennai",
      "state": "TN"
    }
  }
}
```
### JSON in Hive
####Create Table (Schema-on-Read)
```
CREATE TABLE raw_json (
  emp STRUCT<
    id:INT,
    name:STRING,
    skills:ARRAY<STRING>,
    address:STRUCT<city:STRING,state:STRING>
  >
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
```

####Extract Nested Fields
```
SELECT
  emp.id,
  emp.name,
  emp.address.city,
  emp.address.state
FROM raw_json;
```
#### get_json_object (String JSON)
```
SELECT
  get_json_object(json_col,'$.emp.id')   AS emp_id,
  get_json_object(json_col,'$.emp.name') AS emp_name
FROM raw_json;
```

####Flatten Array â€“ EXPLODE
```
SELECT
  emp.id,
  skill
FROM raw_json
LATERAL VIEW EXPLODE(emp.skills) t AS skill;
```
####Flatten Array â€“ POSEXPLODE
```
SELECT
  emp.id,
  pos,
  skill
FROM raw_json
LATERAL VIEW POSEXPLODE(emp.skills) t AS pos, skill;
```
###JSON in PySpark
```
df = spark.read.option("multiLine","true").json("/path/json")
df.printSchema()
df.show(truncate=False)
```
####Select Nested Fields
```
df.select(
  df.emp.id.alias("emp_id"),
  df.emp.name.alias("emp_name"),
  df.emp.address.city.alias("city")
).show()
```
####Flatten Array
```
from pyspark.sql.functions import explode
df.select(df.emp.id, explode(df.emp.skills)).show()
```









In [0]:
dbutils.fs.head("dbfs:/Volumes/dev/club_db/data/json/nested_json.json")

In [0]:
# Minimal fix: Use Python to read JSON from Unity Catalog volume and save as table
json_path = "/Volumes/dev/club_db/data/json/simple_json.json"
df = spark.read.json(json_path)
df.write.mode("overwrite").saveAsTable("dev.club_db.bronze_simple_json")

In [0]:
%sql
select * from dev.club_db.bronze_simple_json


In [0]:
%sql
CREATE OR REPLACE TEMP VIEW raw_employees AS
SELECT *
FROM read_files(
  "/Volumes/dev/club_db/data/json/nested_json.json",
  format => "json",
  multiline => true
);

In [0]:
%sql
select * from raw_employees

In [0]:
%sql
SELECT
  emp.address.city  AS city,
  emp.address.state AS state,
  skill
FROM raw_employees
LATERAL VIEW EXPLODE(employees) e AS emp
LATERAL VIEW EXPLODE(emp.skills) s AS skill;


In [0]:
%sql
SELECT
  emp.address.city   AS emp_city,
  emp.address.state AS emp_state
FROM raw_employees;

In [0]:
from pyspark.sql.types import StructType, StructField, LongType, StringType

json_path1 = "/Volumes/dev/club_db/data/json/nested_json.json"
schema = StructType([
    StructField('age', LongType(), True),
    StructField('id', LongType(), True),
    StructField('name', StringType(), True)
])
df1 = spark.read.schema(schema).option('mode', 'PERMISSIVE').json(json_path1)
df1.write.saveAsTable("dev.club_db.bronze_nested_json")

In [0]:
%sql
select *  from dev.club_db.bronze_nested_json