<img src="uva_seal.png">  

## Data Ingestion

### University of Virginia
### DS 5110: Big Data Systems
### Last Updated: January 17, 2026

---  

### Sources 

Learning Spark, Chapter 9: Spark SQL

### OBJECTIVES
- Define data ingestion
- Explain the Spark data schema and why explicitly providing it is preferable
- Demonstrate ingestion from different file formats
- Understand how to handle malformed records
- Execute data partitioning

### CONCEPTS AND FUNCTIONS
- Data ingestion
- Schema
- Parquet files
- Data partitioning

---  

### 1. Data ingestion

Data ingestion is the process of bringing data from external sources into a system

Common **sources** include local disk, HDFS, S3, or other distributed storage

Common **formats** include CSV, text files, JSON, Parquet, and Avro

 <img src="ingestion.png" width=600> 

---  

### 2. Data Schema in Spark

The schema in Spark defines the data structure.  
For each field, a 3-tuple is specified: `(column name, data type, nullable)`  



**Example of schema with two Fields *author* and *pages*, which cannot contain null values**
```
schema = StructType(
                    [StructField("author", StringType(), False), 
                     StructField("pages", IntegerType(), False)
                    ])
```

You can let Spark infer the data schema, but it's preferable to feed it the schema:

- Avoids having Spark launch a separate job to read a large fraction of the data to infer schema
- Early detection of errors if the data doesn't match the schema
- Spark inference may be incorrect. For example, it may think all numerical data are strings.

**This schema is different from database schema**

A database *schema* is the structure that represents the logical view of the entire database.  
It defines how data is organized and how relations among them are associated.  
This is implemented through the use of tables, views, and integrity constraints.

**Common Spark Data Types**

- Integer types, all `int` in python:
  - ShortType
  - IntegerType
  - LongType
  - FloatType
  - DoubleType
- StringType
- BooleanType

---

### 3. Reading Files in Spark

The `SparkSession.read()` method supports efficient data loading in PySpark.

Several examples of batch data ingestion are demonstrated below.  

**NOTES**:

1 | Recall that batch data is finite data.  

The process of ingesting streaming data, which is infinite data, will be discussed later in the course.

2 | These examples use the *Spark DataFrame* object.

See the notebook `spark_sql_and_dataframes.ipynb` to dive deeper into Spark Dataframes.


In [None]:
# initialize Spark Session
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \
    .appName("DataIngestionExample") \
    .getOrCreate()

#### 3.1 Read CSV file

`header=True` treats first row as column names

`infer schema` prompts Spark to infer the schema; generally not recommended

In [None]:
df_csv = spark.read.csv("../data/amzn_msft_prices.csv", header=True, inferSchema=True)
df_csv.show(5)
df_csv.printSchema()

**Read CSV file, explicitly defining schema**

In [None]:
from pyspark.sql.types import StructType, StructField, TimestampType, StringType, DoubleType, IntegerType

schema = StructType([
    StructField("date", TimestampType(), True),
    StructField("ticker", StringType(), True),
    StructField("close", DoubleType(), True),
    StructField("adjusted_close", DoubleType(), True),
    StructField("volume", IntegerType(), True)
])

df_csv_schema = spark.read.csv("../data/amzn_msft_prices.csv", header=True, schema=schema)
df_csv_schema.show(5)

---

#### 3.2 Read semi-structured JSON file

JSON supports nested structures, which PySpark can handle with structs and arrays.

In [None]:
df_json = spark.read.json("../data/people.json")
df_json.show(5)
df_json.printSchema()

---

#### 3.3 Reading Parquet Files

- Parquet **format is columnar**
- Reading and writing parquet files can be MUCH faster in Spark
- It is supported by many other data processing systems
- Stores metadata (schema) about the columns, which can provide efficiency
- Especially useful when querying columns for analytics and ML (don't generally need entire rows of data)
- Parquet files also have good compression options


In [None]:
df_parquet = spark.read.parquet("../data/amzn_msft_prices.parquet")
df_parquet.show(5)
df_parquet.printSchema()

---

#### 3.4 Handling Malformed Records

Bad rows in your data can cause havok if not handled properly.

In this example, there is a row with too many columns.

Use option `mode` to control handling:
- PERMISSIVE (default) → set corrupted fields to null
- DROPMALFORMED → drop bad rows
- FAILFAST → fail immediately if a row is bad

In [None]:
df_csv_bad = spark.read.option("mode", "FAILFAST") \
    .csv("../data/amzn_msft_prices_bad_row.csv", header=True, inferSchema=True)

df_csv_bad.show(5)

In [None]:
df_csv_bad = spark.read.option("mode", "DROPMALFORMED") \
    .csv("../data/amzn_msft_prices_bad_row.csv", header=True, inferSchema=True)

df_csv_bad.show(5)

---

### 4. Reading Large Datasets Efficiently

For massive datasets, we want to be efficient when reading and handling data

#### 4.1 Partitioning in Spark

Spark reads data in parallel by splitting files into *partitions*

In [None]:
df_csv = spark.read.option("header", True).csv("../data/amzn_msft_prices.csv")
print(df_csv.rdd.getNumPartitions())

**Repartition** for better parallelism on large datasets

In [None]:
df_repart = df_csv.repartition(6)  # 6 partitions
print(df_repart.rdd.getNumPartitions())

**Column pruning**: Read only necessary columns to reduce memory usage

In [None]:
df_select = df_csv.select("date","ticker","adjusted_close")
df_select.show(5)

---

#### 4.2 Partition Discovery

We looked at partitioning data in Spark.

Database tables can also be partitioned to make querying more efficient.  

For example, a dataset may have logical groupings, such as locations or demographic subsets. 

We might split the data by **gender** and **country**, producing smaller tables.  

If the analyst queries a country against the partitioned table, it will run faster.


**Different Directories**

In a partitioned table, data are usually stored in different directories, with partitioning column values encoded in the path of each partition directory.  

All built-in file sources (including Text/CSV/JSON/ORC/Parquet) are able to discover and infer partitioning information automatically. 

In [None]:

path
└── to
    └── table
        ├── gender=male
        │   ├── ...
        │   │
        │   ├── country=US
        │   │   └── data.parquet
        │   ├── country=CN
        │   │   └── data.parquet
        │   └── ...
        └── gender=female
            ├── ...
            │
            ├── country=US
            │   └── data.parquet
            ├── country=CN
            │   └── data.parquet
            └── ...


**Examples of writing DF to Parquet file, partitioning columns**

```
df = df.withColumn('end_month', F.month('end_date'))
df = df.withColumn('end_year', F.year('end_date'))
df.write.partitionBy("end_year", "end_month").parquet("/tmp/sample_table")
```

---

**TRY FOR YOURSELF (UNGRADED EXERCISES)**

1) Write and execute code to ingest the JSON file `"../data/people.json"` and store the data in the `\data` folder in Parquet format.  
   If things completed successfully, you will see a `people.parquet` folder with the file `_SUCCESS` 

---

### 5. Summary

You should now have some understanding of how to ingest different data formats into Spark.

We also covered some methods for efficiently handling big data.

Next, dive deeper into Spark Dataframes and Spark SQL in `spark_sql_and_dataframes.ipynb`