<img src="uva_seal.png">  

## Data Ingestion

### University of Virginia
### DS 5110: Big Data Systems
### Last Updated: January 17, 2026

---  

### Sources 

Learning Spark, Chapter 9: Spark SQL

### OBJECTIVES
- Define data ingestion
- Explain the Spark data schema and why explicitly providing it is preferable
- Demonstrate ingestion from different file formats
- Understand how to handle malformed records
- Execute data partitioning

### CONCEPTS AND FUNCTIONS
- Data ingestion
- Schema
- Parquet files
- Data partitioning

---  

### 1. Data ingestion

Data ingestion is the process of bringing data from external sources into a system

Common **sources** include local disk, HDFS, S3, or other distributed storage

Common **formats** include CSV, text files, JSON, Parquet, and Avro

 <img src="ingestion.png" width=600> 

---  

### 2. Data Schema in Spark

The schema in Spark defines the data structure.  
For each field, a 3-tuple is specified: `(column name, data type, nullable)`  



**Example of schema with two Fields *author* and *pages*, which cannot contain null values**
```
schema = StructType(
                    [StructField("author", StringType(), False), 
                     StructField("pages", IntegerType(), False)
                    ])
```

You can let Spark infer the data schema, but it's preferable to feed it the schema:

- Avoids having Spark launch a separate job to read a large fraction of the data to infer schema
- Early detection of errors if the data doesn't match the schema
- Spark inference may be incorrect. For example, it may think all numerical data are strings.

**This schema is different from database schema**

A database *schema* is the structure that represents the logical view of the entire database.  
It defines how data is organized and how relations among them are associated.  
This is implemented through the use of tables, views, and integrity constraints.

**Common Spark Data Types**

- Integer types, all `int` in python:
  - ShortType
  - IntegerType
  - LongType
  - FloatType
  - DoubleType
- StringType
- BooleanType

---

### 3. Reading Files in Spark

The `SparkSession.read()` method supports efficient data loading in PySpark.

Several examples of batch data ingestion are demonstrated below.  

**NOTES**:

1 | Recall that batch data is finite data.  

The process of ingesting streaming data, which is infinite data, will be discussed later in the course.

2 | These examples use the *Spark DataFrame* object.

See the notebook `spark_sql_and_dataframes.ipynb` to dive deeper into Spark Dataframes.


In [1]:
# initialize Spark Session
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \
    .appName("DataIngestionExample") \
    .getOrCreate()

/opt/conda/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


26/01/17 22:21:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


#### 3.1 Read CSV file

`header=True` treats first row as column names

`infer schema` prompts Spark to infer the schema; generally not recommended

In [2]:
df_csv = spark.read.csv("../data/amzn_msft_prices.csv", header=True, inferSchema=True)
df_csv.show(5)
df_csv.printSchema()

                                                                                

+-------------------+------+----------+--------------+--------+
|               date|ticker|     close|adjusted_close|  volume|
+-------------------+------+----------+--------------+--------+
|2022-02-10 00:00:00|  MSFT|302.380005|    299.572968|45386200|
|2022-02-11 00:00:00|  MSFT|295.040009|    292.301117|39175600|
|2022-02-14 00:00:00|  MSFT|     295.0|    292.261505|36359500|
|2022-02-15 00:00:00|  MSFT|300.470001|    297.680695|27058300|
|2022-02-16 00:00:00|  MSFT|     299.5|    297.333221|29982100|
+-------------------+------+----------+--------------+--------+
only showing top 5 rows

root
 |-- date: timestamp (nullable = true)
 |-- ticker: string (nullable = true)
 |-- close: double (nullable = true)
 |-- adjusted_close: double (nullable = true)
 |-- volume: integer (nullable = true)



**Read CSV file, explicitly defining schema**

In [8]:
from pyspark.sql.types import StructType, StructField, TimestampType, StringType, DoubleType, IntegerType

schema = StructType([
    StructField("date", TimestampType(), True),
    StructField("ticker", StringType(), True),
    StructField("close", DoubleType(), True),
    StructField("adjusted_close", DoubleType(), True),
    StructField("volume", IntegerType(), True)
])

df_csv_schema = spark.read.csv("../data/amzn_msft_prices.csv", header=True, schema=schema)
df_csv_schema.show(5)

+-------------------+------+----------+--------------+--------+
|               date|ticker|     close|adjusted_close|  volume|
+-------------------+------+----------+--------------+--------+
|2022-02-10 00:00:00|  MSFT|302.380005|    299.572968|45386200|
|2022-02-11 00:00:00|  MSFT|295.040009|    292.301117|39175600|
|2022-02-14 00:00:00|  MSFT|     295.0|    292.261505|36359500|
|2022-02-15 00:00:00|  MSFT|300.470001|    297.680695|27058300|
|2022-02-16 00:00:00|  MSFT|     299.5|    297.333221|29982100|
+-------------------+------+----------+--------------+--------+
only showing top 5 rows



---

#### 3.2 Read semi-structured JSON file

JSON supports nested structures, which PySpark can handle with structs and arrays.

In [11]:
df_json = spark.read.json("../data/people.json")
df_json.show(5)
df_json.printSchema()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



---

#### 3.3 Reading Parquet Files

- Parquet **format is columnar**
- Reading and writing parquet files can be MUCH faster in Spark
- It is supported by many other data processing systems
- Stores metadata (schema) about the columns, which can provide efficiency
- Especially useful when querying columns for analytics and ML (don't generally need entire rows of data)
- Parquet files also have good compression options


In [34]:
df_parquet = spark.read.parquet("../data/amzn_msft_prices.parquet")
df_parquet.show(5)
df_parquet.printSchema()

+-------------------+------+----------+--------------+--------+
|               date|ticker|     close|adjusted_close|  volume|
+-------------------+------+----------+--------------+--------+
|2022-02-10 00:00:00|  MSFT|302.380005|    299.572968|45386200|
|2022-02-11 00:00:00|  MSFT|295.040009|    292.301117|39175600|
|2022-02-14 00:00:00|  MSFT|     295.0|    292.261505|36359500|
|2022-02-15 00:00:00|  MSFT|300.470001|    297.680695|27058300|
|2022-02-16 00:00:00|  MSFT|     299.5|    297.333221|29982100|
+-------------------+------+----------+--------------+--------+
only showing top 5 rows

root
 |-- date: timestamp (nullable = true)
 |-- ticker: string (nullable = true)
 |-- close: double (nullable = true)
 |-- adjusted_close: double (nullable = true)
 |-- volume: integer (nullable = true)



---

#### 3.4 Handling Malformed Records

Bad rows in your data can cause havok if not handled properly.

In this example, there is a row with too many columns.

Use option `mode` to control handling:
- PERMISSIVE (default) → set corrupted fields to null
- DROPMALFORMED → drop bad rows
- FAILFAST → fail immediately if a row is bad

In [35]:
df_csv_bad = spark.read.option("mode", "FAILFAST") \
    .csv("../data/amzn_msft_prices_bad_row.csv", header=True, inferSchema=True)

df_csv_bad.show(5)

26/01/17 22:46:15 ERROR Executor: Exception in task 0.0 in stage 62.0 (TID 59)
org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.malformedRecordsDetectedInRecordParsingError(QueryExecutionErrors.scala:1417)
	at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:68)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:421)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.cat

Py4JJavaError: An error occurred while calling o214.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 62.0 failed 1 times, most recent failure: Lost task 0.0 in stage 62.0 (TID 59) (udc-aw33-3c1 executor driver): org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.malformedRecordsDetectedInRecordParsingError(QueryExecutionErrors.scala:1417)
	at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:68)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:421)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.sql.catalyst.util.BadRecordException: java.lang.RuntimeException: Malformed CSV record
	at org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:330)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:275)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:417)
	at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60)
	... 23 more
Caused by: java.lang.RuntimeException: Malformed CSV record
	at org.apache.spark.sql.errors.QueryExecutionErrors$.malformedCSVRecordError(QueryExecutionErrors.scala:1222)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:298)
	... 26 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:506)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:459)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3868)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2863)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2863)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:3084)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
	at jdk.internal.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.malformedRecordsDetectedInRecordParsingError(QueryExecutionErrors.scala:1417)
	at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:68)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:421)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	... 1 more
Caused by: org.apache.spark.sql.catalyst.util.BadRecordException: java.lang.RuntimeException: Malformed CSV record
	at org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:330)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:275)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:417)
	at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60)
	... 23 more
Caused by: java.lang.RuntimeException: Malformed CSV record
	at org.apache.spark.sql.errors.QueryExecutionErrors$.malformedCSVRecordError(QueryExecutionErrors.scala:1222)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:298)
	... 26 more


In [21]:
df_csv_bad = spark.read.option("mode", "DROPMALFORMED") \
    .csv("../data/amzn_msft_prices_bad_row.csv", header=True, inferSchema=True)

In [33]:
df_csv_bad.show(5)
print(f'number of records {df_csv_bad.count()}')

+-------------------+------+----------+--------------+--------+
|               date|ticker|     close|adjusted_close|  volume|
+-------------------+------+----------+--------------+--------+
|2022-02-10 00:00:00|  MSFT|302.380005|    299.572968|45386200|
|2022-02-11 00:00:00|  MSFT|295.040009|    292.301117|39175600|
|2022-02-14 00:00:00|  MSFT|     295.0|    292.261505|36359500|
|2022-02-15 00:00:00|  MSFT|300.470001|    297.680695|27058300|
|2022-02-16 00:00:00|  MSFT|     299.5|    297.333221|29982100|
+-------------------+------+----------+--------------+--------+
only showing top 5 rows

number of records 502


In [36]:
df_csv_bad = spark.read.option("mode", "DROPMALFORMED") \
    .csv("../data/amzn_msft_prices_bad_row.csv", header=True, inferSchema=True)

df_csv_bad.show(5)

+-------------------+------+----------+--------------+--------+
|               date|ticker|     close|adjusted_close|  volume|
+-------------------+------+----------+--------------+--------+
|2022-02-10 00:00:00|  MSFT|302.380005|    299.572968|45386200|
|2022-02-11 00:00:00|  MSFT|295.040009|    292.301117|39175600|
|2022-02-15 00:00:00|  MSFT|300.470001|    297.680695|27058300|
|2022-02-16 00:00:00|  MSFT|     299.5|    297.333221|29982100|
|2022-02-17 00:00:00|  MSFT|290.730011|    288.626678|32461600|
+-------------------+------+----------+--------------+--------+
only showing top 5 rows



---

### 4. Reading Large Datasets Efficiently

For massive datasets, we want to be efficient when reading and handling data

#### 4.1 Partitioning in Spark

Spark reads data in parallel by splitting files into *partitions*

In [37]:
df_csv = spark.read.option("header", True).csv("../data/amzn_msft_prices.csv")
print(df_csv.rdd.getNumPartitions())

1


**Repartition** for better parallelism on large datasets

In [40]:
df_repart = df_csv.repartition(6)  # 6 partitions
print(df_repart.rdd.getNumPartitions())

6


**Column pruning**: Read only necessary columns to reduce memory usage

In [41]:
df_select = df_csv.select("date","ticker","adjusted_close")
df_select.show(5)

+----------+------+--------------+
|      date|ticker|adjusted_close|
+----------+------+--------------+
|2022-02-10|  MSFT|    299.572968|
|2022-02-11|  MSFT|    292.301117|
|2022-02-14|  MSFT|    292.261505|
|2022-02-15|  MSFT|    297.680695|
|2022-02-16|  MSFT|    297.333221|
+----------+------+--------------+
only showing top 5 rows



---

#### 4.2 Partition Discovery

We looked at partitioning data in Spark.

Database tables can also be partitioned to make querying more efficient.  

For example, a dataset may have logical groupings, such as locations or demographic subsets. 

We might split the data by **gender** and **country**, producing smaller tables.  

If the analyst queries a country against the partitioned table, it will run faster.


**Different Directories**

In a partitioned table, data are usually stored in different directories, with partitioning column values encoded in the path of each partition directory.  

All built-in file sources (including Text/CSV/JSON/ORC/Parquet) are able to discover and infer partitioning information automatically. 

In [None]:

path
└── to
    └── table
        ├── gender=male
        │   ├── ...
        │   │
        │   ├── country=US
        │   │   └── data.parquet
        │   ├── country=CN
        │   │   └── data.parquet
        │   └── ...
        └── gender=female
            ├── ...
            │
            ├── country=US
            │   └── data.parquet
            ├── country=CN
            │   └── data.parquet
            └── ...


**Examples of writing DF to Parquet file, partitioning columns**

```
df = df.withColumn('end_month', F.month('end_date'))
df = df.withColumn('end_year', F.year('end_date'))
df.write.partitionBy("end_year", "end_month").parquet("/tmp/sample_table")
```

---

### 5. Summary

You should now have some understanding of how to ingest different data formats into Spark.

We also covered some methods for efficiently handling big data.

Next, dive deeper into Spark Dataframes and Spark SQL in `spark_sql_and_dataframes.ipynb`