<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Reading Data with Spark - Text Files

**Technical Accomplishments:**
- Reading data from a simple text file

## Getting Started

Let's start importing libraries and creating useful variables 

In [None]:
%load_ext autotime

import os
import qcutils
from pyspark.sql import SparkSession
import boto3
import io

s3 = boto3.client('s3')
baseUri = "s3a://quantia-master/training/"

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.5 pyspark-shell'

spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )
qcutils.init_spark_session(spark)

## Reading from Text File

We can read in just about any file when each record is delineated only by a new line just as we saw with CSV and JSON (or rather JSON-Lines), formats.

To accomplish this, we can use `DataFrameReader.text(..)` which gives a `DataFrame` with just one column named **value** of type **string**.

The difference is that we now need to take responsibility for parsing out the data in each "column" ourselves.

One of the more common use cases is fixed-width files or even Apache's HTTP Access Logs. In the first case, it would require a sequence of substrings. In the second, a sequence of regular expressions would be a better solution to extract the value of each column. In either case, additional transformations are required - which we will go into later.

For this example, we are going to create a `DataFrame` from the full text of the book *The Adventures of Tom Sawyer* by Mark Twain.

In [None]:
qcutils.list_s3_bucket_objects(limit=10)

In [None]:
qcutils.print_s3_bucket_object(key='training/tom.txt')

In [None]:
textFile = baseUri + "tom.txt"

textDF = spark.read.text(textFile)

textDF.printSchema()

And with the `DataFrame` created, we can view the data, one record for each line in the text file.

In [None]:
textDF.show(truncate = False)

As simple as this example is, it's also the premise for loading more complex text files like fixed-width text files.

We will see later exactly how to do this, but for each line that is read in, it's simply a matter of a couple of more transformations (like substring-ing values) to convert each line into something more meaningful.

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.