## Week 8

**1. Stream Directory Data**

In the first part of the exercise, you will create a simple Spark streaming program that reads an input stream from a file source. The file source stream reader reads data from a directory on a file system. When a new file is added to the folder, Spark adds that file’s data to the input data stream.

You can find the input data for this exercise in the baby-names/streaming directory. This directory contains the baby names CSV file randomized and split into 98 individual files. You will use these files to simulate incoming streaming data.

*a. Count the Number of Females*

In the first part of the exercise, you will create a Spark program that monitors an incoming directory. To simulate streaming data, you will copy CSV files from the baby-names/streaming directory into the incoming directory. Since you will be loading CSV data, you will need to define a schema before you initialize the streaming dataframe.

From this input data stream, you will create a simple output data stream that counts the number of females and writes it to the console. Approximately every 10 seconds or so, copy a new file into the directory and report the console output. Do this for the first ten files.

**2. Micro-Batching**

Repeat the last step, but use a micro-batch interval to trigger the processing every 30 seconds. Approximately every 10 seconds or so, copy a new file into the directory and report the console output. Do this for the first ten files. How did the output differ from the previous example?

### Stream directory data

In [3]:
# load libraries
from pyspark.sql import SparkSession
from pyspark import SparkConf
from pyspark import SparkContext
from time import sleep

# create spark context
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

In [4]:
# define static and streaming directory
static_dir = '/FileStore/tables/babystatic/baby_names_csv-70f67.gz'
stream_dir = '/FileStore/tables/babystream'

# before streaming, use static data to define dataframe
spark = SparkSession.builder.appName('strtst').getOrCreate()
static = spark.read.csv(static_dir, header = True)

dataschema = static.schema

static.printSchema()

In [5]:
# check streaming
streaming = spark.readStream.schema(dataschema).csv(stream_dir)

counts = streaming.groupBy('sex').count()
counts

In [6]:
# start streaming, print and stop
streamingquery = counts.writeStream.queryName('Counts').format('memory').outputMode('complete').start()

for i in range(5):
  spark.sql('SELECT * FROM counts').show()
  sleep(1)
  
streamingquery.stop()