This notebook is designed to run in a IBM Watson Studio default runtime (NOT the Watson Studio Apache Spark Runtime as the default runtime with 1 vCPU is free of charge). Therefore, we install Apache Spark in local mode for test purposes only. Please don't use it in production.

In case you are facing issues, please read the following two documents first:

https://github.com/IBM/skillsnetwork/wiki/Environment-Setup

https://github.com/IBM/skillsnetwork/wiki/FAQ

Then, please feel free to ask in the discussion forum of the course:

Please make sure to follow the guidelines before asking a question:

https://github.com/IBM/skillsnetwork/wiki/FAQ#im-feeling-lost-and-confused-please-help-me

If running outside Watson Studio, this should work as well. In case you are running in an Apache Spark context outside Watson Studio, please remove the Apache Spark setup in the first notebook cells.

In [None]:
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown('# <span style="color:red">'+string+'</span>'))


if ('sc' in locals() or 'sc' in globals()):
    printmd('<<<<<!!!!! It seems that you are running in a IBM Watson Studio Apache Spark Notebook. Please run it in an IBM Watson Studio Default Runtime (without Apache Spark) !!!!!>>>>>')


Lets install Spark

In [None]:
!pip install pyspark==3.1.1

In [None]:
try:
    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SparkSession
except ImportError as e:
    printmd('<<<<<!!!!! Please restart your kernel after installing Apache Spark !!!!!>>>>>')

Lets create a local spark context (sc) and session (spark)

In [None]:
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

spark = SparkSession \
    .builder \
    .getOrCreate()

Lets pull the data in raw format from the source (github)

In [None]:
!rm -Rf HMP_Dataset
!git clone https://github.com/wchill/HMP_Dataset

As you can see, the data set contains data in raw text format. For each category one folde

In [None]:
!ls HMP_Dataset

In [None]:
!ls HMP_Dataset/Brush_teeth

In [None]:
!head ./HMP_Dataset/Brush_teeth/Accelerometer-2011-04-11-13-28-18-brush_teeth-f1.txt

As we can see, each file contains three columns of integer accelerometer readings as a time series, lets create the appropriate schema

In [None]:
from pyspark.sql.types import StructType, StructField, IntegerType

schema = StructType([
    StructField("x", IntegerType(), True),
    StructField("y", IntegerType(), True),
    StructField("z", IntegerType(), True)])

This step takes a while, it parses through all files and folders and creates a temporary dataframe for each file which gets appended to an overall data-frame "df". In addition, a column called "class" is added to allow for straightforward usage in Spark afterwards in a supervised machine learning scenario for example.

In [None]:
import os
import fnmatch

d = 'HMP_Dataset/'

# filter list for all folders containing data (folders that don't start with .)
file_list_filtered = [s for s in os.listdir(d) if os.path.isdir(os.path.join(d,s)) & ~fnmatch.fnmatch(s, '.*')]

from pyspark.sql.functions import lit

#create pandas data frame for all the data

df = None

for category in file_list_filtered:
    data_files = os.listdir('HMP_Dataset/'+category)
    
    #create a temporary pandas data frame for each data file
    for data_file in data_files:
        print(data_file)
        temp_df = spark.read.option("header", "false").option("delimiter", " ").csv('HMP_Dataset/'+category+'/'+data_file,schema=schema)
        
        #create a column called "source" storing the current CSV file
        temp_df = temp_df.withColumn("source", lit(data_file))
        
        #create a column called "class" storing the current data folder
        temp_df = temp_df.withColumn("class", lit(category))
        
        #append to existing data frame list
        #data_frames = data_frames + [temp_df]
                                                                                                             
        if df is None:
            df = temp_df
        else:
            df = df.union(temp_df)
        


Lets write the dataf-rame to a file in "parquet" format, this will also take quite some time:

In [None]:
df.write.parquet('hmp.parquet')


Now we should have a file with our contents

# Exercise 
Please use the data-frame "df" below to anser the following questions about the data-frame
(you can use SQL or the data-frame api or combine both)

Please use the pyspark API doc for your reference. https://spark.apache.org/docs/latest/api/python/reference/index.html



1. How many total rows does the data-frame have? (Hint: If you don’t use SQL, there is a single function you can call on the “df” object which returns the solution)

2. How many rows in class "Brush_teeth"? (Hint: You need to filter first for class="Brush_teeth" before you apply the same function as in question one)

3. Which two additional columns beside x, y and z does the data-frame have? (Hint: You can either look at the ETL code from the previous cells or use a field of the “df” object which you can find when looking at the API reference)
    

In [None]:
df.createOrReplaceTempView('df')

In [None]:
df.# your code here

In [None]:
spark.sql('''

select # your code here # from df where # your code here #

''').show()