<h1 align="center">PySpark4Climate I/O primer</h1>

This notebook introduces some of the functionalities supported by PySpark4Climate ```read module``` and how to use PySpark4Climate ```read module``` in general.

In [1]:
# Hide warnings if there are any
import warnings
warnings.filterwarnings('ignore')
from pyspark.sql import SparkSession
from pyspark import SparkConf
import read
import numpy as np
from __future__ import print_function
import os
import sys

In [2]:
spark = SparkSession.builder.appName("spark-read-test2").getOrCreate()
sc = spark.sparkContext

Let's direct Spark to make **pyspark4climate read module** available to all executors by using ```sc.addPyFiles()``` function option.

In [3]:
sc.addPyFile("/glade/p/work/abanihi/pyspark4climate/read.py")

In [4]:
help(read)

Help on module read:

NAME
    read

FILE
    /glade/p/work/abanihi/pyspark4climate/read.py

DESCRIPTION
    This module ingests netCDF file formats into Spark as:
        - a resilient distributed dataset(RDD)
        - a distributed dataframe
    
    Attributes:
        PARTITIONS (int): default number of partitions to be used by Spark.
    
    TODO:
        * Support multiple files reading
        * Convert time_indices from numbers to dates

CLASSES
    __builtin__.object
        dataset
    
    class dataset(__builtin__.object)
     |  Defines and initializes netCDF file attributes needed by Spark.
     |  Attributes:
     |      filepath                 (str)   :  path for the file to be read
     |      variable_name            (str)   :  variable name
     |      dims                     (tuple) :  dimensions (excluding time dimension) of the variable of interest
     |      ndims                    (int)   :  size of dims tuple
     |      partitions               (int)   :

In [5]:
# Print some information about Spark's configuration
print(SparkConf().toDebugString())

spark.Kryoserializer.buffer.max.mb=4096
spark.app.name=spark-read-test2
spark.driver.maxResultSize=10g
spark.driver.memory=20g
spark.executor.memory=15g
spark.master=spark://r1i6n24.ib0.cheyenne.ucar.edu:7077
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.speculation=True
spark.submit.deployMode=client


# Dataset
For this tutorial we will be using the following dataset.
- ```/glade/p/CMIP/CMIP5/output1/NCAR/CCSM4/historical/3hr/atmos/3hr/r6i1p1/files/tas_20120514/tas_3hr_CCSM4_historical_r6i1p1_200501010000-200512312100.nc```

In [6]:
!ncdump -h /glade/p/CMIP/CMIP5/output1/NCAR/CCSM4/historical/3hr/atmos/3hr/r6i1p1/files/tas_20120514/tas_3hr_CCSM4_historical_r6i1p1_200501010000-200512312100.nc

netcdf tas_3hr_CCSM4_historical_r6i1p1_200501010000-200512312100 {
dimensions:
	time = UNLIMITED ; // (2920 currently)
	lat = 192 ;
	lon = 288 ;
	bnds = 2 ;
variables:
	double time(time) ;
		time:units = "days since 1850-01-01 00:00:00" ;
		time:calendar = "noleap" ;
		time:axis = "T" ;
		time:long_name = "time" ;
		time:standard_name = "time" ;
	double lat(lat) ;
		lat:bounds = "lat_bnds" ;
		lat:units = "degrees_north" ;
		lat:axis = "Y" ;
		lat:long_name = "latitude" ;
		lat:standard_name = "latitude" ;
	double lat_bnds(lat, bnds) ;
	double lon(lon) ;
		lon:bounds = "lon_bnds" ;
		lon:units = "degrees_east" ;
		lon:axis = "X" ;
		lon:long_name = "longitude" ;
		lon:standard_name = "longitude" ;
	double lon_bnds(lon, bnds) ;
	double height ;
		height:units = "m" ;
		height:axis = "Z" ;
		height:positive = "up" ;
		height:long_name = "height" ;
		height:standard_name = "height" ;
	float tas(time, lat, lon) ;
		tas:standard_name = "air_temperature" ;


In [7]:
!du -lh /glade/p/CMIP/CMIP5/output1/NCAR/CCSM4/historical/3hr/atmos/3hr/r6i1p1/files/tas_20120514/tas_3hr_CCSM4_historical_r6i1p1_200501010000-200512312100.nc

616M	/glade/p/CMIP/CMIP5/output1/NCAR/CCSM4/historical/3hr/atmos/3hr/r6i1p1/files/tas_20120514/tas_3hr_CCSM4_historical_r6i1p1_200501010000-200512312100.nc


In [8]:
filepath = '/glade/p/CMIP/CMIP5/output1/NCAR/CCSM4/historical/3hr/atmos/3hr/r6i1p1/files/tas_20120514/tas_3hr_CCSM4_historical_r6i1p1_200501010000-200512312100.nc'

# Step 1: Initialize ```dataset``` class available in ```read module```

To initialize this class, we need to pass as an argument of a tuple containing ```(filepath, variable)```. In this case we are interested in ```tas variable```.

In [9]:
dset = read.dataset((filepath, 'tas'))

In [10]:
print(dset.dims)
print(dset.filepath)
print(dset.variable_name)
print(dset.partitions)
print(dset.other_dims_values_tuple[:5])

(u'lat', u'lon')
/glade/p/CMIP/CMIP5/output1/NCAR/CCSM4/historical/3hr/atmos/3hr/r6i1p1/files/tas_20120514/tas_3hr_CCSM4_historical_r6i1p1_200501010000-200512312100.nc
tas
365
[(-90.0, 0.0), (-90.0, 1.25), (-90.0, 2.5), (-90.0, 3.75), (-90.0, 5.0)]


# Step 2: Use spark to broadcast the following dataset attributes to all the workers

In [11]:
other_dims_values_tuple = sc.broadcast(dset.other_dims_values_tuple) 
variable_name = sc.broadcast(dset.variable_name)
dims = sc.broadcast(dset.dims)
ndims = sc.broadcast(dset.ndims)

# Step 3: Create an RDD using ```read.create_rdd()```

In [12]:
tas_rdd = read.create_rdd(sc, (filepath, 'tas'), mode='single')

In [13]:
tas_rdd.count()

2920

In [14]:
tas_rdd.take(10)

[(array([[ 252.29956055,  252.49697876,  252.51231384, ...,  252.30375671,
           252.2875061 ,  252.70848083],
         [ 254.22868347,  253.75341797,  253.87719727, ...,  254.39428711,
           254.04812622,  253.65689087],
         [ 255.30435181,  255.51652527,  255.38240051, ...,  255.73982239,
           255.53344727,  255.60620117],
         ..., 
         [ 242.31686401,  242.32778931,  242.32933044, ...,  242.25997925,
           242.2774353 ,  242.29290771],
         [ 242.40519714,  242.40791321,  242.41052246, ...,  242.39332581,
           242.40063477,  242.40325928],
         [ 243.58824158,  243.58340454,  243.57896423, ...,  243.60612488,
           243.59954834,  243.59364319]], dtype=float32), 0),
 (array([[ 252.1300354 ,  252.33421326,  252.34989929, ...,  252.13648987,
           252.11824036,  252.54826355],
         [ 254.00390625,  253.53767395,  253.66960144, ...,  254.15505981,
           253.8092041 ,  253.41487122],
         [ 254.95761108,  255.187942

![](https://i.imgur.com/51l0O4r.jpg)

# Step 4: Create a DataFrame using ```read.dataframe()```

In [18]:
# %load /glade/p/work/abanihi/pyspark4climate/read.py
"""
This module ingests netCDF file formats into Spark as:
    - a resilient distributed dataset(RDD)
    - a distributed dataframe

Attributes:
    PARTITIONS (int): default number of partitions to be used by Spark.

TODO:
    * Support multiple files reading
    * Convert time_indices from numbers to dates
"""

from __future__ import print_function
from netCDF4 import Dataset
from netCDF4 import MFDataset
import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql import Row
import itertools
import os

global PARTITIONS


class dataset(object):
    """Defines and initializes netCDF file attributes needed by Spark.
    Attributes:
        filepath                 (str)   :  path for the file to be read
        variable_name            (str)   :  variable name
        dims                     (tuple) :  dimensions (excluding time dimension) of the variable of interest
        ndims                    (int)   :  size of dims tuple
        partitions               (int)   :  number of partitions to be used by spark
        other_dims_values_tuple  (list)  :  list of tuples containing cartesian product of all dims values


    Examples:
        >>> dset = dataset(('ta_Amon_CCSM4_historical_r1i1p1_185001-189912.nc', 'ta'))
        >>> print(dset.partitions)
        75
        >>> print(dset.variable_name)
        ta
        >>> print(dset.ndims)
        3
        >>> print(dset.dims)
        (u'plev', u'lat', u'lon')
        >>> print(dset.other_dims_values_tuple[:2])
        [(100000.0, -90.0, 0.0), (100000.0, -90.0, 1.25)]

    """

    def __init__(self, filepath_variable_tuple=None):
        """

        Args:
            filepath_variable_tuple (tuple): tuple containing (filepath, 'variable')
        """

        if filepath_variable_tuple is not None:
            self.filepath = filepath_variable_tuple[0]
            self.variable_name = filepath_variable_tuple[1]
            self.dims = None
            self.ndims = None
            self.partitions = None
            self.other_dims_values_tuple = self.generate_cartesian_product()

    def generate_cartesian_product(self):
        f = Dataset(self.filepath, 'r')
        dset = f.variables[self.variable_name]
        self.partitions = dset.shape[0] / 8
        global PARTITIONS
        PARTITIONS = self.partitions
        self.dims = dset.dimensions[1:]
        self.ndims = len(self.dims)
        values = [f.variables[dim][:].tolist() for dim in self.dims]
        f.close()
        return [element for element in itertools.product(*values)]


def create_rdd(sc, file_list_or_txt_file, mode='multi', partitions=None):
    """Create an RDD from a file_list or tuple of (filepath, variable) and Returns the RDD.

    Args:
        sc                     (object)       : sparkContext Object
        file_list_or_txt_file  (list or tuple : list of tuples or a tuple of the format (filepath, variable)
        mode                   (str)          : If reading multiple files (multi), otherwise(single)
        partitions             (int)          : number of partitions

    """
    if mode == 'multi':
        return read_nc_multi(sc, file_list_or_txt_file, partitions=partitions)

    elif mode == 'single':
        return read_nc_single_chunked(sc, file_list_or_txt_file, partitions=partitions)

    else:
        raise NotImplementedError("You specified a mode that is not implemented.")


def read_nc_single_chunked(sc, filepath_variable_tuple, partitions=None):

    """ Generates an RDD using the information passed by create_rdd function.
     Args:
        sc                     (object)       : sparkContext Object
        file_list_or_txt_file  (tuple)        : a tuple of the format (filepath, variable)
        partitions             (int)          : number of partitions

    Returns:
        rdd_                   (rdd)          : Spark's resilient distributed dataset
    """
    assert isinstance(filepath_variable_tuple, tuple), "For single file mode, you must must input a tuple"
    dset = dataset(filepath_variable_tuple)
    filepath_ = dset.filepath
    variable_ = dset.variable_name
    rows = Dataset(filepath_, 'r').variables[variable_].shape[0]

    if not partitions:
        partitions = PARTITIONS

    if partitions > rows:
        partitions = rows

    step = rows / partitions

    rdd_ = sc.range(0, rows, step)\
             .sortBy(lambda x: x, numPartitions=partitions)\
             .flatMap(lambda x: readonep(filepath_, variable_, x, step)).zipWithIndex()\

    return rdd_


def readonep(filepath_, variable_, start_idx, chunk_size):
    """Read a slice from one file.

    Args:
        filepath_    (str): string containing the file path
        variable_    (str): variable name
        start_idx    (int): starting index
        chunk_size   (int): the chunk size to be read at a time.

    Returns:
        list:   list of the chunk read
    """
    try:
        f = Dataset(filepath_, 'r')
        dset = f.variables[variable_]

        # get the number of dimensions of the variable
        dims = dset.dimensions
        ndims = len(dims)
        end_idx = start_idx + chunk_size

        if end_idx < dset.shape[0]:
            chunk = dset[tuple([slice(start_idx, end_idx)] + [slice(None)]*(ndims-1))]

        else:
            chunk = dset[tuple([slice(start_idx, dset.shape[0])] + [slice(None)]*(ndims-1))]

        return list(chunk[:])

    except Exception as e:
        print("IOError: {} {}".format(e, filepath_))

    finally:
        pass
        f.close()


def dataframe(sc, file_list_or_txt_file, mode='multi', partitions=None):
    """Creates a distributed dataframe from a netCDF file.

    Args:
        sc                     (object)       : sparkContext Object
        file_list_or_txt_file  (tuple)        : a tuple of the format (filepath, variable)
        partitions             (int)          : number of partitions
        mode                   (str)          : (multi) if reading multiple files, otherwise(single)
    Returns:
        df                  (dataframe)          : Spark's distributed data frame
    """

    df = create_rdd(sc, file_list_or_txt_file, mode=mode, partitions=partitions)\
        .map(flatten_data)\
        .flatMap(lambda x: x).repartition(partitions*10)\
        .map(row_transform)\
        .toDF()

    return df


def rdd_to_df(rdd):
    """Function that converts an RDD into a Spark data frame.
    Arguments:
        - rdd: (rdd)

    Returns:
        - df: Spark dataframe
    """
    df = rdd.map(flatten_data)\
            .flatMap(lambda x: x).repartition(PARTITIONS*10)\
            .map(row_transform)\
            .toDF()

    return df


def flatten_data(line):
    """Flattens numpy array and return a tuple of each value
    and its corresponding lat_lon coordinates together with other dimensions.

    Args:
        line (tuple) :  an rdd element in the form of a tuple (data, idx) where data is
                        a numpy array and idx correspond to time index.

    Returns:
         results (tuple): a transformed rdd element in the form
                           of a tuple (idx, dim1_value, dim_value2, ..., data_value)
    """
    data = line[0].ravel().tolist()
    idx = line[1]
    results = map(lambda x: (idx, ) + (x[0]) + (x[1], ), zip(other_dims_values_tuple.value, data))
    return results


def row_transform(line):
    """Transforms a a tuple (idx, dim1_value, dim_value2, ..., data_value) into a Spark sql
       Row object.

    Args:
        line (tuple): a tuple of the form (idx, dim1_value, dim_value2, ..., data_value)

    Returns:
        row(*line) : Spark Row object with arbitray number of items depending on the size of
                     the tuple in line.

    Examples:
        >>> print(line)
        (0, 100000.0, -90.0, 0.0, 257.8)
        >>> row(*line)
        Row(time=0, plev=100000.0, lat=-90.0, lon=0.0, ta=257.8)

    """
    dims_ = dims.value
    ndims_ = len(dims_)
    variable_ = variable_name.value
    columns = ("time",)+tuple(dims_[:])+(variable_,)
    row = Row(*columns)
    return row(*line)


#if __name__ == '__main__':
    #pass 

In [16]:
tas_df = read.rdd_to_df(tas_rdd)

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 51 in stage 12.0 failed 4 times, most recent failure: Lost task 51.3 in stage 12.0 (TID 1336, 10.148.8.134, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/glade/p/work/abanihi/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/glade/p/work/abanihi/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/glade/p/work/abanihi/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 220, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/glade/p/work/abanihi/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 138, in dump_stream
    for obj in iterator:
  File "/glade/p/work/abanihi/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 209, in _batched
    for item in iterator:
  File "read.py", line 220, in flatten_data
    results = map(lambda x: (idx, ) + (x[0]) + (x[1], ), zip(other_dims_values_tuple.value, data))
NameError: global name 'other_dims_values_tuple' is not defined

	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
	at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:441)
	at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/glade/p/work/abanihi/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/glade/p/work/abanihi/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/glade/p/work/abanihi/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 220, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/glade/p/work/abanihi/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 138, in dump_stream
    for obj in iterator:
  File "/glade/p/work/abanihi/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 209, in _batched
    for item in iterator:
  File "read.py", line 220, in flatten_data
    results = map(lambda x: (idx, ) + (x[0]) + (x[1], ), zip(other_dims_values_tuple.value, data))
NameError: global name 'other_dims_values_tuple' is not defined

	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more


In [19]:
tas_df = rdd_to_df(tas_rdd)

NameError: global name 'PARTITIONS' is not defined

In [14]:
dset.partitions

94170

In [20]:
tas_df = dataframe(sc, (filepath, 'tas'), mode='single', partitions=dset.partitions)

In [21]:
tas_df.show()

+----+------------------+------+------------------+
|time|               lat|   lon|               tas|
+----+------------------+------+------------------+
|  88|-77.74868774414062| 145.0|252.76138305664062|
|  88|-77.74868774414062|146.25|252.78738403320312|
|  88|-77.74868774414062| 147.5| 252.7574462890625|
|  88|-77.74868774414062|148.75|253.21543884277344|
|  88|-77.74868774414062| 150.0| 253.6935577392578|
|  88|-77.74868774414062|151.25|254.23153686523438|
|  88|-77.74868774414062| 152.5|254.77780151367188|
|  88|-77.74868774414062|153.75|  255.639404296875|
|  88|-77.74868774414062| 155.0| 256.7210693359375|
|  88|-77.74868774414062|156.25| 258.0180358886719|
|  88|41.937171936035156|  50.0| 276.7320556640625|
|  88|41.937171936035156| 51.25|   277.10302734375|
|  88|41.937171936035156|  52.5| 276.8706359863281|
|  88|41.937171936035156| 53.75|274.19281005859375|
|  88|41.937171936035156|  55.0| 269.5821533203125|
|  88|41.937171936035156| 56.25|268.18756103515625|
|  88|41.937

In [22]:
tas_df.describe().show()

+-------+-----------------+-----------------+------------------+-----------------+
|summary|             time|              lat|               lon|              tas|
+-------+-----------------+-----------------+------------------+-----------------+
|  count|        161464320|        161464320|         161464320|        161464320|
|   mean|           1459.5|              0.0|           179.375|278.1147441679061|
| stddev|842.9313461964497|52.23286586926288|103.92242230892091|22.18978777577845|
|    min|                0|            -90.0|               0.0|199.0844268798828|
|    max|             2919|             90.0|            358.75|325.5912780761719|
+-------+-----------------+-----------------+------------------+-----------------+



![](https://i.imgur.com/AUKVUyV.jpg)

In [23]:
tas_df.createGlobalTempView("temperature")

In [24]:
from pyspark.sql.functions import *

In [None]:
spark.sql("""
          SELECT time, lat, lon, tas
          FROM temperature""")