# Wind Statistics

### Introduction:

The data have been modified to contain some missing values, identified by NaN.  
Using pandas should make this exercise
easier, in particular for the bonus question.

You should be able to perform all of these operations without using
a for loop or other looping construct.


1. The data in 'wind.data' has the following format:

In [1]:
"""
Yr Mo Dy   RPT   VAL   ROS   KIL   SHA   BIR   DUB   CLA   MUL   CLO   BEL   MAL
61  1  1 15.04 14.96 13.17  9.29   NaN  9.87 13.67 10.25 10.83 12.58 18.50 15.04
61  1  2 14.71   NaN 10.83  6.50 12.62  7.67 11.50 10.04  9.79  9.67 17.54 13.83
61  1  3 18.50 16.88 12.33 10.13 11.17  6.17 11.25   NaN  8.50  7.67 12.75 12.71
"""

'\nYr Mo Dy   RPT   VAL   ROS   KIL   SHA   BIR   DUB   CLA   MUL   CLO   BEL   MAL\n61  1  1 15.04 14.96 13.17  9.29   NaN  9.87 13.67 10.25 10.83 12.58 18.50 15.04\n61  1  2 14.71   NaN 10.83  6.50 12.62  7.67 11.50 10.04  9.79  9.67 17.54 13.83\n61  1  3 18.50 16.88 12.33 10.13 11.17  6.17 11.25   NaN  8.50  7.67 12.75 12.71\n'

   The first three columns are year, month and day.  The
   remaining 12 columns are average windspeeds in knots at 12
   locations in Ireland on that day.   

   More information about the dataset go [here](wind.desc).

### Step 1. Import the necessary libraries

In [2]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("winds").getOrCreate()
spark

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/Wind_Stats/wind.data)

### Step 3. Assign it to a variable called data and replace the first 3 columns by a proper datetime index.

In [3]:
from pyspark import SparkFiles
import pandas as pd

In [4]:
# url = "https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/Wind_Stats/wind.data"

# spark.sparkContext.addFile(url)

# data = spark.read.csv(SparkFiles.get("wind.data"), header=True, inferSchema=True, sep=" ")
# produces null columns & null values in different columns

data_url = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/Wind_Stats/wind.data'
data = pd.read_csv(data_url, sep = "\s+", parse_dates = [[0,1,2]]) 
# data = pd.read_csv(data_url, sep = "\s+") 
data.head()

# print(data.count())

Unnamed: 0,Yr_Mo_Dy,RPT,VAL,ROS,KIL,SHA,BIR,DUB,CLA,MUL,CLO,BEL,MAL
0,2061-01-01,15.04,14.96,13.17,9.29,,9.87,13.67,10.25,10.83,12.58,18.5,15.04
1,2061-01-02,14.71,,10.83,6.5,12.62,7.67,11.5,10.04,9.79,9.67,17.54,13.83
2,2061-01-03,18.5,16.88,12.33,10.13,11.17,6.17,11.25,,8.5,7.67,12.75,12.71
3,2061-01-04,10.58,6.63,11.75,4.58,4.54,2.88,8.63,1.79,5.83,5.88,5.46,10.88
4,2061-01-05,13.33,13.25,11.42,6.17,10.71,8.21,11.92,6.54,10.92,10.34,12.92,11.83


In [5]:
data_df = spark.createDataFrame(data)
data_df.count()

6574

In [6]:
data_df.show(5)

+-------------------+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+
|           Yr_Mo_Dy|  RPT|  VAL|  ROS|  KIL|  SHA| BIR|  DUB|  CLA|  MUL|  CLO|  BEL|  MAL|
+-------------------+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+
|2061-01-01 00:00:00|15.04|14.96|13.17| 9.29|  NaN|9.87|13.67|10.25|10.83|12.58| 18.5|15.04|
|2061-01-02 00:00:00|14.71|  NaN|10.83|  6.5|12.62|7.67| 11.5|10.04| 9.79| 9.67|17.54|13.83|
|2061-01-03 00:00:00| 18.5|16.88|12.33|10.13|11.17|6.17|11.25|  NaN|  8.5| 7.67|12.75|12.71|
|2061-01-04 00:00:00|10.58| 6.63|11.75| 4.58| 4.54|2.88| 8.63| 1.79| 5.83| 5.88| 5.46|10.88|
|2061-01-05 00:00:00|13.33|13.25|11.42| 6.17|10.71|8.21|11.92| 6.54|10.92|10.34|12.92|11.83|
+-------------------+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+
only showing top 5 rows



In [7]:
data_df.printSchema()

root
 |-- Yr_Mo_Dy: timestamp (nullable = true)
 |-- RPT: double (nullable = true)
 |-- VAL: double (nullable = true)
 |-- ROS: double (nullable = true)
 |-- KIL: double (nullable = true)
 |-- SHA: double (nullable = true)
 |-- BIR: double (nullable = true)
 |-- DUB: double (nullable = true)
 |-- CLA: double (nullable = true)
 |-- MUL: double (nullable = true)
 |-- CLO: double (nullable = true)
 |-- BEL: double (nullable = true)
 |-- MAL: double (nullable = true)



In [8]:
from pyspark.sql.functions import *

In [9]:
from pyspark.sql.types import *

In [10]:
data_df = data_df.withColumn("Yr_Mo_Dy", to_date("Yr_Mo_Dy"))
data_df.show(5)

+----------+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+
|  Yr_Mo_Dy|  RPT|  VAL|  ROS|  KIL|  SHA| BIR|  DUB|  CLA|  MUL|  CLO|  BEL|  MAL|
+----------+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+
|2061-01-01|15.04|14.96|13.17| 9.29|  NaN|9.87|13.67|10.25|10.83|12.58| 18.5|15.04|
|2061-01-02|14.71|  NaN|10.83|  6.5|12.62|7.67| 11.5|10.04| 9.79| 9.67|17.54|13.83|
|2061-01-03| 18.5|16.88|12.33|10.13|11.17|6.17|11.25|  NaN|  8.5| 7.67|12.75|12.71|
|2061-01-04|10.58| 6.63|11.75| 4.58| 4.54|2.88| 8.63| 1.79| 5.83| 5.88| 5.46|10.88|
|2061-01-05|13.33|13.25|11.42| 6.17|10.71|8.21|11.92| 6.54|10.92|10.34|12.92|11.83|
+----------+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+
only showing top 5 rows



In [11]:
data_df.printSchema()

root
 |-- Yr_Mo_Dy: date (nullable = true)
 |-- RPT: double (nullable = true)
 |-- VAL: double (nullable = true)
 |-- ROS: double (nullable = true)
 |-- KIL: double (nullable = true)
 |-- SHA: double (nullable = true)
 |-- BIR: double (nullable = true)
 |-- DUB: double (nullable = true)
 |-- CLA: double (nullable = true)
 |-- MUL: double (nullable = true)
 |-- CLO: double (nullable = true)
 |-- BEL: double (nullable = true)
 |-- MAL: double (nullable = true)



### Step 4. Year 2061? Do we really have data from this year? Create a function to fix it and apply it.

In [12]:
date_data_df = data_df.withColumn("Yr", year("Yr_Mo_Dy")) \
.withColumn("Mo",month("Yr_Mo_Dy")) \
.withColumn("Dy", dayofmonth("Yr_Mo_Dy"))

date_data_df.show(5)

+----------+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+----+---+---+
|  Yr_Mo_Dy|  RPT|  VAL|  ROS|  KIL|  SHA| BIR|  DUB|  CLA|  MUL|  CLO|  BEL|  MAL|  Yr| Mo| Dy|
+----------+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+----+---+---+
|2061-01-01|15.04|14.96|13.17| 9.29|  NaN|9.87|13.67|10.25|10.83|12.58| 18.5|15.04|2061|  1|  1|
|2061-01-02|14.71|  NaN|10.83|  6.5|12.62|7.67| 11.5|10.04| 9.79| 9.67|17.54|13.83|2061|  1|  2|
|2061-01-03| 18.5|16.88|12.33|10.13|11.17|6.17|11.25|  NaN|  8.5| 7.67|12.75|12.71|2061|  1|  3|
|2061-01-04|10.58| 6.63|11.75| 4.58| 4.54|2.88| 8.63| 1.79| 5.83| 5.88| 5.46|10.88|2061|  1|  4|
|2061-01-05|13.33|13.25|11.42| 6.17|10.71|8.21|11.92| 6.54|10.92|10.34|12.92|11.83|2061|  1|  5|
+----------+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+----+---+---+
only showing top 5 rows



In [13]:
date_data_df.printSchema()

root
 |-- Yr_Mo_Dy: date (nullable = true)
 |-- RPT: double (nullable = true)
 |-- VAL: double (nullable = true)
 |-- ROS: double (nullable = true)
 |-- KIL: double (nullable = true)
 |-- SHA: double (nullable = true)
 |-- BIR: double (nullable = true)
 |-- DUB: double (nullable = true)
 |-- CLA: double (nullable = true)
 |-- MUL: double (nullable = true)
 |-- CLO: double (nullable = true)
 |-- BEL: double (nullable = true)
 |-- MAL: double (nullable = true)
 |-- Yr: integer (nullable = true)
 |-- Mo: integer (nullable = true)
 |-- Dy: integer (nullable = true)



In [14]:
def fix_century(x):
    if x > 1989:
        yr = x - 100 
        return yr

udf_fix_century = udf(lambda x: fix_century(x), IntegerType())

In [15]:
date_data_df = date_data_df.withColumn("Yr", udf_fix_century(col("Yr")))
date_data_df.show(5)

+----------+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+----+---+---+
|  Yr_Mo_Dy|  RPT|  VAL|  ROS|  KIL|  SHA| BIR|  DUB|  CLA|  MUL|  CLO|  BEL|  MAL|  Yr| Mo| Dy|
+----------+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+----+---+---+
|2061-01-01|15.04|14.96|13.17| 9.29|  NaN|9.87|13.67|10.25|10.83|12.58| 18.5|15.04|1961|  1|  1|
|2061-01-02|14.71|  NaN|10.83|  6.5|12.62|7.67| 11.5|10.04| 9.79| 9.67|17.54|13.83|1961|  1|  2|
|2061-01-03| 18.5|16.88|12.33|10.13|11.17|6.17|11.25|  NaN|  8.5| 7.67|12.75|12.71|1961|  1|  3|
|2061-01-04|10.58| 6.63|11.75| 4.58| 4.54|2.88| 8.63| 1.79| 5.83| 5.88| 5.46|10.88|1961|  1|  4|
|2061-01-05|13.33|13.25|11.42| 6.17|10.71|8.21|11.92| 6.54|10.92|10.34|12.92|11.83|1961|  1|  5|
+----------+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+----+---+---+
only showing top 5 rows



### Step 5. Set the right dates as the index. Pay attention at the data type, it should be datetime64[ns].

In [16]:
cols_to_drop = ["Yr_Mo_Dy","Yr","Mo","Dy"]

In [17]:
# d2_df = date_data_df.select(concat_ws("-", date_data_df.Yr, date_data_df.Mo, date_data_df.Dy).\
#                             alias("Date").cast(DateType()),*date_data_df.columns).drop(*cols_to_drop)

d2_df = date_data_df.select(concat_ws("-", date_data_df.Yr, date_data_df.Mo, date_data_df.Dy)\
                            .alias("Date").cast(DateType()),*date_data_df.columns)

d2_df.show(5)

+----------+----------+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+----+---+---+
|      Date|  Yr_Mo_Dy|  RPT|  VAL|  ROS|  KIL|  SHA| BIR|  DUB|  CLA|  MUL|  CLO|  BEL|  MAL|  Yr| Mo| Dy|
+----------+----------+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+----+---+---+
|1961-01-01|2061-01-01|15.04|14.96|13.17| 9.29|  NaN|9.87|13.67|10.25|10.83|12.58| 18.5|15.04|1961|  1|  1|
|1961-01-02|2061-01-02|14.71|  NaN|10.83|  6.5|12.62|7.67| 11.5|10.04| 9.79| 9.67|17.54|13.83|1961|  1|  2|
|1961-01-03|2061-01-03| 18.5|16.88|12.33|10.13|11.17|6.17|11.25|  NaN|  8.5| 7.67|12.75|12.71|1961|  1|  3|
|1961-01-04|2061-01-04|10.58| 6.63|11.75| 4.58| 4.54|2.88| 8.63| 1.79| 5.83| 5.88| 5.46|10.88|1961|  1|  4|
|1961-01-05|2061-01-05|13.33|13.25|11.42| 6.17|10.71|8.21|11.92| 6.54|10.92|10.34|12.92|11.83|1961|  1|  5|
+----------+----------+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+----+---+---+
only showing top 5 rows



In [18]:
d2_df.printSchema()

root
 |-- Date: date (nullable = true)
 |-- Yr_Mo_Dy: date (nullable = true)
 |-- RPT: double (nullable = true)
 |-- VAL: double (nullable = true)
 |-- ROS: double (nullable = true)
 |-- KIL: double (nullable = true)
 |-- SHA: double (nullable = true)
 |-- BIR: double (nullable = true)
 |-- DUB: double (nullable = true)
 |-- CLA: double (nullable = true)
 |-- MUL: double (nullable = true)
 |-- CLO: double (nullable = true)
 |-- BEL: double (nullable = true)
 |-- MAL: double (nullable = true)
 |-- Yr: integer (nullable = true)
 |-- Mo: integer (nullable = true)
 |-- Dy: integer (nullable = true)



### Step 6. Compute how many values are missing for each location over the entire record.  
#### They should be ignored in all calculations below. 

In [22]:
cols = d2_df.columns
cols.remove('Date')
cols.remove('Yr_Mo_Dy')
cols

['RPT',
 'VAL',
 'ROS',
 'KIL',
 'SHA',
 'BIR',
 'DUB',
 'CLA',
 'MUL',
 'CLO',
 'BEL',
 'MAL',
 'Yr',
 'Mo',
 'Dy']

In [23]:
d2_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in cols]).show(5)

+---+---+---+---+---+---+---+---+---+---+---+---+----+---+---+
|RPT|VAL|ROS|KIL|SHA|BIR|DUB|CLA|MUL|CLO|BEL|MAL|  Yr| Mo| Dy|
+---+---+---+---+---+---+---+---+---+---+---+---+----+---+---+
|  6|  3|  2|  5|  2|  0|  3|  2|  3|  1|  0|  4|2922|  0|  0|
+---+---+---+---+---+---+---+---+---+---+---+---+----+---+---+



In [24]:
d2_df.select([count(when(isnan(c), c)).alias(c) for c in cols]).show(5)

+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|RPT|VAL|ROS|KIL|SHA|BIR|DUB|CLA|MUL|CLO|BEL|MAL| Yr| Mo| Dy|
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|  6|  3|  2|  5|  2|  0|  3|  2|  3|  1|  0|  4|  0|  0|  0|
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+



In [25]:
import pyspark.sql.functions as F
def count_missings(spark_df,sort=True):
    """
    Counts number of nulls and nans in each column
    """
    df = spark_df.select([F.count(F.when(F.isnan(c) | F.isnull(c), c)).alias(c) for (c,c_type) in spark_df.dtypes if c_type not in ('timestamp', 'string', 'date')]).toPandas()

    if len(df) == 0:
        print("There are no any missing values!")
        return None

    if sort:
        return df.rename(index={0: 'count'}).T.sort_values("count",ascending=False)

    return df

In [26]:
missing_counts = count_missings(d2_df)
missing_counts

Unnamed: 0,count
Yr,2922
RPT,6
KIL,5
MAL,4
VAL,3
DUB,3
MUL,3
ROS,2
SHA,2
CLA,2


In [27]:
missing_counts['count'].sum()

2953

### Step 7. Compute how many non-missing values there are in total.

In [28]:
d2_df.count() * len(d2_df.columns) - missing_counts['count'].sum()

108805

### Step 8. Calculate the mean windspeeds of the windspeeds over all the locations and all the times.
#### A single number for the entire dataset.

In [None]:
#trying to find an answer

### Step 9. Create a DataFrame called loc_stats and calculate the min, max and mean windspeeds and standard deviations of the windspeeds at each location over all the days 

#### A different set of numbers for each location.

In [29]:
loc_stats = d2_df.dropna().describe()

# for c in loc_stats.columns:
#     if loc_stats[c]
#         loc_stats = loc_stats.withColumn(c, round(c, 2))
    
loc_stats.show()

+-------+-----------------+------------------+------------------+-----------------+------------------+-----------------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+------------------+------------------+
|summary|              RPT|               VAL|               ROS|              KIL|               SHA|              BIR|               DUB|              CLA|              MUL|              CLO|              BEL|              MAL|                Yr|                Mo|                Dy|
+-------+-----------------+------------------+------------------+-----------------+------------------+-----------------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+------------------+------------------+
|  count|             3624|              3624|              3624|             3624|              3624|             3624|              3624|

### Step 10. Create a DataFrame called day_stats and calculate the min, max and mean windspeed and standard deviations of the windspeeds across all the locations at each day.

#### A different set of numbers for each day.

In [30]:
d2_df.dropna().describe().show()

+-------+-----------------+------------------+------------------+-----------------+------------------+-----------------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+------------------+------------------+
|summary|              RPT|               VAL|               ROS|              KIL|               SHA|              BIR|               DUB|              CLA|              MUL|              CLO|              BEL|              MAL|                Yr|                Mo|                Dy|
+-------+-----------------+------------------+------------------+-----------------+------------------+-----------------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+------------------+------------------+
|  count|             3624|              3624|              3624|             3624|              3624|             3624|              3624|

### Step 11. Find the average windspeed in January for each location.  
#### Treat January 1961 and January 1962 both as January.

In [49]:
cols_to_show = d2_df.columns
date_cols = ['Date','Yr_Mo_Dy','Yr','Mo','Dy']
for i in date_cols:
    cols_to_show.remove(i)
print(cols_to_show)

['RPT', 'VAL', 'ROS', 'KIL', 'SHA', 'BIR', 'DUB', 'CLA', 'MUL', 'CLO', 'BEL', 'MAL']


In [54]:
d2_mean = d2_df.filter(d2_df.Mo.isin(1)).select(*cols_to_show).dropna().toPandas()
d2_mean

Unnamed: 0,RPT,VAL,ROS,KIL,SHA,BIR,DUB,CLA,MUL,CLO,BEL,MAL,Yr
0,10.58,6.63,11.75,4.58,4.54,2.88,8.63,1.79,5.83,5.88,5.46,10.88,1961
1,13.33,13.25,11.42,6.17,10.71,8.21,11.92,6.54,10.92,10.34,12.92,11.83,1961
2,13.21,8.12,9.96,6.67,5.37,4.50,10.67,4.42,7.17,7.50,8.12,13.17,1961
3,13.50,14.29,9.50,4.96,12.29,8.33,9.17,9.29,7.58,7.96,13.96,13.79,1961
4,10.96,9.75,7.62,5.91,9.62,7.29,14.29,7.62,9.25,10.46,16.62,16.46,1961
...,...,...,...,...,...,...,...,...,...,...,...,...,...
300,10.54,19.08,7.83,3.17,10.00,7.00,5.17,8.79,5.21,2.88,12.42,9.96,1970
301,17.50,17.67,17.83,8.58,17.37,14.62,13.75,14.88,11.42,12.50,18.79,21.25,1970
302,12.96,11.08,18.12,6.83,9.83,9.46,10.83,10.63,10.17,8.58,10.58,19.00,1970
303,9.79,6.63,14.96,3.17,5.54,3.21,5.66,2.79,2.67,2.17,4.50,8.54,1970


In [39]:
type(d2_mean)

pandas.core.frame.DataFrame

In [55]:
d2_mean.mean()

RPT      14.280656
VAL      12.449016
ROS      13.117049
KIL       7.224393
SHA      11.464131
BIR       7.777672
DUB      11.509410
CLA       9.355934
MUL       8.529377
CLO       9.913607
BEL      14.000951
MAL      16.716754
Yr     1965.573770
dtype: float64

### Step 12. Downsample the record to a yearly frequency for each location.

In [50]:
cols_to_show.append('Yr')
print(cols_to_show)

['RPT', 'VAL', 'ROS', 'KIL', 'SHA', 'BIR', 'DUB', 'CLA', 'MUL', 'CLO', 'BEL', 'MAL', 'Yr']


In [58]:
d2_year_avg = d2_df.select(*cols_to_show).dropna().groupBy('Yr').avg().orderBy('Yr')
d2_year_avg.show(5)

+----+------------------+------------------+------------------+-----------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+------------------+-------+
|  Yr|          avg(RPT)|          avg(VAL)|          avg(ROS)|         avg(KIL)|          avg(SHA)|         avg(BIR)|          avg(DUB)|          avg(CLA)|         avg(MUL)|          avg(CLO)|          avg(BEL)|          avg(MAL)|avg(Yr)|
+----+------------------+------------------+------------------+-----------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+------------------+-------+
|1961| 12.20758823529412|10.207205882352943|11.282852941176477|6.846441176470588|10.742735294117647|7.617147058823529| 9.633264705882354| 8.684911764705886|8.571941176470586| 9.700470588235298|13.271558823529414|13.587205882352938| 1961.0|
|1962|12.274779005524865|10.150856353591

In [59]:
d2_year_avg.toPandas() #just to have a beautiful output

Unnamed: 0,Yr,avg(RPT),avg(VAL),avg(ROS),avg(KIL),avg(SHA),avg(BIR),avg(DUB),avg(CLA),avg(MUL),avg(CLO),avg(BEL),avg(MAL),avg(Yr)
0,1961,12.207588,10.207206,11.282853,6.846441,10.742735,7.617147,9.633265,8.684912,8.571941,9.700471,13.271559,13.587206,1961.0
1,1962,12.274779,10.150856,11.714365,6.964337,10.692265,7.415331,11.043122,8.829807,8.335552,9.693812,12.985691,14.361436,1962.0
2,1963,12.813452,10.836986,12.541151,7.330055,11.72411,8.434712,11.075699,10.336548,8.903589,10.224438,13.638877,14.999014,1963.0
3,1964,12.363661,10.920164,12.104372,6.787787,11.454481,7.570874,10.259153,9.46735,7.789016,10.207951,13.740546,14.910301,1964.0
4,1965,12.45137,11.075534,11.848767,6.858466,11.024795,7.47811,10.618712,8.879918,7.907425,9.918082,12.964247,15.591644,1965.0
5,1966,13.461973,11.557205,12.02063,7.345726,11.805041,7.793671,10.579808,8.835096,8.514438,9.768959,14.265836,16.30726,1966.0
6,1967,12.737151,10.990986,11.739397,7.143425,11.63074,7.368164,10.652027,9.325616,8.645014,9.547425,14.774548,17.135945,1967.0
7,1968,11.835628,10.468197,11.409754,6.477678,10.760765,6.067322,8.85918,8.255519,7.224945,7.832978,12.808634,15.017486,1968.0
8,1969,11.166356,9.723699,10.902,5.767973,9.873918,6.189973,8.564493,7.711397,7.924521,7.754384,12.621233,15.762904,1969.0
9,1970,12.600329,10.726932,11.730247,6.217178,10.56737,7.609452,9.60989,8.33463,9.297616,8.289808,13.183644,16.456027,1970.0


### Step 13. Downsample the record to a monthly frequency for each location.

In [60]:
print(cols_to_show)

['RPT', 'VAL', 'ROS', 'KIL', 'SHA', 'BIR', 'DUB', 'CLA', 'MUL', 'CLO', 'BEL', 'MAL', 'Yr']


In [61]:
cols_to_show.remove('Yr')

In [69]:
# cols_to_show.append('Mo')
# cols_to_show.append('Yr')
print(cols_to_show)

['RPT', 'VAL', 'ROS', 'KIL', 'SHA', 'BIR', 'DUB', 'CLA', 'MUL', 'CLO', 'BEL', 'MAL', 'Mo', 'Yr']


In [71]:
# d2_month_avg = d2_df.filter(d2_df.Yr.isin(1961)).select(*cols_to_show).dropna().groupBy('Mo').avg().orderBy('Mo')
# d2_month_avg.show()

#trying to find an easier way

### Step 14. Downsample the record to a weekly frequency for each location.

In [72]:
# trying to find an answer

### Step 15. Calculate the min, max and mean windspeeds and standard deviations of the windspeeds across all locations for each week (assume that the first week starts on January 2 1961) for the first 52 weeks.

In [73]:
d2_df.columns

['Date',
 'Yr_Mo_Dy',
 'RPT',
 'VAL',
 'ROS',
 'KIL',
 'SHA',
 'BIR',
 'DUB',
 'CLA',
 'MUL',
 'CLO',
 'BEL',
 'MAL',
 'Yr',
 'Mo',
 'Dy']

In [74]:
d2_df.show(2)

+----------+----------+-----+-----+-----+----+-----+----+-----+-----+-----+-----+-----+-----+----+---+---+
|      Date|  Yr_Mo_Dy|  RPT|  VAL|  ROS| KIL|  SHA| BIR|  DUB|  CLA|  MUL|  CLO|  BEL|  MAL|  Yr| Mo| Dy|
+----------+----------+-----+-----+-----+----+-----+----+-----+-----+-----+-----+-----+-----+----+---+---+
|1961-01-01|2061-01-01|15.04|14.96|13.17|9.29|  NaN|9.87|13.67|10.25|10.83|12.58| 18.5|15.04|1961|  1|  1|
|1961-01-02|2061-01-02|14.71|  NaN|10.83| 6.5|12.62|7.67| 11.5|10.04| 9.79| 9.67|17.54|13.83|1961|  1|  2|
+----------+----------+-----+-----+-----+----+-----+----+-----+-----+-----+-----+-----+-----+----+---+---+
only showing top 2 rows



In [78]:
d2_wk_df = d2_df.withColumn('Wk', weekofyear("Date"))
d2_wk_df.show(2)

+----------+----------+-----+-----+-----+----+-----+----+-----+-----+-----+-----+-----+-----+----+---+---+---+
|      Date|  Yr_Mo_Dy|  RPT|  VAL|  ROS| KIL|  SHA| BIR|  DUB|  CLA|  MUL|  CLO|  BEL|  MAL|  Yr| Mo| Dy| Wk|
+----------+----------+-----+-----+-----+----+-----+----+-----+-----+-----+-----+-----+-----+----+---+---+---+
|1961-01-01|2061-01-01|15.04|14.96|13.17|9.29|  NaN|9.87|13.67|10.25|10.83|12.58| 18.5|15.04|1961|  1|  1| 52|
|1961-01-02|2061-01-02|14.71|  NaN|10.83| 6.5|12.62|7.67| 11.5|10.04| 9.79| 9.67|17.54|13.83|1961|  1|  2|  1|
+----------+----------+-----+-----+-----+----+-----+----+-----+-----+-----+-----+-----+-----+----+---+---+---+
only showing top 2 rows



In [80]:
print(cols_to_show)

['RPT', 'VAL', 'ROS', 'KIL', 'SHA', 'BIR', 'DUB', 'CLA', 'MUL', 'CLO', 'BEL', 'MAL', 'Mo', 'Yr']


In [81]:
cols_to_show.append('Wk')

In [85]:
d2_wk_avg_1961 = d2_wk_df.filter(d2_wk_df.Yr.isin(1961)).select(*cols_to_show).dropna().groupBy('Wk').avg().orderBy('Wk')
d2_wk_avg_1961.show(2)

+---+------------------+-----------------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+------------------+-------+-------+-------+
| Wk|          avg(RPT)|         avg(VAL)|          avg(ROS)|         avg(KIL)|         avg(SHA)|         avg(BIR)|         avg(DUB)|         avg(CLA)|         avg(MUL)|         avg(CLO)|          avg(BEL)|          avg(MAL)|avg(Mo)|avg(Yr)|avg(Wk)|
+---+------------------+-----------------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+------------------+-------+-------+-------+
|  1|            12.316|           10.408|             10.05|            5.658|8.505999999999998|            6.242|           10.936|            5.932|             8.15|            8.428|            11.416|13.225999999999999|    1.0| 1961.0|    1.0|


In [86]:
d2_wk_min_1961 = d2_wk_df.filter(d2_wk_df.Yr.isin(1961)).select(*cols_to_show).dropna().groupBy('Wk').min().orderBy('Wk')
d2_wk_min_1961.show(2)

+---+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-------+-------+-------+
| Wk|min(RPT)|min(VAL)|min(ROS)|min(KIL)|min(SHA)|min(BIR)|min(DUB)|min(CLA)|min(MUL)|min(CLO)|min(BEL)|min(MAL)|min(Mo)|min(Yr)|min(Wk)|
+---+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-------+-------+-------+
|  1|   10.58|    6.63|    7.62|    4.58|    4.54|    2.88|    8.63|    1.79|    5.83|    5.88|    5.46|   10.88|      1|   1961|      1|
|  2|    9.04|    3.54|    7.08|    0.67|    2.29|    0.96|     2.5|    0.58|     0.5|    2.67|    5.25|    5.17|      1|   1961|      2|
+---+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-------+-------+-------+
only showing top 2 rows



In [90]:
d2_wk_max_1961 = d2_wk_df.filter(d2_wk_df.Yr.isin(1961)).select(*cols_to_show).dropna().groupBy('Wk').max().orderBy('Wk')
d2_wk_max_1961.show(2)

+---+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-------+-------+-------+
| Wk|max(RPT)|max(VAL)|max(ROS)|max(KIL)|max(SHA)|max(BIR)|max(DUB)|max(CLA)|max(MUL)|max(CLO)|max(BEL)|max(MAL)|max(Mo)|max(Yr)|max(Wk)|
+---+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-------+-------+-------+
|  1|    13.5|   14.29|   11.75|    6.67|   12.29|    8.33|   14.29|    9.29|   10.92|   10.46|   16.62|   16.46|      1|   1961|      1|
|  2|   19.75|   12.08|    19.5|   10.54|   10.37|    9.46|   15.54|    11.5|   10.37|   14.58|   20.71|   16.92|      1|   1961|      2|
+---+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-------+-------+-------+
only showing top 2 rows



In [88]:
#trying to find an easier way to calculate these stats at once