# Energy Prices in Spain
## Spark Individual Assignment
### By: Alain Grullón González

To objective of the average energy prices in Spain analysis will help us gain insights on how energy sources, weather conditions, and energy demand affect the final energy price. 

### PySpark Setup
First, lets setup PySpark:

In [1]:
import findspark
findspark.init()

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import lit

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

### Data Source Loading into Spark Abstraction
Lets now load the data sources as DataFrames and join the two datasets to make one dataset containing weather conditions of the cities with the energy sources, total load (demand) and national energy price.

Source: https://www.kaggle.com/nicholasjhana/energy-consumption-generation-prices-and-weather

In [2]:
energyDF = spark.read \
                 .option("inferSchema", "true") \
                 .option("header", "true") \
                 .csv("energy_dataset.csv")
weatherDF = spark.read \
                 .option("inferSchema", "true") \
                 .option("header", "true") \
                 .csv("weather_features.csv")

In [3]:
nrgDF = weatherDF.join(energyDF, on='time')

In [4]:
nrgDF.count()

178436

In [5]:
from pyspark.sql.functions import monotonically_increasing_id
nrgDF_in = nrgDF.select("*").withColumn("id",monotonically_increasing_id())
nrgDF_in.show(3)

+--------------------+---------+-------+--------+--------+--------+--------+----------+--------+-------+-------+-------+----------+----------+------------+-------------------+------------+------------------+------------------------------------+----------------------------------+---------------------+---------------------------+---------------------+---------------------------+----------------------+---------------------+------------------------------------------+-------------------------------------------+------------------------------------------+--------------------------------+-----------------+------------------+----------------+--------------------------+----------------+----------------+------------------------+-----------------------+------------------------+---------------------------------+-------------------------------+-------------------+-----------------+---------------+------------+---+
|                time|city_name|   temp|temp_min|temp_max|pressure|humidity|wind_speed

Making Time feature into Timestamp object

In [6]:
from pyspark.sql.functions import from_unixtime
from pyspark.sql.functions import unix_timestamp
from pyspark.sql.types import TimestampType

Time_df = nrgDF.select(nrgDF.columns[0],
                     unix_timestamp('time', "yyyy-MM-dd HH:mm:ss+01:00")
                     .cast(TimestampType())
                     .alias("Date"))

In [7]:
Time_in = Time_df.select("*").withColumn("id",monotonically_increasing_id())
Time_in.show(2)

+--------------------+-------------------+---+
|                time|               Date| id|
+--------------------+-------------------+---+
|2015-01-01 00:00:...|2015-01-01 00:00:00|  0|
|2015-01-01 01:00:...|2015-01-01 01:00:00|  1|
+--------------------+-------------------+---+
only showing top 2 rows



In [8]:
Time_in = Time_in.drop("time")
nrgDF_in = nrgDF_in.join(Time_in, on = "id", how="inner")

In [9]:
nrgDF_in.count()

178436

In [10]:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.functions import *

nrgDF_temp = nrgDF.select((col("temp")-273.15).alias("Temperature"))
nrgDF_temp = nrgDF_temp.select("*").withColumn("id",monotonically_increasing_id())
nrgDF_temp.show(2)

+-------------------+---+
|        Temperature| id|
+-------------------+---+
|-2.6749999999999545|  0|
|-2.6749999999999545|  1|
+-------------------+---+
only showing top 2 rows



In [11]:
nrgDF_temp = nrgDF_temp.join(nrgDF_in, on = "id", how="inner")

In [12]:
nrgDF_temp.count()

178436

In [13]:
nrgDF = nrgDF_temp.select("Date","city_name","Temperature","pressure","humidity","wind_speed",
              "rain_1h","rain_3h","snow_3h","clouds_all","generation biomass","generation fossil gas",
              "generation fossil hard coal","generation fossil oil","generation fossil oil shale",
              "generation geothermal","generation hydro run-of-river and poundage","generation hydro water reservoir",
              "generation nuclear", "generation solar","generation waste","generation wind offshore",
              "generation wind onshore","total load actual","price actual"
             )
nrgDF.show(2)

+-------------------+---------+-------------------+--------+--------+----------+-------+-------+-------+----------+------------------+---------------------+---------------------------+---------------------+---------------------------+---------------------+------------------------------------------+--------------------------------+------------------+----------------+----------------+------------------------+-----------------------+-----------------+------------+
|               Date|city_name|        Temperature|pressure|humidity|wind_speed|rain_1h|rain_3h|snow_3h|clouds_all|generation biomass|generation fossil gas|generation fossil hard coal|generation fossil oil|generation fossil oil shale|generation geothermal|generation hydro run-of-river and poundage|generation hydro water reservoir|generation nuclear|generation solar|generation waste|generation wind offshore|generation wind onshore|total load actual|price actual|
+-------------------+---------+-------------------+--------+--------

### Display schema and size of the DataFrame
Now that we have our dataset joined and as a DataFrame Abstraction

We will be working with NRGDF.

In [14]:
from IPython.display import display, Markdown

nrgDF.printSchema()
display(Markdown("This DataFrame has **%d rows**." % nrgDF.count()))

root
 |-- Date: timestamp (nullable = true)
 |-- city_name: string (nullable = true)
 |-- Temperature: double (nullable = true)
 |-- pressure: integer (nullable = true)
 |-- humidity: integer (nullable = true)
 |-- wind_speed: integer (nullable = true)
 |-- rain_1h: double (nullable = true)
 |-- rain_3h: double (nullable = true)
 |-- snow_3h: double (nullable = true)
 |-- clouds_all: integer (nullable = true)
 |-- generation biomass: integer (nullable = true)
 |-- generation fossil gas: integer (nullable = true)
 |-- generation fossil hard coal: integer (nullable = true)
 |-- generation fossil oil: integer (nullable = true)
 |-- generation fossil oil shale: integer (nullable = true)
 |-- generation geothermal: integer (nullable = true)
 |-- generation hydro run-of-river and poundage: integer (nullable = true)
 |-- generation hydro water reservoir: integer (nullable = true)
 |-- generation nuclear: integer (nullable = true)
 |-- generation solar: integer (nullable = true)
 |-- generatio

This DataFrame has **178436 rows**.

### One or multiple random samples from the data set
Lets use a cache optimization to make processing faster.

In [15]:
nrgDF.cache() # optimization to make the processing faster
nrgDF.sample(False, 0.01).take(2)

[Row(Date=datetime.datetime(2016, 3, 9, 13, 0), city_name='Valencia', Temperature=14.170000000000016, pressure=1012, humidity=39, wind_speed=6, rain_1h=0.0, rain_3h=0.0, snow_3h=0.0, clouds_all=20, generation biomass=384, generation fossil gas=5087, generation fossil hard coal=6234, generation fossil oil=269, generation fossil oil shale=0, generation geothermal=0, generation hydro run-of-river and poundage=728, generation hydro water reservoir=2174, generation nuclear=6935, generation solar=4657, generation waste=316, generation wind offshore=0, generation wind onshore=2224, total load actual=32629, price actual=30.02),
 Row(Date=datetime.datetime(2017, 2, 22, 11, 0), city_name='Madrid', Temperature=8.82000000000005, pressure=1022, humidity=70, wind_speed=1, rain_1h=0.0, rain_3h=0.0, snow_3h=0.0, clouds_all=0, generation biomass=362, generation fossil gas=6200, generation fossil hard coal=5478, generation fossil oil=307, generation fossil oil shale=0, generation geothermal=0, generatio

### Data entities, metrics and dimensions

It is important to understand the following elements of our data:

* **Entities:** Energy Price (target), Weather condition (dimension), Generation source (dimension)
* **Metrics:** Total Load, generation in MWh (by source)
* **Dimensions:** temperatures, humidity, rain, snow, clouds, etc.

### Column categorization

We can categorize the features as follows:

* **City/weather related columns:** *city*, *temperature*, *wind speed*, *wind direction*, *pressure*, *humidity*, *rain*, *snow*...
* **Energy Price related columns:** *Price*, *Total Load*, *Generation biomass*, *Generation fossil*, *Generation solar*, *Generation wind*, *Generation Hydro...*...

## Columns groups basic profiling to better understand our data set
### Weather related columns basic profiling

In [16]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit


print ("Summary of columns Temperature, pressure:")
nrgDF.select("Temperature","pressure").summary().show()

print("Checking for nulls in Temperature, pressure:")
nrgDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["Temperature","pressure"]]).show()

print("Checking amount of distinct values in columns Temperature, pressure:")
nrgDF.select([countDistinct(c).alias(c) for c in ["Temperature","pressure"]]).show()

# Humidity and Rain
print ("Summary of columns humidity, rain_1h:")
nrgDF.select("humidity","rain_1h").summary().show()

print("Checking for nulls on columns humidity, rain_1h:")
nrgDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["humidity","rain_1h"]]).show()

print("Checking amount of distinct values in columns humidity, rain_1h:")
nrgDF.select([countDistinct(c).alias(c) for c in ["humidity","rain_1h"]]).show()

# Snow and Clouds
print ("Summary of columns snow_3h, clouds_all:")
nrgDF.select("snow_3h","clouds_all").summary().show()

print("Checking for nulls on columns snow_3h, clouds_all:")
nrgDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["snow_3h","clouds_all"]]).show()

print("Checking amount of distinct values in columns snow_3h, clouds_all:")
nrgDF.select([countDistinct(c).alias(c) for c in ["snow_3h","clouds_all"]]).show()

Summary of columns Temperature, pressure:
+-------+-------------------+------------------+
|summary|        Temperature|          pressure|
+-------+-------------------+------------------+
|  count|             178436|            178436|
|   mean| 16.467577374170588|1069.2480273039073|
| stddev|  8.025847611669839| 5968.962810861072|
|    min|-10.909999999999968|                 0|
|    25%| 10.511656300000027|              1013|
|    50%|               16.0|              1018|
|    75%|               22.0|              1022|
|    max| 42.450000000000045|           1008371|
+-------+-------------------+------------------+

Checking for nulls in Temperature, pressure:
+-----------+--------+
|Temperature|pressure|
+-----------+--------+
|          0|       0|
+-----------+--------+

Checking amount of distinct values in columns Temperature, pressure:
+-----------+--------+
|Temperature|pressure|
+-----------+--------+
|      20743|     190|
+-----------+--------+

Summary of columns humi

We can see the average weather conditions in these 5 cities that cover almost all of the Spanish landmass:

Average temperatures are around 16ºC with an interquartile range from 10.5ºC to 22ºC and extremes from -11ºC to 42.5ºC. Showing diverse weather conditions in the Spanish mayor cities. Although the pressure attribute has been plagued by extreme error outliers, we can see it is fairly stable between 1013 hPa and 1022 hPa. 

Humidity is also interesting to see at first glance as in general the Spanish landmass, regardless of how dry Madrid is, humidity averages out at around 70% and are around 53 to 87 (%) 50 percent of the time. Rain seems to vary significantly, because of the national weather diversity between cities, however, it can be seen here that the average rainfall per year is 0.0754 (mm/hour) * 8760(hours/yr) = 650 mm/year with a very large standard deviation that will not be considered here because of the natural hourly volatility of this feature. 

It seems only around 40mm/year of snow fall in these 5 cities, which is as expected, given the relatively high average temperatures (very much above 0ºC). And the average amount of clouds is around 25% and is less than 40% most of the time. 

All in all, we can conclude that the weather conditions in Spain as a whole is warm and dry relative to the rest of Europe, and rainfall is very low. 

There are no null values in any of the 5 relevant weather features, which means we are ready to handle any query related to weather.

### Energy related columns basic profiling

In [17]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit

# Thermal Energy Renewables 

print ("Summary of columns generation biomass, generation geothermal, generation waste:")
nrgDF.select("generation biomass","generation geothermal", "generation waste").summary().show()

print("Checking for nulls in generation biomass, generation geothermal,generation waste:")
nrgDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["generation biomass","generation geothermal","generation waste"]]).show()

print("Checking amount of distinct values in columns generation biomass, generation geothermal, generation waste:")
nrgDF.select([countDistinct(c).alias(c) for c in ["generation biomass","generation geothermal","generation waste"]]).show()

# Wind and Solar
print ("Summary of columns generation solar, generation wind onshore, generation wind offshore:")
nrgDF.select("generation solar","generation wind onshore", "generation wind offshore").summary().show()

print("Checking for nulls on columns generation solar, generation wind onshore, generation wind offshore:")
nrgDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["generation solar","generation wind onshore","generation wind offshore"]]).show()

print("Checking amount of distinct values in columns generation solar, generation wind onshore, generation wind offshore:")
nrgDF.select([countDistinct(c).alias(c) for c in ["generation solar","generation wind onshore","generation wind offshore"]]).show()

# Hydro
print ("Summary of columns generation hydro run-of-river and poundage, generation hydro water reservoir:")
nrgDF.select("generation hydro run-of-river and poundage","generation hydro water reservoir").summary().show()

print("Checking for nulls on columns generation hydro run-of-river and poundage, generation hydro water reservoir:")
nrgDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["generation hydro run-of-river and poundage","generation hydro water reservoir"]]).show()

print("Checking amount of distinct values in columns generation hydro run-of-river and poundage, generation hydro water reservoir:")
nrgDF.select([countDistinct(c).alias(c) for c in ["generation hydro run-of-river and poundage","generation hydro water reservoir"]]).show()

# Nuclear and Coal
print ("Summary of columns generation nuclear, generation fossil hard coal:")
nrgDF.select("generation nuclear","generation fossil hard coal").summary().show()

print("Checking for nulls on columns generation nuclear, generation fossil hard coal:")
nrgDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["generation nuclear","generation fossil hard coal"]]).show()

print("Checking amount of distinct values in columns generation nuclear, generation fossil hard coal:")
nrgDF.select([countDistinct(c).alias(c) for c in ["generation nuclear","generation fossil hard coal"]]).show()

# Oil and Gas
print ("Summary of columns generation fossil oil, generation fossil gas, generation fossil oil shale:")
nrgDF.select("generation fossil oil","generation fossil gas","generation fossil oil shale").summary().show()

print("Checking for nulls on columns generation fossil oil, generation fossil gas, generation fossil oil shale:")
nrgDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["generation fossil oil","generation fossil gas","generation fossil oil shale"]]).show()

print("Checking amount of distinct values in columns generation fossil oil, generation fossil gas, generation fossil oil shale:")
nrgDF.select([countDistinct(c).alias(c) for c in ["generation fossil oil","generation fossil gas", "generation fossil oil shale"]]).show()

Summary of columns generation biomass, generation geothermal, generation waste:
+-------+------------------+---------------------+-----------------+
|summary|generation biomass|generation geothermal| generation waste|
+-------+------------------+---------------------+-----------------+
|  count|            178341|               178346|           178341|
|   mean| 382.9993944185577|                  0.0|269.7537414279386|
| stddev| 85.25674784643664|                  0.0|50.08805651993184|
|    min|                 0|                    0|                0|
|    25%|               333|                    0|              241|
|    50%|               366|                    0|              280|
|    75%|               429|                    0|              310|
|    max|               592|                    0|              357|
+-------+------------------+---------------------+-----------------+

Checking for nulls in generation biomass, generation geothermal,generation waste:
+--------

We see there are a few nulls (around 90 for each generation source column), lets take a look.

Upon running the forward fills below, we have no nulls in the generation and ready to determine the relationship betweent the source and the price. 

In [18]:
nrgDF.where(col("generation fossil oil").isNull()).show(10)

+-------------------+----------+------------------+--------+--------+----------+-------+-------+-------+----------+------------------+---------------------+---------------------------+---------------------+---------------------------+---------------------+------------------------------------------+--------------------------------+------------------+----------------+----------------+------------------------+-----------------------+-----------------+------------+
|               Date| city_name|       Temperature|pressure|humidity|wind_speed|rain_1h|rain_3h|snow_3h|clouds_all|generation biomass|generation fossil gas|generation fossil hard coal|generation fossil oil|generation fossil oil shale|generation geothermal|generation hydro run-of-river and poundage|generation hydro water reservoir|generation nuclear|generation solar|generation waste|generation wind offshore|generation wind onshore|total load actual|price actual|
+-------------------+----------+------------------+--------+--------

It seems that the missing generation values are due to non recordings at random times of this 4 year interval. As the generation values are fairly stable by nature (at a large scale only new power plants coming online alters this stability) and there are only around 900 missing values out of 178K (~0.5% of values) we can go ahead and perform a simple forward fill for the null values.

In [19]:
from pyspark.sql import Window
from pyspark.sql.functions import last

# define the window
window = Window.partitionBy('city_name')\
               .orderBy('Date')\

# define the forward-filled column
filled_oil = last(nrgDF['generation fossil oil'], ignorenulls=True).over(window)
filled_gas = last(nrgDF['generation fossil gas'], ignorenulls=True).over(window)
filled_oils = last(nrgDF['generation fossil oil shale'], ignorenulls=True).over(window)
filled_coal = last(nrgDF['generation fossil hard coal'], ignorenulls=True).over(window)
filled_nuk = last(nrgDF['generation nuclear'], ignorenulls=True).over(window)
filled_hwr = last(nrgDF['generation hydro water reservoir'], ignorenulls=True).over(window)
filled_hrr = last(nrgDF['generation hydro run-of-river and poundage'], ignorenulls=True).over(window)
filled_sol = last(nrgDF['generation solar'], ignorenulls=True).over(window)
filled_wind = last(nrgDF['generation wind onshore'], ignorenulls=True).over(window)
filled_wos = last(nrgDF['generation wind offshore'], ignorenulls=True).over(window)
filled_was = last(nrgDF['generation waste'], ignorenulls=True).over(window)
filled_geo = last(nrgDF['generation geothermal'], ignorenulls=True).over(window)
filled_bio = last(nrgDF['generation biomass'], ignorenulls=True).over(window)

# do the fill
nrgDF_filled = nrgDF.withColumn('generation fossil oil', filled_oil)
nrgDF_filled = nrgDF_filled.withColumn('generation fossil gas', filled_gas)
nrgDF_filled = nrgDF_filled.withColumn('generation fossil oil shale', filled_oils)
nrgDF_filled = nrgDF_filled.withColumn('generation fossil hard coal', filled_coal)
nrgDF_filled = nrgDF_filled.withColumn('generation nuclear', filled_nuk)
nrgDF_filled = nrgDF_filled.withColumn('generation hydro water reservoir', filled_hwr)
nrgDF_filled = nrgDF_filled.withColumn('generation hydro run-of-river and poundage', filled_hrr)
nrgDF_filled = nrgDF_filled.withColumn('generation solar', filled_sol)
nrgDF_filled = nrgDF_filled.withColumn('generation wind onshore', filled_wind)
nrgDF_filled = nrgDF_filled.withColumn('generation wind offshore', filled_wos)
nrgDF_filled = nrgDF_filled.withColumn('generation waste', filled_was)
nrgDF_filled = nrgDF_filled.withColumn('generation geothermal', filled_geo)
nrgDF_filled = nrgDF_filled.withColumn('generation biomass', filled_bio)

Now lets run again with nrg_filled and see the results to analyze.

In [20]:
# Thermal Energy Renewables 

print ("Summary of columns generation biomass, generation geothermal, generation waste:")
nrgDF_filled.select("generation biomass","generation geothermal", "generation waste").summary().show()

print("Checking for nulls in generation biomass, generation geothermal, generation waste:")
nrgDF_filled.select([count(when(col(c).isNull(), c)).alias(c) for c in ["generation biomass","generation geothermal","generation waste"]]).show()

print("Checking amount of distinct values in columns generation biomass, generation geothermal, generation waste:")
nrgDF_filled.select([countDistinct(c).alias(c) for c in ["generation biomass","generation geothermal","generation waste"]]).show()

# Wind and Solar
print ("Summary of columns generation solar, generation wind onshore, generation wind offshore:")
nrgDF_filled.select("generation solar","generation wind onshore", "generation wind offshore").summary().show()

print("Checking for nulls on columns generation solar, generation wind onshore, generation wind offshore:")
nrgDF_filled.select([count(when(col(c).isNull(), c)).alias(c) for c in ["generation solar","generation wind onshore","generation wind offshore"]]).show()

print("Checking amount of distinct values in columns generation solar, generation wind onshore, generation wind offshore:")
nrgDF_filled.select([countDistinct(c).alias(c) for c in ["generation solar","generation wind onshore","generation wind offshore"]]).show()

# Hydro
print ("Summary of columns generation hydro run-of-river and poundage, generation hydro water reservoir:")
nrgDF_filled.select("generation hydro run-of-river and poundage","generation hydro water reservoir").summary().show()

print("Checking for nulls on columns generation hydro run-of-river and poundage, generation hydro water reservoir:")
nrgDF_filled.select([count(when(col(c).isNull(), c)).alias(c) for c in ["generation hydro run-of-river and poundage","generation hydro water reservoir"]]).show()

print("Checking amount of distinct values in columns generation hydro run-of-river and poundage, generation hydro water reservoir:")
nrgDF_filled.select([countDistinct(c).alias(c) for c in ["generation hydro run-of-river and poundage","generation hydro water reservoir"]]).show()

# Nuclear and Coal
print ("Summary of columns generation nuclear, generation fossil hard coal:")
nrgDF_filled.select("generation nuclear","generation fossil hard coal").summary().show()

print("Checking for nulls on columns generation nuclear, generation fossil hard coal:")
nrgDF_filled.select([count(when(col(c).isNull(), c)).alias(c) for c in ["generation nuclear","generation fossil hard coal"]]).show()

print("Checking amount of distinct values in columns generation nuclear, generation fossil hard coal:")
nrgDF_filled.select([countDistinct(c).alias(c) for c in ["generation nuclear","generation fossil hard coal"]]).show()

# Oil and Gas
print ("Summary of columns generation fossil oil, generation fossil gas, generation fossil oil shale:")
nrgDF_filled.select("generation fossil oil","generation fossil gas","generation fossil oil shale").summary().show()

print("Checking for nulls on columns generation fossil oil, generation fossil gas, generation fossil oil shale:")
nrgDF_filled.select([count(when(col(c).isNull(), c)).alias(c) for c in ["generation fossil oil","generation fossil gas","generation fossil oil shale"]]).show()

print("Checking amount of distinct values in columns generation fossil oil, generation fossil gas, generation fossil oil shale:")
nrgDF_filled.select([countDistinct(c).alias(c) for c in ["generation fossil oil","generation fossil gas", "generation fossil oil shale"]]).show()

Summary of columns generation biomass, generation geothermal, generation waste:
+-------+------------------+---------------------+-----------------+
|summary|generation biomass|generation geothermal| generation waste|
+-------+------------------+---------------------+-----------------+
|  count|            178436|               178436|           178436|
|   mean|383.01865654912683|                  0.0|269.7211829451456|
| stddev|  85.2495653368818|                  0.0|50.10959354584978|
|    min|                 0|                    0|                0|
|    25%|               333|                    0|              241|
|    50%|               366|                    0|              280|
|    75%|               429|                    0|              310|
|    max|               592|                    0|              357|
+-------+------------------+---------------------+-----------------+

Checking for nulls in generation biomass, generation geothermal, generation waste:
+-------

To start lets look at the renewable thermal energies that have a quite a similar associated cost given that they use similar machinery such as steam turbines to generate power. Lets first take into account that the average total load in Spain was on average 28.704,82 MW the past 4 years (based off the next profiling). The Biomass generation was around 383 MW (a bit over 1%), with the peak at 592 MW. For Geothermal energies, it was 0 which is surprising given the potential of this techonology. For waste to power, it was a bit below biomass averaging 270 MW. We will see how these have evolved in the insights.

Now, lets take a look at solar and wind. We can see for solar because of its nature of peaking during the middle of the day, and having 0 generation during the night, the peak is much higher than the average: almost 5 times. However, we can see solar has taken some market share at around 1.500 MW (5% of avg load) and peaking at 5792 MW (20%) which means this technology paired with other naturally synergetic technologies like wind, biomass, geothermal, etc. can make a great impact in reducing our footprint. We can see wind's average load at around 5.472 MW (20%) and peaking at 17.436 MW (60% of avg load), which tells us that we can harness the wind for over half of our demand at very windy moments in Spain. For offshore wind, there was still none in Spain during these years. For these two technologies, it seems storing the energy in our electric vehicles can make an impact, given that most vehicles become electric and there is a big enough buffer. 

Analyzing the hydros and combining the two, tells us that Spain generates around 3.600 MW from Hydro (12.5% of demand), which is a significant amount. Peaking at around 11.728 MW (40.85% of avg load), this is more proof that a synergetic feed of renewables in Spain is meeting demand. In total, during these 4 years these above mentioned renewables were supplying around 38% of the average demand in Spain.

Lets now look at nuclear and fossil fuels. Nuclear was supplying on average 6.263 MW (22% of avg load), with a steady flow or low standard deviation due to its nature of being on and running non-stop. Coal was around 4.243 MW which is around 15% of demand, but peaked at 8359 MW (30% of demand) probably during the times of low winds and water. 

Oil and gas account for less than 20% of the national load at around 6.000 MW on average. However, during this period gas peaked at over 20.000 MW which is around 70% of the average national load. It is well known in the industry that these oil and gas peakers are the ones that keep the lights on at night in times of high power demand, low winds, and low waters. It is also well known that these power plants are off most of the times, but are paid hansomely purely for being available and ready to go online or to increase their power loads at any given moment in time, reacting to the national grid's demand.  From this observation, we can build a query to show the prices at those times where the gas peakers are to the maximum and the prices at the moments where solar and wind (the new cheapest technologies) are at their maximum.

### Prices/Load related columns basic profiling

In [21]:
print ("Summary of columns Date,  total load actual, price actual:")
nrgDF_filled.select("Date","total load actual","price actual").summary().show()

print("Checking for nulls in Date, total load actual, price actual:")
nrgDF_filled.select([count(when(col(c).isNull(), c)).alias(c) for c in ["Date", "total load actual","price actual"]]).show()

print("Checking amount of distinct values in columns Date, total load actual, price actual:") 
nrgDF_filled.select([countDistinct(c).alias(c) for c in ["Date", "total load actual","price actual"]]).show()

Summary of columns Date,  total load actual, price actual:
+-------+------------------+------------------+
|summary| total load actual|      price actual|
+-------+------------------+------------------+
|  count|            178256|            178436|
|   mean|28704.188403195403| 57.92262211661325|
| stddev| 4580.299460055086|14.208869074386032|
|    min|             18041|              9.33|
|    25%|             24807|             49.38|
|    50%|             28914|             58.06|
|    75%|             32202|             68.05|
|    max|             41015|             116.8|
+-------+------------------+------------------+

Checking for nulls in Date, total load actual, price actual:
+----+-----------------+------------+
|Date|total load actual|price actual|
+----+-----------------+------------+
|   0|              180|           0|
+----+-----------------+------------+

Checking amount of distinct values in columns Date, total load actual, price actual:
+-----+-----------------+--

180 nulls in the Power Load column, that is around 0.01% of observations, we can clearly just forward fill these values without any worry.  

In [22]:
filled_load = last(nrgDF['total load actual'], ignorenulls=True).over(window)
nrgDF_filled = nrgDF_filled.withColumn('total load actual', filled_load)

In [30]:
print ("Summary of columns Date,  total load actual, price actual:")
nrgDF_filled.select("Date","total load actual","price actual").summary().show()

print("Checking for nulls in Date, total load actual, price actual:")
nrgDF_filled.select([count(when(col(c).isNull(), c)).alias(c) for c in ["Date", "total load actual","price actual"]]).show()

print("Checking amount of distinct values in columns Date, total load actual, price actual:")
nrgDF_filled.select([countDistinct(c).alias(c) for c in ["Date", "total load actual","price actual"]]).show()

Summary of columns Date,  total load actual, price actual:
+-------+------------------+------------------+
|summary| total load actual|      price actual|
+-------+------------------+------------------+
|  count|            178436|            178436|
|   mean|28704.823499742204|57.922622116613624|
| stddev|4581.1880664151295|14.208869074386023|
|    min|             18041|              9.33|
|    25%|             24806|             49.39|
|    50%|             28914|             58.06|
|    75%|             32204|             68.05|
|    max|             41015|             116.8|
+-------+------------------+------------------+

Checking for nulls in Date, total load actual, price actual:
+----+-----------------+------------+
|Date|total load actual|price actual|
+----+-----------------+------------+
|   0|                0|           0|
+----+-----------------+------------+

Checking amount of distinct values in columns Date, total load actual, price actual:
+-----+-----------------+--

Everyone knows that the higher the load the higher the price of energy, since if more energy is needed, more power plants must come online to meet that demand. In terms of total load, the average in the studied period was around 29.000 MW with extremes between 18.000 MW and 41.000 MW and most of the time between 25.000 MW and 32.000 MW (the interquartile range). The prices are measured here in €/MW-hour or in this case since it is measured per hour it is equivalent to €/MW. The average price from generator to distributor was around €58/MW or 5.8 cents / kW, while most of the time the price of a megawatt was between €50 and €68. In extreme cases the price was between only €9.33 and up to €116.8. My hypothesis is that the price was €9.33 when renewables were at its highest or close to highest, while high prices of €116.8 were when gas power plants were at their highest power loads. Nevertheless, analyzing price against generation sources (supply side) and weather conditions (demand side) will be the focus of our business questions. 

## Relationship between generation source and weather on Energy Supply and Demand 

### Summary for guidance:

**Weather features (Demand)**: Temperature, Pressure, Humidity, Rain (1h), Snow, Clouds

**Generation (Supply)**: Geothermal, Biomass, Waste, Solar, Wind Onshore, Wind Offshore, Hydro run of river, Hydro reservoir, Nuclear, Coal, Oil, Shale oil, Gas

**Markets (Target)**: Total Load, Price

### How do weather conditions in the 5 largest cities affect the total load and energy price

In [32]:
from pyspark.sql.functions import max, min, avg, stddev, count
# By month to check how weather affects load and price
# First, lets make sure our hypothesis of lower load means lower prices
print("Load vs Energy Prices")
temp_vs_price = nrgDF_filled\
   .withColumn("PriceLevel", when(col("price actual")<=30.00,"7.lowest")\
                            .when((col("price actual")>30.00) & (col("price actual")<=40.00),"6.lower")\
                            .when((col("price actual")>40.00) & (col("price actual")<=50.00),"5.low")\
                            .when((col("price actual")>50.00) & (col("price actual")<=60.00),"4.normal")\
                            .when((col("price actual")>60.00) & (col("price actual")<=70.00),"3.high")\
                            .when((col("price actual")>70.00) & (col("price actual")<=80.00),"2.higher")\
                            .otherwise("1.highest"))
temp_vs_price.cache() 
temp_vs_price.select("PriceLevel", "price actual", "total load actual")\
             .groupBy("PriceLevel")\
             .agg(count("PriceLevel").alias("Instances (hours)"), \
                 (mean("total load actual")).alias("Load")) \
             .orderBy("PriceLevel").show()

# temperature affects price?
print("Temperatures vs Energy Prices")
temp_vs_price = nrgDF_filled\
   .withColumn("PriceLevel", when(col("price actual")<=30.00,"7.lowest")\
                            .when((col("price actual")>30.00) & (col("price actual")<=40.00),"6.lower")\
                            .when((col("price actual")>40.00) & (col("price actual")<=50.00),"5.low")\
                            .when((col("price actual")>50.00) & (col("price actual")<=60.00),"4.normal")\
                            .when((col("price actual")>60.00) & (col("price actual")<=70.00),"3.high")\
                            .when((col("price actual")>70.00) & (col("price actual")<=80.00),"2.higher")\
                            .otherwise("1.highest"))
temp_vs_price.cache() 
temp_vs_price.select("PriceLevel", "price actual", "Temperature")\
             .groupBy("PriceLevel")\
             .agg(count("PriceLevel").alias("Instances (hours)"), \
                 (mean("Temperature")).alias("Temperature (ºC)"), \
                 (min("Temperature")).alias("Min Temperature"), \
                 (max("Temperature")).alias("Max Temperature")) \
             .orderBy("PriceLevel").show()

# rain affects price? 
print("Rain vs Energy Prices")
rain_vs_price = nrgDF_filled\
   .withColumn("PriceLevel", when(col("price actual")<=30.00,"7.lowest")\
                            .when((col("price actual")>30.00) & (col("price actual")<=40.00),"6.lower")\
                            .when((col("price actual")>40.00) & (col("price actual")<=50.00),"5.low")\
                            .when((col("price actual")>50.00) & (col("price actual")<=60.00),"4.normal")\
                            .when((col("price actual")>60.00) & (col("price actual")<=70.00),"3.high")\
                            .when((col("price actual")>70.00) & (col("price actual")<=80.00),"2.higher")\
                            .otherwise("1.highest"))
rain_vs_price.cache() 
rain_vs_price.select("PriceLevel", "price actual", "rain_1h")\
             .groupBy("PriceLevel")\
             .agg(count("PriceLevel").alias("Instances (hours)"), \
                 (mean("rain_1h")).alias("Rain (mm)"), \
                 (max("rain_1h")).alias("Max Rain")) \
             .orderBy("PriceLevel").show()

# humidity affects price? 
print("Humidity vs Energy Prices")
humidity_vs_price = nrgDF_filled\
   .withColumn("PriceLevel", when(col("price actual")<=30.00,"7.lowest")\
                            .when((col("price actual")>30.00) & (col("price actual")<=40.00),"6.lower")\
                            .when((col("price actual")>40.00) & (col("price actual")<=50.00),"5.low")\
                            .when((col("price actual")>50.00) & (col("price actual")<=60.00),"4.normal")\
                            .when((col("price actual")>60.00) & (col("price actual")<=70.00),"3.high")\
                            .when((col("price actual")>70.00) & (col("price actual")<=80.00),"2.higher")\
                            .otherwise("1.highest"))
humidity_vs_price.cache() 
humidity_vs_price.select("PriceLevel", "price actual", "humidity")\
             .groupBy("PriceLevel")\
             .agg(count("PriceLevel").alias("Instances (hours)"), \
                 (mean("humidity")).alias("Humidity (%)"), \
                 (min("humidity")).alias("Min Humidity"), \
                 (max("humidity")).alias("Max Humidity")) \
             .orderBy("PriceLevel").show()

# clouds affects price?
print("Clouds vs Energy Prices")
clouds_vs_price = nrgDF_filled\
   .withColumn("PriceLevel", when(col("price actual")<=30.00,"7.lowest")\
                            .when((col("price actual")>30.00) & (col("price actual")<=40.00),"6.lower")\
                            .when((col("price actual")>40.00) & (col("price actual")<=50.00),"5.low")\
                            .when((col("price actual")>50.00) & (col("price actual")<=60.00),"4.normal")\
                            .when((col("price actual")>60.00) & (col("price actual")<=70.00),"3.high")\
                            .when((col("price actual")>70.00) & (col("price actual")<=80.00),"2.higher")\
                            .otherwise("1.highest"))
clouds_vs_price.cache() 
clouds_vs_price.select("PriceLevel", "price actual", "clouds_all")\
             .groupBy("PriceLevel")\
             .agg(count("PriceLevel").alias("Instances (hours)"), \
                 (mean("clouds_all")).alias("Clouds (%)")) \
             .orderBy("PriceLevel").show()

# snow affects price? 
print("Snow vs Energy Prices")
snow_vs_price = nrgDF_filled\
   .withColumn("PriceLevel", when(col("price actual")<=30.00,"7.lowest")\
                            .when((col("price actual")>30.00) & (col("price actual")<=40.00),"6.lower")\
                            .when((col("price actual")>40.00) & (col("price actual")<=50.00),"5.low")\
                            .when((col("price actual")>50.00) & (col("price actual")<=60.00),"4.normal")\
                            .when((col("price actual")>60.00) & (col("price actual")<=70.00),"3.high")\
                            .when((col("price actual")>70.00) & (col("price actual")<=80.00),"2.higher")\
                            .otherwise("1.highest"))
snow_vs_price.cache() 
snow_vs_price.select("PriceLevel", "price actual", "snow_3h")\
             .groupBy("PriceLevel")\
             .agg(count("PriceLevel").alias("Instances (hours)"), \
                 (mean("snow_3h")).alias("Snow (mm)"), \
                 (max("snow_3h")).alias("Max Snow")) \
             .orderBy("PriceLevel").show()

Load vs Energy Prices
+----------+-----------------+------------------+
|PriceLevel|Instances (hours)|              Load|
+----------+-----------------+------------------+
| 1.highest|             7958| 33578.39859261121|
|  2.higher|            28368|31386.370699379582|
|    3.high|            42714|29571.527438310623|
|  4.normal|            52143| 27927.29706767927|
|     5.low|            30447|26562.506716589483|
|   6.lower|             9724|26426.894487865076|
|  7.lowest|             7082|25322.421349901157|
+----------+-----------------+------------------+

Temperatures vs Energy Prices
+----------+-----------------+------------------+-------------------+------------------+
|PriceLevel|Instances (hours)|  Temperature (ºC)|    Min Temperature|   Max Temperature|
+----------+-----------------+------------------+-------------------+------------------+
| 1.highest|             7958|15.569963798944464| -6.375999999999976| 41.18000000000001|
|  2.higher|            28368|17.11853115

To start off, we can see that as hypothesized, the lower the load, the lower the price, that is just how it works as explained the previous section. 

Now for the temperature, it it difficult to find a pattern out of the mean temperatures, however from the extremes we can see in the minimum temperatures that Spain has lower energy prices when cold temperatures are not as aggressive and also lower energy prices when high temperatures are not as aggressive. Therefore, we can say that extreme temperatures are features that can add predictive value to estimating future energy prices. 

We can also see that as rain increases, energy prices decrease, and this has no demand-side logic, as you would think that when it rains people stay inside and consume more electricity, driving prices up. What we can clearly determine here is from the supply-side, as rain increases, more hydroelectricity is produced, which is one of the cheapest power sources given that the "fuel" is free and the variable price is what fixed the generation source price when it comes online. 

Humidity and Clouds are other features that is hard to see a trend, but they are clearly correlated to rain and therefore in the same sense, as rain, humidity, and clouds increase, prices will decrease given that hydroelectric power will be able to come online. These are three features that in a machine learning model would most probably be treated under the same principal component if a PCA is performed for dimensionality reduction. Finally, snow has absolutely no observable trend in the mean or in the maximum (the minimum is 0) and therefore would probably not be useful to predict energy prices.

To sum up the weather vs energy prices section, we were able to obtain insights out of the features temperature extremes from a demand-side perspective and rain/humidity/clouds from a supply-side perspective.

### How did these generation sources grow or shrink between 2015 and 2018

In [26]:
from pyspark.sql.functions import col

# By year to check evolution

# Solar and Wind
print("Solar & Wind")
swvpri = nrgDF_filled\
   .withColumn("Year", when((year("Date")==2014),"'14")\
                          .when((year("Date")==2015),"'15")\
                           .when((year("Date")==2016),"'16")\
                            .when((year("Date")==2017),"'17")\
                            .otherwise("'18"))
swvpri.cache()
swvpri.select("Year", "generation solar", "generation wind onshore","wind_speed","clouds_all")\
             .groupBy("Year")\
             .agg(mean("generation solar").alias("Solar"), \
                 (mean("generation wind onshore")).alias("Wind"), \
                 (mean("wind_speed")).alias("Wind (m/s)"),\
                 (mean("clouds_all")).alias("Clouds (%)"))\
             .orderBy("Year").show()

# Thermal Renewables 
print("Thermal Renewables")
the_ren = nrgDF_filled\
   .withColumn("Year", when((year("Date")==2014),"'14")\
                          .when((year("Date")==2015),"'15")\
                           .when((year("Date")==2016),"'16")\
                            .when((year("Date")==2017),"'17")\
                            .otherwise("'18"))
the_ren.cache()
the_ren.select("Year", "generation biomass", "generation waste")\
             .groupBy("Year")\
             .agg(mean("generation biomass").alias("Biomass"), \
                 (mean("generation waste")).alias("Waste")) \
             .orderBy("Year").show()

# Hydro
print("Hydroelectric")
Hydro = nrgDF_filled\
   .withColumn("Year", when((year("Date")==2014),"'14")\
                          .when((year("Date")==2015),"'15")\
                           .when((year("Date")==2016),"'16")\
                            .when((year("Date")==2017),"'17")\
                            .otherwise("'18"))
Hydro.cache()
Hydro.select("Year", "generation hydro run-of-river and poundage", "generation hydro water reservoir", "rain_1h")\
             .groupBy("Year")\
             .agg(mean("generation hydro run-of-river and poundage").alias("Hydro River"), \
                 (mean("generation hydro water reservoir")).alias("Hydro Reservoir"), \
                 (mean("rain_1h")).alias("Rain (mm)")) \
             .orderBy("Year").show()

# Nuclear and Coal
print("Nuclear and Coal")
Nuc_Co = nrgDF_filled\
   .withColumn("Year", when((year("Date")==2014),"'14")\
                          .when((year("Date")==2015),"'15")\
                           .when((year("Date")==2016),"'16")\
                            .when((year("Date")==2017),"'17")\
                            .otherwise("'18"))
Nuc_Co.cache()
Nuc_Co.select("Year", "generation nuclear", "generation fossil hard coal")\
             .groupBy("Year")\
             .agg(mean("generation nuclear").alias("Nuclear"), \
                 (mean("generation fossil hard coal")).alias("Coal")) \
             .orderBy("Year").show()
# Oil and Gas
print("Oil and Gas")
Oil_Gas = nrgDF_filled\
   .withColumn("Year", when((year("Date")==2014),"'14")\
                          .when((year("Date")==2015),"'15")\
                           .when((year("Date")==2016),"'16")\
                            .when((year("Date")==2017),"'17")\
                            .otherwise("'18"))
Oil_Gas.cache()
Oil_Gas.select("Year", "generation fossil gas", "generation fossil oil")\
             .groupBy("Year")\
             .agg(mean("generation fossil gas").alias("Gas"), \
                 (mean("generation fossil oil")).alias("Oil")) \
             .orderBy("Year").show()

Solar & Wind
+----+------------------+-----------------+------------------+-----------------+
|Year|             Solar|             Wind|        Wind (m/s)|       Clouds (%)|
+----+------------------+-----------------+------------------+-----------------+
| '15|1454.7794691470056|5472.854401088929|2.4875226860254083|25.59038112522686|
| '16|1398.0767833923687|5414.845379545352| 2.502331524906963|26.63282966417074|
| '17|1492.2660277224174|5393.578341581387|2.3052377006964195|21.57453478737936|
| '18|1368.8127647306678|5603.838378462289|2.5852127824717805|26.48530814095314|
+----+------------------+-----------------+------------------+-----------------+

Thermal Renewables
+----+------------------+------------------+
|Year|           Biomass|             Waste|
+----+------------------+------------------+
| '15| 491.4190335753176|223.16068511796732|
| '16|365.42882123481144| 258.3329148545039|
| '17|340.29903486575455|297.69039568264776|
| '18|   336.75989621449|298.80220876854503|
+---

In this section, we will analyze the average generation per year of the most important generation sources and how this affected directly by weather conditions. To start off, solar power output is widely affected by clouds, as it can clearly be seen in this table. The less clouds on average, the more power output from solar (see 2017). Therefore it can be deduced that the clouds feature will be able to predict most of the solar output and as we will see in the next section, the generation sources can directly predict the final energy price. Wind, similarly, is very much affected by weather as well, the higher the wind speeds, the more wind power output we get, there is no doubt in this one. Again, meteorogolically predicting wind speeds can help us predict the energy price. 

Shifting to the thermal renewables (biomass and waste), interestingly enough, burning agricultural waste or bagasse, such as sugar cane bagasse, orange peels, etc, has decreased in popularity thoughout the years in Spain (in a linear fashion) which would be interesting to find out why, but it can be a good thing as the bagasse can be used as a powerful fertilizer to grow more food. However, it seems that burning waste has very much increased in popularity in Spain, and this shows that the waste separation efforts have been successful in the past few years since this is the first step to generating power out of one of nature's most unwanted things. 

Observing the Hydroelectric behavior in the past years, we cannot see much of a trend, as there seems to be very little correlation between hydropower output and rain, however, it can be observed that in the year with the lowest rains, 2017, hydropower output was the lowest of the four years. The increase in hydro run of river can probably give us insight into the popularity gain of this exciting relatively new technology, that even with low rainfall, it reached its highest output in 2018 out of these four years. Given the fact that Spain does not receive much rainfall, it does make a lot of sense to shift toward wind and solar, in order to rely less on the contaminating traditional power sources. 

Understanding nuclear is a different game from the other above mentioned power sources because once a Nuclear power station is on, only a programmed maintenance or a major problem can switch it off, the fuel rods supply is planned to never run out, given that running out of fuel can mean a catastrophe at a Nuclear power station. Taking a look at the power output we can see that throughout the years it has remained stable, however it looks like we can see either of two things happenning in 2018, either the reliance on Nuclear decreased due to new availability of renewable power or some of the Nuclear power stations were programmed for decomissioning and dismantling, in any case, there will be a need to build new power stations given that most nuclear stations were commissioned in the 70s and 80s and are reaching their maturity dates, while energy demand is increasing substantially every year. 

Analyzing fossil fuel power sources we can see similar trends, especially in oil and coal. Coal is the oldest and most contaminating power source. It seems Spain has been doing a great job at reducing significantly their dependence on coal. A sharp reduction in coal power output can be observed in almost 2000 MW in only 4 years. Oil was also decreasing, except for in 2017 where hydropower output was very low (same as in coal). It seems Spain does not rely much on oil fueled engines, as they have probably transformed all of their power generation engines to gas, because of the nobility and cleanliness of the fuel type. Oil is very dirty and produces a byproduct that is very difficult and expensive to handle (sludge) and gas prices have been very similar or lower to that of oil, and given the technological breakthrough it is a no brainer to install the switch from oil to gas. Finally, look at gas fired power output, we can observe the same trend that the other two (oil and coal) power sources have realized, but in the opposite direction. The transition from fossil to renewables has had a step in between and that step is gas. Gas is half as polluting in CO2 emissions than oil, and oil is less polluting than coal in this sense. It seems Spain is headed in the right direction in general, in terms of energy generation sources, since as we will see in the next section, more renewables and less reliance on expensive and volatile fossil fuels means lower prices for the end consumer. 

### How do generation sources affect final energy prices

In [28]:
# Solar and Wind
print("Solar & Wind vs Energy Prices")
solwind_vs_price = nrgDF_filled\
   .withColumn("PriceLevel", when(col("price actual")<=30.00,"7.lowest")\
                            .when((col("price actual")>30.00) & (col("price actual")<=40.00),"6.lower")\
                            .when((col("price actual")>40.00) & (col("price actual")<=50.00),"5.low")\
                            .when((col("price actual")>50.00) & (col("price actual")<=60.00),"4.normal")\
                            .when((col("price actual")>60.00) & (col("price actual")<=70.00),"3.high")\
                            .when((col("price actual")>70.00) & (col("price actual")<=80.00),"2.higher")\
                            .otherwise("1.highest"))
solwind_vs_price.cache() 
solwind_vs_price.select("PriceLevel", "price actual", "generation solar", "generation wind onshore")\
             .groupBy("PriceLevel")\
             .agg(count("PriceLevel").alias("Instances (hours)"), \
                 (mean("generation solar")).alias("Solar MW"), \
                 (mean("generation wind onshore")).alias("Wind MW")) \
             .orderBy("PriceLevel").show()

# Thermal Renewables 
print("Thermal Renewables vs Energy Prices")
thermren_vs_price = nrgDF_filled\
   .withColumn("PriceLevel", when(col("price actual")<=30.00,"7.lowest")\
                            .when((col("price actual")>30.00) & (col("price actual")<=40.00),"6.lower")\
                            .when((col("price actual")>40.00) & (col("price actual")<=50.00),"5.low")\
                            .when((col("price actual")>50.00) & (col("price actual")<=60.00),"4.normal")\
                            .when((col("price actual")>60.00) & (col("price actual")<=70.00),"3.high")\
                            .when((col("price actual")>70.00) & (col("price actual")<=80.00),"2.higher")\
                            .otherwise("1.highest"))
thermren_vs_price.cache() 
thermren_vs_price.select("PriceLevel", "price actual", "generation biomass", "generation waste")\
             .groupBy("PriceLevel")\
             .agg(count("PriceLevel").alias("Instances (hours)"), \
                 (mean("generation biomass")).alias("Bio MW"), \
                 (mean("generation waste")).alias("Waste MW")) \
             .orderBy("PriceLevel").show()

# Hydro
print("Hydroelectric vs Energy Prices")
hydro_vs_price = nrgDF_filled\
   .withColumn("PriceLevel", when(col("price actual")<=30.00,"7.lowest")\
                            .when((col("price actual")>30.00) & (col("price actual")<=40.00),"6.lower")\
                            .when((col("price actual")>40.00) & (col("price actual")<=50.00),"5.low")\
                            .when((col("price actual")>50.00) & (col("price actual")<=60.00),"4.normal")\
                            .when((col("price actual")>60.00) & (col("price actual")<=70.00),"3.high")\
                            .when((col("price actual")>70.00) & (col("price actual")<=80.00),"2.higher")\
                            .otherwise("1.highest"))
hydro_vs_price.cache() 
hydro_vs_price.select("PriceLevel", "price actual", "generation hydro water reservoir", "generation hydro run-of-river and poundage")\
             .groupBy("PriceLevel")\
             .agg(count("PriceLevel").alias("Instances (hours)"), \
                 (mean("generation hydro water reservoir")).alias("Reservoir MW"), \
                 (mean("generation hydro run-of-river and poundage")).alias("River MW")) \
             .orderBy("PriceLevel").show()

# Nuclear and Coal
print("Nuclear & Coal vs Energy Prices")
nuc_coal_vs_price = nrgDF_filled\
   .withColumn("PriceLevel", when(col("price actual")<=30.00,"7.lowest")\
                            .when((col("price actual")>30.00) & (col("price actual")<=40.00),"6.lower")\
                            .when((col("price actual")>40.00) & (col("price actual")<=50.00),"5.low")\
                            .when((col("price actual")>50.00) & (col("price actual")<=60.00),"4.normal")\
                            .when((col("price actual")>60.00) & (col("price actual")<=70.00),"3.high")\
                            .when((col("price actual")>70.00) & (col("price actual")<=80.00),"2.higher")\
                            .otherwise("1.highest"))
nuc_coal_vs_price.cache() 
nuc_coal_vs_price.select("PriceLevel", "price actual", "generation nuclear", "generation fossil hard coal")\
             .groupBy("PriceLevel")\
             .agg(count("PriceLevel").alias("Instances (hours)"), \
                 (mean("generation nuclear")).alias("Nuclear MW"), \
                 (mean("generation fossil hard coal")).alias("Coal MW")) \
             .orderBy("PriceLevel").show()

# Oil and Gas
print("Oil & Gas vs Energy Prices")
oilgas_vs_price = nrgDF_filled\
   .withColumn("PriceLevel", when(col("price actual")<=30.00,"7.lowest")\
                            .when((col("price actual")>30.00) & (col("price actual")<=40.00),"6.lower")\
                            .when((col("price actual")>40.00) & (col("price actual")<=50.00),"5.low")\
                            .when((col("price actual")>50.00) & (col("price actual")<=60.00),"4.normal")\
                            .when((col("price actual")>60.00) & (col("price actual")<=70.00),"3.high")\
                            .when((col("price actual")>70.00) & (col("price actual")<=80.00),"2.higher")\
                            .otherwise("1.highest"))
oilgas_vs_price.cache() 
oilgas_vs_price.select("PriceLevel", "price actual", "generation fossil oil", "generation fossil gas")\
             .groupBy("PriceLevel")\
             .agg(count("PriceLevel").alias("Instances (hours)"), \
                 (mean("generation fossil oil")).alias("Oil MW"), \
                 (mean("generation fossil gas")).alias("Gas MW")) \
             .orderBy("PriceLevel").show()


Solar&Wind vs Energy Prices
+----------+-----------------+------------------+-----------------+
|PriceLevel|Instances (hours)|          Solar MW|          Wind MW|
+----------+-----------------+------------------+-----------------+
| 1.highest|             7958| 1673.106685096758|4939.565091731591|
|  2.higher|            28368|1638.1563381274675|4618.570008460237|
|    3.high|            42714|1571.1705061572318|4964.967809149225|
|  4.normal|            52143| 1396.459313810099| 5517.32807855321|
|     5.low|            30447|1181.4830360955102|6206.177423063027|
|   6.lower|             9724| 1114.763986013986|6483.796894282188|
|  7.lowest|             7082|1175.9889861621011|7657.819260096018|
+----------+-----------------+------------------+-----------------+

Thermal Renewables vs Energy Prices
+----------+-----------------+------------------+------------------+
|PriceLevel|Instances (hours)|            Bio MW|          Waste MW|
+----------+-----------------+------------------+

In this section, we will analyze the effects of the most important generation sources on the final energy price. To start off, solar and wind are two relatively new technologies that have reduced their installation prices incredibly in the past 20 years and since the cost of its fuel is completely free there is no marginal or variable cost associated, meaning that its cost is lower and completely stable, unlike the counterparts (fossil fuels). To analyze solar, it can be observed that the higher the power output, the higher the price, this goes against the hypothesis of more renewables lower prices. Upon research, the false perception and erred hypothesis can be explained by the famous "impuestos contra el sol" which has had negative connotations in the Spanish people's perspective. The solar tax was imposed on solar production, therefore the more solar energy online, the higher the prices. At the end of 2018 and beginning of 2019, the solar tax was eliminated, so taking data from 2019, we might be able to appreciate better results. On the other hand, wind power output has a direct impact on the energy price, as the more production the lower the price, as expected and hypothesized. Wind is naturally more intermittent and unpredictable than solar, but has the blessing of being able to produce at any time of the day, unlike solar. 

Observing biomass and waste to power output also rejects my hypothesis that the more renewables the lower the prices. Upon thinking about this outcome, the marginal cost (or fuel+operations&maintenance) of these waste to power stations must be higher than that of most of the other power generation sources, and this must be because of the price of handling and transport of these fuel types (waste). A direct positive correlation can be observed between biomass and waste power output with the energy price, the higher the production, the higher the price. An insight here can be that it is costly to buy waste for burning and generating power in Spain, because of the value that is placed on the waste (since in most developed countries people throw away things that are in good conditions or can serve better purposes than burning and electricity). 

Upon analyzing the hydropower outputs and their relationships with energy prices, we can deduce not much of a trend between reservoir hydroelectric and the price, as high production yields both the lowest and highest energy prices, but prices seem to stabilize around the average 45€/MW when its production is lowest. It is difficult to infer any information from reservoir hydro since a lot of the water there is used for drinking and irrigation. River hydroelectric does show the trend that the more power output the lower the price, proving our hypothesis right, that close to zero marginal costs mean overall lower electricity prices. 

As speculated, nuclear power output has no specific trend with energy prices, as it is basically in the middle of the merit order (meaning that its marginal costs are less than fossil fuels, but obsviously more than renewables (with the exception of the solar taxes). 

Now, moving on to the most interesting part: fossil fuels. Observing the behavior between the three most important fossil fuel power sources, coal, oil and gas, which have all high and volatile marginal or variable prices (both fuel costs and operations/maintenance costs are both high and volatile) only one thing can be clearly deduced. Proving the most important part of my hypothesis right, more fossil-fuel-fired power output means higher prices. The difference between 90€/MWh and 30€/MWh is the same difference between 14.000 MW and 6.000 MW of fossil fuel power output.  Nevertheless, who is going to be there to match the peaks in demand, if its not gas and coal peakers? The future needs energy storage and it needs it fast, if Spain and Europe can manage to get electric vehicles on the road and have them manage the grid's load when parked, as well as bring online other large, important energy storage systems such as pumped hydroelectric power, then they will be able to take offline the need for this largely polluting and expensive power sources such as Oil, Coal, and Gas. The most important insight here is that the more fossil fuel the more we pay for electricity. 