### Import necessary packages
---
We have used the numpy, pandas, matplotlib

In [246]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid")
import warnings
import datetime as dt
import matplotlib.dates as mdates

warnings.filterwarnings("ignore")
import os
import pathlib


In [247]:
getCurrentDirectory = os.getcwd()
getDataDirectory = pathlib.Path(getCurrentDirectory).parents[1]
display(getDataDirectory)


WindowsPath('d:/DipanjanDocuments/Education/University of Stavanger/Autumn - 2022/DAT540 - Introduction to Data Science/Project')

### Data Path
---
Setting up the data path

In [248]:
plantData = {
    "Plant1G": "../Data/Plant_1_Generation_Data.csv",
    "Plant1W": "../Data/Plant_1_Weather_Sensor_Data.csv",
    "Plant2G": "../Data/Plant_2_Generation_Data.csv",
    "Plant2W": "../Data/Plant_2_Weather_Sensor_Data.csv",
}


### Load Data
---
Loading the plant 1&2 csv data from local storage. There are total 4 data files for this project and each plant have their own power generation and weather sensor readings data.

In [249]:
plant1Genration = pd.read_csv(plantData["Plant1G"])
plant1Weather = pd.read_csv(plantData["Plant1W"])
plant2Genration = pd.read_csv(plantData["Plant2G"])
plant2Weather = pd.read_csv(plantData["Plant2W"])

dataDict1 = {
    "plant1": [plant1Genration, plant1Weather],
    "plant2": [plant2Genration, plant2Weather],
}

dataDict2 = {
    "generation": [plant1Genration, plant2Genration],
    "weather": [plant1Weather, plant2Weather],
}


### Data exploration and insights
---
We will briefly analyze our dataset here so that we can see what kind of information/data we have and what we plan to do with it.

- Data sample of plant 1 power generation

In [250]:
display(plant1Genration.sample(5))


Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD
44708,06-06-2020 13:15,4135001,McdE0feGgRqW7Ca,4146.285714,405.742857,4663.142857,7329344.143
24212,27-05-2020 09:30,4135001,McdE0feGgRqW7Ca,8906.142857,871.057143,1455.857143,7251015.857
25613,28-05-2020 02:00,4135001,ih0vzX44oOqAx2f,0.0,0.0,0.0,6278855.0
35829,02-06-2020 06:45,4135001,bvBOhCH3iADSZry,750.75,72.7875,29.375,6438674.375
68270,17-06-2020 18:00,4135001,zBIq5rxdHJRwDNY,594.0,57.528571,5799.0,6583351.0


- Data sample of plant 2 power generation

In [251]:
display(plant2Genration.sample(5))


Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD
57384,2020-06-13 02:45:00,4136001,IQ2d7wF4YD8zU1Q,0.0,0.0,4802.0,20158320.0
45638,2020-06-07 13:15:00,4136001,LlT2YUhhzqhg5Sw,1150.357143,1122.514286,5008.5,282740000.0
24990,2020-05-28 13:00:00,4136001,9kRcWv60rDACzjR,1254.126667,1222.786667,5624.933333,2247814000.0
51325,2020-06-10 05:45:00,4136001,q49J1IKaHRwDQnt,0.0,0.0,0.0,477638.0
17142,2020-05-24 00:00:00,4136001,9kRcWv60rDACzjR,0.0,0.0,1622.5,2247778000.0


- Data sample of plant 1 weather sensors

In [252]:
display(plant1Weather.sample(5))


Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION
1402,2020-05-30 10:45:00,4135001,HmiyD2TTLFNqkNe,24.859465,32.482479,0.348692
1749,2020-06-03 01:30:00,4135001,HmiyD2TTLFNqkNe,22.883348,21.585738,0.0
1640,2020-06-01 22:15:00,4135001,HmiyD2TTLFNqkNe,23.055707,20.388511,0.0
597,2020-05-21 19:45:00,4135001,HmiyD2TTLFNqkNe,25.76652,23.48217,0.0
1784,2020-06-03 10:15:00,4135001,HmiyD2TTLFNqkNe,26.170805,36.674451,0.415408


- Data sample of plant 2 weather sensors

In [253]:
display(plant2Weather.sample(5))


Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION
3176,2020-06-17 03:15:00,4136001,iq8k7ZNt4Mwm3w0,22.963709,22.606509,0.0
1571,2020-05-31 09:30:00,4136001,iq8k7ZNt4Mwm3w0,30.568175,47.623211,0.758124
2977,2020-06-15 01:30:00,4136001,iq8k7ZNt4Mwm3w0,24.537739,24.403716,0.0
2748,2020-06-12 16:15:00,4136001,iq8k7ZNt4Mwm3w0,27.81586,32.435735,0.157751
2736,2020-06-12 13:15:00,4136001,iq8k7ZNt4Mwm3w0,25.594031,34.25254,0.436703


Checking missing values in both power generation and weather sensor data
- Missing values in power generation

In [254]:
plant1GTemp = plant1Genration.isnull().sum()
plant2GTemp = plant2Genration.isnull().sum()
result = pd.concat([plant1GTemp, plant2GTemp], axis=1, join="inner")
result.columns = ["Plant 1", "Plant 2"]
display(result.transpose())


Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD
Plant 1,0,0,0,0,0,0,0
Plant 2,0,0,0,0,0,0,0


- Missing values in weather sensor

In [255]:
plant1WTemp = plant1Weather.isnull().sum()
plant2WTemp = plant2Weather.isnull().sum()
result = pd.concat([plant1WTemp, plant2WTemp], axis=1, join="inner")
result.columns = ["Plant 1", "Plant 2"]
display(result.transpose())


Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION
Plant 1,0,0,0,0,0,0
Plant 2,0,0,0,0,0,0


_So as per our observation, we did not find any missing values for both of the power plants._

Now we going to check how many unique PLANT_ID exists for both of the power plants.

In [256]:
print("Power generation:-")
print("Plant 1: ", len(plant1Genration["PLANT_ID"].unique()))
print("Plant 2: ", len(plant2Genration["PLANT_ID"].unique()))

print("Weather sensor:-")
print("Plant 1: ", len(plant1Weather["PLANT_ID"].unique()))
print("Plant 2: ", len(plant2Weather["PLANT_ID"].unique()))


Power generation:-
Plant 1:  1
Plant 2:  1
Weather sensor:-
Plant 1:  1
Plant 2:  1


_So as expected, we only have data from one plant in this database_. For plant 1 the value is 4135001 and for plant 2 the value is 4136001.

Now we are going to check how many inverters are in the data and how many measurements are there per inverter.

In [257]:
print("Power generation:-")
print("Plant 1: ", len(plant1Genration["SOURCE_KEY"].unique()))
print("Plant 2: ", len(plant2Genration["SOURCE_KEY"].unique()))

print("Weather sensor:-")
print("Plant 1: ", len(plant1Weather["SOURCE_KEY"].unique()))
print("Plant 2: ", len(plant2Weather["SOURCE_KEY"].unique()))


Power generation:-
Plant 1:  22
Plant 2:  22
Weather sensor:-
Plant 1:  1
Plant 2:  1


- Checking the data for plant 1 in each inverter

In [258]:
print("\033[3;31m" + "Plant 1 generator:-")
display(plant1Genration.groupby("SOURCE_KEY").count())

print(
    "\033[1;4;30m"
    + "Number of measurements per inverter range from "
    + str(plant1Genration.SOURCE_KEY.value_counts().min())
    + " to "
    + str(plant1Genration.SOURCE_KEY.value_counts().max())
    + "\033[0m"
)

print("\n\033[3;34m" + "Plant 1 weather:-")
display(plant1Weather.groupby("SOURCE_KEY").count())


[3;31mPlant 1 generator:-


Unnamed: 0_level_0,DATE_TIME,PLANT_ID,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD
SOURCE_KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1BY6WEcLGh8j5v7,3154,3154,3154,3154,3154,3154
1IF53ai7Xc0U56Y,3119,3119,3119,3119,3119,3119
3PZuoBAID5Wc2HD,3118,3118,3118,3118,3118,3118
7JYdWkrLSPkdwr4,3133,3133,3133,3133,3133,3133
McdE0feGgRqW7Ca,3124,3124,3124,3124,3124,3124
VHMLBKoKgIrUVDU,3133,3133,3133,3133,3133,3133
WRmjgnKYAwPKWDb,3118,3118,3118,3118,3118,3118
YxYtjZvoooNbGkE,3104,3104,3104,3104,3104,3104
ZnxXDlPa8U1GXgE,3130,3130,3130,3130,3130,3130
ZoEaEvLYb1n2sOq,3123,3123,3123,3123,3123,3123


[1;4;30mNumber of measurements per inverter range from 3104 to 3155[0m

[3;34mPlant 1 weather:-


Unnamed: 0_level_0,DATE_TIME,PLANT_ID,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION
SOURCE_KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
HmiyD2TTLFNqkNe,3182,3182,3182,3182,3182


- Checking the data for plant 2 in each inverter

In [259]:
print("\033[3;31m" + "Plant 2 generator:-")
display(plant2Genration.groupby("SOURCE_KEY").count())

print(
    "\033[1;4;30m"
    + "Number of measurements per inverter range from "
    + str(plant2Genration.SOURCE_KEY.value_counts().min())
    + " to "
    + str(plant2Genration.SOURCE_KEY.value_counts().max())
    + "\033[0m"
)

print("\n\033[3;34m" + "Plant 2 weather:-")
display(plant2Weather.groupby("SOURCE_KEY").count())


[3;31mPlant 2 generator:-


Unnamed: 0_level_0,DATE_TIME,PLANT_ID,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD
SOURCE_KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4UPUqMRk7TRMgml,3195,3195,3195,3195,3195,3195
81aHJ1q11NBPMrL,3259,3259,3259,3259,3259,3259
9kRcWv60rDACzjR,3259,3259,3259,3259,3259,3259
Et9kgGMDl729KT4,3195,3195,3195,3195,3195,3195
IQ2d7wF4YD8zU1Q,2355,2355,2355,2355,2355,2355
LYwnQax7tkwH5Cb,3259,3259,3259,3259,3259,3259
LlT2YUhhzqhg5Sw,3259,3259,3259,3259,3259,3259
Mx2yZCDsyf6DPfv,3195,3195,3195,3195,3195,3195
NgDl19wMapZy17u,2355,2355,2355,2355,2355,2355
PeE6FRyGXUgsRhN,3259,3259,3259,3259,3259,3259


[1;4;30mNumber of measurements per inverter range from 2355 to 3259[0m

[3;34mPlant 2 weather:-


Unnamed: 0_level_0,DATE_TIME,PLANT_ID,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION
SOURCE_KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
iq8k7ZNt4Mwm3w0,3259,3259,3259,3259,3259


_So as per our observation, 22 different inverters gives different measurements and this difference may cause an issue with prediction models._ Since this is data for 34 days and one entry corresponds to a 15 min measurement. That should be 34days x 24hours x 4 = 3264 rows for each inverter.



a- Data is very clean without Null values, negatives or infinites.
b- The column names are in uppercase. They will be changed to lowercase.
c- The DATE_TIME column is in text format and will be converted to Timestamp.
d- DC_POWER and AC_POWER seems to have a scale problem since DC_POWER should
   be very similar to    AC_POWER but appears to be 10 times bigger instead.
e- PLANT_ID column holds a single value throughout the entire dataset, this
   column will be deleted and the value stored in an external variable to
   reduce the Data Frame memory footprint.


#### sgfsdg