## Introduction:
This notebook is to preliminary analysis using PV data provided and combing it with weather data downloaded from meteostat.

In [1]:
import pandas as pd
import datetime
start_date = datetime.datetime(2022, 7, 1)
end_date = datetime.datetime(2022, 12, 31)

In [2]:
df_aet = pd.read_csv('../data/aet_air_quality_hourly_24June2022_18feb2023.csv', sep=',', header='infer').rename(columns={'timestamp':'datetime'})
df_aet['datetime'] = pd.to_datetime(df_aet['datetime'], format = '%Y-%m-%d %H:%M:%S')
df_aet = df_aet.set_index('datetime')
df_aet = df_aet.loc[start_date:end_date]

In [3]:
len(df_aet)

4393

In [4]:
df_weather = pd.read_csv('../data/weather_data_london.csv', sep=',', header='infer').rename(columns={'time':'datetime'})
df_weather['datetime'] =pd.to_datetime(df_weather['datetime'], format = '%Y-%m-%d %H:%M:%S')
df_weather = df_weather.set_index('datetime')
df_weather = df_weather.loc[start_date:end_date]

In [5]:
len(df_weather)

4393

In [6]:
df_pvc = pd.read_csv('../data/pvc_sensors.csv', sep=',', header='infer').drop(columns=['Type', 'Type.1']).rename(
    columns={'Date (UTC)':'datetime', '1200061435370 (kWh)': 'PV Panel A', '1200061695248 (kWh)':'PV Panel B'})
df_pvc['datetime'] = pd.to_datetime(df_pvc['datetime'], format = '%d/%m/%Y %H:%M')
df_pvc = df_pvc.set_index('datetime')
df_pvc = df_pvc.loc[start_date:end_date]
#aggregate to hourly
df_pvc_hourly = df_pvc.resample('H').mean()

In [7]:
len(df_pvc_hourly)

4393

In [8]:
df_prec = pd.read_csv('../data/precipitation_data_london_jun_dec.csv', sep =',', header = 'infer').drop(columns = ['Unnamed: 0']).rename(
    columns={'date':'datetime'})
df_prec['datetime'] = pd.to_datetime(df_prec['datetime'], format = '%Y-%m-%d %H:%M:%S')
df_prec = df_prec.set_index('datetime')
df_prec = df_prec.loc[start_date:end_date]


In [9]:
display(df_prec)

Unnamed: 0_level_0,precipitation
datetime,Unnamed: 1_level_1
2022-07-01 00:00:00,0.000979
2022-07-01 01:00:00,0.000000
2022-07-01 02:00:00,0.000002
2022-07-01 03:00:00,0.000005
2022-07-01 04:00:00,0.000109
...,...
2022-12-30 20:00:00,0.005902
2022-12-30 21:00:00,0.005902
2022-12-30 22:00:00,0.005998
2022-12-30 23:00:00,0.006031


### Combining data

In [10]:
df1 = pd.merge(df_aet,df_pvc_hourly, on = 'datetime')
df2 = pd.merge(df_prec, df_weather, on ='datetime')
df_total = pd.merge(df1, df2, on = 'datetime')
df_total = df_total.drop(columns = ['prcp', 'snow', 'tsun', 'pm_10_from', 'pm_2_5_from'])

In [11]:
df_total.columns

Index(['deviceID', 'label', 'temperature_celsius', 'humidity_percent',
       'battery_volt', 'pressure_hpa', 'pm_1_ug_per_m3', 'pm_2_5_ug_per_m3',
       'pm_10_ug_per_m3', 'PV Panel A', 'PV Panel B', 'precipitation', 'temp',
       'dwpt', 'rhum', 'wdir', 'wspd', 'wpgt', 'pres', 'coco'],
      dtype='object')

### Generating yprofile analysis

dataset is big, either we can aggregate it to daily basis or else can create a profile on monthly baiss

In [12]:
# %pip install ydata-profiling

In [13]:
import pandas as pd

print("Pandas version:", pd.__version__)

Pandas version: 1.5.3


In [14]:
df_total_aug = df_total[df_total.index.month ==8][['temperature_celsius', 'humidity_percent',
       'pressure_hpa', 'pm_1_ug_per_m3', 'pm_2_5_ug_per_m3',
       'pm_10_ug_per_m3', 'PV Panel A', 'PV Panel B', 'precipitation', 'temp',
       'dwpt', 'rhum', 'wdir', 'wspd', 'wpgt', 'pres']]

In [66]:
df_total_aug = df_total[df_total.index.month ==8][['dwpt', 'rhum', 'wdir', 'wspd', 'wpgt', 'pres']]

In [15]:
print("Are there any null values in the DataFrame?")
print(df_total_aug.isnull().any().any())

Are there any null values in the DataFrame?
False


***this error might be with the latest version of pandas.***

In [None]:
from ydata_profiling import ProfileReport
profile = ProfileReport(df_total_aug, title="Profiling Report")
profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]