## 📥 Loading & Merging City Data

We start with:
- **10 JSON files**, one per city (detailed weather metadata),  
- **2 CSV files** containing electricity + weather data for all cities.

For each CSV, we:
1. **Filter** its rows to the one city it belongs to.  
2. **Merge** with that city’s JSON-derived DataFrame on the shared key(s).  

This produces 10 per‑city DataFrames (`df_merged`, `df_merged1`, … `df_merged9`).  
We then **concatenate** all ten into a single unified `df_all` for downstream preprocessing.


#### df_merged = seattle(df)
#### df_merged1 = phoenix(df1)
#### df_merged2 = nyc(df2)
#### df_merged3 = san jose(df3)
#### df_merged4 = dallas(df4)
#### df_merged5 = houston(df5)
#### df_merged6 = la(df6)
#### df_merged7 = philadelphia(df7)
#### df_merged8 = san antonio(df8)
#### df_merged9 = san diego(df9)

In [1]:
import os
import pandas as pd

drive    = "/Volumes/meen"            
folder   = "archive"
filename = "cleaned_balance_data.csv"
filename1 = "cleaned_subregion_data.csv"
csv_path = os.path.join(drive, folder, filename)
csv_path1 = os.path.join(drive, folder, filename1)
df = pd.read_csv(csv_path)
print(df.head(10))

  company           local_time             utc_time  demand     city
0    AZPS  2018-07-01 01:00:00  2018-07-01 08:00:00  3497.0  phoenix
1    AZPS  2018-07-01 02:00:00  2018-07-01 09:00:00  3256.0  phoenix
2    AZPS  2018-07-01 03:00:00  2018-07-01 10:00:00  3065.0  phoenix
3    AZPS  2018-07-01 04:00:00  2018-07-01 11:00:00  2929.0  phoenix
4    AZPS  2018-07-01 05:00:00  2018-07-01 12:00:00  2833.0  phoenix
5    AZPS  2018-07-01 06:00:00  2018-07-01 13:00:00  2736.0  phoenix
6    AZPS  2018-07-01 07:00:00  2018-07-01 14:00:00  2764.0  phoenix
7    AZPS  2018-07-01 08:00:00  2018-07-01 15:00:00  2895.0  phoenix
8    AZPS  2018-07-01 09:00:00  2018-07-01 16:00:00  3096.0  phoenix
9    AZPS  2018-07-01 10:00:00  2018-07-01 17:00:00  3293.0  phoenix


In [2]:
df = df[df['city'] == 'seattle']
df = df.reset_index(drop=True)
print(df)

      company           local_time             utc_time  demand     city
0         SCL  2018-07-01 01:00:00  2018-07-01 08:00:00   809.0  seattle
1         SCL  2018-07-01 02:00:00  2018-07-01 09:00:00   779.0  seattle
2         SCL  2018-07-01 03:00:00  2018-07-01 10:00:00   753.0  seattle
3         SCL  2018-07-01 04:00:00  2018-07-01 11:00:00   748.0  seattle
4         SCL  2018-07-01 05:00:00  2018-07-01 12:00:00   745.0  seattle
...       ...                  ...                  ...     ...      ...
15955     SCL  2020-04-25 20:00:00  2020-04-26 03:00:00     NaN  seattle
15956     SCL  2020-04-25 21:00:00  2020-04-26 04:00:00     NaN  seattle
15957     SCL  2020-04-25 22:00:00  2020-04-26 05:00:00     NaN  seattle
15958     SCL  2020-04-25 23:00:00  2020-04-26 06:00:00     NaN  seattle
15959     SCL  2020-04-26 00:00:00  2020-04-26 07:00:00     NaN  seattle

[15960 rows x 5 columns]


In [3]:
print(type(df['utc_time'][0]))

<class 'str'>


In [4]:
# make the local_time column a datetime object
df['utc_time'] = pd.to_datetime(df['utc_time'])
print(df)

      company           local_time            utc_time  demand     city
0         SCL  2018-07-01 01:00:00 2018-07-01 08:00:00   809.0  seattle
1         SCL  2018-07-01 02:00:00 2018-07-01 09:00:00   779.0  seattle
2         SCL  2018-07-01 03:00:00 2018-07-01 10:00:00   753.0  seattle
3         SCL  2018-07-01 04:00:00 2018-07-01 11:00:00   748.0  seattle
4         SCL  2018-07-01 05:00:00 2018-07-01 12:00:00   745.0  seattle
...       ...                  ...                 ...     ...      ...
15955     SCL  2020-04-25 20:00:00 2020-04-26 03:00:00     NaN  seattle
15956     SCL  2020-04-25 21:00:00 2020-04-26 04:00:00     NaN  seattle
15957     SCL  2020-04-25 22:00:00 2020-04-26 05:00:00     NaN  seattle
15958     SCL  2020-04-25 23:00:00 2020-04-26 06:00:00     NaN  seattle
15959     SCL  2020-04-26 00:00:00 2020-04-26 07:00:00     NaN  seattle

[15960 rows x 5 columns]


In [5]:
print(type(df['utc_time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [6]:
#df_weather = pd.read_json('seattle.json', orient='records')
df_weather = pd.read_json('/Volumes/meen/seattle.json', orient='records')
print(df_weather)

             time        summary                 icon  precipIntensity  \
0      1530428400       Overcast               cloudy           0.0000   
1      1530432000       Overcast               cloudy           0.0000   
2      1530435600       Overcast               cloudy           0.0000   
3      1530439200       Overcast               cloudy           0.0000   
4      1530442800       Overcast               cloudy           0.0007   
...           ...            ...                  ...              ...   
16569  1590112800  Partly Cloudy    partly-cloudy-day           0.0204   
16570  1590116400  Partly Cloudy    partly-cloudy-day           0.0106   
16571  1590120000  Partly Cloudy  partly-cloudy-night           0.0058   
16572  1590123600  Partly Cloudy  partly-cloudy-night           0.0023   
16573  1590127200  Partly Cloudy  partly-cloudy-night           0.0003   

       precipProbability  temperature  apparentTemperature  dewPoint  \
0                   0.00        59.32  

In [7]:
df_weather['time'] = pd.to_datetime(df_weather['time'], unit='s')
print(df_weather)

                     time        summary                 icon  \
0     2018-07-01 07:00:00       Overcast               cloudy   
1     2018-07-01 08:00:00       Overcast               cloudy   
2     2018-07-01 09:00:00       Overcast               cloudy   
3     2018-07-01 10:00:00       Overcast               cloudy   
4     2018-07-01 11:00:00       Overcast               cloudy   
...                   ...            ...                  ...   
16569 2020-05-22 02:00:00  Partly Cloudy    partly-cloudy-day   
16570 2020-05-22 03:00:00  Partly Cloudy    partly-cloudy-day   
16571 2020-05-22 04:00:00  Partly Cloudy  partly-cloudy-night   
16572 2020-05-22 05:00:00  Partly Cloudy  partly-cloudy-night   
16573 2020-05-22 06:00:00  Partly Cloudy  partly-cloudy-night   

       precipIntensity  precipProbability  temperature  apparentTemperature  \
0               0.0000               0.00        59.32                59.32   
1               0.0000               0.00        58.96       

In [8]:
print(type(df_weather['time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [9]:
df_merged = df.merge(
    df_weather,
    left_on='utc_time',
    right_on='time',
    how='inner'
)

In [10]:
df_merged.drop(columns=['time'], inplace=True)

In [11]:
print(df_merged)

      company           local_time            utc_time  demand     city  \
0         SCL  2018-07-01 01:00:00 2018-07-01 08:00:00   809.0  seattle   
1         SCL  2018-07-01 02:00:00 2018-07-01 09:00:00   779.0  seattle   
2         SCL  2018-07-01 03:00:00 2018-07-01 10:00:00   753.0  seattle   
3         SCL  2018-07-01 04:00:00 2018-07-01 11:00:00   748.0  seattle   
4         SCL  2018-07-01 05:00:00 2018-07-01 12:00:00   745.0  seattle   
...       ...                  ...                 ...     ...      ...   
15945     SCL  2020-04-25 20:00:00 2020-04-26 03:00:00     NaN  seattle   
15946     SCL  2020-04-25 21:00:00 2020-04-26 04:00:00     NaN  seattle   
15947     SCL  2020-04-25 22:00:00 2020-04-26 05:00:00     NaN  seattle   
15948     SCL  2020-04-25 23:00:00 2020-04-26 06:00:00     NaN  seattle   
15949     SCL  2020-04-26 00:00:00 2020-04-26 07:00:00     NaN  seattle   

             summary                 icon  precipIntensity  precipProbability  \
0           Overca

In [12]:
df1 = pd.read_csv(csv_path)
df1

Unnamed: 0,company,local_time,utc_time,demand,city
0,AZPS,2018-07-01 01:00:00,2018-07-01 08:00:00,3497.0,phoenix
1,AZPS,2018-07-01 02:00:00,2018-07-01 09:00:00,3256.0,phoenix
2,AZPS,2018-07-01 03:00:00,2018-07-01 10:00:00,3065.0,phoenix
3,AZPS,2018-07-01 04:00:00,2018-07-01 11:00:00,2929.0,phoenix
4,AZPS,2018-07-01 05:00:00,2018-07-01 12:00:00,2833.0,phoenix
...,...,...,...,...,...
31915,SCL,2020-04-25 20:00:00,2020-04-26 03:00:00,,seattle
31916,SCL,2020-04-25 21:00:00,2020-04-26 04:00:00,,seattle
31917,SCL,2020-04-25 22:00:00,2020-04-26 05:00:00,,seattle
31918,SCL,2020-04-25 23:00:00,2020-04-26 06:00:00,,seattle


In [13]:
df1 = df1[df1['city'] == 'phoenix']
df1 = df1.reset_index(drop=True)
print(df1)

      company           local_time             utc_time  demand     city
0        AZPS  2018-07-01 01:00:00  2018-07-01 08:00:00  3497.0  phoenix
1        AZPS  2018-07-01 02:00:00  2018-07-01 09:00:00  3256.0  phoenix
2        AZPS  2018-07-01 03:00:00  2018-07-01 10:00:00  3065.0  phoenix
3        AZPS  2018-07-01 04:00:00  2018-07-01 11:00:00  2929.0  phoenix
4        AZPS  2018-07-01 05:00:00  2018-07-01 12:00:00  2833.0  phoenix
...       ...                  ...                  ...     ...      ...
15955    AZPS  2020-04-25 20:00:00  2020-04-26 03:00:00     NaN  phoenix
15956    AZPS  2020-04-25 21:00:00  2020-04-26 04:00:00     NaN  phoenix
15957    AZPS  2020-04-25 22:00:00  2020-04-26 05:00:00     NaN  phoenix
15958    AZPS  2020-04-25 23:00:00  2020-04-26 06:00:00     NaN  phoenix
15959    AZPS  2020-04-26 00:00:00  2020-04-26 07:00:00     NaN  phoenix

[15960 rows x 5 columns]


In [14]:
print(type(df1['utc_time'][0]))
# make the local_time column a datetime object
df1['utc_time'] = pd.to_datetime(df1['utc_time'])
print(df1)
print(type(df1['utc_time'][0]))

<class 'str'>
      company           local_time            utc_time  demand     city
0        AZPS  2018-07-01 01:00:00 2018-07-01 08:00:00  3497.0  phoenix
1        AZPS  2018-07-01 02:00:00 2018-07-01 09:00:00  3256.0  phoenix
2        AZPS  2018-07-01 03:00:00 2018-07-01 10:00:00  3065.0  phoenix
3        AZPS  2018-07-01 04:00:00 2018-07-01 11:00:00  2929.0  phoenix
4        AZPS  2018-07-01 05:00:00 2018-07-01 12:00:00  2833.0  phoenix
...       ...                  ...                 ...     ...      ...
15955    AZPS  2020-04-25 20:00:00 2020-04-26 03:00:00     NaN  phoenix
15956    AZPS  2020-04-25 21:00:00 2020-04-26 04:00:00     NaN  phoenix
15957    AZPS  2020-04-25 22:00:00 2020-04-26 05:00:00     NaN  phoenix
15958    AZPS  2020-04-25 23:00:00 2020-04-26 06:00:00     NaN  phoenix
15959    AZPS  2020-04-26 00:00:00 2020-04-26 07:00:00     NaN  phoenix

[15960 rows x 5 columns]
<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [15]:
df_weather1 = pd.read_json('/Volumes/meen/phoenix.json', orient='records')
print(df_weather1)

             time summary         icon  precipIntensity  precipProbability  \
0      1530428400   Clear  clear-night              0.0                0.0   
1      1530432000   Clear  clear-night              0.0                0.0   
2      1530435600   Clear  clear-night              0.0                0.0   
3      1530439200   Clear  clear-night              0.0                0.0   
4      1530442800   Clear  clear-night              0.0                0.0   
...           ...     ...          ...              ...                ...   
16569  1590112800   Clear    clear-day              0.0                0.0   
16570  1590116400   Clear  clear-night              0.0                0.0   
16571  1590120000   Clear  clear-night              0.0                0.0   
16572  1590123600   Clear  clear-night              0.0                0.0   
16573  1590127200   Clear  clear-night              0.0                0.0   

       temperature  apparentTemperature  dewPoint  humidity  pr

In [16]:
df_weather1['time'] = pd.to_datetime(df_weather1['time'], unit='s')
print(df_weather1)

                     time summary         icon  precipIntensity  \
0     2018-07-01 07:00:00   Clear  clear-night              0.0   
1     2018-07-01 08:00:00   Clear  clear-night              0.0   
2     2018-07-01 09:00:00   Clear  clear-night              0.0   
3     2018-07-01 10:00:00   Clear  clear-night              0.0   
4     2018-07-01 11:00:00   Clear  clear-night              0.0   
...                   ...     ...          ...              ...   
16569 2020-05-22 02:00:00   Clear    clear-day              0.0   
16570 2020-05-22 03:00:00   Clear  clear-night              0.0   
16571 2020-05-22 04:00:00   Clear  clear-night              0.0   
16572 2020-05-22 05:00:00   Clear  clear-night              0.0   
16573 2020-05-22 06:00:00   Clear  clear-night              0.0   

       precipProbability  temperature  apparentTemperature  dewPoint  \
0                    0.0        86.82                86.82     34.57   
1                    0.0        83.37              

In [17]:
print(type(df_weather1['time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [18]:
df_merged1 = df1.merge(
    df_weather1,
    left_on='utc_time',
    right_on='time',
    how='inner'
)



In [19]:
df_merged1.drop(columns=['time'], inplace=True)

In [20]:
df_merged1

Unnamed: 0,company,local_time,utc_time,demand,city,summary,icon,precipIntensity,precipProbability,temperature,...,pressure,windSpeed,windGust,windBearing,cloudCover,uvIndex,visibility,precipType,ozone,precipAccumulation
0,AZPS,2018-07-01 01:00:00,2018-07-01 08:00:00,3497.0,phoenix,Clear,clear-night,0.0000,0.00,83.37,...,1009.3,4.00,4.00,282.0,0.00,0.0,9.997,,,
1,AZPS,2018-07-01 02:00:00,2018-07-01 09:00:00,3256.0,phoenix,Clear,clear-night,0.0000,0.00,82.22,...,1009.5,2.47,2.47,279.0,0.00,0.0,9.997,,,
2,AZPS,2018-07-01 03:00:00,2018-07-01 10:00:00,3065.0,phoenix,Clear,clear-night,0.0000,0.00,80.34,...,1010.1,2.98,4.12,107.0,0.00,0.0,9.997,,,
3,AZPS,2018-07-01 04:00:00,2018-07-01 11:00:00,2929.0,phoenix,Clear,clear-night,0.0000,0.00,79.34,...,1010.4,2.74,3.28,106.0,0.00,0.0,9.997,,,
4,AZPS,2018-07-01 05:00:00,2018-07-01 12:00:00,2833.0,phoenix,Clear,clear-night,0.0000,0.00,76.67,...,1010.7,2.51,4.76,115.0,0.00,0.0,9.997,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15945,AZPS,2020-04-25 20:00:00,2020-04-26 03:00:00,,phoenix,Clear,clear-night,0.0000,0.00,86.44,...,1010.2,3.27,3.27,269.0,0.00,0.0,10.000,,301.8,
15946,AZPS,2020-04-25 21:00:00,2020-04-26 04:00:00,,phoenix,Clear,clear-night,0.0000,0.00,87.49,...,1010.7,1.29,7.63,235.0,0.00,0.0,10.000,,301.5,
15947,AZPS,2020-04-25 22:00:00,2020-04-26 05:00:00,,phoenix,Clear,clear-night,0.0000,0.00,81.75,...,1011.0,4.06,4.06,331.0,0.00,0.0,10.000,,301.3,
15948,AZPS,2020-04-25 23:00:00,2020-04-26 06:00:00,,phoenix,Clear,clear-night,0.0000,0.00,82.21,...,1011.5,4.11,4.11,359.0,0.00,0.0,10.000,,300.7,


In [21]:
df2 = pd.read_csv(csv_path1)
df2

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,NYIS,ZONJ,7269.0,2018-07-01 01:00:00,2018-07-01 05:00:00,nyc
1,NYIS,ZONJ,6977.0,2018-07-01 02:00:00,2018-07-01 06:00:00,nyc
2,NYIS,ZONJ,6725.0,2018-07-01 03:00:00,2018-07-01 07:00:00,nyc
3,NYIS,ZONJ,6539.0,2018-07-01 04:00:00,2018-07-01 08:00:00,nyc
4,NYIS,ZONJ,6415.0,2018-07-01 05:00:00,2018-07-01 09:00:00,nyc
...,...,...,...,...,...,...
132283,CISO,PGAE,11578.0,2020-05-19 20:00:00,2020-05-20 03:00:00,san jose
132284,CISO,PGAE,11782.0,2020-05-19 21:00:00,2020-05-20 04:00:00,san jose
132285,CISO,PGAE,11592.0,2020-05-19 22:00:00,2020-05-20 05:00:00,san jose
132286,CISO,PGAE,11083.0,2020-05-19 23:00:00,2020-05-20 06:00:00,san jose


In [22]:
# only keep rows that have city as seattle 
df2 = df2[df2['city'] == 'nyc']
df2

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,NYIS,ZONJ,7269.0,2018-07-01 01:00:00,2018-07-01 05:00:00,nyc
1,NYIS,ZONJ,6977.0,2018-07-01 02:00:00,2018-07-01 06:00:00,nyc
2,NYIS,ZONJ,6725.0,2018-07-01 03:00:00,2018-07-01 07:00:00,nyc
3,NYIS,ZONJ,6539.0,2018-07-01 04:00:00,2018-07-01 08:00:00,nyc
4,NYIS,ZONJ,6415.0,2018-07-01 05:00:00,2018-07-01 09:00:00,nyc
...,...,...,...,...,...,...
108770,NYIS,ZONJ,4674.0,2020-05-19 20:00:00,2020-05-20 00:00:00,nyc
108771,NYIS,ZONJ,4708.0,2020-05-19 21:00:00,2020-05-20 01:00:00,nyc
108772,NYIS,ZONJ,4617.0,2020-05-19 22:00:00,2020-05-20 02:00:00,nyc
108773,NYIS,ZONJ,4440.0,2020-05-19 23:00:00,2020-05-20 03:00:00,nyc


In [23]:
df2 = df2.reset_index(drop=True)
df2

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,NYIS,ZONJ,7269.0,2018-07-01 01:00:00,2018-07-01 05:00:00,nyc
1,NYIS,ZONJ,6977.0,2018-07-01 02:00:00,2018-07-01 06:00:00,nyc
2,NYIS,ZONJ,6725.0,2018-07-01 03:00:00,2018-07-01 07:00:00,nyc
3,NYIS,ZONJ,6539.0,2018-07-01 04:00:00,2018-07-01 08:00:00,nyc
4,NYIS,ZONJ,6415.0,2018-07-01 05:00:00,2018-07-01 09:00:00,nyc
...,...,...,...,...,...,...
16531,NYIS,ZONJ,4674.0,2020-05-19 20:00:00,2020-05-20 00:00:00,nyc
16532,NYIS,ZONJ,4708.0,2020-05-19 21:00:00,2020-05-20 01:00:00,nyc
16533,NYIS,ZONJ,4617.0,2020-05-19 22:00:00,2020-05-20 02:00:00,nyc
16534,NYIS,ZONJ,4440.0,2020-05-19 23:00:00,2020-05-20 03:00:00,nyc


In [24]:
print(type(df2['utc_time'][0]))

<class 'str'>


In [25]:
# make the local_time column a datetime object
df2['utc_time'] = pd.to_datetime(df2['utc_time'])
df2

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,NYIS,ZONJ,7269.0,2018-07-01 01:00:00,2018-07-01 05:00:00,nyc
1,NYIS,ZONJ,6977.0,2018-07-01 02:00:00,2018-07-01 06:00:00,nyc
2,NYIS,ZONJ,6725.0,2018-07-01 03:00:00,2018-07-01 07:00:00,nyc
3,NYIS,ZONJ,6539.0,2018-07-01 04:00:00,2018-07-01 08:00:00,nyc
4,NYIS,ZONJ,6415.0,2018-07-01 05:00:00,2018-07-01 09:00:00,nyc
...,...,...,...,...,...,...
16531,NYIS,ZONJ,4674.0,2020-05-19 20:00:00,2020-05-20 00:00:00,nyc
16532,NYIS,ZONJ,4708.0,2020-05-19 21:00:00,2020-05-20 01:00:00,nyc
16533,NYIS,ZONJ,4617.0,2020-05-19 22:00:00,2020-05-20 02:00:00,nyc
16534,NYIS,ZONJ,4440.0,2020-05-19 23:00:00,2020-05-20 03:00:00,nyc


In [27]:
print(type(df2['utc_time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [28]:
df_weather2 = pd.read_json('/Volumes/meen/nyc.json', orient='records')
print(df_weather2)

             time              summary         icon  precipIntensity  \
0      1530504000                Clear  clear-night           0.0000   
1      1530507600                Clear  clear-night           0.0000   
2      1530511200                Clear  clear-night           0.0000   
3      1530514800                Clear  clear-night           0.0000   
4      1530518400                Clear  clear-night           0.0000   
...           ...                  ...          ...              ...   
16569  1590188400     Possible Drizzle         rain           0.0086   
16570  1590192000  Possible Light Rain         rain           0.0103   
16571  1590195600     Possible Drizzle         rain           0.0089   
16572  1590199200     Possible Drizzle         rain           0.0063   
16573  1590202800     Possible Drizzle         rain           0.0047   

       precipProbability  temperature  apparentTemperature  dewPoint  \
0                   0.00        83.18                87.93     

In [29]:
df_weather2['time'] = pd.to_datetime(df_weather2['time'], unit='s')

In [30]:
df_weather2

Unnamed: 0,time,summary,icon,precipIntensity,precipProbability,temperature,apparentTemperature,dewPoint,humidity,pressure,windSpeed,windGust,windBearing,cloudCover,uvIndex,visibility,precipType,ozone,precipAccumulation
0,2018-07-02 04:00:00,Clear,clear-night,0.0000,0.00,83.18,87.93,71.08,0.67,1017.0,3.27,3.27,242.0,0.03,0.0,9.784,,,
1,2018-07-02 05:00:00,Clear,clear-night,0.0000,0.00,82.55,86.45,69.88,0.66,1017.2,2.40,2.40,234.0,0.02,0.0,9.763,,,
2,2018-07-02 06:00:00,Clear,clear-night,0.0000,0.00,79.89,82.86,69.55,0.71,1017.4,3.64,3.64,256.0,0.02,0.0,9.876,,,
3,2018-07-02 07:00:00,Clear,clear-night,0.0000,0.00,79.07,81.70,69.37,0.72,1017.3,5.51,5.51,254.0,0.02,0.0,9.793,,,
4,2018-07-02 08:00:00,Clear,clear-night,0.0000,0.00,78.12,79.12,69.24,0.74,1017.2,1.95,2.90,255.0,0.02,0.0,9.799,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16569,2020-05-22 23:00:00,Possible Drizzle,rain,0.0086,0.31,62.97,63.01,58.73,0.86,1018.4,8.61,14.42,166.0,0.78,0.0,10.000,rain,332.5,
16570,2020-05-23 00:00:00,Possible Light Rain,rain,0.0103,0.39,62.11,62.20,58.81,0.89,1018.1,7.70,13.86,161.0,0.82,0.0,9.666,rain,334.6,
16571,2020-05-23 01:00:00,Possible Drizzle,rain,0.0089,0.37,61.70,61.70,57.95,0.88,1017.9,7.09,12.65,158.0,0.87,0.0,10.000,rain,335.4,
16572,2020-05-23 02:00:00,Possible Drizzle,rain,0.0063,0.31,61.55,61.55,57.33,0.86,1017.9,6.62,10.99,154.0,0.93,0.0,10.000,rain,335.4,


In [31]:
print(type(df_weather2['time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [32]:
df_merged2 = df2.merge(
    df_weather2,
    left_on='utc_time',
    right_on='time',
    how='inner'
)

In [33]:
df_merged2.drop(columns=['time'], inplace=True)

In [34]:
df_merged2

Unnamed: 0,company,region,demand,local_time,utc_time,city,summary,icon,precipIntensity,precipProbability,...,pressure,windSpeed,windGust,windBearing,cloudCover,uvIndex,visibility,precipType,ozone,precipAccumulation
0,NYIS,ZONJ,8548.0,2018-07-02 00:00:00,2018-07-02 04:00:00,nyc,Clear,clear-night,0.0000,0.00,...,1017.0,3.27,3.27,242.0,0.03,0.0,9.784,,,
1,NYIS,ZONJ,8153.0,2018-07-02 01:00:00,2018-07-02 05:00:00,nyc,Clear,clear-night,0.0000,0.00,...,1017.2,2.40,2.40,234.0,0.02,0.0,9.763,,,
2,NYIS,ZONJ,7824.0,2018-07-02 02:00:00,2018-07-02 06:00:00,nyc,Clear,clear-night,0.0000,0.00,...,1017.4,3.64,3.64,256.0,0.02,0.0,9.876,,,
3,NYIS,ZONJ,7541.0,2018-07-02 03:00:00,2018-07-02 07:00:00,nyc,Clear,clear-night,0.0000,0.00,...,1017.3,5.51,5.51,254.0,0.02,0.0,9.793,,,
4,NYIS,ZONJ,7368.0,2018-07-02 04:00:00,2018-07-02 08:00:00,nyc,Clear,clear-night,0.0000,0.00,...,1017.2,1.95,2.90,255.0,0.02,0.0,9.799,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16498,NYIS,ZONJ,4674.0,2020-05-19 20:00:00,2020-05-20 00:00:00,nyc,Partly Cloudy,partly-cloudy-day,0.0012,0.01,...,1027.8,14.09,23.67,94.0,0.43,0.0,10.000,rain,312.2,
16499,NYIS,ZONJ,4708.0,2020-05-19 21:00:00,2020-05-20 01:00:00,nyc,Partly Cloudy,partly-cloudy-night,0.0008,0.01,...,1028.3,13.53,24.30,86.0,0.46,0.0,10.000,rain,314.7,
16500,NYIS,ZONJ,4617.0,2020-05-19 22:00:00,2020-05-20 02:00:00,nyc,Partly Cloudy,partly-cloudy-night,0.0000,0.00,...,1028.6,13.29,23.73,79.0,0.46,0.0,10.000,,314.8,
16501,NYIS,ZONJ,4440.0,2020-05-19 23:00:00,2020-05-20 03:00:00,nyc,Partly Cloudy,partly-cloudy-night,0.0010,0.01,...,1028.6,11.51,22.95,76.0,0.46,0.0,10.000,rain,314.7,


In [35]:
df3 = pd.read_csv(csv_path1)
df3

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,NYIS,ZONJ,7269.0,2018-07-01 01:00:00,2018-07-01 05:00:00,nyc
1,NYIS,ZONJ,6977.0,2018-07-01 02:00:00,2018-07-01 06:00:00,nyc
2,NYIS,ZONJ,6725.0,2018-07-01 03:00:00,2018-07-01 07:00:00,nyc
3,NYIS,ZONJ,6539.0,2018-07-01 04:00:00,2018-07-01 08:00:00,nyc
4,NYIS,ZONJ,6415.0,2018-07-01 05:00:00,2018-07-01 09:00:00,nyc
...,...,...,...,...,...,...
132283,CISO,PGAE,11578.0,2020-05-19 20:00:00,2020-05-20 03:00:00,san jose
132284,CISO,PGAE,11782.0,2020-05-19 21:00:00,2020-05-20 04:00:00,san jose
132285,CISO,PGAE,11592.0,2020-05-19 22:00:00,2020-05-20 05:00:00,san jose
132286,CISO,PGAE,11083.0,2020-05-19 23:00:00,2020-05-20 06:00:00,san jose


In [36]:
# only keep rows that have city as seattle 
df3 = df3[df3['city'] == 'san jose']
df3

Unnamed: 0,company,region,demand,local_time,utc_time,city
30919,CISO,PGAE,12522.0,2018-07-01 01:00:00,2018-07-01 08:00:00,san jose
30920,CISO,PGAE,11745.0,2018-07-01 02:00:00,2018-07-01 09:00:00,san jose
30921,CISO,PGAE,11200.0,2018-07-01 03:00:00,2018-07-01 10:00:00,san jose
30922,CISO,PGAE,10822.0,2018-07-01 04:00:00,2018-07-01 11:00:00,san jose
30923,CISO,PGAE,10644.0,2018-07-01 05:00:00,2018-07-01 12:00:00,san jose
...,...,...,...,...,...,...
132283,CISO,PGAE,11578.0,2020-05-19 20:00:00,2020-05-20 03:00:00,san jose
132284,CISO,PGAE,11782.0,2020-05-19 21:00:00,2020-05-20 04:00:00,san jose
132285,CISO,PGAE,11592.0,2020-05-19 22:00:00,2020-05-20 05:00:00,san jose
132286,CISO,PGAE,11083.0,2020-05-19 23:00:00,2020-05-20 06:00:00,san jose


In [37]:
df3 = df3.reset_index(drop=True)
df3

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,CISO,PGAE,12522.0,2018-07-01 01:00:00,2018-07-01 08:00:00,san jose
1,CISO,PGAE,11745.0,2018-07-01 02:00:00,2018-07-01 09:00:00,san jose
2,CISO,PGAE,11200.0,2018-07-01 03:00:00,2018-07-01 10:00:00,san jose
3,CISO,PGAE,10822.0,2018-07-01 04:00:00,2018-07-01 11:00:00,san jose
4,CISO,PGAE,10644.0,2018-07-01 05:00:00,2018-07-01 12:00:00,san jose
...,...,...,...,...,...,...
16531,CISO,PGAE,11578.0,2020-05-19 20:00:00,2020-05-20 03:00:00,san jose
16532,CISO,PGAE,11782.0,2020-05-19 21:00:00,2020-05-20 04:00:00,san jose
16533,CISO,PGAE,11592.0,2020-05-19 22:00:00,2020-05-20 05:00:00,san jose
16534,CISO,PGAE,11083.0,2020-05-19 23:00:00,2020-05-20 06:00:00,san jose


In [38]:
# print the type of local_time column 
print(type(df3['utc_time'][0]))

<class 'str'>


In [39]:
# make the local_time column a datetime object
df3['utc_time'] = pd.to_datetime(df3['utc_time'])
df3

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,CISO,PGAE,12522.0,2018-07-01 01:00:00,2018-07-01 08:00:00,san jose
1,CISO,PGAE,11745.0,2018-07-01 02:00:00,2018-07-01 09:00:00,san jose
2,CISO,PGAE,11200.0,2018-07-01 03:00:00,2018-07-01 10:00:00,san jose
3,CISO,PGAE,10822.0,2018-07-01 04:00:00,2018-07-01 11:00:00,san jose
4,CISO,PGAE,10644.0,2018-07-01 05:00:00,2018-07-01 12:00:00,san jose
...,...,...,...,...,...,...
16531,CISO,PGAE,11578.0,2020-05-19 20:00:00,2020-05-20 03:00:00,san jose
16532,CISO,PGAE,11782.0,2020-05-19 21:00:00,2020-05-20 04:00:00,san jose
16533,CISO,PGAE,11592.0,2020-05-19 22:00:00,2020-05-20 05:00:00,san jose
16534,CISO,PGAE,11083.0,2020-05-19 23:00:00,2020-05-20 06:00:00,san jose


In [40]:
print(type(df3['utc_time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [41]:
df_weather3 = pd.read_json('/Volumes/meen/san_jose.json', orient='records')
print(df_weather3)

             time           summary         icon  precipIntensity  \
0      1530428400             Clear  clear-night           0.0000   
1      1530432000             Clear  clear-night           0.0000   
2      1530435600             Clear  clear-night           0.0000   
3      1530439200             Clear  clear-night           0.0000   
4      1530442800             Clear  clear-night           0.0000   
...           ...               ...          ...              ...   
16569  1590112800             Clear    clear-day           0.0000   
16570  1590116400             Clear    clear-day           0.0000   
16571  1590120000             Clear  clear-night           0.0000   
16572  1590123600             Clear  clear-night           0.0000   
16573  1590127200  Possible Drizzle         rain           0.0001   

       precipProbability  temperature  apparentTemperature  dewPoint  \
0                   0.00        67.78                67.78     53.27   
1                   0.00   

In [42]:
df_weather3['time'] = pd.to_datetime(df_weather3['time'], unit='s')

In [43]:
df_weather3

Unnamed: 0,time,summary,icon,precipIntensity,precipProbability,temperature,apparentTemperature,dewPoint,humidity,pressure,windSpeed,windGust,windBearing,cloudCover,uvIndex,visibility,ozone,precipType
0,2018-07-01 07:00:00,Clear,clear-night,0.0000,0.00,67.78,67.78,53.27,0.60,1012.8,6.36,8.58,141.0,0.17,0.0,9.988,,
1,2018-07-01 08:00:00,Clear,clear-night,0.0000,0.00,66.09,66.09,53.14,0.63,1013.1,3.15,5.36,143.0,0.01,0.0,9.988,,
2,2018-07-01 09:00:00,Clear,clear-night,0.0000,0.00,64.30,64.30,53.15,0.67,1013.0,4.74,5.84,117.0,0.00,0.0,9.997,,
3,2018-07-01 10:00:00,Clear,clear-night,0.0000,0.00,63.42,63.42,53.22,0.69,1012.9,4.51,6.80,160.0,0.06,0.0,9.988,,
4,2018-07-01 11:00:00,Clear,clear-night,0.0000,0.00,61.73,61.73,53.10,0.73,1013.0,3.48,5.91,121.0,0.08,0.0,9.978,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16569,2020-05-22 02:00:00,Clear,clear-day,0.0000,0.00,69.47,69.47,48.13,0.47,1014.3,7.51,14.32,308.0,0.05,0.0,10.000,344.1,
16570,2020-05-22 03:00:00,Clear,clear-day,0.0000,0.00,65.14,65.14,46.11,0.50,1014.5,8.23,12.70,318.0,0.05,0.0,10.000,342.9,
16571,2020-05-22 04:00:00,Clear,clear-night,0.0000,0.00,61.28,61.28,46.18,0.58,1014.8,7.36,9.98,348.0,0.05,0.0,10.000,342.5,
16572,2020-05-22 05:00:00,Clear,clear-night,0.0000,0.00,58.57,58.57,46.25,0.64,1015.4,5.51,7.44,345.0,0.05,0.0,10.000,342.8,


In [44]:
print(type(df_weather3['time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [45]:
df_merged3 = df3.merge(
    df_weather3,
    left_on='utc_time',
    right_on='time',
    how='inner'
)
df_merged3.drop(columns=['time'], inplace=True)

In [46]:
df_merged3

Unnamed: 0,company,region,demand,local_time,utc_time,city,summary,icon,precipIntensity,precipProbability,...,humidity,pressure,windSpeed,windGust,windBearing,cloudCover,uvIndex,visibility,ozone,precipType
0,CISO,PGAE,12522.0,2018-07-01 01:00:00,2018-07-01 08:00:00,san jose,Clear,clear-night,0.0000,0.00,...,0.63,1013.1,3.15,5.36,143.0,0.01,0.0,9.988,,
1,CISO,PGAE,11745.0,2018-07-01 02:00:00,2018-07-01 09:00:00,san jose,Clear,clear-night,0.0000,0.00,...,0.67,1013.0,4.74,5.84,117.0,0.00,0.0,9.997,,
2,CISO,PGAE,11200.0,2018-07-01 03:00:00,2018-07-01 10:00:00,san jose,Clear,clear-night,0.0000,0.00,...,0.69,1012.9,4.51,6.80,160.0,0.06,0.0,9.988,,
3,CISO,PGAE,10822.0,2018-07-01 04:00:00,2018-07-01 11:00:00,san jose,Clear,clear-night,0.0000,0.00,...,0.73,1013.0,3.48,5.91,121.0,0.08,0.0,9.978,,
4,CISO,PGAE,10644.0,2018-07-01 05:00:00,2018-07-01 12:00:00,san jose,Clear,clear-night,0.0000,0.00,...,0.76,1013.6,4.10,5.96,161.0,0.10,0.0,9.988,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16521,CISO,PGAE,11578.0,2020-05-19 20:00:00,2020-05-20 03:00:00,san jose,Clear,clear-day,0.0009,0.01,...,0.63,1018.9,7.74,12.71,294.0,0.28,0.0,10.000,360.1,rain
16522,CISO,PGAE,11782.0,2020-05-19 21:00:00,2020-05-20 04:00:00,san jose,Partly Cloudy,partly-cloudy-night,0.0013,0.01,...,0.68,1019.3,6.22,10.75,300.0,0.42,0.0,10.000,356.2,rain
16523,CISO,PGAE,11592.0,2020-05-19 22:00:00,2020-05-20 05:00:00,san jose,Partly Cloudy,partly-cloudy-night,0.0004,0.01,...,0.69,1019.7,4.97,8.00,260.0,0.54,0.0,10.000,353.4,rain
16524,CISO,PGAE,11083.0,2020-05-19 23:00:00,2020-05-20 06:00:00,san jose,Partly Cloudy,partly-cloudy-night,0.0005,0.01,...,0.73,1019.9,4.98,9.09,271.0,0.51,0.0,10.000,351.1,rain


In [47]:
df4 = pd.read_csv(csv_path1)
df4
df4 = df4[df4['city'] == 'dallas']
df4

Unnamed: 0,company,region,demand,local_time,utc_time,city
26502,ERCO,NCEN,,2018-07-01 01:00:00,2018-07-01 06:00:00,dallas
26503,ERCO,NCEN,,2018-07-01 02:00:00,2018-07-01 07:00:00,dallas
26504,ERCO,NCEN,,2018-07-01 03:00:00,2018-07-01 08:00:00,dallas
26505,ERCO,NCEN,,2018-07-01 04:00:00,2018-07-01 09:00:00,dallas
26506,ERCO,NCEN,,2018-07-01 05:00:00,2018-07-01 10:00:00,dallas
...,...,...,...,...,...,...
128924,ERCO,NCEN,,2020-05-19 20:00:00,2020-05-20 01:00:00,dallas
128925,ERCO,NCEN,,2020-05-19 21:00:00,2020-05-20 02:00:00,dallas
128926,ERCO,NCEN,,2020-05-19 22:00:00,2020-05-20 03:00:00,dallas
128927,ERCO,NCEN,,2020-05-19 23:00:00,2020-05-20 04:00:00,dallas


In [48]:
df4 = df4.reset_index(drop=True)
df4

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,ERCO,NCEN,,2018-07-01 01:00:00,2018-07-01 06:00:00,dallas
1,ERCO,NCEN,,2018-07-01 02:00:00,2018-07-01 07:00:00,dallas
2,ERCO,NCEN,,2018-07-01 03:00:00,2018-07-01 08:00:00,dallas
3,ERCO,NCEN,,2018-07-01 04:00:00,2018-07-01 09:00:00,dallas
4,ERCO,NCEN,,2018-07-01 05:00:00,2018-07-01 10:00:00,dallas
...,...,...,...,...,...,...
16531,ERCO,NCEN,,2020-05-19 20:00:00,2020-05-20 01:00:00,dallas
16532,ERCO,NCEN,,2020-05-19 21:00:00,2020-05-20 02:00:00,dallas
16533,ERCO,NCEN,,2020-05-19 22:00:00,2020-05-20 03:00:00,dallas
16534,ERCO,NCEN,,2020-05-19 23:00:00,2020-05-20 04:00:00,dallas


In [49]:
print(type(df4['utc_time'][0]))

<class 'str'>


In [50]:
# make the local_time column a datetime object
df4['utc_time'] = pd.to_datetime(df4['utc_time'])
df4
print(type(df4['utc_time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [51]:
df4

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,ERCO,NCEN,,2018-07-01 01:00:00,2018-07-01 06:00:00,dallas
1,ERCO,NCEN,,2018-07-01 02:00:00,2018-07-01 07:00:00,dallas
2,ERCO,NCEN,,2018-07-01 03:00:00,2018-07-01 08:00:00,dallas
3,ERCO,NCEN,,2018-07-01 04:00:00,2018-07-01 09:00:00,dallas
4,ERCO,NCEN,,2018-07-01 05:00:00,2018-07-01 10:00:00,dallas
...,...,...,...,...,...,...
16531,ERCO,NCEN,,2020-05-19 20:00:00,2020-05-20 01:00:00,dallas
16532,ERCO,NCEN,,2020-05-19 21:00:00,2020-05-20 02:00:00,dallas
16533,ERCO,NCEN,,2020-05-19 22:00:00,2020-05-20 03:00:00,dallas
16534,ERCO,NCEN,,2020-05-19 23:00:00,2020-05-20 04:00:00,dallas


In [52]:
df_weather4 = pd.read_json('/Volumes/meen/dallas.json', orient='records')
print(df_weather4)

             time                  summary               icon  \
0      1530507600                    Clear        clear-night   
1      1530511200                    Clear        clear-night   
2      1530514800                    Clear        clear-night   
3      1530518400                    Clear        clear-night   
4      1530522000                    Clear        clear-night   
...           ...                      ...                ...   
16569  1590192000  Humid and Partly Cloudy  partly-cloudy-day   
16570  1590195600  Humid and Partly Cloudy  partly-cloudy-day   
16571  1590199200                    Humid        clear-night   
16572  1590202800                    Humid        clear-night   
16573  1590206400                    Humid        clear-night   

       precipIntensity  precipProbability  temperature  apparentTemperature  \
0               0.0000               0.00        89.56                91.71   
1               0.0000               0.00        88.35       

In [53]:
df_weather4['time'] = pd.to_datetime(df_weather4['time'], unit='s')

In [54]:
df_weather4

Unnamed: 0,time,summary,icon,precipIntensity,precipProbability,temperature,apparentTemperature,dewPoint,humidity,pressure,windSpeed,windGust,windBearing,cloudCover,uvIndex,visibility,precipType,ozone,precipAccumulation
0,2018-07-02 05:00:00,Clear,clear-night,0.0000,0.00,89.56,91.71,65.31,0.45,1011.9,10.84,12.40,164.0,0.06,0.0,9.617,,,
1,2018-07-02 06:00:00,Clear,clear-night,0.0000,0.00,88.35,92.26,68.29,0.52,1012.7,9.94,9.94,179.0,0.04,0.0,9.817,,,
2,2018-07-02 07:00:00,Clear,clear-night,0.0000,0.00,87.05,91.45,69.26,0.56,1012.8,8.92,10.31,186.0,0.04,0.0,9.510,,,
3,2018-07-02 08:00:00,Clear,clear-night,0.0000,0.00,86.04,90.51,69.61,0.58,1012.7,8.47,8.47,189.0,0.18,0.0,9.617,,,
4,2018-07-02 09:00:00,Clear,clear-night,0.0000,0.00,84.80,89.74,70.68,0.63,1012.9,5.61,5.61,193.0,0.15,0.0,9.617,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16569,2020-05-23 00:00:00,Humid and Partly Cloudy,partly-cloudy-day,0.0023,0.03,87.82,96.51,74.01,0.64,1007.3,13.85,22.64,165.0,0.46,1.0,10.000,rain,305.9,
16570,2020-05-23 01:00:00,Humid and Partly Cloudy,partly-cloudy-day,0.0023,0.04,85.99,93.66,73.63,0.67,1007.5,13.42,24.25,164.0,0.34,0.0,10.000,rain,304.2,
16571,2020-05-23 02:00:00,Humid,clear-night,0.0024,0.04,83.74,90.09,73.15,0.71,1007.9,12.80,25.85,163.0,0.20,0.0,10.000,rain,301.3,
16572,2020-05-23 03:00:00,Humid,clear-night,0.0037,0.05,82.07,87.47,72.93,0.74,1008.4,12.38,27.11,163.0,0.12,0.0,10.000,rain,299.2,


In [55]:
print(type(df_weather4['time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [56]:
df_merged4 = df4.merge(
    df_weather4,
    left_on='utc_time',
    right_on='time',
    how='inner'
)
df_merged4.drop(columns=['time'], inplace=True)

In [57]:
df_merged4

Unnamed: 0,company,region,demand,local_time,utc_time,city,summary,icon,precipIntensity,precipProbability,...,pressure,windSpeed,windGust,windBearing,cloudCover,uvIndex,visibility,precipType,ozone,precipAccumulation
0,ERCO,NCEN,,2018-07-02 00:00:00,2018-07-02 05:00:00,dallas,Clear,clear-night,0.0000,0.00,...,1011.9,10.84,12.40,164.0,0.06,0.0,9.617,,,
1,ERCO,NCEN,,2018-07-02 01:00:00,2018-07-02 06:00:00,dallas,Clear,clear-night,0.0000,0.00,...,1012.7,9.94,9.94,179.0,0.04,0.0,9.817,,,
2,ERCO,NCEN,,2018-07-02 02:00:00,2018-07-02 07:00:00,dallas,Clear,clear-night,0.0000,0.00,...,1012.8,8.92,10.31,186.0,0.04,0.0,9.510,,,
3,ERCO,NCEN,,2018-07-02 03:00:00,2018-07-02 08:00:00,dallas,Clear,clear-night,0.0000,0.00,...,1012.7,8.47,8.47,189.0,0.18,0.0,9.617,,,
4,ERCO,NCEN,,2018-07-02 04:00:00,2018-07-02 09:00:00,dallas,Clear,clear-night,0.0000,0.00,...,1012.9,5.61,5.61,193.0,0.15,0.0,9.617,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16498,ERCO,NCEN,,2020-05-19 20:00:00,2020-05-20 01:00:00,dallas,Clear,clear-day,0.0078,0.01,...,1006.4,9.93,15.59,64.0,0.23,0.0,10.000,rain,296.8,
16499,ERCO,NCEN,,2020-05-19 21:00:00,2020-05-20 02:00:00,dallas,Clear,clear-night,0.0016,0.01,...,1007.4,10.18,18.38,70.0,0.18,0.0,10.000,rain,295.7,
16500,ERCO,NCEN,,2020-05-19 22:00:00,2020-05-20 03:00:00,dallas,Clear,clear-night,0.0011,0.01,...,1008.0,9.20,20.93,73.0,0.11,0.0,10.000,rain,295.0,
16501,ERCO,NCEN,,2020-05-19 23:00:00,2020-05-20 04:00:00,dallas,Clear,clear-night,0.0014,0.01,...,1008.2,8.76,20.51,76.0,0.17,0.0,10.000,rain,294.9,


In [58]:
df5 = pd.read_csv(csv_path1)
df5

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,NYIS,ZONJ,7269.0,2018-07-01 01:00:00,2018-07-01 05:00:00,nyc
1,NYIS,ZONJ,6977.0,2018-07-01 02:00:00,2018-07-01 06:00:00,nyc
2,NYIS,ZONJ,6725.0,2018-07-01 03:00:00,2018-07-01 07:00:00,nyc
3,NYIS,ZONJ,6539.0,2018-07-01 04:00:00,2018-07-01 08:00:00,nyc
4,NYIS,ZONJ,6415.0,2018-07-01 05:00:00,2018-07-01 09:00:00,nyc
...,...,...,...,...,...,...
132283,CISO,PGAE,11578.0,2020-05-19 20:00:00,2020-05-20 03:00:00,san jose
132284,CISO,PGAE,11782.0,2020-05-19 21:00:00,2020-05-20 04:00:00,san jose
132285,CISO,PGAE,11592.0,2020-05-19 22:00:00,2020-05-20 05:00:00,san jose
132286,CISO,PGAE,11083.0,2020-05-19 23:00:00,2020-05-20 06:00:00,san jose


In [59]:
# only keep rows that have city as seattle 
df5 = df5[df5['city'] == 'houston']
df5

Unnamed: 0,company,region,demand,local_time,utc_time,city
8834,ERCO,COAS,,2018-07-01 01:00:00,2018-07-01 06:00:00,houston
8835,ERCO,COAS,,2018-07-01 02:00:00,2018-07-01 07:00:00,houston
8836,ERCO,COAS,,2018-07-01 03:00:00,2018-07-01 08:00:00,houston
8837,ERCO,COAS,,2018-07-01 04:00:00,2018-07-01 09:00:00,houston
8838,ERCO,COAS,,2018-07-01 05:00:00,2018-07-01 10:00:00,houston
...,...,...,...,...,...,...
115488,ERCO,COAS,,2020-05-19 20:00:00,2020-05-20 01:00:00,houston
115489,ERCO,COAS,,2020-05-19 21:00:00,2020-05-20 02:00:00,houston
115490,ERCO,COAS,,2020-05-19 22:00:00,2020-05-20 03:00:00,houston
115491,ERCO,COAS,,2020-05-19 23:00:00,2020-05-20 04:00:00,houston


In [60]:
df5 = df5.reset_index(drop=True)
df5

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,ERCO,COAS,,2018-07-01 01:00:00,2018-07-01 06:00:00,houston
1,ERCO,COAS,,2018-07-01 02:00:00,2018-07-01 07:00:00,houston
2,ERCO,COAS,,2018-07-01 03:00:00,2018-07-01 08:00:00,houston
3,ERCO,COAS,,2018-07-01 04:00:00,2018-07-01 09:00:00,houston
4,ERCO,COAS,,2018-07-01 05:00:00,2018-07-01 10:00:00,houston
...,...,...,...,...,...,...
16531,ERCO,COAS,,2020-05-19 20:00:00,2020-05-20 01:00:00,houston
16532,ERCO,COAS,,2020-05-19 21:00:00,2020-05-20 02:00:00,houston
16533,ERCO,COAS,,2020-05-19 22:00:00,2020-05-20 03:00:00,houston
16534,ERCO,COAS,,2020-05-19 23:00:00,2020-05-20 04:00:00,houston


In [61]:
# print the type of local_time column 
print(type(df5['utc_time'][0]))
# make the local_time column a datetime object
df5['utc_time'] = pd.to_datetime(df5['utc_time'])
df5

<class 'str'>


Unnamed: 0,company,region,demand,local_time,utc_time,city
0,ERCO,COAS,,2018-07-01 01:00:00,2018-07-01 06:00:00,houston
1,ERCO,COAS,,2018-07-01 02:00:00,2018-07-01 07:00:00,houston
2,ERCO,COAS,,2018-07-01 03:00:00,2018-07-01 08:00:00,houston
3,ERCO,COAS,,2018-07-01 04:00:00,2018-07-01 09:00:00,houston
4,ERCO,COAS,,2018-07-01 05:00:00,2018-07-01 10:00:00,houston
...,...,...,...,...,...,...
16531,ERCO,COAS,,2020-05-19 20:00:00,2020-05-20 01:00:00,houston
16532,ERCO,COAS,,2020-05-19 21:00:00,2020-05-20 02:00:00,houston
16533,ERCO,COAS,,2020-05-19 22:00:00,2020-05-20 03:00:00,houston
16534,ERCO,COAS,,2020-05-19 23:00:00,2020-05-20 04:00:00,houston


In [62]:
print(type(df5['utc_time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [63]:
df_weather5 = pd.read_json('/Volumes/meen/houston.json', orient='records')
print(df_weather5)

             time                  summary                 icon  \
0      1530507600                    Humid          clear-night   
1      1530511200                    Humid          clear-night   
2      1530514800                    Humid          clear-night   
3      1530518400                    Humid          clear-night   
4      1530522000                    Humid          clear-night   
...           ...                      ...                  ...   
16569  1590192000            Partly Cloudy    partly-cloudy-day   
16570  1590195600            Partly Cloudy    partly-cloudy-day   
16571  1590199200  Humid and Partly Cloudy  partly-cloudy-night   
16572  1590202800  Humid and Partly Cloudy  partly-cloudy-night   
16573  1590206400  Humid and Partly Cloudy  partly-cloudy-night   

       precipIntensity  precipProbability  temperature  apparentTemperature  \
0               0.0000               0.00        81.87                87.88   
1               0.0000               

In [64]:
df_weather5['time'] = pd.to_datetime(df_weather5['time'], unit='s')

In [65]:
df_weather5

Unnamed: 0,time,summary,icon,precipIntensity,precipProbability,temperature,apparentTemperature,dewPoint,humidity,pressure,windSpeed,windGust,windBearing,cloudCover,uvIndex,visibility,precipType,ozone,precipAccumulation
0,2018-07-02 05:00:00,Humid,clear-night,0.0000,0.00,81.87,87.88,74.15,0.78,1015.6,5.45,9.31,191.0,0.01,0.0,8.479,,,
1,2018-07-02 06:00:00,Humid,clear-night,0.0000,0.00,81.29,86.90,74.17,0.79,1015.4,5.30,7.54,195.0,0.00,0.0,7.880,,,
2,2018-07-02 07:00:00,Humid,clear-night,0.0000,0.00,80.34,85.34,74.47,0.82,1014.9,5.49,5.49,218.0,0.00,0.0,8.751,,,
3,2018-07-02 08:00:00,Humid,clear-night,0.0000,0.00,79.75,84.06,74.09,0.83,1014.8,3.53,8.39,187.0,0.01,0.0,7.772,,,
4,2018-07-02 09:00:00,Humid,clear-night,0.0000,0.00,79.23,83.02,74.02,0.84,1015.0,4.25,4.25,210.0,0.05,0.0,7.772,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16569,2020-05-23 00:00:00,Partly Cloudy,partly-cloudy-day,0.0000,0.00,87.89,93.74,70.93,0.57,1010.8,12.79,17.16,150.0,0.49,0.0,10.000,,297.5,
16570,2020-05-23 01:00:00,Partly Cloudy,partly-cloudy-day,0.0000,0.00,85.22,90.94,71.60,0.64,1011.0,12.43,18.72,147.0,0.46,0.0,10.000,,298.4,
16571,2020-05-23 02:00:00,Humid and Partly Cloudy,partly-cloudy-night,0.0003,0.01,82.07,87.26,72.57,0.73,1011.5,12.11,20.79,143.0,0.43,0.0,10.000,rain,299.4,
16572,2020-05-23 03:00:00,Humid and Partly Cloudy,partly-cloudy-night,0.0006,0.02,80.04,84.52,73.87,0.82,1011.7,11.78,22.32,142.0,0.43,0.0,10.000,rain,300.4,


In [66]:
print(type(df_weather5['time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [67]:
df_merged5 = df5.merge(
    df_weather5,
    left_on='utc_time',
    right_on='time',
    how='inner'
)
df_merged5.drop(columns=['time'], inplace=True)

In [68]:
df_merged5

Unnamed: 0,company,region,demand,local_time,utc_time,city,summary,icon,precipIntensity,precipProbability,...,pressure,windSpeed,windGust,windBearing,cloudCover,uvIndex,visibility,precipType,ozone,precipAccumulation
0,ERCO,COAS,,2018-07-02 00:00:00,2018-07-02 05:00:00,houston,Humid,clear-night,0.0000,0.00,...,1015.6,5.45,9.31,191.0,0.01,0.0,8.479,,,
1,ERCO,COAS,,2018-07-02 01:00:00,2018-07-02 06:00:00,houston,Humid,clear-night,0.0000,0.00,...,1015.4,5.30,7.54,195.0,0.00,0.0,7.880,,,
2,ERCO,COAS,,2018-07-02 02:00:00,2018-07-02 07:00:00,houston,Humid,clear-night,0.0000,0.00,...,1014.9,5.49,5.49,218.0,0.00,0.0,8.751,,,
3,ERCO,COAS,,2018-07-02 03:00:00,2018-07-02 08:00:00,houston,Humid,clear-night,0.0000,0.00,...,1014.8,3.53,8.39,187.0,0.01,0.0,7.772,,,
4,ERCO,COAS,,2018-07-02 04:00:00,2018-07-02 09:00:00,houston,Humid,clear-night,0.0000,0.00,...,1015.0,4.25,4.25,210.0,0.05,0.0,7.772,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16498,ERCO,COAS,,2020-05-19 20:00:00,2020-05-20 01:00:00,houston,Humid,clear-day,0.0012,0.01,...,1007.4,9.06,14.34,201.0,0.00,0.0,10.000,rain,293.1,
16499,ERCO,COAS,,2020-05-19 21:00:00,2020-05-20 02:00:00,houston,Humid,clear-night,0.0005,0.01,...,1008.3,9.06,16.53,195.0,0.00,0.0,10.000,rain,292.6,
16500,ERCO,COAS,,2020-05-19 22:00:00,2020-05-20 03:00:00,houston,Humid,clear-night,0.0014,0.01,...,1008.5,8.31,17.15,191.0,0.00,0.0,10.000,rain,292.4,
16501,ERCO,COAS,,2020-05-19 23:00:00,2020-05-20 04:00:00,houston,Humid,clear-night,0.0019,0.01,...,1008.6,8.66,20.45,195.0,0.10,0.0,10.000,rain,292.5,


In [69]:
df6 = pd.read_csv(csv_path1)
df6

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,NYIS,ZONJ,7269.0,2018-07-01 01:00:00,2018-07-01 05:00:00,nyc
1,NYIS,ZONJ,6977.0,2018-07-01 02:00:00,2018-07-01 06:00:00,nyc
2,NYIS,ZONJ,6725.0,2018-07-01 03:00:00,2018-07-01 07:00:00,nyc
3,NYIS,ZONJ,6539.0,2018-07-01 04:00:00,2018-07-01 08:00:00,nyc
4,NYIS,ZONJ,6415.0,2018-07-01 05:00:00,2018-07-01 09:00:00,nyc
...,...,...,...,...,...,...
132283,CISO,PGAE,11578.0,2020-05-19 20:00:00,2020-05-20 03:00:00,san jose
132284,CISO,PGAE,11782.0,2020-05-19 21:00:00,2020-05-20 04:00:00,san jose
132285,CISO,PGAE,11592.0,2020-05-19 22:00:00,2020-05-20 05:00:00,san jose
132286,CISO,PGAE,11083.0,2020-05-19 23:00:00,2020-05-20 06:00:00,san jose


In [70]:
# only keep rows that have city as seattle 
df6 = df6[df6['city'] == 'la']
df6

Unnamed: 0,company,region,demand,local_time,utc_time,city
4417,CISO,SCE,10681.0,2018-07-01 01:00:00,2018-07-01 08:00:00,la
4418,CISO,SCE,10197.0,2018-07-01 02:00:00,2018-07-01 09:00:00,la
4419,CISO,SCE,9776.0,2018-07-01 03:00:00,2018-07-01 10:00:00,la
4420,CISO,SCE,9508.0,2018-07-01 04:00:00,2018-07-01 11:00:00,la
4421,CISO,SCE,9431.0,2018-07-01 05:00:00,2018-07-01 12:00:00,la
...,...,...,...,...,...,...
112129,CISO,SCE,10893.0,2020-05-19 20:00:00,2020-05-20 03:00:00,la
112130,CISO,SCE,11263.0,2020-05-19 21:00:00,2020-05-20 04:00:00,la
112131,CISO,SCE,10952.0,2020-05-19 22:00:00,2020-05-20 05:00:00,la
112132,CISO,SCE,10338.0,2020-05-19 23:00:00,2020-05-20 06:00:00,la


In [71]:
df6 = df6.reset_index(drop=True)
df6

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,CISO,SCE,10681.0,2018-07-01 01:00:00,2018-07-01 08:00:00,la
1,CISO,SCE,10197.0,2018-07-01 02:00:00,2018-07-01 09:00:00,la
2,CISO,SCE,9776.0,2018-07-01 03:00:00,2018-07-01 10:00:00,la
3,CISO,SCE,9508.0,2018-07-01 04:00:00,2018-07-01 11:00:00,la
4,CISO,SCE,9431.0,2018-07-01 05:00:00,2018-07-01 12:00:00,la
...,...,...,...,...,...,...
16531,CISO,SCE,10893.0,2020-05-19 20:00:00,2020-05-20 03:00:00,la
16532,CISO,SCE,11263.0,2020-05-19 21:00:00,2020-05-20 04:00:00,la
16533,CISO,SCE,10952.0,2020-05-19 22:00:00,2020-05-20 05:00:00,la
16534,CISO,SCE,10338.0,2020-05-19 23:00:00,2020-05-20 06:00:00,la


In [72]:
# print the type of local_time column 
print(type(df6['utc_time'][0]))

<class 'str'>


In [73]:
# make the local_time column a datetime object
df6['utc_time'] = pd.to_datetime(df6['utc_time'])
df6

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,CISO,SCE,10681.0,2018-07-01 01:00:00,2018-07-01 08:00:00,la
1,CISO,SCE,10197.0,2018-07-01 02:00:00,2018-07-01 09:00:00,la
2,CISO,SCE,9776.0,2018-07-01 03:00:00,2018-07-01 10:00:00,la
3,CISO,SCE,9508.0,2018-07-01 04:00:00,2018-07-01 11:00:00,la
4,CISO,SCE,9431.0,2018-07-01 05:00:00,2018-07-01 12:00:00,la
...,...,...,...,...,...,...
16531,CISO,SCE,10893.0,2020-05-19 20:00:00,2020-05-20 03:00:00,la
16532,CISO,SCE,11263.0,2020-05-19 21:00:00,2020-05-20 04:00:00,la
16533,CISO,SCE,10952.0,2020-05-19 22:00:00,2020-05-20 05:00:00,la
16534,CISO,SCE,10338.0,2020-05-19 23:00:00,2020-05-20 06:00:00,la


In [74]:
print(type(df6['utc_time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [75]:
df_weather6 = pd.read_json('/Volumes/meen/la.json', orient='records')
print(df_weather6)

             time   summary         icon  precipIntensity  precipProbability  \
0      1530428400     Clear  clear-night           0.0000               0.00   
1      1530432000  Overcast       cloudy           0.0000               0.00   
2      1530435600  Overcast       cloudy           0.0000               0.00   
3      1530439200  Overcast       cloudy           0.0000               0.00   
4      1530442800  Overcast       cloudy           0.0000               0.00   
...           ...       ...          ...              ...                ...   
16569  1590112800     Clear    clear-day           0.0083               0.01   
16570  1590116400     Clear  clear-night           0.0000               0.00   
16571  1590120000     Clear  clear-night           0.0000               0.00   
16572  1590123600     Clear  clear-night           0.0000               0.00   
16573  1590127200     Clear  clear-night           0.0000               0.00   

       temperature  apparentTemperature

In [76]:
df_weather6['time'] = pd.to_datetime(df_weather6['time'], unit='s')

In [77]:
df_weather6

Unnamed: 0,time,summary,icon,precipIntensity,precipProbability,temperature,apparentTemperature,dewPoint,humidity,pressure,windSpeed,windGust,windBearing,cloudCover,uvIndex,visibility,precipType,ozone
0,2018-07-01 07:00:00,Clear,clear-night,0.0000,0.00,65.45,65.45,58.64,0.79,1014.5,4.23,4.23,243.0,0.25,0.0,9.798,,
1,2018-07-01 08:00:00,Overcast,cloudy,0.0000,0.00,65.16,65.16,58.62,0.79,1014.4,3.95,3.95,193.0,0.88,0.0,9.777,,
2,2018-07-01 09:00:00,Overcast,cloudy,0.0000,0.00,64.58,64.58,58.23,0.80,1014.1,4.21,4.21,185.0,0.92,0.0,9.778,,
3,2018-07-01 10:00:00,Overcast,cloudy,0.0000,0.00,64.46,64.46,57.87,0.79,1013.9,4.01,4.04,175.0,0.99,0.0,9.782,,
4,2018-07-01 11:00:00,Overcast,cloudy,0.0000,0.00,64.19,64.19,57.96,0.80,1014.1,3.81,3.81,182.0,1.00,0.0,9.108,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16569,2020-05-22 02:00:00,Clear,clear-day,0.0083,0.01,78.69,78.69,48.05,0.34,1010.7,4.62,8.96,239.0,0.00,0.0,10.000,rain,316.3
16570,2020-05-22 03:00:00,Clear,clear-night,0.0000,0.00,74.25,74.25,49.05,0.41,1010.9,2.41,7.73,250.0,0.01,0.0,10.000,,317.7
16571,2020-05-22 04:00:00,Clear,clear-night,0.0000,0.00,70.94,70.94,48.91,0.46,1011.1,2.58,6.56,294.0,0.01,0.0,10.000,,317.5
16572,2020-05-22 05:00:00,Clear,clear-night,0.0000,0.00,68.82,68.82,48.74,0.49,1011.8,2.68,5.03,246.0,0.02,0.0,10.000,,316.3


In [78]:
print(type(df_weather6['time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [79]:
df_merged6 = df6.merge(
    df_weather6,
    left_on='utc_time',
    right_on='time',
    how='inner'
)
df_merged6.drop(columns=['time'], inplace=True)

In [80]:
df_merged6

Unnamed: 0,company,region,demand,local_time,utc_time,city,summary,icon,precipIntensity,precipProbability,...,humidity,pressure,windSpeed,windGust,windBearing,cloudCover,uvIndex,visibility,precipType,ozone
0,CISO,SCE,10681.0,2018-07-01 01:00:00,2018-07-01 08:00:00,la,Overcast,cloudy,0.0000,0.00,...,0.79,1014.4,3.95,3.95,193.0,0.88,0.0,9.777,,
1,CISO,SCE,10197.0,2018-07-01 02:00:00,2018-07-01 09:00:00,la,Overcast,cloudy,0.0000,0.00,...,0.80,1014.1,4.21,4.21,185.0,0.92,0.0,9.778,,
2,CISO,SCE,9776.0,2018-07-01 03:00:00,2018-07-01 10:00:00,la,Overcast,cloudy,0.0000,0.00,...,0.79,1013.9,4.01,4.04,175.0,0.99,0.0,9.782,,
3,CISO,SCE,9508.0,2018-07-01 04:00:00,2018-07-01 11:00:00,la,Overcast,cloudy,0.0000,0.00,...,0.80,1014.1,3.81,3.81,182.0,1.00,0.0,9.108,,
4,CISO,SCE,9431.0,2018-07-01 05:00:00,2018-07-01 12:00:00,la,Overcast,cloudy,0.0000,0.00,...,0.80,1014.6,3.81,3.81,181.0,1.00,0.0,9.709,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16521,CISO,SCE,10893.0,2020-05-19 20:00:00,2020-05-20 03:00:00,la,Clear,clear-night,0.0012,0.01,...,0.52,1015.8,3.97,8.93,261.0,0.03,0.0,10.000,rain,347.5
16522,CISO,SCE,11263.0,2020-05-19 21:00:00,2020-05-20 04:00:00,la,Clear,clear-night,0.0015,0.01,...,0.59,1016.2,3.34,8.24,286.0,0.03,0.0,10.000,rain,345.1
16523,CISO,SCE,10952.0,2020-05-19 22:00:00,2020-05-20 05:00:00,la,Clear,clear-night,0.0000,0.00,...,0.63,1016.8,2.29,6.37,262.0,0.02,0.0,10.000,,342.2
16524,CISO,SCE,10338.0,2020-05-19 23:00:00,2020-05-20 06:00:00,la,Clear,clear-night,0.0000,0.00,...,0.67,1017.3,1.61,4.62,253.0,0.01,0.0,10.000,,339.9


In [81]:
df7 = pd.read_csv(csv_path1)
df7

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,NYIS,ZONJ,7269.0,2018-07-01 01:00:00,2018-07-01 05:00:00,nyc
1,NYIS,ZONJ,6977.0,2018-07-01 02:00:00,2018-07-01 06:00:00,nyc
2,NYIS,ZONJ,6725.0,2018-07-01 03:00:00,2018-07-01 07:00:00,nyc
3,NYIS,ZONJ,6539.0,2018-07-01 04:00:00,2018-07-01 08:00:00,nyc
4,NYIS,ZONJ,6415.0,2018-07-01 05:00:00,2018-07-01 09:00:00,nyc
...,...,...,...,...,...,...
132283,CISO,PGAE,11578.0,2020-05-19 20:00:00,2020-05-20 03:00:00,san jose
132284,CISO,PGAE,11782.0,2020-05-19 21:00:00,2020-05-20 04:00:00,san jose
132285,CISO,PGAE,11592.0,2020-05-19 22:00:00,2020-05-20 05:00:00,san jose
132286,CISO,PGAE,11083.0,2020-05-19 23:00:00,2020-05-20 06:00:00,san jose


In [82]:
# only keep rows that have city as seattle 
df7 = df7[df7['city'] == 'philadelphia']
df7

Unnamed: 0,company,region,demand,local_time,utc_time,city
13251,PJM,PE,4397.0,2018-07-01 01:00:00,2018-07-01 05:00:00,philadelphia
13252,PJM,PE,4423.0,2018-07-01 02:00:00,2018-07-01 06:00:00,philadelphia
13253,PJM,PE,4743.0,2018-07-01 03:00:00,2018-07-01 07:00:00,philadelphia
13254,PJM,PE,5230.0,2018-07-01 04:00:00,2018-07-01 08:00:00,philadelphia
13255,PJM,PE,5752.0,2018-07-01 05:00:00,2018-07-01 09:00:00,philadelphia
...,...,...,...,...,...,...
118847,PJM,PE,,2020-05-19 20:00:00,2020-05-20 00:00:00,philadelphia
118848,PJM,PE,,2020-05-19 21:00:00,2020-05-20 01:00:00,philadelphia
118849,PJM,PE,,2020-05-19 22:00:00,2020-05-20 02:00:00,philadelphia
118850,PJM,PE,,2020-05-19 23:00:00,2020-05-20 03:00:00,philadelphia


In [83]:
# drop index 
df7 = df7.reset_index(drop=True)
df7

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,PJM,PE,4397.0,2018-07-01 01:00:00,2018-07-01 05:00:00,philadelphia
1,PJM,PE,4423.0,2018-07-01 02:00:00,2018-07-01 06:00:00,philadelphia
2,PJM,PE,4743.0,2018-07-01 03:00:00,2018-07-01 07:00:00,philadelphia
3,PJM,PE,5230.0,2018-07-01 04:00:00,2018-07-01 08:00:00,philadelphia
4,PJM,PE,5752.0,2018-07-01 05:00:00,2018-07-01 09:00:00,philadelphia
...,...,...,...,...,...,...
16531,PJM,PE,,2020-05-19 20:00:00,2020-05-20 00:00:00,philadelphia
16532,PJM,PE,,2020-05-19 21:00:00,2020-05-20 01:00:00,philadelphia
16533,PJM,PE,,2020-05-19 22:00:00,2020-05-20 02:00:00,philadelphia
16534,PJM,PE,,2020-05-19 23:00:00,2020-05-20 03:00:00,philadelphia


In [84]:
# print the type of local_time column 
print(type(df7['utc_time'][0]))

<class 'str'>


In [85]:
# make the local_time column a datetime object
df7['utc_time'] = pd.to_datetime(df7['utc_time'])
df7

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,PJM,PE,4397.0,2018-07-01 01:00:00,2018-07-01 05:00:00,philadelphia
1,PJM,PE,4423.0,2018-07-01 02:00:00,2018-07-01 06:00:00,philadelphia
2,PJM,PE,4743.0,2018-07-01 03:00:00,2018-07-01 07:00:00,philadelphia
3,PJM,PE,5230.0,2018-07-01 04:00:00,2018-07-01 08:00:00,philadelphia
4,PJM,PE,5752.0,2018-07-01 05:00:00,2018-07-01 09:00:00,philadelphia
...,...,...,...,...,...,...
16531,PJM,PE,,2020-05-19 20:00:00,2020-05-20 00:00:00,philadelphia
16532,PJM,PE,,2020-05-19 21:00:00,2020-05-20 01:00:00,philadelphia
16533,PJM,PE,,2020-05-19 22:00:00,2020-05-20 02:00:00,philadelphia
16534,PJM,PE,,2020-05-19 23:00:00,2020-05-20 03:00:00,philadelphia


In [86]:
print(type(df7['utc_time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [87]:
df_weather7 = pd.read_json('/Volumes/meen/philadelphia.json', orient='records')
print(df_weather7)

             time              summary         icon  precipIntensity  \
0      1530504000                Clear  clear-night           0.0000   
1      1530507600                Clear  clear-night           0.0000   
2      1530511200                Clear  clear-night           0.0000   
3      1530514800                Clear  clear-night           0.0000   
4      1530518400                Clear  clear-night           0.0000   
...           ...                  ...          ...              ...   
16569  1590188400  Possible Light Rain         rain           0.0111   
16570  1590192000  Possible Light Rain         rain           0.0103   
16571  1590195600     Possible Drizzle         rain           0.0066   
16572  1590199200             Overcast       cloudy           0.0030   
16573  1590202800             Overcast       cloudy           0.0016   

       precipProbability  temperature  apparentTemperature  dewPoint  \
0                   0.00        80.27                83.79     

In [88]:
df_weather7['time'] = pd.to_datetime(df_weather7['time'], unit='s')

In [89]:
df_weather7

Unnamed: 0,time,summary,icon,precipIntensity,precipProbability,temperature,apparentTemperature,dewPoint,humidity,pressure,windSpeed,windGust,windBearing,cloudCover,uvIndex,visibility,precipType,ozone,precipAccumulation
0,2018-07-02 04:00:00,Clear,clear-night,0.0000,0.00,80.27,83.79,70.79,0.73,1017.3,3.03,3.21,238.0,0.12,0.0,9.964,,,
1,2018-07-02 05:00:00,Clear,clear-night,0.0000,0.00,79.74,82.67,69.59,0.71,1017.2,2.39,2.39,226.0,0.05,0.0,9.705,,,
2,2018-07-02 06:00:00,Clear,clear-night,0.0000,0.00,78.14,79.19,69.70,0.75,1017.6,1.78,1.78,226.0,0.20,0.0,9.553,,,
3,2018-07-02 07:00:00,Clear,clear-night,0.0000,0.00,76.94,78.02,69.75,0.79,1017.6,0.95,1.21,192.0,0.12,0.0,9.401,,,
4,2018-07-02 08:00:00,Clear,clear-night,0.0000,0.00,76.07,77.20,69.96,0.81,1017.5,1.23,1.49,214.0,0.05,0.0,8.918,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16569,2020-05-22 23:00:00,Possible Light Rain,rain,0.0111,0.42,65.13,65.54,61.87,0.89,1017.1,6.41,11.84,118.0,1.00,0.0,10.000,rain,333.4,
16570,2020-05-23 00:00:00,Possible Light Rain,rain,0.0103,0.41,64.70,65.08,61.63,0.90,1016.8,5.73,10.80,116.0,1.00,0.0,10.000,rain,334.8,
16571,2020-05-23 01:00:00,Possible Drizzle,rain,0.0066,0.34,64.42,64.86,61.89,0.92,1016.8,5.23,10.11,119.0,0.98,0.0,10.000,rain,335.1,
16572,2020-05-23 02:00:00,Overcast,cloudy,0.0030,0.22,64.37,64.84,62.06,0.92,1016.8,4.74,9.59,123.0,0.97,0.0,10.000,rain,334.8,


In [90]:
print(type(df_weather7['time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [91]:
df_merged7 = df7.merge(
    df_weather7,
    left_on='utc_time',
    right_on='time',
    how='inner'
)
df_merged7.drop(columns=['time'], inplace=True)

In [92]:
df_merged7

Unnamed: 0,company,region,demand,local_time,utc_time,city,summary,icon,precipIntensity,precipProbability,...,pressure,windSpeed,windGust,windBearing,cloudCover,uvIndex,visibility,precipType,ozone,precipAccumulation
0,PJM,PE,4774.0,2018-07-02 00:00:00,2018-07-02 04:00:00,philadelphia,Clear,clear-night,0.0000,0.00,...,1017.3,3.03,3.21,238.0,0.12,0.0,9.964,,,
1,PJM,PE,4397.0,2018-07-02 01:00:00,2018-07-02 05:00:00,philadelphia,Clear,clear-night,0.0000,0.00,...,1017.2,2.39,2.39,226.0,0.05,0.0,9.705,,,
2,PJM,PE,4423.0,2018-07-02 02:00:00,2018-07-02 06:00:00,philadelphia,Clear,clear-night,0.0000,0.00,...,1017.6,1.78,1.78,226.0,0.20,0.0,9.553,,,
3,PJM,PE,4743.0,2018-07-02 03:00:00,2018-07-02 07:00:00,philadelphia,Clear,clear-night,0.0000,0.00,...,1017.6,0.95,1.21,192.0,0.12,0.0,9.401,,,
4,PJM,PE,5230.0,2018-07-02 04:00:00,2018-07-02 08:00:00,philadelphia,Clear,clear-night,0.0000,0.00,...,1017.5,1.23,1.49,214.0,0.05,0.0,8.918,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16498,PJM,PE,,2020-05-19 20:00:00,2020-05-20 00:00:00,philadelphia,Mostly Cloudy,partly-cloudy-day,0.0000,0.00,...,1026.8,13.21,25.27,89.0,0.87,0.0,10.000,,315.3,
16499,PJM,PE,,2020-05-19 21:00:00,2020-05-20 01:00:00,philadelphia,Mostly Cloudy,partly-cloudy-night,0.0000,0.00,...,1027.2,12.17,24.75,88.0,0.84,0.0,10.000,,318.1,
16500,PJM,PE,,2020-05-19 22:00:00,2020-05-20 02:00:00,philadelphia,Mostly Cloudy,partly-cloudy-night,0.0000,0.00,...,1027.4,11.78,25.32,86.0,0.79,0.0,10.000,,318.2,
16501,PJM,PE,,2020-05-19 23:00:00,2020-05-20 03:00:00,philadelphia,Mostly Cloudy,partly-cloudy-night,0.0013,0.01,...,1027.5,11.15,25.04,83.0,0.79,0.0,10.000,rain,318.0,


In [130]:
df8 = pd.read_csv(csv_path1)
df8

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,NYIS,ZONJ,7269.0,2018-07-01 01:00:00,2018-07-01 05:00:00,nyc
1,NYIS,ZONJ,6977.0,2018-07-01 02:00:00,2018-07-01 06:00:00,nyc
2,NYIS,ZONJ,6725.0,2018-07-01 03:00:00,2018-07-01 07:00:00,nyc
3,NYIS,ZONJ,6539.0,2018-07-01 04:00:00,2018-07-01 08:00:00,nyc
4,NYIS,ZONJ,6415.0,2018-07-01 05:00:00,2018-07-01 09:00:00,nyc
...,...,...,...,...,...,...
132283,CISO,PGAE,11578.0,2020-05-19 20:00:00,2020-05-20 03:00:00,san jose
132284,CISO,PGAE,11782.0,2020-05-19 21:00:00,2020-05-20 04:00:00,san jose
132285,CISO,PGAE,11592.0,2020-05-19 22:00:00,2020-05-20 05:00:00,san jose
132286,CISO,PGAE,11083.0,2020-05-19 23:00:00,2020-05-20 06:00:00,san jose


In [131]:
df8 = df8[df8['city'] == 'san antonio']
df8

Unnamed: 0,company,region,demand,local_time,utc_time,city
17668,ERCO,SCEN,,2018-07-01 01:00:00,2018-07-01 06:00:00,san antonio
17669,ERCO,SCEN,,2018-07-01 02:00:00,2018-07-01 07:00:00,san antonio
17670,ERCO,SCEN,,2018-07-01 03:00:00,2018-07-01 08:00:00,san antonio
17671,ERCO,SCEN,,2018-07-01 04:00:00,2018-07-01 09:00:00,san antonio
17672,ERCO,SCEN,,2018-07-01 05:00:00,2018-07-01 10:00:00,san antonio
...,...,...,...,...,...,...
122206,ERCO,SCEN,,2020-05-19 20:00:00,2020-05-20 01:00:00,san antonio
122207,ERCO,SCEN,,2020-05-19 21:00:00,2020-05-20 02:00:00,san antonio
122208,ERCO,SCEN,,2020-05-19 22:00:00,2020-05-20 03:00:00,san antonio
122209,ERCO,SCEN,,2020-05-19 23:00:00,2020-05-20 04:00:00,san antonio


In [132]:
df8 = df8.reset_index(drop=True)
df8

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,ERCO,SCEN,,2018-07-01 01:00:00,2018-07-01 06:00:00,san antonio
1,ERCO,SCEN,,2018-07-01 02:00:00,2018-07-01 07:00:00,san antonio
2,ERCO,SCEN,,2018-07-01 03:00:00,2018-07-01 08:00:00,san antonio
3,ERCO,SCEN,,2018-07-01 04:00:00,2018-07-01 09:00:00,san antonio
4,ERCO,SCEN,,2018-07-01 05:00:00,2018-07-01 10:00:00,san antonio
...,...,...,...,...,...,...
16531,ERCO,SCEN,,2020-05-19 20:00:00,2020-05-20 01:00:00,san antonio
16532,ERCO,SCEN,,2020-05-19 21:00:00,2020-05-20 02:00:00,san antonio
16533,ERCO,SCEN,,2020-05-19 22:00:00,2020-05-20 03:00:00,san antonio
16534,ERCO,SCEN,,2020-05-19 23:00:00,2020-05-20 04:00:00,san antonio


In [133]:
# print the type of local_time column 
print(type(df8['utc_time'][0]))
# make the local_time column a datetime object
df8['utc_time'] = pd.to_datetime(df8['utc_time'])
df8

<class 'str'>


Unnamed: 0,company,region,demand,local_time,utc_time,city
0,ERCO,SCEN,,2018-07-01 01:00:00,2018-07-01 06:00:00,san antonio
1,ERCO,SCEN,,2018-07-01 02:00:00,2018-07-01 07:00:00,san antonio
2,ERCO,SCEN,,2018-07-01 03:00:00,2018-07-01 08:00:00,san antonio
3,ERCO,SCEN,,2018-07-01 04:00:00,2018-07-01 09:00:00,san antonio
4,ERCO,SCEN,,2018-07-01 05:00:00,2018-07-01 10:00:00,san antonio
...,...,...,...,...,...,...
16531,ERCO,SCEN,,2020-05-19 20:00:00,2020-05-20 01:00:00,san antonio
16532,ERCO,SCEN,,2020-05-19 21:00:00,2020-05-20 02:00:00,san antonio
16533,ERCO,SCEN,,2020-05-19 22:00:00,2020-05-20 03:00:00,san antonio
16534,ERCO,SCEN,,2020-05-19 23:00:00,2020-05-20 04:00:00,san antonio


In [134]:
print(type(df8['utc_time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [135]:
df_weather8 = pd.read_json('/Volumes/meen/san_antonio.json', orient='records')
print(df_weather8)

             time                  summary                 icon  \
0      1530507600                    Humid          clear-night   
1      1530511200                    Humid          clear-night   
2      1530514800  Humid and Mostly Cloudy  partly-cloudy-night   
3      1530518400  Humid and Mostly Cloudy  partly-cloudy-night   
4      1530522000  Humid and Partly Cloudy  partly-cloudy-night   
...           ...                      ...                  ...   
16569  1590192000  Humid and Partly Cloudy    partly-cloudy-day   
16570  1590195600                    Humid            clear-day   
16571  1590199200                    Clear          clear-night   
16572  1590202800                    Clear          clear-night   
16573  1590206400                    Clear          clear-night   

       precipIntensity  precipProbability  temperature  apparentTemperature  \
0               0.0000               0.00        81.09                86.15   
1               0.0000               

In [136]:
df_weather8['time'] = pd.to_datetime(df_weather8['time'], unit='s')

In [137]:
df_weather8

Unnamed: 0,time,summary,icon,precipIntensity,precipProbability,temperature,apparentTemperature,dewPoint,humidity,pressure,windSpeed,windGust,windBearing,cloudCover,uvIndex,visibility,precipType,ozone,precipAccumulation
0,2018-07-02 05:00:00,Humid,clear-night,0.0000,0.00,81.09,86.15,73.38,0.77,1014.7,9.02,9.02,131.0,0.01,0.0,9.997,,,
1,2018-07-02 06:00:00,Humid,clear-night,0.0000,0.00,79.40,82.99,72.74,0.80,1014.7,9.18,9.18,131.0,0.02,0.0,9.976,,,
2,2018-07-02 07:00:00,Humid and Mostly Cloudy,partly-cloudy-night,0.0000,0.00,78.22,79.70,73.07,0.84,1014.3,11.20,11.20,141.0,0.75,0.0,9.976,,,
3,2018-07-02 08:00:00,Humid and Mostly Cloudy,partly-cloudy-night,0.0000,0.00,78.22,79.67,72.85,0.84,1014.6,8.01,8.01,162.0,0.75,0.0,9.976,,,
4,2018-07-02 09:00:00,Humid and Partly Cloudy,partly-cloudy-night,0.0000,0.00,77.08,78.47,72.18,0.85,1013.6,4.81,4.81,181.0,0.45,0.0,9.976,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16569,2020-05-23 00:00:00,Humid and Partly Cloudy,partly-cloudy-day,0.0000,0.00,90.56,99.22,73.18,0.57,1007.0,13.32,17.07,144.0,0.41,1.0,10.000,,302.6,
16570,2020-05-23 01:00:00,Humid,clear-day,0.0000,0.00,88.32,95.58,72.41,0.59,1007.4,13.42,19.50,143.0,0.30,0.0,10.000,,302.0,
16571,2020-05-23 02:00:00,Clear,clear-night,0.0004,0.01,85.65,91.22,71.25,0.62,1008.3,13.46,22.66,142.0,0.13,0.0,10.000,rain,301.1,
16572,2020-05-23 03:00:00,Clear,clear-night,0.0010,0.02,83.25,87.76,70.63,0.66,1008.8,13.17,24.94,144.0,0.03,0.0,10.000,rain,300.5,


In [138]:
print(type(df_weather8['time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [139]:
df_merged8 = df8.merge(
    df_weather8,
    left_on='utc_time',
    right_on='time',
    how='inner'
)
df_merged8.drop(columns=['time'], inplace=True)

In [140]:
df_merged8

Unnamed: 0,company,region,demand,local_time,utc_time,city,summary,icon,precipIntensity,precipProbability,...,pressure,windSpeed,windGust,windBearing,cloudCover,uvIndex,visibility,precipType,ozone,precipAccumulation
0,ERCO,SCEN,,2018-07-02 00:00:00,2018-07-02 05:00:00,san antonio,Humid,clear-night,0.0,0.0,...,1014.7,9.02,9.02,131.0,0.01,0.0,9.997,,,
1,ERCO,SCEN,,2018-07-02 01:00:00,2018-07-02 06:00:00,san antonio,Humid,clear-night,0.0,0.0,...,1014.7,9.18,9.18,131.0,0.02,0.0,9.976,,,
2,ERCO,SCEN,,2018-07-02 02:00:00,2018-07-02 07:00:00,san antonio,Humid and Mostly Cloudy,partly-cloudy-night,0.0,0.0,...,1014.3,11.20,11.20,141.0,0.75,0.0,9.976,,,
3,ERCO,SCEN,,2018-07-02 03:00:00,2018-07-02 08:00:00,san antonio,Humid and Mostly Cloudy,partly-cloudy-night,0.0,0.0,...,1014.6,8.01,8.01,162.0,0.75,0.0,9.976,,,
4,ERCO,SCEN,,2018-07-02 04:00:00,2018-07-02 09:00:00,san antonio,Humid and Partly Cloudy,partly-cloudy-night,0.0,0.0,...,1013.6,4.81,4.81,181.0,0.45,0.0,9.976,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16498,ERCO,SCEN,,2020-05-19 20:00:00,2020-05-20 01:00:00,san antonio,Clear,clear-day,0.0,0.0,...,1005.5,7.09,13.00,114.0,0.00,0.0,10.000,,290.4,
16499,ERCO,SCEN,,2020-05-19 21:00:00,2020-05-20 02:00:00,san antonio,Clear,clear-night,0.0,0.0,...,1006.0,6.93,14.56,112.0,0.00,0.0,10.000,,290.4,
16500,ERCO,SCEN,,2020-05-19 22:00:00,2020-05-20 03:00:00,san antonio,Clear,clear-night,0.0,0.0,...,1006.3,7.71,16.61,113.0,0.00,0.0,10.000,,290.5,
16501,ERCO,SCEN,,2020-05-19 23:00:00,2020-05-20 04:00:00,san antonio,Humid,clear-night,0.0,0.0,...,1007.2,10.47,18.69,139.0,0.00,0.0,10.000,,290.7,


In [109]:
df9 = pd.read_csv(csv_path1)
df9

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,NYIS,ZONJ,7269.0,2018-07-01 01:00:00,2018-07-01 05:00:00,nyc
1,NYIS,ZONJ,6977.0,2018-07-01 02:00:00,2018-07-01 06:00:00,nyc
2,NYIS,ZONJ,6725.0,2018-07-01 03:00:00,2018-07-01 07:00:00,nyc
3,NYIS,ZONJ,6539.0,2018-07-01 04:00:00,2018-07-01 08:00:00,nyc
4,NYIS,ZONJ,6415.0,2018-07-01 05:00:00,2018-07-01 09:00:00,nyc
...,...,...,...,...,...,...
132283,CISO,PGAE,11578.0,2020-05-19 20:00:00,2020-05-20 03:00:00,san jose
132284,CISO,PGAE,11782.0,2020-05-19 21:00:00,2020-05-20 04:00:00,san jose
132285,CISO,PGAE,11592.0,2020-05-19 22:00:00,2020-05-20 05:00:00,san jose
132286,CISO,PGAE,11083.0,2020-05-19 23:00:00,2020-05-20 06:00:00,san jose


In [111]:
df9 = df9[df9['city'] == 'san diego']
df9

Unnamed: 0,company,region,demand,local_time,utc_time,city
22085,CISO,SDGE,2023.0,2018-07-01 01:00:00,2018-07-01 08:00:00,san diego
22086,CISO,SDGE,1896.0,2018-07-01 02:00:00,2018-07-01 09:00:00,san diego
22087,CISO,SDGE,1857.0,2018-07-01 03:00:00,2018-07-01 10:00:00,san diego
22088,CISO,SDGE,1825.0,2018-07-01 04:00:00,2018-07-01 11:00:00,san diego
22089,CISO,SDGE,1798.0,2018-07-01 05:00:00,2018-07-01 12:00:00,san diego
...,...,...,...,...,...,...
125565,CISO,SDGE,2220.0,2020-05-19 20:00:00,2020-05-20 03:00:00,san diego
125566,CISO,SDGE,2317.0,2020-05-19 21:00:00,2020-05-20 04:00:00,san diego
125567,CISO,SDGE,2227.0,2020-05-19 22:00:00,2020-05-20 05:00:00,san diego
125568,CISO,SDGE,2056.0,2020-05-19 23:00:00,2020-05-20 06:00:00,san diego


In [115]:
# drop index 
df9 = df9.reset_index(drop=True)
df9

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,CISO,SDGE,2023.0,2018-07-01 01:00:00,2018-07-01 08:00:00,san diego
1,CISO,SDGE,1896.0,2018-07-01 02:00:00,2018-07-01 09:00:00,san diego
2,CISO,SDGE,1857.0,2018-07-01 03:00:00,2018-07-01 10:00:00,san diego
3,CISO,SDGE,1825.0,2018-07-01 04:00:00,2018-07-01 11:00:00,san diego
4,CISO,SDGE,1798.0,2018-07-01 05:00:00,2018-07-01 12:00:00,san diego
...,...,...,...,...,...,...
16531,CISO,SDGE,2220.0,2020-05-19 20:00:00,2020-05-20 03:00:00,san diego
16532,CISO,SDGE,2317.0,2020-05-19 21:00:00,2020-05-20 04:00:00,san diego
16533,CISO,SDGE,2227.0,2020-05-19 22:00:00,2020-05-20 05:00:00,san diego
16534,CISO,SDGE,2056.0,2020-05-19 23:00:00,2020-05-20 06:00:00,san diego


In [116]:
# print the type of local_time column 
print(type(df9['utc_time'][0]))

<class 'str'>


In [117]:
# make the local_time column a datetime object
df9['utc_time'] = pd.to_datetime(df9['utc_time'])
df9

Unnamed: 0,company,region,demand,local_time,utc_time,city
0,CISO,SDGE,2023.0,2018-07-01 01:00:00,2018-07-01 08:00:00,san diego
1,CISO,SDGE,1896.0,2018-07-01 02:00:00,2018-07-01 09:00:00,san diego
2,CISO,SDGE,1857.0,2018-07-01 03:00:00,2018-07-01 10:00:00,san diego
3,CISO,SDGE,1825.0,2018-07-01 04:00:00,2018-07-01 11:00:00,san diego
4,CISO,SDGE,1798.0,2018-07-01 05:00:00,2018-07-01 12:00:00,san diego
...,...,...,...,...,...,...
16531,CISO,SDGE,2220.0,2020-05-19 20:00:00,2020-05-20 03:00:00,san diego
16532,CISO,SDGE,2317.0,2020-05-19 21:00:00,2020-05-20 04:00:00,san diego
16533,CISO,SDGE,2227.0,2020-05-19 22:00:00,2020-05-20 05:00:00,san diego
16534,CISO,SDGE,2056.0,2020-05-19 23:00:00,2020-05-20 06:00:00,san diego


In [118]:
print(type(df9['utc_time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [119]:
df_weather9 = pd.read_json('/Volumes/meen/san_diego.json', orient='records')
print(df_weather9)

             time   summary         icon  precipIntensity  precipProbability  \
0      1530428400  Overcast       cloudy              0.0                0.0   
1      1530432000  Overcast       cloudy              0.0                0.0   
2      1530435600  Overcast       cloudy              0.0                0.0   
3      1530439200  Overcast       cloudy              0.0                0.0   
4      1530442800  Overcast       cloudy              0.0                0.0   
...           ...       ...          ...              ...                ...   
16569  1590112800     Clear    clear-day              0.0                0.0   
16570  1590116400     Clear  clear-night              0.0                0.0   
16571  1590120000     Clear  clear-night              0.0                0.0   
16572  1590123600     Clear  clear-night              0.0                0.0   
16573  1590127200     Clear  clear-night              0.0                0.0   

       temperature  apparentTemperature

In [125]:
df_weather9['time'] = pd.to_datetime(df_weather9['time'], unit='s')

In [126]:
df_weather9

Unnamed: 0,time,summary,icon,precipIntensity,precipProbability,temperature,apparentTemperature,dewPoint,humidity,pressure,windSpeed,windGust,windBearing,cloudCover,uvIndex,visibility,precipType,ozone
0,2018-07-01 07:00:00,Overcast,cloudy,0.0,0.0,61.71,61.71,55.47,0.80,1015.9,3.53,3.53,210.0,1.00,0.0,9.969,,
1,2018-07-01 08:00:00,Overcast,cloudy,0.0,0.0,62.26,62.26,55.50,0.79,1015.7,3.06,3.06,231.0,1.00,0.0,9.972,,
2,2018-07-01 09:00:00,Overcast,cloudy,0.0,0.0,62.27,62.27,55.52,0.79,1015.3,4.34,4.34,166.0,1.00,0.0,9.962,,
3,2018-07-01 10:00:00,Overcast,cloudy,0.0,0.0,62.13,62.13,55.29,0.78,1015.0,3.91,3.91,223.0,0.99,0.0,9.943,,
4,2018-07-01 11:00:00,Overcast,cloudy,0.0,0.0,61.52,61.52,55.84,0.82,1015.2,3.37,3.37,207.0,1.00,0.0,9.921,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16569,2020-05-22 02:00:00,Clear,clear-day,0.0,0.0,68.73,68.73,57.75,0.68,1012.1,4.74,6.59,269.0,0.02,0.0,10.000,,312.2
16570,2020-05-22 03:00:00,Clear,clear-night,0.0,0.0,66.17,66.17,58.35,0.76,1012.2,3.77,4.58,272.0,0.02,0.0,10.000,,312.1
16571,2020-05-22 04:00:00,Clear,clear-night,0.0,0.0,64.98,64.98,58.65,0.80,1012.6,2.83,3.40,267.0,0.02,0.0,10.000,,312.7
16572,2020-05-22 05:00:00,Clear,clear-night,0.0,0.0,64.13,64.14,58.83,0.83,1013.2,2.60,2.60,239.0,0.02,0.0,10.000,,313.7


In [127]:
print(type(df_weather9['time'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [142]:
df_merged9 = df9.merge(
    df_weather9,
    left_on='utc_time',
    right_on='time',
    how='inner'
)
df_merged9.drop(columns=['time'], inplace=True)

In [143]:
df_merged9

Unnamed: 0,company,region,demand,local_time,utc_time,city,summary,icon,precipIntensity,precipProbability,...,humidity,pressure,windSpeed,windGust,windBearing,cloudCover,uvIndex,visibility,precipType,ozone
0,CISO,SDGE,2023.0,2018-07-01 01:00:00,2018-07-01 08:00:00,san diego,Overcast,cloudy,0.0000,0.00,...,0.79,1015.7,3.06,3.06,231.0,1.00,0.0,9.972,,
1,CISO,SDGE,1896.0,2018-07-01 02:00:00,2018-07-01 09:00:00,san diego,Overcast,cloudy,0.0000,0.00,...,0.79,1015.3,4.34,4.34,166.0,1.00,0.0,9.962,,
2,CISO,SDGE,1857.0,2018-07-01 03:00:00,2018-07-01 10:00:00,san diego,Overcast,cloudy,0.0000,0.00,...,0.78,1015.0,3.91,3.91,223.0,0.99,0.0,9.943,,
3,CISO,SDGE,1825.0,2018-07-01 04:00:00,2018-07-01 11:00:00,san diego,Overcast,cloudy,0.0000,0.00,...,0.82,1015.2,3.37,3.37,207.0,1.00,0.0,9.921,,
4,CISO,SDGE,1798.0,2018-07-01 05:00:00,2018-07-01 12:00:00,san diego,Overcast,cloudy,0.0000,0.00,...,0.80,1015.5,2.71,2.71,300.0,1.00,0.0,9.920,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16521,CISO,SDGE,2220.0,2020-05-19 20:00:00,2020-05-20 03:00:00,san diego,Clear,clear-night,0.0000,0.00,...,0.66,1017.3,6.18,9.03,293.0,0.16,0.0,10.000,,324.5
16522,CISO,SDGE,2317.0,2020-05-19 21:00:00,2020-05-20 04:00:00,san diego,Clear,clear-night,0.0009,0.01,...,0.68,1017.7,5.14,7.11,306.0,0.28,0.0,10.000,rain,326.5
16523,CISO,SDGE,2227.0,2020-05-19 22:00:00,2020-05-20 05:00:00,san diego,Clear,clear-night,0.0000,0.00,...,0.70,1018.1,2.93,2.93,294.0,0.22,0.0,10.000,,328.4
16524,CISO,SDGE,2056.0,2020-05-19 23:00:00,2020-05-20 06:00:00,san diego,Partly Cloudy,partly-cloudy-night,0.0000,0.00,...,0.71,1018.2,2.83,3.45,282.0,0.48,0.0,10.000,,329.4


In [157]:
# put them in a list
dfs = [
    df_merged,
    df_merged1,
    df_merged2,
    df_merged3,
    df_merged4,
    df_merged5,
    df_merged6,
    df_merged7,
    df_merged8,
    df_merged9,
]

# stack them into one big DataFrame
df_all = pd.concat(dfs, ignore_index=True)

# inspect
print(df_all.shape)
print(df_all.city.unique())
df_all.head(50)

(163993, 24)
['seattle' 'phoenix' 'nyc' 'san jose' 'dallas' 'houston' 'la'
 'philadelphia' 'san antonio' 'san diego']


Unnamed: 0,company,local_time,utc_time,demand,city,summary,icon,precipIntensity,precipProbability,temperature,...,windSpeed,windGust,windBearing,cloudCover,uvIndex,visibility,precipType,ozone,precipAccumulation,region
0,SCL,2018-07-01 01:00:00,2018-07-01 08:00:00,809.0,seattle,Overcast,cloudy,0.0,0.0,58.96,...,5.02,5.02,201.0,1.0,0.0,3.775,,,,
1,SCL,2018-07-01 02:00:00,2018-07-01 09:00:00,779.0,seattle,Overcast,cloudy,0.0,0.0,58.56,...,4.2,4.2,199.0,1.0,0.0,3.56,,,,
2,SCL,2018-07-01 03:00:00,2018-07-01 10:00:00,753.0,seattle,Overcast,cloudy,0.0,0.0,58.26,...,3.66,3.66,195.0,1.0,0.0,3.482,,,,
3,SCL,2018-07-01 04:00:00,2018-07-01 11:00:00,748.0,seattle,Overcast,cloudy,0.0007,0.15,58.04,...,4.19,4.19,190.0,1.0,0.0,3.351,rain,,,
4,SCL,2018-07-01 05:00:00,2018-07-01 12:00:00,745.0,seattle,Overcast,cloudy,0.0012,0.08,57.73,...,3.78,3.78,184.0,1.0,0.0,2.758,rain,,,
5,SCL,2018-07-01 06:00:00,2018-07-01 13:00:00,749.0,seattle,Overcast,cloudy,0.0011,0.19,57.43,...,3.38,3.38,186.0,1.0,0.0,2.608,rain,,,
6,SCL,2018-07-01 07:00:00,2018-07-01 14:00:00,774.0,seattle,Overcast,cloudy,0.0014,0.17,57.51,...,3.57,3.57,197.0,1.0,0.0,2.57,rain,,,
7,SCL,2018-07-01 08:00:00,2018-07-01 15:00:00,827.0,seattle,Overcast,cloudy,0.001,0.19,57.97,...,3.81,3.81,196.0,1.0,1.0,2.65,rain,,,
8,SCL,2018-07-01 09:00:00,2018-07-01 16:00:00,881.0,seattle,Overcast,cloudy,0.0011,0.07,57.98,...,4.97,4.97,207.0,0.99,2.0,2.809,rain,,,
9,SCL,2018-07-01 10:00:00,2018-07-01 17:00:00,933.0,seattle,Overcast,cloudy,0.001,0.08,58.31,...,4.22,4.22,197.0,0.96,3.0,2.761,rain,,,


In [158]:
print(os.listdir('/Volumes'))

['MACOS', 'meen']


In [159]:
csv_path = '/Volumes/meen/all_cities.csv'
df_all.to_csv(csv_path, index=False)
print(f"Saved to {csv_path}")


Saved to /Volumes/meen/all_cities.csv


## 🎯 Sampling for Balanced Analysis

To keep modelling fast and ensure each city is equally represented:
1. Create a `period` column (e.g. `'YYYY-MM'`) from the UTC timestamp.  
2. **Stratified sampling** within each city on that `period` to pull **5,000 rows per city**.  
3. Re‐concatenate those samples into a single 50,000‐row DataFrame `df_sample`.


In [162]:
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit

# 1) Make sure you have a 'period' column—for example, “YYYY-MM”:
df_all['period'] = df_all['utc_time'].dt.to_period('M').astype(str)

# 2) Loop over each city, stratifying by that period:
samples = []
for city, city_df in df_all.groupby('city'):
    # set up a stratified splitter to draw 5 000 rows
    sss = StratifiedShuffleSplit(n_splits=1, train_size=5000, random_state=42)
    
    # split returns indices of the “train” set we want
    train_idx, _ = next(sss.split(city_df, city_df['period']))
    
    # grab those 5 000 rows for this city
    samples.append(city_df.iloc[train_idx])

# 3) concatenate all ten cities back into one DataFrame
df_sample = pd.concat(samples, ignore_index=True)

# sanity check
print(df_sample.shape)             # should be (50000, …)
print(df_sample['city'].value_counts())  
print(df_sample.groupby('city')['period'].value_counts(normalize=True).head(12))
df_sample.tail(100)

(50000, 25)
city
dallas          5000
houston         5000
la              5000
nyc             5000
philadelphia    5000
phoenix         5000
san antonio     5000
san diego       5000
san jose        5000
seattle         5000
Name: count, dtype: int64
city    period 
dallas  2020-03    0.0452
        2019-07    0.0452
        2020-01    0.0452
        2018-10    0.0452
        2018-08    0.0450
        2019-12    0.0450
        2019-08    0.0450
        2018-12    0.0450
        2019-01    0.0450
        2019-03    0.0450
        2019-05    0.0450
        2019-10    0.0450
Name: proportion, dtype: float64


Unnamed: 0,company,local_time,utc_time,demand,city,summary,icon,precipIntensity,precipProbability,temperature,...,windGust,windBearing,cloudCover,uvIndex,visibility,precipType,ozone,precipAccumulation,region,period
49900,SCL,2020-01-02 07:00:00,2020-01-02 15:00:00,1204.0,seattle,Mostly Cloudy,partly-cloudy-night,0.0003,0.16,40.05,...,6.07,151.0,0.60,0.0,10.000,rain,360.9,,,2020-01
49901,SCL,2019-02-08 09:00:00,2019-02-08 17:00:00,1663.0,seattle,Mostly Cloudy,partly-cloudy-day,0.0110,0.13,32.45,...,6.32,144.0,0.86,0.0,7.335,snow,452.6,0.0863,,2019-02
49902,SCL,2020-04-01 11:00:00,2020-04-01 18:00:00,1249.0,seattle,Mostly Cloudy,partly-cloudy-day,0.0031,0.20,44.07,...,2.81,334.0,0.81,2.0,9.738,rain,473.6,,,2020-04
49903,SCL,2020-04-19 08:00:00,2020-04-19 15:00:00,835.0,seattle,Mostly Cloudy,partly-cloudy-day,0.0018,0.03,48.79,...,4.42,174.0,0.65,1.0,10.000,rain,348.5,,,2020-04
49904,SCL,2019-01-27 11:00:00,2019-01-27 19:00:00,1314.0,seattle,Mostly Cloudy,partly-cloudy-day,0.0003,0.01,41.15,...,7.52,13.0,0.74,1.0,7.568,rain,255.1,,,2019-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,SCL,2018-09-08 13:00:00,2018-09-08 20:00:00,1024.0,seattle,Overcast,cloudy,0.0000,0.00,64.87,...,7.66,189.0,0.93,4.0,3.275,,,,,2018-09
49996,SCL,2019-04-03 10:00:00,2019-04-03 17:00:00,1106.0,seattle,Possible Light Rain,rain,0.0334,0.57,53.58,...,19.51,179.0,0.94,2.0,9.184,rain,326.9,,,2019-04
49997,SCL,2019-04-10 00:00:00,2019-04-10 07:00:00,885.0,seattle,Mostly Cloudy,partly-cloudy-night,0.0007,0.03,47.91,...,5.74,131.0,0.81,0.0,9.285,rain,426.5,,,2019-04
49998,SCL,2019-08-07 01:00:00,2019-08-07 08:00:00,813.0,seattle,Clear,clear-night,0.0011,0.01,63.43,...,3.44,149.0,0.03,0.0,10.000,rain,295.8,,,2019-08


In [163]:
csv_path = '/Volumes/meen/sampled_data.csv'
df_sample.to_csv(csv_path, index=False)
print(f"Saved sample to {csv_path}")


Saved sample to /Volumes/meen/sampled_data.csv
