# Python TFM Section

## Data Processing

On this section of the TFM it will be done all the preparation needed for the model:

1. Import all the .csv created on the R section to unify them 
2. Set up a unique data frame where we will have all the variables and information required to the regression model 
3. Unstack the structure for making it more suitable to be used on a model


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

In [2]:
#!conda install --yes pathlib 
#$ python -m pip install pathlib

In [3]:
from pathlib import Path

In [4]:
print(Path.cwd())

C:\Users\ES71531200G\Desktop\Data Science\00.TFM


In [5]:
#Defining path to files from the imports folder
%pwd

file1 = "COSTES_E4E_EUROS.csv"
file2 = "LIQUIDACIONES_EUROS.csv"
file3 = "NUCLEAR_WASTES_EUROS.csv"

file5 = "LIQUIDACIONES_MWH.csv"

File_list = [file1, file2, file3]
del(file1,file2,file3)


File_list

['COSTES_E4E_EUROS.csv', 'LIQUIDACIONES_EUROS.csv', 'NUCLEAR_WASTES_EUROS.csv']

In [6]:
%whos

Variable    Type      Data/Info
-------------------------------
File_list   list      n=3
Path        type      <class 'pathlib.Path'>
file5       str       LIQUIDACIONES_MWH.csv
np          module    <module 'numpy' from 'C:\<...>ges\\numpy\\__init__.py'>
pd          module    <module 'pandas' from 'C:<...>es\\pandas\\__init__.py'>
plt         module    <module 'matplotlib.pyplo<...>\\matplotlib\\pyplot.py'>


Importing directly with the read_csv function retrieves an error due to the encoding used by R during the exportation
The parametres encoding and sep solve the probem

In [7]:
inputpath1 = Path.cwd() / 'Outputs' / 'COSTES_E4E_EUROS.csv' 
df1 = pd.read_csv(inputpath1, sep = ';', header = 0 , encoding = "ISO-8859-1")

I create a df with the same columnames and data types but no rows for using it as the initial frame to append everything alltogheter.

In [8]:
dfTotal = df1[0:0]
del(df1)
dfTotal

Unnamed: 0,VERSION,ID_UPR,ID_TECNOLOGIA,ID_GRUPO_EMPRESARIAL,ID_AREA_SISTEMA,ID_CONCEPTO_CTRL,VALOR,ID_UNIDAD


In [9]:
path_list = []
for file in File_list:
    inputpath = Path.cwd() / 'Outputs' / file
    print (inputpath)
    df1 = pd.read_csv(inputpath, sep = ';', header = 0 , encoding = "ISO-8859-1")
    dfTotal = dfTotal.append(df1)

C:\Users\ES71531200G\Desktop\Data Science\00.TFM\Outputs\COSTES_E4E_EUROS.csv
C:\Users\ES71531200G\Desktop\Data Science\00.TFM\Outputs\LIQUIDACIONES_EUROS.csv
C:\Users\ES71531200G\Desktop\Data Science\00.TFM\Outputs\NUCLEAR_WASTES_EUROS.csv


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


The process comes back a warning saying that one column is nmissing in at least one section, so we explore the data to see what's happening

In [10]:
dfTotal.sample(20)

Unnamed: 0,ID_AREA_SISTEMA,ID_CONCEPTO_CTRL,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_UNIDAD,ID_UPR,VALOR,VERSION
4527,ESPAÑA,LUBRICANTES,EMPRESA1,HN,EUROS,UPR303,"0,000000e+00",201810
1899,,S. Regulacion,EMPRESA1,GN,EUROS,UPR1851,-761297,201703
9511,,D. Medida Contador,EMPRESA3,NC,EUROS,UPR78,2768128,201712
1913,ESPAÑA,CENTIMO_VERDE,EMPRESA1,CI,EUROS,UPR1861,"-7,858044e+05",201711
12693,,M. Diario,EMPRESA1,BP,EUROS,UPR1751,-3081001,201805
2132,ESPAÑA,CANON HID,EMPRESA1,EB,EUROS,UPR2103,"-3,052967e+05",201712
12746,,Ajuste,EMPRESA1,CI,EUROS,UPR1860,-4597694,201805
10168,,M. Diario,EMPRESA1,HN,EUROS,UPR304,122314655,201801
5411,,Terciaria,EMPRESA2,NC,EUROS,UPR116,1000163,201708
13330,,R. Secundaria,EMPRESA1,BP,EUROS,UPR1315,2399292,201806


We obserb that the file "LIQUIDACIONES_EUROS" doesn't have the column ID_AREA_SISTEMA.
To overpass this situations, I proceed to replace all the NaNs with the right values through the selection of the unique tuples
[ID_AREA_SISTEMA - ID_UPR].

In [11]:
df_aux = dfTotal[['ID_AREA_SISTEMA', 'ID_UPR']].dropna().drop_duplicates()
df_aux.sample(10)

Unnamed: 0,ID_AREA_SISTEMA,ID_UPR
156,ESPAÑA,UPR417
118,ESPAÑA,UPR2343
98,ESPAÑA,UPR2331
69,ESPAÑA,UPR1863
45,PORTUGAL,UPR1850
48,PORTUGAL,UPR1851
51,ESPAÑA,UPR1860
26,ESPAÑA,UPR162
102,ESPAÑA,UPR2341
94,ESPAÑA,UPR2182


And now I will replace the values using a left join with pandas

In [12]:
df_merged = pd.merge(dfTotal, df_aux, on='ID_UPR', how='left')
df_merged.sample(10)
df_merged.columns

Index(['ID_AREA_SISTEMA_x', 'ID_CONCEPTO_CTRL', 'ID_GRUPO_EMPRESARIAL',
       'ID_TECNOLOGIA', 'ID_UNIDAD', 'ID_UPR', 'VALOR', 'VERSION',
       'ID_AREA_SISTEMA_y'],
      dtype='object')

In [13]:
df_merged = df_merged.rename(columns={'ID_AREA_SISTEMA_y': 'ID_AREA_SISTEMA'})
df_merged = df_merged.drop(columns="ID_AREA_SISTEMA_x")
print(df_merged.columns)
df_merged.sample(10)

Index(['ID_CONCEPTO_CTRL', 'ID_GRUPO_EMPRESARIAL', 'ID_TECNOLOGIA',
       'ID_UNIDAD', 'ID_UPR', 'VALOR', 'VERSION', 'ID_AREA_SISTEMA'],
      dtype='object')


Unnamed: 0,ID_CONCEPTO_CTRL,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_UNIDAD,ID_UPR,VALOR,VERSION,ID_AREA_SISTEMA
11544,G. Potencia MP,EMPRESA1,GN,EUROS,UPR1864,5709893,201709,ESPAÑA
22007,R. Secundaria,EMPRESA1,BP,EUROS,UPR2415,-1008571,201810,ESPAÑA
17077,Banda,EMPRESA3,CI,EUROS,UPR1721,2374836,201804,
279,ATR,EMPRESA1,GN,EUROS,UPR2182,"-1,863134e+05",201702,ESPAÑA
15615,Terciaria,EMPRESA3,GN,EUROS,UPR1846,6034455,201802,
23492,SERV_GEST_RES,EMPRESA1,NC,EUROS,UPR77,-1458420767,201706,ESPAÑA
18200,M. Diario,EMPRESA3,GN,EUROS,UPR317,139742420,201805,
5615,Banda,EMPRESA1,LN,EUROS,UPR2343,3200912,201701,ESPAÑA
12774,G. Potencia MP,EMPRESA1,HN,EUROS,UPR74,8895837,201710,ESPAÑA
4426,CENTIMO_VERDE,EMPRESA1,CI,EUROS,UPR1863,"-1,830766e+06",201810,ESPAÑA


In [14]:
df_merged = df_merged[df_merged['ID_GRUPO_EMPRESARIAL'] == 'EMPRESA1']
df_merged.sample(10)

Unnamed: 0,ID_CONCEPTO_CTRL,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_UNIDAD,ID_UPR,VALOR,VERSION,ID_AREA_SISTEMA
139,IMPUESTO ELECT,EMPRESA1,GN,EUROS,UPR300,"-1,560878e+05",201701,ESPAÑA
22576,Banda,EMPRESA1,HN,EUROS,UPR194,3774788,201811,ESPAÑA
4417,UREA,EMPRESA1,CI,EUROS,UPR1861,"-4,237013e+04",201810,ESPAÑA
7206,D. Medida Contador,EMPRESA1,EB,EUROS,UPR2331,-8719813,201703,ESPAÑA
7957,Ajuste,EMPRESA1,EB,EUROS,UPR2344,-5151712,201704,ESPAÑA
1255,IMPUESTO ELECT,EMPRESA1,GN,EUROS,UPR300,"-1,703250e+05",201707,ESPAÑA
6749,G. Potencia MP,EMPRESA1,EB,EUROS,UPR1207,1216728,201703,ESPAÑA
18079,Terciaria,EMPRESA1,LN,EUROS,UPR2342,5208894,201805,ESPAÑA
3330,COSTE_COMBUSTIBLE,EMPRESA1,HN,EUROS,UPR304,"2,153387e+04",201805,ESPAÑA
15517,R. Secundaria,EMPRESA1,CI,EUROS,UPR1661,-2659097,201802,ESPAÑA


In [15]:
df_merged[df_merged['ID_AREA_SISTEMA'].isna()].head(5)

Unnamed: 0,ID_CONCEPTO_CTRL,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_UNIDAD,ID_UPR,VALOR,VERSION,ID_AREA_SISTEMA
5079,D. Medida Contador,EMPRESA1,NC,EUROS,UPR1198,-3530563,201701,
5080,S. Res. Pot. Adicional,EMPRESA1,NC,EUROS,UPR1198,-1928,201701,
5239,Ajuste,EMPRESA1,BP,EUROS,UPR1751,-3748906,201701,
5240,Bilateral,EMPRESA1,BP,EUROS,UPR1751,-2131556,201701,
5241,D. Medida Contador,EMPRESA1,BP,EUROS,UPR1751,-2245397,201701,


There are still NaN values, but thanks to our knowledge from the original data, we know that there are ONLY 2 UPRs with ID_AREA_SISTEMA = 'Portugal', what means that every NaN value right now should be equal to ESPAÑA, so we replace now all the NANs

In [16]:
df_merged = df_merged.fillna('ESPAÑA')
df_merged.sample(10)

Unnamed: 0,ID_CONCEPTO_CTRL,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_UNIDAD,ID_UPR,VALOR,VERSION,ID_AREA_SISTEMA
20777,R. Cobertura,EMPRESA1,GN,EUROS,UPR162,139446126,201809,ESPAÑA
17585,M. Intradiarios,EMPRESA1,NC,EUROS,UPR116,4303674,201805,ESPAÑA
1703,OTROS,EMPRESA1,CI,EUROS,UPR1661,"-9,393526e+03",201710,ESPAÑA
18956,D. Medida Contador,EMPRESA1,HN,EUROS,UPR303,-5335355,201806,ESPAÑA
14562,D. Medida Contador,EMPRESA1,NC,EUROS,UPR78,8837154,201712,ESPAÑA
14038,I. G. Desvíos y Terciaria,EMPRESA1,CI,EUROS,UPR1862,-2101416,201712,ESPAÑA
15694,Banda,EMPRESA1,HN,EUROS,UPR194,4260680,201802,ESPAÑA
9464,M. Diario,EMPRESA1,BP,EUROS,UPR2414,-1939492,201706,ESPAÑA
14674,M. Intradiarios,EMPRESA1,BP,EUROS,UPR1315,1168270,201801,ESPAÑA
16071,M. Intradiarios,EMPRESA1,EB,EUROS,UPR417,5211724,201802,ESPAÑA


In [17]:
df_merged[df_merged['ID_AREA_SISTEMA'].isna()].head(5)

Unnamed: 0,ID_CONCEPTO_CTRL,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_UNIDAD,ID_UPR,VALOR,VERSION,ID_AREA_SISTEMA


And finally, reorder de columns to the same order we are already used to 

In [18]:
df_merged = df_merged[['VERSION','ID_UPR','ID_TECNOLOGIA','ID_GRUPO_EMPRESARIAL','ID_AREA_SISTEMA','ID_CONCEPTO_CTRL','VALOR']]
df_merged.reset_index()
df_merged.sample(5)

Unnamed: 0,VERSION,ID_UPR,ID_TECNOLOGIA,ID_GRUPO_EMPRESARIAL,ID_AREA_SISTEMA,ID_CONCEPTO_CTRL,VALOR
6299,201702,UPR2103,EB,EMPRESA1,ESPAÑA,M. Intradiarios,-6888203
12116,201710,UPR162,GN,EMPRESA1,ESPAÑA,I. Res. Pot. Adicional,-4338705
11419,201709,UPR1752,BP,EMPRESA1,ESPAÑA,S. Res. Pot. Adicional,-252
5961,201702,UPR1206,BX,EMPRESA1,ESPAÑA,I. G. Desvíos y Terciaria,-210444
10170,201707,UPR2331,EB,EMPRESA1,ESPAÑA,Ajuste,-3618819


At this point, I will save the current df "df_merged" for the future visualizatin part, this is the structure desired to represent the Integral Margin of the different power plants and so it is for the temporal evolution of every single one of them.

The problem here seemed to be the data types... so first I tried to convert the column value to numeric directly 
with no success...

The error got, suggested that I should convert the data type to floats but the lesson learnt here was that float type in pandas use dots insted of comma for float


In [19]:
df_merged.dtypes

VERSION                  int64
ID_UPR                  object
ID_TECNOLOGIA           object
ID_GRUPO_EMPRESARIAL    object
ID_AREA_SISTEMA         object
ID_CONCEPTO_CTRL        object
VALOR                   object
dtype: object

In [20]:
df_merged['VALOR'].str.replace(',','.').sample(10)

4156     -1.869188e+06
23133          8883290
23379          7300800
9047           4517411
22185         44793371
13172          6984874
6970         -21372.92
1827     -1.928083e+01
16394           -338.7
236      -1.076721e+03
Name: VALOR, dtype: object

In [21]:
#pd.to_numeric(df_merged['VALOR'])
df_merged['VALOR'] = pd.to_numeric((df_merged['VALOR'].str.replace(',','.')),errors='coerce').fillna(0).astype(np.int64)
df_merged.sample(10)

Unnamed: 0,VERSION,ID_UPR,ID_TECNOLOGIA,ID_GRUPO_EMPRESARIAL,ID_AREA_SISTEMA,ID_CONCEPTO_CTRL,VALOR
19727,201807,UPR304,HN,EMPRESA1,ESPAÑA,Bilateral,16952863
17677,201805,UPR162,GN,EMPRESA1,ESPAÑA,Terciaria,792480
23224,201812,UPR1864,GN,EMPRESA1,ESPAÑA,Terciaria,691309
20710,201809,UPR1207,EB,EMPRESA1,ESPAÑA,Bilateral,11818987
19045,201806,UPR418,EB,EMPRESA1,ESPAÑA,Banda,2688868
5647,201701,UPR2414,BP,EMPRESA1,ESPAÑA,S. Res. Pot. Adicional,-69
1897,201711,UPR1662,CI,EMPRESA1,ESPAÑA,TASAS_MEDIOAMB,-40914
1053,201706,UPR2343,LN,EMPRESA1,ESPAÑA,OTROS,1478
21089,201809,UPR2142,BP,EMPRESA1,ESPAÑA,Terciaria,467368
20184,201808,UPR1862,CI,EMPRESA1,ESPAÑA,M. Diario,429636862


In [22]:
df_merged['VALOR'].sum()

175694686343

In [23]:
#df_merged.to_csv?
df_merged.to_csv(Path.cwd() / 'Outputs' / 'INTEGRATED_MARGIN.csv', sep= ';',index=False)

Now I proceed to unstack or pivot the table to get the suitable structure for modeling
During this procedure, I've faced multiple problems so here I brievely describe the process:

1. First attempts ended on multiple errors such as "Length of passed values is 15227, index implies 1" , "index contains duplicate entries,cannot reshape"
2. It seemed clear that in any moment of the dropping unused columns, I created a duplicity on a register so first thing requieres was to do a group by
3. After done, I reseted the index for setting free all the columns
4. I used the pandas fuction "pivot_table" instead the method .pivot due to the hability of the first one to summing all the values generated with duplicities during the process of resizing.
5. Once pivoted, indexes and headers were a problematic segmentation so I dropped it out and create a new header.


In [24]:
df_pivoted = df_merged[['VERSION','ID_UPR','ID_TECNOLOGIA','ID_CONCEPTO_CTRL','VALOR']]
df_pivoted.shape

(15117, 5)

In [25]:
df_pivoted2 =df_pivoted.groupby(['VERSION','ID_UPR','ID_TECNOLOGIA','ID_CONCEPTO_CTRL']).sum()
print(df_pivoted2.shape)
print(df_pivoted2.columns)
df_pivoted2.sample(5)

(15117, 1)
Index(['VALOR'], dtype='object')


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,VALOR
VERSION,ID_UPR,ID_TECNOLOGIA,ID_CONCEPTO_CTRL,Unnamed: 4_level_1
201709,UPR115,NC,D. Medida Contador,-469088
201708,UPR74,HN,M. Intradiarios,1365732
201709,UPR74,HN,G. Potencia MP,770535
201702,UPR74,HN,Terciaria,119317
201811,UPR2342,LN,CANON_CONCESION,-17076


In [26]:
df_pivoted2= df_pivoted2.reset_index()
df_pivoted2.head(5)

Unnamed: 0,VERSION,ID_UPR,ID_TECNOLOGIA,ID_CONCEPTO_CTRL,VALOR
0,201701,UPR115,NC,Bilateral,81778497
1,201701,UPR115,NC,CANON_NC_EST,-2543929
2,201701,UPR115,NC,COSTE_COMBUSTIBLE,-1875378
3,201701,UPR115,NC,D. Medida Contador,-7853
4,201701,UPR115,NC,IMPUESTO ELECT,-3190970


In [27]:
df_pivoted2.columns

Index(['VERSION', 'ID_UPR', 'ID_TECNOLOGIA', 'ID_CONCEPTO_CTRL', 'VALOR'], dtype='object')

In [28]:
df_pivoted3 = df_pivoted2.pivot_table( 
                          values=['VALOR'], 
                          index=['VERSION', 'ID_UPR', 'ID_TECNOLOGIA'],
                          columns=['ID_CONCEPTO_CTRL'], 
                          aggfunc=np.sum)
df_pivoted3.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR
Unnamed: 0_level_1,Unnamed: 1_level_1,ID_CONCEPTO_CTRL,A. No Cobrados,AMONIACO,ATR,Ajuste,BONO_SOCIAL_PEGO,Banda,Bilateral,CALIZAS,CANON HID,CANON_CONCESION,...,Redespachos,Res. Pot. Adicional,S. Regulacion,S. Res. Pot. Adicional,SERV_GEST_RES,TASAS_MEDIOAMB,TASA_ARAGON,Terciaria,UREA,VCF
VERSION,ID_UPR,ID_TECNOLOGIA,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2
201804,UPR304,HN,,,,,,,,0.0,,,...,,,,-3191.0,,-124259.0,,,,
201811,UPR1862,CI,,,,,,6759794.0,50815240.0,,,,...,,192116.0,,-481.0,,-31935.0,,2135382.0,-10293.0,245772.0
201802,UPR1751,BP,,,,-5558819.0,,,-152934.0,,,,...,-131800.0,,,,,,,-359119.0,,
201801,UPR194,HN,,,,,,4016524.0,,-69851.0,,,...,,,,-56.0,,,,302503.0,,90975.0
201804,UPR1863,CI,,,,,,,,,,,...,,,,-401.0,,-2726.0,,,0.0,
201805,UPR418,EB,,,,-39794395.0,,1259490.0,1287029000.0,,-4463186.0,-298695.0,...,,,,-13205.0,,,-336655.0,4351472.0,,11580.0
201709,UPR194,HN,,,,,,,,-41509.0,,,...,,,,-18.0,,-3464.0,,,,
201811,UPR162,GN,210996.0,,-744805.0,,,5624452.0,,,,,...,,5250183.0,,-3086.0,,,,3652391.0,,334678.0
201802,UPR2622,NC,,,,,,,758727100.0,,,,...,,,,-485.0,-3471182.0,,,,,
201708,UPR74,HN,,,,,,,,,,,...,,,,,,,,,,


In [29]:
df_pivoted3.columns = df_pivoted3.columns.droplevel()
df_modelize= df_pivoted3.reset_index()
df_modelize.head(10)

ID_CONCEPTO_CTRL,VERSION,ID_UPR,ID_TECNOLOGIA,A. No Cobrados,AMONIACO,ATR,Ajuste,BONO_SOCIAL_PEGO,Banda,Bilateral,...,Redespachos,Res. Pot. Adicional,S. Regulacion,S. Res. Pot. Adicional,SERV_GEST_RES,TASAS_MEDIOAMB,TASA_ARAGON,Terciaria,UREA,VCF
0,201701,UPR115,NC,,,,,,,81778497.0,...,,,,-36.0,-5156083.0,,,,,
1,201701,UPR116,NC,,,,,,,68455391.0,...,,,,,-4357224.0,,,,,
2,201701,UPR1198,NC,,,,,,,,...,,,,-192.0,,,,,,
3,201701,UPR1205,EB,,,,-124366.0,,,1578792.0,...,,,,-455.0,,-33711.0,,55144.0,,
4,201701,UPR1206,BX,,,,-9972.0,,,,...,,,,-7.0,,-16346.0,,,,
5,201701,UPR1207,EB,,,,-198314.0,,15066.0,59653.0,...,,,,-807.0,,-1528.0,,225821.0,,999.0
6,201701,UPR1314,BP,,,,-429377.0,,,-118383.0,...,-8556.0,,,-39.0,,,,-24031.0,,
7,201701,UPR1315,BP,,,,-506708.0,,2001.0,,...,1146.0,,,-397.0,,,,441646.0,,-162.0
8,201701,UPR160,GN,-994654.0,,-938689.0,,,1045907.0,,...,47854.0,51355.0,,-2223.0,,,,251416.0,,-14892.0
9,201701,UPR162,GN,-418200.0,,-507356.0,,,957825.0,,...,745956.0,579221.0,,-316.0,,,,573346.0,,-17674.0


Finally, we incorporate now the last column of data that we are gonna implement to the model, the power column from the second
dataframe that we got in the liquidations R process

In [30]:
df_power = pd.read_csv(Path.cwd() / 'Outputs' / 'LIQUIDACIONES_MWH.csv' , sep = ';', header = 0 , encoding = "ISO-8859-1",decimal=',')
print(df_power.describe())
df_power.sample(5)

             VERSION         VALOR
count   13048.000000  1.304800e+04
mean   201754.106913  4.498754e+05
std        49.951242  2.031215e+06
min    201701.000000 -5.489374e+06
25%    201706.000000 -9.301933e+03
50%    201712.000000  2.225860e+03
75%    201806.000000  8.903905e+04
max    201812.000000  3.518217e+07


Unnamed: 0,VERSION,ID_UPR,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_CONCEPTO_CTRL,ID_UNIDAD,VALOR
5814,201711,UPR1864,EMPRESA1,GN,M. Diario,MWH,3887135.0
4073,201708,UPR1762,EMPRESA2,BP,Redespachos,MWH,17325.0
37,201701,UPR1207,EMPRESA1,EB,M. Diario,MWH,12646.8
9885,201806,UPR321,EMPRESA2,GN,M. Intradiarios,MWH,30786.0
8958,201805,UPR160,EMPRESA1,GN,D. Medida Contador,MWH,114788.1


I apply the same filters that I did in the previous dfs

And eventually a group by just in case we have the same problem than before

In [31]:
df_power = df_power[df_power['ID_GRUPO_EMPRESARIAL'] == 'EMPRESA1']
print(df_power.describe())

             VERSION         VALOR
count    6417.000000  6.417000e+03
mean   201755.352189  4.508979e+05
std        50.180305  2.115280e+06
min    201701.000000 -3.789073e+06
25%    201706.000000 -1.387000e+04
50%    201712.000000  2.163250e+02
75%    201807.000000  5.869300e+04
max    201812.000000  3.256008e+07


In [32]:
df_power = df_power[['VERSION','ID_UPR','ID_TECNOLOGIA','VALOR']]
df_power= df_power.groupby(['VERSION','ID_UPR','ID_TECNOLOGIA']).sum().reset_index()
print(df_power.describe())

             VERSION         VALOR
count    1024.000000  1.024000e+03
mean   201755.424805  2.825598e+06
std        50.143803  4.930549e+06
min    201701.000000 -3.789073e+06
25%    201706.000000  2.363667e+04
50%    201712.000000  5.563538e+05
75%    201806.000000  3.699485e+06
max    201812.000000  3.262049e+07


In [33]:
df_power.describe()

Unnamed: 0,VERSION,VALOR
count,1024.0,1024.0
mean,201755.424805,2825598.0
std,50.143803,4930549.0
min,201701.0,-3789073.0
25%,201706.0,23636.67
50%,201712.0,556353.8
75%,201806.0,3699485.0
max,201812.0,32620490.0


In [34]:
df_modelize = pd.merge(df_modelize, df_power, on=['VERSION','ID_UPR','ID_TECNOLOGIA'], how='left')
df_modelize = df_modelize.rename(columns={'VALOR': 'POWER_MWH'})
df_modelize.sample(10)

Unnamed: 0,VERSION,ID_UPR,ID_TECNOLOGIA,A. No Cobrados,AMONIACO,ATR,Ajuste,BONO_SOCIAL_PEGO,Banda,Bilateral,...,Res. Pot. Adicional,S. Regulacion,S. Res. Pot. Adicional,SERV_GEST_RES,TASAS_MEDIOAMB,TASA_ARAGON,Terciaria,UREA,VCF,POWER_MWH
388,201709,UPR303,HN,,,,,,,,...,,,-39.0,,-9036.0,,,,,-20812.03
778,201806,UPR417,EB,,,,,,,,...,,,-157.0,,,-528080.0,46189.0,,,874581.929
121,201703,UPR2491,NC,,,,,,,,...,,,0.0,-52234.0,,,,,,39349.75
243,201706,UPR2141,BP,,,,-763969.0,,,-549307.0,...,,,-56.0,,,,-617914.0,,,-72156.6
916,201810,UPR1206,BX,,,,,,,,...,,,-35.0,,193.0,,,,,-1010.205
78,201702,UPR2622,NC,,,,,,,88213380.0,...,,,-21.0,-3449103.0,,,,,,1714469.685
386,201709,UPR2622,NC,,,,,,,447666184.0,...,,,-26.0,-3614913.0,,,-73865.0,,,9044923.186
988,201811,UPR2491,NC,,,,,,,,...,,,-7.0,-50564.0,,,1066.0,,,332786.743
766,201806,UPR2331,EB,,,,-1498433.0,,552847.0,18182295.0,...,,,-5604.0,,-27475.0,,2652945.0,,-19109.0,795811.152
927,201810,UPR1851,GN,,,316864.0,,-312679.0,14607510.0,,...,,-276705.0,,,,,-95305.0,,,5320572.8


In [35]:
df_modelize.to_csv(Path.cwd() / 'Outputs' / 'DF_MODELIZE.csv', sep= ';',decimal=',',index=False)

In [36]:
To_be_deleted =['df1',
                'df_aux',
                'df_merged',
                'dfTotal',
                'df_pivoted',
                'df_pivoted2',
                'df_pivoted3',
                'path_list',
                'inputpath',
                'inputpath1']
To_be_deleted

['df1',
 'df_aux',
 'df_merged',
 'dfTotal',
 'df_pivoted',
 'df_pivoted2',
 'df_pivoted3',
 'path_list',
 'inputpath',
 'inputpath1']

In [37]:
for item in To_be_deleted:
    try:
        del item
    except:
        pass