# Python TFM Section

## Data Processing

On this section of the TFM it will be done all the preparation needed for the model:

1. Import all the .csv created on the R section to unify them 
2. Set up a unique data frame where we will have all the variables and information required to the regression model 
3. Unstack the structure for making it more suitable to be used on a model


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

In [2]:
#!conda install --yes pathlib 
#$ python -m pip install pathlib

In [3]:
from pathlib import Path

In [4]:
print(Path.cwd())

C:\Users\ES71531200G\Desktop\Data Science\00.TFM


In [5]:
#Defining path to files from the imports folder
%pwd

file1 = "COSTES_E4E_EUROS.csv"
file2 = "LIQUIDACIONES_EUROS.csv"
file3 = "NUCLEAR_WASTES_EUROS.csv"

file5 = "LIQUIDACIONES_MWH.csv"

File_list = [file1, file2, file3]
del(file1,file2,file3)


File_list

['COSTES_E4E_EUROS.csv', 'LIQUIDACIONES_EUROS.csv', 'NUCLEAR_WASTES_EUROS.csv']

In [6]:
%whos

Variable    Type      Data/Info
-------------------------------
File_list   list      n=3
Path        type      <class 'pathlib.Path'>
file5       str       LIQUIDACIONES_MWH.csv
np          module    <module 'numpy' from 'C:\<...>ges\\numpy\\__init__.py'>
pd          module    <module 'pandas' from 'C:<...>es\\pandas\\__init__.py'>
plt         module    <module 'matplotlib.pyplo<...>\\matplotlib\\pyplot.py'>


Importing directly with the read_csv function retrieves an error due to the encoding used by R during the exportation
The parametres encoding and sep solve the probem

In [7]:
inputpath1 = Path.cwd() / 'Outputs' / 'COSTES_E4E_EUROS.csv' 
df1 = pd.read_csv(inputpath1, sep = ';', header = 0 , encoding = "ISO-8859-1")

I create a df with the same columnames and data types but no rows for using it as the initial frame to append everything alltogheter.

In [8]:
dfTotal = df1[0:0]
del(df1)
dfTotal

Unnamed: 0,VERSION,ID_UPR,ID_TECNOLOGIA,ID_GRUPO_EMPRESARIAL,ID_AREA_SISTEMA,ID_CONCEPTO_CTRL,VALOR,ID_UNIDAD


In [9]:
path_list = []
for file in File_list:
    inputpath = Path.cwd() / 'Outputs' / file
    print (inputpath)
    df1 = pd.read_csv(inputpath, sep = ';', header = 0 , encoding = "ISO-8859-1")
    dfTotal = dfTotal.append(df1)

C:\Users\ES71531200G\Desktop\Data Science\00.TFM\Outputs\COSTES_E4E_EUROS.csv
C:\Users\ES71531200G\Desktop\Data Science\00.TFM\Outputs\LIQUIDACIONES_EUROS.csv
C:\Users\ES71531200G\Desktop\Data Science\00.TFM\Outputs\NUCLEAR_WASTES_EUROS.csv


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


The process comes back a warning saying that one column is nmissing in at least one section, so we explore the data to see what's happening

In [10]:
dfTotal.sample(20)

Unnamed: 0,ID_AREA_SISTEMA,ID_CONCEPTO_CTRL,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_UNIDAD,ID_UPR,VALOR,VERSION
11743,,R. Secundaria,EMPRESA3,GN,EUROS,UPR318,-416778,201803
7042,,Terciaria,EMPRESA3,GN,EUROS,UPR161,9255238,201710
15800,,S. Res. Pot. Adicional,EMPRESA1,BP,EUROS,UPR1752,-1314552,201809
1165,,G. Potencia MP,EMPRESA1,GN,EUROS,UPR1864,1058181,201702
11543,,Terciaria,EMPRESA2,BX,EUROS,UPR2261,15104,201803
82,ESPAÑA,CALIZAS,EMPRESA1,HN,EUROS,UPR194,"-1,951940e+04",201701
5962,,Terciaria,EMPRESA1,BP,EUROS,UPR2415,-3066947,201708
219,,G. Desvios,EMPRESA3,HN,EUROS,UPR1803,1128275,201701
4913,,M. Diario,EMPRESA1,CI,EUROS,UPR1860,151916371,201707
13100,,Terciaria,EMPRESA3,EB,EUROS,UPR2600,3779883,201805


We obserb that the file "LIQUIDACIONES_EUROS" doesn't have the column ID_AREA_SISTEMA.
To overpass this situations, I proceed to replace all the NaNs with the right values through the selection of the unique tuples
[ID_AREA_SISTEMA - ID_UPR].

In [11]:
df_aux = dfTotal[['ID_AREA_SISTEMA', 'ID_UPR']].dropna().drop_duplicates()
df_aux.sample(10)

Unnamed: 0,ID_AREA_SISTEMA,ID_UPR
82,ESPAÑA,UPR194
89,ESPAÑA,UPR2103
120,ESPAÑA,UPR2343
14,ESPAÑA,UPR1206
159,ESPAÑA,UPR417
100,ESPAÑA,UPR2331
45,ESPAÑA,UPR1752
24,ESPAÑA,UPR160
104,ESPAÑA,UPR2341
112,ESPAÑA,UPR2342


And now I will replace the values using a left join with pandas

In [12]:
df_merged = pd.merge(dfTotal, df_aux, on='ID_UPR', how='left')
df_merged.sample(10)
df_merged.columns

Index(['ID_AREA_SISTEMA_x', 'ID_CONCEPTO_CTRL', 'ID_GRUPO_EMPRESARIAL',
       'ID_TECNOLOGIA', 'ID_UNIDAD', 'ID_UPR', 'VALOR', 'VERSION',
       'ID_AREA_SISTEMA_y'],
      dtype='object')

In [13]:
df_merged = df_merged.rename(columns={'ID_AREA_SISTEMA_y': 'ID_AREA_SISTEMA'})
df_merged = df_merged.drop(columns="ID_AREA_SISTEMA_x")
print(df_merged.columns)
df_merged.sample(10)

Index(['ID_CONCEPTO_CTRL', 'ID_GRUPO_EMPRESARIAL', 'ID_TECNOLOGIA',
       'ID_UNIDAD', 'ID_UPR', 'VALOR', 'VERSION', 'ID_AREA_SISTEMA'],
      dtype='object')


Unnamed: 0,ID_CONCEPTO_CTRL,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_UNIDAD,ID_UPR,VALOR,VERSION,ID_AREA_SISTEMA
14406,I. G. Desvíos y Terciaria,EMPRESA1,LN,EUROS,UPR2343,-9931051,201712,ESPAÑA
18817,Banda,EMPRESA1,EB,EUROS,UPR2103,1257887,201806,ESPAÑA
1825,AMONIACO,EMPRESA1,LN,EUROS,UPR2342,"-7,255110e+03",201710,ESPAÑA
4100,OTROS,EMPRESA1,LN,EUROS,UPR2343,"1,458979e+04",201808,ESPAÑA
5483,VCF,EMPRESA1,CI,EUROS,UPR1863,-2471655,201701,ESPAÑA
10699,G. Potencia MP,EMPRESA1,CI,EUROS,UPR1661,3400514,201708,ESPAÑA
16021,D. Medida Contador,EMPRESA1,NC,EUROS,UPR2491,-2874566,201802,ESPAÑA
18083,Redespachos,EMPRESA2,BP,EUROS,UPR2134,-1104319,201805,
10214,M. Intradiarios,EMPRESA2,BP,EUROS,UPR2134,-2352078,201707,
14905,I. G. Desvíos y Terciaria,EMPRESA1,BP,EUROS,UPR1752,-2556516,201801,ESPAÑA


In [14]:
df_merged = df_merged[df_merged['ID_GRUPO_EMPRESARIAL'] == 'EMPRESA1']
df_merged.sample(10)

Unnamed: 0,ID_CONCEPTO_CTRL,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_UNIDAD,ID_UPR,VALOR,VERSION,ID_AREA_SISTEMA
9295,I. G. Desvíos y Terciaria,EMPRESA1,CI,EUROS,UPR1861,-1269182,201706,ESPAÑA
12249,G. Potencia MP,EMPRESA1,CI,EUROS,UPR1661,4290329,201710,ESPAÑA
22910,I. G. Desvíos y Terciaria,EMPRESA1,BP,EUROS,UPR2414,-2467155,201811,ESPAÑA
17142,Banda,EMPRESA1,CI,EUROS,UPR1661,4012416,201804,ESPAÑA
534,CALIZAS,EMPRESA1,HN,EUROS,UPR304,"-2,040799e+04",201703,ESPAÑA
3864,PEAJE GEN,EMPRESA1,EB,EUROS,UPR2344,"-8,721492e+03",201807,ESPAÑA
7347,M. Intradiarios,EMPRESA1,LN,EUROS,UPR2342,1797773,201703,ESPAÑA
18981,Terciaria,EMPRESA1,EB,EUROS,UPR2344,-1040609,201806,ESPAÑA
15313,G. Potencia MP,EMPRESA1,GN,EUROS,UPR300,4011557,201801,ESPAÑA
15889,Ajuste,EMPRESA1,BP,EUROS,UPR2141,-3439387,201802,


In [15]:
df_merged[df_merged['ID_AREA_SISTEMA'].isna()].head(5)

Unnamed: 0,ID_CONCEPTO_CTRL,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_UNIDAD,ID_UPR,VALOR,VERSION,ID_AREA_SISTEMA
5189,D. Medida Contador,EMPRESA1,NC,EUROS,UPR1198,-3530563,201701,
5190,S. Res. Pot. Adicional,EMPRESA1,NC,EUROS,UPR1198,-1928,201701,
5349,Ajuste,EMPRESA1,BP,EUROS,UPR1751,-3748906,201701,
5350,Bilateral,EMPRESA1,BP,EUROS,UPR1751,-2131556,201701,
5351,D. Medida Contador,EMPRESA1,BP,EUROS,UPR1751,-2245397,201701,


There are still NaN values, but thanks to our knowledge from the original data, we know that there are ONLY 2 UPRs with ID_AREA_SISTEMA = 'Portugal', what means that every NaN value right now should be equal to ESPAÑA, so we replace now all the NANs

In [16]:
df_merged = df_merged.fillna('ESPAÑA')
df_merged.sample(10)

Unnamed: 0,ID_CONCEPTO_CTRL,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_UNIDAD,ID_UPR,VALOR,VERSION,ID_AREA_SISTEMA
6103,S. Res. Pot. Adicional,EMPRESA1,BP,EUROS,UPR1314,-462,201702,ESPAÑA
21107,I. Res. Pot. Adicional,EMPRESA1,HN,EUROS,UPR194,-1763,201809,ESPAÑA
3224,IMPUESTO ELECT,EMPRESA1,EB,EUROS,UPR1205,"-1,933444e+05",201805,ESPAÑA
17956,Ajuste,EMPRESA1,CI,EUROS,UPR1863,-7163507,201805,ESPAÑA
10246,D. Medida Contador,EMPRESA1,GN,EUROS,UPR2182,-1690022,201707,ESPAÑA
6517,Terciaria,EMPRESA1,EB,EUROS,UPR2331,8250313,201702,ESPAÑA
8886,G. Desvios,EMPRESA1,HN,EUROS,UPR304,2018777,201705,ESPAÑA
8514,G. Potencia LP,EMPRESA1,GN,EUROS,UPR1851,7672579,201705,PORTUGAL
6330,G. Potencia LP,EMPRESA1,GN,EUROS,UPR1864,2250524,201702,ESPAÑA
4300,IMPUESTO ELECT,EMPRESA1,EB,EUROS,UPR2103,"-4,929911e+05",201809,ESPAÑA


In [17]:
df_merged[df_merged['ID_AREA_SISTEMA'].isna()].head(5)

Unnamed: 0,ID_CONCEPTO_CTRL,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_UNIDAD,ID_UPR,VALOR,VERSION,ID_AREA_SISTEMA


And finally, reorder de columns to the same order we are already used to 

In [18]:
df_merged = df_merged[['VERSION','ID_UPR','ID_TECNOLOGIA','ID_GRUPO_EMPRESARIAL','ID_AREA_SISTEMA','ID_CONCEPTO_CTRL','VALOR']]
df_merged.reset_index()
df_merged.sample(5)

Unnamed: 0,VERSION,ID_UPR,ID_TECNOLOGIA,ID_GRUPO_EMPRESARIAL,ID_AREA_SISTEMA,ID_CONCEPTO_CTRL,VALOR
2755,201803,UPR1207,EB,EMPRESA1,ESPAÑA,CANON HID,"-3,809874e+04"
21472,201809,UPR418,EB,EMPRESA1,ESPAÑA,D. Medida Contador,-3109097
2386,201801,UPR194,HN,EMPRESA1,ESPAÑA,PEAJE GEN,"-5,683775e+04"
14817,201801,UPR162,GN,EMPRESA1,ESPAÑA,Banda,5464730
6770,201702,UPR74,HN,EMPRESA1,ESPAÑA,M. Diario,7238959


At this point, I will save the current df "df_merged" for the future visualizatin part, this is the structure desired to represent the Integral Margin of the different power plants and so it is for the temporal evolution of every single one of them.

The problem here seemed to be the data types... so first I tried to convert the column value to numeric directly 
with no success...

The error got, suggested that I should convert the data type to floats but the lesson learnt here was that float type in pandas use dots insted of comma for float


In [19]:
df_merged.dtypes

VERSION                  int64
ID_UPR                  object
ID_TECNOLOGIA           object
ID_GRUPO_EMPRESARIAL    object
ID_AREA_SISTEMA         object
ID_CONCEPTO_CTRL        object
VALOR                   object
dtype: object

In [20]:
df_merged['VALOR'].str.replace(',','.').sample(10)

16797          -6201.8
9023         341565547
6308         -154359.7
2121      8.881526e+02
12060            -8.64
16967          1603866
23474        -14621057
1951     -2.084532e+05
2253     -4.318225e+03
14001         18915664
Name: VALOR, dtype: object

In [21]:
#pd.to_numeric(df_merged['VALOR'])
df_merged['VALOR'] = pd.to_numeric((df_merged['VALOR'].str.replace(',','.')),errors='coerce').fillna(0).astype(np.int64)
df_merged.sample(10)

Unnamed: 0,VERSION,ID_UPR,ID_TECNOLOGIA,ID_GRUPO_EMPRESARIAL,ID_AREA_SISTEMA,ID_CONCEPTO_CTRL,VALOR
12166,201710,UPR1314,BP,EMPRESA1,ESPAÑA,G. Desvios,-55432
2191,201712,UPR2182,GN,EMPRESA1,ESPAÑA,ATR,-337533
14520,201712,UPR300,GN,EMPRESA1,ESPAÑA,R. Secundaria,-841391
11596,201709,UPR1860,CI,EMPRESA1,ESPAÑA,D. Medida Contador,-995916
3825,201807,UPR2331,EB,EMPRESA1,ESPAÑA,CANON_CONCESION,99356
1197,201707,UPR1860,CI,EMPRESA1,ESPAÑA,TASAS_MEDIOAMB,-121233
648,201704,UPR1863,CI,EMPRESA1,ESPAÑA,UREA,-266
17949,201805,UPR1862,CI,EMPRESA1,ESPAÑA,M. Intradiarios,3153164
2056,201711,UPR303,HN,EMPRESA1,ESPAÑA,TASAS_MEDIOAMB,-9027
16754,201803,UPR2343,LN,EMPRESA1,ESPAÑA,D. Medida Contador,-411144


In [22]:
df_merged['VALOR'].sum()

176045148550

In [23]:
#df_merged.to_csv?
df_merged.to_csv(Path.cwd() / 'Outputs' / 'INTEGRATED_MARGIN.csv', sep= ';',index=False)

Now I proceed to unstack or pivot the table to get the suitable structure for modeling
During this procedure, I've faced multiple problems so here I brievely describe the process:

1. First attempts ended on multiple errors such as "Length of passed values is 15227, index implies 1" , "index contains duplicate entries,cannot reshape"
2. It seemed clear that in any moment of the dropping unused columns, I created a duplicity on a register so first thing requieres was to do a group by
3. After done, I reseted the index for setting free all the columns
4. I used the pandas fuction "pivot_table" instead the method .pivot due to the hability of the first one to summing all the values generated with duplicities during the process of resizing.
5. Once pivoted, indexes and headers were a problematic segmentation so I dropped it out and create a new header.


In [24]:
df_pivoted = df_merged[['VERSION','ID_UPR','ID_TECNOLOGIA','ID_CONCEPTO_CTRL','VALOR']]
df_pivoted.shape

(15227, 5)

In [25]:
df_pivoted2 =df_pivoted.groupby(['VERSION','ID_UPR','ID_TECNOLOGIA','ID_CONCEPTO_CTRL']).sum()
print(df_pivoted2.shape)
print(df_pivoted2.columns)
df_pivoted2.sample(5)

(15122, 1)
Index(['VALOR'], dtype='object')


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,VALOR
VERSION,ID_UPR,ID_TECNOLOGIA,ID_CONCEPTO_CTRL,Unnamed: 4_level_1
201711,UPR2182,GN,Banda,3861053
201712,UPR2414,BP,Bilateral,-10535604
201807,UPR417,EB,M. Intradiarios,-3268295
201808,UPR300,GN,TASAS_MEDIOAMB,0
201811,UPR162,GN,Terciaria,3652391


In [26]:
df_pivoted2= df_pivoted2.reset_index()
df_pivoted2.head(5)

Unnamed: 0,VERSION,ID_UPR,ID_TECNOLOGIA,ID_CONCEPTO_CTRL,VALOR
0,201701,UPR115,NC,Bilateral,81778497
1,201701,UPR115,NC,CANON_NC_EST,-2543929
2,201701,UPR115,NC,COSTE_COMBUSTIBLE,-1875378
3,201701,UPR115,NC,D. Medida Contador,-7853
4,201701,UPR115,NC,IMPUESTO ELECT,-3190970


In [27]:
df_pivoted2.columns

Index(['VERSION', 'ID_UPR', 'ID_TECNOLOGIA', 'ID_CONCEPTO_CTRL', 'VALOR'], dtype='object')

In [28]:
df_pivoted3 = df_pivoted2.pivot_table( 
                          values=['VALOR'], 
                          index=['VERSION', 'ID_UPR', 'ID_TECNOLOGIA'],
                          columns=['ID_CONCEPTO_CTRL'], 
                          aggfunc=np.sum)
df_pivoted3.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR
Unnamed: 0_level_1,Unnamed: 1_level_1,ID_CONCEPTO_CTRL,A. No Cobrados,AMONIACO,ATR,Ajuste,BONO_SOCIAL_PEGO,Banda,Bilateral,CALIZAS,CANON HID,CANON_CONCESION,...,Redespachos,Res. Pot. Adicional,S. Regulacion,S. Res. Pot. Adicional,SERV_GEST_RES,TASAS_MEDIOAMB,TASA_ARAGON,Terciaria,UREA,VCF
VERSION,ID_UPR,ID_TECNOLOGIA,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2
201712,UPR2415,BP,,,,-3586949.0,,167366.0,,,,,...,-15.0,,,-5925.0,,,,9345873.0,,-555.0
201708,UPR2415,BP,,,,-6111375.0,,,,,,,...,21566.0,,,-254.0,,,,-306694.0,,
201706,UPR417,EB,,,,,,,,,-11843.0,,...,,,,-14.0,,,23850.0,12944.0,,
201708,UPR1206,BX,,,,,,,,,,,...,,,,,,104386.0,,,,
201804,UPR418,EB,,,,-31879575.0,,5277166.0,919729782.0,,-3987281.0,-2532808.0,...,,,,-2584.0,,,-850014.0,4394852.0,,-119583.0
201704,UPR1850,GN,,,-232869.0,,-208453.0,1139583.0,,,,,...,-2691.0,,-22816.0,,,,,-272625.0,,
201805,UPR2331,EB,,,,-1742889.0,,365241.0,28429124.0,,-104684.0,-17739.0,...,,,,-8545.0,,-160804.0,,5402263.0,,1531.0
201703,UPR1851,GN,,,-1217682.0,,-208453.0,1520302.0,,,,,...,,,-76129.0,,,,,-405140.0,,
201804,UPR300,GN,,,1105713.0,,,,,,,,...,,,,-610.0,,-7026.0,,,,
201810,UPR1751,BP,,,,-8363216.0,,,-2454610.0,,,,...,-119044.0,,,-3.0,,,,-490293.0,,


In [29]:
df_pivoted3.columns = df_pivoted3.columns.droplevel()
df_modelize= df_pivoted3.reset_index()
df_modelize.head(10)

ID_CONCEPTO_CTRL,VERSION,ID_UPR,ID_TECNOLOGIA,A. No Cobrados,AMONIACO,ATR,Ajuste,BONO_SOCIAL_PEGO,Banda,Bilateral,...,Redespachos,Res. Pot. Adicional,S. Regulacion,S. Res. Pot. Adicional,SERV_GEST_RES,TASAS_MEDIOAMB,TASA_ARAGON,Terciaria,UREA,VCF
0,201701,UPR115,NC,,,,,,,81778497.0,...,,,,-36.0,-197055.0,,,,,
1,201701,UPR116,NC,,,,,,,68455391.0,...,,,,,-308245.0,,,,,
2,201701,UPR1198,NC,,,,,,,,...,,,,-192.0,,,,,,
3,201701,UPR1205,EB,,,,-124366.0,,,1578792.0,...,,,,-455.0,,-33711.0,,55144.0,,
4,201701,UPR1206,BX,,,,-9972.0,,,,...,,,,-7.0,,-16346.0,,,,
5,201701,UPR1207,EB,,,,-198314.0,,15066.0,59653.0,...,,,,-807.0,,-1528.0,,225821.0,,999.0
6,201701,UPR1314,BP,,,,-429377.0,,,-118383.0,...,-8556.0,,,-39.0,,,,-24031.0,,
7,201701,UPR1315,BP,,,,-506708.0,,2001.0,,...,1146.0,,,-397.0,,,,441646.0,,-162.0
8,201701,UPR160,GN,-994654.0,,-938689.0,,,1045907.0,,...,47854.0,51355.0,,-2223.0,,,,251416.0,,-14892.0
9,201701,UPR162,GN,-418200.0,,-507356.0,,,957825.0,,...,745956.0,579221.0,,-316.0,,,,573346.0,,-17674.0


Finally, we incorporate now the last column of data that we are gonna implement to the model, the power column from the second
dataframe that we got in the liquidations R process

In [30]:
df_power = pd.read_csv(Path.cwd() / 'Outputs' / 'LIQUIDACIONES_MWH.csv' , sep = ';', header = 0 , encoding = "ISO-8859-1",decimal=',')
print(df_power.describe())
df_power.sample(5)

             VERSION         VALOR
count   13048.000000  1.304800e+04
mean   201754.106913  4.498754e+05
std        49.951242  2.031215e+06
min    201701.000000 -5.489374e+06
25%    201706.000000 -9.301933e+03
50%    201712.000000  2.225860e+03
75%    201806.000000  8.903905e+04
max    201812.000000  3.518217e+07


Unnamed: 0,VERSION,ID_UPR,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_CONCEPTO_CTRL,ID_UNIDAD,VALOR
1363,201703,UPR1861,EMPRESA1,CI,R. Secundaria,MWH,2682.859
3760,201707,UPR2491,EMPRESA2,NC,D. Medida Contador,MWH,-4800.639
7357,201801,UPR78,EMPRESA2,NC,Bilateral,MWH,9954000.0
5505,201710,UPR366,EMPRESA2,BP,Ajuste,MWH,-88537.2
10873,201808,UPR2342,EMPRESA1,LN,Bilateral,MWH,3255095.0


I apply the same filters that I did in the previous dfs

And eventually a group by just in case we have the same problem than before

In [31]:
df_power = df_power[df_power['ID_GRUPO_EMPRESARIAL'] == 'EMPRESA1']
print(df_power.describe())

             VERSION         VALOR
count    6417.000000  6.417000e+03
mean   201755.352189  4.508979e+05
std        50.180305  2.115280e+06
min    201701.000000 -3.789073e+06
25%    201706.000000 -1.387000e+04
50%    201712.000000  2.163250e+02
75%    201807.000000  5.869300e+04
max    201812.000000  3.256008e+07


In [32]:
df_power = df_power[['VERSION','ID_UPR','ID_TECNOLOGIA','VALOR']]
df_power= df_power.groupby(['VERSION','ID_UPR','ID_TECNOLOGIA']).sum().reset_index()
print(df_power.describe())

             VERSION         VALOR
count    1024.000000  1.024000e+03
mean   201755.424805  2.825598e+06
std        50.143803  4.930549e+06
min    201701.000000 -3.789073e+06
25%    201706.000000  2.363667e+04
50%    201712.000000  5.563538e+05
75%    201806.000000  3.699485e+06
max    201812.000000  3.262049e+07


In [33]:
df_power.describe()

Unnamed: 0,VERSION,VALOR
count,1024.0,1024.0
mean,201755.424805,2825598.0
std,50.143803,4930549.0
min,201701.0,-3789073.0
25%,201706.0,23636.67
50%,201712.0,556353.8
75%,201806.0,3699485.0
max,201812.0,32620490.0


In [34]:
df_modelize = pd.merge(df_modelize, df_power, on=['VERSION','ID_UPR','ID_TECNOLOGIA'], how='left')
df_modelize = df_modelize.rename(columns={'VALOR': 'POWER_MWH'})
df_modelize.sample(10)

Unnamed: 0,VERSION,ID_UPR,ID_TECNOLOGIA,A. No Cobrados,AMONIACO,ATR,Ajuste,BONO_SOCIAL_PEGO,Banda,Bilateral,...,Res. Pot. Adicional,S. Regulacion,S. Res. Pot. Adicional,SERV_GEST_RES,TASAS_MEDIOAMB,TASA_ARAGON,Terciaria,UREA,VCF,POWER_MWH
906,201809,UPR304,HN,,,,-30302705.0,,4353773.0,48547240.0,...,3326822.0,,-12510.0,,0.0,,-379058.0,,-26216.0,6112505.35
19,201701,UPR1863,CI,,,,-665273.0,,507087.0,,...,,,-385.0,,-16965.0,,60666.0,,-24716.0,356606.66
933,201810,UPR194,HN,,,,-2979839.0,,4676164.0,45649355.0,...,582666.0,,-892.0,,0.0,,256886.0,,-92.0,5758031.67
619,201803,UPR160,GN,1552619.0,,-327355.0,,,41000861.0,,...,,,-1242.0,,,,6516341.0,,-1031038.0,2870328.03
242,201706,UPR2103,EB,,,,-9973601.0,,,22620640.0,...,,,-59.0,,-342718.0,,2788893.0,,,250413.622
147,201704,UPR1851,GN,,,-188983.0,,-208453.0,3647717.0,,...,,-13461.0,,,,,-347280.0,,,1281399.496
521,201712,UPR418,EB,,,,-22006169.0,,1617993.0,221521612.0,...,,,-747.0,,,-563690.0,10071687.0,,-52571.0,3515170.508
798,201807,UPR1850,GN,,,-2755574.0,,0.0,11198787.0,,...,,-453631.0,,,,,-16619612.0,,,6580730.35
696,201804,UPR77,NC,,,,,,,315371005.0,...,,,-119.0,3328.0,-2061993.0,,,,,8254901.674
501,201712,UPR1863,CI,1680134.0,,,,,12846378.0,26215189.0,...,,,-153.0,,-12093.0,,433606.0,-2098.0,-29388.0,4719742.07


In [35]:
df_modelize.to_csv(Path.cwd() / 'Outputs' / 'DF_MODELIZE.csv', sep= ';',decimal=',',index=False)

In [36]:
To_be_deleted =['df1',
                'df_aux',
                'df_merged',
                'dfTotal',
                'df_pivoted',
                'df_pivoted2',
                'df_pivoted3',
                'path_list',
                'inputpath',
                'inputpath1']
To_be_deleted

['df1',
 'df_aux',
 'df_merged',
 'dfTotal',
 'df_pivoted',
 'df_pivoted2',
 'df_pivoted3',
 'path_list',
 'inputpath',
 'inputpath1']

In [37]:
for item in To_be_deleted:
    try:
        del item
    except:
        pass