# Python TFM Section

## Data Processing

On this section of the TFM it will be done all the preparation needed for the model:

1. Import all the .csv created on the R section to unify them 
2. Set up a unique data frame where we will have all the variables and information required to the regression model 
3. Unstack the structure for making it more suitable to be used on a model


In [1]:
#Import the common used libraries for df treatment and fast visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

In [2]:
#!conda install --yes pathlib
#$ python -m pip install pathlib

In [3]:
# After being instaled, import pathlib for having the relative path in the easiest way possible
from pathlib import Path

In [4]:
print(Path.cwd())

C:\Users\ES71531200G\Desktop\Data Science\00.TFM


In [5]:
#Defining path to files from the imports folder
%pwd

file1 = "COSTES_E4E_EUROS.csv"
file2 = "LIQUIDACIONES_EUROS.csv"
file3 = "NUCLEAR_WASTES_EUROS.csv"

file5 = "LIQUIDACIONES_MWH.csv"

File_list = [file1, file2, file3]
del(file1,file2,file3)


File_list

['COSTES_E4E_EUROS.csv', 'LIQUIDACIONES_EUROS.csv', 'NUCLEAR_WASTES_EUROS.csv']

In [7]:
#command to check all the variables created this far
%whos

Variable    Type      Data/Info
-------------------------------
File_list   list      n=3
Path        type      <class 'pathlib.Path'>
file5       str       LIQUIDACIONES_MWH.csv
np          module    <module 'numpy' from 'C:\<...>ges\\numpy\\__init__.py'>
pd          module    <module 'pandas' from 'C:<...>es\\pandas\\__init__.py'>
plt         module    <module 'matplotlib.pyplo<...>\\matplotlib\\pyplot.py'>


Importing directly with the read_csv function retrieves an error due to the encoding used by R during the exportation
The parametres encoding and sep solve the probem

In [8]:
inputpath1 = Path.cwd() / 'Outputs' / 'COSTES_E4E_EUROS.csv' 
df1 = pd.read_csv(inputpath1, sep = ';', header = 0 , encoding = "ISO-8859-1")

I create a df with the same columnames and data types but no rows for using it as the initial frame to append everything alltogheter.

In [9]:
dfTotal = df1[0:0]
del(df1)
dfTotal

Unnamed: 0,VERSION,ID_UPR,ID_TECNOLOGIA,ID_GRUPO_EMPRESARIAL,ID_AREA_SISTEMA,ID_CONCEPTO_CTRL,VALOR,ID_UNIDAD


In [10]:
path_list = []
for file in File_list:
    inputpath = Path.cwd() / 'Outputs' / file
    print (inputpath)
    df1 = pd.read_csv(inputpath, sep = ';', header = 0 , encoding = "ISO-8859-1")
    dfTotal = dfTotal.append(df1)

C:\Users\ES71531200G\Desktop\Data Science\00.TFM\Outputs\COSTES_E4E_EUROS.csv
C:\Users\ES71531200G\Desktop\Data Science\00.TFM\Outputs\LIQUIDACIONES_EUROS.csv
C:\Users\ES71531200G\Desktop\Data Science\00.TFM\Outputs\NUCLEAR_WASTES_EUROS.csv


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


The process comes back a warning saying that one column is nmissing in at least one section, so we explore the data to see what's happening

In [11]:
dfTotal.sample(20)

Unnamed: 0,ID_AREA_SISTEMA,ID_CONCEPTO_CTRL,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_UNIDAD,ID_UPR,VALOR,VERSION
6016,,M. Intradiarios,EMPRESA2,NC,EUROS,UPR295,-2619198,201708
17507,,G. Potencia LP,EMPRESA1,GN,EUROS,UPR1864,32702929,201811
3807,ESPAÑA,CENTIMO_VERDE,EMPRESA1,HN,EUROS,UPR304,"-2,013255e+04",201807
16028,,I. G. Desvíos y Terciaria,EMPRESA1,BP,EUROS,UPR2142,-4512979,201809
5413,,S. Res. Pot. Adicional,EMPRESA1,NC,EUROS,UPR1198,-70555,201708
1231,ESPAÑA,OTROS,EMPRESA1,LN,EUROS,UPR2342,"-1,043785e+03",201707
3315,ESPAÑA,RELIQ_CANON_NC_CATALUÃA,EMPRESA1,NC,EUROS,UPR2622,"0,000000e+00",201805
3600,,M. Intradiarios,EMPRESA1,LN,EUROS,UPR2342,-3248840,201705
11552,,VCF,EMPRESA2,BX,EUROS,UPR2262,3156733,201803
15778,,I. G. Desvíos y Terciaria,EMPRESA3,GN,EUROS,UPR1714,-1500767,201809


### NaN replacement

We obserb that the file "LIQUIDACIONES_EUROS" doesn't have the column ID_AREA_SISTEMA.
To overpass this situations, I proceed to replace all the NaNs with the right values through the selection of the unique tuples
[ID_AREA_SISTEMA - ID_UPR].

In [12]:
df_aux = dfTotal[['ID_AREA_SISTEMA', 'ID_UPR']].dropna().drop_duplicates()
df_aux.sample(10)

Unnamed: 0,ID_AREA_SISTEMA,ID_UPR
19,ESPAÑA,UPR1315
80,ESPAÑA,UPR194
75,ESPAÑA,UPR1864
102,ESPAÑA,UPR2341
12,ESPAÑA,UPR1206
51,ESPAÑA,UPR1860
137,ESPAÑA,UPR300
98,ESPAÑA,UPR2331
29,ESPAÑA,UPR1661
48,PORTUGAL,UPR1851


And now I will replace the values using a left join with pandas

In [13]:
df_merged = pd.merge(dfTotal, df_aux, on='ID_UPR', how='left')
df_merged.sample(10)
df_merged.columns

Index(['ID_AREA_SISTEMA_x', 'ID_CONCEPTO_CTRL', 'ID_GRUPO_EMPRESARIAL',
       'ID_TECNOLOGIA', 'ID_UNIDAD', 'ID_UPR', 'VALOR', 'VERSION',
       'ID_AREA_SISTEMA_y'],
      dtype='object')

In [14]:
df_merged = df_merged.rename(columns={'ID_AREA_SISTEMA_y': 'ID_AREA_SISTEMA'})
df_merged = df_merged.drop(columns="ID_AREA_SISTEMA_x")
print(df_merged.columns)
df_merged.sample(10)

Index(['ID_CONCEPTO_CTRL', 'ID_GRUPO_EMPRESARIAL', 'ID_TECNOLOGIA',
       'ID_UNIDAD', 'ID_UPR', 'VALOR', 'VERSION', 'ID_AREA_SISTEMA'],
      dtype='object')


Unnamed: 0,ID_CONCEPTO_CTRL,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_UNIDAD,ID_UPR,VALOR,VERSION,ID_AREA_SISTEMA
10726,D. Medida Contador,EMPRESA1,CI,EUROS,UPR1861,3791822,201708,ESPAÑA
3447,UREA,EMPRESA1,CI,EUROS,UPR1860,"-3,200063e+04",201806,ESPAÑA
4071,PEAJE GEN,EMPRESA1,BX,EUROS,UPR726,"-2,069280e+01",201808,ESPAÑA
21801,Banda,EMPRESA3,GN,EUROS,UPR21,1523253,201810,
1783,AMONIACO,EMPRESA1,LN,EUROS,UPR2342,"-7,255110e+03",201710,ESPAÑA
1875,IMPUESTO ELECT,EMPRESA1,BP,EUROS,UPR1315,"6,032868e+01",201711,ESPAÑA
13786,M. Diario,EMPRESA2,HN,EUROS,UPR1310,195743857,201712,
1214,IMPUESTO ELECT,EMPRESA1,EB,EUROS,UPR2331,"-6,176099e+04",201707,ESPAÑA
17397,M. Diario,EMPRESA1,NC,EUROS,UPR2491,9643711,201804,ESPAÑA
12646,S. Res. Pot. Adicional,EMPRESA2,NC,EUROS,UPR2622,-298045,201710,ESPAÑA


In [15]:
df_merged = df_merged[df_merged['ID_GRUPO_EMPRESARIAL'] == 'EMPRESA1']
df_merged.sample(10)

Unnamed: 0,ID_CONCEPTO_CTRL,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_UNIDAD,ID_UPR,VALOR,VERSION,ID_AREA_SISTEMA
17275,G. Desvios,EMPRESA1,BP,EUROS,UPR2141,-6336555,201804,
10458,M. Intradiarios,EMPRESA1,NC,EUROS,UPR116,-1794167,201708,ESPAÑA
12261,M. Diario,EMPRESA1,GN,EUROS,UPR1851,188593618,201710,PORTUGAL
6418,R. Secundaria,EMPRESA1,LN,EUROS,UPR2341,9977274,201702,ESPAÑA
18574,G. Potencia LP,EMPRESA1,CI,EUROS,UPR1860,9053636,201806,ESPAÑA
10368,D. Medida Contador,EMPRESA1,EB,EUROS,UPR417,-9837201,201707,ESPAÑA
18874,Ajuste,EMPRESA1,BP,EUROS,UPR2414,-5750324,201806,ESPAÑA
11481,R. Secundaria,EMPRESA1,GN,EUROS,UPR1851,3524009,201709,PORTUGAL
23566,SERV_GEST_RES,EMPRESA1,NC,EUROS,UPR77,-1807981693,201807,ESPAÑA
3197,COSTE_COMBUSTIBLE,EMPRESA1,GN,EUROS,UPR1850,"-1,432440e+03",201805,PORTUGAL


In [16]:
df_merged[df_merged['ID_AREA_SISTEMA'].isna()].head(5)

Unnamed: 0,ID_CONCEPTO_CTRL,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_UNIDAD,ID_UPR,VALOR,VERSION,ID_AREA_SISTEMA
5079,D. Medida Contador,EMPRESA1,NC,EUROS,UPR1198,-3530563,201701,
5080,S. Res. Pot. Adicional,EMPRESA1,NC,EUROS,UPR1198,-1928,201701,
5239,Ajuste,EMPRESA1,BP,EUROS,UPR1751,-3748906,201701,
5240,Bilateral,EMPRESA1,BP,EUROS,UPR1751,-2131556,201701,
5241,D. Medida Contador,EMPRESA1,BP,EUROS,UPR1751,-2245397,201701,


There are still NaN values, but thanks to our knowledge from the original data, we know that there are ONLY 2 UPRs with ID_AREA_SISTEMA = 'PORTUGAL', what means that every NaN value right now should be equal to ESPAÑA, so we replace now all the NANs

In [17]:
df_merged = df_merged.fillna('ESPAÑA')
df_merged.sample(10)

Unnamed: 0,ID_CONCEPTO_CTRL,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_UNIDAD,ID_UPR,VALOR,VERSION,ID_AREA_SISTEMA
21658,Banda,EMPRESA1,GN,EUROS,UPR1851,14607510,201810,PORTUGAL
3619,RELIQ_CANON_NC_CATALUÃA,EMPRESA1,NC,EUROS,UPR116,"0,000000e+00",201807,ESPAÑA
3918,PEAJE GEN,EMPRESA1,GN,EUROS,UPR1851,"0,000000e+00",201808,PORTUGAL
9410,Banda,EMPRESA1,LN,EUROS,UPR2341,2099372,201706,ESPAÑA
22555,R. Secundaria,EMPRESA1,CI,EUROS,UPR1863,-1261176,201811,ESPAÑA
3043,TASAS_MEDIOAMB,EMPRESA1,LN,EUROS,UPR2341,"4,259270e+03",201804,ESPAÑA
7091,Banda,EMPRESA1,EB,EUROS,UPR2103,7036507,201703,ESPAÑA
7501,D. Medida Contador,EMPRESA1,NC,EUROS,UPR116,1489421,201704,ESPAÑA
3904,IMPUESTO ELECT,EMPRESA1,CI,EUROS,UPR1662,"-1,169843e+06",201808,ESPAÑA
20197,M. Diario,EMPRESA1,CI,EUROS,UPR1863,339687625,201808,ESPAÑA


In [18]:
df_merged[df_merged['ID_AREA_SISTEMA'].isna()].head(5)

Unnamed: 0,ID_CONCEPTO_CTRL,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_UNIDAD,ID_UPR,VALOR,VERSION,ID_AREA_SISTEMA


And finally, reorder de columns to the same order we are already used to 

In [19]:
df_merged = df_merged[['VERSION','ID_UPR','ID_TECNOLOGIA','ID_GRUPO_EMPRESARIAL','ID_AREA_SISTEMA','ID_CONCEPTO_CTRL','VALOR']]
df_merged.reset_index()
df_merged.sample(5)

Unnamed: 0,VERSION,ID_UPR,ID_TECNOLOGIA,ID_GRUPO_EMPRESARIAL,ID_AREA_SISTEMA,ID_CONCEPTO_CTRL,VALOR
269,201702,UPR194,HN,EMPRESA1,ESPAÑA,LUBRICANTES,"-7,675195e+00"
18101,201805,UPR2414,BP,EMPRESA1,ESPAÑA,S. Res. Pot. Adicional,-64563
18560,201806,UPR1851,GN,EMPRESA1,PORTUGAL,Banda,13523918
12578,201710,UPR2414,BP,EMPRESA1,ESPAÑA,Redespachos,-1802986
12759,201710,UPR418,EB,EMPRESA1,ESPAÑA,D. Medida Contador,-3899041


### Solving the wrong data-type importation

At this point, I tryed to save the current df "df_merged" for the future visualizatin part, this is the structure desired to represent the Integral Margin of the different power plants and so it is for the temporal evolution of every single one of them.

The problem here seemed to be the data types... so first I tried to convert the column value to numeric directly 
with no success...

The error got, suggested that I should convert the data type to floats but the lesson learnt here was that float type in pandas use dots insted of comma for float


In [20]:
df_merged.dtypes

VERSION                  int64
ID_UPR                  object
ID_TECNOLOGIA           object
ID_GRUPO_EMPRESARIAL    object
ID_AREA_SISTEMA         object
ID_CONCEPTO_CTRL        object
VALOR                   object
dtype: object

In [21]:
df_merged['VALOR'].str.replace(',','.').sample(10)

3853      0.000000e+00
6779           -537.69
18270         11580.02
11836         -669.258
15450         -1925.23
7244          -3837804
10597         -1653637
15703          -109.74
4698      0.000000e+00
4376     -2.482806e+06
Name: VALOR, dtype: object

In [22]:
df_merged['VALOR'] = pd.to_numeric((df_merged['VALOR'].str.replace(',','.')),errors='coerce').fillna(0).astype(np.int64)
df_merged.sample(10)

Unnamed: 0,VERSION,ID_UPR,ID_TECNOLOGIA,ID_GRUPO_EMPRESARIAL,ID_AREA_SISTEMA,ID_CONCEPTO_CTRL,VALOR
11313,201709,UPR160,GN,EMPRESA1,ESPAÑA,G. Desvios,442965
18895,201806,UPR2491,NC,EMPRESA1,ESPAÑA,M. Intradiario Continuo,8706
2173,201712,UPR2344,EB,EMPRESA1,ESPAÑA,TASA_ARAGON,12681
2845,201803,UPR2622,NC,EMPRESA1,ESPAÑA,OTROS,-13271
22357,201811,UPR162,GN,EMPRESA1,ESPAÑA,Terciaria,3652391
13365,201711,UPR2331,EB,EMPRESA1,ESPAÑA,I. G. Desvíos y Terciaria,-166229
22500,201811,UPR1851,GN,EMPRESA1,PORTUGAL,R. Secundaria,7208051
12669,201710,UPR304,HN,EMPRESA1,ESPAÑA,Ajuste,-1058067
10006,201707,UPR1863,CI,EMPRESA1,ESPAÑA,M. Diario,147771076
20991,201809,UPR194,HN,EMPRESA1,ESPAÑA,M. Intradiarios,28392252


Now, let's check if the total sum of the column "VALUE" is 175694686343 ( total sum we know it have to be from the original files summatory), and if so, it would mean that the comma-replacement solve the issue and we can export the file for the future visualization module.

In [24]:
df_merged['VALOR'].sum()

175694686343

In [25]:
#df_merged.to_csv?
df_merged.to_csv(Path.cwd() / 'Outputs' / 'INTEGRATED_MARGIN.csv', sep= ';',index=False)

### Pivoting/ Unstacking our df

Now I proceed to unstack or pivot the table to get the suitable structure for modeling
During this procedure, I've faced multiple problems so here I brievely describe the process:

1. First attempts ended on multiple errors such as "Length of passed values is 15227, index implies 1" , "index contains duplicate entries,cannot reshape"
2. It seemed clear that in any moment of the dropping unused columns, I created a duplicity on a register so first thing requiered was to do a group by
3. After done, I reseted the index for setting free all the columns
4. I used the pandas fuction "pivot_table" instead the method .pivot due to the hability of the first one to summing all the values generated with duplicities during the process of resizing.
5. Once pivoted, indexes and headers were a problematic segmentation so I dropped it out and create a new header.


In [26]:
df_pivoted = df_merged[['VERSION','ID_UPR','ID_TECNOLOGIA','ID_CONCEPTO_CTRL','VALOR']]
df_pivoted.shape

(15117, 5)

In [27]:
df_pivoted2 =df_pivoted.groupby(['VERSION','ID_UPR','ID_TECNOLOGIA','ID_CONCEPTO_CTRL']).sum()
print(df_pivoted2.shape)
print(df_pivoted2.columns)
df_pivoted2.sample(5)

(15117, 1)
Index(['VALOR'], dtype='object')


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,VALOR
VERSION,ID_UPR,ID_TECNOLOGIA,ID_CONCEPTO_CTRL,Unnamed: 4_level_1
201703,UPR304,HN,D. Medida Contador,-136336
201706,UPR2331,EB,D. Medida Contador,-129675
201807,UPR418,EB,I. G. Desvíos y Terciaria,-1036264
201810,UPR1205,EB,CANON HID,-61818
201709,UPR1864,GN,Terciaria,1575696


In [28]:
df_pivoted2= df_pivoted2.reset_index()
df_pivoted2.head(5)

Unnamed: 0,VERSION,ID_UPR,ID_TECNOLOGIA,ID_CONCEPTO_CTRL,VALOR
0,201701,UPR115,NC,Bilateral,81778497
1,201701,UPR115,NC,CANON_NC_EST,-2543929
2,201701,UPR115,NC,COSTE_COMBUSTIBLE,-1875378
3,201701,UPR115,NC,D. Medida Contador,-7853
4,201701,UPR115,NC,IMPUESTO ELECT,-3190970


In [29]:
df_pivoted2.columns

Index(['VERSION', 'ID_UPR', 'ID_TECNOLOGIA', 'ID_CONCEPTO_CTRL', 'VALOR'], dtype='object')

In [30]:
df_pivoted3 = df_pivoted2.pivot_table( 
                          values=['VALOR'], 
                          index=['VERSION', 'ID_UPR', 'ID_TECNOLOGIA'],
                          columns=['ID_CONCEPTO_CTRL'], 
                          aggfunc=np.sum)
df_pivoted3.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR,VALOR
Unnamed: 0_level_1,Unnamed: 1_level_1,ID_CONCEPTO_CTRL,A. No Cobrados,AMONIACO,ATR,Ajuste,BONO_SOCIAL_PEGO,Banda,Bilateral,CALIZAS,CANON HID,CANON_CONCESION,...,Redespachos,Res. Pot. Adicional,S. Regulacion,S. Res. Pot. Adicional,SERV_GEST_RES,TASAS_MEDIOAMB,TASA_ARAGON,Terciaria,UREA,VCF
VERSION,ID_UPR,ID_TECNOLOGIA,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2
201812,UPR77,NC,,,,,,,531709800.0,,,,...,,,,-197.0,-1657920.0,-963125.0,,3148.0,,
201803,UPR1207,EB,,,,-3159417.0,,,3318776.0,,-38098.0,-95544.0,...,,,,-380.0,,-31206.0,,,,
201702,UPR2415,BP,,,,-3508085.0,,18842.0,,,,,...,,,,-562.0,,,,372237.0,,-2505.0
201808,UPR2414,BP,,,,,,,,,,,...,,,,-2720.0,,,,-472486.0,,
201808,UPR304,HN,,,,-3786943.0,,3237551.0,29335240.0,-73120.0,,,...,,2139507.0,,-6591.0,,-95941.0,,-804428.0,,-29651.0
201805,UPR726,BX,,,,,,,,,0.0,,...,,,,-271.0,,,,,,
201711,UPR194,HN,,,,,,3101623.0,37467900.0,-44660.0,,,...,,,,-883.0,,-3460.0,,102273.0,,-56871.0
201806,UPR115,NC,,,,,,,1509465000.0,,,,...,,,,-5170.0,-4955999.0,,,,,
201707,UPR2103,EB,,,,-9265889.0,,,29751970.0,,-279300.0,,...,,,,,,-4211125.0,,687236.0,,
201809,UPR74,HN,,,,,,,,,,,...,,,,,,,,,,


In [31]:
df_pivoted3.columns = df_pivoted3.columns.droplevel()
df_modelize= df_pivoted3.reset_index()
df_modelize.head(10)

ID_CONCEPTO_CTRL,VERSION,ID_UPR,ID_TECNOLOGIA,A. No Cobrados,AMONIACO,ATR,Ajuste,BONO_SOCIAL_PEGO,Banda,Bilateral,...,Redespachos,Res. Pot. Adicional,S. Regulacion,S. Res. Pot. Adicional,SERV_GEST_RES,TASAS_MEDIOAMB,TASA_ARAGON,Terciaria,UREA,VCF
0,201701,UPR115,NC,,,,,,,81778497.0,...,,,,-36.0,-5156083.0,,,,,
1,201701,UPR116,NC,,,,,,,68455391.0,...,,,,,-4357224.0,,,,,
2,201701,UPR1198,NC,,,,,,,,...,,,,-192.0,,,,,,
3,201701,UPR1205,EB,,,,-124366.0,,,1578792.0,...,,,,-455.0,,-33711.0,,55144.0,,
4,201701,UPR1206,BX,,,,-9972.0,,,,...,,,,-7.0,,-16346.0,,,,
5,201701,UPR1207,EB,,,,-198314.0,,15066.0,59653.0,...,,,,-807.0,,-1528.0,,225821.0,,999.0
6,201701,UPR1314,BP,,,,-429377.0,,,-118383.0,...,-8556.0,,,-39.0,,,,-24031.0,,
7,201701,UPR1315,BP,,,,-506708.0,,2001.0,,...,1146.0,,,-397.0,,,,441646.0,,-162.0
8,201701,UPR160,GN,-994654.0,,-938689.0,,,1045907.0,,...,47854.0,51355.0,,-2223.0,,,,251416.0,,-14892.0
9,201701,UPR162,GN,-418200.0,,-507356.0,,,957825.0,,...,745956.0,579221.0,,-316.0,,,,573346.0,,-17674.0


Finally, we incorporate now the last column of data that we are gonna implement to the model, the power column from the second
dataframe that we got in the liquidations R process

In [32]:
df_power = pd.read_csv(Path.cwd() / 'Outputs' / 'LIQUIDACIONES_MWH.csv' , sep = ';', header = 0 , encoding = "ISO-8859-1",decimal=',')
print(df_power.describe())
df_power.sample(5)

             VERSION         VALOR
count   13048.000000  1.304800e+04
mean   201754.106913  4.498754e+05
std        49.951242  2.031215e+06
min    201701.000000 -5.489374e+06
25%    201706.000000 -9.301933e+03
50%    201712.000000  2.225860e+03
75%    201806.000000  8.903905e+04
max    201812.000000  3.518217e+07


Unnamed: 0,VERSION,ID_UPR,ID_GRUPO_EMPRESARIAL,ID_TECNOLOGIA,ID_CONCEPTO_CTRL,ID_UNIDAD,VALOR
4825,201709,UPR2342,EMPRESA1,LN,Terciaria,MWH,-13291.9
6074,201711,UPR316,EMPRESA3,GN,G. Desvios,MWH,33456.5
1442,201703,UPR2133,EMPRESA2,EB,D. Medida Contador,MWH,-52178.46
8503,201804,UPR1315,EMPRESA1,BP,G. Desvios,MWH,15903.7
8322,201803,UPR2588,EMPRESA3,EB,M. Intradiarios,MWH,319965.6


I apply the same filters that I did in the previous dfs

And eventually a group by just in case we have the same problem than before

In [33]:
df_power = df_power[df_power['ID_GRUPO_EMPRESARIAL'] == 'EMPRESA1']
print(df_power.describe())

             VERSION         VALOR
count    6417.000000  6.417000e+03
mean   201755.352189  4.508979e+05
std        50.180305  2.115280e+06
min    201701.000000 -3.789073e+06
25%    201706.000000 -1.387000e+04
50%    201712.000000  2.163250e+02
75%    201807.000000  5.869300e+04
max    201812.000000  3.256008e+07


In [34]:
df_power = df_power[['VERSION','ID_UPR','ID_TECNOLOGIA','VALOR']]
df_power= df_power.groupby(['VERSION','ID_UPR','ID_TECNOLOGIA']).sum().reset_index()
print(df_power.describe())

             VERSION         VALOR
count    1024.000000  1.024000e+03
mean   201755.424805  2.825598e+06
std        50.143803  4.930549e+06
min    201701.000000 -3.789073e+06
25%    201706.000000  2.363667e+04
50%    201712.000000  5.563538e+05
75%    201806.000000  3.699485e+06
max    201812.000000  3.262049e+07


In [35]:
df_power.describe()

Unnamed: 0,VERSION,VALOR
count,1024.0,1024.0
mean,201755.424805,2825598.0
std,50.143803,4930549.0
min,201701.0,-3789073.0
25%,201706.0,23636.67
50%,201712.0,556353.8
75%,201806.0,3699485.0
max,201812.0,32620490.0


In [36]:
df_modelize = pd.merge(df_modelize, df_power, on=['VERSION','ID_UPR','ID_TECNOLOGIA'], how='left')
df_modelize = df_modelize.rename(columns={'VALOR': 'POWER_MWH'})
df_modelize.sample(10)

Unnamed: 0,VERSION,ID_UPR,ID_TECNOLOGIA,A. No Cobrados,AMONIACO,ATR,Ajuste,BONO_SOCIAL_PEGO,Banda,Bilateral,...,Res. Pot. Adicional,S. Regulacion,S. Res. Pot. Adicional,SERV_GEST_RES,TASAS_MEDIOAMB,TASA_ARAGON,Terciaria,UREA,VCF,POWER_MWH
6,201701,UPR1314,BP,,,,-429377.0,,,-118383.0,...,,,-39.0,,,,-24031.0,,,-13631.195
1005,201812,UPR1315,BP,,,,-7924166.0,,72267.0,,...,,,-3818.0,,,,449460.0,,3849.0,28134.576
210,201705,UPR2622,NC,,,,,,,241252566.0,...,,,,-3815873.0,,,,,,5103668.888
267,201707,UPR1205,EB,,,,-153932.0,,,34699165.0,...,,,,,-278650.0,,92364.0,,,758627.698
953,201810,UPR74,HN,,,,,,,,...,,,,,,,,,,
821,201807,UPR304,HN,,,,,,145485.0,16952863.0,...,,,-4176.0,,95941.0,,127715.0,,3382.0,417225.22
133,201704,UPR116,NC,,,,,,,194407935.0,...,,,,-4205449.0,,,,,,4473415.595
626,201803,UPR1851,GN,,,-3383349.0,,-312679.0,,,...,,,,,,,,,,
134,201704,UPR1198,NC,,,,,,,,...,,,,,,,,,,-5533.966
645,201803,UPR2622,NC,,,,,,,53314447.0,...,,,-10450.0,-199748.0,,,,,,624312.82


And eventually, let's save all this transformation into a new file. We will use this file as well as the other one previously
mentioned for the visualization section.

This one will have new challenges such as the NaN replacement, the creation of a new column called Integral_costs and so on...

In [40]:
df_modelize.to_csv(Path.cwd() / 'Outputs' / 'DF_MODELIZE.csv', sep= ';',decimal=',',index=False)