# Modelización de territorio con regresión lineal sin PCA

En este cuaderno mostramos un ejemplo de modelización de un territorio, en este caso la provincia de Zaragoza, mediante secciones electorales escogidas de la provincia de Burgos. La modelización la haremos mediante regresión lineal sin utilizar PCA.

Primero elegimos las secciones para una misma elección, en este caso la de noviembre de 2019. Después tomamos las secciones elegidas y utilizamos sus equivalentes de las elecciones de 2016, para ver si sirven para modelizar la provincia de Zaragoza en esos comicios.

## Modelización en las elecciones de noviembre de 2019

Comenzamos cargando las librerías necesarias, así como el dataset de las elecciones de noviembre de 2019.

In [1]:
import pandas as pd
import numpy as np
import random

In [2]:
strings = {'Sección' : 'str', 'cod_ccaa' : 'str', 'cod_prov' : 'str', 'cod_mun' : 'str', 'cod_sec' : 'str'}

In [8]:
df_eleccion_comp = pd.read_csv('aws/datasets/gen_N19_unif_cols_prov_copia.txt', dtype = strings)

In [9]:
df_eleccion_comp

Unnamed: 0,Sección,cod_ccaa,cod_prov,cod_mun,cod_sec,CCAA,Provincia,Municipio,Censo_Esc,Votos_Total,...,Renta hogar 2017,Renta hogar 2015,Renta Salarios 2018,Renta Salarios 2015,Renta Pensiones 2018,Renta Pensiones 2015,Renta Desempleo 2018,Renta Desempleo 2015,dict_res,dict_res_ord
0,022019111010400101001,01,04,04001,0400101001,Andalucía,Almería,Abla,1002,717,...,20172.0,19546.0,5574.0,4833.0,3286.0,3082.0,403.0,471.0,"{'PP': 193, 'PSOE': 310, 'Cs': 47, 'UP': 30, '...","[('PSOE', 310), ('PP', 193), ('VOX', 122), ('C..."
1,022019111010400201001,01,04,04002,0400201001,Andalucía,Almería,Abrucena,1013,711,...,17841.0,17115.0,4640.0,4048.0,3418.0,2770.0,568.0,620.0,"{'PP': 111, 'PSOE': 349, 'Cs': 45, 'UP': 42, '...","[('PSOE', 349), ('VOX', 147), ('PP', 111), ('C..."
2,022019111010400301001,01,04,04003,0400301001,Andalucía,Almería,Adra,667,484,...,26498.0,24688.0,5121.0,4795.0,2499.0,2301.0,337.0,333.0,"{'PP': 176, 'PSOE': 128, 'Cs': 15, 'UP': 34, '...","[('PP', 176), ('PSOE', 128), ('VOX', 116), ('U..."
3,022019111010400301002,01,04,04003,0400301002,Andalucía,Almería,Adra,1306,909,...,25677.0,23400.0,5381.0,4837.0,1815.0,1724.0,343.0,464.0,"{'PP': 251, 'PSOE': 220, 'Cs': 51, 'UP': 58, '...","[('VOX', 312), ('PP', 251), ('PSOE', 220), ('U..."
4,022019111010400301003,01,04,04003,0400301003,Andalucía,Almería,Adra,1551,975,...,22051.0,19687.0,5224.0,4044.0,1170.0,1198.0,416.0,476.0,"{'PP': 292, 'PSOE': 202, 'Cs': 73, 'UP': 52, '...","[('VOX', 327), ('PP', 292), ('PSOE', 202), ('C..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36297,022019111195200108011,19,52,52001,5200108011,Melilla,Melilla,Melilla,1638,1021,...,66352.0,62632.0,11378.0,11119.0,1508.0,1274.0,167.0,166.0,"{'PP': 303, 'PSOE': 140, 'Cs': 30, 'UP': 28, '...","[('Otros', 348), ('PP', 303), ('VOX', 158), ('..."
36298,022019111195200108012,19,52,52001,5200108012,Melilla,Melilla,Melilla,1676,1057,...,50730.0,50839.0,13272.0,13038.0,2763.0,2445.0,169.0,177.0,"{'PP': 463, 'PSOE': 205, 'Cs': 36, 'UP': 35, '...","[('PP', 463), ('VOX', 210), ('PSOE', 205), ('O..."
36299,022019111195200108013,19,52,52001,5200108013,Melilla,Melilla,Melilla,1132,638,...,37816.0,36729.0,10102.0,9640.0,1807.0,1615.0,234.0,252.0,"{'PP': 208, 'PSOE': 113, 'Cs': 31, 'UP': 25, '...","[('PP', 208), ('VOX', 144), ('PSOE', 113), ('O..."
36300,022019111195200108014,19,52,52001,5200108014,Melilla,Melilla,Melilla,899,527,...,29898.0,31384.0,5923.0,6061.0,2463.0,2136.0,244.0,284.0,"{'PP': 200, 'PSOE': 87, 'Cs': 13, 'UP': 12, 'I...","[('PP', 200), ('VOX', 126), ('PSOE', 87), ('Ot..."


Primero especificamos el territorio que queremos modelizar, en este caso la provincia de Zaragoza. Dejamos vacías las opciones de CCAA y municipio; ambas deben ser cocurrentes, es decir, si escogiésemos un municipio, éste tendría que pertenecer en esta caso a la provincia de Zaragoza.

In [10]:
ccaa_mod = []

provincia_mod = ['Zaragoza']

municipio_mod = []

secciones_mod = df_eleccion_comp

In [11]:
if len(ccaa_mod) > 0:

  secciones_mod = secciones_mod.loc[secciones_mod['CCAA'].isin(ccaa_mod)]

if len(provincia_mod) > 0:

  secciones_mod = secciones_mod.loc[secciones_mod['Provincia'].isin(provincia_mod)]

if len(municipio_mod) > 0:

  secciones_mod = secciones_mod.loc[secciones_mod['Municipio'].isin(municipio_mod)]



Vemos que tenemos 880 secciones electorales en Zaragoza provincia.

In [12]:
secciones_mod

Unnamed: 0,Sección,cod_ccaa,cod_prov,cod_mun,cod_sec,CCAA,Provincia,Municipio,Censo_Esc,Votos_Total,...,Renta hogar 2017,Renta hogar 2015,Renta Salarios 2018,Renta Salarios 2015,Renta Pensiones 2018,Renta Pensiones 2015,Renta Desempleo 2018,Renta Desempleo 2015,dict_res,dict_res_ord
6553,022019111025000101001,02,50,50001,5000101001,Aragón,Zaragoza,Abanto,89,68,...,28322.021999,21149.000000,7855.336603,5134.000000,3217.875711,4987.000000,293.331625,139.000000,"{'PP': 42, 'PSOE': 13, 'Cs': 1, 'UP': 0, 'IU':...","[('PP', 42), ('PSOE', 13), ('VOX', 10), ('MP',..."
6554,022019111025000201001,02,50,50002,5000201001,Aragón,Zaragoza,Acered,125,91,...,18895.000000,20525.000000,3494.000000,2873.000000,4611.000000,3968.000000,84.000000,233.000000,"{'PP': 43, 'PSOE': 19, 'Cs': 4, 'UP': 0, 'IU':...","[('PP', 43), ('VOX', 20), ('PSOE', 19), ('Cs',..."
6555,022019111025000301001,02,50,50003,5000301001,Aragón,Zaragoza,Agón,117,89,...,27578.000000,27753.000000,5804.000000,5694.000000,5604.000000,5250.000000,161.000000,247.000000,"{'PP': 23, 'PSOE': 39, 'Cs': 2, 'UP': 2, 'IU':...","[('PSOE', 39), ('PP', 23), ('VOX', 20), ('Cs',..."
6556,022019111025000401001,02,50,50004,5000401001,Aragón,Zaragoza,Aguarón,475,360,...,25421.000000,23879.000000,7039.000000,6056.000000,3502.000000,3246.000000,208.000000,253.000000,"{'PP': 96, 'PSOE': 155, 'Cs': 17, 'UP': 19, 'I...","[('PSOE', 155), ('PP', 96), ('VOX', 44), ('MP'..."
6557,022019111025000501001,02,50,50005,5000501001,Aragón,Zaragoza,Aguilón,228,185,...,31410.000000,29687.000000,8651.000000,8019.000000,5616.000000,4816.000000,108.000000,191.000000,"{'PP': 84, 'PSOE': 34, 'Cs': 13, 'UP': 12, 'IU...","[('PP', 84), ('VOX', 35), ('PSOE', 34), ('Cs',..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7428,022019111025029802001,02,50,50298,5029802001,Aragón,Zaragoza,Zuera,610,482,...,31542.000000,31419.000000,9774.000000,8326.000000,3118.000000,3365.000000,213.000000,395.000000,"{'PP': 134, 'PSOE': 139, 'Cs': 45, 'UP': 50, '...","[('PSOE', 139), ('PP', 134), ('VOX', 82), ('UP..."
7429,022019111025090101001,02,50,50901,5090101001,Aragón,Zaragoza,Biel,133,96,...,25367.000000,26506.000000,13108.000000,9636.000000,7146.000000,7398.000000,145.000000,214.000000,"{'PP': 18, 'PSOE': 33, 'Cs': 7, 'UP': 8, 'IU':...","[('PSOE', 33), ('VOX', 21), ('PP', 18), ('UP',..."
7430,022019111025090201001,02,50,50902,5090201001,Aragón,Zaragoza,Marracos,77,65,...,28322.021999,26938.114416,7855.336603,6845.948425,3217.875711,2985.302533,293.331625,347.217589,"{'PP': 29, 'PSOE': 15, 'Cs': 4, 'UP': 3, 'IU':...","[('PP', 29), ('PSOE', 15), ('VOX', 10), ('Cs',..."
7431,022019111025090301001,02,50,50903,5090301001,Aragón,Zaragoza,Villamayor de Gállego,1143,844,...,34050.000000,31945.000000,9707.000000,8721.000000,3872.000000,3239.000000,162.000000,287.000000,"{'PP': 160, 'PSOE': 226, 'Cs': 64, 'UP': 133, ...","[('PSOE', 226), ('PP', 160), ('VOX', 160), ('U..."


Queremos modelizar solo los resultados electorales, por lo que nos quedamos solo con ellos.

In [13]:
secciones_mod_lista = list(secciones_mod['Sección']) 

In [14]:
cols_validas_mod = ['Censo_Esc', 'Votos_Total', 'Nulos', 'Votos_Válidos', 'Blanco', 'V_Cand', 'PP', 'PSOE', 'Cs', 'UP',
       'IU', 'VOX', 'UPyD', 'MP', 'CiU', 'ERC', 'JxC', 'CUP', 'DiL', 'PNV',
       'Bildu', 'Amaiur', 'CC', 'FA', 'TE', 'BNG', 'PRC', 'GBai', 'Compromis',
       'PACMA', 'Otros']

In [15]:
secciones_mod = secciones_mod[cols_validas_mod]

In [16]:
secciones_mod

Unnamed: 0,Censo_Esc,Votos_Total,Nulos,Votos_Válidos,Blanco,V_Cand,PP,PSOE,Cs,UP,...,Amaiur,CC,FA,TE,BNG,PRC,GBai,Compromis,PACMA,Otros
6553,89,68,0,68,0,68,42,13,1,0,...,0,0,0,0,0,0,0,0,0,0
6554,125,91,5,86,0,86,43,19,4,0,...,0,0,0,0,0,0,0,0,0,0
6555,117,89,0,89,1,88,23,39,2,2,...,0,0,0,0,0,0,0,0,0,0
6556,475,360,4,356,2,354,96,155,17,19,...,0,0,0,0,0,0,0,0,0,2
6557,228,185,1,184,2,182,84,34,13,12,...,0,0,0,0,0,0,0,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7428,610,482,2,480,3,477,134,139,45,50,...,0,0,0,0,0,0,0,0,0,13
7429,133,96,0,96,0,96,18,33,7,8,...,0,0,0,0,0,0,0,0,0,2
7430,77,65,3,62,0,62,29,15,4,3,...,0,0,0,0,0,0,0,0,0,0
7431,1143,844,5,839,10,829,160,226,64,133,...,0,0,0,0,0,0,0,0,4,17


In [27]:
df_psoe= secciones_mod[['PSOE','Censo_Esc']]

In [29]:
df_psoe['diff']=df_psoe['PSOE']/df_psoe['Censo_Esc']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_psoe['diff']=df_psoe['PSOE']/df_psoe['Censo_Esc']


In [85]:
df_psoe= secciones_mod[['PSOE','Censo_Esc','Votos_Total']]
df_psoe['division_voto']=df_psoe['PSOE']/df_psoe['Votos_Total']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_psoe['division_voto']=df_psoe['PSOE']/df_psoe['Votos_Total']


In [86]:
df_psoe.head()

Unnamed: 0,PSOE,Censo_Esc,Votos_Total,division_voto
6553,13,89,68,0.191176
6554,19,125,91,0.208791
6555,39,117,89,0.438202
6556,155,475,360,0.430556
6557,34,228,185,0.183784


In [32]:
df_psoe['diff'].sum()/880


0.22461143484587268

In [87]:
df_psoe['division_voto'].sum()/880

0.31690528992843603

In [89]:
df_psoe.sum()

PSOE             157420.000000
Censo_Esc        715388.000000
Votos_Total      514697.000000
division_voto       278.876655
dtype: float64

Ahora hay que obtener los resultados electorales conjunto del territorio que debemos modelizar. Primero nos quedamos con el censo del territorio, después creamos un df con estos resultados, y finalmente, muy importante: normalizamos estos resultados dividendo por el censo, así no importa el tamaño del territorio que queramos modelizar.

In [17]:
censo_mod = secciones_mod['Censo_Esc'].sum()

In [18]:
modelizacion = pd.DataFrame(secciones_mod.sum(), columns = ['Modelización'])

In [19]:
modelizacion['Modelización'] = modelizacion['Modelización'] / modelizacion['Modelización']['Censo_Esc']

Obtenemos un df de una columna con los resultados electorales normalizados por su censo.

In [33]:
modelizacion

Unnamed: 0,Modelización
Votos_Total,0.719466
Nulos,0.006076
Votos_Válidos,0.713389
Blanco,0.006958
V_Cand,0.706431
PP,0.166932
PSOE,0.220048
Cs,0.065202
UP,0.080233
IU,0.0


La primera fila siempre será 1, pues es el censo dividido por sí mismo, por lo que la podemos eliminar.

In [21]:
modelizacion = modelizacion.drop(['Censo_Esc']) 

In [37]:
modelizacion

Unnamed: 0,Modelización
Votos_Total,0.719466
Nulos,0.006076
Votos_Válidos,0.713389
Blanco,0.006958
V_Cand,0.706431
PP,0.166932
PSOE,0.220048
Cs,0.065202
UP,0.080233
IU,0.0


In [38]:
modelizacion.shape

(30, 1)

Ahora debemos buscar las secciones que modelicen la provincia de Zaragoza. Elegimos tomarlas de la provincia de Burgos, seleccionando en principio todas ellas. Hay unas 587.

In [39]:
ccaa_select = []

provincia_select = ['Burgos']

municipio_select = []

secciones_select = df_eleccion_comp

In [40]:
if len(ccaa_select) > 0:

  secciones_select = secciones_select.loc[secciones_select['CCAA'].isin(ccaa_select)]

if len(provincia_select) > 0:

  secciones_select = secciones_select.loc[secciones_select['Provincia'].isin(provincia_select)]

if len(municipio_select) > 0:

  secciones_select = secciones_select.loc[secciones_select['Municipio'].isin(municipio_select)]



In [41]:
secciones_select

Unnamed: 0,Sección,cod_ccaa,cod_prov,cod_mun,cod_sec,CCAA,Provincia,Municipio,Censo_Esc,Votos_Total,...,Renta hogar 2017,Renta hogar 2015,Renta Salarios 2018,Renta Salarios 2015,Renta Pensiones 2018,Renta Pensiones 2015,Renta Desempleo 2018,Renta Desempleo 2015,dict_res,dict_res_ord
13054,022019111080900101001,08,09,09001,0900101001,Castilla - La Mancha,Burgos,Abajas,32,25,...,28322.021999,26938.114416,7855.336603,6845.948425,3217.875711,2985.302533,293.331625,347.217589,"{'PP': 9, 'PSOE': 12, 'Cs': 1, 'UP': 0, 'IU': ...","[('PSOE', 12), ('PP', 9), ('VOX', 3), ('Cs', 1..."
13055,022019111080900301001,08,09,09003,0900301001,Castilla - La Mancha,Burgos,Adrada de Haza,182,134,...,26576.000000,23346.000000,6150.000000,5573.000000,3588.000000,3408.000000,277.000000,301.000000,"{'PP': 33, 'PSOE': 48, 'Cs': 9, 'UP': 23, 'IU'...","[('PSOE', 48), ('PP', 33), ('UP', 23), ('VOX',..."
13056,022019111080900601001,08,09,09006,0900601001,Castilla - La Mancha,Burgos,Aguas Cándidas,57,30,...,28322.021999,26938.114416,7855.336603,6845.948425,3217.875711,2985.302533,293.331625,347.217589,"{'PP': 7, 'PSOE': 7, 'Cs': 2, 'UP': 3, 'IU': 0...","[('PP', 7), ('PSOE', 7), ('VOX', 7), ('UP', 3)..."
13057,022019111080900701001,08,09,09007,0900701001,Castilla - La Mancha,Burgos,Aguilar de Bureba,54,37,...,28322.021999,26938.114416,7855.336603,6845.948425,3217.875711,2985.302533,293.331625,347.217589,"{'PP': 25, 'PSOE': 5, 'Cs': 2, 'UP': 0, 'IU': ...","[('PP', 25), ('PSOE', 5), ('VOX', 3), ('Cs', 2..."
13058,022019111080900901001,08,09,09009,0900901001,Castilla - La Mancha,Burgos,Albillos,158,127,...,29610.000000,29418.000000,11411.000000,9711.000000,1816.000000,1738.000000,292.000000,348.000000,"{'PP': 34, 'PSOE': 29, 'Cs': 10, 'UP': 8, 'IU'...","[('VOX', 35), ('PP', 34), ('PSOE', 29), ('Cs',..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13636,022019111080990401001,08,09,09904,0990401001,Castilla - La Mancha,Burgos,Valle de las Navas,464,344,...,28686.000000,30743.000000,10484.000000,9223.000000,4248.000000,4655.000000,232.000000,427.000000,"{'PP': 112, 'PSOE': 99, 'Cs': 26, 'UP': 26, 'I...","[('PP', 112), ('PSOE', 99), ('VOX', 67), ('Cs'..."
13637,022019111080990501001,08,09,09905,0990501001,Castilla - La Mancha,Burgos,Valle de Sedano,373,239,...,31934.000000,26436.000000,8577.000000,6574.000000,5122.000000,4766.000000,258.000000,332.000000,"{'PP': 72, 'PSOE': 64, 'Cs': 17, 'UP': 50, 'IU...","[('PP', 72), ('PSOE', 64), ('UP', 50), ('VOX',..."
13638,022019111080990601001,08,09,09906,0990601001,Castilla - La Mancha,Burgos,Merindad de Río Ubierna,1185,883,...,32718.000000,40404.000000,10685.000000,9695.000000,3976.000000,3455.000000,209.000000,334.000000,"{'PP': 302, 'PSOE': 198, 'Cs': 64, 'UP': 80, '...","[('PP', 302), ('VOX', 201), ('PSOE', 198), ('U..."
13639,022019111080990701001,08,09,09907,0990701001,Castilla - La Mancha,Burgos,Alfoz de Quintanadueñas,1526,1140,...,32831.000000,32477.000000,11561.000000,10642.000000,1569.000000,1559.000000,220.000000,338.000000,"{'PP': 218, 'PSOE': 308, 'Cs': 99, 'UP': 174, ...","[('PSOE', 308), ('VOX', 280), ('PP', 218), ('U..."


Ahora tomamos una decisión algo arbitraria, que es quedarnos con las secciones de más de 500 censados, pues pensamos que no es bueno depender de aquellas que sean demasiado pequeñas, y en las que factores púramente locales hagan variar el resultado electoral. Quedan 250 secciones, lo cual es una rebaja muy grande, ya que Burgos es la provincia con el mayor número de municipios de España, la mayor parte de ellos muy pequeños.

In [42]:
secciones_select = secciones_select.loc[secciones_select['Censo_Esc'] > 500]

In [43]:
secciones_select

Unnamed: 0,Sección,cod_ccaa,cod_prov,cod_mun,cod_sec,CCAA,Provincia,Municipio,Censo_Esc,Votos_Total,...,Renta hogar 2017,Renta hogar 2015,Renta Salarios 2018,Renta Salarios 2015,Renta Pensiones 2018,Renta Pensiones 2015,Renta Desempleo 2018,Renta Desempleo 2015,dict_res,dict_res_ord
13066,022019111080901801001,08,09,09018,0901801001,Castilla - La Mancha,Burgos,Aranda de Duero,707,557,...,32747.0,31477.0,13355.0,11823.0,1712.0,1381.0,116.0,202.0,"{'PP': 147, 'PSOE': 161, 'Cs': 75, 'UP': 56, '...","[('PSOE', 161), ('PP', 147), ('VOX', 91), ('Cs..."
13067,022019111080901801002,08,09,09018,0901801002,Castilla - La Mancha,Burgos,Aranda de Duero,1374,973,...,30765.0,29813.0,9874.0,8030.0,3533.0,4003.0,192.0,223.0,"{'PP': 292, 'PSOE': 306, 'Cs': 94, 'UP': 111, ...","[('PSOE', 306), ('PP', 292), ('VOX', 129), ('U..."
13068,022019111080901801003,08,09,09018,0901801003,Castilla - La Mancha,Burgos,Aranda de Duero,1111,769,...,27728.0,26579.0,6894.0,5913.0,5250.0,5034.0,188.0,280.0,"{'PP': 223, 'PSOE': 270, 'Cs': 72, 'UP': 74, '...","[('PSOE', 270), ('PP', 223), ('VOX', 113), ('U..."
13069,022019111080901801004,08,09,09018,0901801004,Castilla - La Mancha,Burgos,Aranda de Duero,1036,749,...,29902.0,28970.0,7946.0,7899.0,4819.0,3870.0,253.0,295.0,"{'PP': 175, 'PSOE': 319, 'Cs': 52, 'UP': 89, '...","[('PSOE', 319), ('PP', 175), ('VOX', 92), ('UP..."
13070,022019111080901801005,08,09,09018,0901801005,Castilla - La Mancha,Burgos,Aranda de Duero,1762,1212,...,30110.0,27798.0,10059.0,8877.0,2675.0,2137.0,230.0,266.0,"{'PP': 288, 'PSOE': 452, 'Cs': 111, 'UP': 149,...","[('PSOE', 452), ('PP', 288), ('VOX', 170), ('U..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13633,022019111080990301001,08,09,09903,0990301001,Castilla - La Mancha,Burgos,Villarcayo de Merindad de Castilla la Vieja,1001,724,...,23534.0,23348.0,6792.0,6180.0,3014.0,3205.0,268.0,266.0,"{'PP': 235, 'PSOE': 198, 'Cs': 50, 'UP': 75, '...","[('PP', 235), ('PSOE', 198), ('VOX', 142), ('U..."
13634,022019111080990301002,08,09,09903,0990301002,Castilla - La Mancha,Burgos,Villarcayo de Merindad de Castilla la Vieja,1269,805,...,23047.0,21295.0,7433.0,6180.0,3130.0,2755.0,255.0,286.0,"{'PP': 230, 'PSOE': 257, 'Cs': 61, 'UP': 82, '...","[('PSOE', 257), ('PP', 230), ('VOX', 142), ('U..."
13635,022019111080990301003,08,09,09903,0990301003,Castilla - La Mancha,Burgos,Villarcayo de Merindad de Castilla la Vieja,842,558,...,23326.0,21884.0,6638.0,5524.0,3546.0,3334.0,195.0,212.0,"{'PP': 177, 'PSOE': 171, 'Cs': 39, 'UP': 54, '...","[('PP', 177), ('PSOE', 171), ('VOX', 96), ('UP..."
13638,022019111080990601001,08,09,09906,0990601001,Castilla - La Mancha,Burgos,Merindad de Río Ubierna,1185,883,...,32718.0,40404.0,10685.0,9695.0,3976.0,3455.0,209.0,334.0,"{'PP': 302, 'PSOE': 198, 'Cs': 64, 'UP': 80, '...","[('PP', 302), ('VOX', 201), ('PSOE', 198), ('U..."


También nos quedamos solo con las columnas del dataset de Burgos que tratan del resultado electoral.

In [44]:
col_validas_select = ['Sección', 'Censo_Esc', 'Votos_Total', 'Nulos', 'Votos_Válidos', 'Blanco', 'V_Cand', 'PP', 'PSOE', 'Cs', 'UP',
       'IU', 'VOX', 'UPyD', 'MP', 'CiU', 'ERC', 'JxC', 'CUP', 'DiL', 'PNV',
       'Bildu', 'Amaiur', 'CC', 'FA', 'TE', 'BNG', 'PRC', 'GBai', 'Compromis',
       'PACMA', 'Otros']

In [45]:
secciones_select = secciones_select[col_validas_select]

In [46]:
secciones_select

Unnamed: 0,Sección,Censo_Esc,Votos_Total,Nulos,Votos_Válidos,Blanco,V_Cand,PP,PSOE,Cs,...,Amaiur,CC,FA,TE,BNG,PRC,GBai,Compromis,PACMA,Otros
13066,022019111080901801001,707,557,14,543,7,536,147,161,75,...,0,0,0,0,0,0,0,0,3,3
13067,022019111080901801002,1374,973,21,952,11,941,292,306,94,...,0,0,0,0,0,0,0,0,3,6
13068,022019111080901801003,1111,769,4,765,2,763,223,270,72,...,0,0,0,0,0,0,0,0,7,4
13069,022019111080901801004,1036,749,5,744,3,741,175,319,52,...,0,0,0,0,0,0,0,0,3,11
13070,022019111080901801005,1762,1212,17,1195,9,1186,288,452,111,...,0,0,0,0,0,0,0,0,2,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13633,022019111080990301001,1001,724,4,720,7,713,235,198,50,...,0,0,0,0,0,0,0,0,9,4
13634,022019111080990301002,1269,805,16,789,10,779,230,257,61,...,0,0,0,0,0,0,0,0,5,2
13635,022019111080990301003,842,558,9,549,7,542,177,171,39,...,0,0,0,0,0,0,0,0,3,2
13638,022019111080990601001,1185,883,6,877,16,861,302,198,64,...,0,0,0,0,0,0,0,0,7,9


In [47]:
secciones_select_norm = secciones_select.copy()

In [48]:
secciones_select_norm

Unnamed: 0,Sección,Censo_Esc,Votos_Total,Nulos,Votos_Válidos,Blanco,V_Cand,PP,PSOE,Cs,...,Amaiur,CC,FA,TE,BNG,PRC,GBai,Compromis,PACMA,Otros
13066,022019111080901801001,707,557,14,543,7,536,147,161,75,...,0,0,0,0,0,0,0,0,3,3
13067,022019111080901801002,1374,973,21,952,11,941,292,306,94,...,0,0,0,0,0,0,0,0,3,6
13068,022019111080901801003,1111,769,4,765,2,763,223,270,72,...,0,0,0,0,0,0,0,0,7,4
13069,022019111080901801004,1036,749,5,744,3,741,175,319,52,...,0,0,0,0,0,0,0,0,3,11
13070,022019111080901801005,1762,1212,17,1195,9,1186,288,452,111,...,0,0,0,0,0,0,0,0,2,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13633,022019111080990301001,1001,724,4,720,7,713,235,198,50,...,0,0,0,0,0,0,0,0,9,4
13634,022019111080990301002,1269,805,16,789,10,779,230,257,61,...,0,0,0,0,0,0,0,0,5,2
13635,022019111080990301003,842,558,9,549,7,542,177,171,39,...,0,0,0,0,0,0,0,0,3,2
13638,022019111080990601001,1185,883,6,877,16,861,302,198,64,...,0,0,0,0,0,0,0,0,7,9


Ahora hacemos un pequeño tratamiento de datos. Tomamos el dataset y normalizamos los resultados de las secciones dividiendo por su censo, y después trasponemos el dataset, siendo ahora las secciones las columnas, y los resultados normalizados las filas, igual que hemos hecho con la provincia de Zaragoza. 

In [49]:
set_cols = ['Sección', 'Censo_Esc']

In [50]:
for col in secciones_select_norm.columns:

  if col not in set_cols:
    
    secciones_select_norm[col] = secciones_select_norm[col] / secciones_select_norm['Censo_Esc']

secciones_select_norm = secciones_select_norm.set_index('Sección')
secciones_select_norm = secciones_select_norm.drop('Censo_Esc', axis = 1)

secciones_select_norm = secciones_select_norm.T

In [51]:
secciones_select_norm

Sección,022019111080901801001,022019111080901801002,022019111080901801003,022019111080901801004,022019111080901801005,022019111080901801006,022019111080901801007,022019111080901801008,022019111080901801009,022019111080901802001,...,022019111080942701001,022019111080942701002,022019111080943401001,022019111080943801001,022019111080943901001,022019111080990301001,022019111080990301002,022019111080990301003,022019111080990601001,022019111080990701001
Votos_Total,0.787836,0.708151,0.692169,0.722973,0.687855,0.627887,0.740458,0.7296,0.76036,0.725049,...,0.650943,0.597424,0.755278,0.795374,0.754948,0.723277,0.634358,0.662708,0.745148,0.747051
Nulos,0.019802,0.015284,0.0036,0.004826,0.009648,0.013687,0.008724,0.0048,0.012613,0.009785,...,0.014825,0.00161,0.007819,0.017794,0.010368,0.003996,0.012608,0.010689,0.005063,0.013761
Votos_Válidos,0.768034,0.692868,0.688569,0.718147,0.678207,0.6142,0.731734,0.7248,0.747748,0.715264,...,0.636119,0.595813,0.747459,0.77758,0.744581,0.719281,0.621749,0.652019,0.740084,0.73329
Blanco,0.009901,0.008006,0.0018,0.002896,0.005108,0.010265,0.011996,0.012,0.004505,0.01272,...,0.006739,0.003221,0.0086,0.010676,0.001885,0.006993,0.00788,0.008314,0.013502,0.011796
V_Cand,0.758133,0.684862,0.686769,0.715251,0.673099,0.603935,0.719738,0.7128,0.743243,0.702544,...,0.62938,0.592593,0.738858,0.766904,0.742696,0.712288,0.613869,0.643705,0.726582,0.721494
PP,0.207921,0.212518,0.20072,0.168919,0.163451,0.109495,0.23337,0.1904,0.191892,0.197652,...,0.227763,0.333333,0.169664,0.274021,0.194156,0.234765,0.181245,0.210214,0.254852,0.142857
PSOE,0.227723,0.222707,0.243024,0.307915,0.256527,0.254919,0.221374,0.2456,0.240541,0.237769,...,0.179245,0.072464,0.225176,0.254448,0.214892,0.197802,0.202522,0.203088,0.167089,0.201835
Cs,0.106082,0.068413,0.064806,0.050193,0.062997,0.04876,0.080698,0.0784,0.087387,0.071429,...,0.053908,0.024155,0.10086,0.053381,0.073516,0.04995,0.048069,0.046318,0.054008,0.064875
UP,0.079208,0.080786,0.066607,0.085907,0.084563,0.071856,0.076336,0.0928,0.115315,0.076321,...,0.078167,0.030596,0.101642,0.071174,0.082941,0.074925,0.064618,0.064133,0.067511,0.114024
IU,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Lo que ocurre ahora es que no sabemos qué secciones vamos finalmente a utilizar.

Seleccionaremos las secciones que estén menos correlacionadas entre sí. Lo que pasa es que vemos que hay registros enteros que tienen todo ceros, por lo que es posible que nos diese un error si quisiemos calcular la matriz de correlación a partir del anterior dataset, 'secciones_select_norm'.

Pese a ser algo redundante, vamos a partir del dataset antes de normalizar, el 'secciones_select'. A este df le aplicamos la función 'preparación_sec' que definimos a continuación. Esencialmente lo que hace es:

- Elimina las columnas (votos a partidos) que son todo ceros, es decir, los que no se presentaron en Burgos, en este caso.

- Normaliza por el censo

- Cambia el orden de los registros al azar, esto es importante para no dar sistemáticamente más importancia a una sección sobre otra cuando las seleccionemos.

- Hace una trasposición, como hemos visto antes.

In [52]:
def preparacion_sec(eleccion):

  set_cols = ['Sección', 'Censo_Esc']
  
  for col in eleccion.columns:

    if eleccion[col].sum() == 0:

      eleccion = eleccion.drop([col], axis = 1)

    elif col not in set_cols:

      eleccion[col] = eleccion[col] / eleccion['Censo_Esc']

  eleccion = eleccion.set_index('Sección')
  eleccion = eleccion.drop('Censo_Esc', axis = 1)

  df_elec_transpose = eleccion.T

  lista_sec = list(df_elec_transpose.columns)
  random.shuffle(lista_sec)

  df_elec_transpose = df_elec_transpose[lista_sec]

  return df_elec_transpose


Con lo que obtenemos, luego veremos un ejemplo, ya podemos seleccionar las secciones. Tras calcular la matriz de correlación de todas las secciones, se la pasamos a la función siguiente, 'secciones_corr', que se encarga de repasar una a una las correlacines de cada sección con el resto, comenzando por la primera que, como vimos elegimos al azar.

Vamos viendo si cada seccion tiene una correlación máxima con otras secciones por encima o por debajo de un limite, threshold:

- Si está por encima, es que está demasiado correlacionada con otra que ya hemos revisado, y por lo tanto la eliminamos. 

- Si está por debajo, no la eliminamos.

Al pasar por todas las secciones, nos quedamos por lo tanto con las poco correlacionadas entre sí. Se trata de elegir bien el threshold para que tengamos unas cuantas, pero no demasiadas, normalmente menos de 10, pongamos.

La elección de las secciones depende del orden en que se vayan examinando, que hemos hecho en la función anterior que fuese al azar, por lo que cada vez puede dar (casi seguro) distintas secciones, salvo que fijemos una semilla.

In [53]:
def secciones_corr(dummy, threshold = 0.995):

  for ind in range(2, m.shape[0]):
    s = m.iloc[0:ind, 0:ind]

    if max(s.iloc[ind-1, 0:ind-1] > threshold):
    # print(m.columns[ind-1])
      dummy = dummy.drop(m.columns[ind-1], axis = 0)
      dummy = dummy.drop(m.columns[ind-1], axis = 1)

  return dummy.columns


El resultado de la primera función es un dataset normalizado y traspuesto, pero que tiene por filas elementos que no son enteramente ceros.

In [54]:
secc = preparacion_sec(secciones_select)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  eleccion[col] = eleccion[col] / eleccion['Censo_Esc']


In [55]:
secc

Sección,022019111080921904003,022019111080905906005,022019111080905905026,022019111080901802003,022019111080936301001,022019111080901803003,022019111080927401001,022019111080942701002,022019111080901803005,022019111080990301002,...,022019111080920901001,022019111080921903001,022019111080901801005,022019111080905909034,022019111080905905027,022019111080905909013,022019111080905902001,022019111080905907007,022019111080905905002,022019111080901801003
Votos_Total,0.650563,0.669734,0.837093,0.745698,0.728477,0.593373,0.747826,0.597424,0.723733,0.634358,...,0.752577,0.673594,0.687855,0.671593,0.790055,0.728261,0.682274,0.641295,0.742925,0.692169
Nulos,0.00728,0.013292,0.008772,0.006692,0.005519,0.00753,0.011594,0.00161,0.014933,0.012608,...,0.009021,0.012225,0.009648,0.014778,0.012431,0.005435,0.010033,0.010499,0.012972,0.0036
Votos_Válidos,0.643283,0.656442,0.828321,0.739006,0.722958,0.585843,0.736232,0.595813,0.7088,0.621749,...,0.743557,0.661369,0.678207,0.656814,0.777624,0.722826,0.672241,0.630796,0.729953,0.688569
Blanco,0.00728,0.00409,0.010025,0.00956,0.005519,0.00753,0.007729,0.003221,0.006933,0.00788,...,0.007732,0.00978,0.005108,0.013136,0.009669,0.008152,0.011706,0.007874,0.011792,0.0018
V_Cand,0.636003,0.652352,0.818296,0.729446,0.717439,0.578313,0.728502,0.592593,0.701867,0.613869,...,0.735825,0.651589,0.673099,0.643678,0.767956,0.714674,0.660535,0.622922,0.71816,0.686769
PP,0.12045,0.249489,0.451128,0.314532,0.352097,0.156627,0.229952,0.333333,0.183467,0.181245,...,0.235825,0.135697,0.163451,0.236453,0.379834,0.183424,0.264214,0.179353,0.261792,0.20072
PSOE,0.229649,0.203476,0.154135,0.150096,0.103753,0.224398,0.281159,0.072464,0.2416,0.202522,...,0.212629,0.311736,0.256527,0.210181,0.131215,0.279891,0.170569,0.180227,0.211085,0.243024
Cs,0.056254,0.038855,0.057644,0.076482,0.065121,0.052711,0.052174,0.024155,0.0736,0.048069,...,0.061856,0.031785,0.062997,0.044335,0.070442,0.057065,0.053512,0.062992,0.060142,0.064806
UP,0.131039,0.068507,0.028822,0.049713,0.037528,0.064759,0.069565,0.030596,0.082133,0.064618,...,0.085052,0.103912,0.084563,0.060755,0.040055,0.081522,0.046823,0.070866,0.051887,0.066607
VOX,0.088683,0.08589,0.122807,0.128107,0.151214,0.076807,0.087923,0.127214,0.108267,0.111899,...,0.130155,0.066015,0.096481,0.083744,0.13674,0.096467,0.115385,0.11811,0.123821,0.10171


Ahora calculamos la matriz de correlación y se la pasamos a la segunda función con el valor del threshold. Obtenemos siete secciones, que ya sabemos que no están tan correlacionadas entre sí.

In [56]:
m = secc.corr()
lista_sec = secciones_corr(m, 0.9975)

In [57]:
lista_sec

Index(['022019111080921904003', '022019111080905906005',
       '022019111080905905026', '022019111080901802003',
       '022019111080901803003', '022019111080921902004',
       '022019111080908601001', '022019111080910901001',
       '022019111080901801003'],
      dtype='object', name='Sección')

Ya sabiendo las secciones que hemos elegido ya las podemos seleccionar del dataset normalizado que incluía las secciones de Burgos, incluyendo las filas que son todo ceros. 

In [58]:
secciones_select_norm = secciones_select_norm[lista_sec]

In [59]:
secciones_select_norm

Sección,022019111080921904003,022019111080905906005,022019111080905905026,022019111080901802003,022019111080901803003,022019111080921902004,022019111080908601001,022019111080910901001,022019111080901801003
Votos_Total,0.650563,0.669734,0.837093,0.745698,0.593373,0.557971,0.734663,0.536381,0.692169
Nulos,0.00728,0.013292,0.008772,0.006692,0.00753,0.007246,0.01227,0.015858,0.0036
Votos_Válidos,0.643283,0.656442,0.828321,0.739006,0.585843,0.550725,0.722393,0.520522,0.688569
Blanco,0.00728,0.00409,0.010025,0.00956,0.00753,0.005435,0.004601,0.0,0.0018
V_Cand,0.636003,0.652352,0.818296,0.729446,0.578313,0.54529,0.717791,0.520522,0.686769
PP,0.12045,0.249489,0.451128,0.314532,0.156627,0.103261,0.226994,0.088619,0.20072
PSOE,0.229649,0.203476,0.154135,0.150096,0.224398,0.293478,0.162577,0.149254,0.243024
Cs,0.056254,0.038855,0.057644,0.076482,0.052711,0.021739,0.039877,0.028918,0.064806
UP,0.131039,0.068507,0.028822,0.049713,0.064759,0.070652,0.069018,0.175373,0.066607
IU,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Vemos que tiene las 30 filas que tiene los datos normalizados de la provinvia de Zaragoza. Podemos añadir este df para tener los datos que pasaremos al modelo de regresión en un solo df.

In [60]:
secciones_select_norm.shape

(30, 9)

In [61]:
secciones_select_norm['Modelización'] = modelizacion['Modelización']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  secciones_select_norm['Modelización'] = modelizacion['Modelización']


In [62]:
secciones_select_norm

Sección,022019111080921904003,022019111080905906005,022019111080905905026,022019111080901802003,022019111080901803003,022019111080921902004,022019111080908601001,022019111080910901001,022019111080901801003,Modelización
Votos_Total,0.650563,0.669734,0.837093,0.745698,0.593373,0.557971,0.734663,0.536381,0.692169,0.719466
Nulos,0.00728,0.013292,0.008772,0.006692,0.00753,0.007246,0.01227,0.015858,0.0036,0.006076
Votos_Válidos,0.643283,0.656442,0.828321,0.739006,0.585843,0.550725,0.722393,0.520522,0.688569,0.713389
Blanco,0.00728,0.00409,0.010025,0.00956,0.00753,0.005435,0.004601,0.0,0.0018,0.006958
V_Cand,0.636003,0.652352,0.818296,0.729446,0.578313,0.54529,0.717791,0.520522,0.686769,0.706431
PP,0.12045,0.249489,0.451128,0.314532,0.156627,0.103261,0.226994,0.088619,0.20072,0.166932
PSOE,0.229649,0.203476,0.154135,0.150096,0.224398,0.293478,0.162577,0.149254,0.243024,0.220048
Cs,0.056254,0.038855,0.057644,0.076482,0.052711,0.021739,0.039877,0.028918,0.064806,0.065202
UP,0.131039,0.068507,0.028822,0.049713,0.064759,0.070652,0.069018,0.175373,0.066607,0.080233
IU,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [63]:
secciones_select_norm.index

Index(['Votos_Total', 'Nulos', 'Votos_Válidos', 'Blanco', 'V_Cand', 'PP',
       'PSOE', 'Cs', 'UP', 'IU', 'VOX', 'UPyD', 'MP', 'CiU', 'ERC', 'JxC',
       'CUP', 'DiL', 'PNV', 'Bildu', 'Amaiur', 'CC', 'FA', 'TE', 'BNG', 'PRC',
       'GBai', 'Compromis', 'PACMA', 'Otros'],
      dtype='object')

Ahora ya podemos modelizar mediante regresión lineal. Cargamos las librerías necesarias, y definimos las matrices X e y.

In [64]:
import numpy as np
from sklearn.linear_model import LinearRegression

In [65]:
X = secciones_select_norm.drop('Modelización', axis = 1).values

In [66]:
y = secciones_select_norm['Modelización'].values

In [67]:
X

array([[0.65056254, 0.66973415, 0.83709273, 0.7456979 , 0.59337349,
        0.55797101, 0.73466258, 0.5363806 , 0.69216922],
       [0.00727995, 0.01329243, 0.00877193, 0.00669216, 0.00753012,
        0.00724638, 0.01226994, 0.01585821, 0.00360036],
       [0.64328259, 0.65644172, 0.8283208 , 0.73900574, 0.58584337,
        0.55072464, 0.72239264, 0.52052239, 0.68856886],
       [0.00727995, 0.00408998, 0.01002506, 0.00956023, 0.00753012,
        0.00543478, 0.00460123, 0.        , 0.00180018],
       [0.63600265, 0.65235174, 0.81829574, 0.72944551, 0.57831325,
        0.54528986, 0.71779141, 0.52052239, 0.68676868],
       [0.12045003, 0.24948875, 0.45112782, 0.31453155, 0.15662651,
        0.10326087, 0.22699387, 0.0886194 , 0.20072007],
       [0.22964924, 0.20347648, 0.15413534, 0.1500956 , 0.22439759,
        0.29347826, 0.16257669, 0.14925373, 0.2430243 ],
       [0.05625414, 0.03885481, 0.05764411, 0.07648184, 0.05271084,
        0.02173913, 0.0398773 , 0.02891791, 0.06480648],


In [68]:
y

array([0.71946552, 0.00607642, 0.7133891 , 0.00695846, 0.70643064,
       0.16693179, 0.22004842, 0.06520238, 0.08023338, 0.        ,
       0.12857079, 0.        , 0.03213501, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.00499869, 0.00831018])

Hacemos el fit con X e y. Hemos puesto el intercept como cero, para que no aparezcan votos en partidos que no se presentaron ni en Burgos ni Zaragoza. Es algo óptico más que nada.

In [69]:
reg = LinearRegression(fit_intercept = False).fit(X, y)

In [70]:
reg.intercept_*censo_mod

0.0

Parece que hay un fit excelente, el 99,9%

In [71]:
reg.score(X, y)

0.9991397839523883

Estos son los coeficientes, que sumados no es extraño que den casi 1, pues tras normalizar estamos modelizando magnitudes unidimensinales del mismo orden de magnitud.

In [72]:
reg.coef_

array([-6.41877865, -8.28537284,  2.070113  ,  0.29409541,  6.74411194,
        1.73979213,  0.77334033,  4.2169627 ,  0.99541833])

In [73]:
reg.coef_.sum()

2.129682345912274

Ahora podemos ver los resultados que hemos predicho en nuestro modelo. Deshacemos la normalización volviendo a multiplicar por el censo de Zaragoza, y lo almacenamos en un df.

In [74]:
est = reg.predict(X)*censo_mod

In [75]:
df = pd.DataFrame(est, index = secciones_select_norm.index, columns = ['Estimación']).astype('int32')

In [76]:
df

Unnamed: 0,Estimación
Votos_Total,514320
Nulos,4723
Votos_Válidos,509596
Blanco,6108
V_Cand,503487
PP,122435
PSOE,160434
Cs,49659
UP,60412
IU,0


Ahora mostramos los datos reales que queríamos modelizar, y lo mostramos en otro df.

In [77]:
df1 = pd.DataFrame(secciones_mod.sum(), columns = ['Real']).drop('Censo_Esc')

In [78]:
df1

Unnamed: 0,Real
Votos_Total,514697
Nulos,4347
Votos_Válidos,510350
Blanco,4978
V_Cand,505372
PP,119421
PSOE,157420
Cs,46645
UP,57398
IU,0


Comparamos ambos df. Dado el fit tal alto, era de esperar que se parecieran bastante. Sin embargo hay una excepcion, los votos a Mas País (MP). La razón es sencilla: ese partido no se presentó en Burgos, por lo que era imposible que pudiese modelizarlo en Zaragoza. Con todo, el fit parece impresionante pese a que solo hemos utilizado 7 secciones electorales de otra provinvia.

In [79]:
df['Real'] = df1['Real']

In [80]:
df

Unnamed: 0,Estimación,Real
Votos_Total,514320,514697
Nulos,4723,4347
Votos_Válidos,509596,510350
Blanco,6108,4978
V_Cand,503487,505372
PP,122435,119421
PSOE,160434,157420
Cs,49659,46645
UP,60412,57398
IU,0,0


## Modelización en las elecciones de 2016

Nos puede surgir la pregunta que cuán válida es la selección de secciones electorales en 2019 si utilizamos sus equivalentes en las elecciones de 2016. Eso es lo que tratamos en este capítulo. Recordamos las secciones elegidas:

In [81]:
lista_sec

Index(['022019111080921904003', '022019111080905906005',
       '022019111080905905026', '022019111080901802003',
       '022019111080901803003', '022019111080921902004',
       '022019111080908601001', '022019111080910901001',
       '022019111080901801003'],
      dtype='object', name='Sección')

Esas secciones son las de 2019, tenemos que encontrar las equivalentes, o similares, en 2016. Para ello cargamos el df de similitud de secciones, que acumula todas de las 5 últimas elecciones. 

In [83]:
sim_secciones = pd.read_csv('similitud_secciones_def_REF.csv', dtype = 'str')

FileNotFoundError: [Errno 2] File similitud_secciones_def_REF.csv does not exist: 'similitud_secciones_def_REF.csv'

In [None]:
sim_secciones

Ahora seleccinamos las similares a las secciones de Burgos que encontramos en el capítulo anterior...

In [None]:
sec_select_J16 = sim_secciones.loc[sim_secciones['cod_sec_ref'].isin(lista_sec)]

In [None]:
sec_select_J16

... y escogemos sus equivalentes en las elecciones de 2016, que son estas siete:

In [None]:
list_sec_J16 = list(sec_select_J16['cercana J16_ref'])

In [None]:
list_sec_J16

Cargamos ahora los resultados de las elecciones de junio de 2016

In [None]:
df_eleccion_comp_J16 = pd.read_csv('/content/drive/MyDrive/Proyecto_KeepCoding - Propio/Data/Gen-16-Jun/gen_J16_unif_cols_prov.txt', dtype = strings)

Seleccionamos las secciones a modelizar, que los naturalmente las de la provincia de Zaragoza.

In [None]:
secciones_mod = df_eleccion_comp_J16

if len(ccaa_mod) > 0:

  secciones_mod = secciones_mod.loc[secciones_mod['CCAA'].isin(ccaa_mod)]

if len(provincia_mod) > 0:

  secciones_mod = secciones_mod.loc[secciones_mod['Provincia'].isin(provincia_mod)]

if len(municipio_mod) > 0:

  secciones_mod = secciones_mod.loc[secciones_mod['Municipio'].isin(municipio_mod)]


In [None]:
secciones_mod

In [None]:
censo_mod = secciones_mod['Censo_Esc'].sum()

Procedemos de igual manera, sumamos los resultados, normalizamos y los almacenamos en un df.

In [None]:
censo_mod

In [None]:
secciones_mod = secciones_mod[cols_validas_mod]

In [None]:
modelizacion = pd.DataFrame(secciones_mod.sum(), columns = ['Modelización'])
modelizacion['Modelización'] = modelizacion['Modelización'] / modelizacion['Modelización']['Censo_Esc']
modelizacion = modelizacion.drop(['Censo_Esc']) 

In [None]:
modelizacion

In [None]:
modelizacion.shape

Ahora ya no tenemos que seleccionar las secciones de la provincia de Burgos porque ya las conocemos: son las 7 que hemos visto antes. Sí nos hace falta almacenar los resultados que tuvieron en 2016.

In [None]:
secciones_select = df_eleccion_comp_J16.loc[df_eleccion_comp_J16['Sección'].isin(list_sec_J16)]

In [None]:
secciones_select = secciones_select[col_validas_select]

In [None]:
secciones_select

In [None]:
secciones_select_norm = secciones_select.copy()

Y ahora simplemente normalizamos y trasponemos.

In [None]:
for col in secciones_select_norm.columns:

  if col not in set_cols:
    
    secciones_select_norm[col] = secciones_select_norm[col] / secciones_select_norm['Censo_Esc']

secciones_select_norm = secciones_select_norm.set_index('Sección')
secciones_select_norm = secciones_select_norm.drop('Censo_Esc', axis = 1)

secciones_select_norm = secciones_select_norm.T

In [None]:
secciones_select_norm

In [None]:
secciones_select_norm.shape

Ya podemos modelizar, hacemos lo mismo que antes, definimos la matriz X e y.

In [None]:
secciones_select_norm['Modelización'] = modelizacion['Modelización']

In [None]:
X = secciones_select_norm.drop('Modelización', axis = 1).values
y = secciones_select_norm['Modelización'].values

In [None]:
X

Hacemos el fit...

In [None]:
reg = LinearRegression(fit_intercept = False).fit(X, y)

... y obtenemos un score de... 0.99997, muy superior al anterior.

Puede parecer un contrasentido, pero hay que tener en cuenta que ahora Más País no se presentó ni en Zaragoza ni en Burgos, por lo que la mayor fuente de error ya no no existe.

In [None]:
reg.score(X, y)

Si ahora comprobamos la predicción con los datos reales vemos que las diferencias son mínimas, especialmente entre los partidos más importantes, y eso lo hemos conseguido solo mediante 7 secciones de otra provincia, seleccionadas con los datos de otra elección...

In [None]:
est = reg.predict(X) * censo_mod
df = pd.DataFrame(est, index = secciones_select_norm.index, columns = ['Estimación']).astype('int32')
df1 = pd.DataFrame(secciones_mod.sum(), columns = ['Real']).drop('Censo_Esc')
df['Real'] = df1['Real']

In [None]:
df