# Modelización de territorio con regresión lineal sin PCA

En este cuaderno mostramos un ejemplo de modelización de un territorio, en este caso la provincia de Zaragoza, mediante secciones electorales escogidas de esa misma provincia. La modelización la haremos mediante regresión lineal sin utilizar PCA.

Primero elegimos las secciones para una misma elección, en este caso la de noviembre de 2019. Después tomamos las secciones elegidas y utilizamos sus equivalentes de las elecciones de 2016, para ver si sirven para modelizar la provincia de Zaragoza en esos comicios.

## Modelización en las elecciones de noviembre de 2019

Comenzamos cargando las librerías necesarias, así como el dataset de las elecciones de noviembre de 2019.

In [1]:
import pandas as pd
import numpy as np
import random

In [2]:
strings = {'Sección' : 'str', 'cod_ccaa' : 'str', 'cod_prov' : 'str', 'cod_mun' : 'str', 'cod_sec' : 'str'}

In [3]:
import boto3

BUCKET_NAME = 'electomedia' 

# sustituir por credenciales de acceso. 
s3 = boto3.resource('s3', aws_access_key_id = 'xxxxxxxxxxxxxxx', 
                          aws_secret_access_key= 'xxxxxxxxxxxxxxxx')

In [4]:
import botocore.exceptions

KEY = 'datos-elecciones-generales-unificados/gen_N19_unif_cols_prov_copia.txt' 

try:
    s3.Bucket(BUCKET_NAME).download_file(KEY, 'gen_N19_unif_cols_prov_copia.txt')
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == "404":
        print("The object does not exist.")
    else:
        raise

In [5]:
df_eleccion_comp = pd.read_csv('gen_N19_unif_cols_prov_copia.txt', dtype = strings)

In [6]:
df_eleccion_comp

Unnamed: 0,Sección,cod_ccaa,cod_prov,cod_mun,cod_sec,CCAA,Provincia,Municipio,Censo_Esc,Votos_Total,...,Renta hogar 2017,Renta hogar 2015,Renta Salarios 2018,Renta Salarios 2015,Renta Pensiones 2018,Renta Pensiones 2015,Renta Desempleo 2018,Renta Desempleo 2015,dict_res,dict_res_ord
0,022019111010400101001,01,04,04001,0400101001,Andalucía,Almería,Abla,1002,717,...,20172.0,19546.0,5574.0,4833.0,3286.0,3082.0,403.0,471.0,"{'PP': 193, 'PSOE': 310, 'Cs': 47, 'UP': 30, '...","[('PSOE', 310), ('PP', 193), ('VOX', 122), ('C..."
1,022019111010400201001,01,04,04002,0400201001,Andalucía,Almería,Abrucena,1013,711,...,17841.0,17115.0,4640.0,4048.0,3418.0,2770.0,568.0,620.0,"{'PP': 111, 'PSOE': 349, 'Cs': 45, 'UP': 42, '...","[('PSOE', 349), ('VOX', 147), ('PP', 111), ('C..."
2,022019111010400301001,01,04,04003,0400301001,Andalucía,Almería,Adra,667,484,...,26498.0,24688.0,5121.0,4795.0,2499.0,2301.0,337.0,333.0,"{'PP': 176, 'PSOE': 128, 'Cs': 15, 'UP': 34, '...","[('PP', 176), ('PSOE', 128), ('VOX', 116), ('U..."
3,022019111010400301002,01,04,04003,0400301002,Andalucía,Almería,Adra,1306,909,...,25677.0,23400.0,5381.0,4837.0,1815.0,1724.0,343.0,464.0,"{'PP': 251, 'PSOE': 220, 'Cs': 51, 'UP': 58, '...","[('VOX', 312), ('PP', 251), ('PSOE', 220), ('U..."
4,022019111010400301003,01,04,04003,0400301003,Andalucía,Almería,Adra,1551,975,...,22051.0,19687.0,5224.0,4044.0,1170.0,1198.0,416.0,476.0,"{'PP': 292, 'PSOE': 202, 'Cs': 73, 'UP': 52, '...","[('VOX', 327), ('PP', 292), ('PSOE', 202), ('C..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36297,022019111195200108011,19,52,52001,5200108011,Melilla,Melilla,Melilla,1638,1021,...,66352.0,62632.0,11378.0,11119.0,1508.0,1274.0,167.0,166.0,"{'PP': 303, 'PSOE': 140, 'Cs': 30, 'UP': 28, '...","[('Otros', 348), ('PP', 303), ('VOX', 158), ('..."
36298,022019111195200108012,19,52,52001,5200108012,Melilla,Melilla,Melilla,1676,1057,...,50730.0,50839.0,13272.0,13038.0,2763.0,2445.0,169.0,177.0,"{'PP': 463, 'PSOE': 205, 'Cs': 36, 'UP': 35, '...","[('PP', 463), ('VOX', 210), ('PSOE', 205), ('O..."
36299,022019111195200108013,19,52,52001,5200108013,Melilla,Melilla,Melilla,1132,638,...,37816.0,36729.0,10102.0,9640.0,1807.0,1615.0,234.0,252.0,"{'PP': 208, 'PSOE': 113, 'Cs': 31, 'UP': 25, '...","[('PP', 208), ('VOX', 144), ('PSOE', 113), ('O..."
36300,022019111195200108014,19,52,52001,5200108014,Melilla,Melilla,Melilla,899,527,...,29898.0,31384.0,5923.0,6061.0,2463.0,2136.0,244.0,284.0,"{'PP': 200, 'PSOE': 87, 'Cs': 13, 'UP': 12, 'I...","[('PP', 200), ('VOX', 126), ('PSOE', 87), ('Ot..."


Primero especificamos el territorio que queremos modelizar, en este caso la provincia de Zaragoza. Dejamos vacías las opciones de CCAA y municipio; ambas deben ser cocurrentes, es decir, si escogiésemos un municipio, éste tendría que pertenecer en esta caso a la provincia de Zaragoza.

In [7]:
ccaa_mod = []

provincia_mod = ['Zaragoza']

municipio_mod = []

secciones_mod = df_eleccion_comp

In [8]:
if len(ccaa_mod) > 0:

  secciones_mod = secciones_mod.loc[secciones_mod['CCAA'].isin(ccaa_mod)]

if len(provincia_mod) > 0:

  secciones_mod = secciones_mod.loc[secciones_mod['Provincia'].isin(provincia_mod)]

if len(municipio_mod) > 0:

  secciones_mod = secciones_mod.loc[secciones_mod['Municipio'].isin(municipio_mod)]



Vemos que tenemos 880 secciones electorales en Zaragoza provincia.

In [9]:
secciones_mod

Unnamed: 0,Sección,cod_ccaa,cod_prov,cod_mun,cod_sec,CCAA,Provincia,Municipio,Censo_Esc,Votos_Total,...,Renta hogar 2017,Renta hogar 2015,Renta Salarios 2018,Renta Salarios 2015,Renta Pensiones 2018,Renta Pensiones 2015,Renta Desempleo 2018,Renta Desempleo 2015,dict_res,dict_res_ord
6553,022019111025000101001,02,50,50001,5000101001,Aragón,Zaragoza,Abanto,89,68,...,28322.021999,21149.000000,7855.336603,5134.000000,3217.875711,4987.000000,293.331625,139.000000,"{'PP': 42, 'PSOE': 13, 'Cs': 1, 'UP': 0, 'IU':...","[('PP', 42), ('PSOE', 13), ('VOX', 10), ('MP',..."
6554,022019111025000201001,02,50,50002,5000201001,Aragón,Zaragoza,Acered,125,91,...,18895.000000,20525.000000,3494.000000,2873.000000,4611.000000,3968.000000,84.000000,233.000000,"{'PP': 43, 'PSOE': 19, 'Cs': 4, 'UP': 0, 'IU':...","[('PP', 43), ('VOX', 20), ('PSOE', 19), ('Cs',..."
6555,022019111025000301001,02,50,50003,5000301001,Aragón,Zaragoza,Agón,117,89,...,27578.000000,27753.000000,5804.000000,5694.000000,5604.000000,5250.000000,161.000000,247.000000,"{'PP': 23, 'PSOE': 39, 'Cs': 2, 'UP': 2, 'IU':...","[('PSOE', 39), ('PP', 23), ('VOX', 20), ('Cs',..."
6556,022019111025000401001,02,50,50004,5000401001,Aragón,Zaragoza,Aguarón,475,360,...,25421.000000,23879.000000,7039.000000,6056.000000,3502.000000,3246.000000,208.000000,253.000000,"{'PP': 96, 'PSOE': 155, 'Cs': 17, 'UP': 19, 'I...","[('PSOE', 155), ('PP', 96), ('VOX', 44), ('MP'..."
6557,022019111025000501001,02,50,50005,5000501001,Aragón,Zaragoza,Aguilón,228,185,...,31410.000000,29687.000000,8651.000000,8019.000000,5616.000000,4816.000000,108.000000,191.000000,"{'PP': 84, 'PSOE': 34, 'Cs': 13, 'UP': 12, 'IU...","[('PP', 84), ('VOX', 35), ('PSOE', 34), ('Cs',..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7428,022019111025029802001,02,50,50298,5029802001,Aragón,Zaragoza,Zuera,610,482,...,31542.000000,31419.000000,9774.000000,8326.000000,3118.000000,3365.000000,213.000000,395.000000,"{'PP': 134, 'PSOE': 139, 'Cs': 45, 'UP': 50, '...","[('PSOE', 139), ('PP', 134), ('VOX', 82), ('UP..."
7429,022019111025090101001,02,50,50901,5090101001,Aragón,Zaragoza,Biel,133,96,...,25367.000000,26506.000000,13108.000000,9636.000000,7146.000000,7398.000000,145.000000,214.000000,"{'PP': 18, 'PSOE': 33, 'Cs': 7, 'UP': 8, 'IU':...","[('PSOE', 33), ('VOX', 21), ('PP', 18), ('UP',..."
7430,022019111025090201001,02,50,50902,5090201001,Aragón,Zaragoza,Marracos,77,65,...,28322.021999,26938.114416,7855.336603,6845.948425,3217.875711,2985.302533,293.331625,347.217589,"{'PP': 29, 'PSOE': 15, 'Cs': 4, 'UP': 3, 'IU':...","[('PP', 29), ('PSOE', 15), ('VOX', 10), ('Cs',..."
7431,022019111025090301001,02,50,50903,5090301001,Aragón,Zaragoza,Villamayor de Gállego,1143,844,...,34050.000000,31945.000000,9707.000000,8721.000000,3872.000000,3239.000000,162.000000,287.000000,"{'PP': 160, 'PSOE': 226, 'Cs': 64, 'UP': 133, ...","[('PSOE', 226), ('PP', 160), ('VOX', 160), ('U..."


Queremos modelizar solo los resultados electorales, por lo que nos quedamos solo con ellos.

In [10]:
secciones_mod_lista = list(secciones_mod['Sección']) 

In [11]:
cols_validas_mod = ['Censo_Esc', 'Votos_Total', 'Nulos', 'Votos_Válidos', 'Blanco', 'V_Cand', 'PP', 'PSOE', 'Cs', 'UP',
       'IU', 'VOX', 'UPyD', 'MP', 'CiU', 'ERC', 'JxC', 'CUP', 'DiL', 'PNV',
       'Bildu', 'Amaiur', 'CC', 'FA', 'TE', 'BNG', 'PRC', 'GBai', 'Compromis',
       'PACMA', 'Otros']

In [12]:
secciones_mod = secciones_mod[cols_validas_mod]

In [13]:
secciones_mod

Unnamed: 0,Censo_Esc,Votos_Total,Nulos,Votos_Válidos,Blanco,V_Cand,PP,PSOE,Cs,UP,...,Amaiur,CC,FA,TE,BNG,PRC,GBai,Compromis,PACMA,Otros
6553,89,68,0,68,0,68,42,13,1,0,...,0,0,0,0,0,0,0,0,0,0
6554,125,91,5,86,0,86,43,19,4,0,...,0,0,0,0,0,0,0,0,0,0
6555,117,89,0,89,1,88,23,39,2,2,...,0,0,0,0,0,0,0,0,0,0
6556,475,360,4,356,2,354,96,155,17,19,...,0,0,0,0,0,0,0,0,0,2
6557,228,185,1,184,2,182,84,34,13,12,...,0,0,0,0,0,0,0,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7428,610,482,2,480,3,477,134,139,45,50,...,0,0,0,0,0,0,0,0,0,13
7429,133,96,0,96,0,96,18,33,7,8,...,0,0,0,0,0,0,0,0,0,2
7430,77,65,3,62,0,62,29,15,4,3,...,0,0,0,0,0,0,0,0,0,0
7431,1143,844,5,839,10,829,160,226,64,133,...,0,0,0,0,0,0,0,0,4,17


Ahora hay que obtener los resultados electorales conjunto del territorio que debemos modelizar. Primero nos quedamos con el censo del territorio, después creamos un df con estos resultados, y finalmente, muy importante: normalizamos estos resultados dividendo por el censo, así no importa el tamaño del territorio que queramos modelizar.

In [14]:
censo_mod = secciones_mod['Censo_Esc'].sum()

In [15]:
modelizacion = pd.DataFrame(secciones_mod.sum(), columns = ['Modelización'])

In [16]:
modelizacion['Modelización'] = modelizacion['Modelización'] / modelizacion['Modelización']['Censo_Esc']

Obtenemos un df de una columna con los resultados electorales normalizados por su censo.

In [17]:
modelizacion

Unnamed: 0,Modelización
Censo_Esc,1.0
Votos_Total,0.719466
Nulos,0.006076
Votos_Válidos,0.713389
Blanco,0.006958
V_Cand,0.706431
PP,0.166932
PSOE,0.220048
Cs,0.065202
UP,0.080233


La primera fila siempre será 1, pues es el censo dividido por sí mismo, por lo que la podemos eliminar.

In [18]:
modelizacion = modelizacion.drop(['Censo_Esc']) 

In [19]:
modelizacion

Unnamed: 0,Modelización
Votos_Total,0.719466
Nulos,0.006076
Votos_Válidos,0.713389
Blanco,0.006958
V_Cand,0.706431
PP,0.166932
PSOE,0.220048
Cs,0.065202
UP,0.080233
IU,0.0


In [20]:
modelizacion.shape

(30, 1)

Ahora debemos buscar las secciones que modelicen la provincia de Zaragoza, que, como hemos mencionado, son las de ella misma.

In [21]:
ccaa_select = []

provincia_select = ['Zaragoza']

municipio_select = []

secciones_select = df_eleccion_comp

In [22]:
if len(ccaa_select) > 0:

  secciones_select = secciones_select.loc[secciones_select['CCAA'].isin(ccaa_select)]

if len(provincia_select) > 0:

  secciones_select = secciones_select.loc[secciones_select['Provincia'].isin(provincia_select)]

if len(municipio_select) > 0:

  secciones_select = secciones_select.loc[secciones_select['Municipio'].isin(municipio_select)]



In [23]:
secciones_select

Unnamed: 0,Sección,cod_ccaa,cod_prov,cod_mun,cod_sec,CCAA,Provincia,Municipio,Censo_Esc,Votos_Total,...,Renta hogar 2017,Renta hogar 2015,Renta Salarios 2018,Renta Salarios 2015,Renta Pensiones 2018,Renta Pensiones 2015,Renta Desempleo 2018,Renta Desempleo 2015,dict_res,dict_res_ord
6553,022019111025000101001,02,50,50001,5000101001,Aragón,Zaragoza,Abanto,89,68,...,28322.021999,21149.000000,7855.336603,5134.000000,3217.875711,4987.000000,293.331625,139.000000,"{'PP': 42, 'PSOE': 13, 'Cs': 1, 'UP': 0, 'IU':...","[('PP', 42), ('PSOE', 13), ('VOX', 10), ('MP',..."
6554,022019111025000201001,02,50,50002,5000201001,Aragón,Zaragoza,Acered,125,91,...,18895.000000,20525.000000,3494.000000,2873.000000,4611.000000,3968.000000,84.000000,233.000000,"{'PP': 43, 'PSOE': 19, 'Cs': 4, 'UP': 0, 'IU':...","[('PP', 43), ('VOX', 20), ('PSOE', 19), ('Cs',..."
6555,022019111025000301001,02,50,50003,5000301001,Aragón,Zaragoza,Agón,117,89,...,27578.000000,27753.000000,5804.000000,5694.000000,5604.000000,5250.000000,161.000000,247.000000,"{'PP': 23, 'PSOE': 39, 'Cs': 2, 'UP': 2, 'IU':...","[('PSOE', 39), ('PP', 23), ('VOX', 20), ('Cs',..."
6556,022019111025000401001,02,50,50004,5000401001,Aragón,Zaragoza,Aguarón,475,360,...,25421.000000,23879.000000,7039.000000,6056.000000,3502.000000,3246.000000,208.000000,253.000000,"{'PP': 96, 'PSOE': 155, 'Cs': 17, 'UP': 19, 'I...","[('PSOE', 155), ('PP', 96), ('VOX', 44), ('MP'..."
6557,022019111025000501001,02,50,50005,5000501001,Aragón,Zaragoza,Aguilón,228,185,...,31410.000000,29687.000000,8651.000000,8019.000000,5616.000000,4816.000000,108.000000,191.000000,"{'PP': 84, 'PSOE': 34, 'Cs': 13, 'UP': 12, 'IU...","[('PP', 84), ('VOX', 35), ('PSOE', 34), ('Cs',..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7428,022019111025029802001,02,50,50298,5029802001,Aragón,Zaragoza,Zuera,610,482,...,31542.000000,31419.000000,9774.000000,8326.000000,3118.000000,3365.000000,213.000000,395.000000,"{'PP': 134, 'PSOE': 139, 'Cs': 45, 'UP': 50, '...","[('PSOE', 139), ('PP', 134), ('VOX', 82), ('UP..."
7429,022019111025090101001,02,50,50901,5090101001,Aragón,Zaragoza,Biel,133,96,...,25367.000000,26506.000000,13108.000000,9636.000000,7146.000000,7398.000000,145.000000,214.000000,"{'PP': 18, 'PSOE': 33, 'Cs': 7, 'UP': 8, 'IU':...","[('PSOE', 33), ('VOX', 21), ('PP', 18), ('UP',..."
7430,022019111025090201001,02,50,50902,5090201001,Aragón,Zaragoza,Marracos,77,65,...,28322.021999,26938.114416,7855.336603,6845.948425,3217.875711,2985.302533,293.331625,347.217589,"{'PP': 29, 'PSOE': 15, 'Cs': 4, 'UP': 3, 'IU':...","[('PP', 29), ('PSOE', 15), ('VOX', 10), ('Cs',..."
7431,022019111025090301001,02,50,50903,5090301001,Aragón,Zaragoza,Villamayor de Gállego,1143,844,...,34050.000000,31945.000000,9707.000000,8721.000000,3872.000000,3239.000000,162.000000,287.000000,"{'PP': 160, 'PSOE': 226, 'Cs': 64, 'UP': 133, ...","[('PSOE', 226), ('PP', 160), ('VOX', 160), ('U..."


Ahora tomamos una decisión algo arbitraria, que es quedarnos con las secciones de más de 500 censados, pues pensamos que no es bueno depender de aquellas que sean demasiado pequeñas, y en las que factores púramente locales hagan variar el resultado electoral. Quedan 661 secciones, lo cual no es una rebaja muy grande.

In [24]:
secciones_select = secciones_select.loc[secciones_select['Censo_Esc'] > 500]

In [25]:
secciones_select

Unnamed: 0,Sección,cod_ccaa,cod_prov,cod_mun,cod_sec,CCAA,Provincia,Municipio,Censo_Esc,Votos_Total,...,Renta hogar 2017,Renta hogar 2015,Renta Salarios 2018,Renta Salarios 2015,Renta Pensiones 2018,Renta Pensiones 2015,Renta Desempleo 2018,Renta Desempleo 2015,dict_res,dict_res_ord
6558,022019111025000601001,02,50,50006,5000601001,Aragón,Zaragoza,Ainzón,913,670,...,28032.0,26577.0,8319.0,6628.0,3506.0,3683.0,261.0,258.0,"{'PP': 140, 'PSOE': 282, 'Cs': 44, 'UP': 59, '...","[('PSOE', 282), ('PP', 140), ('VOX', 91), ('UP..."
6560,022019111025000801001,02,50,50008,5000801001,Aragón,Zaragoza,Alagón,882,484,...,30871.0,29924.0,9155.0,7807.0,2856.0,2812.0,271.0,403.0,"{'PP': 65, 'PSOE': 162, 'Cs': 46, 'UP': 72, 'I...","[('PSOE', 162), ('VOX', 91), ('UP', 72), ('PP'..."
6561,022019111025000801002,02,50,50008,5000801002,Aragón,Zaragoza,Alagón,1353,856,...,28145.0,26651.0,9036.0,7514.0,2858.0,2896.0,250.0,328.0,"{'PP': 127, 'PSOE': 313, 'Cs': 58, 'UP': 151, ...","[('PSOE', 313), ('UP', 151), ('VOX', 148), ('P..."
6562,022019111025000801003,02,50,50008,5000801003,Aragón,Zaragoza,Alagón,1758,1138,...,28058.0,25554.0,8652.0,7190.0,2872.0,2771.0,254.0,380.0,"{'PP': 165, 'PSOE': 385, 'Cs': 98, 'UP': 191, ...","[('PSOE', 385), ('VOX', 192), ('UP', 191), ('P..."
6563,022019111025000801004,02,50,50008,5000801004,Aragón,Zaragoza,Alagón,1194,856,...,34736.0,33041.0,10159.0,8951.0,2698.0,2456.0,210.0,273.0,"{'PP': 139, 'PSOE': 257, 'Cs': 74, 'UP': 114, ...","[('PSOE', 257), ('VOX', 199), ('PP', 139), ('U..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7426,022019111025029801002,02,50,50298,5029801002,Aragón,Zaragoza,Zuera,1618,1226,...,29618.0,28599.0,9758.0,8291.0,2169.0,2263.0,183.0,249.0,"{'PP': 331, 'PSOE': 419, 'Cs': 94, 'UP': 92, '...","[('PSOE', 419), ('PP', 331), ('VOX', 215), ('C..."
7427,022019111025029801003,02,50,50298,5029801003,Aragón,Zaragoza,Zuera,1437,1016,...,29236.0,26961.0,10628.0,8834.0,1513.0,1768.0,166.0,284.0,"{'PP': 200, 'PSOE': 279, 'Cs': 98, 'UP': 135, ...","[('PSOE', 279), ('VOX', 238), ('PP', 200), ('U..."
7428,022019111025029802001,02,50,50298,5029802001,Aragón,Zaragoza,Zuera,610,482,...,31542.0,31419.0,9774.0,8326.0,3118.0,3365.0,213.0,395.0,"{'PP': 134, 'PSOE': 139, 'Cs': 45, 'UP': 50, '...","[('PSOE', 139), ('PP', 134), ('VOX', 82), ('UP..."
7431,022019111025090301001,02,50,50903,5090301001,Aragón,Zaragoza,Villamayor de Gállego,1143,844,...,34050.0,31945.0,9707.0,8721.0,3872.0,3239.0,162.0,287.0,"{'PP': 160, 'PSOE': 226, 'Cs': 64, 'UP': 133, ...","[('PSOE', 226), ('PP', 160), ('VOX', 160), ('U..."


También nos quedamos solo con las columnas del dataset que tratan del resultado electoral.

In [26]:
col_validas_select = ['Sección', 'Censo_Esc', 'Votos_Total', 'Nulos', 'Votos_Válidos', 'Blanco', 'V_Cand', 'PP', 'PSOE', 'Cs', 'UP',
       'IU', 'VOX', 'UPyD', 'MP', 'CiU', 'ERC', 'JxC', 'CUP', 'DiL', 'PNV',
       'Bildu', 'Amaiur', 'CC', 'FA', 'TE', 'BNG', 'PRC', 'GBai', 'Compromis',
       'PACMA', 'Otros']

In [27]:
secciones_select = secciones_select[col_validas_select]

In [28]:
secciones_select

Unnamed: 0,Sección,Censo_Esc,Votos_Total,Nulos,Votos_Válidos,Blanco,V_Cand,PP,PSOE,Cs,...,Amaiur,CC,FA,TE,BNG,PRC,GBai,Compromis,PACMA,Otros
6558,022019111025000601001,913,670,16,654,13,641,140,282,44,...,0,0,0,0,0,0,0,0,3,1
6560,022019111025000801001,882,484,3,481,10,471,65,162,46,...,0,0,0,0,0,0,0,0,10,4
6561,022019111025000801002,1353,856,9,847,13,834,127,313,58,...,0,0,0,0,0,0,0,0,10,7
6562,022019111025000801003,1758,1138,25,1113,7,1106,165,385,98,...,0,0,0,0,0,0,0,0,6,12
6563,022019111025000801004,1194,856,9,847,7,840,139,257,74,...,0,0,0,0,0,0,0,0,8,13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7426,022019111025029801002,1618,1226,18,1208,18,1190,331,419,94,...,0,0,0,0,0,0,0,0,2,7
7427,022019111025029801003,1437,1016,7,1009,10,999,200,279,98,...,0,0,0,0,0,0,0,0,3,6
7428,022019111025029802001,610,482,2,480,3,477,134,139,45,...,0,0,0,0,0,0,0,0,0,13
7431,022019111025090301001,1143,844,5,839,10,829,160,226,64,...,0,0,0,0,0,0,0,0,4,17


In [29]:
secciones_select_norm = secciones_select.copy()

In [30]:
secciones_select_norm

Unnamed: 0,Sección,Censo_Esc,Votos_Total,Nulos,Votos_Válidos,Blanco,V_Cand,PP,PSOE,Cs,...,Amaiur,CC,FA,TE,BNG,PRC,GBai,Compromis,PACMA,Otros
6558,022019111025000601001,913,670,16,654,13,641,140,282,44,...,0,0,0,0,0,0,0,0,3,1
6560,022019111025000801001,882,484,3,481,10,471,65,162,46,...,0,0,0,0,0,0,0,0,10,4
6561,022019111025000801002,1353,856,9,847,13,834,127,313,58,...,0,0,0,0,0,0,0,0,10,7
6562,022019111025000801003,1758,1138,25,1113,7,1106,165,385,98,...,0,0,0,0,0,0,0,0,6,12
6563,022019111025000801004,1194,856,9,847,7,840,139,257,74,...,0,0,0,0,0,0,0,0,8,13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7426,022019111025029801002,1618,1226,18,1208,18,1190,331,419,94,...,0,0,0,0,0,0,0,0,2,7
7427,022019111025029801003,1437,1016,7,1009,10,999,200,279,98,...,0,0,0,0,0,0,0,0,3,6
7428,022019111025029802001,610,482,2,480,3,477,134,139,45,...,0,0,0,0,0,0,0,0,0,13
7431,022019111025090301001,1143,844,5,839,10,829,160,226,64,...,0,0,0,0,0,0,0,0,4,17


Ahora hacemos un pequeño tratamiento de datos. Tomamos el dataset y normalizamos los resultados de las secciones dividiendo por su censo, y después trasponemos el dataset, siendo ahora las secciones las columnas, y los resultados normalizados las filas, igual que hemos hecho con el dataset que queremos modelizar. 

In [31]:
set_cols = ['Sección', 'Censo_Esc']

In [32]:
for col in secciones_select_norm.columns:

  if col not in set_cols:
    
    secciones_select_norm[col] = secciones_select_norm[col] / secciones_select_norm['Censo_Esc']

secciones_select_norm = secciones_select_norm.set_index('Sección')
secciones_select_norm = secciones_select_norm.drop('Censo_Esc', axis = 1)

secciones_select_norm = secciones_select_norm.T

In [33]:
secciones_select_norm

Sección,022019111025000601001,022019111025000801001,022019111025000801002,022019111025000801003,022019111025000801004,022019111025001701001,022019111025001801001,022019111025002001001,022019111025002201001,022019111025002401001,...,022019111025029712007,022019111025029712008,022019111025029712009,022019111025029712010,022019111025029801001,022019111025029801002,022019111025029801003,022019111025029802001,022019111025090301001,022019111025090301002
Votos_Total,0.733844,0.548753,0.632668,0.647327,0.716918,0.723632,0.704417,0.689308,0.740079,0.736648,...,0.655271,0.655263,0.587117,0.695172,0.749154,0.757726,0.707029,0.790164,0.738408,0.747556
Nulos,0.017525,0.003401,0.006652,0.014221,0.007538,0.010152,0.006795,0.003774,0.007937,0.005525,...,0.002849,0.006579,0.006748,0.005517,0.01071,0.011125,0.004871,0.003279,0.004374,0.008889
Votos_Válidos,0.71632,0.545351,0.626016,0.633106,0.70938,0.71348,0.697622,0.685535,0.732143,0.731123,...,0.652422,0.648684,0.580368,0.689655,0.738444,0.746601,0.702157,0.786885,0.734033,0.738667
Blanco,0.014239,0.011338,0.009608,0.003982,0.005863,0.007332,0.005663,0.005031,0.02381,0.005525,...,0.008547,0.007895,0.008589,0.004138,0.005637,0.011125,0.006959,0.004918,0.008749,0.005333
V_Cand,0.702081,0.534014,0.616408,0.629124,0.703518,0.706148,0.691959,0.680503,0.708333,0.725599,...,0.643875,0.640789,0.571779,0.685517,0.732807,0.735476,0.695198,0.781967,0.725284,0.733333
PP,0.153341,0.073696,0.093865,0.093857,0.116415,0.135928,0.184598,0.138365,0.210317,0.145488,...,0.119658,0.110526,0.134969,0.17931,0.199549,0.204574,0.139179,0.219672,0.139983,0.144889
PSOE,0.308872,0.183673,0.231338,0.218999,0.215243,0.178229,0.321631,0.242767,0.263889,0.368324,...,0.24359,0.218421,0.089571,0.136552,0.233935,0.258962,0.194154,0.227869,0.197725,0.184
Cs,0.048193,0.052154,0.042868,0.055745,0.061977,0.078962,0.032843,0.056604,0.053571,0.042357,...,0.052707,0.063158,0.053374,0.062069,0.055242,0.058096,0.068198,0.07377,0.055993,0.066667
UP,0.064622,0.081633,0.111604,0.108646,0.095477,0.087422,0.026048,0.055346,0.027778,0.062615,...,0.08547,0.092105,0.066258,0.078621,0.063698,0.05686,0.093946,0.081967,0.11636,0.102222
IU,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Lo que ocurre ahora es que no sabemos qué secciones vamos finalmente a utilizar.

Seleccionaremos las secciones que estén menos correlacionadas entre sí. Lo que pasa es que vemos que hay registros enteros que tienen todo ceros, por lo que es posible que nos diese un error si quisiemos calcular la matriz de correlación a partir del anterior dataset, 'secciones_select_norm'.

Pese a ser algo redundante, vamos a partir del dataset antes de normalizar, el 'secciones_select'. A este df le aplicamos la función 'preparación_sec' que definimos a continuación. Esencialmente lo que hace es:

- Elimina las columnas (votos a partidos) que son todo ceros, es decir, los que no se presentaron, en este caso.

- Normaliza por el censo

- Cambia el orden de los registros al azar, esto es importante para no dar sistemáticamente más importancia a una sección sobre otra cuando las seleccionemos.

- Hace una trasposición, como hemos visto antes.

In [34]:
def preparacion_sec(eleccion):

  set_cols = ['Sección', 'Censo_Esc']
  
  for col in eleccion.columns:

    if eleccion[col].sum() == 0:

      eleccion = eleccion.drop([col], axis = 1)

    elif col not in set_cols:

      eleccion[col] = eleccion[col] / eleccion['Censo_Esc']

  eleccion = eleccion.set_index('Sección')
  eleccion = eleccion.drop('Censo_Esc', axis = 1)

  df_elec_transpose = eleccion.T

  lista_sec = list(df_elec_transpose.columns)
  random.shuffle(lista_sec)

  df_elec_transpose = df_elec_transpose[lista_sec]

  return df_elec_transpose


Con lo que obtenemos, luego veremos un ejemplo, ya podemos seleccionar las secciones. Tras calcular la matriz de correlación de todas las secciones, se la pasamos a la función siguiente, 'secciones_corr', que se encarga de repasar una a una las correlacines de cada sección con el resto, comenzando por la primera que, como vimos elegimos al azar.

Vamos viendo si cada seccion tiene una correlación máxima con otras secciones por encima o por debajo de un limite, threshold:

- Si está por encima, es que está demasiado correlacionada con otra que ya hemos revisado, y por lo tanto la eliminamos. 

- Si está por debajo, no la eliminamos.

Al pasar por todas las secciones, nos quedamos por lo tanto con las poco correlacionadas entre sí. Se trata de elegir bien el threshold para que tengamos unas cuantas, pero no demasiadas, normalmente menos de 10, pongamos.

La elección de las secciones depende del orden en que se vayan examinando, que hemos hecho en la función anterior que fuese al azar, por lo que cada vez puede dar (casi seguro) distintas secciones, salvo que fijemos una semilla.

In [35]:
def secciones_corr(dummy, threshold = 0.995):

  for ind in range(2, m.shape[0]):
    s = m.iloc[0:ind, 0:ind]

    if max(s.iloc[ind-1, 0:ind-1] > threshold):
    # print(m.columns[ind-1])
      dummy = dummy.drop(m.columns[ind-1], axis = 0)
      dummy = dummy.drop(m.columns[ind-1], axis = 1)

  return dummy.columns


El resultado de la primera función es un dataset normalizado y traspuesto, pero que tiene por filas elementos que no son enteramente ceros.

In [36]:
secc = preparacion_sec(secciones_select)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  eleccion[col] = eleccion[col] / eleccion['Censo_Esc']


In [37]:
secc

Sección,022019111025029702040,022019111025029703080,022019111025029704016,022019111025007404001,022019111025029710048,022019111025014701001,022019111025020901001,022019111025029710055,022019111025027201001,022019111025029705010,...,022019111025029708017,022019111025006701001,022019111025029710059,022019111025023501001,022019111025029705014,022019111025029704031,022019111025029706007,022019111025029703058,022019111025029704043,022019111025029711011
Votos_Total,0.725949,0.839552,0.795296,0.642292,0.795102,0.682796,0.748705,0.629992,0.709125,0.659283,...,0.801301,0.644315,0.770099,0.705263,0.78508,0.751524,0.648062,0.661509,0.782036,0.728814
Nulos,0.005696,0.001866,0.002613,0.012846,0.009495,0.002688,0.011226,0.005705,0.002535,0.007384,...,0.002365,0.015549,0.014104,0.007368,0.00444,0.006098,0.003101,0.003868,0.005988,0.00339
Votos_Válidos,0.720253,0.837687,0.792683,0.629447,0.785607,0.680108,0.737478,0.624287,0.706591,0.651899,...,0.798936,0.628766,0.755994,0.697895,0.780639,0.745427,0.644961,0.65764,0.776048,0.725424
Blanco,0.008228,0.009328,0.005226,0.001976,0.008496,0.00672,0.008636,0.00652,0.010773,0.009494,...,0.00887,0.007775,0.007052,0.007368,0.007105,0.007622,0.006202,0.003868,0.003593,0.000847
V_Cand,0.712025,0.828358,0.787456,0.62747,0.777111,0.673387,0.728843,0.617767,0.695817,0.642405,...,0.790065,0.620991,0.748942,0.690526,0.773535,0.737805,0.63876,0.653772,0.772455,0.724576
PP,0.202532,0.330224,0.272648,0.136364,0.104948,0.123656,0.199482,0.102689,0.124208,0.159283,...,0.274985,0.124393,0.141044,0.169474,0.241563,0.22561,0.094574,0.12766,0.307784,0.116949
PSOE,0.208861,0.173507,0.171603,0.205534,0.264868,0.298387,0.219344,0.206194,0.184411,0.238397,...,0.1644,0.215743,0.251058,0.195789,0.195382,0.161585,0.213953,0.272727,0.152096,0.229661
Cs,0.067722,0.113806,0.084495,0.066206,0.090455,0.043011,0.056995,0.054605,0.06654,0.03692,...,0.105855,0.050534,0.06347,0.068421,0.10302,0.073171,0.052713,0.032882,0.058683,0.069492
UP,0.065823,0.046642,0.09669,0.043478,0.112444,0.071237,0.068221,0.106764,0.084918,0.07173,...,0.064459,0.048591,0.152327,0.074737,0.043517,0.073171,0.134884,0.083172,0.061078,0.127119
VOX,0.126582,0.113806,0.107143,0.12747,0.148926,0.086022,0.149396,0.112469,0.202155,0.098101,...,0.141928,0.139942,0.09591,0.144211,0.155417,0.166159,0.08062,0.102515,0.160479,0.130508


Ahora calculamos la matriz de correlación y se la pasamos a la segunda función con el valor del threshold. Obtenemos siete secciones, que ya sabemos que no están tan correlacionadas entre sí.

In [38]:
m = secc.corr()
lista_sec = secciones_corr(m, 0.996)

In [39]:
lista_sec

Index(['022019111025029702040', '022019111025029703080',
       '022019111025014701001', '022019111025027201001',
       '022019111025029708025', '022019111025029702026',
       '022019111025029702009', '022019111025029711011'],
      dtype='object', name='Sección')

In [40]:
lista_sec = np.sort(lista_sec)

In [41]:
lista_sec

array(['022019111025014701001', '022019111025027201001',
       '022019111025029702009', '022019111025029702026',
       '022019111025029702040', '022019111025029703080',
       '022019111025029708025', '022019111025029711011'], dtype=object)

Ya sabiendo las secciones que hemos elegido ya las podemos seleccionar del dataset normalizado que incluía las 661 secciones de Zaragoza, incluyendo las filas que son todo ceros. 

In [42]:
secciones_select_norm = secciones_select_norm[lista_sec]

In [43]:
secciones_select_norm

Sección,022019111025014701001,022019111025027201001,022019111025029702009,022019111025029702026,022019111025029702040,022019111025029703080,022019111025029708025,022019111025029711011
Votos_Total,0.682796,0.709125,0.739445,0.839384,0.725949,0.839552,0.817143,0.728814
Nulos,0.002688,0.002535,0.001206,0.0011,0.005696,0.001866,0.005079,0.00339
Votos_Válidos,0.680108,0.706591,0.738239,0.838284,0.720253,0.837687,0.812063,0.725424
Blanco,0.00672,0.010773,0.002413,0.007701,0.008228,0.009328,0.007619,0.000847
V_Cand,0.673387,0.695817,0.735826,0.830583,0.712025,0.828358,0.804444,0.724576
PP,0.123656,0.124208,0.126659,0.468647,0.202532,0.330224,0.221587,0.116949
PSOE,0.298387,0.184411,0.171291,0.056106,0.208861,0.173507,0.136508,0.229661
Cs,0.043011,0.06654,0.074789,0.075908,0.067722,0.113806,0.147937,0.069492
UP,0.071237,0.084918,0.072376,0.022002,0.065823,0.046642,0.04381,0.127119
IU,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Vemos que tiene las 30 filas que tiene los datos normalizados de la provincia de Zaragoza que queremos modelizar. Podemos añadir este df para tener los datos que pasaremos al modelo de regresión en un solo df.

In [44]:
secciones_select_norm.shape

(30, 8)

In [45]:
secciones_select_norm['Modelización'] = modelizacion['Modelización']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  secciones_select_norm['Modelización'] = modelizacion['Modelización']


In [46]:
secciones_select_norm

Sección,022019111025014701001,022019111025027201001,022019111025029702009,022019111025029702026,022019111025029702040,022019111025029703080,022019111025029708025,022019111025029711011,Modelización
Votos_Total,0.682796,0.709125,0.739445,0.839384,0.725949,0.839552,0.817143,0.728814,0.719466
Nulos,0.002688,0.002535,0.001206,0.0011,0.005696,0.001866,0.005079,0.00339,0.006076
Votos_Válidos,0.680108,0.706591,0.738239,0.838284,0.720253,0.837687,0.812063,0.725424,0.713389
Blanco,0.00672,0.010773,0.002413,0.007701,0.008228,0.009328,0.007619,0.000847,0.006958
V_Cand,0.673387,0.695817,0.735826,0.830583,0.712025,0.828358,0.804444,0.724576,0.706431
PP,0.123656,0.124208,0.126659,0.468647,0.202532,0.330224,0.221587,0.116949,0.166932
PSOE,0.298387,0.184411,0.171291,0.056106,0.208861,0.173507,0.136508,0.229661,0.220048
Cs,0.043011,0.06654,0.074789,0.075908,0.067722,0.113806,0.147937,0.069492,0.065202
UP,0.071237,0.084918,0.072376,0.022002,0.065823,0.046642,0.04381,0.127119,0.080233
IU,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [47]:
secciones_select_norm.index

Index(['Votos_Total', 'Nulos', 'Votos_Válidos', 'Blanco', 'V_Cand', 'PP',
       'PSOE', 'Cs', 'UP', 'IU', 'VOX', 'UPyD', 'MP', 'CiU', 'ERC', 'JxC',
       'CUP', 'DiL', 'PNV', 'Bildu', 'Amaiur', 'CC', 'FA', 'TE', 'BNG', 'PRC',
       'GBai', 'Compromis', 'PACMA', 'Otros'],
      dtype='object')

Ahora ya podemos modelizar mediante regresión lineal. Cargamos las librerías necesarias, y definimos las matrices X e y.

In [48]:
import numpy as np
from sklearn.linear_model import LinearRegression

In [49]:
X = secciones_select_norm.drop('Modelización', axis = 1).values

In [50]:
y = secciones_select_norm['Modelización'].values

In [51]:
X

array([[0.6827957 , 0.70912548, 0.73944511, 0.83938394, 0.72594937,
        0.83955224, 0.81714286, 0.72881356],
       [0.00268817, 0.00253485, 0.00120627, 0.00110011, 0.0056962 ,
        0.00186567, 0.00507937, 0.00338983],
       [0.68010753, 0.70659062, 0.73823884, 0.83828383, 0.72025316,
        0.83768657, 0.81206349, 0.72542373],
       [0.00672043, 0.01077313, 0.00241255, 0.00770077, 0.00822785,
        0.00932836, 0.00761905, 0.00084746],
       [0.6733871 , 0.69581749, 0.7358263 , 0.83058306, 0.71202532,
        0.82835821, 0.80444444, 0.72457627],
       [0.12365591, 0.12420786, 0.12665862, 0.46864686, 0.20253165,
        0.33022388, 0.2215873 , 0.11694915],
       [0.2983871 , 0.18441065, 0.17129071, 0.05610561, 0.20886076,
        0.17350746, 0.13650794, 0.22966102],
       [0.04301075, 0.06653992, 0.0747889 , 0.07590759, 0.06772152,
        0.11380597, 0.14793651, 0.06949153],
       [0.07123656, 0.08491762, 0.07237636, 0.0220022 , 0.06582278,
        0.04664179, 0.043809

In [52]:
y

array([0.71946552, 0.00607642, 0.7133891 , 0.00695846, 0.70643064,
       0.16693179, 0.22004842, 0.06520238, 0.08023338, 0.        ,
       0.12857079, 0.        , 0.03213501, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.00499869, 0.00831018])

Hacemos el fit con X e y. Hemos puesto el intercept como cero, para que no aparezcan votos en partidos que no se presentaron en Zaragoza. Es algo óptico más que nada.

In [53]:
reg = LinearRegression(fit_intercept = False).fit(X, y)

In [54]:
reg.intercept_*censo_mod

0.0

Parece que hay un fit excelente, el 99,998%

In [55]:
reg.score(X, y)

0.9999971822646484

Estos son los coeficientes, que sumados no es extraño que den casi 1, pues tras normalizar estamos modelizando magnitudes unidimensinales del mismo orden de magnitud.

In [56]:
reg.coef_

array([-0.01211387, -0.12020185,  0.01638242, -0.00883913,  1.05953755,
       -0.21674705,  0.07829277,  0.21540932])

In [57]:
reg.coef_.sum()

1.0117201543850207

Ahora podemos ver los resultados que hemos predicho en nuestro modelo. Deshacemos la normalización volviendo a multiplicar por el censo total de la provincia Zaragoza, y lo almacenamos en un df.

In [58]:
est = reg.predict(X)*censo_mod

In [59]:
df = pd.DataFrame(est, index = secciones_select_norm.index, columns = ['Estimación']).astype('int32')

In [60]:
df

Unnamed: 0,Estimación
Votos_Total,514617
Nulos,4601
Votos_Válidos,510016
Blanco,4342
V_Cand,505674
PP,119512
PSOE,157654
Cs,46981
UP,57492
IU,0


Ahora mostramos los datos reales que queríamos modelizar, y lo mostramos en otro df.

In [61]:
df1 = pd.DataFrame(secciones_mod.sum(), columns = ['Real']).drop('Censo_Esc')

In [62]:
df1

Unnamed: 0,Real
Votos_Total,514697
Nulos,4347
Votos_Válidos,510350
Blanco,4978
V_Cand,505372
PP,119421
PSOE,157420
Cs,46645
UP,57398
IU,0


Comparamos ambos df. Dado el fit tal alto, era de esperar que se parecieran bastante, en especial en el caso de los partidos principales. Desde luego, el fit parece impresionante pese a que solo hemos utilizado 7 secciones electorales de la provincia, que como vimos tiene 880 en total.

In [63]:
df['Real'] = df1['Real']

In [64]:
df

Unnamed: 0,Estimación,Real
Votos_Total,514617,514697
Nulos,4601,4347
Votos_Válidos,510016,510350
Blanco,4342,4978
V_Cand,505674,505372
PP,119512,119421
PSOE,157654,157420
Cs,46981,46645
UP,57492,57398
IU,0,0


## Modelización en las elecciones de 2016

Nos puede surgir la pregunta que cuán válida es la selección de secciones electorales en 2019 si utilizamos sus equivalentes en las elecciones de 2016. Eso es lo que tratamos en este capítulo. Recordamos las secciones elegidas:

In [65]:
lista_sec

array(['022019111025014701001', '022019111025027201001',
       '022019111025029702009', '022019111025029702026',
       '022019111025029702040', '022019111025029703080',
       '022019111025029708025', '022019111025029711011'], dtype=object)

Esas secciones son las de 2019, tenemos que encontrar las equivalentes, o similares, en 2016. Para ello cargamos el df de similitud de secciones, que acumula todas de las 5 últimas elecciones. 

In [66]:
import botocore.exceptions

KEY = 'datos-elecciones-generales-unificados/similitud_secciones_def_REF.csv' 

try:
    s3.Bucket(BUCKET_NAME).download_file(KEY, 'similitud_secciones_def_REF.csv')
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == "404":
        print("The object does not exist.")
    else:
        raise

In [67]:
sim_secciones = pd.read_csv('similitud_secciones_def_REF.csv', dtype = 'str')

In [68]:
sim_secciones

Unnamed: 0,cod_sec_ref,CUSEC,CUMUN,CPRO,Elección,cod_ccaa_orig,cod_ccaa_ref,cercana N11_ref,cercana D15_ref,cercana J16_ref,cercana A19_ref,cercana N19_ref
0,022019041140100901001,0100901001,01009,01,02201904,16,14,022011111140100901001,022015121140100901001,022016061140100901001,022019041140100901001,022019111140100901001
1,022019041140101001002,0101001002,01010,01,02201904,16,14,022011111140101001002,022015121140101001002,022016061140101001002,022019041140101001002,022019111140101001002
2,022019041140103101001,0103101001,01031,01,02201904,16,14,022011111140103101001,022015121140103101001,022016061140103101001,022019041140103101001,022019111140103101001
3,022019041140103301001,0103301001,01033,01,02201904,16,14,022011111140103301001,022015121140103301001,022016061140103301001,022019041140103301001,022019111140103301001
4,022019041140103701001,0103701001,01037,01,02201904,16,14,022011111140103701001,022015121140103701001,022016061140103701001,022019041140103701001,022019111140103701001
...,...,...,...,...,...,...,...,...,...,...,...,...
181034,022011111195200108010,5200108010,52001,52,02201111,19,19,022011111195200108010,022015121195200108010,022016061195200108010,022019041195200108010,022019111195200108010
181035,022011111195200108011,5200108011,52001,52,02201111,19,19,022011111195200108011,022015121195200108011,022016061195200108011,022019041195200108011,022019111195200108011
181036,022011111195200108012,5200108012,52001,52,02201111,19,19,022011111195200108012,022015121195200108012,022016061195200108012,022019041195200108012,022019111195200108012
181037,022011111195200108013,5200108013,52001,52,02201111,19,19,022011111195200108013,022015121195200108013,022016061195200108013,022019041195200108013,022019111195200108013


Ahora seleccinamos las similares a las secciones de Zaragoza que encontramos en el capítulo anterior...

In [69]:
sec_select_J16 = sim_secciones.loc[sim_secciones['cod_sec_ref'].isin(lista_sec)]

In [70]:
sec_select_J16

Unnamed: 0,cod_sec_ref,CUSEC,CUMUN,CPRO,Elección,cod_ccaa_orig,cod_ccaa_ref,cercana N11_ref,cercana D15_ref,cercana J16_ref,cercana A19_ref,cercana N19_ref
67803,022019111025014701001,5014701001,50147,50,2201911,2,2,022011111025014701001,022015121025014701001,022016061025014701001,022019041025014701001,022019111025014701001
67993,022019111025027201001,5027201001,50272,50,2201911,2,2,022011111025027201001,022015121025027201001,022016061025027201001,022019041025027201001,022019111025027201001
68003,022019111025029702009,5029702009,50297,50,2201911,2,2,022011111025029702009,022015121025029702009,022016061025029702009,022019041025029702009,022019111025029702009
68045,022019111025029702026,5029702026,50297,50,2201911,2,2,022011111025029702026,022015121025029702026,022016061025029702026,022019041025029702026,022019111025029702026
68055,022019111025029702040,5029702040,50297,50,2201911,2,2,022011111025029702040,022015121025029702040,022016061025029702040,022019041025029702040,022019111025029702040
68126,022019111025029703080,5029703080,50297,50,2201911,2,2,022011111025029703080,022015121025029703080,022016061025029703080,022019041025029703080,022019111025029703080
68319,022019111025029708025,5029708025,50297,50,2201911,2,2,022011111025029708020,022015121025029708025,022016061025029708025,022019041025029708025,022019111025029708025
70002,022019111025029711011,5029711011,50297,50,2201911,2,2,022011111025029711011,022015121025029711011,022016061025029711011,022019041025029711011,022019111025029711011


... y escogemos sus equivalentes en las elecciones de 2016, que son estas siete:

In [71]:
list_sec_J16 = list(sec_select_J16['cercana J16_ref'])

In [72]:
list_sec_J16

['022016061025014701001',
 '022016061025027201001',
 '022016061025029702009',
 '022016061025029702026',
 '022016061025029702040',
 '022016061025029703080',
 '022016061025029708025',
 '022016061025029711011']

In [73]:
list_sec_J16 = np.sort(list_sec_J16)

In [74]:
list_sec_J16

array(['022016061025014701001', '022016061025027201001',
       '022016061025029702009', '022016061025029702026',
       '022016061025029702040', '022016061025029703080',
       '022016061025029708025', '022016061025029711011'], dtype='<U21')

In [75]:
lista_sec

array(['022019111025014701001', '022019111025027201001',
       '022019111025029702009', '022019111025029702026',
       '022019111025029702040', '022019111025029703080',
       '022019111025029708025', '022019111025029711011'], dtype=object)

Cargamos ahora los resultados de las elecciones de junio de 2016

In [76]:
import botocore.exceptions

KEY = 'datos-elecciones-generales-unificados/gen_J16_unif_cols_prov_copia.txt' 

try:
    s3.Bucket(BUCKET_NAME).download_file(KEY, 'gen_J16_unif_cols_prov_copia.txt')
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == "404":
        print("The object does not exist.")
    else:
        raise

In [77]:
df_eleccion_comp_J16 = pd.read_csv('gen_J16_unif_cols_prov_copia.txt', dtype = strings)

Seleccionamos las secciones a modelizar, que los naturalmente las de la provincia de Zaragoza.

In [78]:
secciones_mod = df_eleccion_comp_J16

if len(ccaa_mod) > 0:

  secciones_mod = secciones_mod.loc[secciones_mod['CCAA'].isin(ccaa_mod)]

if len(provincia_mod) > 0:

  secciones_mod = secciones_mod.loc[secciones_mod['Provincia'].isin(provincia_mod)]

if len(municipio_mod) > 0:

  secciones_mod = secciones_mod.loc[secciones_mod['Municipio'].isin(municipio_mod)]


In [79]:
secciones_mod

Unnamed: 0,Sección,cod_ccaa,cod_prov,cod_mun,cod_sec,CCAA,Provincia,Municipio,Censo_Esc,Votos_Total,...,Renta hogar 2017,Renta hogar 2015,Renta Salarios 2018,Renta Salarios 2015,Renta Pensiones 2018,Renta Pensiones 2015,Renta Desempleo 2018,Renta Desempleo 2015,dict_res,dict_res_ord
6494,022016061025000101001,02,50,50001,5000101001,Aragón,Zaragoza,Abanto,100,77,...,28322.021999,21149.000000,7855.336603,5134.000000,3217.875711,4987.000000,293.331625,139.000000,"{'PP': 52, 'PSOE': 15, 'Cs': 4, 'UP': 6, 'IU':...","[('PP', 52), ('PSOE', 15), ('UP', 6), ('Cs', 4..."
6495,022016061025000201001,02,50,50002,5000201001,Aragón,Zaragoza,Acered,143,108,...,18895.000000,20525.000000,3494.000000,2873.000000,4611.000000,3968.000000,84.000000,233.000000,"{'PP': 72, 'PSOE': 22, 'Cs': 8, 'UP': 1, 'IU':...","[('PP', 72), ('PSOE', 22), ('Cs', 8), ('UP', 1..."
6496,022016061025000301001,02,50,50003,5000301001,Aragón,Zaragoza,Agón,128,92,...,27578.000000,27753.000000,5804.000000,5694.000000,5604.000000,5250.000000,161.000000,247.000000,"{'PP': 34, 'PSOE': 32, 'Cs': 15, 'UP': 6, 'IU'...","[('PP', 34), ('PSOE', 32), ('Cs', 15), ('UP', ..."
6497,022016061025000401001,02,50,50004,5000401001,Aragón,Zaragoza,Aguarón,522,390,...,25421.000000,23879.000000,7039.000000,6056.000000,3502.000000,3246.000000,208.000000,253.000000,"{'PP': 146, 'PSOE': 148, 'Cs': 43, 'UP': 49, '...","[('PSOE', 148), ('PP', 146), ('UP', 49), ('Cs'..."
6498,022016061025000501001,02,50,50005,5000501001,Aragón,Zaragoza,Aguilón,229,189,...,31410.000000,29687.000000,8651.000000,8019.000000,5616.000000,4816.000000,108.000000,191.000000,"{'PP': 111, 'PSOE': 37, 'Cs': 23, 'UP': 12, 'I...","[('PP', 111), ('PSOE', 37), ('Cs', 23), ('UP',..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7366,022016061025029802001,02,50,50298,5029802001,Aragón,Zaragoza,Zuera,619,492,...,31542.000000,31419.000000,9774.000000,8326.000000,3118.000000,3365.000000,213.000000,395.000000,"{'PP': 188, 'PSOE': 109, 'Cs': 86, 'UP': 92, '...","[('PP', 188), ('PSOE', 109), ('UP', 92), ('Cs'..."
7367,022016061025090101001,02,50,50901,5090101001,Aragón,Zaragoza,Biel,138,105,...,25367.000000,26506.000000,13108.000000,9636.000000,7146.000000,7398.000000,145.000000,214.000000,"{'PP': 42, 'PSOE': 33, 'Cs': 11, 'UP': 17, 'IU...","[('PP', 42), ('PSOE', 33), ('UP', 17), ('Cs', ..."
7368,022016061025090201001,02,50,50902,5090201001,Aragón,Zaragoza,Marracos,87,79,...,28322.021999,26938.114416,7855.336603,6845.948425,3217.875711,2985.302533,293.331625,347.217589,"{'PP': 50, 'PSOE': 14, 'Cs': 8, 'UP': 4, 'IU':...","[('PP', 50), ('PSOE', 14), ('Cs', 8), ('UP', 4..."
7369,022016061025090301001,02,50,50903,5090301001,Aragón,Zaragoza,Villamayor de Gállego,1133,846,...,34050.000000,31945.000000,9707.000000,8721.000000,3872.000000,3239.000000,162.000000,287.000000,"{'PP': 276, 'PSOE': 188, 'Cs': 136, 'UP': 211,...","[('PP', 276), ('UP', 211), ('PSOE', 188), ('Cs..."


In [80]:
censo_mod = secciones_mod['Censo_Esc'].sum()

Procedemos de igual manera, sumamos los resultados, normalizamos y los almacenamos en un df.

In [81]:
censo_mod

712841

In [82]:
secciones_mod = secciones_mod[cols_validas_mod]

In [83]:
modelizacion = pd.DataFrame(secciones_mod.sum(), columns = ['Modelización'])
modelizacion['Modelización'] = modelizacion['Modelización'] / modelizacion['Modelización']['Censo_Esc']
modelizacion = modelizacion.drop(['Censo_Esc']) 

In [84]:
modelizacion

Unnamed: 0,Modelización
Votos_Total,0.721988
Nulos,0.005557
Votos_Válidos,0.716432
Blanco,0.005759
V_Cand,0.710673
PP,0.250529
PSOE,0.175263
Cs,0.120522
UP,0.145055
IU,0.0


In [85]:
modelizacion.shape

(30, 1)

Ahora ya no tenemos que seleccionar las secciones de la provincia de Zaragoza porque ya las conocemos: son las 7 que hemos visto antes. Sí que nos hace falta almacenar los resultados que tuvieron en 2016.

In [86]:
secciones_select = df_eleccion_comp_J16.loc[df_eleccion_comp_J16['Sección'].isin(list_sec_J16)]

In [87]:
secciones_select = secciones_select[col_validas_select]

In [88]:
secciones_select

Unnamed: 0,Sección,Censo_Esc,Votos_Total,Nulos,Votos_Válidos,Blanco,V_Cand,PP,PSOE,Cs,...,Amaiur,CC,FA,TE,BNG,PRC,GBai,Compromis,PACMA,Otros
6686,022016061025014701001,777,552,0,552,2,550,174,194,43,...,0,0,0,0,0,0,0,0,2,6
6835,022016061025027201001,1566,1074,6,1068,18,1050,357,211,224,...,0,0,0,0,0,0,0,0,14,14
6909,022016061025029702009,810,599,3,596,2,594,263,93,109,...,0,0,0,0,0,0,0,0,5,5
6924,022016061025029702026,900,773,4,769,7,762,583,33,110,...,0,0,0,0,0,0,0,0,2,3
6936,022016061025029702040,1567,1162,8,1154,12,1142,491,239,175,...,0,0,0,0,0,0,0,0,10,21
7019,022016061025029703080,536,469,1,468,4,464,223,73,94,...,0,0,0,0,0,0,0,0,3,5
7219,022016061025029708025,1482,1213,4,1209,7,1202,536,132,378,...,0,0,0,0,0,0,0,0,9,8
7345,022016061025029711011,1129,840,3,837,10,827,207,201,166,...,0,0,0,0,0,0,0,0,14,12


In [89]:
secciones_select_norm = secciones_select.copy()

Y ahora simplemente normalizamos y trasponemos.

In [90]:
for col in secciones_select_norm.columns:

  if col not in set_cols:
    
    secciones_select_norm[col] = secciones_select_norm[col] / secciones_select_norm['Censo_Esc']

secciones_select_norm = secciones_select_norm.set_index('Sección')
secciones_select_norm = secciones_select_norm.drop('Censo_Esc', axis = 1)

secciones_select_norm = secciones_select_norm.T

In [91]:
secciones_select_norm

Sección,022016061025014701001,022016061025027201001,022016061025029702009,022016061025029702026,022016061025029702040,022016061025029703080,022016061025029708025,022016061025029711011
Votos_Total,0.710425,0.685824,0.739506,0.858889,0.741544,0.875,0.818489,0.744021
Nulos,0.0,0.003831,0.003704,0.004444,0.005105,0.001866,0.002699,0.002657
Votos_Válidos,0.710425,0.681992,0.735802,0.854444,0.736439,0.873134,0.815789,0.741364
Blanco,0.002574,0.011494,0.002469,0.007778,0.007658,0.007463,0.004723,0.008857
V_Cand,0.707851,0.670498,0.733333,0.846667,0.728781,0.865672,0.811066,0.732507
PP,0.223938,0.227969,0.324691,0.647778,0.313338,0.416045,0.361673,0.183348
PSOE,0.249678,0.134738,0.114815,0.036667,0.152521,0.136194,0.089069,0.178034
Cs,0.055341,0.14304,0.134568,0.122222,0.111678,0.175373,0.255061,0.147033
UP,0.168597,0.137931,0.128395,0.026667,0.126356,0.119403,0.087719,0.194863
IU,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [92]:
secciones_select_norm.shape

(30, 8)

Ya podemos modelizar, hacemos lo mismo que antes, definimos la matriz X e y.

In [93]:
secciones_select_norm['Modelización'] = modelizacion['Modelización']

In [94]:
X = secciones_select_norm.drop('Modelización', axis = 1).values
y = secciones_select_norm['Modelización'].values

In [95]:
X

array([[0.71042471, 0.68582375, 0.73950617, 0.85888889, 0.74154435,
        0.875     , 0.81848853, 0.74402126],
       [0.        , 0.00383142, 0.0037037 , 0.00444444, 0.0051053 ,
        0.00186567, 0.00269906, 0.00265722],
       [0.71042471, 0.68199234, 0.73580247, 0.85444444, 0.73643906,
        0.87313433, 0.81578947, 0.74136404],
       [0.002574  , 0.01149425, 0.00246914, 0.00777778, 0.00765795,
        0.00746269, 0.00472335, 0.0088574 ],
       [0.70785071, 0.67049808, 0.73333333, 0.84666667, 0.72878111,
        0.86567164, 0.81106613, 0.73250664],
       [0.22393822, 0.22796935, 0.32469136, 0.64777778, 0.31333759,
        0.41604478, 0.36167341, 0.1833481 ],
       [0.24967825, 0.13473819, 0.11481481, 0.03666667, 0.15252074,
        0.13619403, 0.08906883, 0.17803366],
       [0.05534106, 0.14303959, 0.1345679 , 0.12222222, 0.11167837,
        0.17537313, 0.25506073, 0.14703277],
       [0.16859717, 0.13793103, 0.12839506, 0.02666667, 0.12635609,
        0.11940299, 0.087719

... y calculamos el fit del modelo que calculamos en el apartado anterior, no hacemos ahora ningún fit.

El score es magnífico, superior al 99,9%

In [96]:
reg.score(X, y)

0.998892067038494

Si ahora comprobamos la predicción con los datos reales vemos que las diferencias son pequeñas, del orden del punto porcentual, inferiores por lo tanto al margen de error de un sondeo, por ejemplo. Y eso lo hemos conseguido solo mediante 7 secciones de la provincia, seleccionadas con los datos de otra elección...

In [97]:
est = reg.predict(X) * censo_mod
df = pd.DataFrame(est, index = secciones_select_norm.index, columns = ['Estimación']).astype('int32')
df1 = pd.DataFrame(secciones_mod.sum(), columns = ['Real']).drop('Censo_Esc')
df['Real'] = df1['Real']

In [98]:
df

Unnamed: 0,Estimación,Real
Votos_Total,523133,514663
Nulos,3813,3961
Votos_Válidos,519320,510702
Blanco,5227,4105
V_Cand,514092,506597
PP,198958,178587
PSOE,113870,124935
Cs,82131,85913
UP,99860,103401
IU,0,0


Mostramos a continuación la comparación entre el resultado real y el estimado, que, como comentamos, no difiere en más de 1 pp. El resultado negativo de Vox es debido al bajo porcentaje de voto que obtuvo en 2016.

In [99]:
df['pc Estimación'] = df['Estimación'] / df['Estimación'][2] * 100

In [100]:
df['pc Real'] = df['Real'] / df['Real'][2] * 100


In [101]:
df['dif. Real-Est.'] = df['pc Real'] - df['pc Estimación']

In [102]:
df

Unnamed: 0,Estimación,Real,pc Estimación,pc Real,dif. Real-Est.
Votos_Total,523133,514663,100.734229,100.775599,0.04137
Nulos,3813,3961,0.734229,0.775599,0.04137
Votos_Válidos,519320,510702,100.0,100.0,0.0
Blanco,5227,4105,1.006509,0.803796,-0.202713
V_Cand,514092,506597,98.993299,99.196204,0.202906
PP,198958,178587,38.311253,34.968925,-3.342328
PSOE,113870,124935,21.92675,24.463386,2.536635
Cs,82131,85913,15.815104,16.822531,1.007426
UP,99860,103401,19.228992,20.246837,1.017845
IU,0,0,0.0,0.0,0.0


In [121]:
df_result= df.head(12).append(df.tail(2))

In [122]:
df_result

Unnamed: 0,Estimación,Real,pc Estimación,pc Real,dif. Real-Est.
Votos_Total,523133,514663,100.734229,100.775599,0.04137
Nulos,3813,3961,0.734229,0.775599,0.04137
Votos_Válidos,519320,510702,100.0,100.0,0.0
Blanco,5227,4105,1.006509,0.803796,-0.202713
V_Cand,514092,506597,98.993299,99.196204,0.202906
PP,198958,178587,38.311253,34.968925,-3.342328
PSOE,113870,124935,21.92675,24.463386,2.536635
Cs,82131,85913,15.815104,16.822531,1.007426
UP,99860,103401,19.228992,20.246837,1.017845
IU,0,0,0.0,0.0,0.0


In [123]:
df_result.to_csv('result_seleccion_secciones_zgz_2019_aplicado_en_2016.csv')

In [124]:
#para guardar el archivo en s3:

from botocore.exceptions import ClientError

s3_client = boto3.client(
    's3',
    aws_access_key_id='XXXXXXX',
    aws_secret_access_key='xxxxxxxxxxxxxx',    
)

def upload_file(file_name, bucket, object_name=None):
    """Subir un archivo a un bucket
    :param file_name: archivo que hay que subir
    :param bucket: Bucket al que hay que subirlo
    :param object_name: S3 object name. Incluye la carpeta en la que hay que guardarlo. si no hay no se pone nada
    :return: True si sube el archivo, else False
    """

    # If S3 object_name was not specified, use file_name
    if object_name is None:
        object_name = file_name

    # Upload the file
    #s3_client = boto3.client('s3')
    try:
        response = s3_client.upload_file(file_name, bucket, object_name)
    except ClientError as e:
        logging.error(e)
        return False
    return True

In [126]:
upload_file('result_seleccion_secciones_zgz_2019_aplicado_en_2016.csv',
            'electomedia',
            object_name = "resultados_modelos/" + 'result_seleccion_secciones_zgz_2019_aplicado_en_2016.csv'
           )

True