## Predicción de tipo de comunidad a partir del consumo energético utilizando *k-Means*

Alberto Ramos Sánchez

08/01/2021

### Contenido

* [Preparación de los datos](#Preparación-de-los-datos)
* [A1: Predicción utilizando consumo eléctrico](#A1:-Predicción-utilizando-consumo-eléctrico)
    * [A1.1: Predicción sin balancear los datos](#A1.1:-Predicción-sin-balancear-los-datos)
    * [A1.2: Predicción balanceando los datos](#A1.2:-Predicción-balanceando-los-datos)
* [A2: Predicción utilizando consumo de gas](#A2:-Predicción-utilizando-consumo-de-gas)
    * [A2.1: Predicción sin balancear los datos](#A2.1:-Predicción-sin-balancear-los-datos)
    * [A2.2: Predicción balanceando los datos](#A2.2:-Predicción-balanceando-los-datos)

In [1]:
import pandas as pd
import numpy as np

from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
from sklearn import preprocessing

from dataloader import DataLoader

seed_val = 42
np.random.seed(seed_val)
import random
random.seed(seed_val)

In [2]:
df = pd.read_csv('energy-usage-2010-clean.csv')
df

Unnamed: 0,COMMUNITY AREA NAME,CENSUS BLOCK,BUILDING TYPE,BUILDING_SUBTYPE,KWH JANUARY 2010,KWH FEBRUARY 2010,KWH MARCH 2010,KWH APRIL 2010,KWH MAY 2010,KWH JUNE 2010,...,TOTAL POPULATION,TOTAL UNITS,AVERAGE STORIES,AVERAGE BUILDING AGE,AVERAGE HOUSESIZE,OCCUPIED UNITS,OCCUPIED UNITS PERCENTAGE,RENTER-OCCUPIED HOUSING UNITS,RENTER-OCCUPIED HOUSING PERCENTAGE,OCCUPIED HOUSING UNITS
0,Archer Heights,1.703157e+14,Residential,Multi < 7,,,,,,,...,89.0,24.0,2.00,71.33,3.87,23.0,0.9582,9.0,0.3910,23.0
1,Ashburn,1.703170e+14,Residential,Multi 7+,7334.0,7741.0,4214.0,4284.0,2518.0,4273.0,...,112.0,67.0,2.00,41.00,1.81,62.0,0.9254,50.0,0.8059,62.0
2,Auburn Gresham,1.703171e+14,Commercial,Multi < 7,,,,,,,...,102.0,48.0,3.00,86.00,3.00,34.0,0.7082,23.0,0.6759,34.0
3,Austin,1.703125e+14,Commercial,Multi < 7,,,,,,,...,121.0,56.0,2.00,84.00,2.95,41.0,0.7321,32.0,0.7800,41.0
4,Austin,1.703125e+14,Commercial,Multi < 7,,,,,,,...,62.0,23.0,2.00,85.00,3.26,19.0,0.8261,11.0,0.5790,19.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67046,Woodlawn,1.703184e+14,Residential,Single Family,2705.0,1318.0,1582.0,1465.0,1494.0,2990.0,...,116.0,55.0,1.00,0.00,3.14,37.0,0.6727,26.0,0.7030,37.0
67047,Woodlawn,1.703184e+14,Commercial,Multi < 7,1005.0,1760.0,1521.0,1832.0,2272.0,2361.0,...,31.0,24.0,3.00,104.50,2.07,15.0,0.6250,13.0,0.8670,15.0
67048,Woodlawn,1.703184e+14,Residential,Multi < 7,3567.0,3031.0,2582.0,2295.0,7902.0,4987.0,...,31.0,24.0,2.33,100.67,2.07,15.0,0.6250,13.0,0.8670,15.0
67049,Woodlawn,1.703184e+14,Residential,Single Family,1208.0,1055.0,1008.0,1109.0,1591.0,1367.0,...,0.0,0.0,1.00,0.00,0.00,0.0,,0.0,,0.0


In [3]:
dl = DataLoader(df)

In [4]:
dataset_energy = dl[dl.energy_cols + ['BUILDING TYPE']]
dataset_gas = dl[dl.gas_cols + ['BUILDING TYPE']]

Tenemos 3 clases: residencial, comercial e industrial

In [5]:
dataset_energy['BUILDING TYPE'].unique().tolist()

['Residential', 'Commercial', 'Industrial']

In [6]:
dataset_gas['BUILDING TYPE'].unique().tolist()

['Residential', 'Commercial', 'Industrial']

In [7]:
dataset_energy.groupby(['BUILDING TYPE']).count()['KWH JANUARY 2010']

BUILDING TYPE
Commercial     16630
Industrial        26
Residential    49447
Name: KWH JANUARY 2010, dtype: int64

In [8]:
dataset_gas.groupby(['BUILDING TYPE']).count()['THERM JANUARY 2010']

BUILDING TYPE
Commercial     14505
Industrial        31
Residential    46200
Name: THERM JANUARY 2010, dtype: int64

### Preparación de los datos

In [9]:
dataset_energy_X = dataset_energy[dl.energy_cols]
dataset_energy_y = dataset_energy['BUILDING TYPE']

In [10]:
dataset_gas_X = dataset_gas[dl.gas_cols]
dataset_gas_y = dataset_gas['BUILDING TYPE']

Dividimos los datos en entrenamiento y test.

In [11]:
dataset_energy_X_train = dataset_energy_X.sample(int(len(dataset_energy_X)*0.9))
dataset_energy_X_test = dataset_energy_X.drop(dataset_energy_X_train.index)

dataset_energy_y_train = dataset_energy_y[dataset_energy_X_train.index]
dataset_energy_y_test = dataset_energy_y.drop(dataset_energy_y_train.index)

In [12]:
dataset_gas_X_train = dataset_gas_X.sample(int(len(dataset_gas_X)*0.9))
dataset_gas_X_test = dataset_gas_X.drop(dataset_gas_X_train.index)

dataset_gas_y_train = dataset_gas_y[dataset_gas_X_train.index]
dataset_gas_y_test = dataset_gas_y.drop(dataset_gas_y_train.index)

### A1: Predicción utilizando consumo eléctrico

#### A1.1: Predicción sin balancear los datos

In [13]:
scaler = preprocessing.StandardScaler()
data = dataset_energy_X_train
norm_data = scaler.fit(data).transform(data)

kmeans = KMeans(n_clusters = 3, random_state = 42)
kmeans.fit(norm_data)

KMeans(n_clusters=3, random_state=42)

In [14]:
cluster = kmeans.predict(norm_data)

Train score:

In [15]:
d = {"Residential": 0, "Commercial": 1, "Industrial": 2}
accuracy_score(dataset_energy_y_train.map(d), cluster)

0.7468062932831305

Test score:

In [16]:
scaler = preprocessing.StandardScaler()
data = dataset_energy_X_test
norm_data = scaler.fit(data).transform(data)

cluster = kmeans.predict(norm_data)

In [17]:
d = {"Residential": 0, "Commercial": 1, "Industrial": 2}
accuracy_score(dataset_energy_y_test.map(d), cluster)

0.7597942822568446

#### A1.2: Predicción balanceando los datos

Se balancean los datos.

In [18]:
g = dataset_energy.groupby("BUILDING TYPE")
dataset_energy_bal = g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))

dataset_energy_bal["BUILDING TYPE"] = dataset_energy_bal.index
dataset_energy_bal["BUILDING TYPE"] = dataset_energy_bal["BUILDING TYPE"].apply(lambda x: x[0])

dataset_energy_bal

Unnamed: 0_level_0,Unnamed: 1_level_0,KWH JANUARY 2010,KWH FEBRUARY 2010,KWH MARCH 2010,KWH APRIL 2010,KWH MAY 2010,KWH JUNE 2010,KWH JULY 2010,KWH AUGUST 2010,KWH SEPTEMBER 2010,KWH OCTOBER 2010,KWH NOVEMBER 2010,KWH DECEMBER 2010,BUILDING TYPE
BUILDING TYPE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Commercial,0,8076.0,7988.0,8598.0,9562.0,9821.0,10288.0,11636.0,11290.0,10466.0,10062.0,10598.0,11120.0,Commercial
Commercial,1,11412.0,10808.0,11285.0,12010.0,12872.0,14416.0,16091.0,13222.0,11874.0,11062.0,14382.0,14304.0,Commercial
Commercial,2,0.0,719.0,274.0,167.0,138.0,170.0,189.0,228.0,277.0,383.0,2139.0,1837.0,Commercial
Commercial,3,14581.0,13973.0,14613.0,11522.0,17511.0,19829.0,22480.0,19950.0,16179.0,11221.0,14346.0,15774.0,Commercial
Commercial,4,0.0,0.0,0.0,0.0,446.0,586.0,733.0,1011.0,503.0,338.0,1916.0,1760.0,Commercial
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Residential,21,3345.0,2839.0,2537.0,3436.0,4591.0,5601.0,5700.0,4033.0,2277.0,1965.0,4868.0,4094.0,Residential
Residential,22,2054.0,1197.0,1005.0,1744.0,2939.0,3136.0,3200.0,2205.0,1117.0,949.0,1521.0,2280.0,Residential
Residential,23,4187.0,3380.0,4229.0,3413.0,4593.0,9306.0,9454.0,5306.0,3826.0,3194.0,5763.0,7348.0,Residential
Residential,24,9582.0,7683.0,6774.0,7880.0,11154.0,16201.0,17010.0,12129.0,8166.0,11656.0,14434.0,14911.0,Residential


In [19]:
dataset_energy_bal_X = dataset_energy_bal[dl.energy_cols]
dataset_energy_bal_y = dataset_energy_bal['BUILDING TYPE']

In [20]:
dataset_energy_bal_X_train = dataset_energy_bal_X.sample(int(len(dataset_energy_bal_X)*0.9))
dataset_energy_bal_X_test = dataset_energy_bal_X.drop(dataset_energy_bal_X_train.index)

dataset_energy_bal_y_train = dataset_energy_bal_y[dataset_energy_bal_X_train.index]
dataset_energy_bal_y_test = dataset_energy_bal_y.drop(dataset_energy_bal_y_train.index)

Predicción

In [21]:
scaler = preprocessing.StandardScaler()
data = dataset_energy_bal_X_train
norm_data = scaler.fit(data).transform(data)

kmeans = KMeans(n_clusters = 3, random_state = 42)
kmeans.fit(norm_data)

KMeans(n_clusters=3, random_state=42)

In [22]:
cluster = kmeans.predict(norm_data)

Train score:

In [23]:
d = {"Residential": 0, "Commercial": 1, "Industrial": 2}
accuracy_score(dataset_energy_bal_y_train.map(d), cluster)

0.37142857142857144

Test score:

In [24]:
scaler = preprocessing.StandardScaler()
data = dataset_energy_bal_X_test
norm_data = scaler.fit(data).transform(data)

cluster = kmeans.predict(norm_data)

In [25]:
d = {"Residential": 0, "Commercial": 1, "Industrial": 2}
accuracy_score(dataset_energy_bal_y_test.map(d), cluster)

0.5

### A2: Predicción utilizando consumo de gas

#### A2.1: Predicción sin balancear los datos

In [26]:
scaler = preprocessing.StandardScaler()
data = dataset_gas_X_train
norm_data = scaler.fit(data).transform(data)

kmeans = KMeans(n_clusters = 3, random_state = 42)
kmeans.fit(norm_data)

KMeans(n_clusters=3, random_state=42)

In [27]:
cluster = kmeans.predict(norm_data)

Train score:

In [28]:
d = {"Residential": 0, "Commercial": 1, "Industrial": 2}
accuracy_score(dataset_gas_y_train.map(d), cluster)

0.7603819838278878

Test score:

In [29]:
scaler = preprocessing.StandardScaler()
data = dataset_gas_X_test
norm_data = scaler.fit(data).transform(data)

cluster = kmeans.predict(norm_data)

In [30]:
d = {"Residential": 0, "Commercial": 1, "Industrial": 2}
accuracy_score(dataset_gas_y_test.map(d), cluster)

0.7627593019427066

#### A2.2: Predicción balanceando los datos

Se balancean los datos.

In [31]:
g = dataset_gas.groupby("BUILDING TYPE")
dataset_gas_bal = g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))

dataset_gas_bal["BUILDING TYPE"] = dataset_gas_bal.index
dataset_gas_bal["BUILDING TYPE"] = dataset_gas_bal["BUILDING TYPE"].apply(lambda x: x[0])

dataset_gas_bal

Unnamed: 0_level_0,Unnamed: 1_level_0,THERM JANUARY 2010,THERM FEBRUARY 2010,THERM MARCH 2010,THERM APRIL 2010,THERM MAY 2010,THERM JUNE 2010,THERM JULY 2010,THERM AUGUST 2010,THERM SEPTEMBER 2010,THERM OCTOBER 2010,THERM NOVEMBER 2010,THERM DECEMBER 2010,BUILDING TYPE
BUILDING TYPE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Commercial,0,2038.0,1694.0,1699.0,1319.0,469.0,368.0,256.0,207.0,209.0,217.0,605.0,1199.0,Commercial
Commercial,1,4909.0,4045.0,2223.0,698.0,291.0,261.0,260.0,90.0,79.0,162.0,1365.0,4711.0,Commercial
Commercial,2,502.0,462.0,487.0,721.0,755.0,618.0,862.0,610.0,746.0,604.0,547.0,636.0,Commercial
Commercial,3,1686.0,1397.0,1519.0,2158.0,590.0,1240.0,602.0,1233.0,536.0,1407.0,1291.0,2108.0,Commercial
Commercial,4,3405.0,4301.0,3137.0,888.0,425.0,94.0,39.0,28.0,26.0,74.0,114.0,2017.0,Commercial
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Residential,26,3500.0,3353.0,2445.0,1249.0,773.0,368.0,273.0,269.0,257.0,335.0,1014.0,2637.0,Residential
Residential,27,3830.0,3445.0,2646.0,1422.0,944.0,539.0,356.0,368.0,355.0,447.0,1111.0,2884.0,Residential
Residential,28,6662.0,5718.0,4056.0,2037.0,1070.0,569.0,482.0,449.0,606.0,1099.0,2390.0,7094.0,Residential
Residential,29,5927.0,5112.0,4867.0,2359.0,1544.0,1126.0,857.0,890.0,807.0,972.0,1352.0,2864.0,Residential


In [32]:
dataset_gas_bal_X = dataset_gas_bal[dl.gas_cols]
dataset_gas_bal_y = dataset_gas_bal['BUILDING TYPE']

In [33]:
dataset_gas_bal_X_train = dataset_gas_bal_X.sample(int(len(dataset_gas_bal_X)*0.9))
dataset_gas_bal_X_test = dataset_gas_bal_X.drop(dataset_gas_bal_X_train.index)

dataset_gas_bal_y_train = dataset_gas_bal_y[dataset_gas_bal_X_train.index]
dataset_gas_bal_y_test = dataset_gas_bal_y.drop(dataset_gas_bal_y_train.index)

Predicción

In [34]:
scaler = preprocessing.StandardScaler()
data = dataset_gas_bal_X_train
norm_data = scaler.fit(data).transform(data)

kmeans = KMeans(n_clusters = 3, random_state = 42)
kmeans.fit(norm_data)

KMeans(n_clusters=3, random_state=42)

In [35]:
cluster = kmeans.predict(norm_data)

Train score:

In [36]:
d = {"Residential": 0, "Commercial": 1, "Industrial": 2}
accuracy_score(dataset_gas_bal_y_train.map(d), cluster)

0.3614457831325301

Test score:

In [37]:
scaler = preprocessing.StandardScaler()
data = dataset_gas_bal_X_test
norm_data = scaler.fit(data).transform(data)

cluster = kmeans.predict(norm_data)

In [38]:
d = {"Residential": 0, "Commercial": 1, "Industrial": 2}
accuracy_score(dataset_gas_bal_y_test.map(d), cluster)

0.1