# Aprendizaje supervisado



El aprendizaje supervisado trata de encontrar valores para variables independientes a partir de variables dependientes. Esto parte del supuesto de que, entre las variables hay relaciones que permiten explicarlas. 

El aprendizaje supervisado suele subdividirse en 
- Aprendizaje semisupervisado
- *Active learning*
- Aprendizaje por refuerzo
- Aprendizaje profundo

## Paquetes

In [44]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import MinMaxScaler

import matplotlib.pyplot as plt

## Regresión lineal

En una regresión se busca ajustar una curva a los datos minimizando el error. La regresión más sencilla es la regresión lineal donde se pretende predecir valores $Y$ a partir de determinados valores $X$ mediante la ecuación lineal $Y = b_0 + X \cdot b_1$, donde $b_0$ coincide con una constante o intercepción, mientras que $b_1$ es la pendiente.

Para este ejemplo se quiere predecir $\text{PM}_{10}$ a partir de la humedad relativa RH.

In [51]:
df_sel = pd.read_csv('../data/results/df_sel.csv')
df_sel.timestamp = pd.to_datetime(df_sel.timestamp)
df_sel.timestamp = df_sel.timestamp.values.astype(np.int64) / 10 ** 9
df_sel

Unnamed: 0,timestamp,lat,lon,h,variable,value
0,1.483229e+09,25.670,-100.338,560,PM10,143.0
1,1.483232e+09,25.670,-100.338,560,PM10,183.0
2,1.483236e+09,25.670,-100.338,560,PM10,142.0
3,1.483240e+09,25.670,-100.338,560,PM10,101.0
4,1.483243e+09,25.670,-100.338,560,PM10,85.0
...,...,...,...,...,...,...
5124400,1.577819e+09,25.665,-100.413,636,WD,82.0
5124401,1.577822e+09,25.665,-100.413,636,WD,87.0
5124402,1.577826e+09,25.665,-100.413,636,WD,98.0
5124403,1.577830e+09,25.665,-100.413,636,WD,104.0


In [52]:
df_sel.dtypes

timestamp    float64
lat          float64
lon          float64
h              int64
variable      object
value        float64
dtype: object

In [None]:
scaler = MinMaxScaler()
# https://stackoverflow.com/a/43383700
scaled = scaler.fit_transform(df_sel[[ ,'lat','lon','h','value']])
# https://datatofish.com/numpy-array-to-pandas-dataframe/
df_scaled = pd.DataFrame(scaled, columns = df_sel.columns)
df_scaled

In [47]:
df_pollutants_coords = pd.read_csv('../data/results/df_pollutants_coords.csv')
df_pollutants_coords.timestamp = pd.to_datetime(df_pollutants_coords.timestamp)
df_pollutants_coords.dropna(inplace = True)
df_pollutants_coords

Unnamed: 0,station,abbr,lat,lon,h,timestamp,CO,NO,NO2,NOX,...,PM10,PM2_5,BP,RF,RH,SO2,SR,T,WV,WD
1420,Centro,C,25.670,-100.338,560,2017-03-01 04:00:00,2.48,3.7,13.6,17.3,...,98.0,34.0,709.4,0.0,56.0,4.0,0.003,22.43,1.7,256.0
1421,Centro,C,25.670,-100.338,560,2017-03-01 05:00:00,2.40,2.5,10.9,13.4,...,87.0,33.0,709.5,0.0,61.0,3.8,0.003,21.78,1.5,242.0
1422,Centro,C,25.670,-100.338,560,2017-03-01 06:00:00,2.41,3.1,11.8,14.8,...,81.0,77.0,709.9,0.0,60.0,3.9,0.003,21.59,1.9,235.0
1426,Centro,C,25.670,-100.338,560,2017-03-01 10:00:00,2.57,5.8,8.5,14.3,...,158.0,23.0,712.3,0.0,35.0,3.8,0.308,25.93,12.2,34.0
1427,Centro,C,25.670,-100.338,560,2017-03-01 11:00:00,2.39,4.0,5.1,9.1,...,268.0,167.0,713.3,0.0,29.0,3.6,0.626,26.21,13.7,25.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
341276,Suroeste2,SO2,25.665,-100.413,636,2019-12-17 09:00:00,2.01,1.1,1.0,2.1,...,19.0,4.0,717.5,0.0,44.0,3.0,0.033,9.17,3.8,88.0
341279,Suroeste2,SO2,25.665,-100.413,636,2019-12-17 12:00:00,2.05,1.2,1.0,2.2,...,37.0,7.0,717.6,0.0,36.0,2.7,0.468,11.31,2.6,77.0
341302,Suroeste2,SO2,25.665,-100.413,636,2019-12-18 11:00:00,2.06,1.0,1.1,2.1,...,32.0,6.0,718.9,0.0,27.0,5.1,0.443,8.63,0.7,69.0
341303,Suroeste2,SO2,25.665,-100.413,636,2019-12-18 12:00:00,2.05,1.0,1.3,2.3,...,37.0,4.0,718.6,0.0,25.0,9.0,0.532,9.65,0.8,70.0


In [28]:
df_train = df_pollutants_coords.sample(frac = 0.7)

In [31]:
x_train = df_train[['RH']]# [['BP', 'RF', 'RH', 'SR', 'T', 'WV', 'WD']]
y_train = df_train[['PM10']]

In [32]:
from sklearn import linear_model

In [33]:
reg = linear_model.LinearRegression()
reg.fit(x_train, y_train)

In [38]:
print(f'y = {reg.intercept_[0]} + X ({reg.coef_[0][0]})')

y = 83.43208140223541 + X (-0.3401237504230793)


In [39]:
reg.score(x_train, y_train)

0.028867355377801385

In [None]:
reg.predict()

### Mínimos cuadrados

## Regresión logística

## Árboles de decisión

## $K$ vecinos más cercanos

## Bosques aleatorios

## Naive Bayes

## Fuentes

- https://scikit-learn.org/stable/supervised_learning.html
- https://www.toptal.com/machine-learning/supervised-machine-learning-algorithms
- https://www.datacamp.com/blog/supervised-machine-learning