# SVR - Ejemplo - Desempeño de Estudiantes

**Contexto**  
Este conjunto de datos contiene el desempeño de estudiantes, de acuerdo diversos factores.

**Contenido**  
El conjunto de datos proviene de kaggle: [Student Performance](https://www.kaggle.com/datasets/nikhil7280/student-performance-multiple-linear-regression).  
Contiene 10,000 renglones, con las siguientes columnas:

| Variable                         | Definición                                       | Valor           |
| -------------------------------- | ------------------------------------------------ | --------------- |
| Hours Studied                    | Número total de horas dedicadas al estudio       | Numérico entero |
| Previous Scores                  | Calificaciones obtenidas en pruebas anteriores   | Numérico entero |
| Extracurricular Activities       | Participación en actividades extracurriculares   | Yes / No        |
| Sleep Hours                      | Horas de sueño por día                           | Numérico entero |
| Sample Question Papers Practiced | Exámenes muestra de práctica                     | Numérico entero |
| Performance Index                | Desempeño del estudiante **(variable objetivo)** | Entre 10 y 100  |

**Planteamiento del problema**  
Se busca encontrar que factores tienen mayor influencia en el desempeño de los estudiantes.

In [1]:
# Importar librerias
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn import metrics

## Cargar Datos

In [2]:
# Importar los datos
df = pd.read_csv('Student_Performance.csv')
df.head()

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,Yes,9,1,91.0
1,4,82,No,4,2,65.0
2,8,51,Yes,7,2,45.0
3,5,52,Yes,5,2,36.0
4,7,75,No,8,5,66.0


In [3]:
# Renombrar columnas
df.columns = ['horas_estudio', 'calificacion_anterior', 'actividades_extra', 'horas_sueño', 'preguntas_practica', 'desempeño']
df.head()

Unnamed: 0,horas_estudio,calificacion_anterior,actividades_extra,horas_sueño,preguntas_practica,desempeño
0,7,99,Yes,9,1,91.0
1,4,82,No,4,2,65.0
2,8,51,Yes,7,2,45.0
3,5,52,Yes,5,2,36.0
4,7,75,No,8,5,66.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   horas_estudio          10000 non-null  int64  
 1   calificacion_anterior  10000 non-null  int64  
 2   actividades_extra      10000 non-null  object 
 3   horas_sueño            10000 non-null  int64  
 4   preguntas_practica     10000 non-null  int64  
 5   desempeño              10000 non-null  float64
dtypes: float64(1), int64(4), object(1)
memory usage: 468.9+ KB


## One-hot encoding

In [5]:
print(df['actividades_extra'].unique())

['Yes' 'No']


In [6]:
df['actividades_extra'].value_counts()

No     5052
Yes    4948
Name: actividades_extra, dtype: int64

In [7]:
# One-hot encoding
df_adj = pd.get_dummies(df, columns=['actividades_extra'])
df_adj.head()

Unnamed: 0,horas_estudio,calificacion_anterior,horas_sueño,preguntas_practica,desempeño,actividades_extra_No,actividades_extra_Yes
0,7,99,9,1,91.0,0,1
1,4,82,4,2,65.0,1,0
2,8,51,7,2,45.0,0,1
3,5,52,5,2,36.0,0,1
4,7,75,8,5,66.0,1,0


In [8]:
# Eliminar columna extra
df_adj = df_adj[['horas_estudio', 'calificacion_anterior', 'actividades_extra_Yes', 'horas_sueño', 'preguntas_practica', 'desempeño']]
df_adj.columns = ['horas_estudio', 'calificacion_anterior', 'actividades_extra', 'horas_sueño', 'preguntas_practica', 'desempeño']
df_adj.head()

Unnamed: 0,horas_estudio,calificacion_anterior,actividades_extra,horas_sueño,preguntas_practica,desempeño
0,7,99,1,9,1,91.0
1,4,82,0,4,2,65.0
2,8,51,1,7,2,45.0
3,5,52,1,5,2,36.0
4,7,75,0,8,5,66.0


## Normalización

In [9]:
# Variables independientes
X = df_adj[['horas_estudio', 'calificacion_anterior', 'actividades_extra', 'horas_sueño', 'preguntas_practica']]
X.head()

Unnamed: 0,horas_estudio,calificacion_anterior,actividades_extra,horas_sueño,preguntas_practica
0,7,99,1,9,1
1,4,82,0,4,2
2,8,51,1,7,2
3,5,52,1,5,2
4,7,75,0,8,5


In [10]:
# Normalizar
scaler_X = StandardScaler()
X_adj = scaler_X.fit_transform(X)
print(X_adj)

[[ 0.77518771  1.70417565  1.01045465  1.45620461 -1.24975394]
 [-0.38348058  0.72391268 -0.98965352 -1.49229423 -0.90098215]
 [ 1.16141048 -1.06362569  1.01045465  0.27680507 -0.90098215]
 ...
 [ 0.38896495  0.7815752   1.01045465  0.86650484  0.1453332 ]
 [ 1.54763324  1.5888506   1.01045465  0.27680507 -1.59852572]
 [ 0.77518771  0.26261245 -0.98965352  0.86650484 -1.24975394]]


In [11]:
# Variable dependiente
y = df_adj[['desempeño']]
y.head()

Unnamed: 0,desempeño
0,91.0
1,65.0
2,45.0
3,36.0
4,66.0


In [12]:
# Normalizar
scaler_y = StandardScaler()
y_adj = scaler_y.fit_transform(y)
y_adj = y_adj.ravel()
print(y_adj)

[ 1.86216688  0.50881766 -0.5322202  ...  0.9772847   2.07037446
  0.45676577]


In [13]:
print('X:', len(X_adj), 'y:', len(y_adj))

X: 10000 y: 10000


## Modelado

In [14]:
# Conjunto de entrenamiento y pruebas
X_train, X_test, y_train, y_test = train_test_split(X_adj, y_adj, test_size=0.3, random_state=0)

In [15]:
print('X_train:', len(X_train), 'y_train:', len(y_train))
print('X_test:',  len(X_test),  'y_test:',  len(y_test))

X_train: 7000 y_train: 7000
X_test: 3000 y_test: 3000


### Kernel Lineal

In [16]:
model1 = SVR(kernel='linear')
model1.fit(X_train,y_train)

In [17]:
# Predicciones
pred1 = model1.predict(X_test)
pred1 

array([-0.24794251, -0.11037558,  1.20103093, ...,  0.34935119,
       -0.83682275,  1.79083362])

### Kernel RBF

In [18]:
model2 = SVR(kernel='rbf')
model2.fit(X_train,y_train)

In [19]:
# Predicciones
pred2 = model2.predict(X_test)
pred2 

array([-0.22790031, -0.13242202,  1.21677892, ...,  0.37531325,
       -0.84499325,  1.81067634])

### Kernel Polinómico

In [20]:
model3 = SVR(kernel='poly', degree=3)
model3.fit(X_train,y_train)

In [21]:
# Predicciones
pred3 = model3.predict(X_test)
pred3 

array([-0.23924672, -0.11173186,  1.17420794, ...,  0.34747007,
       -0.85017315,  1.8383586 ])

## Evaluación

In [22]:
# Kernel Lineal
print('MAE:', metrics.mean_absolute_error(y_test, pred1))
print('MSE:', metrics.mean_squared_error(y_test, pred1))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred1)))
print('R2:', metrics.r2_score(y_test, pred1))

MAE: 0.08304751100169622
MSE: 0.01094751763520833
RMSE: 0.10463038581219287
R2: 0.9888036752125198


In [23]:
# Kernel rbf
print('MAE:', metrics.mean_absolute_error(y_test, pred2))
print('MSE:', metrics.mean_squared_error(y_test, pred2))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred2)))
print('R2:', metrics.r2_score(y_test, pred2))

MAE: 0.08555874144469623
MSE: 0.011676927196830484
RMSE: 0.10805983156025409
R2: 0.9880576881653055


In [24]:
# Kernel polinómico
print('MAE:', metrics.mean_absolute_error(y_test, pred3))
print('MSE:', metrics.mean_squared_error(y_test, pred3))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred3)))
print('R2:', metrics.r2_score(y_test, pred3))

MAE: 0.08453327589739122
MSE: 0.011344133424181332
RMSE: 0.10650884200000173
R2: 0.9883980454307597
