# Análisis del Rendimiento de Estudiantes

## Introducción

Este proyecto se centra en analizar el rendimiento de estudiantes en tres áreas clave: matemáticas, lectura y escritura, utilizando el dataset 'Students Performance'. El objetivo es explorar cómo diversas características demográficas y socioeconómicas, como el género, el grupo étnico y el nivel educativo de los padres, influyen en el rendimiento académico de los estudiantes.

A través de este análisis, buscamos identificar patrones significativos y factores predictivos que puedan ayudar en la formulación de estrategias para mejorar el rendimiento estudiantil. Las principales tareas incluirán la exploración de datos, visualización, manejo de datos faltantes, análisis estadístico y modelado predictivo para estimar el rendimiento en matemáticas.

Este análisis no solo aportará insights valiosos sobre la educación sino que también demostrará cómo técnicas avanzadas de análisis de datos pueden aplicarse en contextos educativos para obtener conclusiones prácticas y útiles.


## Carga de Datos
A continuación, cargamos el dataset utilizando la librería Pandas. Contiene las siguientes columnas de interés:
- `gender`: Género del estudiante.
- `race/ethnicity`: Grupo étnico del estudiante.
- `parental level of education`: Nivel educativo de los padres.
- `lunch`: Tipo de almuerzo.
- `test preparation course`: Si el estudiante tomó o no un curso de preparación.
- `math score`, `reading score`, `writing score`: Puntuaciones en las pruebas de matemáticas, lectura y escritura.


In [None]:
import pandas as pd

# Load the data
data_path = '../data/StudentsPerformance.csv'
df = pd.read_csv(data_path)

# Show the first rows of the dataframe using the head method
df.head()


In [None]:
# General information about the dataframe
df.info()

In [None]:
# Summary statistics for the numerical columns
df.describe()

In [None]:
# Check for missing values
df.isnull().sum()

#check for missing values and handle them if any
if df.isnull().sum().any():
    # fill missing values with the mean when the column is numerical and with the mode when the column is categorical
    df.fillna({
        'math score': df['math score'].mean(),
        'reading score': df['reading score'].mean(),
        'writing score': df['writing score'].mean(),
        'gender': df['gender'].mode()[0],
        'race/ethnicity': df['race/ethnicity'].mode()[0],
        'parental level of education': df['parental level of education'].mode()[0],
        'lunch': df['lunch'].mode()[0],
        'test preparation course': df['test preparation course'].mode()[0]
    }, inplace=True)
else:
    print('No missing values found')


In [None]:
## Histogramas de las puntuaciones
''' 
Utilizamos histogramas para visualizar la distribución de las puntuaciones en matemáticas, lectura y escritura. 
Estos gráficos nos permiten observar la forma de la distribución y detectar si existen patrones como asimetría o presencia de picos.

'''

import matplotlib.pyplot as plt
import seaborn as sns

# Setting the style of seaborn
sns.set(style="whitegrid")

# Creating the histogram plots for the math, reading, and writing scores
plt.figure(figsize=(18, 5))

plt.subplot(1, 3, 1)
sns.histplot(df['math score'], kde=True, color='blue')
plt.title('Distribución de Puntuaciones de Matemáticas')

plt.subplot(1, 3, 2)
sns.histplot(df['reading score'], kde=True, color='green')
plt.title('Distribución de Puntuaciones de Lectura')

plt.subplot(1, 3, 3)
sns.histplot(df['writing score'], kde=True, color='red')
plt.title('Distribución de Puntuaciones de Escritura')

plt.show()


In [None]:
## Gráficos de Caja por Género

'''
Los gráficos de caja proporcionan una forma visual de comparar la distribución de las puntuaciones entre diferentes grupos de género, destacando diferencias en medianas, rangos intercuartílicos y la presencia de valores atípicos.
'''

# Graphs of box plots to compare performance by gender
plt.figure(figsize=(18, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x='gender', y='math score', data=df)
plt.title('Puntuaciones de Matemáticas por Género')

plt.subplot(1, 3, 2)
sns.boxplot(x='gender', y='reading score', data=df)
plt.title('Puntuaciones de Lectura por Género')

plt.subplot(1, 3, 3)
sns.boxplot(x='gender', y='writing score', data=df)
plt.title('Puntuaciones de Escritura por Género')

plt.show()


In [None]:
# Gráfico de caja del nivel educativo de los padres y las puntuaciones en todas las materias
subjects = ['math score', 'reading score', 'writing score']
titles = ['Matemáticas', 'Lectura', 'Escritura']

plt.figure(figsize=(18, 6))
for i, (score, title) in enumerate(zip(subjects, titles), 1):
    plt.subplot(1, 3, i)
    sns.boxplot(x='parental level of education', y=score, data=df)
    plt.xticks(rotation=45)
    plt.title(f'Puntuaciones de {title} por Nivel Educativo de los Padres')

plt.tight_layout()
plt.show()


In [None]:
# Gráfico de caja de la relación entre el almuerzo y las puntuaciones en todas las materias
plt.figure(figsize=(18, 6))
for i, score in enumerate(subjects, 1):
    plt.subplot(1, 3, i)
    sns.boxplot(x='lunch', y=score, data=df)
    plt.title(f'{score.title()} por Tipo de Almuerzo')

plt.tight_layout()
plt.show()


In [None]:
# Gráfico de caja de la relación entre la preparación del examen y las puntuaciones en todas las materias
plt.figure(figsize=(18, 6))
for i, score in enumerate(subjects, 1):
    plt.subplot(1, 3, i)
    sns.boxplot(x='test preparation course', y=score, data=df)
    plt.title(f'{score.title()} por Preparación del Examen')

plt.tight_layout()
plt.show()


In [None]:
df['score_category'] = pd.cut(df['math score'], bins=[0, 50, 70, 100], labels=['Low', 'Medium', 'High'])
plt.figure(figsize=(10, 6))
(df.groupby(['gender', 'score_category']).size().unstack().apply(lambda x: x/x.sum(), axis=1).plot(kind='bar', stacked=True))
plt.title('Distribución de Categorías de Puntuación en Matemáticas por Género')
plt.show()


In [None]:
## Gráficos de Dispersión entre Puntuaciones
'''
Utilizamos gráficos de dispersión para evaluar la relación entre las diferentes puntuaciones académicas. Estos gráficos ayudan a identificar correlaciones potenciales entre las puntuaciones en matemáticas, lectura y escritura.

'''
# Gráficos de dispersión entre las puntuaciones
sns.pairplot(df[['math score', 'reading score', 'writing score']])
plt.suptitle('Dispersión entre Puntuaciones', y=1.02)
plt.show()


In [None]:
# Correlaciones entre las puntuaciones
correlation_matrix = df[['math score', 'reading score', 'writing score']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Matriz de Correlación entre Puntuaciones')
plt.show()

In [None]:
# Codification of categorical variables and division data
from sklearn.model_selection import train_test_split, cross_val_score

df_encoded = pd.get_dummies(df, columns=['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course'])

X = df_encoded.drop(['math score', 'reading score', 'writing score'], axis=1)  # Delete other scores to focus on math score
y = df_encoded['math score']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Linear Regression Model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error


model = LinearRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)
print(f"Linear Regression Mean Squared Error: {mse}")

# Coefficients of the linear regression model
importance = pd.DataFrame({
    'Attribute': X_train.columns,
    'Importance': model.coef_
})
importance = importance.sort_values(by='Importance', ascending=False)
print("Linear Regression coefficient importance:\n", importance)



In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
import numpy as np

# Standardize the data
scaler = StandardScaler()
x_scaled = scaler.fit_transform(X)
x_train_scaled, x_test_scaled = scaler.transform(X_train), scaler.transform(X_test)

# Different values of alpha
alpha_options = [0.001, 0.01, 0.1, 1, 10, 100]

In [None]:
# Ridge  Model
from sklearn.linear_model import Ridge

best_score = np.inf
best_alpha = {}

for alpha in alpha_options:
    model = Ridge(alpha=alpha, random_state=42, max_iter=10000)
    model.fit(x_train_scaled, y_train)
    scores = cross_val_score(model, x_scaled, y, cv=5, scoring='neg_mean_squared_error')
    mean_mse = -scores.mean()
    if mean_mse < best_score:
        best_score = mean_mse
        best_alpha = {'alpha': alpha}
    print(f"Alpha: {alpha}, Mean Squared Error: {mean_mse:.2f}")

print(f"\n\nBest Alpha: {best_alpha['alpha']}, Best Mean Squared Error: {best_score:.2f}")



In [None]:
# Lasso  Model
from sklearn.linear_model import Lasso

best_score = np.inf
best_alpha = {}

for alpha in alpha_options:
    model = Lasso(alpha=alpha, random_state=42, max_iter=10000)
    model.fit(x_train_scaled, y_train)
    scores = cross_val_score(model, x_scaled, y, cv=5, scoring='neg_mean_squared_error')
    mean_mse = -scores.mean()
    if mean_mse < best_score:
        best_score = mean_mse
        best_alpha = {'alpha': alpha}
    print(f"Alpha: {alpha}, Mean Squared Error: {mean_mse:.2f}")

print(f"\n\nBest Alpha: {best_alpha['alpha']}, Best Mean Squared Error: {best_score:.2f}")

In [None]:
# Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor


# List of different settings for testing
n_estimators_options = [100, 200, 300]
max_features_options = ['sqrt', 'log2', None]
max_depth_options = [10, 15, 20, None]

best_score = np.inf

# Test all the combinations
for n_estimators in n_estimators_options:
    for max_features in max_features_options:
        for max_depth in max_depth_options:
            model = RandomForestRegressor(n_estimators=n_estimators, max_features=max_features, max_depth=max_depth, random_state=42)
            model.fit(X_train, y_train)
            scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
            mean_mse = -scores.mean()

            print(f"Random Forest with n_estimators={n_estimators}, max_features={max_features}, max_depth={max_depth}: MSE={mean_mse:.2f}")

            # Guardar el mejor modelo
            if mean_mse < best_score:
                best_score = mean_mse
                best_params = {'n_estimators': n_estimators, 'max_features': max_features, 'max_depth': max_depth}

print(f"Best MSE: {best_score:.2f} with parameters: {best_params}")

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from itertools import combinations

def test_feature_combinations(df, target, max_features=3):
    features = [col for col in df.columns if col != target]
    results = []

        # Create file to store the results 
    filename = f"../Results/linear_regression_results_{max_features}_features.txt"
    with open(filename, 'w') as file:
        file.write("Linear Regression Feature Combination Results\n\n")
        file.write(f"Max Features: {max_features}\n")
    
    # Probar todas las combinaciones de características
    for r in range(1, max_features + 1):
        for subset in combinations(features, r):
            X = df[list(subset)]
            y = df[target]
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
            
            # Entrenar el modelo de regresión lineal
            model = LinearRegression()
            model.fit(X_train, y_train)
            predictions = model.predict(X_test)
            mse = mean_squared_error(y_test, predictions)
            
            # Almacenar el subset, MSE, y una muestra de los valores reales vs. predichos
            results.append({
                'features': subset,
                'mse': mse,
                'sample_comparison': pd.DataFrame({'Actual': y_test[:5], 'Predicted': predictions[:5]}).to_string()
            })

    # Ordenar los resultados por MSE
    results.sort(key=lambda x: x['mse'])

    # invert the order to have the best MSE first
    results = results[:50][::-1]
    
    # # Mostrar solo las mejores 100 combinaciones
    # for result in results:
    #     print(f"Features: {result['features']}")
    #     print(f"MSE: {result['mse']}")
    #     print(f"Sample Comparison:\n{result['sample_comparison']}\n")

    # save the results to a file
    with open(filename, 'a') as file:
        for result in results:
            file.write(f"Features: {result['features']}\n")
            file.write(f"MSE: {result['mse']}\n")
            file.write(f"Sample Comparison:\n{result['sample_comparison']}\n\n")

# Ejemplo de uso
test_feature_combinations(df_encoded, 'math score', max_features=2)
test_feature_combinations(df_encoded, 'math score', max_features=3)
test_feature_combinations(df_encoded, 'math score', max_features=5)


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from itertools import combinations
import numpy as np

def test_ridge_combinations(df, target, max_features=3, alpha_options=[0.001, 0.01, 0.1, 1, 10, 100]):
    features = [col for col in df.columns if col != target]
    results = []
    scaler = StandardScaler()

    # Create file to store the results 
    filename = f"../Results/ridge_results_max_features_{max_features}.txt"
    with open(filename, 'w') as file:
        file.write("Ridge Regression Analysis\n")
        file.write(f"Max Features: {max_features}\n")

    
    # Probar todas las combinaciones de características
    for r in range(1, max_features + 1):
        for subset in combinations(features, r):
            X = df[list(subset)]
            y = df[target]
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
            
            # Escalar los datos
            X_train_scaled = scaler.fit_transform(X_train)
            X_test_scaled = scaler.transform(X_test)
            
            # Evaluar diferentes alphas
            for alpha in alpha_options:
                model = Ridge(alpha=alpha, max_iter=10000)
                model.fit(X_train_scaled, y_train)
                predictions = model.predict(X_test_scaled)
                mse = mean_squared_error(y_test, predictions)
                scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='neg_mean_squared_error')
                mean_cv_mse = -np.mean(scores)
                
                # Almacenar los resultados
                results.append({
                    'features': subset,
                    'alpha': alpha,
                    'mse': mse,
                    'mean_cv_mse': mean_cv_mse,
                    'sample_comparison': pd.DataFrame({'Actual': y_test[:5], 'Predicted': predictions[:5]}).to_string()
                })

    
    # Ordenar los resultados por MSE de validación cruzada
    results.sort(key=lambda x: x['mean_cv_mse'])

    results = results[:50][::-1]

    # # Mostrar solo las mejores 100 combinaciones
    # for result in results:
    #     print(f"Features: {result['features']}, Alpha: {result['alpha']}")
    #     print(f"MSE: {result['mse']}, Mean CV MSE: {result['mean_cv_mse']}")
    #     print(f"Sample Comparison:\n{result['sample_comparison']}\n")

    # Write the results to the file
    with open(filename, 'a') as file:
        for result in results:
            file.write(f"Features: {result['features']}, Alpha: {result['alpha']}\n")
            file.write(f"MSE: {result['mse']}, Mean CV MSE: {result['mean_cv_mse']}\n")
            file.write(f"Sample Comparison:\n{result['sample_comparison']}\n\n")


test_ridge_combinations(df_encoded, 'math score', max_features=2, alpha_options=[0.001, 0.01, 0.1, 1, 10, 100])
test_ridge_combinations(df_encoded, 'math score', max_features=3, alpha_options=[0.001, 0.01, 0.1, 1, 10, 100])
test_ridge_combinations(df_encoded, 'math score', max_features=5, alpha_options=[0.001, 0.01, 0.1, 1, 10, 100])


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from itertools import combinations
import numpy as np

def test_random_forest_combinations(df, target, max_features_comb=3):
    features = [col for col in df.columns if col != target]
    results = []
    scaler = StandardScaler()

    # Crear archivo para almacenar los resultados
    filename = f"../Results/random_forest_results_max_features_{max_features_comb}.txt"
    with open(filename, 'w') as file:
        file.write("Random Forest Regression Analysis\n")
        file.write(f"Max Features for Combination Testing: {max_features_comb}\n")

    # Opciones de configuración del Random Forest
    n_estimators_options = [100, 200]
    max_features_options = ['sqrt', 'log2', None]
    max_depth_options = [10, 15, None]

    # Probar todas las combinaciones de características
    for r in range(1, max_features_comb + 1):
        for subset in combinations(features, r):
            X = df[list(subset)]
            y = df[target]
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

            # Escalar los datos
            X_train_scaled = scaler.fit_transform(X_train)
            X_test_scaled = scaler.transform(X_test)

            # Evaluar diferentes configuraciones del Random Forest
            for n_estimators in n_estimators_options:
                for max_features in max_features_options:
                    for max_depth in max_depth_options:
                        model = RandomForestRegressor(n_estimators=n_estimators, max_features=max_features, max_depth=max_depth, random_state=42)
                        model.fit(X_train_scaled, y_train)
                        predictions = model.predict(X_test_scaled)
                        mse = mean_squared_error(y_test, predictions)
                        scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='neg_mean_squared_error')
                        mean_cv_mse = -np.mean(scores)

                        # Almacenar los resultados
                        results.append({
                            'features': subset,
                            'n_estimators': n_estimators,
                            'max_features': max_features,
                            'max_depth': max_depth,
                            'mse': mse,
                            'mean_cv_mse': mean_cv_mse,
                            'sample_comparison': pd.DataFrame({'Actual': y_test[:5], 'Predicted': predictions[:5]}).to_string()
                        })

    # Ordenar los resultados por MSE de validación cruzada
    results.sort(key=lambda x: x['mean_cv_mse'])

    # Escribir los resultados en el archivo y mostrar los 20 mejores
    with open(filename, 'a') as file:
        for result in results[:20]:
            file.write(f"Features: {result['features']}, Estimators: {result['n_estimators']}, Max Features: {result['max_features']}, Max Depth: {result['max_depth']}\n")
            file.write(f"MSE: {result['mse']}, Mean CV MSE: {result['mean_cv_mse']}\n")
            file.write(f"Sample Comparison:\n{result['sample_comparison']}\n\n")

# Ejemplo de uso
test_random_forest_combinations(df_encoded, 'math score', max_features_comb=2)
test_random_forest_combinations(df_encoded, 'math score', max_features_comb=3)
