# üìä Exploratory Data Analysis (EDA)
## Dataset: Student Performance Factors

Este proyecto analiza factores que influyen en el rendimiento estudiantil usando un dataset p√∫blico de Kaggle.  
El objetivo es aplicar t√©cnicas de **limpieza de datos**, **exploraci√≥n** y **visualizaci√≥n** para extraer patrones relevantes.

## üîß 1. Carga de librer√≠as y dataset
En esta secci√≥n importamos las librer√≠as necesarias y cargamos el dataset desde la carpeta `../data/`.


In [1]:
# Importar librer√≠as
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Cargar el dataset
df = pd.read_csv("../data/student_performance.csv")

In [None]:
# Exploraci√≥n inicial
print("Forma del dataset:", df.shape) # Muestra filas x columnas
print("\nColumnas:")
print(df.columns)

df.head()

Forma del dataset: (6607, 20)

Columnas:
Index(['Hours_Studied', 'Attendance', 'Parental_Involvement',
       'Access_to_Resources', 'Extracurricular_Activities', 'Sleep_Hours',
       'Previous_Scores', 'Motivation_Level', 'Internet_Access',
       'Tutoring_Sessions', 'Family_Income', 'Teacher_Quality', 'School_Type',
       'Peer_Influence', 'Physical_Activity', 'Learning_Disabilities',
       'Parental_Education_Level', 'Distance_from_Home', 'Gender',
       'Exam_Score'],
      dtype='object')


Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70


## üîç 2. Exploraci√≥n inicial
Aqu√≠ revisamos la forma del dataset, sus columnas y tipos de datos.


In [19]:
# Informaci√≥n general
df.info()

# Estad√≠sticas descriptivas
df.describe(include='all') # Con include='all' hacemos que lo haga con cada columna


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6607 entries, 0 to 6606
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Hours_Studied               6607 non-null   int64 
 1   Attendance                  6607 non-null   int64 
 2   Parental_Involvement        6607 non-null   object
 3   Access_to_Resources         6607 non-null   object
 4   Extracurricular_Activities  6607 non-null   object
 5   Sleep_Hours                 6607 non-null   int64 
 6   Previous_Scores             6607 non-null   int64 
 7   Motivation_Level            6607 non-null   object
 8   Internet_Access             6607 non-null   object
 9   Tutoring_Sessions           6607 non-null   int64 
 10  Family_Income               6607 non-null   object
 11  Teacher_Quality             6529 non-null   object
 12  School_Type                 6607 non-null   object
 13  Peer_Influence              6607 non-null   obje

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
count,6607.0,6607.0,6607,6607,6607,6607.0,6607.0,6607,6607,6607.0,6607,6529,6607,6607,6607.0,6607,6517,6540,6607,6607.0
unique,,,3,3,2,,,3,2,,3,3,2,3,,2,3,3,2,
top,,,Medium,Medium,Yes,,,Medium,Yes,,Low,Medium,Public,Positive,,No,High School,Near,Male,
freq,,,3362,3319,3938,,,3351,6108,,2672,3925,4598,2638,,5912,3223,3884,3814,
mean,19.975329,79.977448,,,,7.02906,75.070531,,,1.493719,,,,,2.96761,,,,,67.235659
std,5.990594,11.547475,,,,1.46812,14.399784,,,1.23057,,,,,1.031231,,,,,3.890456
min,1.0,60.0,,,,4.0,50.0,,,0.0,,,,,0.0,,,,,55.0
25%,16.0,70.0,,,,6.0,63.0,,,1.0,,,,,2.0,,,,,65.0
50%,20.0,80.0,,,,7.0,75.0,,,1.0,,,,,3.0,,,,,67.0
75%,24.0,90.0,,,,8.0,88.0,,,2.0,,,,,4.0,,,,,69.0


## üîç Observaciones iniciales

- El dataset contiene **6607 filas** y **20 columnas**.  
- Las columnas incluyen tanto **variables num√©ricas** (ej. `Exam_Score`, `Study_Hours`) como **categ√≥ricas** (ej. `Gender`, `Parental_Involvement`, `Motivation_Level`).  
- Podemos observar posibles **valores nulos** en algunas columnas, que deber√°n ser tratados en la fase de limpieza.  
- Hay variables con valores de texto que podr√≠an necesitar **normalizaci√≥n** (ej. categor√≠as con may√∫sculas/min√∫sculas o espacios).  
- El dataset parece adecuado para un an√°lisis exploratorio, ya que combina factores sociales, acad√©micos y personales relacionados con el rendimiento estudiantil.
