# Classification with an Academic Success Dataset
## Playground Series - Season 4, Episode 6

#### Importing Necessary Libraries
The initial step in any data analysis or machine learning project is importing the required libraries. Here, we import `numpy`, `pandas`, `os`, and specific modules from `sklearn` for data handling, model building, and evaluation.

#### Load the Data
We load the training and test datasets (`train.csv` and `test.csv`) using `pd.read_csv()` into pandas DataFrames (`train_df` and `test_df`). These datasets contain information related to educational and socioeconomic indicators of students.

#### Understanding the Training Dataset
To understand the structure and content of the training dataset, we display the first few rows using `train_df.head()` and also print dataset information using `train_df.info()` and statistical summary using `train_df.describe()`.

#### Understanding the Test Dataset
Similarly, we examine the test dataset (`test_df`) to ensure consistency in structure and content. We display the first few rows using `test_df.head()` and check dataset information using `test_df.info()` and `test_df.describe()`.

#### Data Preprocessing
To prepare the data for modeling:
- **Combining Datasets**: Both training and test datasets are concatenated into `combined_df` for uniform preprocessing.
- **Handling Missing Values**: Missing numerical values are filled with the median using `fillna()` method.
- **Encoding Categorical Variables**: Categorical variables are encoded using one-hot encoding with `pd.get_dummies()`.

#### Model Training
- **Splitting Data**: The combined dataset is split back into training (`X_train`, `y_train`) and test sets (`X_test`).
- **Initializing Model**: We choose `RandomForestClassifier()` as our model, initialized with `random_state=42`.
- **Model Fitting**: The model is trained on the training data using `model.fit(X_train, y_train)`.

#### Model Prediction and Submission
- **Making Predictions**: Using the trained model, predictions are made on the test data (`X_test`) with `model.predict()`.
- **Creating Submission File**: Predictions are organized into a DataFrame (`submission_df`) along with corresponding `id` values from the test dataset.
- **Saving Results**: Finally, the submission DataFrame is saved to a CSV file (`submission.csv`) using `to_csv()` method.

In [1]:
#Importing the necessary libraries:
import numpy as np
import pandas as pd
import os

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [2]:
# Load the data
train_df = pd.read_csv('/kaggle/input/playground-series-s4e6/train.csv')
test_df = pd.read_csv('/kaggle/input/playground-series-s4e6/test.csv')

In [3]:
# Display the first few rows of the training dataset to understand its structure
print("Training Dataset:")
train_df.head(5)

Training Dataset:


Unnamed: 0,id,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,0,1,1,1,9238,1,1,126.0,1,1,...,0,6,7,6,12.428571,0,11.1,0.6,2.02,Graduate
1,1,1,17,1,9238,1,1,125.0,1,19,...,0,6,9,0,0.0,0,11.1,0.6,2.02,Dropout
2,2,1,17,2,9254,1,1,137.0,1,3,...,0,6,0,0,0.0,0,16.2,0.3,-0.92,Dropout
3,3,1,1,3,9500,1,1,131.0,1,19,...,0,8,11,7,12.82,0,11.1,0.6,2.02,Enrolled
4,4,1,1,2,9500,1,1,132.0,1,19,...,0,7,12,6,12.933333,0,7.6,2.6,0.32,Graduate


In [4]:
train_df.info(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76518 entries, 0 to 76517
Data columns (total 38 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   id                                              76518 non-null  int64  
 1   Marital status                                  76518 non-null  int64  
 2   Application mode                                76518 non-null  int64  
 3   Application order                               76518 non-null  int64  
 4   Course                                          76518 non-null  int64  
 5   Daytime/evening attendance                      76518 non-null  int64  
 6   Previous qualification                          76518 non-null  int64  
 7   Previous qualification (grade)                  76518 non-null  float64
 8   Nacionality                                     76518 non-null  int64  
 9   Mother's qualification                 

In [5]:
train_df.describe()

Unnamed: 0,id,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,...,Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP
count,76518.0,76518.0,76518.0,76518.0,76518.0,76518.0,76518.0,76518.0,76518.0,76518.0,...,76518.0,76518.0,76518.0,76518.0,76518.0,76518.0,76518.0,76518.0,76518.0,76518.0
mean,38258.5,1.111934,16.054419,1.64441,9001.286377,0.915314,3.65876,132.378766,1.2266,19.837633,...,0.05796,0.137053,5.933414,7.234468,4.007201,9.626085,0.062443,11.52034,1.228218,-0.080921
std,22088.988286,0.441669,16.682337,1.229645,1803.438531,0.278416,8.623774,10.995328,3.392183,15.399456,...,0.40849,0.93383,1.627182,3.50304,2.772956,5.546035,0.462107,2.653375,1.398816,2.251382
min,0.0,1.0,1.0,0.0,33.0,0.0,1.0,95.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.6,-0.8,-4.06
25%,19129.25,1.0,1.0,1.0,9119.0,1.0,1.0,125.0,1.0,1.0,...,0.0,0.0,5.0,6.0,1.0,10.0,0.0,9.4,0.3,-1.7
50%,38258.5,1.0,17.0,1.0,9254.0,1.0,1.0,133.1,1.0,19.0,...,0.0,0.0,6.0,7.0,5.0,12.142857,0.0,11.1,1.4,0.32
75%,57387.75,1.0,39.0,2.0,9670.0,1.0,1.0,140.0,1.0,37.0,...,0.0,0.0,6.0,9.0,6.0,13.244048,0.0,12.7,2.6,1.79
max,76517.0,6.0,53.0,9.0,9991.0,1.0,43.0,190.0,109.0,44.0,...,12.0,19.0,23.0,33.0,20.0,18.0,12.0,16.2,3.7,3.51


In [6]:
train_df.tail(5)

Unnamed: 0,id,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
76513,76513,1,17,1,9254,1,1,121.0,1,19,...,0,6,8,5,10.6,0,13.9,-0.3,0.79,Graduate
76514,76514,1,1,6,9254,1,1,125.0,1,1,...,0,6,9,6,13.875,0,9.4,-0.8,-3.12,Graduate
76515,76515,5,17,1,9085,1,1,138.0,1,37,...,0,5,8,5,11.4,1,9.4,-0.8,-3.12,Enrolled
76516,76516,1,1,3,9070,1,1,136.0,1,38,...,0,6,0,0,0.0,0,7.6,2.6,0.32,Dropout
76517,76517,1,1,1,9773,1,1,133.1,1,19,...,0,6,6,6,13.666667,0,15.5,2.8,-4.06,Graduate


In [7]:
# Display the first few rows of the test dataset
print("\nTest Dataset:")
test_df.head(5)


Test Dataset:


Unnamed: 0,id,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,...,Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP
0,76518,1,1,1,9500,1,1,141.0,1,3,...,0,0,8,0,0,0.0,0,13.9,-0.3,0.79
1,76519,1,1,1,9238,1,1,128.0,1,1,...,0,0,6,6,6,13.5,0,11.1,0.6,2.02
2,76520,1,1,1,9238,1,1,118.0,1,1,...,0,0,6,11,5,11.0,0,15.5,2.8,-4.06
3,76521,1,44,1,9147,1,39,130.0,1,1,...,0,3,8,14,5,11.0,0,8.9,1.4,3.51
4,76522,1,39,1,9670,1,1,110.0,1,1,...,0,0,6,9,4,10.666667,2,7.6,2.6,0.32


In [8]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51012 entries, 0 to 51011
Data columns (total 37 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   id                                              51012 non-null  int64  
 1   Marital status                                  51012 non-null  int64  
 2   Application mode                                51012 non-null  int64  
 3   Application order                               51012 non-null  int64  
 4   Course                                          51012 non-null  int64  
 5   Daytime/evening attendance                      51012 non-null  int64  
 6   Previous qualification                          51012 non-null  int64  
 7   Previous qualification (grade)                  51012 non-null  float64
 8   Nacionality                                     51012 non-null  int64  
 9   Mother's qualification                 

In [9]:
test_df.describe()

Unnamed: 0,id,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,...,Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP
count,51012.0,51012.0,51012.0,51012.0,51012.0,51012.0,51012.0,51012.0,51012.0,51012.0,...,51012.0,51012.0,51012.0,51012.0,51012.0,51012.0,51012.0,51012.0,51012.0,51012.0
mean,102023.5,1.109092,16.067102,1.648161,9026.304556,0.918313,3.635007,132.328001,1.20009,19.913275,...,0.05781,0.129283,5.944131,7.274092,4.039697,9.709128,0.063809,11.520611,1.228719,-0.086477
std,14726.040303,0.438084,16.654196,1.235666,1751.328311,0.273889,8.57725,10.885679,3.26473,15.383823,...,0.403434,0.87725,1.599746,3.433149,2.749871,5.49681,0.467176,2.651113,1.402773,2.25165
min,76518.0,1.0,1.0,0.0,33.0,0.0,1.0,95.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.6,-0.8,-4.06
25%,89270.75,1.0,1.0,1.0,9119.0,1.0,1.0,125.0,1.0,1.0,...,0.0,0.0,5.0,6.0,1.0,10.0,0.0,9.4,0.3,-1.7
50%,102023.5,1.0,17.0,1.0,9254.0,1.0,1.0,133.1,1.0,19.0,...,0.0,0.0,6.0,8.0,5.0,12.166667,0.0,11.1,1.4,0.32
75%,114776.25,1.0,39.0,2.0,9670.0,1.0,1.0,139.0,1.0,37.0,...,0.0,0.0,6.0,9.0,6.0,13.25,0.0,12.7,2.6,1.79
max,127529.0,6.0,53.0,9.0,9991.0,1.0,43.0,190.0,109.0,44.0,...,12.0,19.0,23.0,33.0,20.0,17.714286,10.0,16.2,3.7,3.51


In [10]:
test_df.tail(5)

Unnamed: 0,id,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,...,Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP
51007,127525,1,1,2,171,1,1,128.0,1,38,...,0,0,0,0,0,0.0,0,15.5,2.8,-4.06
51008,127526,2,39,1,9119,1,19,133.1,1,19,...,0,0,5,5,0,0.0,0,9.4,-0.8,-3.12
51009,127527,1,1,1,171,1,1,127.0,1,1,...,0,0,0,0,0,0.0,0,15.5,2.8,-4.06
51010,127528,1,1,3,9773,1,1,132.0,1,19,...,0,0,6,9,3,13.0,0,7.6,2.6,0.32
51011,127529,1,1,1,171,1,1,129.0,1,37,...,0,0,0,0,0,0.0,0,7.6,2.6,0.32


In [11]:
# Combine train and test datasets for preprocessing
combined_df = pd.concat([train_df.drop('Target', axis=1), test_df])

In [12]:
# Example: Fill missing numerical values with the median
numerical_cols = combined_df.select_dtypes(include=['number']).columns
combined_df[numerical_cols] = combined_df[numerical_cols].fillna(combined_df[numerical_cols].median())

In [13]:
# Example: Encode categorical variables using pandas get_dummies
combined_df = pd.get_dummies(combined_df)

In [14]:
# Separate back into train and test sets
X_train = combined_df.iloc[:len(train_df), :].drop(['id'], axis=1)
X_test = combined_df.iloc[len(train_df):, :].drop(['id'], axis=1)

In [15]:
# Prepare target variable for training
y_train = train_df['Target']

In [16]:
# Initialize RandomForestClassifier (you can choose other classifiers as well)
model = RandomForestClassifier(random_state=42)

In [17]:
# Fit the model on training data
model.fit(X_train, y_train)

In [18]:
# Predict on test data
predictions = model.predict(X_test)

In [19]:
# Prepare submission DataFrame
submission_df = pd.DataFrame({'id': test_df['id'], 'Target': predictions})

In [20]:
# Save submission to a CSV file
submission_df.to_csv('/kaggle/working/submission.csv', index=False)
print("Submission saved successfully.")

Submission saved successfully.
