<a href="https://colab.research.google.com/github/dengathitu/Delaware-Students-Enrollment/blob/main/Delaware_Student_Enrolment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Description
Delaware District wants to understand the students enrollment trends among different schools in different districts.

## Data Source
*  https://www.kaggle.com/datasets/noeyislearning/delaware-student-enrollment


In [42]:
import pandas as pd # Dataset Manipulation
import numpy as np # for mathematical operations
from sklearn import metrics
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import LabelEncoder #For converting categorical features to numerical columns
from sklearn.model_selection import train_test_split # divide the dataset into train & test data
from sklearn.linear_model import LinearRegression, LogisticRegression # Contains the LinearRegression and LogisticRegression algorithims to be used
from sklearn.tree import DecisionTreeRegressor # Contains the Decision Tree Algorithim to be used
from sklearn.ensemble import RandomForestRegressor # Containins the Radom Forest Algorithim to be used
from sklearn.neighbors import KNeighborsClassifier # Contains the KNN Classifier

## Load the Dataset

In [2]:
data = pd.read_csv("/content/student_enrollment.csv")

## Overview of the DataSet
*  Shows the first 5 rows of the dataset.

In [3]:
data.head()

Unnamed: 0,School Year,District Code,District,School Code,Organization,Race,Gender,Grade,SpecialDemo,Geography,SubGroup,RowStatus,Students,EOYEnrollment,PctOfEOYEnrollment,FallEnrollment
0,2015,0,State of Delaware,0,State of Delaware,White,All Students,9th Grade,All Students,All Students,White/9th Grade,REPORTED,5631.0,141336.0,3.98,134932.0
1,2015,0,State of Delaware,0,State of Delaware,White,All Students,Twelfth,All Students,All Students,White/Twelfth,REPORTED,4828.0,141336.0,3.42,134932.0
2,2015,0,State of Delaware,0,State of Delaware,White,All Students,All Students,All Students,All Students,White,REPORTED,65185.0,141336.0,46.12,134932.0
3,2015,0,State of Delaware,0,State of Delaware,White,Female,4th Grade,Homeless,All Students,White/Female/4th Grade/Homeless,REDACTED,37.0,141336.0,0.03,
4,2015,0,State of Delaware,0,State of Delaware,White,Female,4th Grade,Low-Income,All Students,White/Female/4th Grade/Low-Income,REPORTED,600.0,141336.0,0.42,134932.0


## Data Structure

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186759 entries, 0 to 186758
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   School Year         186759 non-null  int64  
 1   District Code       186759 non-null  int64  
 2   District            186759 non-null  object 
 3   School Code         186759 non-null  int64  
 4   Organization        186758 non-null  object 
 5   Race                186758 non-null  object 
 6   Gender              186758 non-null  object 
 7   Grade               186758 non-null  object 
 8   SpecialDemo         186758 non-null  object 
 9   Geography           186758 non-null  object 
 10  SubGroup            186758 non-null  object 
 11  RowStatus           186758 non-null  object 
 12  Students            102745 non-null  float64
 13  EOYEnrollment       186460 non-null  float64
 14  PctOfEOYEnrollment  102745 non-null  float64
 15  FallEnrollment      162550 non-nul

## Findings
*   9 categorical columns
*   7 numerical columns
*   1281801 entries

#Check for Null values

In [5]:
data.isnull().sum()

Unnamed: 0,0
School Year,0
District Code,0
District,0
School Code,0
Organization,1
Race,1
Gender,1
Grade,1
SpecialDemo,1
Geography,1


# Findings
The following columns have null values
- Students
- EOYEnrollment
- PctOfEOYEnrollment
- FallEnrollment



# Data Preprocessing
*   Preparing data for analysis


**Step 1: Remove Null Values**

In [6]:
data.dropna(subset=["Students", "EOYEnrollment", "PctOfEOYEnrollment", "FallEnrollment"], inplace=True)

**Step 2: Confirm whether the null values have been dropped**

In [7]:
data.isnull().sum()

Unnamed: 0,0
School Year,0
District Code,0
District,0
School Code,0
Organization,0
Race,0
Gender,0
Grade,0
SpecialDemo,0
Geography,0


**Findings**
*   There are nolonger null values

**Step 3: Dealing with Categorical values**
- District
- Organization       
- Race               
- Gender             
- Grade               
- SpecialDemo   
- Geography           
- SubGroup            
- RowStatus           



In [8]:
# A list of Categorical Columns
Categorical_Columns = ["District", "Organization", "Race", "Gender", "Grade", "SpecialDemo", "Geography", "SubGroup", "RowStatus"]

# A dictionary to store the LabelEncoders
Label_Encoders = {}
for columns in Categorical_Columns:
  LE = LabelEncoder()
  data[columns] = LE.fit_transform(data[columns].astype(str))
  Label_Encoders[columns] = LE

## Confirmation

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 98307 entries, 0 to 186755
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   School Year         98307 non-null  int64  
 1   District Code       98307 non-null  int64  
 2   District            98307 non-null  int64  
 3   School Code         98307 non-null  int64  
 4   Organization        98307 non-null  int64  
 5   Race                98307 non-null  int64  
 6   Gender              98307 non-null  int64  
 7   Grade               98307 non-null  int64  
 8   SpecialDemo         98307 non-null  int64  
 9   Geography           98307 non-null  int64  
 10  SubGroup            98307 non-null  int64  
 11  RowStatus           98307 non-null  int64  
 12  Students            98307 non-null  float64
 13  EOYEnrollment       98307 non-null  float64
 14  PctOfEOYEnrollment  98307 non-null  float64
 15  FallEnrollment      98307 non-null  float64
dtypes: float

**Findings**
- There are no longer categirical columns

In [10]:
data.head()

Unnamed: 0,School Year,District Code,District,School Code,Organization,Race,Gender,Grade,SpecialDemo,Geography,SubGroup,RowStatus,Students,EOYEnrollment,PctOfEOYEnrollment,FallEnrollment
0,2015,0,38,0,212,7,0,11,1,0,1487,0,5631.0,141336.0,3.98,134932.0
1,2015,0,38,0,212,7,0,15,1,0,1713,0,4828.0,141336.0,3.42,134932.0
2,2015,0,38,0,212,7,0,12,1,0,1426,0,65185.0,141336.0,46.12,134932.0
4,2015,0,38,0,212,7,1,6,4,0,1529,0,600.0,141336.0,0.42,134932.0
6,2015,0,38,0,212,7,2,15,10,0,1711,0,2053.0,141336.0,1.45,134932.0


**Findings**
- Data has no ***null values*** nor ***categorical features*** and is therefore ready for processing.

## Data Analysis

- Check for columns found in the dataset

In [11]:
data.columns

Index(['School Year', 'District Code', 'District', 'School Code',
       'Organization', 'Race', 'Gender', 'Grade', 'SpecialDemo', 'Geography',
       'SubGroup', 'RowStatus', 'Students', 'EOYEnrollment',
       'PctOfEOYEnrollment', 'FallEnrollment'],
      dtype='object')

FallEnrollment is the dependent variable

In [12]:
X = data [['School Year', 'District Code', 'District', 'School Code',
       'Organization', 'Race', 'Gender', 'Grade', 'SpecialDemo', 'Geography',
       'SubGroup', 'RowStatus', 'Students', 'EOYEnrollment',
       'PctOfEOYEnrollment']]

y = data [['FallEnrollment']]

**Split Data into Training and Testing set**
- 80% used for training and 20% used for testing.

In [13]:
X_test, X_train, y_test, y_train = train_test_split(X,y, test_size=0.2, random_state=0)

# Linear Regression
- The dependent variable keeps on changing depending on the independent variable(s)
- Linear Regression is used to determine the relationship between the dependent and independent variables.

# Step 1: Fitting the algorithim onto the dataset

In [14]:
algorithim = LinearRegression()
algorithim.fit(X_train, y_train)

# Finding
Algorithim applied to the dataset

## Step 2: Use the Algorithim to make predictions

In [15]:
y_prediction=algorithim.predict(X_test)

# Step 3: Comparison with Actual Values

In [16]:
print("Linear Regression Prediction")
data = pd.DataFrame({"Actual Values":y_test['FallEnrollment'],"Predicted Values":y_prediction.flatten()})
data

Linear Regression Prediction


Unnamed: 0,Actual Values,Predicted Values
35839,475.0,446.998372
136318,481.0,475.970892
34705,500.0,531.697247
130457,1800.0,1811.464470
88683,9842.0,9776.692863
...,...,...
39766,1361.0,1263.199484
86600,594.0,744.155045
80494,484.0,505.730018
82122,553.0,538.619018


#Step 4: Evaluate the Performance of Model

In [17]:
# Mean Absolute Error
print("Mean Absolute Error:", metrics.mean_absolute_error(y_test, y_prediction))

Mean Absolute Error: 117.97414066711617


In [18]:
# Mean Squared Error
print("Mean Squared Error:", metrics.mean_squared_error(y_test, y_prediction))

Mean Squared Error: 45686.16637758277


In [19]:
# Squareroot of MeanSquared Error
print("Squareroot Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y_test, y_prediction)))

Squareroot Mean Squared Error: 213.74322533727886


In [20]:
# Performance Score

print("Performance Score:", metrics.r2_score(y_test, y_prediction)*100,"%")

Performance Score: 99.99219401164086 %


# **Logistic Regression**

*   A classification algorithim used to represent multi dependent variables.



**Step 1: Fit the Algorithim on the dataset**

In [21]:
algorithim2 = LogisticRegression()
algorithim2.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


**Finding**
- Algorithim loaded

**Step 2: Use the Algorithim to make predictions**

In [22]:
y_prediction2=algorithim2.predict(X_test)

**Step 3: Comparison with Actual Values**

In [23]:
print("Logistic Regression Prediction")
data2 = pd.DataFrame({"Actual Values":y_test['FallEnrollment'],"Predicted Values":y_prediction2.flatten()})
data2

Logistic Regression Prediction


Unnamed: 0,Actual Values,Predicted Values
35839,475.0,564.0
136318,481.0,564.0
34705,500.0,564.0
130457,1800.0,1948.0
88683,9842.0,15553.0
...,...,...
39766,1361.0,1948.0
86600,594.0,564.0
80494,484.0,564.0
82122,553.0,564.0


**Step 4: Evaluate the Performance of Model**

In [24]:
# Mean Absolute Error
print("Mean Absolute Error:", metrics.mean_absolute_error(y_test, y_prediction2))

Mean Absolute Error: 824.5463157225507


In [25]:
# Mean Squared Error
print("Mean Squared Error:", metrics.mean_squared_error(y_test, y_prediction2))

Mean Squared Error: 3180395.636810986


In [26]:
# Squareroot of MeanSquared Error
print("Squareroot Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y_test, y_prediction2)))

Squareroot Mean Squared Error: 1783.3663776159362


In [52]:
#Accuracy Score
print("Accuracy Score;", accuracy_score(y_test, y_prediction2))

Accuracy Score; 0.05733358764066374


In [27]:
# Performance Score

print("Performance Score:", metrics.r2_score(y_test, y_prediction2)*100,"%")

Performance Score: 99.45659412275467 %


**Comparison Between Linear Regression and Logistic Regression**

In [28]:
Comparison_Lr_Lr = 99.99319777696202 - 99.06005779497814
print(Comparison_Lr_Lr)

0.933139981983885


Linear Regresseion model is more accurate by 0.933139981983885%

# **Decision Tree**
- Splits the dataset into smaller subsets based on their features until a more defined feature has been achieved.

**Step 1: Fit the Algorithim on the dataset**

> Add blockquote



In [29]:
algorithim3 = DecisionTreeRegressor()
algorithim3.fit(X_train, y_train)

**Step 2: Use the Algorithim to make predictions**

In [30]:
y_prediction3=algorithim3.predict(X_test)

**Step 3: Comparison with Actual Values**

In [31]:
print("Decision Tree Prediction")
data3 = pd.DataFrame({"Actual Values":y_test['FallEnrollment'],"Predicted Values":y_prediction3.flatten()})
data3

Decision Tree Prediction


Unnamed: 0,Actual Values,Predicted Values
35839,475.0,475.0
136318,481.0,481.0
34705,500.0,500.0
130457,1800.0,1800.0
88683,9842.0,9842.0
...,...,...
39766,1361.0,1361.0
86600,594.0,594.0
80494,484.0,484.0
82122,553.0,553.0


**Step 4: Evaluate the Performance of Model**

In [32]:
# Mean Absolute Error
print("Mean Absolute Error:", metrics.mean_absolute_error(y_test, y_prediction3))

Mean Absolute Error: 0.03103820967639392


In [33]:
# Squareroot of MeanSquared Error
print("Squareroot Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y_test, y_prediction3)))

Squareroot Mean Squared Error: 1.3754752269462294


In [51]:
#Accuracy Score
print("Accuracy Score;", accuracy_score(y_test, y_prediction3))

Accuracy Score; 0.9990717782440078


In [34]:
# Performance Score

print("Performance Score:", metrics.r2_score(y_test, y_prediction3)*100,"%")

Performance Score: 99.99999967674242 %


# **Random Forest**
- Takes several decision trees and combines their results to improve the accuracy

**Step 1: Fit the Algorithim on the dataset**

In [35]:
algorithim4 = RandomForestRegressor(n_estimators=100)
algorithim4.fit(X_train, y_train)

  return fit_method(estimator, *args, **kwargs)


**Step 2: Use the Algorithim to make predictions**

In [36]:
y_prediction4=algorithim4.predict(X_test)

**Step 3: Comparison with Actual Values**

In [37]:
print("Random Forest Prediction")
data4 = pd.DataFrame({"Actual Values":y_test['FallEnrollment'],"Predicted Values":y_prediction4.flatten()})
data4

Random Forest Prediction


Unnamed: 0,Actual Values,Predicted Values
35839,475.0,475.0
136318,481.0,481.0
34705,500.0,500.0
130457,1800.0,1800.0
88683,9842.0,9842.0
...,...,...
39766,1361.0,1361.0
86600,594.0,594.0
80494,484.0,484.0
82122,553.0,553.0


Step 4: Evaluate the Performance of Model

In [38]:
# Mean Absolute Error
print("Mean Absolute Error:", metrics.mean_absolute_error(y_test, y_prediction4))

Mean Absolute Error: 0.06937160658655907


In [39]:
# Squareroot of MeanSquared Error
print("Squareroot Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y_test, y_prediction4)))

Squareroot Mean Squared Error: 2.1560409075291416


In [50]:
#Accuracy Score
print("Accuracy Score;", accuracy_score(y_test, y_prediction4))

Accuracy Score; 0.36465128107317696


In [40]:
# Performance Score

print("Performance Score:", metrics.r2_score(y_test, y_prediction4)*100,"%")

Performance Score: 99.99999920575009 %


Comparison of Accuracy Values Between Decision Tree and Random Forest

In [41]:
Dt_Rf = 99.999979 - 99.999903
print(Dt_Rf)

7.599999999285956e-05


# K-NEAREST NEIGHBOURS
- Looks for *similarities*



**Step 1: Fit the Algorithim on the dataset**

In [44]:
algorithim5 = KNeighborsClassifier(n_neighbors=100)
algorithim5.fit(X_train, y_train)

  return self._fit(X, y)


**Step 2: Make Predictions**

In [46]:
y_prediction5=algorithim5.predict(X_test)

**Step 3: Coparison with Actual Values**

In [47]:
print("K Nearest Neighbours")
data5 = pd.DataFrame({"Actual Values":y_test['FallEnrollment'],"Predicted Values":y_prediction5.flatten()})
data5

K Nearest Neighbours


Unnamed: 0,Actual Values,Predicted Values
35839,475.0,394.0
136318,481.0,437.0
34705,500.0,500.0
130457,1800.0,1948.0
88683,9842.0,9842.0
...,...,...
39766,1361.0,1367.0
86600,594.0,606.0
80494,484.0,425.0
82122,553.0,553.0


**Step 4: Evaluation of the Model**

**Accuracy Score**

In [48]:
print("Accuracy Score;", accuracy_score(y_test, y_prediction5))

Accuracy Score; 0.36465128107317696


**Performance Score**

In [49]:
print("Performance Score:", metrics.r2_score(y_test, y_prediction5)*100,"%")

Performance Score: 99.9853526386207 %


# **Recommendations**
- This dataset used in this machine learning exercise works best with the Decision Tree model.
- It has the highest accuracy and performance scores as follows;
*   Accuracy Score; 0.9990717782440078
*   Performance Score: 99.99999967674242 %

