# CS5228 Project - Primary school Score

Our project is designed to predict the rent price of houses, and the primary school nearby is also an important factor to find the prices of rent. Therefore, this file is to explore the score of schools of each houses.

**Important:** 
* We used data from website [primary-schools-vacancy-and-registered](https://elite.com.sg/primary-schools) to extract the number of vacancy and registered of each school.
* We use weightage to rank the primary school and combine it with the distances to the house and get the score result.
* To rank the primary school, we use Registered/Vacancy Percentage directly as a ranking metric. Schools with higher percentages (indicating higher demand relative to availability) get higher ranks.

In [1]:
from primary_school_score import *

import pandas as pd

Let's have a look at the data:

In [2]:
df_primary_school = pd.read_csv('./Datasets/auxiliary-data/sg-primary-schools-vacancy-registered.csv')

df_primary_school.head()

Unnamed: 0,name,1_vacancy,1_registered,1_Registered_Vacancy_Percentage,2A_registered,2B_vacancy,2B_registered,2B_Registered_Vacancy_Percentage,2C_vacancy,2C_registered,2C_Registered_Vacancy_Percentage,2CS_vacancy,2CS_registered,2CS_Registered_Vacancy_Percentage
0,Admiralty Primary School,150,75,50,32,34,37,109,70,109,156,0,0,0
1,Ahmad Ibrahim Primary School,160,54,34,10,52,0,0,157,27,17,130,15,12
2,Ai Tong School,240,114,48,120,22,56,255,45,78,173,0,0,0
3,Alexandra Primary School,140,57,41,13,43,4,9,127,141,111,0,0,0
4,Anchor Green Primary School,180,85,47,30,42,0,0,128,36,28,93,66,71


For the features in this dataframe, we can check the corresponding results as following charts:

| Phases          | Results announcement dates  |
|-----------------|-----------------------------|
| 1               | **Tuesday, 11 July 2023**   |
| 2A              | **Friday, 21 July 2023**    |
| 2B              | **Monday, 31 July 2023**    |
| 2C              | **Tuesday, 15 August 2023** |
| 2CSupplementary | **Tuesday, 29 August 2023** |

For example, the '1_vacancy' feature means the number of vacancy provided by primary schools announced on Tuesday, 11 July 2023;
the '2C_registered' feature means the number of registered students announced on Tuesday, 15 August 2023;
the '2CS_Registered_Vacancy_Percentage' feature means the oversubscribed percentage(registered divided by vacancy) announced on Tuesday, 29 August 2023.

### Data preprocessing

We add the rank from sg-primary-schools_2022_ranking.csv to the dataframe of primary schools.

The dataset we extract from the website seems clean and well-structured (no duplicates, no null value, etc). However, there is an issue with it. In our algorithm, low Registered-vacancy-Percentage implies the public has low demand to this school, so this school will get lower rank, but there are some special cases. 

For example, the values of '2CS_vacancy', '2CS_registered', and '2CS_Registered_Vacancy_Percentage' of Admiralty Primary School are all 0, but this does not mean that the school is not popular. On the contrary, Admiralty Primary School is quite popular based on the data in phases '1', '2B' and '2C', the reason it gets 0 for 2CS_Registered_Vacancy_Percentage is it does not provide any vacancy in phase '2CS' (2CS_vacancy(Admiralty Primary School) = 0). In fact, many schools don't provide any vacancy in phase 2cs.

In addition, the data in phase 2cs cannot reflect the true popularity of the school. For example, The percentages of Dazhong Primary School in the first two stages are just at average level: 1 is only 40%, 2b is 0. However, it's percentage in 2c rises to 98%, and 2cs rises to 2050%. However, this could not be the evidence that Dazhong Primary school is popular: Many children are not accepted by their first choice primary school and settle for the other option.

Therefore, we can simply remove features of 2CS phases since studying people's first choices could be more meaningful in determining a school's popularity.

In [3]:
# df_primary_school = df_primary_school.drop('2CS_vacancy', axis=1)
# df_primary_school = df_primary_school.drop('2CS_registered', axis=1)
# df_primary_school = df_primary_school.drop('2CS_Registered_Vacancy_Percentage', axis=1)
# 
# df_primary_school.head()

### Rank the school!
#### Things to note:
We first tried using the average registered-vacancy-percentage of different phases to calculate 

In [4]:
schools = pd.read_csv('./Datasets/auxiliary-data/sg-primary-schools.csv')
ranking = pd.read_csv('./Datasets/auxiliary-data/sg-primary-schools_2022_ranking.csv')

# Merge the rankings into the primary schools list based on school name
merged = schools.merge(ranking, left_on='name', right_on='School Name', how='left')

# Drop the 'School Name' column as it's redundant
merged.drop('School Name', axis=1, inplace=True)

# Rename the 'Rank' column to 'Ranking 2022'
merged.rename(columns={'Rank': 'Ranking 2022'}, inplace=True)

# Assign the last ranking + 1 to empty rankings
last_ranking = merged['Ranking 2022'].max()
merged['Ranking 2022'].fillna(last_ranking + 1, inplace=True)

# Convert the rankings to integers
merged['Ranking 2022'] = merged['Ranking 2022'].astype(int)

# Display the first few rows of the merged dataframe
merged.head()

output_path = "./Datasets/auxiliary-data/sg-primary-schools_with_ranking.csv"
merged.to_csv(output_path, index=False)

merged.head()

Unnamed: 0,name,latitude,longitude,Ranking 2022
0,Admiralty Primary School,1.454038,103.817436,8
1,Ahmad Ibrahim Primary School,1.433153,103.832942,166
2,Ai Tong School,1.360583,103.83302,10
3,Alexandra Primary School,1.291334,103.824425,120
4,Anchor Green Primary School,1.39037,103.887165,100


In [5]:
# df_school_ranked = get_school_rank(df_primary_school)
# 
# df_school_ranked.head(20)

### Calculate the Euclidean distance and get the score
The reason we count the euclidean distance between houses and primary school to predict the rent price is the location of the primary school can affect the rent price: The closer the houses are to the school, the more expensive the rent price will be.
Here we get two dataframes of primary schools and houses.

In [6]:
df_loc_primary_school = pd.read_csv('./datasets/auxiliary-data/sg-primary-schools_with_ranking.csv')

df_loc_primary_school.head()

Unnamed: 0,name,latitude,longitude,Ranking 2022
0,Admiralty Primary School,1.454038,103.817436,8
1,Ahmad Ibrahim Primary School,1.433153,103.832942,166
2,Ai Tong School,1.360583,103.83302,10
3,Alexandra Primary School,1.291334,103.824425,120
4,Anchor Green Primary School,1.39037,103.887165,100


In [7]:
df_loc_house = pd.read_csv('./datasets/test.csv')

df_loc_house.head()

Unnamed: 0,rent_approval_date,town,block,street_name,flat_type,flat_model,floor_area_sqm,furnished,lease_commence_date,latitude,longitude,elevation,subzone,planning_area,region
0,2023-01,hougang,245,hougang street 22,5-room,improved,121.0,yes,1984,1.358411,103.891722,0.0,lorong ah soo,hougang,north-east region
1,2022-09,sembawang,316,sembawang vista,4-room,model a,100.0,yes,1999,1.446343,103.820817,0.0,sembawang central,sembawang,north region
2,2023-07,clementi,708,Clementi West Street 2,4-room,new generation,91.0,yes,1980,1.305719,103.762168,0.0,clementi west,clementi,west region
3,2021-08,jurong east,351,Jurong East Street 31,3 room,model a,74.0,yes,1986,1.344832,103.730778,0.0,yuhua west,jurong east,west region
4,2022-03,jurong east,305,jurong east street 32,5-room,improved,121.0,yes,1983,1.345437,103.735241,0.0,yuhua west,jurong east,west region


Given the dataframe of all houses in the training set and all schools location, we can calculate the final score by combining the school rank and euclidean distances between primary schools and houses.

In [8]:
# df_loc_house = get_score(df_loc_primary_school, df_loc_house)
df_loc_house = get_score_optimized(df_loc_primary_school, df_loc_house)

df_loc_house.head(20)

Unnamed: 0,rent_approval_date,town,block,street_name,flat_type,flat_model,floor_area_sqm,furnished,lease_commence_date,latitude,longitude,elevation,subzone,planning_area,region,school_score,school_count
0,2023-01,hougang,245,hougang street 22,5-room,improved,121.0,yes,1984,1.358411,103.891722,0.0,lorong ah soo,hougang,north-east region,376.7,7.0
1,2022-09,sembawang,316,sembawang vista,4-room,model a,100.0,yes,1999,1.446343,103.820817,0.0,sembawang central,sembawang,north region,312.0,8.0
2,2023-07,clementi,708,Clementi West Street 2,4-room,new generation,91.0,yes,1980,1.305719,103.762168,0.0,clementi west,clementi,west region,255.4,4.0
3,2021-08,jurong east,351,Jurong East Street 31,3 room,model a,74.0,yes,1986,1.344832,103.730778,0.0,yuhua west,jurong east,west region,284.8,8.0
4,2022-03,jurong east,305,jurong east street 32,5-room,improved,121.0,yes,1983,1.345437,103.735241,0.0,yuhua west,jurong east,west region,287.8,5.0
5,2022-01,clementi,701,west coast road,3-room,new generation,67.0,yes,1980,1.307637,103.761326,0.0,clementi west,clementi,west region,258.4,4.0
6,2021-05,punggol,643,punggol central,5 room,premium apartment,110.0,yes,2005,1.397319,103.916019,0.0,waterway east,punggol,north-east region,325.1,11.0
7,2022-08,tampines,515b,Tampines Central 7,5-room,dbss,108.0,yes,2008,1.356968,103.938668,0.0,tampines east,tampines,east region,336.8,11.0
8,2021-12,jurong west,604,Jurong West Street 62,executive,premium apartment,133.0,yes,2001,1.338605,103.700079,0.0,jurong west central,jurong west,west region,300.1,9.0
9,2023-07,marine parade,14,marine terrace,5-room,standard,120.0,yes,1975,1.303476,103.915412,0.0,marine parade,marine parade,central region,240.9,6.0


In [9]:
output_path = "./Datasets/auxiliary-data/test_with_schoolscore.csv"
df_loc_house.to_csv(output_path, index=False)