# Question 2 (Classification)
Download the Wireless@SG hotspots file from https://data.gov.sg/dataset/wireless-hotspots (in
either KML or geoJSON format), and extract the data associated with it. You should obtain a
table with over 1600 rows and several columns, where each row corresponding to a different
WiFi hotspot in Singapore.


### Task 1: From the table, what are some of the information you can deduce for each hotspot?


In [20]:
import geojson
import pandas as pd
import geopandas as gpd
from bs4 import BeautifulSoup


import warnings 
warnings.filterwarnings('ignore')

geojson_path = "./data/WirelessHotSpotsGEOJSON.geojson"

In [21]:
with open(geojson_path) as f:
    gj = geojson.load(f)
original_df = gpd.GeoDataFrame(gj['features'])
original_df

Unnamed: 0,type,geometry,properties
0,Feature,POINT Z (103.74751 1.35019 0.00000),"{'Name': 'kml_1', 'Description': '<center><tab..."
1,Feature,POINT Z (103.83609 1.42804 0.00000),"{'Name': 'kml_2', 'Description': '<center><tab..."
2,Feature,POINT Z (103.85298 1.30020 0.00000),"{'Name': 'kml_3', 'Description': '<center><tab..."
3,Feature,POINT Z (103.84648 1.28633 0.00000),"{'Name': 'kml_4', 'Description': '<center><tab..."
4,Feature,POINT Z (103.88965 1.39923 0.00000),"{'Name': 'kml_5', 'Description': '<center><tab..."
...,...,...,...
1795,Feature,POINT Z (103.85406 1.32343 0.00000),"{'Name': 'kml_1796', 'Description': '<center><..."
1796,Feature,POINT Z (103.87082 1.33919 0.00000),"{'Name': 'kml_1797', 'Description': '<center><..."
1797,Feature,POINT Z (103.83500 1.42952 0.00000),"{'Name': 'kml_1798', 'Description': '<center><..."
1798,Feature,POINT Z (103.73158 1.34531 0.00000),"{'Name': 'kml_1799', 'Description': '<center><..."


In [22]:
def parseHTML(text: str) -> dict:
    # Parse the HTML using BeautifulSoup
    soup = BeautifulSoup(text['Description'], 'html.parser')
    
    # Find the table within the parsed HTML
    table = soup.find('table')
    
    # Initialize a dictionary to store extracted data
    data = {}
    
    # Iterate over each row (tr) in the table
    for row in table.find_all('tr'):
        # Extract the header (th) and data (td) cells
        cells = row.find_all(['th', 'td'])
        if len(cells) == 2:
            # Extract key (header) and value (data) and store in dictionary
            key = cells[0].text.strip()
            value = cells[1].text.strip()
            data[key] = value
    
    # Print the extracted data
    # print(data)

    return data


In [23]:
first_row = original_df.iloc[0]
print("Geometry: ", first_row['geometry'], "\n")

description = parseHTML(first_row['properties'])
print("\nDescription Keys: ", list(description.keys()))

Geometry:  POINT Z (103.747514 1.350191 0) 


Description Keys:  ['Y', 'X', 'LOCATION_NAME', 'LOCATION_TYPE', 'POSTAL_CODE', 'STREET_ADDRESS', 'OPERATOR_NAME', 'INC_CRC', 'FMEL_UPD_D']


#### For each hotspot, the following information can be deduced:

- Location Coordinates (X-coordinate, Y-coordinate)
- Name
- Location Name
- Location Type
- Postal Code
- Street Address
- Operator Name
- INC_CRC
- FMEL_UPD_D

### Task 2: Due to a system error, the location type column for the last 200 rows of the dataset has become garbled. Using all earlier rows as well as all other columns in the dataset, build a classification model to predict the location type for these hotspots. You may treat the three rarest location types as one category.
(Note: you may wish to create some additional features based on available ones.)

In [24]:
def preprocess_df(dataframe: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    dataframe.rename(columns={'geometry': 'geometryOld'}, inplace=True)

    
    # Apply function to create new column 'coordinates'
    dataframe['crs4326_x'] = dataframe['geometryOld'].x
    dataframe['crs4326_y'] = dataframe['geometryOld'].y    
    
    # Apply parseHTML to the 'properties' column once and store it in a new DataFrame
    df_properties = dataframe['properties'].apply(parseHTML)
    
    # Extract specific properties into new columns
    dataframe['name'] = dataframe['properties'].apply(lambda x: x.get('Name'))
    dataframe['x-coordinate'] = df_properties.apply(lambda x: x.get('X')).astype(float)
    dataframe['y-coordinate'] = df_properties.apply(lambda x: x.get('Y')).astype(float)
    dataframe['location_name'] = df_properties.apply(lambda x: x.get('LOCATION_NAME'))
    dataframe['location_type'] = df_properties.apply(lambda x: x.get('LOCATION_TYPE'))
    dataframe['postal_code'] = df_properties.apply(lambda x: x.get('POSTAL_CODE'))
    dataframe['street_add'] = df_properties.apply(lambda x: x.get('STREET_ADDRESS'))
    dataframe['operator_name'] = df_properties.apply(lambda x: x.get('OPERATOR_NAME'))
    dataframe['INC_CRC'] = df_properties.apply(lambda x: x.get('INC_CRC'))
    dataframe['FMEL_UPD_D'] = df_properties.apply(lambda x: x.get('FMEL_UPD_D'))
        
        
    # Handling rare categories by grouping the three rarest types
    rare_types = dataframe['location_type'].value_counts().tail(3).index
    dataframe['location_type'] = dataframe['location_type'].replace(rare_types, 'Rare')

    
    # Drop unnecessary columns
    # location_name and street_add  do not bring additional information to determining new hotspots
    # models are not able to learn significance of postal code
    # FMEL_UPD_D is the same value for all 1800 rows
    # INC_CRC is unique for all 1800 rows - does not bring additional information to determining new hotspots
    
    drop_cols = ['type', 'geometryOld', 'properties', 'location_name', 'street_add', 'postal_code', 'INC_CRC', 'FMEL_UPD_D']
    preprocessed_df = dataframe.drop(drop_cols, axis=1)

    return preprocessed_df

preprocessed_df = preprocess_df(original_df)
preprocessed_df

Unnamed: 0,crs4326_x,crs4326_y,name,x-coordinate,y-coordinate,location_type,operator_name
0,103.747514,1.350191,kml_1,18450.95232,36922.92412,Community,M1
1,103.836092,1.428036,kml_2,28308.65184,45530.46595,Community,M1
2,103.852975,1.300197,kml_3,30187.62071,31394.65632,Government,M1
3,103.846479,1.286329,kml_4,29464.67939,29861.29437,Community,M1
4,103.889654,1.399229,kml_5,34269.36498,42345.17715,F&B,M1
...,...,...,...,...,...,...,...
1795,103.854060,1.323428,kml_1796,30324.03757,33949.27857,F&B,Singtel
1796,103.870818,1.339190,kml_1797,32173.31857,35706.37936,Public Transport,Singtel
1797,103.834995,1.429525,kml_1798,28187.67873,45686.07005,Public Transport,Singtel
1798,103.731580,1.345313,kml_1799,16691.34633,36394.85093,F&B,Singtel


In [25]:
preprocessed_df[preprocessed_df['location_type'] == "Rare"]

Unnamed: 0,crs4326_x,crs4326_y,name,x-coordinate,y-coordinate,location_type,operator_name
113,103.817071,1.25167,kml_114,26191.73015,26028.82892,Rare,M1
272,103.779618,1.429569,kml_273,22023.9739,45700.03059,Rare,M1
323,103.817535,1.251209,kml_324,26337.23171,25946.1844,Rare,M1
339,103.812983,1.466954,kml_340,25742.85302,49843.97329,Rare,M1
470,103.818687,1.264486,kml_471,26371.59961,27446.00231,Rare,M1
786,103.819409,1.271273,kml_787,26494.28226,28175.2977,Rare,M1
867,103.880451,1.350587,kml_868,33245.33404,36966.65506,Rare,M1
953,103.844936,1.28781,kml_954,29292.93162,30025.02282,Rare,M1
954,103.821589,1.263896,kml_955,26694.61233,27380.70201,Rare,M1
955,103.817535,1.251209,kml_956,26337.23171,25946.1844,Rare,M1


## Classification Model

In [26]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

In [27]:
le = LabelEncoder()
preprocessed_df['location_type_encoded'] = le.fit_transform(preprocessed_df['location_type'])

df_encoded = pd.get_dummies(preprocessed_df, columns=['operator_name'])

df_encoded


Unnamed: 0,crs4326_x,crs4326_y,name,x-coordinate,y-coordinate,location_type,location_type_encoded,operator_name_M1,operator_name_MyRepublic,operator_name_Singtel,operator_name_StarHub
0,103.747514,1.350191,kml_1,18450.95232,36922.92412,Community,1,True,False,False,False
1,103.836092,1.428036,kml_2,28308.65184,45530.46595,Community,1,True,False,False,False
2,103.852975,1.300197,kml_3,30187.62071,31394.65632,Government,3,True,False,False,False
3,103.846479,1.286329,kml_4,29464.67939,29861.29437,Community,1,True,False,False,False
4,103.889654,1.399229,kml_5,34269.36498,42345.17715,F&B,2,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
1795,103.854060,1.323428,kml_1796,30324.03757,33949.27857,F&B,2,False,False,True,False
1796,103.870818,1.339190,kml_1797,32173.31857,35706.37936,Public Transport,5,False,False,True,False
1797,103.834995,1.429525,kml_1798,28187.67873,45686.07005,Public Transport,5,False,False,True,False
1798,103.731580,1.345313,kml_1799,16691.34633,36394.85093,F&B,2,False,False,True,False


In [28]:
# Drop original categorical columns and unnecessary features
X = df_encoded.drop(['location_type_encoded', 'name', 'location_type'], axis=1)  # Features
y = df_encoded['location_type_encoded']  # Target

# Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [29]:
# Initialize classifiers/models
classifiers = {
    'Random Forest': RandomForestClassifier(random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'Bagging': BaggingClassifier(random_state=42),
    'Support Vector Classifier': SVC(kernel='rbf', random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42)
}

# Train and evaluate each model
for clf_name, clf in classifiers.items():
    print(f"Training {clf_name}...")
    clf.fit(X_train, y_train)
    
    # Model evaluation
    y_pred = clf.predict(X_test)
    print(f"Evaluation for {clf_name}:")
    print(classification_report(y_test, y_pred))
    
    # Cross-validation to assess model performance
    cv_scores = cross_val_score(clf, X, y, cv=5)
    print(f"Cross-validation Accuracy for {clf_name}: {cv_scores.mean():.5f}")

    print("\n")

Training Random Forest...
Evaluation for Random Forest:
              precision    recall  f1-score   support

           0       0.95      0.72      0.82        29
           1       0.61      0.79      0.69       108
           2       0.62      0.71      0.66        83
           3       0.25      0.17      0.20         6
           4       0.90      0.77      0.83        86
           5       0.83      0.50      0.62        10
           6       0.00      0.00      0.00         2
           7       0.00      0.00      0.00         5
           8       1.00      1.00      1.00         6
           9       0.50      0.33      0.40         3
          10       0.12      0.05      0.07        22

    accuracy                           0.68       360
   macro avg       0.53      0.46      0.48       360
weighted avg       0.67      0.68      0.67       360

Cross-validation Accuracy for Random Forest: 0.51389


Training Gradient Boosting...
Evaluation for Gradient Boosting:
            

In [30]:
# Select the best model based on evaluation metrics and cross-validation scores

# Predicting on garbled data (last 200 rows)
garbled_data = df_encoded.iloc[-200:]
garbled_X = garbled_data.drop(['location_type_encoded', 'location_type', 'name'], axis=1)  # Features without the target
garbled_y = garbled_data['location_type_encoded'] # Target

# Based on previous test, GradientBoost has the best accuracy
best_model = GradientBoostingClassifier(random_state=42)

best_model.fit(garbled_X, garbled_y)  # Fit on entire dataset

# Predict location types for garbled data
garbled_y_pred = best_model.predict(garbled_X)

In [31]:
accuracy = accuracy_score(garbled_y, garbled_y_pred)
precision = precision_score(garbled_y, garbled_y_pred, average='micro')
recall = recall_score(garbled_y, garbled_y_pred, average='micro')
f1 = f1_score(garbled_y, garbled_y_pred, average='micro')
conf_matrix = confusion_matrix(garbled_y, garbled_y_pred)

print("Model Performance Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-score: {f1:}")
print("Confusion Matrix:")
print(conf_matrix)

# Cross-validation scores (example with 5-fold cross-validation)
cv_scores = cross_val_score(best_model, garbled_X, garbled_y, cv=5)
print(f"\nCross-validation Scores (5-fold): {cv_scores}")
print(f"Mean CV Accuracy: {cv_scores.mean():.4f}")

Model Performance Metrics:
Accuracy: 0.9950
Precision: 0.995
Recall: 0.995
F1-score: 0.995
Confusion Matrix:
[[  8   0   0   0   0   0   0   0   0   0]
 [  0   2   0   0   0   0   0   0   0   0]
 [  0   0 104   0   0   0   1   0   0   0]
 [  0   0   0   1   0   0   0   0   0   0]
 [  0   0   0   0  52   0   0   0   0   0]
 [  0   0   0   0   0  16   0   0   0   0]
 [  0   0   0   0   0   0   3   0   0   0]
 [  0   0   0   0   0   0   0   8   0   0]
 [  0   0   0   0   0   0   0   0   4   0]
 [  0   0   0   0   0   0   0   0   0   1]]

Cross-validation Scores (5-fold): [0.65  0.725 0.675 0.85  0.775]
Mean CV Accuracy: 0.7350


In [32]:
garbled_data['location_type_predicted'] = le.inverse_transform(garbled_y_pred)
garbled_data

Unnamed: 0,crs4326_x,crs4326_y,name,x-coordinate,y-coordinate,location_type,location_type_encoded,operator_name_M1,operator_name_MyRepublic,operator_name_Singtel,operator_name_StarHub,location_type_predicted
1600,103.782710,1.293953,kml_1601,22370.63242,30652.97809,Healthcare,4,False,False,True,False,Healthcare
1601,103.697242,1.341264,kml_1602,12873.92540,35941.44086,F&B,2,False,False,True,False,F&B
1602,103.800701,1.439916,kml_1603,24372.35982,46825.72998,F&B,2,False,False,True,False,F&B
1603,103.951548,1.373432,kml_1604,41157.46916,39492.94541,F&B,2,False,False,True,False,F&B
1604,103.844724,1.425030,kml_1605,29269.23418,45198.05291,F&B,2,False,False,True,False,F&B
...,...,...,...,...,...,...,...,...,...,...,...,...
1795,103.854060,1.323428,kml_1796,30324.03757,33949.27857,F&B,2,False,False,True,False,F&B
1796,103.870818,1.339190,kml_1797,32173.31857,35706.37936,Public Transport,5,False,False,True,False,Public Transport
1797,103.834995,1.429525,kml_1798,28187.67873,45686.07005,Public Transport,5,False,False,True,False,Public Transport
1798,103.731580,1.345313,kml_1799,16691.34633,36394.85093,F&B,2,False,False,True,False,F&B


### Task 3: The information has now been recovered from a backup copy of the file. Compared to the true location types, how good was your model? Be prepared to explain the metrics you use to evaluate your model.

In [33]:
result_df = garbled_data[['location_type', 'location_type_predicted']]
result_df

Unnamed: 0,location_type,location_type_predicted
1600,Healthcare,Healthcare
1601,F&B,F&B
1602,F&B,F&B
1603,F&B,F&B
1604,F&B,F&B
...,...,...
1795,F&B,F&B
1796,Public Transport,Public Transport
1797,Public Transport,Public Transport
1798,F&B,F&B


In [34]:
# Extract columns for comparison
true_labels = result_df['location_type']
predicted_labels = result_df['location_type_predicted']

# Performance Metrics
accuracy = accuracy_score(true_labels, predicted_labels)
accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels, average='micro')
recall = recall_score(true_labels, predicted_labels, average='micro')
f1 = f1_score(true_labels, predicted_labels, average='micro')
conf_matrix = confusion_matrix(true_labels, predicted_labels)

print("Model Performance Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-score: {f1:}")
print("Confusion Matrix:")
print(conf_matrix)

# Cross-validation scores (example with 5-fold cross-validation)
cv_scores = cross_val_score(best_model, garbled_X, garbled_y, cv=5)
print(f"\nCross-validation Scores (5-fold): {cv_scores}")
print(f"Mean CV Accuracy: {cv_scores.mean():.4f}")

# Generate classification report (if applicable)
class_report = classification_report(true_labels, predicted_labels)

# Display results
print("\n")
print("\nClassification Report:\n", class_report)

Model Performance Metrics:
Accuracy: 0.9950
Precision: 0.995
Recall: 0.995
F1-score: 0.995
Confusion Matrix:
[[  8   0   0   0   0   0   0   0   0   0]
 [  0   2   0   0   0   0   0   0   0   0]
 [  0   0 104   0   0   0   1   0   0   0]
 [  0   0   0   1   0   0   0   0   0   0]
 [  0   0   0   0  52   0   0   0   0   0]
 [  0   0   0   0   0  16   0   0   0   0]
 [  0   0   0   0   0   0   3   0   0   0]
 [  0   0   0   0   0   0   0   8   0   0]
 [  0   0   0   0   0   0   0   0   4   0]
 [  0   0   0   0   0   0   0   0   0   1]]

Cross-validation Scores (5-fold): [0.65  0.725 0.675 0.85  0.775]
Mean CV Accuracy: 0.7350



Classification Report:
                       precision    recall  f1-score   support

          Commercial       1.00      1.00      1.00         8
           Community       1.00      1.00      1.00         2
                 F&B       1.00      0.99      1.00       105
          Government       1.00      1.00      1.00         1
          Healthcare       1.0

### Conclusion

#### Compared to the true location types, my model was able to achieve a much higher mean CV accuracy (0.7350 vs 0.5400)

However, it is heavily skewed towards F&B location, and hence there was definitely bias towards locations in F&B. 
<br/>
This may be due to the data not being shuffled properly, especially towards the last 200 rows.
<br/><br/>
Hence, when looking at accuracy, it is more accurate (no pun intended) to look at the average cross-validation accuracy instead of the accuracy of one round. Cross-validation ensures that the data within the 200 rows are shuffled properly and that the model is trained and tested on all kinds of data that exists in that 200 rows.