<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Data Analysis of HDB Housing Prices in Singapore

---
## Part 1: Background, Overview & Data Cleaning
---

## Contents
---
- [Problem Statement](##Problem-Statement)
- [Overview of Process](##Overview-of-Process)
- [Summary of Key Features](##Summary-of-Key-Features)
- [Data Cleaning](##Data-Cleaning)

---
## Problem Statement
---

House hunting in Singapore can be a tiring ordeal, especially when one is looking to both buy _and_ sell. The problem is especially pronounced for working adults who have to manage it in addition to a full-time job and, for some, caretaking of family members. This project aims to analyse and identify key features impacting resale prices of public housing (HDB flats) in Singapore so that prospective buyers and/or sellers will be able to avoid wasting unnecessary time and effort on the process, as well as be able to gauge whether prices being quoted are reasonable. A linear model will be built and evaluated accordingly to ensure that the selected features are appropriate and informative.

---
## Overview of Process
---

#### Data Cleaning

Standard data cleaning was carried out for both the `train.csv` and `test.csv` files. The most notable part of this process would be the method used to deal with null values, which will be shown in the relevant section.

#### Features Selection & Exploratory Data Analysis (EDA)

With 76 (excluding `resale_price`) features from the original dataset, there was a need to filter out features that are less relevant / inferior predictors of resale price. During this process, an additional feature, based on the original features available, was included to improve accuracy. 

In addition, it became apparent that a separate model for the most expensive resale flats might be useful. Thus, two models - high-end and regular - will be built.

#### Modelling

Based on the above, the dataset was split into 2 parts - 'high-end outliers' and the remaining flats. For each of these split datasets, predictive models were created using linear regression, ridge regression and LASSO regression. For each model, a set of predictions for resale price for the `test.csv` file was generated and saved as csv files. These files were then submitted to kaggle.

Finally, the models were evaluated based on their $R_2$ scores (only available for `train.csv`) and RMSE values. Based on these, the final selection of models was made.

#### Conclusion & Recommendations

Based on what was done, recommendations were given to the target audience, prospective flat owners and buyers.

---
## Summary of Key Features
---

This is a summary of features that were used in the models. Out of 76 predictive features, 14 were selected due to their relevance and impact on resale prices.

Justifications for the selection of these features, as well as omission of the rest, are provided in the file `02_Feature-Engineering.ipynb` that is located within the same folder.

| Feature | Type | Description | No. Labels (Categorical) | Unit of Measurement |
| --- | --- | --- | --- | --- |
| `flat_type` | Categorical | Flat type of resale flat. | 7 | - |
| `flat_model` | Categorical | Flat model of resale flat. | 20 | - |
| `Tranc_Year` | Categorical | Year that the resale flat was transacted. | 10 | - |
| `planning_area` | Categorical | Planning area that resale flat is located in. | 32 | - |
| `bus_interchange` | Categorical | boolean value denoting if the resale flat's nearest MRT station has a bus interchange. | 2 | - |
| `mrt_interchange` | Categorical | boolean value denoting if the resale flat's nearest MRT station is an MRT interchange.  | 2 | - |
| `floor_area_sqm` | Numerical | Floor area of the resale flat. | - | squared metres ($m^2$) |
| `mid` | Numerical | Middle value of the range of levels that the resale flat is located in; used as an approximation of its level. | - | - |
| `max_floor_lvl` | Numerical | Number of levels in the resale flat's block. | - | - |
| `Mall_Nearest_Distance` | Numerical | Distance from the resale flat to the nearest mall. | - | metres (m) |
| `Mall_Within_500m` | Numerical | Number of malls within 500m of the resale flat. | - | - |
| `Hawker_Nearest_Distance` | Numerical | Distance from the resale flat to the nearest hawker centre. | - | metres (m) |
| `mrt_nearest_distance` | Numerical | Distance from the resale flat to the nearest MRT station. | - | metres (m) |
| `hdb_age_at_tranc` | Numerical | Age of the resale flat as at transaction. | - | years |

---
## Data Cleaning
---

### Part 1: `train.csv`
---

In [1]:
import pandas as pd
import numpy as np

from math import radians, cos, sin, asin, sqrt

### Classes & Functions
---

In [2]:
# Function to obtain information on columns with null values.

def nulls_info(df=None):
    
    '''
    Parameters
    ----------
    df : pd.DataFrame
        DataFrame to be used in function.
    '''
    
    # Create a Series reflecting columns with null values (index)...
    # ...as well as each column's total number of null values.
    nulls = df.isnull().sum()
    
    # Create a dictionary of columns with null values using the above information, 
    # based on the following key: value format:
    # column: number of nulls
    cols_with_nulls = {nulls.index[i]: nulls[i] for i in range(len(nulls.index)) if nulls[i] != 0}
    
    return cols_with_nulls

In [3]:
# Function to obtain distance between two points based on latitude & longitude.
# The main use of this function is to determine the block/street closest to...
# ...the target block because the latter has a null value that requires filling.
# Mathematical formulas obtained from https://www.geeksforgeeks.org/program-distance-two-points-earth/

def distance(df=None, street=None, town=None, missing_col=None):
    
    '''
    Parameters
    ----------
    df : pd.DataFrame
        DataFrame to be used in function.
    street : str
        Value from the 'street_name' column of the DataFrame.
    town: str
        Value from the 'town' column of the DataFrame.
        Should only be specified if the street for which the target block is located in 
        has no useable values for the missing value ('missing_col').
    missing_col: str
        Feature/Column name. Refers to the feature for which the target block has a missing/null value.
    '''
    
    # List of features that are necessary for calculations.
    cols = ['block', 'postal', 'street_name', 
            'Mall_Nearest_Distance', 'Mall_Within_500m', 'Mall_Within_1km', 'Mall_Within_2km', 
            'Longitude', 'Latitude']
    
    # In terms of features, 'block' is a subset of 'street_name', which is in turn a subset of 'town'.
    # In most cases, only 'street_name' is needed because the block closest to the target block is...
    # ...located in the same street.
    # However, in some cases, the entire street are target blocks, ie. all have null values for the same column.
    # In such cases, 'town' is used to locate the nearest block instead.
    # Note that the 'town' column is not used; instead, all of the relevant town's streets are included.
    if town == None:
        df = df[df.street_name.isin([street])][cols]
        
    else:
        df = df[df.town.isin([town])][cols]
    
    # Duplicates are dropped to avoid having to run the code multiple times to obtain the same value.
    df.drop_duplicates(subset=['Mall_Nearest_Distance', 'Longitude', 'Latitude'], inplace=True)
    
    # Converting the unit of measurement of existing longitude & latitude columns from degrees to radians.
    df['Longitude'] = df.Longitude.map(lambda x: radians(x))
    df['Latitude'] = df.Latitude.map(lambda x: radians(x))
    
    # Lists of longitude & latitude values for target block(s) in the specified street...
    # ...(or streets, if 'town' column is being used.)
    lon1 = list(df[(df[missing_col].isnull()) & (df.street_name.isin([street]))].Longitude.unique())
    lat1 = list(df[(df[missing_col].isnull()) & (df.street_name.isin([street]))].Latitude.unique())
    
    # For each target block:
    for i in range(len(lon1)):
        
        # Create 2 columns where column values are the target block's longitude & latitude.
        df[f'lon1_{i}'] = lon1[i]
        df[f'lat1_{i}'] = lat1[i]
        
        # Create 2 columns that reflect the differences in longitude & latitude values between...
        # ...the target block & neighbouring blocks.
        df[f'dlon_{i}'] = abs(df.Longitude - df[f'lon1_{i}'])
        df[f'dlat_{i}'] = abs(df.Latitude - df[f'lat1_{i}'])
        
        # Create 3 columns that reflect different stages of the Haversine formula.
        # The final column, 'Distance', reflects the distance between the target block...
        # ...& neighbouring blocks.
        df[f'a_{i}'] = 0
        
        df[f'a_{i}'] = df[f'dlat_{i}'].map(lambda x: sin(x / 2)**2) + \
        df[f'lat1_{i}'].map(lambda x: cos(x)) * \
        df.Latitude.map(lambda x: cos(x)) * \
        df[f'dlon_{i}'].map(lambda x: sin(x / 2)**2)
        
        df[f'c_{i}'] = df[f'a_{i}'].map(lambda x: 2 * asin(sqrt(x)))
        
        df[f'Distance_{i}'] = df[f'c_{i}'].map(lambda x: x * 6371 * 1000)
    
    # Return the DataFrame with only the necessary columns pertaining to:
    # - information of the neighbouring blocks (block number, postal code, distance to the nearest mall),
    # - proximity (in metres) to the target block.
    return df.loc[:, df.columns.map(lambda x: x.startswith(('block', 'postal', 'Mall_Nearest_Distance', 'Distance')))]

In [4]:
# Class that replaces nulls depending on feature.

class ReplaceNulls():
    
    
    def __init__(self, df=None):
        
        self.df = df
        
        # Create a list of street names that have null values in the 'Mall_Nearest_Distance' column.
        self.streets = list(self.df[self.df.Mall_Nearest_Distance.isnull()].street_name.unique())
        
        # Create an empty dictionary that will later have the following key: value format:
        # street: blocks
        # ...where the blocks reflected are those that have null values in the 'Mall_Nearest_Distance' column.
        self.streets_and_blocks = {}
        
        self.check()
        self.approx_dist()   
        self.replace_nulls()
        self.check_and_return()
        
        
    # Function to ascertain whether there are any houses for the given street and block...
    # ...that DO have a valid value in the 'Mall_Nearest_Distance' column.
    def compare_blocks(self, street=None):

        # For the target street, create a list of its blocks that have null values in the...
        # ...'Mall_Nearest_Distance' column.
        self.blocks = list(self.df[self.df.street_name.isin([street]) & \
                                   self.df.Mall_Nearest_Distance.isnull()].block.unique())

        # Add a new key: value pair into the relevant dictionary where...
        # ...key = target street and value = list of street's blocks with null values.
        self.streets_and_blocks[street] = self.blocks

        # Create a list of unique values from all the houses in each target block's 'Mall_Nearest_Distance' column.
        # If the assumption that all houses belonging to the same target block have the same value (nan)...
        # ...for the 'Mall_Nearest_Distance' column, every element in this list should contain only 1 element: nan.
        self.nearest_mall_dist = [self.df[self.df.street_name.isin([street]) & \
                                          self.df.block.isin([block])].Mall_Nearest_Distance.unique() \
                                  for block in self.blocks]

        return self.nearest_mall_dist
    

    def check(self):
        
        for street in self.streets:

            # Using the compare_blocks() function, check that all blocks (for their corresponding streets)...
            # ...with null values in the 'Mall_Nearest_Distance' column...
            self.dists = self.compare_blocks(street=street)

            for dist in self.dists:
                
                # ...have no valid values in the 'Mall_Nearest_Distance' column.
                # ie. All houses for a given target block have null values in the 'Mall_Nearest_Distance' column.
                # This is done to:
                # 1) Prevent overwriting existing values, and
                # 2) Check if there are valid values that can be used.
                if np.isnan(dist[0]) and len(dist) == 1:
                    pass
                
                # If not, raise an 'alert'.
                else:
                    print(dist)
              
            
    def approx_dist(self):

        # Create a dictionary to store key: value pairs where keys are identical to keys from...
        # ...the streets_and_blocks dictionary.
        self.mall_nearest_distance = {street: {} for street in self.streets_and_blocks.keys()}

        # Create an empty dictionary to store streets that have no valid values in...
        # ...the 'Mall_Nearest_Distance' column, ie. all values are np.nan.
        self.to_expand_area = {}

        # For each street in the 'streets' list:
        for street in self.streets:

            try:

                # 1. Apply the distance() function and sort resulting DataFrame by distance.
                df_dist = distance(df=self.df, 
                                   street=street,
                                   missing_col='Mall_Nearest_Distance')

                for i in range(3, len(df_dist.columns)):
                    
                    # 2. For each target block, obtain the 'Mall_Nearest_Distance' column value...
                    # ...from its closest neighbouring block.
                    df_sorted = df_dist.sort_values(df_dist.columns[i])

                    row = 1
                    while np.isnan(df_sorted.iloc[row, 2]) == True:
                        row +=1

                    self.mall_nearest_distance[street][df_sorted.iloc[0, 0]] = (df_sorted.iloc[row, 2])

            except:
                
                # 3. If step 2 doesn't turn up a result, ie. all blocks in the target street...
                # ...have no valid values in the 'Mall_Nearest_Distance' column,
                # add the street to the to_expand_area dictionary.
                self.to_expand_area[street] = None
                continue
                
        # For each street (key) in the to_expand_area dictionary, obtain its corresponding town (value).
        for street in self.to_expand_area:
            self.to_expand_area[street] = self.df[self.df.street_name.isin([street])].town.unique()[0]
            
        # For each street in the to_expand_area dictionary:
        for street in self.to_expand_area.keys():

            # 1. Apply the distance() function, this time to the entire town of the target street. 
            # Sort resulting DataFrame by distance.
            df_dist = distance(df=self.df, 
                               street=street, 
                               town=self.to_expand_area[street], 
                               missing_col='Mall_Nearest_Distance')
            
            # 2. For each target block, obtain the 'Mall_Nearest_Distance' column value from its...
            # ...closest neighbouring block.
            for i in range(3, len(df_dist.columns)):

                df_sorted = df_dist.sort_values(df_dist.columns[i])

                row = 1
                while np.isnan(df_sorted.iloc[row, 2]) == True:
                    row +=1

                self.mall_nearest_distance[street][df_sorted.iloc[0, 0]] = (df_sorted.iloc[row, 2])
                
                
    def replace_nulls(self):
                
        # Obtain the column number corresponding to the 'Mall_Nearest_Distance' column.
        col_num_nearest_mall = [i for i in range(len(self.df.columns)) \
                                if self.df.columns[i] == 'Mall_Nearest_Distance'][0]
        
        # Filter the DataFrame to only retain rows that have null values for the 'Mall_Nearest_Distance' column.
        # Replace the null value using the distance value from the mall_nearest_distance dictionary.
        for street in self.mall_nearest_distance.keys():

            for block in self.mall_nearest_distance[street]:

                self.df.iloc[self.df[(self.df.street_name == street) & (self.df.block == block)].index, 
                             col_num_nearest_mall] = self.mall_nearest_distance[street][block]

        # For each of the 3 'Mall_Within_...' columns:
        for col, dist in zip(['Mall_Within_500m', 'Mall_Within_1km', 'Mall_Within_2km'], [500, 1000, 2000]):
            
            # 1. Extract index of rows with null values for target column and filter the DataFrame accordingly.
            col_num = [i for i in range(len(self.df.columns)) if self.df.columns[i] == col][0]

            # 2. Replace the null value with either 0 or 1 depending on that row's value for...
            # ...the 'Mall_Nearest_Distance' column.
            self.df.iloc[self.df[np.isnan(self.df[col])].index, col_num] = \
            np.where(self.df.iloc[self.df[np.isnan(self.df[col])].index, col_num_nearest_mall] <= dist, 1, 0)

        # For each of the 3 'Hawker_Within_...' columns:
        for col in ['Hawker_Within_500m', 'Hawker_Within_1km', 'Hawker_Within_2km']:

            # 1. Check smallest value of Hawker_Nearest_Distance for which the target column has a null value.
            if (self.df.iloc[self.df[self.df[col].isnull() == True].index, 
                             :].sort_values('Hawker_Nearest_Distance').iloc[1,0] > 500) == True:

                col_num = [i for i in range(len(self.df.columns)) if self.df.columns[i] == col][0]

                # 2. If all values > 500, null values will be substituted with 0.
                self.df.iloc[self.df[self.df[col].isnull() == True].index, col_num] = 0

            else:

                print(False)
                
                
    def check_and_return(self):
        
        # Check that there are no more null values.
        if self.df.isnull().sum().sum() == 0:
            return self.df
        
        else:
            return self.df.isnull().sum()

### Reading in the Data
---

In [5]:
trn = pd.read_csv('../data/train.csv')

  trn = pd.read_csv('../data/train.csv')


In [6]:
trn.shape

(150634, 77)

In [7]:
trn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150634 entries, 0 to 150633
Data columns (total 77 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   id                         150634 non-null  int64  
 1   Tranc_YearMonth            150634 non-null  object 
 2   town                       150634 non-null  object 
 3   flat_type                  150634 non-null  object 
 4   block                      150634 non-null  object 
 5   street_name                150634 non-null  object 
 6   storey_range               150634 non-null  object 
 7   floor_area_sqm             150634 non-null  float64
 8   flat_model                 150634 non-null  object 
 9   lease_commence_date        150634 non-null  int64  
 10  resale_price               150634 non-null  float64
 11  Tranc_Year                 150634 non-null  int64  
 12  Tranc_Month                150634 non-null  int64  
 13  mid_storey                 15

### Dropping Duplicates
---

Duplicates will be defined as having identical values for all but one  (`id`) of the features. All `id` values are unique, but a closer look at rows that are identical except for the `id` feature reveals that:
1) All are other values are identical;
2) `id` values are usually just 1 digit apart.

The above points, coupled with the fact that it is extremely unlikely that identical sales were made for nearby flats in the same month for the exact same resale price, point towards the duplicates likely being a result of human error (data entry issues). Thus, they should be dropped.

In [8]:
trn.drop_duplicates(trn.columns[1:], inplace=True)

In [9]:
# 217 duplicate rows were dropped.

trn.shape

(150417, 77)

In [10]:
trn.reset_index(drop=True, inplace=True)

### Replacing Null / `NIL` Values
---

#### Part 1: Dealing with `NIL` values in the `postal` column

Blocks in the same area (`street_name`) should have similar postal codes, ie. the first 3 digits should be similar. The `distance()` will be used to determine the blocks closest to the target block. Following which, the first 3 digits of these neighbouring blocks' postal codes will be used.

In [11]:
trn[trn.postal.str.isnumeric() == False].postal.unique()

array(['NIL'], dtype=object)

In [12]:
# Standardise values in 'postal' column as str.

trn['postal'] = trn.postal.map(lambda x: str(x))

In [13]:
# Change 'NIL' values in 'postal' column to np.nan.

trn['postal'] = trn.postal.map(lambda x: np.nan if x in ['NIL'] else x)

In [14]:
# Obtain block & street name of rows that have no postal codes.

no_postal_trn = trn[trn.postal.isnull() == True][['block', 'street_name']]\
.drop_duplicates().to_dict(orient='records')

In [15]:
# Using the distance() function, check postal codes of blocks nearest to the...
# ...target blocks (blocks with no postal codes).

for i in range(len(no_postal_trn)):
    
    street = no_postal_trn[i]['street_name']
    
    display(distance(df=trn, 
                     street=street, 
                     missing_col='postal').sort_values('Distance_0').head())

Unnamed: 0,block,postal,Mall_Nearest_Distance,Distance_0
880,215,,300.156625,0.0
8121,216,680216.0,299.385682,17.500518
24440,214,680214.0,303.83177,53.673124
3751,217,680217.0,242.948553,60.536651
17218,212,680212.0,371.138174,74.551625


Unnamed: 0,block,postal,Mall_Nearest_Distance,Distance_0
3030,238,,448.929181,0.0
17274,237,540237.0,464.57181,65.597126
2414,236,540236.0,527.031302,86.255979
7567,240,540240.0,362.051069,86.880028
43090,235,540235.0,523.986192,109.653587


In [16]:
# Update postal codes of blocks with no postal codes.
# First 3 digits: same as that of nearest blocks.
# Last 3 digits: block number (all target block numbers are 3 digits long).

trn.loc[trn[(trn.postal.isnull() == True) & (trn.block == '215')].index, 'postal'] = '680215'

trn.loc[trn[(trn.postal.isnull() == True) & (trn.block == '238')].index, 'postal'] = '540238'

#### Part 2: Dealing with `null` values

**`Mall_Nearest_Distance` column**

Similar to the missing postal codes, the `distance()` function will be used to determine blocks closest to the target block. The target block will then take on the `Mall_Nearest_Distance` value of the closest neighbouring block.

Note that it is extremely unlikely that these blocks are right next to a mall, meaning that the missing values should not be substituted with `0`:
- There are `0` values for this column, meaning that some blocks are located next to a mall and this has already been reflected in their `Mall_Nearest_Distance` value.
- The 3 related columns - `Mall_Within_500m`, `Mall_Within_1km`, `Mall_Within_2km` - are always empty as well, indicating that there's just generally missing information, or that beyond a certain distance, values are not recorded.

**All of the remaining 6 columns**

It is likely that the missing values simply mean that there are no malls or hawker centres within the specified area. This is further strengthened by the fact that there are no `0` values across all 6 columns. A check will be done to ensure that this is the case. The missing values will then be filled in using the corresponding `Mall_Nearest_Distance` values. For example, for a given block with the following values:
- `Mall_Nearest_Distance` = 789
- `Mall_Within_500m` = nan
- `Mall_Within_1km` = nan
- The `Mall_Within_500m` and `Mall_Within_1km` will be filled with `0` and `1` respectively.

In [17]:
nulls_info(df=trn)

{'Mall_Nearest_Distance': 826,
 'Mall_Within_500m': 92659,
 'Mall_Within_1km': 25402,
 'Mall_Within_2km': 1935,
 'Hawker_Within_500m': 97246,
 'Hawker_Within_1km': 60779,
 'Hawker_Within_2km': 29152}

In [18]:
# For each of the columns with null values, calculate (across all rows in the DataFrame)...
# ...number of rows with a value of zero.

for col in nulls_info(df=trn).keys():
    print(col, trn[trn[col] == 0].shape[0])

Mall_Nearest_Distance 30
Mall_Within_500m 0
Mall_Within_1km 0
Mall_Within_2km 0
Hawker_Within_500m 0
Hawker_Within_1km 0
Hawker_Within_2km 0


In [19]:
# Check if all rows with a null value in the 'Mall_Nearest_Distance' column also have...
# ...null values in each of the 3 columns starting with 'Mall_Within_...'.

trn[trn.Mall_Nearest_Distance.isnull()][['Mall_Within_500m', 
                                         'Mall_Within_1km', 
                                         'Mall_Within_2km']].isnull().sum()

Mall_Within_500m    826
Mall_Within_1km     826
Mall_Within_2km     826
dtype: int64

In [20]:
ReplaceNulls(trn)

<__main__.ReplaceNulls at 0x1f774f9d9d0>

In [21]:
# Check that there are no more null values.

trn.isnull().sum().sum()

0

In [22]:
# Save cleaned data into another csv file for use in other notebooks.

trn.to_csv('../data/train_cleaned.csv')

### Part 2: `test.csv`
---

In [23]:
tst = pd.read_csv('../data/test.csv')

  tst = pd.read_csv('../data/test.csv')


In [24]:
tst[tst.postal.str.isnumeric() == False].postal.unique()

array(['NIL'], dtype=object)

In [25]:
# Standardise values in 'postal' column as str.

tst['postal'] = tst.postal.map(lambda x: str(x))

In [26]:
# Change 'NIL' values in 'postal' column to np.nan.

tst['postal'] = tst.postal.map(lambda x: np.nan if x in ['NIL'] else x)

In [27]:
# Obtain block & street name of rows that have no postal codes.

no_postal_tst = tst[tst.postal.isnull() == True][['block', 'street_name']]\
.drop_duplicates().to_dict(orient='records')

In [28]:
# Using the distance() function, check postal codes of blocks nearest to the...
# ...target blocks (blocks with no postal codes).

for i in range(len(no_postal_tst)):
    
    street = no_postal_tst[i]['street_name']
    
    display(distance(df=tst, 
                     street=street, 
                     missing_col='postal').sort_values('Distance_0').head())

Unnamed: 0,block,postal,Mall_Nearest_Distance,Distance_0
5137,238,,448.929181,0.0
58,236,540236.0,527.031302,86.254314
9349,240,540240.0,362.051069,86.881562
13157,235,540235.0,523.986192,109.652115
833,239,540239.0,363.8584,113.716682


In [29]:
# Update postal codes of blocks with no postal codes.
# First 3 digits: same as that of nearest blocks.
# Last 3 digits: block number (all target block numbers are 3 digits long).

tst.loc[tst[(tst.postal.isnull() == True) & (tst.block == '238')].index, 'postal'] = '540238'

In [30]:
nulls_info(df=tst)

{'Mall_Nearest_Distance': 84,
 'Mall_Within_500m': 10292,
 'Mall_Within_1km': 2786,
 'Mall_Within_2km': 213,
 'Hawker_Within_500m': 10755,
 'Hawker_Within_1km': 6729,
 'Hawker_Within_2km': 3254}

In [31]:
# For each of the columns with null values, calculate (across all rows in the DataFrame)...
# ...number of rows with a value of zero.

for col in nulls_info(df=tst).keys():
    print(col, tst[tst[col] == 0].shape[0])

Mall_Nearest_Distance 1
Mall_Within_500m 0
Mall_Within_1km 0
Mall_Within_2km 0
Hawker_Within_500m 0
Hawker_Within_1km 0
Hawker_Within_2km 0


In [32]:
# Check if all rows with a null value in the 'Mall_Nearest_Distance' column also have...
# ...null values in each of the 3 columns starting with 'Mall_Within_...'.

tst[tst.Mall_Nearest_Distance.isnull()][['Mall_Within_500m', 
                                         'Mall_Within_1km', 
                                         'Mall_Within_2km']].isnull().sum()

Mall_Within_500m    84
Mall_Within_1km     84
Mall_Within_2km     84
dtype: int64

In [33]:
ReplaceNulls(tst)

<__main__.ReplaceNulls at 0x1f70020ed30>

In [34]:
# Check that there are no more null values.

tst.isnull().sum().sum()

0

In [35]:
# Save cleaned data into another csv file for use in other notebooks.

tst.to_csv('../data/test_cleaned.csv')