<a href="https://colab.research.google.com/github/alfafimel/IPWK10-CORE-ZINDI-COVID-HACKATHON/blob/main/South_African_COVID_19_Vulnerability_Map_by_Zindi_ELIZABETH_JOSEPHINE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **QUESTION: PROBLEM**

Can we infer important COVID-19 public health risk factors from outdated data? In many countries census and other survey data may be incomplete or out of date. This challenge is to develop a proof-of-concept for how machine learning can help governments more accurately map COVID-19 risk in 2020 using old data, without requiring a new costly, risky, and time-consuming on-the-ground survey.

The 2011 census gives us valuable information for determining who might be most vulnerable to COVID-19 in South Africa. However, the data is nearly 10 years old, and we expect that some key indicators will have changed in that time. Building an up-to-date map showing where the most vulnerable are located will be a key step in responding to the disease. A mapping effort like this requires bringing together many different inputs and tools. For this competition, we’re starting small. Can we infer important risk factors from more readily available data?

The task is to predict the percentage of households that fall into a particularly vulnerable bracket - large households who must leave their homes to fetch water - using 2011 South African census data. Solving this challenge will show that with machine learning it is possible to use easy-to-measure stats to identify areas most at risk even in years when census data is not collected.

## **DATA ANALYSIS**

In [1]:
# importing the necesary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [2]:
# reading in the data
# training dataset
train = pd.read_csv('Train_maskedv2.csv')
# test dataset
test = pd.read_csv('Test_maskedv2.csv')

In [3]:
# previewing the shape of the datasets
train.shape, test.shape

# the train dataset has 3174 rows and 50 columns while
# the test dataset has 1102 rows and 49 columns

((3174, 50), (1102, 49))

In [4]:
# previewing the a sample of the datasets
print(train.sample(6))
print('**********************************')
print(test.sample(6))

         ward  total_households  ...     pg_04    lgt_00
2580  R8WD9H5        3294.89919  ...  0.002373  0.927243
2469  B0DMWCB        1421.23288  ...  0.002362  0.319931
528   VK46ZKR        2274.85910  ...  0.008841  0.835105
2079  OHWI5DC        1841.05616  ...  0.005634  0.862503
1944  6RFVC71         518.57566  ...  0.009031  0.459686
2693  6OFQ3ZA        2455.04380  ...  0.001092  0.956165

[6 rows x 50 columns]
**********************************
        ward  total_households  total_individuals  ...     pg_03     pg_04    lgt_00
855  RS6CKPN        9022.35767        28116.89638  ...  0.427909  0.027690  0.994626
732  NYODCCH        1194.78307         7083.79440  ...  0.192877  0.012511  0.980773
996  W9YLEKZ        1489.27839         4916.67090  ...  0.000000  0.006176  0.956691
61   24WYRNI        1374.50664         9378.05541  ...  0.000845  0.000378  0.002970
212  6XO9WKJ        2109.66431         6317.49552  ...  0.078023  0.003916  0.952388
987  VYWR87L        3216.10822   

In [5]:
# combining the training and test datasets for easier feature engneering
df = pd.concat([train, test])

In [6]:
# previweing the combined dataset
print(df.shape)
print('**********************************')
print(df.head(3))
print('**********************************')
print(df.columns)

(4276, 50)
**********************************
      ward  total_households  total_individuals  ...     pg_03     pg_04    lgt_00
0  9D9BEUB       13569.97801        39024.03083  ...  0.029031  0.010292  0.599259
1  RERH3XM       13593.88256        32879.94646  ...  0.000586  0.002832  0.699136
2  GJWA3BO        2698.30050         8261.71093  ...  0.003201  0.000663  0.972315

[3 rows x 50 columns]
**********************************
Index(['ward', 'total_households', 'total_individuals', 'target_pct_vunerable',
       'dw_00', 'dw_01', 'dw_02', 'dw_03', 'dw_04', 'dw_05', 'dw_06', 'dw_07',
       'dw_08', 'dw_09', 'dw_10', 'dw_11', 'dw_12', 'dw_13', 'psa_00',
       'psa_01', 'psa_02', 'psa_03', 'psa_04', 'stv_00', 'stv_01', 'car_00',
       'car_01', 'lln_00', 'lln_01', 'lan_00', 'lan_01', 'lan_02', 'lan_03',
       'lan_04', 'lan_05', 'lan_06', 'lan_07', 'lan_08', 'lan_09', 'lan_10',
       'lan_11', 'lan_12', 'lan_13', 'lan_14', 'pg_00', 'pg_01', 'pg_02',
       'pg_03', 'pg_04', 'lgt

In [7]:
print("old size: %d" % len(train))
train = train.dropna(how='any', axis=0)
print("New size after dropping missing value: %d" % len(train))

old size: 3174
New size after dropping missing value: 3174


In [8]:
import gc
# removing missing data 
missing_perc_thresh = 0.98
exclude_missing = []
num_rows = df.shape[0]
for c in df.columns:
    num_missing = df[c].isnull().sum()
    if num_missing == 0:
        continue
    missing_frac = num_missing / float(num_rows)
    if missing_frac > missing_perc_thresh:
        exclude_missing.append(c)
print("We exclude: %s" % len(exclude_missing))
#
# dealing with missing values
del num_rows, missing_perc_thresh
gc.collect();

We exclude: 0


In [9]:
## FEATURE ENGINEERING

# dropping the columns whose values are unspecified we have:
df.drop(columns=['dw_12', 'psa_03', 'lan_13', 'lan_14'], inplace=True)
#
# the average number of individuals in a household
df['household_size'] = df['total_individuals'] / df['total_households']
#
# grouping individuals with no school attendance
df['negativeschoolattendance'] = df['psa_01'] + df['psa_02']
#
##
from sklearn.cluster import KMeans
cols=df.drop(["target_pct_vunerable","ward"],1).columns

df_km=df[cols].copy()

df_km["total_households"]/=df_km["total_households"].max()

km=KMeans(20,random_state=49)
df["cluster"]=km.fit_predict(df_km[cols])

### **Build model**

In [11]:
# declaring feature vector and target variables
# target = train
#df = df.drop(["ward"], axis=1)
l_train = len(train)
train = df[:l_train]
test = df[l_train:]
#
#
_id = test['ward']
test.drop(columns=['target_pct_vunerable','ward'], inplace=True)
train.drop(columns=['ward'], inplace=True)
#
X = train.drop(['target_pct_vunerable'], axis=1)
y = train['target_pct_vunerable']
#
categorical_features_indices = np.where(X.dtypes != np.float)[0]; categorical_features_indices

array([14, 46])

In [13]:
# Installing required libraries
!pip install catboost
!pip install rgf-python

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/52/39/128fff65072c8327371e3c594f3c826d29c85b21cb6485980353b168e0e4/catboost-0.24.2-cp36-none-manylinux1_x86_64.whl (66.1MB)
[K     |████████████████████████████████| 66.2MB 53kB/s 
Installing collected packages: catboost
Successfully installed catboost-0.24.2
Collecting rgf-python
[?25l  Downloading https://files.pythonhosted.org/packages/7d/b5/9ba527ce20a1a5a6c279318be856aa4996e0d1b59035d466173198ffd468/rgf_python-3.9.0-py2.py3-none-manylinux1_x86_64.whl (757kB)
[K     |████████████████████████████████| 768kB 2.7MB/s 
Installing collected packages: rgf-python
Successfully installed rgf-python-3.9.0


In [14]:
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold,StratifiedKFold, TimeSeriesSplit

testsplit_store=[]
test_store=[]
fold=KFold(n_splits=15, shuffle=True, random_state=123456)
i=1
for train_index, test_index in fold.split(X,y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    cat = CatBoostRegressor(n_estimators=10000,eval_metric='RMSE', learning_rate=0.0801032, random_seed= 123456, l2_leaf_reg=4, use_best_model=True)
    cat.fit(X_train,y_train,eval_set=[(X_train,y_train),(X_test, y_test)], early_stopping_rounds=300,verbose=100, cat_features=categorical_features_indices)
    predict = cat.predict(X_test)
    print("err: ",np.sqrt(mean_squared_error(y_test,predict)))
    testsplit_store.append(np.sqrt(mean_squared_error(y_test,predict)))
    pred = cat.predict(test)
    test_store.append(pred)


0:	learn: 12.1384567	test: 12.1384567	test1: 13.0356250	best: 13.0356250 (0)	total: 68.1ms	remaining: 11m 21s
100:	learn: 4.8736797	test: 4.8736797	test1: 5.1860612	best: 5.1850419 (99)	total: 1.88s	remaining: 3m 4s
200:	learn: 3.6749530	test: 3.6749530	test1: 5.0040193	best: 4.9927432 (197)	total: 3.72s	remaining: 3m 1s
300:	learn: 2.8920581	test: 2.9012199	test1: 4.9102574	best: 4.9006805 (290)	total: 5.54s	remaining: 2m 58s
400:	learn: 2.3643745	test: 2.3767904	test1: 4.8626683	best: 4.8567107 (392)	total: 7.36s	remaining: 2m 56s
500:	learn: 1.9488156	test: 1.9633063	test1: 4.8590414	best: 4.8552519 (488)	total: 9.2s	remaining: 2m 54s
600:	learn: 1.6440419	test: 1.6603104	test1: 4.8532745	best: 4.8531693 (592)	total: 11s	remaining: 2m 52s
700:	learn: 1.4164740	test: 1.4334710	test1: 4.8185048	best: 4.8158127 (695)	total: 12.8s	remaining: 2m 50s
800:	learn: 1.2126203	test: 1.2302567	test1: 4.8157228	best: 4.8044600 (765)	total: 14.6s	remaining: 2m 48s
900:	learn: 1.0552629	test: 1.07

In [17]:
print(np.mean(testsplit_store))
print('********************************')
# making a submission file
submission = {"ward": _id, 'target_pct_vunerable': np.mean(test_store, 0)}
final_df = pd.DataFrame(data = submission)
final_df.sample(6)

5.578583537280377
********************************


Unnamed: 0,ward,target_pct_vunerable
530,HNODGSZ,23.034483
790,PPOUS3C,17.909118
965,VA2T7KX,-1.117707
714,NJCV4SV,6.024565
78,2LDKL4U,27.345245
925,TV0NYBT,37.880918


In [19]:
# the final submission file
final_df.to_csv('zindi.csv', index=False)