# Feature Description

#### Features - 1
- *amount_tsh* - Total static head (amount water available to waterpoint)
- *date_recorded* - The date the row was entered
- ***funder* - Who funded the well**
- ***gps_height* - Altitude of the well**
- ***installer* - Organization that installed the well**
- ***longitude* - GPS coordinate**
- ***latitude* - GPS coordinate**
- *wpt_name* - Name of the waterpoint if there is one
- *num_private* -
- ***basin* - Geographic water basin**
___
#### Geographic Location
- *subvillage* - Geographic location
- *region* - Geographic location
- *region_code* - Geographic location (coded)
- *district_code* - Geographic location (coded)
- *lga* - Geographic location
- *ward* - Geographic location
___
#### Features - 2
- ***population* - Population around the well**
- *public_meeting* - True/False
- ***recorded_by* - Group entering this row of data**
- ***scheme_management* - Who operates the waterpoint**
- *scheme_name* - Who operates the waterpoint
- ***permit* - If the waterpoint is permitted**
- ***construction_year* - Year the waterpoint was constructed**
___
#### Water Extraction
- *extraction_type* - The kind of extraction the waterpoint uses
- *extraction_type_group* - The kind of extraction the waterpoint uses
- *extraction_type_class* - The kind of extraction the waterpoint uses
___
#### Features - 3
- *management* - How the waterpoint is managed
- *management_group* - How the waterpoint is managed
- *payment* - What the water costs
- *payment_type* - What the water costs
- *water_quality* - The quality of the water
- *quality_group* - The quality of the water
- *quantity* - The quantity of water
- *quantity_group* - The quantity of water
- *source* - The source of the water
- *source_type* - The source of the water
- *source_class* - The source of the water
- *waterpoint_type* - The kind of waterpoint
- *waterpoint_type_group* - The kind of waterpoint

# Setup

In [102]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [103]:
PIT_data_df = pd.read_csv('/content/Pump it Up - Training set Values.csv')
PIT_test_df = pd.read_csv('/content/Pump it Up - Testing set.csv')
PIT_label_df = pd.read_csv('/content/Pump it Up - Training set Labels.csv')
PIT_data_df.loc[PIT_data_df.latitude.abs() < 0.0001, 'latitude'] = 0.0
PIT_test_df.loc[PIT_test_df.latitude.abs() < 0.0001, 'latitude'] = 0.0
merged_PIT_data_df = pd.merge(PIT_data_df, PIT_label_df,how='inner', on='id')

# Feature Characteristics Identification

In [104]:
PIT_feature_info_df = pd.DataFrame(merged_PIT_data_df.columns, columns=['feature'])
PIT_feature_info_df = pd.merge(PIT_feature_info_df, pd.DataFrame(merged_PIT_data_df.isna().sum(), columns=['null_count']),how='inner',left_on='feature',right_index=True)
PIT_feature_info_df = pd.merge(PIT_feature_info_df, pd.DataFrame(merged_PIT_data_df.dtypes, columns=['data_type']),how='inner',left_on='feature',right_index=True)
PIT_feature_info_df = pd.merge(PIT_feature_info_df, pd.DataFrame((merged_PIT_data_df == 0).sum(axis=0), columns=['zero_count']),how='inner',left_on='feature',right_index=True)
PIT_feature_info_df = pd.merge(PIT_feature_info_df, pd.DataFrame(merged_PIT_data_df.nunique(), columns=['unique_val_count']),how='inner',left_on='feature',right_index=True)
PIT_feature_info_df

Unnamed: 0,feature,null_count,data_type,zero_count,unique_val_count
0,id,0,int64,1,59400
1,amount_tsh,0,float64,41639,98
2,date_recorded,0,object,0,356
3,funder,3635,object,0,1897
4,gps_height,0,int64,20438,2428
5,installer,3655,object,0,2145
6,longitude,0,float64,1812,57516
7,latitude,0,float64,1812,57517
8,wpt_name,0,object,0,37400
9,num_private,0,int64,58643,65


# Handling Label Imbalance

In [105]:
merged_PIT_data_df.status_group.value_counts()

functional                 32259
non functional             22824
functional needs repair     4317
Name: status_group, dtype: int64

In [106]:
grouping = merged_PIT_data_df.groupby('status_group',group_keys=False)
merged_PIT_data_df = grouping.apply(lambda x: x.sample(grouping.size().min()).reset_index(drop=True))
merged_PIT_data_df.reset_index(drop=True, inplace=True)

# Preprocessing

In [107]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
str_encoder = LabelEncoder()
scaler = MinMaxScaler()

In [108]:
def preprocess(df):
  df['gps_height'] = df['gps_height'].replace(0,np.nan).fillna(df.groupby('region')['gps_height'].transform('mean'))
  df['longitude'] = df['longitude'].replace(0,np.nan).fillna(df.groupby('region')['longitude'].transform('median'))
  df['latitude'] = df['latitude'].replace(0,np.nan).fillna(df.groupby('region')['latitude'].transform('median'))
  df['population'] = df['population'].replace(0,np.nan).fillna(df.groupby('region')['population'].transform('mean'))
  df['construction_year'] = df['construction_year'].replace(0,np.nan)
  df['construction_year'] = df['construction_year'].fillna(df.groupby('scheme_name')['construction_year'].transform('mean'))
  df['construction_year'] = df['construction_year'].fillna(df.groupby('installer')['construction_year'].transform('mean'))
  df['construction_year'] = df['construction_year'].fillna(df.groupby('funder')['construction_year'].transform('mean'))
  df['construction_year'] = df['construction_year'].fillna(df.groupby('scheme_management')['construction_year'].transform('mean'))
  df['construction_year'] = df['construction_year'].fillna(df.groupby('management')['construction_year'].transform('mean'))
  df['duration'] = (2021 - df['construction_year'])
  df['funder'] = df['funder'].fillna('Unknown')
  df['installer'] = df['installer'].fillna('Unknown')
  df['scheme_management'] = df['scheme_management'].fillna('Unknown')
  df['permit'] = df['permit'].fillna('Unknown')
  df['permit'] = df['permit'].replace({True:1,False:0,'Unknown':0.5})
  selected_columns = ['id','gps_height','longitude','latitude','population','basin','region','duration','funder','installer','waterpoint_type','source','quantity','water_quality','payment','management_group','extraction_type_class']
  if('status_group' in df.columns):
    selected_columns.append('status_group')
  return df[selected_columns]

In [109]:
def encode_frame(df):
  encodables = ['basin','region','duration','funder','installer','waterpoint_type','source','quantity','water_quality','payment','management_group','extraction_type_class']
  scalables = ['gps_height','longitude','latitude','population']
  for en in encodables:
    df[en] = str_encoder.fit_transform(df[en])
  if('status_group' in df.columns):
    df['status_group'] = df['status_group'].replace({"functional":2, "non functional":0, "functional needs repair":1})
  df[scalables] = scaler.fit_transform(df[scalables])
  return df

In [110]:
cleaned_data_df = preprocess(merged_PIT_data_df)
cleaned_test_df = preprocess(PIT_test_df)
# encoded_data_df = encode_frame(cleaned_data_df)
# encoded_test_df = encode_frame(cleaned_test_df)

# Training

In [111]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

model = RandomForestClassifier(n_estimators=1000, max_depth=30, random_state=1)

In [None]:
for i in range(1):
  enc_copy_df = encoded_data_df.copy()
  validation_df = enc_copy_df.sample(frac = 0.2)
  train_df = enc_copy_df.drop(validation_df.index)
  test_df = encoded_test_df

  features = train_df.columns.to_list()
  features.remove('status_group')
  features.remove('id')
  train_y = train_df['status_group']
  train_x = train_df[features]
  val_y = validation_df['status_group']
  val_x = validation_df[features]

  model.fit(train_x, train_y)
  accuracy = accuracy_score(model.predict(val_x), val_y)
  print(f'epoch:{i+1} finshed ---> accuracy:{accuracy}')

In [113]:
prediction = model.predict(test_df[features])

submission_df = pd.DataFrame({
			"id": test_df["id"],
			"status_group": prediction
		})
submission_df['status_group'] = submission_df['status_group'].replace({2:"functional", 0:"non functional", 1:"functional needs repair"})
submission_df.to_csv("PIT_submission.csv", index = False)

In [None]:
importances = list(model.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(features, importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
feature_importances