# INFO-F-422 -  Statistical Foundations of Machine Learning 

### Guzga Adrian-Dumitru - __[adrian-dumitru.guzga@ulb.be](mailto:adrian-dumitru.guzga@ulb.be) - Student ID 460513__
### Martis William - __[william.martis@ulb.be](william.martis@ulb.be) - Student ID 441157__
### Schmidt Xavier - __[xavier.schmidt@ulb.be](xavier.schmidt@ulb.be) - Student ID 445723__

### Video presentation: www.youtube.com/doge_shiba_cummies_to_the_moon

## Project Title


# Introduction


# Data preprocessing

In [1]:
library(rlang)
library(dummies)
library(stringdist)
library(dplyr)
library(randomForest)

"package 'rlang' was built under R version 4.0.5"
dummies-1.5.6 provided by Decision Patterns


"package 'dplyr' was built under R version 4.0.5"

Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


"package 'randomForest' was built under R version 4.0.5"
randomForest 4.6-14

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'randomForest'


The following object is masked from 'package:dplyr':

    combine




In [2]:
df <- read.csv("training_data.csv", stringsAsFactors=T, header = TRUE, strip.white = TRUE, sep = ",")
#df <- df[, setdiff(names(df), "X")]
#df_labels <- read.csv("training_labels.csv", header = TRUE, strip.white = TRUE, sep = ",")

In [3]:
lowercase_all = function(df) {
    for(column in names(df)) {
        columnClass = class(df[,column])
        if(columnClass == "factor") {
            df[,column] <<- tolower(df[,column])
        }
    }
}

In [4]:
lowercase_all(df)

## Cleaning data

After having analysed each initial feature of the dataset, we have observed that some categorical features had plenty of unique values that had a very low frequency of occurring, which we considered to simply be noise. As an example, we observed that most of the installers were only responsible for a very small number of waterpumps (between 1 and 10), which, compared to the size of the dataset (roughly 58000 entries), is simply unnecessary noise. Thus, we have sorted the unique values appearing in the *installer* column by their decreasing frequencies. Next, we decided to only keep the top 10 most frequent installers. We have done this for other features as well (such as etc etc etc)

### Geographical position of waterpumps

Because of how many faults we have found in the dataset, mispelled funders and installers, incomplete names etc, we could not rule out the possibility that some of the waterpumps would not actually be in Tanzania. In order to test our presumption, and act accordingly, we had to define an area represented by a square region containing Tanzania, represented by 2 points on Earth (longitude and latitude coordinates of upper right corner and lower left corner of the square), and check that each entry in the dataset actually belongs to that area. If it is the case, the corresponding entries will be kept, otherwise, they will be deleted from the dataset's corresponding dataframe. This has led to the identification and suppression of roughly 1500 entries in the dataset, which translates to 2.5% of it. 

In [5]:
#filter long/lat
x_up_right = 0.022
y_up_right = 40.729
x_down_left = -12.729
y_down_left = 28.138

df <- df[df$latitude > x_down_left & df$latitude < x_up_right & df$longitude > y_down_left & df$longitude < y_up_right, ] #filter pumps in atlantic fucking ocean

## Missing value imputation

Our dataset has missing values, as any real world dataset does. Moreover, after obtaining the number of entries in the dataset that had at least one missing value, we oberved that approximately 45% of them had missing values. Dealing with missing values and obtaining a complete dataset represents a challenge, because there is a plenty of approaches that could be taken. Some approaches include: mean/median value imputation (especially used for numerical features), random sampling value imputation (especially used for categorical features), imputation using the most frequent value present in a feature or imputation based on similarity. We have opted for the latter option, as it made the most sense to look at similar entries in the dataset and change the missing values of another entry with the values contained in the most similar entry. Additionally, in this way, we do not need to treat differently the categorical or numerical features, as it would have been the case with performing the first two approaches for missing value imputation. We argue that the third option, based on frequency, adds too much bias towards the data, which can result in worse classifying rates later. 

In [6]:
work_on = setdiff(names(df), c("id"))
df[work_on][df[work_on] == 0 | df[work_on] == "-" | df[work_on] == "" | df[work_on] == "unknown"] = NA

In [7]:
drops <- c("latitude", "longitude", "funder", "date_recorded", "gps_height", "wpt_name", "num_private", "lga", "ward", "subvillage", "region_code", "district_code", "recorded_by", "scheme_name", "waterpoint_type_group", "payment", "management", "management_group", "extraction_type_group", "extraction_type", "quantity_group", "quality_group", "amount_tsh", "population", "public_meeting", "scheme_management", "permit", "source_type", "source_class")
df = df[ , !(names(df) %in% drops)] #drop useless (a priori) features, maybe after feature selection remove less

In [20]:
### IF we keep 5 features

drops <- c("installer", "basin", "source", "payment_type")
full_df = full_df[ , !(names(full_df) %in% drops)] #drop useless (a priori) features, maybe after feature selection remove less

To successfully perform missing value imputation based on similarity, we simply take a row of the dataframe containing all the entries that have missing values and we test the similarity of its complete features to the same features of each entry that is complete in the dataset. We have used the package stringdist in order to be able to correctly treat the cases in which a distance is needed to be commputed for two words. We then take the index of the entry which has the smallest difference to the entry with missing values and insert in the original dataframe the values that needed to be completed, based on the ID of the entries.

In [10]:
similarity_test_apply = function(line) {
    NAs = is.na(line)
    idxs = which(is.na(line))
            #print(df[names(df)[idxs]])
    line2 = line[-idxs]
    most_similar = full_df[which.min(Reduce(`+`,Map(stringdist, setdiff(line2, full_df[, -idxs]), line2, method='jaccard'))),]

    line[idxs] <- most_similar[idxs] 
    to_change = which(strtoi(df$id) == strtoi(line$id))
    assign('df',df,envir=.GlobalEnv)
    df[to_change, idxs] <<- most_similar[idxs]
        
}

In [9]:
sum(complete.cases(df)) #check how many rows contain at least one NA val
full_df <- na.omit(df)
dim(full_df)

In [11]:
na_df = dplyr::setdiff(df,full_df)
dim(na_df)

In [12]:
#testing similarity

start.time = Sys.time()
res = apply(na_df[1:nrow(na_df),], 1, similarity_test_apply)
round(Sys.time() - start.time,5)
write.csv(df, "similarity_applied.csv")

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multiple of length of shorter"
"longer argument not a multi

Time difference of 4.38331 hours

In [17]:
good_df = read.csv("similarity_applied.csv", stringsAsFactors=T, header = TRUE, strip.white = TRUE, sep = ",")
clean = c("X")
good_df = good_df[ , !(names(good_df) %in% clean)]
good_df

id,installer,basin,region,construction_year,extraction_type_class,payment_type,water_quality,quantity,source,waterpoint_type
<int>,<fct>,<fct>,<fct>,<int>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>
69572,roman,lake nyasa,iringa,1999,gravity,annually,soft,enough,spring,communal standpipe
8776,grumeti,lake victoria,mara,2010,gravity,never pay,soft,insufficient,rainwater harvesting,communal standpipe
34310,world vision,pangani,manyara,2009,gravity,per bucket,soft,enough,dam,communal standpipe multiple
67743,unicef,ruvuma / southern coast,mtwara,1986,submersible,never pay,soft,dry,machine dbh,communal standpipe multiple
19728,artisan,lake victoria,kagera,1999,gravity,never pay,soft,seasonal,rainwater harvesting,communal standpipe
9944,dwe,pangani,tanga,2009,submersible,per bucket,salty,enough,other,communal standpipe multiple
19816,dwsp,internal,shinyanga,1999,handpump,never pay,soft,enough,machine dbh,hand pump
54551,dwe,lake tanganyika,shinyanga,1999,handpump,annually,milky,enough,shallow well,hand pump
53934,water aid,lake tanganyika,tabora,1999,handpump,never pay,salty,seasonal,machine dbh,hand pump
46144,artisan,lake victoria,kagera,1999,handpump,never pay,soft,enough,shallow well,hand pump


In [16]:
df

Unnamed: 0_level_0,id,installer,basin,region,construction_year,extraction_type_class,payment_type,water_quality,quantity,source,waterpoint_type
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,69572,roman,lake nyasa,iringa,1999,gravity,annually,soft,enough,spring,communal standpipe
2,8776,grumeti,lake victoria,mara,2010,gravity,never pay,soft,insufficient,rainwater harvesting,communal standpipe
3,34310,world vision,pangani,manyara,2009,gravity,per bucket,soft,enough,dam,communal standpipe multiple
4,67743,unicef,ruvuma / southern coast,mtwara,1986,submersible,never pay,soft,dry,machine dbh,communal standpipe multiple
5,19728,artisan,lake victoria,kagera,1999,gravity,never pay,soft,seasonal,rainwater harvesting,communal standpipe
6,9944,dwe,pangani,tanga,2009,submersible,per bucket,salty,enough,other,communal standpipe multiple
7,19816,dwsp,internal,shinyanga,1999,handpump,never pay,soft,enough,machine dbh,hand pump
8,54551,dwe,lake tanganyika,shinyanga,1999,handpump,annually,milky,enough,shallow well,hand pump
9,53934,water aid,lake tanganyika,tabora,1999,handpump,never pay,salty,seasonal,machine dbh,hand pump
10,46144,artisan,lake victoria,kagera,1999,handpump,never pay,soft,enough,shallow well,hand pump


## Feature engineering

### One-hot encoding

One-hot encoding is a feature enginnering method which concerns categorical features. More specifically, it refers to the process of encoding a categorical feature with $m$ unique values into $m$ features with binary encodings as values, which specify if the data belongs or not to the a certain unique value of the initial feature. One-hot encoding is generally suited for encoding categorical features whose unique values cannot be ordered in a specific manner. As a counterexample, a feature representing the size of a T-shirt has 4 unique values: S, M, L, XL. These values can be directly encoded by distributing a number from 0 to 3 to each size, in the order given above, because a certain order exists between these unique values. 

In the dataset we use, however, this is not the case. Certain features, such as the installer of the waterpump, contain unique values that cannot be ordered in a specific manner. If we were to encode those unique values into integers, that would automatically imply a sort of order between the values, which is not a good practice. It is just not possible to order the different installers of waterpumps in a certain way. Thus, for all the categorical features we have decided to apply the one-hot encoding scheme. This implicitly results in a higher dimensionality of the feature space. 



In [21]:
df_labels <- read.csv("training_labels.csv", stringsAsFactors=T, header = TRUE, strip.white = TRUE, sep = ",")

In [27]:
good_df = df

In [29]:
indx <- tail(names(sort(table(df$installer))),10)
print(indx)
to_change = which(!(good_df$installer %in% indx))
good_df[to_change, 'installer'] = 'other'
good_df

 [1] "central government" "kkkt"               "district council"  
 [4] "danida"             "commu"              "rwe"               
 [7] "hesawa"             "government"         "roman"             
[10] "dwe"               


Unnamed: 0_level_0,id,installer,basin,region,construction_year,extraction_type_class,payment_type,water_quality,quantity,source,waterpoint_type
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,69572,roman,lake nyasa,iringa,1999,gravity,annually,soft,enough,spring,communal standpipe
2,8776,other,lake victoria,mara,2010,gravity,never pay,soft,insufficient,rainwater harvesting,communal standpipe
3,34310,other,pangani,manyara,2009,gravity,per bucket,soft,enough,dam,communal standpipe multiple
4,67743,other,ruvuma / southern coast,mtwara,1986,submersible,never pay,soft,dry,machine dbh,communal standpipe multiple
5,19728,other,lake victoria,kagera,1999,gravity,never pay,soft,seasonal,rainwater harvesting,communal standpipe
6,9944,dwe,pangani,tanga,2009,submersible,per bucket,salty,enough,other,communal standpipe multiple
7,19816,other,internal,shinyanga,1999,handpump,never pay,soft,enough,machine dbh,hand pump
8,54551,dwe,lake tanganyika,shinyanga,1999,handpump,annually,milky,enough,shallow well,hand pump
9,53934,other,lake tanganyika,tabora,1999,handpump,never pay,salty,seasonal,machine dbh,hand pump
10,46144,other,lake victoria,kagera,1999,handpump,never pay,soft,enough,shallow well,hand pump


In [3]:
"(indx <- tail(names(sort(table(full_df$extraction_type_class))),17)
print(indx)
to_change = which(!(full_df$extraction_type %in% indx))
full_df[to_change, 'extraction_type'] = 'other'
full_df)"

In [27]:
Merged_df = full_df

In [28]:
Merged_df<-merge(Merged_df,df_labels,by="id",all=FALSE)
dim(Merged_df)
Merged_df = Merged_df[,2:ncol(Merged_df)]
dim(Merged_df)

In [29]:
Merged_df = dummy.data.frame(Merged_df, sep="_")

"non-list contrasts argument ignored"
"non-list contrasts argument ignored"
"non-list contrasts argument ignored"
"non-list contrasts argument ignored"
"non-list contrasts argument ignored"
"non-list contrasts argument ignored"


In [23]:
dim(Merged_df)
Merged_df[,70:80]

source_spring,waterpoint_type_cattle trough,waterpoint_type_communal standpipe,waterpoint_type_communal standpipe multiple,waterpoint_type_dam,waterpoint_type_hand pump,waterpoint_type_improved spring,waterpoint_type_other,status_group_functional,status_group_functional needs repair,status_group_non functional
0,0,0,0,0,1,0,0,1,0,0
1,0,1,0,0,0,0,0,1,0,0
0,0,1,0,0,0,0,0,0,0,1
1,0,1,0,0,0,0,0,0,0,1
1,0,1,0,0,0,0,0,1,0,0
0,0,0,1,0,0,0,0,1,0,0
0,0,0,1,0,0,0,0,0,1,0
0,0,0,0,0,0,0,1,0,0,1
1,0,1,0,0,0,0,0,0,0,1
0,0,1,0,0,0,0,0,1,0,0


## Feature selection

In [None]:
#df.pca <- prcomp(dftest_, center = TRUE,scale. = TRUE)
#summary(df.pca)

# Model selection

## Model 1

## Model 2

## Model 3

#### Example of simple equation
\begin{equation}
e = mc^2
\end{equation}

#### Example of matrix equation - Cross product formula:

\begin{equation*}
\mathbf{V}_1 \times \mathbf{V}_2 =  \begin{vmatrix}
\mathbf{i} & \mathbf{j} & \mathbf{k} \\
\frac{\partial X}{\partial u} &  \frac{\partial Y}{\partial u} & 0 \\
\frac{\partial X}{\partial v} &  \frac{\partial Y}{\partial v} & 0
\end{vmatrix}
\end{equation*}

#### Example of multiline equation - The Lorenz Equations:

\begin{align}
\dot{x} & = \sigma(y-x) \\
\dot{y} & = \rho x - y - xz \\
\dot{z} & = -\beta z + xy
\end{align}

#### Example of Markdown Table:

| This | is   |
|------|------|
|   a  | table|


## Model 1 : Random Forest

In [24]:
#Target var not one hot encod - problem :( 3 one hot encod target features created - only one taken into consideration)
library("randomForest")

n_trees <- 20
accuracy_vec <- array(0,n_trees)

df_idx <- sample(1:nrow(Merged_df))
half_split <- floor(nrow(Merged_df)/4*3)
target_variable <-ncol(Merged_df)


    #3.1 Take the first half of the dataset as a training data set
train_data <- Merged_df[df_idx[1:half_split],]

    #3.2 Take the second half of the dataset as a hold out or test data set
test_data <- Merged_df[df_idx[(half_split+1):nrow(Merged_df)],]
    
model_rf <- randomForest(x=train_data[,-c(target_variable)],
                         y=as.factor(train_data[,c(target_variable)]),
                         xtest=test_data[,-c(target_variable)],
                         ytest=as.factor(test_data[,c(target_variable)]),
                         ntree=n_trees, do.trace=T)
    

accuracy_vec[0] = (model_rf$test$confusion[1,1]+model_rf$test$confusion[2,2])/sum(model_rf$test$confusion)

#plot(accuracy_vec,main = "Number of trees influence",xlab = "Nbr of trees",ylab = "Classification rate") 

ntree      OOB      1      2|    Test      1      2
    1:   3.15%  1.74%  5.85%|   3.20%  1.51%  6.32%
    2:   3.04%  2.22%  4.61%|   3.00%  1.34%  6.07%
    3:   3.88%  2.42%  6.70%|   1.40%  0.73%  2.65%
    4:   4.20%  2.61%  7.28%|   1.50%  0.61%  3.16%
    5:   4.55%  2.54%  8.51%|   0.84%  0.49%  1.49%
    6:   4.26%  2.44%  7.84%|   1.22%  0.61%  2.36%
    7:   4.16%  2.24%  7.91%|   0.99%  0.59%  1.74%
    8:   3.76%  1.89%  7.43%|   1.20%  0.37%  2.72%
    9:   3.17%  1.55%  6.35%|   0.61%  0.33%  1.13%
   10:   2.61%  1.35%  5.09%|   0.61%  0.24%  1.31%
   11:   2.34%  1.14%  4.70%|   0.40%  0.22%  0.73%
   12:   1.92%  0.95%  3.82%|   0.25%  0.10%  0.54%
   13:   1.69%  0.77%  3.50%|   0.27%  0.12%  0.54%
   14:   1.54%  0.63%  3.33%|   0.23%  0.08%  0.51%
   15:   1.23%  0.52%  2.62%|   0.19%  0.08%  0.40%
   16:   1.07%  0.49%  2.22%|   0.18%  0.04%  0.44%
   17:   1.03%  0.44%  2.18%|   0.17%  0.06%  0.36%
   18:   0.91%  0.43%  1.87%|   0.14%  0.04%  0.33%
   19:   0.8

In [25]:
y_pred_rf = predict(model_rf, newdata = test_data[,-target_variable])

ERROR: Error in predict.randomForest(model_rf, newdata = test_data[, -target_variable]): No forest component in the object


In [26]:
print((model_rf$test$confusion[1,1]+model_rf$test$confusion[2,2])/sum(model_rf$test$confusion))

[1] 0.9985976


## Model 2 : Neural Network

In [27]:
library(nnet)

df_idx <- sample(1:nrow(Merged_df))
half_split <- floor(nrow(Merged_df)/4*3)
target_variable <-ncol(Merged_df)


    #3.1 Take the first half of the dataset as a training data set
train_data <- Merged_df[df_idx[1:half_split],]

    #3.2 Take the second half of the dataset as a hold out or test data set
test_data <- Merged_df[df_idx[(half_split+1):nrow(Merged_df)],]

hidden_nodes<-72
threshold<-0.5
target_variable <-c(ncol(Merged_df)-2,ncol(Merged_df)-1,ncol(Merged_df))

model_nn<-nnet(x=train_data[,-target_variable],
               y=train_data[,target_variable],
               size=hidden_nodes,
               softmax=T,
               skip=FALSE,
               trace=T, 
               maxit=1000,
               MaxNWts=100000,
               rang=0.5)

model_nn

Y_pred<-predict(model_nn,test_data[,-target_variable])




#Y_hat_one_hot <- ifelse(Y_pred[,2] > threshold,"functional","functional needs repair")

head(Y_pred)

# weights:  5835
initial  value 42805.787271 
iter  10 value 20332.990796
iter  10 value 20332.990793
iter  10 value 20332.990793
final  value 20332.990793 
converged


a 77-72-3 network with 5835 weights
options were - softmax modelling 

Unnamed: 0,status_group_functional,status_group_functional needs repair,status_group_non functional
2636,0.5875005,0.06891277,0.3435867
17165,0.5875005,0.06891277,0.3435867
20360,0.5875005,0.06891277,0.3435867
27280,0.5875005,0.06891277,0.3435867
4994,0.5875005,0.06891277,0.3435867
13457,0.5875005,0.06891277,0.3435867


In [28]:
y_hat_idx = cbind(1:nrow(Y_pred), max.col(Y_pred, 'first'))

In [29]:
Y <- test_data[,target_variable]
y_idx = cbind(1:nrow(Y), max.col(Y, 'first'))

In [30]:
good_preds = which(y_hat_idx[,2] == y_idx[,2])
n_good = length(good_preds)
print(n_good/nrow(Y))

[1] 0.6036197


## Model 3 :  Support Vector Machine

In [133]:
#install.packages("e1071")
library(e1071)

names(Merged_df)[(length(names(Merged_df))-1)]<-"status_group_functional_needs_repair"
names(Merged_df)[length(names(Merged_df))]<-"status_group_non_functional"
Merged_df

df_idx <- sample(1:nrow(Merged_df))
half_split <- floor(nrow(Merged_df)/2)

#Merged_df

    #3.1 Take the first half of the dataset as a training data set
train_data <- Merged_df[df_idx[1:half_split],]

    #3.2 Take the second half of the dataset as a hold out or test data set
test_data <- Merged_df[df_idx[(half_split+1):nrow(Merged_df)],]

train_funct = train_data[,-which(names(train_data) == "status_group_functional_needs_repair" | names(train_data) == "status_group_non_functional")]
train_funct_rep = train_data[,-which(names(train_data) == "status_group_functional" | names(train_data) == "status_group_non_functional")]
train_non_funct = train_data[,-which(names(train_data) == "status_group_functional_needs_repair" | names(train_data) == "status_group_functional")]

test_funct = test_data[,-which(names(test_data) == "status_group_functional_needs_repair" | names(train_data) == "status_group_non_functional")]
test_funct_rep = test_data[,-which(names(test_data) == "status_group_functional" | names(train_data) == "status_group_non_functional")]
test_non_funct = test_data[,-which(names(test_data) == "status_group_functional_needs_repair" | names(train_data) == "status_group_functional")]

#DS<-cbind(train_funct,functional = train_funct[,ncol(train_funct)])

# Model fit (using lm function)
model_svm_f<- svm(status_group_functional~.,train_funct, probability=TRUE)

# Model prediction 
Y_hat_f<- predict(model_svm_f,test_funct, probability=TRUE)

#DS<-cbind(train_funct_rep,funct_rep = train_funct_rep[,ncol(train_funct_rep)])

# Model fit (using lm function)
model_svm_fr<- svm(status_group_functional_needs_repair~.,train_funct_rep,probability=TRUE)

# Model prediction 
Y_hat_fr<- predict(model_svm_fr,test_funct_rep,probability=TRUE)

#DS<-cbind(train_non_funct,non_functional = train_non_funct[,ncol(train_non_funct)])

# Model fit (using lm function)
model_svm_nf<- svm(status_group_non_functional~.,train_non_funct,probability=TRUE)

# Model prediction 
Y_hat_nf<- predict(model_svm_nf,test_non_funct,probability=TRUE)



Unnamed: 0_level_0,region_arusha,region_iringa,region_kigoma,region_kilimanjaro,region_lindi,region_manyara,region_mara,region_morogoro,region_mtwara,region_mwanza,...,waterpoint_type_cattle trough,waterpoint_type_communal standpipe,waterpoint_type_communal standpipe multiple,waterpoint_type_dam,waterpoint_type_hand pump,waterpoint_type_improved spring,waterpoint_type_other,status_group_functional,status_group_functional_needs_repair,status_group_non_functional
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,...,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
2,0,0,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
5,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
6,0,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,1,0,0
7,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
8,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,1
9,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
10,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0


In [134]:
Y_hat_fr = data.frame(Y_hat_fr)
Y_hat_nf = data.frame(Y_hat_nf)
Y_hat_f = data.frame(Y_hat_f)
predictions = cbind(Y_hat_f, Y_hat_fr, Y_hat_nf)
predictions

Unnamed: 0_level_0,Y_hat_f,Y_hat_fr,Y_hat_nf
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
2402,0.86920435,0.025002690,0.15226842
3073,0.04802844,0.953328138,0.04712184
18813,0.95071125,0.025161582,0.04722639
6434,0.90832576,0.025144738,0.04735435
6327,0.95442068,0.024968544,0.04498301
20572,0.01996273,0.015296122,0.98032890
30994,0.03239396,0.025165586,0.95242882
16625,0.94889237,0.025183205,0.04731846
21508,0.93720234,0.025157662,0.04430976
22433,0.95082491,0.025088885,0.04523310


In [135]:
y_hat_idx_svm = cbind(1:nrow(predictions), max.col(predictions, 'first'))

In [136]:
Y = test_data[,(ncol(test_data)-2):ncol(test_data)]
y_idx_svm = cbind(1:nrow(Y), max.col(Y, 'first'))

In [98]:
train_funct

Unnamed: 0_level_0,region_arusha,region_iringa,region_kigoma,region_kilimanjaro,region_lindi,region_manyara,region_mara,region_morogoro,region_mtwara,region_mwanza,...,quantity_insufficient,quantity_seasonal,waterpoint_type_cattle trough,waterpoint_type_communal standpipe,waterpoint_type_communal standpipe multiple,waterpoint_type_dam,waterpoint_type_hand pump,waterpoint_type_improved spring,waterpoint_type_other,status_group_functional
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,...,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
5947,0,0,0,1,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,1
20097,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
19987,0,0,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
7245,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3123,0,0,0,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
9120,0,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
26480,0,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
23372,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,1,0,0,0
3115,0,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
8092,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1


In [137]:
good_preds = which(y_hat_idx_svm[,2] == y_idx_svm[,2])
n_good = length(good_preds)
print(n_good/nrow(Y))

[1] 0.7435636


In [45]:
dim(Merged_df)
sum(complete.cases(Merged_df))

In [None]:
###CV test
library(e1071)
CV_folds <- 3
N = nrow(Merged_df)

size_CV <-floor(N/CV_folds)

CV_err<-numeric(CV_folds)

names(Merged_df)[(length(names(Merged_df))-1)]<-"status_group_functional_needs_repair"
names(Merged_df)[length(names(Merged_df))]<-"status_group_non_functional"

for (i in 1:CV_folds) {
    idx_ts<-(((i-1)*size_CV+1):(i*size_CV))  ### idx_ts represents the indices of the test set the i-th fold
    test_data<-Merged_df[idx_ts,]  
    
    idx_tr<-setdiff(1:N,idx_ts) ### idx_tr represents  indices of the training sefor the i-th fold
    
    train_data <- Merged_df[idx_tr,]
     
    train_funct = train_data[,-which(names(train_data) == "status_group_functional_needs_repair" | names(train_data) == "status_group_non_functional")]
    train_funct_rep = train_data[,-which(names(train_data) == "status_group_functional" | names(train_data) == "status_group_non_functional")]
    train_non_funct = train_data[,-which(names(train_data) == "status_group_functional_needs_repair" | names(train_data) == "status_group_functional")]

    test_funct = test_data[,-which(names(test_data) == "status_group_functional_needs_repair" | names(train_data) == "status_group_non_functional")]
    test_funct_rep = test_data[,-which(names(test_data) == "status_group_functional" | names(train_data) == "status_group_non_functional")]
    test_non_funct = test_data[,-which(names(test_data) == "status_group_functional_needs_repair" | names(train_data) == "status_group_functional")]

    #DS<-cbind(train_funct,functional = train_funct[,ncol(train_funct)])

    # Model fit (using lm function)
    model_svm_f<- svm(status_group_functional~.,train_funct, probability=TRUE)

    # Model prediction 
    Y_hat_f<- predict(model_svm_f,test_funct, probability=TRUE)

    #DS<-cbind(train_funct_rep,funct_rep = train_funct_rep[,ncol(train_funct_rep)])

    # Model fit (using lm function)
    model_svm_fr<- svm(status_group_functional_needs_repair~.,train_funct_rep,probability=TRUE)

    # Model prediction 
    Y_hat_fr<- predict(model_svm_fr,test_funct_rep,probability=TRUE)

    #DS<-cbind(train_non_funct,non_functional = train_non_funct[,ncol(train_non_funct)])

    # Model fit (using lm function)
    model_svm_nf<- svm(status_group_non_functional~.,train_non_funct,probability=TRUE)

    # Model prediction 
    Y_hat_nf<- predict(model_svm_nf,test_non_funct,probability=TRUE)
     
    
    
    Y_hat_fr = data.frame(Y_hat_fr)
    Y_hat_nf = data.frame(Y_hat_nf)
    Y_hat_f = data.frame(Y_hat_f)
    predictions = cbind(Y_hat_f, Y_hat_fr, Y_hat_nf)
    # Cross validation error = Mean Squared Error
    
    y_hat_idx_svm = cbind(1:nrow(predictions), max.col(predictions, 'first'))
    Y = test_data[,(ncol(test_data)-2):ncol(test_data)]
    y_idx_svm = cbind(1:nrow(Y), max.col(Y, 'first'))
    
    good_preds = which(y_hat_idx_svm[,2] == y_idx_svm[,2])
    n_good = length(good_preds)
    accuracy = n_good/nrow(Y)
    
    print(i)
    print(accuracy)
    
    #cross entropy left to compute
}

# Alternative models





In [18]:
library(keras)

"package 'keras' was built under R version 4.0.5"


In [19]:
good_df

id,installer,basin,region,construction_year,extraction_type_class,payment_type,water_quality,quantity,source,waterpoint_type
<int>,<fct>,<fct>,<fct>,<int>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>
69572,roman,lake nyasa,iringa,1999,gravity,annually,soft,enough,spring,communal standpipe
8776,grumeti,lake victoria,mara,2010,gravity,never pay,soft,insufficient,rainwater harvesting,communal standpipe
34310,world vision,pangani,manyara,2009,gravity,per bucket,soft,enough,dam,communal standpipe multiple
67743,unicef,ruvuma / southern coast,mtwara,1986,submersible,never pay,soft,dry,machine dbh,communal standpipe multiple
19728,artisan,lake victoria,kagera,1999,gravity,never pay,soft,seasonal,rainwater harvesting,communal standpipe
9944,dwe,pangani,tanga,2009,submersible,per bucket,salty,enough,other,communal standpipe multiple
19816,dwsp,internal,shinyanga,1999,handpump,never pay,soft,enough,machine dbh,hand pump
54551,dwe,lake tanganyika,shinyanga,1999,handpump,annually,milky,enough,shallow well,hand pump
53934,water aid,lake tanganyika,tabora,1999,handpump,never pay,salty,seasonal,machine dbh,hand pump
46144,artisan,lake victoria,kagera,1999,handpump,never pay,soft,enough,shallow well,hand pump


In [30]:
Merged_df<-merge(good_df,df_labels,by="id",all=FALSE)
dim(Merged_df)



In [31]:
Merged_df = Merged_df[,2:ncol(Merged_df)] #drop id when training, not a feature

In [32]:
Merged_df = dummy.data.frame(Merged_df, sep="_")
dim(Merged_df)

"non-list contrasts argument ignored"
"non-list contrasts argument ignored"
"non-list contrasts argument ignored"
"non-list contrasts argument ignored"
"non-list contrasts argument ignored"
"non-list contrasts argument ignored"
"non-list contrasts argument ignored"
"non-list contrasts argument ignored"
"non-list contrasts argument ignored"
"non-list contrasts argument ignored"


In [33]:
Merged_df

Unnamed: 0_level_0,installer_central government,installer_commu,installer_danida,installer_district council,installer_dwe,installer_government,installer_hesawa,installer_kkkt,installer_other,installer_roman,...,waterpoint_type_cattle trough,waterpoint_type_communal standpipe,waterpoint_type_communal standpipe multiple,waterpoint_type_dam,waterpoint_type_hand pump,waterpoint_type_improved spring,waterpoint_type_other,status_group_functional,status_group_functional needs repair,status_group_non functional
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,...,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,1
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,1,0,0
3,0,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,0,1,0,...,0,1,0,0,0,0,0,1,0,0
5,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,1
6,0,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,1,0,0
7,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
8,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,1,0,0
9,0,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
10,0,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,1


In [37]:
library(tensorflow)

physical_devices = tf$config$list_physical_devices('GPU')
tf$config$experimental$set_memory_growth(physical_devices[[1]],TRUE)
tf$keras$backend$set_floatx('float32')

ERROR: Error in physical_devices[[1]]: subscript out of bounds


In [40]:
install_tensorflow(version='gpu')

ERROR: Error: could not find a Python environment for C:/Users/guzga/AppData/Local/Programs/Python/Python39/python.exe


In [41]:
model <- keras_model_sequential() 
model %>% 
  layer_dense(units = 120, activation = 'relu', input_shape = c(82)) %>% 
  #layer_dropout(rate = 0.1) %>% 
  layer_dense(units = 140, activation = 'relu') %>%
  layer_dense(units = 150, activation = 'relu') %>%
  layer_dense(units = 90, activation = 'relu') %>%
  layer_dense(units = 80, activation = 'relu') %>%
  layer_dense(units = 75, activation = 'relu') %>%
  layer_dense(units = 40, activation = 'relu') %>%
  layer_dense(units = 20, activation = 'relu') %>%
  layer_dense(units = 10, activation = 'relu') %>%
  #layer_dropout(rate = 0.1) %>%
  layer_dense(units = 3, activation = 'softmax')

In [42]:
model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_adam(),
  metrics = c('accuracy')
)

In [None]:
df_idx <- sample(1:nrow(Merged_df))
half_split <- floor(nrow(Merged_df)/4*3)
target_variables <-c((ncol(Merged_df)-2),(ncol(Merged_df)-1), ncol(Merged_df))


    #3.1 Take the first half of the dataset as a training data set
train_data <- Merged_df[df_idx[1:half_split],]

In [None]:
x_train = train_data[,-target_variables]
y_train = train_data[,target_variables]


x_train = array_reshape(x_train, c(nrow(x_train), 83))
y_train = array_reshape(y_train, c(nrow(y_train), 3))
x_train = x_train[,2:ncol(x_train)]
x_train = data.matrix(x_train)
y_train = data.matrix(y_train)
dim(x_train)

In [None]:
history <- model %>% fit(
  x_train, y_train, 
  epochs = 200, batch_size = 64,
  validation_split = 0.2
)

In [None]:
plot(history)

# Conclusions