# Data Cleaning and Modeling
This notebook is to import, clean, and save data from the specified paths in config.ini


These variables are for transforming the challenge's *.csv* files.
* *to_drop*
<br>A list of columns to drop from the data.
* *dtype_dict*
<br>A dictionary containing the specified dtypes for columns.
* *fillna_dict*
<br>A dictionary containing fill values for missing data. This is for the cleaning
process.
---

Import needed libraries and created functions. Additional libraries are imported
through *src.etl_functions*

In [1]:
import sys 
sys.path.append('../')
# Imports needed libraries
# and created functions
from src.etl_functions import *
import warnings
warnings.filterwarnings("ignore")

Import configuration

In [2]:
config = configparser.ConfigParser()
config.read("src/config.ini")

['src/config.ini']

Import output, train_data, and test data paths. Import train labels and submission
form example from *.csv* files

In [3]:
# Specified output path
output = config['paths']['data_path']
# Specified data paths
trn_data = config['paths']['train_data']
trn_lbls = pd.read_csv(config['paths']['train_labels'])
tst_data = config['paths']['test_data']
# Import submission format
sub_form = pd.read_csv(config['paths']['sub_form'])

Inputs that you can modify here.

*to_drop*
Columns that will be dropped and excluded from data

*fill_dict*
Dictionary to fill null values

*dtype_dict*
Dictionary to import columns as specific datatypes. Only works with the original
*.csv* files. This is applied to the original *.csv* files provided for the 
challenge.

*n*
The limit for categorical columns. Categorical columns will be limited to having
at most *n*+1 categories. Default is 20 if not specified.

In [4]:
to_drop = ['extraction_type', 'extraction_type_group',
            'management_group',
            'payment_type',
            'quantity_group',
            'source_type','source_class', 
            'waterpoint_type_group',
            'district_code', 
            'construction_year',
            'num_private',
            'recorded_by',
            'id',
            'scheme_name', 
            'date_recorded']

fill_dict = {'funder':'Other',
                'installer': 'Other',
                'subvillage': 'Other', 
                'public_meeting': False,
                'scheme_management': 'Unknown',
                'permit': False}

dtype_dict = {'amount_tsh': 'float32',
            'funder': 'category',
            'gps_height': 'int16',
            'installer': 'category',
            'longitude': 'float16',
            'latitude': 'float16',
            'wpt_name': 'category',
            'num_private': 'int16',
            'basin': 'category',
            'subvillage': 'category',
            'region': 'category',
            'region_code': 'int8',
            'district_code': 'int8',
            'lga': 'category',
            'ward': 'category',
            'population': 'int16',
            'recorded_by': 'category',
            'scheme_management': 'category',
            'construction_year': 'int16',
            'extraction_type': 'category',
            'extraction_type_group': 'category',
            'extraction_type_class': 'category',
            'management': 'category',
            'management_group': 'category',
            'payment': 'category',
            'payment_type': 'category',
            'water_quality': 'category',
            'quality_group': 'category',
            'quantity': 'category',
            'quantity_group': 'category',
            'source': 'category',
            'source_type': 'category',
            'source_class': 'category',
            'waterpoint_type': 'category',
            'waterpoint_type_group': 'category'}

Clean the given training and testing datasets. This can be given as either a 
dataframe or a data path.

In [5]:
train_df, test_df, exp_output = get_cleaned_sets(trn_data,tst_data, to_drop, output, fill_dict,
                                                dtype_dict=dtype_dict,
                                                return_output=True)

Cleaning successful.
Associated time is 280722_1007PM


# Modeling

Remove *id* from the training labels. Set the labels as index to preserve order.

In [6]:
trn_lbls.index=trn_lbls['id']
trn_lbls.drop(columns='id', inplace=True)


Make a random forest model and make predictions. Save the predictions as a 
*submission.csv* file to submit for the challenge. Append the model parameters
to the experiment output directory.

In [7]:
rf = RandomForestClassifier(n_estimators=150, random_state=42)
rf.fit(train_df, trn_lbls)
preds = rf.predict(test_df)
sub_form['status_group'] = preds
sub_form.to_csv(f'{exp_output}submission.csv',index=False)
with open(f"{exp_output}experiment_notes.txt", 'a') as f:
        f.write(str(rf.set_params()))

In [8]:
sub_form

Unnamed: 0,id,status_group
0,50785,non functional
1,51630,functional
2,17168,functional
3,45559,non functional
4,49871,functional
...,...,...
14845,39307,non functional
14846,18990,functional
14847,28749,functional
14848,33492,functional
