<a href="https://colab.research.google.com/github/economicactivist/DS-Unit-1-Build/blob/master/module1-decision-trees/LS_DS_221_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 1*

---

# Decision Trees

## Assignment
- [ ] [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. Go to our Kaggle InClass competition website. You will be given the URL in Slack. Go to the Rules page. Accept the rules of the competition. Notice that the Rules page also has instructions for the Submission process. The Data page has feature definitions.
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Begin with baselines for classification.
- [ ] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your feature importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this classification problem, you may want to use the parameter `logistic=True`, but it can be slow.

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```


In [2]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

Collecting category_encoders==2.*
[?25l  Downloading https://files.pythonhosted.org/packages/a0/52/c54191ad3782de633ea3d6ee3bb2837bda0cf3bc97644bb6375cf14150a0/category_encoders-2.1.0-py2.py3-none-any.whl (100kB)
[K     |███▎                            | 10kB 24.5MB/s eta 0:00:01[K     |██████▌                         | 20kB 3.1MB/s eta 0:00:01[K     |█████████▉                      | 30kB 3.7MB/s eta 0:00:01[K     |█████████████                   | 40kB 2.9MB/s eta 0:00:01[K     |████████████████▍               | 51kB 3.2MB/s eta 0:00:01[K     |███████████████████▋            | 61kB 3.8MB/s eta 0:00:01[K     |██████████████████████▉         | 71kB 4.0MB/s eta 0:00:01[K     |██████████████████████████▏     | 81kB 3.8MB/s eta 0:00:01[K     |█████████████████████████████▍  | 92kB 4.3MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 3.8MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.1.0
Collecting pandas-

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [4]:
# Check Pandas Profiling version
import pandas_profiling
pandas_profiling.__version__

'2.5.0'

In [0]:
# Old code for Pandas Profiling version 2.3
# It can be very slow with medium & large datasets.
# These parameters will make it faster.

# profile = train.profile_report(
#     check_correlation_pearson=False,
#     correlations={
#         'pearson': False,
#         'spearman': False,
#         'kendall': False,
#         'phi_k': False,
#         'cramers': False,
#         'recoded': False,
#     },
#     plot={'histogram': {'bayesian_blocks_bins': False}},
# )
#

# New code for Pandas Profiling version 2.4
# from pandas_profiling import ProfileReport
# profile = ProfileReport(train, minimal=True).to_notebook_iframe()

# profile

Features
Your goal is to predict the operating condition of a waterpoint for each record in the dataset. You are provided the following set of information about the waterpoints:

amount_tsh : Total static head (amount water available to waterpoint)
date_recorded : The date the row was entered
funder : Who funded the well
gps_height : Altitude of the well
installer : Organization that installed the well
longitude : GPS coordinate
latitude : GPS coordinate
wpt_name : Name of the waterpoint if there is one
num_private :
basin : Geographic water basin
subvillage : Geographic location
region : Geographic location
region_code : Geographic location (coded)
district_code : Geographic location (coded)
lga : Geographic location
ward : Geographic location
population : Population around the well
public_meeting : True/False
recorded_by : Group entering this row of data
scheme_management : Who operates the waterpoint
scheme_name : Who operates the waterpoint
permit : If the waterpoint is permitted
construction_year : Year the waterpoint was constructed
extraction_type : The kind of extraction the waterpoint uses
extraction_type_group : The kind of extraction the waterpoint uses
extraction_type_class : The kind of extraction the waterpoint uses
management : How the waterpoint is managed
management_group : How the waterpoint is managed
payment : What the water costs
payment_type : What the water costs
water_quality : The quality of the water
quality_group : The quality of the water
quantity : The quantity of water
quantity_group : The quantity of water
source : The source of the water
source_type : The source of the water
source_class : The source of the water
waterpoint_type : The kind of waterpoint
waterpoint_type_group : The kind of waterpoint

In [0]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

In [7]:
train.columns

Index(['id', 'amount_tsh', 'date_recorded', 'funder', 'gps_height',
       'installer', 'longitude', 'latitude', 'wpt_name', 'num_private',
       'basin', 'subvillage', 'region', 'region_code', 'district_code', 'lga',
       'ward', 'population', 'public_meeting', 'recorded_by',
       'scheme_management', 'scheme_name', 'permit', 'construction_year',
       'extraction_type', 'extraction_type_group', 'extraction_type_class',
       'management', 'management_group', 'payment', 'payment_type',
       'water_quality', 'quality_group', 'quantity', 'quantity_group',
       'source', 'source_type', 'source_class', 'waterpoint_type',
       'waterpoint_type_group', 'status_group'],
      dtype='object')

In [0]:
cols_to_keep = ['id','funder',
       'installer', 
       'basin', 'subvillage', 'region', 'population', 
      'permit', 
       'extraction_type', 'extraction_type_group', 'extraction_type_class',
       'management', 'management_group', 'payment', 
        'quality_group','quantity_group',
       'source',  'source_class', 
       'waterpoint_type_group', 'status_group']   #gps_height (median impute? or drop?) 'longitude', 'latitude', 

I dropped 18 columns in total

In [0]:
test_train_cols_to_keep = cols_to_keep[1:]
test_id_col=test.id

In [0]:
train = train[test_train_cols_to_keep]
test_train_cols_to_keep.pop()
test = test[test_train_cols_to_keep] 
 # maintaining the same dimensions for train and test sets

In [11]:
test

Unnamed: 0,funder,installer,basin,subvillage,region,population,permit,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,quality_group,quantity_group,source,source_class,waterpoint_type_group
0,Dmdd,DMDD,Internal,Magoma,Manyara,321,True,other,other,other,parastatal,parastatal,never pay,good,seasonal,rainwater harvesting,surface,other
1,Government Of Tanzania,DWE,Pangani,Kimnyak,Arusha,300,True,gravity,gravity,gravity,vwc,user-group,never pay,good,insufficient,spring,groundwater,communal standpipe
2,,,Internal,Msatu,Singida,500,,other,other,other,vwc,user-group,never pay,good,insufficient,rainwater harvesting,surface,other
3,Finn Water,FINN WATER,Ruvuma / Southern Coast,Kipindimbi,Lindi,250,True,other,other,other,vwc,user-group,unknown,good,dry,shallow well,groundwater,other
4,Bruder,BRUDER,Ruvuma / Southern Coast,Losonga,Ruvuma,60,True,gravity,gravity,gravity,water board,user-group,pay monthly,good,enough,spring,groundwater,communal standpipe
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14353,Danida,Da,Wami / Ruvu,Yombo,Pwani,20,True,mono,mono,motorpump,vwc,user-group,never pay,good,enough,river,surface,communal standpipe
14354,Hiap,HIAP,Pangani,Mkondoa,Tanga,2960,False,nira/tanira,nira/tanira,handpump,vwc,user-group,pay annually,salty,insufficient,shallow well,groundwater,hand pump
14355,,,Internal,Juhudi,Singida,200,,gravity,gravity,gravity,vwc,user-group,never pay,good,insufficient,dam,surface,communal standpipe
14356,Germany,DWE,Lake Nyasa,Namakinga B,Ruvuma,150,True,gravity,gravity,gravity,vwc,user-group,never pay,good,insufficient,river,surface,communal standpipe


In [12]:
train.shape, test.shape

((59400, 19), (14358, 18))

In [13]:
train.shape, test.shape  #1812 rows removed



((59400, 19), (14358, 18))

In [14]:
train.funder.value_counts()

Government Of Tanzania    9084
Danida                    3114
Hesawa                    2202
Rwssp                     1374
World Bank                1349
                          ... 
Maajabu Pima                 1
Unhcr/government             1
Maashumu Mohamed             1
Ngumi                        1
Muhindi                      1
Name: funder, Length: 1897, dtype: int64

In [15]:
train.funder = train.funder.fillna("Government Of Tanzania")
train.installer = train.installer.fillna("DWE")
train.subvillage = train.subvillage.fillna("Majengo")

train.funder.value_counts()


Government Of Tanzania    12719
Danida                     3114
Hesawa                     2202
Rwssp                      1374
World Bank                 1349
                          ...  
Maajabu Pima                  1
Unhcr/government              1
Maashumu Mohamed              1
Ngumi                         1
Muhindi                       1
Name: funder, Length: 1897, dtype: int64

In [0]:
# def reduce_categories(df, list_of_series, list_of_thresholds):
#   a=[]
#   for i in range(len(list_of_series)):
#     series = df[list_of_series[i]]
#     series_frequencies = series.value_counts(normalize=True)
#     threshold = list_of_thresholds[i]
#     smaller_categories = series_frequencies[series_frequencies<threshold].index
#     reduced_series = df[series].replace(smaller_categories, "Other")
#     a.append(reduced_series)
#   return a

# reduce_categories(train, ['funder', 'installer', 'subvillage'], [.01,.01,.001])


funder_frequencies = train.funder.value_counts(normalize=True) # < .01
installer_frequencies = train.installer.value_counts(normalize=True) # < .01
subvillage_frequencies = train.subvillage.value_counts(normalize=True) # < .001

funder_small_categories = funder_frequencies[funder_frequencies < 0.01].index #(returns list of relevant row names)
installer_small_categories = installer_frequencies[installer_frequencies < 0.01].index #(returns list of relevant row names)
subvillage_small_categories = subvillage_frequencies[subvillage_frequencies < 0.001].index #(returns list of relevant row names)

train.funder = train.funder.replace(funder_small_categories, "Other")
train.installer = train.installer.replace(installer_small_categories, "Other")
train.subvillage = train.subvillage.replace(subvillage_small_categories, "Other")



In [17]:
train.population.median()

25.0

In [18]:
((train.population==0).sum())/train.shape[0]

0.35994949494949496

In [0]:
median_population = train.population.median()

train.population =  train.population.replace(0, median_population)
test.population =  test.population.replace(0, median_population)


In [0]:
train.permit = train.permit.fillna(False)
test.permit = test.permit.fillna(False)

In [21]:
train.source_class.value_counts(normalize=True)

groundwater    0.770943
surface        0.224377
unknown        0.004680
Name: source_class, dtype: float64

In [0]:
train.source_class = train.source_class.replace("unknown", "groundwater")
test.source_class = test.source_class.replace("unknown", "groundwater")

In [23]:
train.source_class.value_counts()

groundwater    46072
surface        13328
Name: source_class, dtype: int64

Come back to this later

In [0]:
# train.construction_year.value_counts(normalize=True)  

In [0]:
# numeric_train = train.select_dtypes(include="number")
# numeric_test = test.select_dtypes(include="number")

In [0]:
# non_numeric_train = train.select_dtypes(exclude="number")
# non_numeric_test = test.select_dtypes(exclude="number")

In [0]:
#(train.shape[1] == (non_numeric_train.shape[1]+numeric_train.shape[1]))

In [0]:
#(test.shape[1] == (non_numeric_test.shape[1]+numeric_test.shape[1]))

In [0]:
y = train.status_group
X = train.drop(y.name, axis=1)

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X_train, X_validate, y_train, y_validate = train_test_split(X,y,test_size=0.2, random_state=99)  

Test for "stratify = y" later

In [0]:
#!pip install git+https://github.com/MaxHalford/Prince


In [0]:
#from prince import MCA
#from sklearn.linear_model import LogisticRegressionCV

In [33]:
train.shape

(59400, 19)

In [0]:
#!pip install catboost


In [0]:
from category_encoders import OneHotEncoder, CatBoostEncoder, OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler
#from catboost import CatBoostClassifier


In [0]:
# X_latitude_positive = X_train.latitude.apply(abs)
# X_longitude_positive = X_train.longitude.apply(abs)

# X_train_positive = X_train.copy()

# X_train_positive.latitude = X_latitude_positive
# X_train_positive.longitude = X_longitude_positive

In [0]:
# mca = MCA()
# mca.fit(X_train_positive.select_dtypes(exclude="number"))

In [0]:
# from catboost import Pool

# train_data = X_train
# eval_data = y_train
# cat_features = X_train.select_dtypes(include="object").columns.to_list()



# train_dataset = Pool(data=X_train,
#                      label=y_train,
#                      cat_features=cat_features)

# eval_dataset = Pool(data=X_validate,
#                     label=y_validate,
#                     cat_features=cat_features)

# # Initialize CatBoostClassifier
# model = CatBoostClassifier(iterations=10,
#                            learning_rate=.5,
#                            depth=16,
#                            loss_function='MultiClass')
# # Fit model
# model.fit(train_dataset)
# # Get predicted classes
# preds_class = model.predict(eval_dataset)
# # Get predicted probabilities for each class
# preds_proba = model.predict_proba(eval_dataset)
# # Get predicted RawFormulaVal
# preds_raw = model.predict(eval_dataset, 
#                           prediction_type='RawFormulaVal')

In [0]:
#model.score(X_validate,y_validate)

In [0]:
#pd.DataFrame(preds_class)[0].unique()

In [0]:
#pd.DataFrame(preds_proba, columns=['functional', 'non functional', 'functional needs repair'])

In [0]:
pipeline = make_pipeline(OrdinalEncoder(),
                         RandomForestClassifier(max_depth=50))

In [42]:
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['funder', 'installer', 'basin',
                                      'subvillage', 'region', 'extraction_type',
                                      'extraction_type_group',
                                      'extraction_type_class', 'management',
                                      'management_group', 'payment',
                                      'quality_group', 'quantity_group',
                                      'source', 'source_class',
                                      'waterpoint_type_group'],
                                drop_invariant=False, handle_missing='value...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=50, max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
 

In [43]:
print('Validation Accuracy', pipeline.score(X_validate, y_validate))


Validation Accuracy 0.7734848484848484


In [44]:
pipeline.predict(test)

array(['functional', 'functional', 'functional', ..., 'functional',
       'functional', 'non functional'], dtype=object)

In [0]:
Combined_X = X_train.append(X_validate)

In [0]:
Combined_y = y_train.append(y_validate)

In [0]:
# from sklearn.model_selection import RandomizedSearchCV
# # Number of trees in random forest
# n_estimators = [int(x) for x in np.linspace(start = 100, stop = 800, num = 10)]
# # Number of features to consider at every split
# max_features = ['auto', 'sqrt']
# # Maximum number of levels in tree
# max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
# max_depth.append(None)
# # Minimum number of samples required to split a node
# min_samples_split = [3, 5, 10]
# # Minimum number of samples required at each leaf node
# min_samples_leaf = [1, 2, 4]
# # Method of selecting samples for training each tree
# bootstrap = [True, False]
# # Create the random grid
# random_grid = {'n_estimators': n_estimators,
#                'max_features': max_features,
#                'max_depth': max_depth,
#                'min_samples_split': min_samples_split,
#                'min_samples_leaf': min_samples_leaf,
#                'bootstrap': bootstrap}

In [0]:
# rf = RandomForestClassifier()
# # Random search of parameters, using 3 fold cross validation, 
# # search across 100 different combinations, and use all available cores
# rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 6, cv = 4, verbose=2, random_state=42, n_jobs = -1)
# # Fit the random search model
# pipeline2 = make_pipeline(OrdinalEncoder(),
#                          rf_random)

# pipeline2.fit(Combined_X, Combined_y)
# y_pred = pipeline2.predict(test)
# pd.DataFrame(data={"id":test_id_col,"status_group":y_pred}).to_csv("water_pred.csv", index=False)
# pd.read_csv('water_pred.csv')

In [0]:
#from google.colab import output


In [0]:
#output.eval_js('new Audio("https://upload.wikimedia.org/wikipedia/commons/0/05/Beep-09.ogg").play()')

In [0]:
#output.eval_js('new Audio("https://upload.wikimedia.org/wikipedia/commons/0/05/Beep-09.ogg").play()')

In [0]:
# import lightgbm
import xgboost as xgb

In [0]:
# ohe = OneHotEncoder(use_cat_names=True)


In [0]:
# encoded_X = ohe.fit_transform(X)
# encoded_y = y.replace({'functional': 3, 'non functional': 2, 'functional needs repair': 1})


In [0]:
# XG_train, XG_validate, yG_train, yG_validate = train_test_split(encoded_X,encoded_y, test_size=.2, random_state=99)

In [0]:
# encoded_y.unique()

In [0]:
# model1 = xgb.XGBClassifier()
# model2 = xgb.XGBClassifier(n_estimators=200, max_depth=12, learning_rate=0.3, subsample=0.5)

# train_model1 = model1.fit(XG_train, yG_train)
# train_model2 = model2.fit(XG_train, yG_train)

In [0]:
#{3:'functional': 2:'non functional', 1: 'functional needs repair'}


# pred1 = train_model1.predict(XG_validate)
# pred2 = train_model2.predict(XG_validate)

In [0]:
from sklearn.metrics import accuracy_score

In [0]:
# accuracy_score(yG_validate, pred1)

In [0]:
# accuracy_score(yG_validate, pred2)

In [0]:
ohe2 = OneHotEncoder(use_cat_names=True, handle_unknown="ignore")

In [0]:
XG_encoder = ohe2.fit(Combined_X)
encoded_train_X = XG_encoder.transform(Combined_X)
encoded_Combo_y = Combined_y.replace({'functional': 3, 'non functional': 2, 'functional needs repair': 1})
encoded_test = XG_encoder.transform(test)

In [0]:

# model_with_params = xgb.XGBClassifier(n_estimators=200, max_depth=12, learning_rate=0.3, subsample=0.6)

# trained_with_params = model_with_params.fit(encoded_train_X, encoded_Combo_y)

# XGboost_pred = trained_with_params.predict(encoded_test)

# XGboost_pred = pd.Series(XGboost_pred).replace({3: 'functional', 2:'non functional', 1:'functional needs repair'})
# pd.DataFrame(data={"id":test_id_col,"status_group":XGboost_pred}).to_csv("water_pred_xgb.csv", index=False)
# pd.read_csv('water_pred_xgb.csv')

In [0]:
from sklearn.model_selection import GridSearchCV

clf = xgb.XGBClassifier()
parameters = {
     "eta"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
     "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
     "min_child_weight" : [ 1, 3, 5, 7 ],
     "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
     "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
     }

grid = GridSearchCV(clf,
                    parameters, n_jobs=2,
                    scoring="neg_log_loss",
                    cv=3)

grid.fit(encoded_train_X, encoded_Combo_y)
XGboost_predGV = trained_with_params.predict(encoded_test)
XGboost_predGV = pd.Series(XGboost_predGV).replace({3: 'functional', 2:'non functional', 1:'functional needs repair'})

pd.DataFrame(data={"id":test_id_col,"status_group":XGboost_predGV}).to_csv("water_pred_xgbGV.csv", index=False)

pd.read_csv('water_pred_xgbGV.csv')