Lambda School Data Science

*Unit 2, Sprint 2, Module 1*

---

# Decision Trees

## Assignment
- [ ] [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. Go to our Kaggle InClass competition website. You will be given the URL in Slack. Go to the Rules page. Accept the rules of the competition. Notice that the Rules page also has instructions for the Submission process. The Data page has feature definitions.
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Begin with baselines for classification.
- [ ] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your feature importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this classification problem, you may want to use the parameter `logistic=True`, but it can be slow.

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```


In [0]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [43]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [3]:
# Check Pandas Profiling version
import pandas_profiling
pandas_profiling.__version__

'2.4.0'

In [4]:
# Seeing the head of the train dataFrame
train.head(1)

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional


In [0]:
# Will be doing a split of the train dataFrame into a train and val set
train ,val = train_test_split(train, train_size=.80, test_size=.2, 
                             random_state=42, 
                             stratify=train['status_group'])

In [7]:
# Looking at the size of the train and the validate and the test
print(train.shape, val.shape, test.shape)

(47520, 41) (11880, 41) (14358, 40)


In [11]:
# Old code for Pandas Profiling version 2.3
# It can be very slow with medium & large datasets.
# These parameters will make it faster.

# profile = train.profile_report(
#     check_correlation_pearson=False,
#     correlations={
#         'pearson': False,
#         'spearman': False,
#         'kendall': False,
#         'phi_k': False,
#         'cramers': False,
#         'recoded': False,
#     },
#     plot={'histogram': {'bayesian_blocks_bins': False}},
# )
#

# New code for Pandas Profiling version 2.4
from pandas_profiling import ProfileReport
profile = ProfileReport(train, minimal=True)

profile.to_notebook_iframe()

In [0]:
# saving the profile report to an htmlfile
profile.to_file(output_file='tanzania_profile_report_minimal.html')

In [9]:
# Finding the base line for the training data
train['status_group'].value_counts(normalize=True)

functional                 0.543077
non functional             0.384238
functional needs repair    0.072685
Name: status_group, dtype: float64

In [0]:
# the baseline for the most common category is about 54%

In [27]:
train.head(1).T

Unnamed: 0,43360
id,72938
amount_tsh,0
date_recorded,2011-07-27
funder,
gps_height,0
installer,
longitude,33.5429
latitude,-9.17478
wpt_name,Kwa Mzee Noa
num_private,0


In [0]:
# taking off the target from the train dataSet
target = 'status_group'
# Getting the amounts in each feature of the cardinal features
cardinalFeatureAmounts = train.select_dtypes(exclude='number')
cardinalFeatureAmounts = cardinalFeatureAmounts.nunique()

# Creating a list of the features that have less than 50 cardinality
cardinal = cardinalFeatureAmounts[cardinalFeatureAmounts <= 50].index.tolist()

In [0]:
# doing some imports for plotly express
import plotly.express as px

In [32]:
# Want to look at the longitude and the latitute compared to the status_group
fig = px.scatter_mapbox(train, lat='latitude', lon="longitude", color='status_group')
fig.update_layout(mapbox_style='stamen-terrain')
fig.show()

In [36]:
train[['latitude']].describe()

Unnamed: 0,latitude
count,47520.0
mean,-5.705946
std,2.941332
min,-11.64944
25%,-8.528215
50%,-5.021436
75%,-3.327185
max,-2e-08


In [38]:
t = train[train['latitude'] == -2.000000e-08]
t


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
32604,3286,0.0,2013-02-04,Lwi,0,LWI,0.0,-2.000000e-08,Upendo,0,Lake Victoria,Nyanding'O,Shinyanga,17,1,Bariadi,Mhango,0,False,GeoData Consultants Ltd,WUG,,False,0,india mark ii,india mark ii,handpump,wug,user-group,never pay,never pay,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional
37113,10809,0.0,2013-02-05,Dwsp,0,DWE,0.0,-2.000000e-08,Mwamalale,0,Lake Victoria,Mwamalale,Shinyanga,17,1,Bariadi,Mhango,0,False,GeoData Consultants Ltd,WUG,,False,0,nira/tanira,nira/tanira,handpump,wug,user-group,never pay,never pay,salty abandoned,salty,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional needs repair
24948,39399,0.0,2013-02-14,Rwssp,0,DWE,0.0,-2.000000e-08,Ilugala,0,Lake Victoria,Ilugala,Shinyanga,17,1,Bariadi,Somanda,0,,GeoData Consultants Ltd,WUG,,False,0,swn 80,swn 80,handpump,wug,user-group,unknown,unknown,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional
33079,9314,0.0,2012-10-24,Dasp,0,DARDO,0.0,-2.000000e-08,Amani,0,Lake Victoria,Nyankunzi,Shinyanga,17,1,Bariadi,Ikungulyabashashi,0,,GeoData Consultants Ltd,WUG,,False,0,nira/tanira,nira/tanira,handpump,wug,user-group,unknown,unknown,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional
43390,10432,0.0,2011-07-19,Rwssp,0,DWE,0.0,-2.000000e-08,Mkombozi,0,Lake Victoria,Samu,Shinyanga,17,1,Bariadi,Mwaswale,0,True,GeoData Consultants Ltd,WUG,,False,0,nira/tanira,nira/tanira,handpump,wug,user-group,never pay,never pay,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12118,69295,0.0,2013-02-20,Dwssp,0,DWE,0.0,-2.000000e-08,Umoja,0,Lake Victoria,Tobo,Shinyanga,17,1,Bariadi,Bumera,0,True,GeoData Consultants Ltd,WUG,,False,0,nira/tanira,nira/tanira,handpump,wug,user-group,never pay,never pay,salty abandoned,salty,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional needs repair
13360,23106,0.0,2013-01-20,Rwssp,0,DWE,0.0,-2.000000e-08,Kifaru,0,Lake Victoria,Mwakiduta B,Shinyanga,17,1,Bariadi,Dutwa,0,True,GeoData Consultants Ltd,WUG,,False,0,nira/tanira,nira/tanira,handpump,wug,user-group,never pay,never pay,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional needs repair
29994,36396,0.0,2012-10-26,Hesawa,0,DWE,0.0,-2.000000e-08,Kwa Luhaula,0,Lake Victoria,Solima A,Mwanza,19,2,Magu,Kabita,0,True,GeoData Consultants Ltd,VWC,,True,0,nira/tanira,nira/tanira,handpump,vwc,user-group,never pay,never pay,salty,salty,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional
33658,12635,0.0,2013-02-04,Dwssp,0,DWE,0.0,-2.000000e-08,Gegedi Secondary,0,Lake Victoria,Gegedi Secondary,Shinyanga,17,1,Bariadi,Nkololo,0,True,GeoData Consultants Ltd,Parastatal,,False,0,gravity,gravity,gravity,parastatal,parastatal,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional needs repair


In [102]:
# Trying to check and remove values that are out of tanzania
df = train.copy()

df['latitude'] = df['latitude'].replace(-2e-08, 0)

# Getting rid of the zeros in the latitude and the longitude putting in 
# np.nan
cols_w_zeros = ['latitude', 'longitude']

for col in cols_w_zeros:
  df[col] = df[col].replace(0, np.nan)

# looking at the rows that have np.nan






# Changing the values of the longitude and latitude to mean of region from 
 # for name, group in groups:
 #   group = group.copy()
 #   newMask = pd.isnull(group['latitude'])
 #   group[newMask] = group['latitude'].mean()
 #   newMask = pd.isnull(group['longitude'])
 #   group[newMask] = group['longitude'].mean()



def myLatitude(theRow):
  
  if (np.isnan(theRow['latitude']) == True):
    theRegion = theRow['region_code']
    lat = df[df['region_code'] == theRegion]
    theMean = lat['latitude'].mode()
    return theMean

def myLon(theRow): 
  if (pd.isnull(theRow['longitude'])):
    lon = df[df['region'] == theRow['region']]
    theRow['longitude'] = lon['longitude'].mean()
  return lon['longitude'].mean()
# trying apply
df['latitude'] = df.apply(myLatitude, axis=1)
df['latitude'].value_counts(dropna=False)


NaN                                                                                                                                                                                                                                                                                                      57588
[-3.79757861]                                                                                                                                                                                                                                                                                             1057
[-2.52871573, -2.51995041, -2.51661939, -2.51661892, -2.51532072, -2.51063865, -2.50658954, -2.5042939, -2.50162744, -2.49645868, -2.49454559, -2.4943532999999998, -2.49194214, -2.49032384, -2.48937845, -2.48708461, -2.48522658, -2.48004347, -2.47667983, -2.46713695, -2.46524583, -2.46390984]        1
[-2.52871573, -2.51995041, -2.51661939, -2.51661892, -2.51532072, -2.51063865, -2.50658954,

In [0]:
# Thinking of the amount of cardinality that I will need to remove


In [0]:
# This is my wrangle function

def wrangleFxn(df, numCardinal=50):
  ''' This is the wrangle function
      It will take in a dataFrame either the train, test, or validate.
      It should have the TARGET already removed from it. 

      numCardinal:  default is 50. This is the threshold.  If it is less
                    than  or equal to this number it is retained in the 
                    features.

      returns:  Will return the datframe prepared for the pipeline.

  '''
  
  target = 'status_group'

  # making the copy
  df = df.copy()
  # Dropping the target and the id
  df_features = df.drop(columns=[target ,'id', 'quantity_group'])

  # Getting the number of features that are numerical
  numerical_features = df_features.select_dtypes(include='number').columns.tolist()

  # Getting the amounts in each feature of the cardinal features
  cardinalFeatureAmounts = train.select_dtypes(exclude='number')
  cardinalFeatureAmounts = cardinalFeatureAmounts.nunique()

  # Creating a list of the features that have less than 50 cardinality
  cardinal = cardinalFeatureAmounts[cardinalFeatureAmounts <= 50].index.tolist()

  # List of all the features
  features = numerical_features + cardinal

  df['latitude'] = df['latitude'].replace(-2e-08, 0)

  # Getting rid of the zeros in the latitude and the longitude putting in 
  # np.nan
  cols_w_zeros = ['latitude', 'longitude']

  for col in cols_w_zeros:
    df[col] = df[col].replace(0, np.nan)