# *Decision Tree* classification of Countries of the World data

Determine whether Decision Tree Learning can predict the birth rate class of a given country (low, medium or high) based on a set of input features.

- Data is provided for 224 countries
- Up to 18 input features available per country 

**Tasks**:
- Select a subset of suitable input features
- Prepare the data for Scikit Learn  
- Create a training and testing sample 
- Run the decision tree classification process and evaluate predictive power  
- Investigate prediction accuracy changes with sample variation, hyper-parameter tuning and ensemble methods  

### Notes on the assessment 
- Please follow the **8** tasks provided 
- All tasks have an assoicated mark 
- Code must be understandable and reproducible. Before grading the notebook kernel will be **restarted** and **re-run**
- If you are unsure on how to proceed please reference the *IRIS dataset* notebooks for relevant examples
- Please ask one of the TAs if you are really stuck!
- If you are unable to complete the question highlight what approaches you should take to extract the data and result of interest
- Notebooks will be graded from now until end of next week 

### Notes on the setup and notebook
- This notebook will **not** work on the Lab PCs machines by default and will need to be launched within a virtual environment with all the packages installed. See `setup-instructions.md` for more details 
- Successful installation is not part of the assessment so please ask one of the teaching assistants for help if you have any problems running the code
- The countries data is provided by calling `dt_utils.gendata()` and preprocessed for use in your analysis. As an alternative you can choose to read the data directly from the CSV file

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, tree, metrics, model_selection, ensemble
from sklearn.externals.six import StringIO 
from IPython.display import Image
import pydotplus
import itertools
import seaborn as sns
from scipy import stats
import dt_utils
%matplotlib inline 

#### 1. Data Extraction: Complete the following steps to prepare the data for Decision Tree Training and Testing (**2 Marks**)
- Select any **four** features as input into the Decision Tree. The features can be any of the following: 
  - Country, Region, Population, Area, Density, Coastline, Migration, InfantMortality, GDP, Literacy, Phones, Arable, Crops, OtherLand, Climate, Deathrate, Agriculture, Industry, Service
  - More information on the dataset is available [here](https://www.kaggle.com/fernandol/countries-of-the-world)
- Assign your choices to the `features` variable as a list of strings
- Extract the data using the `dt_utils.gendata()` utility method provided. This method will provide all your selected feature data (per country) in a 2D array with the last column holding our target value (Birth Rate)  
- Split the data into `data` and `target` arrays where:
  - `data` is an 2D array holding your **feature data** for each country
  - `target` is a 1D array holding the **birth rate** for each column

**Note**: The "Region" feature will need pre-processing before being used as input into a Decision Tree. I can provide a code snippet to show how to encode the feature - but it might be best avoided in your initial selection.

In [None]:
bclass = ['low', 'medium', 'high'] # birth rate class (provided)
features = [] # feature data for you to define

#### 2. Generate 6 x 2D plots showing the feature distribution of your selected observations (**1 Mark**)

- Use one of the utility methods in `dt_utils` to plot the distributions

#### 3. Construct and train a decision tree to classify weather type based on your training data with the following requirements (**1 Mark**): 
- 70% of the observations are reserved for training
- Tree depth is limited to 5 

#### 4. Show the generated decision tree logic (**1 Mark**): 
- Just use the utility method available in `dt_utils` to generate the tree

#### 5. Display a classification report and confusion matrix for your predictions using the test sample (**1 Mark**)
- Use on the utility methods to plot the confusion matrix
- How many false positives and false negatives do you have for the high birth rate class? 

#### 6. Run cross-validaton to determine the average accuracy (**1 Mark**)
- Split the data into **5** folds
- Use the whole dataset as part of the cross-validation

#### 7. Perform a grid-search on **two** suitable hyper-parameters of your own choosing to look for higher accuracy (**1 Mark**)
- Limit the search to a 8x8 grid
- Use the same 70/30 split of training/testing data as before  
- Determine the best combination of hyper-parameters
- Re-apply the decision tree fit and evaluate performance 

Notes: 
- The distance between successive points does not have to be uniform
- Make reasonable choices for the range of values chosen (you will be asked about your reasoning) 
- Apply the grid-search to your **training** data 

#### 8. Repeat your decision tree analysis to try and improve upon your classification performance (**3 Marks**)

Consider the following: 
- Will more than four features as input improve the classification? 
- Which is the most prominent feature that can help predict birth rate?
- Why do you think a given input feature is a good predictor of the target feature? 
- Are the parameters of your chosen decision tree optimal? 
- Would using an ensemble method help produce improved classification? 
- Is there any signification variation in accuracy by modifying the testing and training sample proportion?  

Notes: 
- If you get over 90% accuracy well done! However, the approaches taken are important, not the final result. Make sure you can justify each of the steps you took when being graded
- Do not try and plot decision surfaces with this dataset (the data needs to be normalised first) 