## Notes and Questions
- The metrics they use to select their model parameters are metrics for classification, will have to look more into the analogs in regression beyond the standard ones I've done in class
- how strictly to follow paper e.g. number trees in random forest

### Covariates in corruption paper 
1. private sector includes different measures of economic activity and sectoral distributions

- Average business establishments size based on employment, number of business establishments, payroll per employee, average business establishments payroll, share of business establishments entering, share of business establishments exiting, business establishments churning, share of private sector workers over population, Hirschman-Herfindahl index based on business establishments size, average growth in business establishments and in employment in past 3 years, share of business establishments below 5 employees, share of business establishments between 5 and 25 employees, share of business establishments above 25 employees, share of business establishments in construction, share of business establishments in retail, share of business establishments in services.

2. public sector features include the size, relative importance, and wages of public officials

- Share of public sector employees over population, average wage of public sector employees, share of public institutions opening,share of public institutions closing, public institutions churning, share of workers by position within the institution, average growth in public employment and public institutions in past 3 years, share of public sector employees from municipal institutions, number of public institutions, average public institution size based on employment.

3. financial development includes measures of credit-related variables from public and private banks

- Share of business establishments receiving public loans, number of public loans per business establishment, total public credit per business establishment, average interest rate in public lending, bank branches per capita, banks per capita, total private credit per capita, total deposits per capita, and Hirschman-Herfindahl index based on private banks total assets and based on private banks credit.

4. human capital includes measures of education and access to it

- Literacy rate, the share of population between 15 and 24 years old that finished, the first, second, and third cycle of primary education (Census), illiteracy rate (Census), average test scores in Portuguese and maths for nationwide tests at 4th and 8th grade, average private sector employees education, average private sector employees education by worker position within the firm, share of unqualified public employees based on job requirements, share of unqualified public employees by position within the institution, average public employees education, average public employees education by position within the institution, number of higher public education institutions per capita, number of higher private education institutions per capita.

5. public spending includes different types of spending as well as local procurement variables

- Total expenditures per capita, personnel expenditures per capita, budget surplus per capita, total revenue per capita, federal transfers of capital per capita, federal current transfers per capita, transfers from the national tax fund per capita, share of business establishments in the municipality with public procurement, number of contracts per business establishments, federal procurement expenditure over population, share of discretionary contracts, and share of competitive contracts.

6. local politics includes variables of political competition and alignment with the central government

- Number of candidates, Hirschman-Herfindahl index based on the vote shares, margin of victory between the winner and the runner-up, an indicator for whether the mayor is in his second term, an indicator for whether the mayor’s party is the same as the one of the governor, an indicator for whether the mayor’s party is from the same party as the one of the president, an indicator if the mayor is from right-wing party, an indicator if the mayor is from left-wing party, average candidate campaign donations and expenditures for firms and individuals, and per capita campaign donations and expenditures for firms and individuals.

7. local demographics

- Population density, GDP per capita, share of population living in rural areas (Census), deaths by aggression, GINI coefficient for income distribution (Census), average night light intensity coverage performing deblurring, inter-calibration, and geometric corrections, local radio, local newspapers, infant mortality rate, child mortality rate, average number of prenatal visits, share of abnormal births, share of underweight births, share of births with more than seven prenatal visits, and share of births with more than four prenatal visits.

8. natural resources’ dependency includes the relevance of different natural resources, and finally 
- Share of business establishments in agriculture and mining sector, share of production of each of the top-7 crops in the country multiplied by the the log change in international prices and share of value of production over GDP (as constructed in Bernstein et al., 2018). The crops included are sugar cane, oranges, soybeans, maize, rice, rice, banana, and wheat, covering more than 98% of total agricultural production.

### Covariates in deforestation
1. federal politics
- "lula": Lula government years (2003-2010),
- "dilma": Dilma government yeras (2011-2016),
- "temer": Temer interim government (2017-2018),
- "bolsonaro": Bolsonaro government (2019-2020),
- "fed_election_year": Years where there was a Federal Election 
- "new_forest_code": years after new forest code (post 2012), --? is this one sorted correctly

2. local politics includes variables of political competition and alignment with the central government
- "mun_election_year": municipal election year,


3. local demographics

- "populacao": Population (I think this is census data so 2000 and 2010),
- "pib_pc": GDP per capita,
- "indigenous_homol": pixel is inside an indigenous, homologated territory

4. natural resources and economy

- "ironore": whether the municipality produces ironore,
- "silver": whether the municipality produces silver,
- "copper": whether the municipality produces copper,
- "gold": whether the municipality produces gold,

- "soy_price": whether the municipality produces soy,
- "beef_price": whether the municipality produces beef,

- "ag_jobs": total employment in the agricultural sector,
- "mining_jobs": total employment in the mining sector,
- "public_jobs": total employment in the public sector,
- "construction_jobs": total employment in the construction sector


5. non-human related geographic variables
- "rain1": rainfall,
- "elevation": elevation (meters above sea level),
- "slope": slope,
- "aspect": aspect,
- "near_mines": distance to nearest mine,
- "near_roads": distance to nearest road,
- "near_hidrovia": distance to nearest hydroeletric,

## Models
- Random Forests
    - In this application, we keep fixed the number of fitted trees (500) and use cross-validation to determine the optimal number of features available in every node
- Gradient Boosting 
    - we keep fixed the learning rate (shrinkage parameter) and the minimum number of observations in the terminal nodes to avoid overfitting, and use cross-validation to determine the optimal number of trees and the interaction depth
- Neural Networks 
    - we keep fixed a logistic activation function and use cross-validation to determine the optimal number of units in the hidden layer (size) and the regularization parameter (decay)
- LASSO
    - The tuning parameter in the cross-validation is the weight of the pe- nalization term in the objective function (λ), which is optimized over a grid of potential values
- Super Learner Ensemble
    - we use the Super Learner ensemble method developed by Polley et al. (2011), which finds an optimal combination of individual prediction models by minimizing the cross-validated out-of-bag risk of these predic- tions

## Protocol
- We divide our dataset into 70% as our training set and 30% as our testing set
- In our training set, we perform a 5-fold cross-validation procedure in order to train our models and choose the optimal combination of parameters
- The previous step is repeated 10 times with different random partitions. Hence, we obtain 10 “optimal parameters” and we use as our optimal parameter the average of them. For the case of integer parameters, we round it to the closest integer
- Using these optimal parameters, we assess the performance of our models in the testing set that has never been used for training purposes
- We standardize the data by the mean and standard deviation of the training set

## Assessing Models’ Performance
- We use as a first performance measure of interest the area under the ROC (Receiver Operating Characteristic) curve (AUC)
- We also present each model’s level of accuracy, which corresponds to the proportion of municipalities correctly predicted as corrupt; models’ precision, which is the proportion of positive identifications that are correct (or true positives over true positives plus false positives); models’ recall, which is the proportion of actual positives identified correctly (true positives over true positives plus false negatives), and models’ F1, which is the harmonic mean of precision and recall

## Random Forests

In [63]:
import pyreadr
import pandas as pd
import sklearn
import numpy as np

from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

In [25]:
result = pyreadr.read_r('/Users/annieulichney/Desktop/Deforestation/analysis.Rdata')

In [64]:
df = pd.DataFrame(result['forest_full'])

In [65]:
df2 = df[0:1000]
y = df2['forest.l']
df1 = df2.copy()
X = df2.drop('forest.l', axis =1)

In [66]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [57]:
model = RandomForestRegressor()

#5-fold cross validation as specified by the paper
#remove random state later
cv = RepeatedKFold(n_splits = 5, n_repeats = 10, random_state = 1)


In [60]:
from sklearn.model_selection import RandomizedSearchCV


# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}



In [None]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 5, verbose=2, random_state=42, n_jobs = -1)

# Fit the random search model
rf_random.fit(X_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


In [None]:
rf_random.best_params_

In [None]:
#Evaluate
def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    errors = abs(predictions - test_labels)
    mape = 100 * np.mean(errors / test_labels)
    accuracy = 100 - mape
    print('Model Performance')
    print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
    print('Accuracy = {:0.2f}%.'.format(accuracy))
    
    return accuracy

In [None]:
base_model = RandomForestRegressor(n_estimators = 10, random_state = 42)
base_model.fit(train_features, train_labels)
base_accuracy = evaluate(base_model, test_features, test_labels)
Model Performance
Average Error: 3.9199 degrees.
Accuracy = 93.36%.
best_random = rf_random.best_estimator_
random_accuracy = evaluate(best_random, test_features, test_labels)
Model Performance
Average Error: 3.7152 degrees.
Accuracy = 93.73%.
print('Improvement of {:0.2f}%.'.format( 100 * (random_accuracy - base_accuracy) / base_accuracy))
Improvement of 0.40%.

In [54]:
#n_estimators should be 500 as in the paper? 

In [55]:
model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 500,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [51]:
#Model performance 

#MAE
mae_n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
print('MAE: %.3f (%.3f)' % (np.mean(mae_n_scores), np.std(mae_n_scores)))

#MSE
mse_n_scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=cv, n_jobs=-1)
print('MSE: %.3f (%.3f)' % (np.mean(mse_n_scores), np.std(mse_n_scores)))

#R2
r2_n_scores = cross_val_score(model, X, y, scoring='r2', cv=cv, n_jobs=-1)
print('R2: %.3f (%.3f)' % (np.mean(r2_n_scores), np.std(r2_n_scores)))


MAE: -5.816 (0.339)
MSE: -61.162 (8.537)
R2: 0.893 (0.015)


## Varaiable Importance Things to look into: 
https://mljar.com/blog/feature-importance-in-random-forest/

https://arxiv.org/pdf/1407.7502.pdf