# Baseline_model.ipynb
The goal of this model is to predict whether a refugee will be granted asylum based on information available before the actual hearing. For this baseline model, we fit an L2 logistic regression using three variables of interest, described below.

#### Features  
   * `nat`: applicant's nationality (one hot encoded)
   * `tracid`: judge's identification number (one hot encoded)
   * `osc_date`: the Notice to Appear date for each proceeding (continuous)  
  
<br>
We define 'asylum granted' in two different ways. These slightly different definitions require different cleaning of the data, which lead to the following two datasets we use in the baseline model.
1. `merged_full_asylum_master_app.csv`: individual (indicated by unique idnProceeding) is granted Full Asylum. 
2. `merged_any_master_app.csv`: individual is granted at least one of the following: Full Asylum, Withholding of Removal, Protection Under Convention Against Torture. 


## Format features
Variables of interest had been cleaned and merged in Cleaning.ipynb. However, these features need to be transformed into the appropriate format (e.g., one hot encoded) for this model.

In [1]:
import pandas as pd
import numpy as np
import scipy
from scipy.sparse import csc_matrix
from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import train_test_split

pd.set_option('precision', 5)



In [2]:
# LOAD IN CLEANED DATASET
path = '/home/emilyboeke/'

master_app = pd.read_csv(path + 'merged_any_master_app.csv', low_memory=False)
#master_app = pd.read_csv(path + 'merged_full_asylum_master_app.csv', low_memory=False)


In [3]:
#load the idncases in the training set, and select the rows of master_app that correspond to those cases.
train_cases = pd.read_csv(path + 'train_cases_any_asylum.csv',header=None).values
train_cases = train_cases.reshape(train_cases.shape[0])
master_app = master_app[master_app.idncase.isin(train_cases)]


# Prepare variables to go into model

In [4]:
# CHANGE OSC_DATE TO CONTINUOUS NUMBER (number of days since startdate)

master_app['osc_date'] = pd.to_datetime(master_app['osc_date'],infer_datetime_format = True)
osc_date_cont = []
startdate = np.datetime64('1984-01-01') # earliest date. from which timedelta is calculated

# change osc_date to continuous number
for i in master_app.index:
    x = master_app.loc[i,'osc_date'] - startdate
    osc_date_cont.append(x.days)
    
osc_date_cont = np.array(osc_date_cont)

In [5]:

# change dec to binary variable
master_app.loc[(master_app["dec"] == 'DENY'),'dec'] = 0
master_app.loc[(master_app["dec"] == 'GRANT'),'dec'] = 1

y = master_app.dec
#master_app.describe()

In [6]:
# ONE HOT ENCODE CATEGORICAL VARIABLES

# change string nationalities to integer categories 
le = LabelEncoder()
nat_int = le.fit_transform(master_app['nat'])
nat_int = np.reshape(nat_int,[len(nat_int),1])

# get N x 2 array of features of interest
feat_int = np.concatenate((master_app[['tracid']], nat_int), axis=1)
feat_int.shape

# get one hot encoder of features
enc = OneHotEncoder()
enc.fit(feat_int)  
enc.feature_indices_

# create sparse matrix of all observations in Compressed Sparse Row format
blah = enc.transform(feat_int)

In [7]:
# CONCATENATE ONE HOT ENCODED FEATURES WITH CONTINUOUS FEATURE

# changes csr to csc, bc simpler to work with columns than rows
blah = scipy.sparse.csr_matrix.tocsc(blah)
# concatenating relevant fields 
new_data = np.concatenate((blah.data, osc_date_cont)) # non-zero values in matrix
new_indices = np.concatenate((blah.indices, range(len(osc_date_cont)))) # row indices for each column
new_ind_ptr = np.append(blah.indptr, blah.indptr[-1]+len(osc_date_cont))
# making new matrix
X = csc_matrix((new_data, new_indices, new_ind_ptr))


## Implement L2 logistic regression

We use sklearn to define a logistic regression with an L2 penalty.

In [8]:
LogReg = LogisticRegression(penalty='l2') # defining model: logistic regression with L2 penalization

### Implement model on train/validation set

In [9]:
# split data into train and validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

model = LogReg.fit(X_train, y_train)
print(model.score(X_test,y_test))

0.7307730618117426


We find that the accuracy of predicting grant rate just from the applicant's nationality, the judge's ID, and the Notice to Appear date is sufficient to predict the following for each data set:  
  * `merged2_master_app.csv`: around 73% accuracy
  * `merged_any_master_app.csv`: around 73.3% accuracy
  * `merged_full_asylum_master_app.csv`: around 73.5% accuracy



In [10]:
# print some summary stats on the weights
coeffs = (model.coef_[0])

print(min(coeffs))
print(max(coeffs))
print(np.mean(coeffs))

-1.888390763926314
0.9809239771939438
-0.004849619656256326


In [11]:
# plot weights

import matplotlib.pyplot as plt
plt.hist(coeffs, bins='auto') 
plt.title("L2 logistic regression weights")
plt.show()

<Figure size 640x480 with 1 Axes>

In [12]:
# LOOK AT LARGEST MAGNITUDE WEIGHTS



### evaluate model using k-fold cross-validation

In [13]:
# evaluate the model using 10-fold cross-validation
scores = cross_val_score(LogReg, X, y, scoring='accuracy', cv=10)
print(scores)
print(scores.mean())

[0.64945327 0.69279295 0.64181681 0.70449565 0.72379751 0.71279633
 0.71461142 0.71253071 0.78004294 0.64713565]
0.6979473251269621
