# Predicting Acceptance of US Visa Applications Using Machine Learning
Capstone Project for SlideRule Intensive Data Science workshop.

Charles Franzen

## The Project

Each year hundreds of thousands of applications are filed for work visas in the US. So many, in fact, that legistated caps on the number of H-1B applications have been reached every single year since 2007. In 2008, the quota was reached on the first day of open applications. In light of this, companies and potential employees that are lucky enough to get their application in via the lottery want to ensure that they have as high a chance as possible of having their application approved.

My project is to investigate application data sets and create models that will predict the acceptance or rejection of a given application. This model could evaluate the strength of an application before it is submitted, and provide insights into how a weak application could be improved. The model will also elucidate the most important factors determining the fate of a visa application.

## The Data

I will be investigating data published by the US Dapartment of Labor. The data sets are comprised of quarterly application data and decisions for different types of visas. I will be focusing on two of these data sets: one that includes a mix of all visa types, and one that is comprised solely of H-1B visas.

The data fields cover a wide range of information, including details about the country of origin of the applicant, location and type of the job, pay and educational requirements of the job, education of the applicant, and whether all regulatory hurdles have been cleared.

H-1B visas are work visas that are only for 'specialty applications', and therefore can act as a bellweather for how much highly-educated talent the US is importing.

## Methodology

I will be treating this as a classification problem. The binary variable 'denied' will indicate whether an application was rejected.

So far I have used logistic regression to model the all-visa dataset. Next steps include refining the logistic regression model, creating a similar model for the H-1B data set, and continuing with other machine learning techniques, including Tree-based Models and SVM.

### Logistic Regression

In order to implement logistic regression, I've had to format the data sets, converting string cell values into integers or floats, and creating dummy variables for categorical data.

Using backdata from the Department of Labor, I will test my models and further refine them.

## Limitations

Since the base rate of application rejection is so low (6.7% for all visa types), it may be difficult to suss out relevant variables. Thus far my models have tended to have a high number of false-negatives (type II errors).

The models may also end up provinding unhelpful insights. For example, the model may identify country of citizenship as a crucial determiner. It won't be helpful to most applicants to tell them to change their citizenship.

## Import Modules

In [267]:
% matplotlib inline
import numpy as np
import pandas as pd
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score

## Data Loading and Initial Processing
Withdrawn applications are removed from the data set, since there was no decision made.

In [281]:
# loading data
data = pd.read_excel('Data/PERM_Disclosure_Data_FY15_Q4.xlsx', 'DDFY2015_Final')

# removing withdrawn applications
data = data[data['CASE_STATUS'] != 'Withdrawn']

# adding a binary series with 1 indicating that an application was denied
data['denied'] = (data.CASE_STATUS == 'Denied').astype(int)

Unnamed: 0,CASE_NUMBER,DECISION_DATE,CASE_STATUS,CASE_RECEIVED_DATE,REFILE,ORIG_FILE_DATE,ORIG_CASE_NO,SCHD_A_SHEEPHERDER,EMPLOYER_NAME,EMPLOYER_ADDRESS_1,...,FOREIGN_WORKER_INFO_REQ_EXPERIENCE,FOREIGN_WORKER_INFO_ALT_EDU_EXPERIENCE,FOREIGN_WORKER_INFO_REL_OCCUP_EXP,PREPARER_INFO_EMP_COMPLETED,PREPARER_INFO_TITLE,EMPLOYER_DECL_INFO_TITLE,NAICS_CODE,NAICS_TITLE,PW_JOB_TITLE_9089,denied
0,A-13316-14231,2015-05-29,Certified,11/19/2013,N,,,N,GENERAC POWER SYSTEMS,S45 W29290 HWY 59,...,A,A,Y,N,Attorney,Vice President of Human Resources,335312,Motor and Generator Manufacturing,Industrial Engineers,0
1,A-13316-14287,2015-06-26,Denied,11/12/2013,N,,,N,"AMERICA'S CATCH, INC.",46623 COUNTY ROAD 523 (MAIL TO PO BOX ONLY),...,A,A,A,N,Representative,President,311712,Fresh and Frozen Seafood Processing,"Meat, Poultry, and Fish Cutters and Trimmers",1
2,A-13316-14312,2014-10-16,Denied,11/27/2013,N,,,N,"AVIDITY, LLC",12635 E. MONTVIEW BLVD. #140,...,Y,A,A,N,Attorney,Chief Financial Officer,54171,"Research and Development in the Physical, Engi...",Protein Production Scientist,1
3,A-13316-14276,2015-05-26,Certified,11/13/2013,N,,,N,STAR COMMUNICATIONS LLC,17426 HWY 99,...,A,A,A,N,Attorney,President,45399,All Other Miscellaneous Store Retailers,Market Research Analysts and Marketing Special...,0
4,A-13316-14275,2015-06-26,Denied,11/12/2013,N,,,N,"AMERICA'S CATCH, INC.",46623 COUNTY ROAD 523 (MAIL TO PO BOX ONLY),...,A,A,A,N,Representative,President,311712,Fresh and Frozen Seafood Processing,"Meat, Poultry, and Fish Cutters and Trimmers",1
5,A-13317-14380,2015-05-28,Certified,11/15/2013,N,,,N,SANSU CORPORATION,4750 S. HAGADORN ROAD,...,Y,A,Y,N,Attorney,President,722110,Full-Service Restaurants,"Cooks, Restaurant",0
6,A-13317-14356,2014-12-01,Certified-Expired,07/02/2014,N,,,N,TALBERT & TALBERT LLC,317 MADISON AVENUE,...,N,A,Y,N,Attorney,Managing Partner,541219,Other Accounting Services,Public Relations Specialists,0
7,A-13316-14278,2015-05-18,Denied,11/12/2013,N,,,N,"AMERICA'S CATCH, INC.",46623 COUNTY ROAD 523 (MAIL TO PO BOX ONLY),...,A,A,A,N,Representative,President,311712,Fresh and Frozen Seafood Processing,"Meat, Poultry, and Fish Cutters and Trimmers",1
8,A-13316-14279,2015-05-18,Denied,11/12/2013,N,,,N,"AMERICA'S CATCH, INC.",46623 COUNTY ROAD 523 (MAIL TO PO BOX ONLY),...,A,A,A,N,Representative,President,311712,Fresh and Frozen Seafood Processing,"Meat, Poultry, and Fish Cutters and Trimmers",1
9,A-13316-14281,2015-05-18,Denied,11/12/2013,N,,,N,"AMERICA'S CATCH, INC.",46623 COUNTY ROAD 523 (MAIL TO PO BOX ONLY),...,A,A,A,N,Representative,President,311712,Fresh and Frozen Seafood Processing,"Meat, Poultry, and Fish Cutters and Trimmers",1


## Descriptive Statistics

To do:

1. make pie charts for country of origin, education, case status, visa type
2. make histograms for salary/wage, training, and experience requirements

In [166]:
data['CASE_STATUS'].value_counts()

Certified            40176
Certified-Expired    38762
Denied                5696
Name: CASE_STATUS, dtype: int64

In [167]:
pcert, pcertex, pdenied = (data['CASE_STATUS'].value_counts().values) / float(len(data))
size = len(data)
print 'Certified: \t\t', pcert, '\nCertified-Expired: \t', pcertex, '\nDenied: \t\t', pdenied, '\nSize: \t\t\t', size
                                                                 

Certified: 		0.474702838103 
Certified-Expired: 	0.457995604603 
Denied: 		0.0673015572938 
Size: 			84634


Base rate of denials is low, at 6.7%

In [168]:
serror = (pdenied*(1-pdenied)/float(len(data))) ** .5
print 'Standard Error: ',serror

Standard Error:  0.000861213341343


### Categorical Data

In [169]:
# possibly interesting columns of categorical data
dfcat = data[['denied', 'EMPLOYER_NAME', 'EMPLOYER_STATE', 'AGENT_FIRM_NAME', 'PW_SOC_TITLE', 'JOB_INFO_EDUCATION', 
              'COUNTRY_OF_CITIZENSHIP', 'CLASS_OF_ADMISSION', 'FOREIGN_WORKER_INFO_EDUCATION', 
              'FOREIGN_WORKER_INFO_MAJOR']]

After exploring the columns, Visa Type, Country of Citizenship, and Education seem to be the most interesting columns.

In [265]:
# visa types with a high rate of denial
dftype = dfcat[['CLASS_OF_ADMISSION', 'denied']]
dftype1 = dftype.groupby('CLASS_OF_ADMISSION').mean()
dftype1['size'] = dftype.groupby('CLASS_OF_ADMISSION').agg(sum)
dftype = dftype1[(dftype1['denied'] > (pdenied + 2*serror)) & (dftype1['size'] >= 10)]
dftype

Unnamed: 0_level_0,denied,size
CLASS_OF_ADMISSION,Unnamed: 1_level_1,Unnamed: 2_level_1
B-1,0.347826,40
B-2,0.265248,187
E-1,0.149123,17
E-2,0.123584,120
E-3,0.071429,11
EWI,0.356796,147
F-1,0.096111,304
F-2,0.116822,25
H-1B1,0.162791,14
H-2A,0.789474,15


In [266]:
# countries with a high rate of denial
dfcit = dfcat[['COUNTRY_OF_CITIZENSHIP', 'denied']]
dfcit1 = dfcit.groupby('COUNTRY_OF_CITIZENSHIP').mean()
dfcit1['size'] = dfcit.groupby('COUNTRY_OF_CITIZENSHIP').agg(sum)
dfcit = dfcit1[(dfcit1['denied'] > (pdenied + 2*serror)) & (dfcit1['size'] >= 10)]
dfcit

Unnamed: 0_level_0,denied,size
COUNTRY_OF_CITIZENSHIP,Unnamed: 1_level_1,Unnamed: 2_level_1
ARGENTINA,0.098859,26
BRAZIL,0.076125,44
BULGARIA,0.078571,11
COLOMBIA,0.117493,45
DOMINICAN REPUBLIC,0.12766,12
ECUADOR,0.270408,53
EL SALVADOR,0.237624,24
GAMBIA,0.85,17
GREECE,0.078125,10
GUATEMALA,0.303371,27


In [172]:
dfcat.groupby(['FOREIGN_WORKER_INFO_EDUCATION'])['denied'].mean().sort_values()

FOREIGN_WORKER_INFO_EDUCATION
Doctorate      0.033010
Master's       0.051275
Bachelor's     0.053348
Other          0.074332
Associate's    0.185270
None           0.228803
High School    0.270021
Name: denied, dtype: float64

In [173]:
dfcat.groupby(['JOB_INFO_EDUCATION'])['denied'].mean().sort_values()

JOB_INFO_EDUCATION
Doctorate      0.025573
Master's       0.047168
Bachelor's     0.051324
Other          0.075218
Associate's    0.188406
None           0.233772
High School    0.304193
Name: denied, dtype: float64

Applicants with at least a Bachelor's have significantly lower rates of rejection than those without one.

## Data Preparation for Logistic Regression

All of the data are given as unicode, even the numbers, so it has to be converted into the proper data type before the regression can be run. Additionally, dummy variables need to be created for the categorical data.

In [201]:
# interesting fields with binary or numerical data
data_fields = ['REFILE', 'EMPLOYER_NUM_EMPLOYEES', 
               'FOREIGN_WORKER_OWNERSHIP_INTEREST', 'PW_AMOUNT_9089', 'WAGE_OFFER_FROM_9089', 'WAGE_OFFER_TO_9089', 
               'JOB_INFO_TRAINING', 'JOB_INFO_TRAINING_NUM_MONTHS', 'JOB_INFO_EXPERIENCE', 
               'JOB_INFO_EXPERIENCE_NUM_MONTHS', 'JOB_INFO_ALT_FIELD', 'JOB_INFO_ALT_COMBO_ED_EXP', 
               'JOB_INFO_ALT_CMB_ED_OTH_YRS', 'JOB_INFO_FOREIGN_ED', 'JOB_INFO_ALT_OCC', 'JOB_INFO_ALT_OCC_NUM_MONTHS', 
               'JOB_INFO_JOB_REQ_NORMAL', 'JOB_INFO_FOREIGN_LANG_REQ', 'JOB_INFO_COMBO_OCCUPATION',
               'JI_FOREIGN_WORKER_LIVE_ON_PREMISES', 'JI_LIVE_IN_DOMESTIC_SERVICE', 'JI_LIVE_IN_DOM_SVC_CONTRACT', 
               'RECR_INFO_PROFESSIONAL_OCC', 'RECR_INFO_COLL_UNIV_TEACHER', 'RECR_INFO_COLL_TEACH_COMP_PROC', 
               'RI_POSTED_NOTICE_AT_WORKSITE', 'RI_LAYOFF_IN_PAST_SIX_MONTHS', 'RI_US_WORKERS_CONSIDERED', 
               'FOREIGN_WORKER_INFO_TRAINING_COMP', 'FOREIGN_WORKER_INFO_REQ_EXPERIENCE', 
               'FOREIGN_WORKER_INFO_ALT_EDU_EXPERIENCE', 'FOREIGN_WORKER_INFO_REL_OCCUP_EXP', 
               'PREPARER_INFO_EMP_COMPLETED']
dflogit = data[data_fields]

In [202]:
def remove_commas(uni):
    # converts unicode numbers with commas into floats
    string = str(uni)
    return float(string.replace(',', ''))

def data_clean(df):
    # formats datatypes to binary for indicators or floats for numbers
    columns = df.columns
    dfr = pd.DataFrame()
    for column in columns:
        name = str(column)
        name = name.lower()
        if str(df[column][0]) in ['Y', 'N']:
            dfr[name] = (df[column] == 'Y').astype(int)
        elif ',' in str(df[column][0]):
            try:
                dfr[name] = df[column].apply(remove_commas)
            except ValueError:
                continue
        else:
            try:
                dfr[name] = df[column].apply(float)
            except ValueError:
                continue
    return dfr

def add_columns(df1, df2, columns):
    # adds columns from one dataframe to another
    for column in columns:
        df2[column] = df1[column]

In [203]:
# formatting data for logistic regression
dflogit = data_clean(dflogit)
new_columns = ['JOB_INFO_EDUCATION', 'COUNTRY_OF_CITIZENSHIP', 'CLASS_OF_ADMISSION', 'FOREIGN_WORKER_INFO_EDUCATION',
               'denied']
add_columns(data, dflogit, new_columns)
dflogit.head()

Unnamed: 0,refile,employer_num_employees,foreign_worker_ownership_interest,pw_amount_9089,job_info_training,job_info_training_num_months,job_info_experience,job_info_experience_num_months,job_info_alt_field,job_info_alt_combo_ed_exp,...,recr_info_coll_univ_teacher,ri_posted_notice_at_worksite,ri_layoff_in_past_six_months,foreign_worker_info_rel_occup_exp,preparer_info_emp_completed,JOB_INFO_EDUCATION,COUNTRY_OF_CITIZENSHIP,CLASS_OF_ADMISSION,FOREIGN_WORKER_INFO_EDUCATION,denied
0,0,1935,0,83366,0,,0,,0,0,...,0,1,0,1,0,Bachelor's,INDIA,H-1B,Bachelor's,0
1,0,350,0,16973,0,,0,,0,0,...,0,1,0,0,0,,SOUTH KOREA,,,1
2,0,4,0,49982,0,,1,36.0,0,0,...,0,1,0,0,0,Doctorate,GERMANY,H-1B,Doctorate,1
3,0,8,0,43514,0,,0,,0,0,...,0,1,0,0,0,Master's,SOUTH KOREA,E-2,Master's,0
4,0,350,0,16973,0,,0,,0,0,...,0,1,0,0,0,,SOUTH KOREA,,,1


In [204]:
# checking data types
dflogit.dtypes

refile                                  int64
employer_num_employees                float64
foreign_worker_ownership_interest       int64
pw_amount_9089                        float64
job_info_training                       int64
job_info_training_num_months          float64
job_info_experience                     int64
job_info_experience_num_months        float64
job_info_alt_field                      int64
job_info_alt_combo_ed_exp               int64
job_info_alt_cmb_ed_oth_yrs           float64
job_info_foreign_ed                     int64
job_info_alt_occ                        int64
job_info_alt_occ_num_months           float64
job_info_job_req_normal                 int64
job_info_foreign_lang_req               int64
job_info_combo_occupation               int64
ji_foreign_worker_live_on_premises      int64
ji_live_in_domestic_service             int64
recr_info_professional_occ              int64
recr_info_coll_univ_teacher             int64
ri_posted_notice_at_worksite      

In [208]:
dflogit.describe()

Unnamed: 0,refile,employer_num_employees,foreign_worker_ownership_interest,pw_amount_9089,job_info_training,job_info_training_num_months,job_info_experience,job_info_experience_num_months,job_info_alt_field,job_info_alt_combo_ed_exp,...,job_info_combo_occupation,ji_foreign_worker_live_on_premises,ji_live_in_domestic_service,recr_info_professional_occ,recr_info_coll_univ_teacher,ri_posted_notice_at_worksite,ri_layoff_in_past_six_months,foreign_worker_info_rel_occup_exp,preparer_info_emp_completed,denied
count,84634.0,84600.0,84634.0,84588.0,84634.0,1756.0,84634.0,46772.0,84634.0,84634.0,...,84634.0,84634.0,84634.0,84634.0,84634.0,84634.0,84634.0,84634.0,84634.0,84634.0
mean,0.00117,21949.727128,0.003686,83749.914278,0.018645,32.408884,0.551752,33.390383,0.373999,0.29995,...,0.004017,0.001229,0.000792,0.892277,0.02655,0.980776,0.031489,0.547038,0.15951,0.067302
std,0.034182,74840.819191,0.060605,39964.592513,0.135269,21.346236,0.497317,22.952114,0.483866,0.458239,...,0.063255,0.035033,0.028125,0.310032,0.160764,0.137312,0.174635,0.497785,0.366154,0.250545
min,0.0,0.0,0.0,6.55,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,75.0,0.0,65936.0,0.0,12.0,0.0,12.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
50%,0.0,1374.5,0.0,83762.0,0.0,36.0,1.0,24.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
75%,0.0,27800.0,0.0,103355.0,0.0,36.0,1.0,60.0,1.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
max,1.0,2200000.0,1.0,5067600.0,1.0,240.0,1.0,240.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [249]:
params = dflogit.columns[:-5]
dummies = ['C(JOB_INFO_EDUCATION)', 'C(COUNTRY_OF_CITIZENSHIP)', 'C(CLASS_OF_ADMISSION)', 'C(FOREIGN_WORKER_INFO_EDUCATION)']

In [250]:
# prep dataframes for sklearn logit
y, X = dmatrices('denied ~ ' + ' + '.join(params) + ' + ' + ' + '.join(dummies), dflogit, return_type="dataframe")
print X.columns

Index([u'Intercept', u'C(JOB_INFO_EDUCATION)[T.Bachelor's]',
       u'C(JOB_INFO_EDUCATION)[T.Doctorate]',
       u'C(JOB_INFO_EDUCATION)[T.High School]',
       u'C(JOB_INFO_EDUCATION)[T.Master's]', u'C(JOB_INFO_EDUCATION)[T.None]',
       u'C(JOB_INFO_EDUCATION)[T.Other]',
       u'C(COUNTRY_OF_CITIZENSHIP)[T.ALBANIA]',
       u'C(COUNTRY_OF_CITIZENSHIP)[T.ALGERIA]',
       u'C(COUNTRY_OF_CITIZENSHIP)[T.ANGOLA]',
       ...
       u'job_info_foreign_lang_req', u'job_info_combo_occupation',
       u'ji_foreign_worker_live_on_premises', u'ji_live_in_domestic_service',
       u'recr_info_professional_occ', u'recr_info_coll_univ_teacher',
       u'ri_posted_notice_at_worksite', u'ri_layoff_in_past_six_months',
       u'foreign_worker_info_rel_occup_exp', u'preparer_info_emp_completed'],
      dtype='object', length=264)


In [251]:
# flattening y into a 1-D array
y = np.ravel(y)

## Logistic Regression

In [282]:
# creating a logistic regression model, and fit with X and y
model = LogisticRegression()
model = model.fit(X, y)

In [283]:
# checking the accuracy on the training set
model.score(X, y)

0.94047619047619047

In [254]:
y.mean()

0.077380952380952384

The model is doing only slightly better than the null (accepting every application).

In [257]:
# examining the coefficients
pd.DataFrame(zip(X.columns, np.transpose(model.coef_))).sort_values(by=1)

Unnamed: 0,0,1
202,C(CLASS_OF_ADMISSION)[T.H-1B],[-0.00622750573483]
252,job_info_alt_occ_num_months,[-0.00334810709121]
77,C(COUNTRY_OF_CITIZENSHIP)[T.INDIA],[-0.00315043854947]
258,recr_info_professional_occ,[-0.00223340535993]
233,C(FOREIGN_WORKER_INFO_EDUCATION)[T.Bachelor's],[-0.00153513035455]
4,C(JOB_INFO_EDUCATION)[T.Master's],[-0.00130397932083]
212,C(CLASS_OF_ADMISSION)[T.L-1],[-0.00117151780021]
260,ri_posted_notice_at_worksite,[-0.000963699341092]
246,job_info_experience_num_months,[-0.00054177408555]
238,C(FOREIGN_WORKER_INFO_EDUCATION)[T.Other],[-0.000519379369431]


According to the model, by far the strongest factor leading to application rejection is number of months of training required. Jobs that will require several months of training are not really liked by US immigration. Another strong factor is whether an alternate (i.e. lower) level of education is acceptable for the job, providing the applicant has a certain number of years of experience. The more experience required, the greater the chaces of rejection. Applicants from Thailand also seem to be rejected as relatively high rates.

In contrast, applicants for H-1B visas from India, who are applying for professional occupations may have the highest chances of having their applications approved.

### Evaluating the Model

In [275]:
# evaluating the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.95, random_state=0)
model2 = LogisticRegression()
model2.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)

Since the base rate is so low, a large number of applications are required for testing.

In [276]:
# predicting class labels for the test set
predicted = model2.predict(X_test)
print predicted

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.
  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]


In [277]:
# generating class probabilities
probs = model2.predict_proba(X_test)
print probs

[[  9.99999969e-01   3.07533538e-08]
 [  9.99999394e-01   6.05818926e-07]
 [  9.99999977e-01   2.33318224e-08]
 [  9.99999884e-01   1.15840691e-07]
 [  9.99999978e-01   2.17361136e-08]
 [  9.99911364e-01   8.86356122e-05]
 [  9.99946811e-01   5.31885263e-05]
 [  9.99999851e-01   1.48917818e-07]
 [  9.99865616e-01   1.34384141e-04]
 [  9.99999818e-01   1.81920677e-07]
 [  8.08954505e-02   9.19104549e-01]
 [  9.99999991e-01   9.28437740e-09]
 [  1.00000000e+00   5.17762158e-19]
 [  9.99999996e-01   4.24245899e-09]
 [  1.00000000e+00   8.97642879e-11]
 [  9.92957781e-01   7.04221853e-03]
 [  9.99999965e-01   3.54446634e-08]
 [  9.99999578e-01   4.22428153e-07]
 [  9.99999566e-01   4.33873667e-07]
 [  9.99999997e-01   3.07288390e-09]
 [  9.99999973e-01   2.72623190e-08]
 [  9.99999996e-01   3.69249299e-09]
 [  1.00000000e+00   1.90028332e-12]
 [  9.99999667e-01   3.33238181e-07]
 [  9.99999990e-01   9.95069179e-09]
 [  9.99974277e-01   2.57232760e-05]
 [  1.00000000e+00   1.22114608e-14]
 

In [284]:
# generating evaluation metrics
print metrics.accuracy_score(y_test, predicted)
print metrics.roc_auc_score(y_test, probs[:, 1])

0.93125
0.644144144144


In [285]:
print metrics.confusion_matrix(y_test, predicted)
print metrics.classification_report(y_test, predicted)

[[147   1]
 [ 10   2]]
             precision    recall  f1-score   support

        0.0       0.94      0.99      0.96       148
        1.0       0.67      0.17      0.27        12

avg / total       0.92      0.93      0.91       160



The confusion matrix does not look very promising, with 83% of all positives missed (type II error). Further work will need to be done to improve the model.

In [286]:
# evaluating the model using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
print scores
print scores.mean()

[ 0.88888889  0.94444444  0.88888889  0.88235294  0.94117647  0.9375      1.
  0.9375      0.9375      1.        ]
0.935825163399


Accuracy varies wildly over the subsamples, indicating inconstencies in the classification.