# FEUP - AC
## Banking - Predicting a Loan Outcome

### Business understanding

##### Analysis of requirements with the end user
Text

##### Definition of business goals
Text

##### Translation of business goals into data mining goals
Text

### Exploratory data analysis

Let us start by importing the datasets.

In [169]:
import pandas as pd
import numpy as np

COMP_DATA_SOURCE = "comp_data/"
DEV_DATA_SOURCE = "dev_data/"

# Read in the data; merge data for tables other than loans since they do not influence merges
accounts = pd.read_csv(DEV_DATA_SOURCE + 'account.csv', sep=';')

cards_dev = pd.read_csv(DEV_DATA_SOURCE + 'card.csv', sep=';')
cards_comp = pd.read_csv(COMP_DATA_SOURCE + 'card.csv', sep=';')
cards = pd.concat([cards_dev, cards_comp])

clients = pd.read_csv(DEV_DATA_SOURCE + 'client.csv', sep=';')

dispositions = pd.read_csv(DEV_DATA_SOURCE + 'disp.csv', sep=';')

districts = pd.read_csv(DEV_DATA_SOURCE + 'district.csv', sep=';')

transactions_dev = pd.read_csv(DEV_DATA_SOURCE + 'trans.csv', sep=';')
transactions_comp = pd.read_csv(COMP_DATA_SOURCE + 'trans.csv', sep=';')
transactions = pd.concat([transactions_dev, transactions_comp])

loans_dev = pd.read_csv(DEV_DATA_SOURCE + 'loan.csv', sep=';')
loans_comp = pd.read_csv(COMP_DATA_SOURCE + 'loan.csv', sep=';')

loans_dev.head()

  transactions_dev = pd.read_csv(DEV_DATA_SOURCE + 'trans.csv', sep=';')


Unnamed: 0,loan_id,account_id,date,amount,duration,payments,status
0,5314,1787,930705,96396,12,8033,-1
1,5316,1801,930711,165960,36,4610,1
2,6863,9188,930728,127080,60,2118,1
3,5325,1843,930803,105804,36,2939,1
4,7240,11013,930906,274740,60,4579,1


Plenty of analysis with graphs, distributions, outliers...

### Data preprocessing

Text

#### Processing accounts

The account spreadsheet relates it with the district and date in which it was created, as well with the frequency of issuance of statements with the owner.

The date comes in the from YYMMDD, which is easily extractable to the three different columns that it is aggregating. This will aid the model in detecting important patterns in the year of creation, for example, indicating a global economic crisis or prosperity when the person has created the account.

In [170]:
import src.preprocess.accounts as accpp
accounts = accpp.preprocess_accounts(accounts)
accounts.head()

Unnamed: 0,account_id,district_id,account_frequency,account_year,account_month,account_day
0,576,55,monthly issuance,93,1,1
1,3818,74,monthly issuance,93,1,1
2,704,55,monthly issuance,93,1,1
3,2378,16,monthly issuance,93,1,1
4,2632,24,monthly issuance,93,1,2


#### Processing credit cards

In the credit cards relation, the same concept applies to the date of issuance.

MORE TEXT

In [171]:
from src.preprocess.cards import preprocess_cards

cards = preprocess_cards(cards)
cards.head()

Unnamed: 0,card_id,disp_id,type,year,month,day
0,1005,9285,classic,93,11,7
1,104,588,classic,94,1,19
2,747,4915,classic,94,2,5
3,70,439,classic,94,2,8
4,577,3687,classic,94,2,15


#### Processing clients

In the clients relation, two foreign references and a date appear once again. However, regarding the latter one, there is an important difference: the day is the actual day plus fifty, if the client is a female, and so we can extract the gender too, here.

In [172]:
from src.preprocess.clients import preprocess_clients

clients = preprocess_clients(clients)
clients.head()

Unnamed: 0,client_id,district_id,year,month,day,gender
0,1,18,70,12,13,female
1,2,1,45,2,4,male
2,3,1,40,10,9,female
3,4,5,56,12,1,male
4,5,5,60,7,3,female


#### Processing dispositions

The disposition describes the rights of clients to operate accounts: only "owners" can ask for loans and issue permanent orders.

There is no clear way to merge this table with loans later on, since the loans table does not refer the disposition used for the loan. Instead, let's focus on the number of dispositions and the ratio of owners of an account, which will give an impression of the nature of the people able to control the account. All humans fight and cheat on each other, or build trust together. Let's discover that pattern in Czech Republic.

In [173]:
from src.preprocess.dispositions import preprocess_dispositions
dispositions = preprocess_dispositions(dispositions)
dispositions.head()

Unnamed: 0,disp_id,client_id,account_id,type,number_account_dispositions
0,1,1,1,OWNER,1
1,2,2,2,OWNER,2
2,3,3,2,DISPONENT,2
3,4,4,3,OWNER,2
4,5,5,3,DISPONENT,2


#### Processing districts

The districts relation is quite dense, describing the demographic data of a region. We can extract interesting measures here:

- The commited crimes growth ratio, derived from the evolution of the number of crimes from '95 to '96
- The unemployment rate growth ratio, derived from the evolution of the employment rate from '95 to '96

We also acknowledged that there was one region where the number of crimes and unemployment rate in '95 was not known (`?` was the appearance in the respective column). In order not to discard the entire region data, we'll simply assume that number to be the same as the matching record in '96.

In [174]:
from src.preprocess.districts import preprocess_districts
districts = preprocess_districts(districts)
districts.head()

Unnamed: 0,district_id,district_no_inhabitants,district_no_cities,district_urban_inhabitants_ratio,district_average_salary,district_unemployment_rate,districts_entrepreneurs_ratio,district_crimes_per_inhabitant,district_crime_growth,district_unemploymant_growth
0,1,1204953,1,100.0,12541,0.43,0.167,0.08225,1.156752,1.482759
1,2,88884,5,46.7,8507,1.85,0.132,0.030084,1.238536,1.107784
2,3,75232,5,41.7,8980,2.21,0.111,0.037391,0.996105,1.133333
3,4,149893,6,67.4,9753,5.05,0.109,0.039308,1.12357,1.088362
4,5,95616,6,51.4,9307,4.43,0.118,0.031794,1.16208,1.150649


#### Processing loans

The loans relation references an account and the date of loan, which can be extracted. The duration seems to be in months, so let us improve the name of the column. The amounts seems to be in `Czech korunas`, the czech currency, as hinted by the list of regions; we'll convert it to euros for better visualization and understanding of the upcoming results.

In [175]:
from src.preprocess.loans import preprocess_loans
loans_comp = preprocess_loans(loans_comp)
loans_dev = preprocess_loans(loans_dev)
loans_dev.head()

Unnamed: 0,loan_id,account_id,amount,duration_months,monthly_payment,status,year,month,day
0,5314,1787,3950.66,12,329.22,-1,93,7,5
1,5316,1801,6801.64,36,188.93,1,93,7,11
2,6863,9188,5208.2,60,86.8,1,93,7,28
3,5325,1843,4336.23,36,120.45,1,93,8,3
4,7240,11013,11259.84,60,187.66,1,93,9,6


#### Processing transactions

This table contains a lot of information that we cannot directly relate to a loan.

Let us first extract the date and make the withdrawal amounts actually negative, to give the model an impression of the direction of the money flow in and out of the bank.

For now, let us extract some statistical data related to the accounts, which we'll relate to the loans table.

In [176]:
from src.preprocess.transactions import preprocess_transactions
transactions = preprocess_transactions(transactions)
transactions.head()

Unnamed: 0,trans_id,account_id,type,operation,amount,balance,k_symbol,bank,account,year,month,day
0,1548749,5270,credit,credit in cash,32.79,32.79,,,,93,1,13
1,1548750,5270,credit,collection from another bank,1833.98,1866.76,,IJ,80269753.0,93,1,14
2,3393738,11265,credit,credit in cash,40.98,40.98,,,,93,1,14
3,3122924,10364,credit,credit in cash,45.08,45.08,,,,93,1,17
4,1121963,3834,credit,credit in cash,28.69,28.69,,,,93,1,19


#### Feature engineering

Text

##### Getting the average account balance based on the transactions

In [177]:
from src.feature_engineering.merge import merge_account_transactions
accounts = merge_account_transactions(accounts, transactions)
accounts.head()

Unnamed: 0,account_id,district_id,account_frequency,account_year,account_month,account_day,account_average_balance
0,576,55,monthly issuance,93,1,1,995.213406
1,3818,74,monthly issuance,93,1,1,1755.200383
2,704,55,monthly issuance,93,1,1,1424.108906
3,2378,16,monthly issuance,93,1,1,2376.927362
4,2632,24,monthly issuance,93,1,2,1326.053913


#### Merging the data

The model will be working on the `loans` table, since the target variable belongs to it. The one foreign key of that table is `account_id`; so, to feed the model with more information about the context of the loan, we should relate concepts to the account for each table.

##### Relating dispositions to clients

In [178]:
import src.feature_engineering.merge as merge

dispositions = merge.merge_dispositions_clients(dispositions, clients)
dispositions.head()

Unnamed: 0,disp_id,account_id,type,number_account_dispositions,client_district_id,client_year,client_month,client_day,client_gender
0,1,1,OWNER,1,18,70,12,13,female
1,2,2,OWNER,2,1,45,2,4,male
2,3,2,DISPONENT,2,1,40,10,9,female
3,4,3,OWNER,2,5,56,12,1,male
4,5,3,DISPONENT,2,5,60,7,3,female


##### Relating owner dispositions to accounts

Only owners can ask for loans. This way, we can directly relate an account to its owner client.

In [179]:
accounts = merge.merge_account_dispositions(accounts, dispositions)
accounts.head(50)

Unnamed: 0,account_id,district_id,account_frequency,account_year,account_month,account_day,account_average_balance,number_account_dispositions,owner_year,owner_month,owner_day,owner_gender
0,576,55,monthly issuance,93,1,1,995.213406,2,36,1,11,female
1,3818,74,monthly issuance,93,1,1,1755.200383,2,35,4,2,male
2,704,55,monthly issuance,93,1,1,1424.108906,2,45,1,14,male
3,2378,16,monthly issuance,93,1,1,2376.927362,1,75,3,24,female
4,2632,24,monthly issuance,93,1,2,1326.053913,1,38,8,12,male
5,1972,77,monthly issuance,93,1,2,922.834958,1,18,7,14,male
6,1539,1,issuance after transaction,93,1,3,1042.995251,1,42,6,11,female
7,793,47,monthly issuance,93,1,3,978.532416,2,65,4,12,female
8,2484,74,monthly issuance,93,1,3,1305.771514,1,79,3,24,female
9,1695,76,monthly issuance,93,1,3,2152.34036,1,71,3,2,male


##### Relating cards to accounts

In [180]:
cards = merge.merge_card_account(cards, dispositions)
cards.head()

Unnamed: 0,card_id,disp_id,card_type,card_year,card_month,card_day,account_id
0,1005,9285,classic,93,11,7,7753
1,104,588,classic,94,1,19,489
2,747,4915,classic,94,2,5,4078
3,70,439,classic,94,2,8,361
4,577,3687,classic,94,2,15,3050


##### Relating accounts with district data

Text

In [181]:
accounts = accounts.merge(districts, on="district_id").drop(axis=1, columns=["district_id"])
accounts.head()

Unnamed: 0,account_id,account_frequency,account_year,account_month,account_day,account_average_balance,number_account_dispositions,owner_year,owner_month,owner_day,owner_gender,district_no_inhabitants,district_no_cities,district_urban_inhabitants_ratio,district_average_salary,district_unemployment_rate,districts_entrepreneurs_ratio,district_crimes_per_inhabitant,district_crime_growth,district_unemploymant_growth
0,576,monthly issuance,93,1,1,995.213406,2,36,1,11,female,157042,9,33.9,8743,2.43,0.111,0.024796,1.064225,1.292553
1,704,monthly issuance,93,1,1,1424.108906,2,45,1,14,male,157042,9,33.9,8743,2.43,0.111,0.024796,1.064225,1.292553
2,192,monthly issuance,93,1,8,1019.869554,1,21,6,17,female,157042,9,33.9,8743,2.43,0.111,0.024796,1.064225,1.292553
3,10364,monthly issuance,93,1,17,1404.598889,2,60,8,20,male,157042,9,33.9,8743,2.43,0.111,0.024796,1.064225,1.292553
4,497,monthly issuance,93,4,15,2244.168398,1,43,2,22,male,157042,9,33.9,8743,2.43,0.111,0.024796,1.064225,1.292553


##### Loading the loans table with data

In [182]:
accounts.rename(columns = {'year':'account_year', 'month':'account_month', 'day':'account_day', 'frequency': 'account_frequency', 'district_id': 'account_district_id'}, inplace=True)
loans_dev = loans_dev.merge(accounts, on="account_id", how="left")
loans_comp = loans_comp.merge(accounts, on="account_id", how="left")
loans_dev.head()

# merge with disp with card
# merge disp with account
# merge stuff

Unnamed: 0,loan_id,account_id,amount,duration_months,monthly_payment,status,year,month,day,account_frequency,...,owner_gender,district_no_inhabitants,district_no_cities,district_urban_inhabitants_ratio,district_average_salary,district_unemployment_rate,districts_entrepreneurs_ratio,district_crimes_per_inhabitant,district_crime_growth,district_unemploymant_growth
0,5314,1787,3950.66,12,329.22,-1,93,7,5,weekly issuance,...,female,94812,10,81.8,9650,3.67,0.1,0.029574,0.939363,1.085799
1,5316,1801,6801.64,36,188.93,1,93,7,11,monthly issuance,...,male,112709,10,73.5,8369,2.31,0.117,0.023228,0.917309,1.290503
2,6863,9188,5208.2,60,86.8,1,93,7,28,monthly issuance,...,male,77917,5,53.5,8390,2.89,0.132,0.027234,1.020192,1.267544
3,5325,1843,4336.23,36,120.45,1,93,8,3,monthly issuance,...,female,107870,6,58.0,8754,4.31,0.137,0.035858,1.016824,1.125326
4,7240,11013,11259.84,60,187.66,1,93,9,6,weekly issuance,...,male,1204953,1,100.0,12541,0.43,0.167,0.08225,1.156752,1.482759


##### Replace account data with ranges

In [183]:
import src.preprocess.util as utils
#loans_dev = utils.account_date_to_levels(loans_dev)
#loans_comp = utils.account_date_to_levels(loans_comp)
#loans_dev.head()

##### Feature selection on the loans table

In [184]:
#loans_dev = loans_dev.drop(columns=["account_id","district_no_cities", "district_crime_growth"])
#loans_comp = loans_comp.drop(columns=["account_id","district_no_cities", "district_crime_growth"])

loans_dev = loans_dev.drop(columns=["account_id", "district_no_cities"])
loans_comp = loans_comp.drop(columns=["account_id", "district_no_cities"])

loans_dev.head(10)

Unnamed: 0,loan_id,amount,duration_months,monthly_payment,status,year,month,day,account_frequency,account_year,...,owner_day,owner_gender,district_no_inhabitants,district_urban_inhabitants_ratio,district_average_salary,district_unemployment_rate,districts_entrepreneurs_ratio,district_crimes_per_inhabitant,district_crime_growth,district_unemploymant_growth
0,5314,3950.66,12,329.22,-1,93,7,5,weekly issuance,93,...,22,female,94812,81.8,9650,3.67,0.1,0.029574,0.939363,1.085799
1,5316,6801.64,36,188.93,1,93,7,11,monthly issuance,93,...,22,male,112709,73.5,8369,2.31,0.117,0.023228,0.917309,1.290503
2,6863,5208.2,60,86.8,1,93,7,28,monthly issuance,93,...,2,male,77917,53.5,8390,2.89,0.132,0.027234,1.020192,1.267544
3,5325,4336.23,36,120.45,1,93,8,3,monthly issuance,93,...,20,female,107870,58.0,8754,4.31,0.137,0.035858,1.016824,1.125326
4,7240,11259.84,60,187.66,1,93,9,6,weekly issuance,93,...,7,male,1204953,100.0,12541,0.43,0.167,0.08225,1.156752,1.482759
5,6687,3600.0,24,150.0,1,93,9,13,monthly issuance,93,...,16,female,53921,41.3,8598,3.26,0.123,0.034773,1.174076,1.176895
6,7284,2163.44,12,180.29,1,93,9,15,monthly issuance,93,...,16,male,58796,51.9,9045,3.6,0.124,0.031958,1.018428,1.15016
7,6111,7161.64,24,298.4,-1,93,9,24,monthly issuance,93,...,12,female,122603,80.0,8991,2.01,0.128,0.043009,1.014429,1.446043
8,7235,6328.52,48,131.84,1,93,10,13,weekly issuance,93,...,25,female,70699,65.3,8968,3.35,0.131,0.027016,1.097701,1.183746
9,5997,4796.07,24,199.84,1,93,11,4,monthly issuance,93,...,3,male,177686,74.8,10045,1.71,0.135,0.035428,0.95321,1.204225


### Training the model

Text

In [185]:
for col in loans_dev.columns:
    print(col)

print(len(loans_dev.columns) == len(loans_comp.columns))

loans_dev = utils.extract_categorical(loans_dev, "account_frequency")
loans_comp = utils.extract_categorical(loans_comp, "account_frequency")
loans_dev = utils.extract_categorical(loans_dev, "owner_gender")
loans_comp = utils.extract_categorical(loans_comp, "owner_gender")

loan_id
amount
duration_months
monthly_payment
status
year
month
day
account_frequency
account_year
account_month
account_day
account_average_balance
number_account_dispositions
owner_year
owner_month
owner_day
owner_gender
district_no_inhabitants
district_urban_inhabitants_ratio
district_average_salary
district_unemployment_rate
districts_entrepreneurs_ratio
district_crimes_per_inhabitant
district_crime_growth
district_unemploymant_growth
True


#### Splitting the data for training

Text

In [186]:
from src.model.model import train_test_split_unbalanced
print(loans_dev.shape)
x_train, x_test, y_train, y_test = train_test_split_unbalanced(loans_dev, "status", sampling_strategy="smote", sort_by_date=False, train_ratio=0.8)
print(x_train.shape)

(328, 29)
(451, 28)


#### Predicting and evaluating

Evaluate classification model performance with loans table

code the categorical feature into multiple binary features to allow the model to work!! (TBD)

In [187]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score, accuracy_score, precision_score, f1_score, roc_auc_score

# Create a Random Forest Model
random_classifier = RandomForestClassifier(n_estimators=200, random_state=0)

# Train it with data
random_classifier.fit(x_train, y_train)

if len(y_test) > 0: # not training with entire dataset
    # Predict the test data for dev and evaluate
    predictions_dev = random_classifier.predict(x_test)
    predictions_dev_prob = random_classifier.predict_proba(x_test)

    # Evaluate
    print('Recall: ', recall_score(y_test, predictions_dev, pos_label=-1))
    print('Accuracy: ', accuracy_score(y_test, predictions_dev))
    print('Precision: ', precision_score(y_test, predictions_dev))
    print('F-Score: ', f1_score(y_test, predictions_dev))
    print("AUROC: ", roc_auc_score(y_test, predictions_dev))

    # Write the dev predictions to a file
    predictions_dev_id = x_test["loan_id"]
    predictions_dev_debt = [x[0] for x in predictions_dev_prob]
    predictions_df_dev = pd.DataFrame({"Id": predictions_dev_id, "Predicted": predictions_dev_debt, "Actual": y_test})
    predictions_df_dev.to_csv("output/predictions_dev.csv", index=False)

# Predict the competition data with probabilities
loans_comp_indep = loans_comp.drop(axis=1, columns=["status"])
predictions_comp_prob = random_classifier.predict_proba(loans_comp_indep).tolist()
predictions_comp = random_classifier.predict(loans_comp_indep)

# Write the comp predictions to a file
predictions_comp_debt = [x[0] for x in predictions_comp_prob]
predictions_comp_id = loans_comp["loan_id"].tolist()
predictions_df_comp = pd.DataFrame({"Id": predictions_comp_id, "Predicted": predictions_comp_debt})
predictions_df_comp.to_csv("output/predictions_comp.csv", index=False)

Recall:  0.953125
Accuracy:  0.9469026548672567
Precision:  0.9387755102040817
F-Score:  0.9387755102040817
AUROC:  0.9459502551020409


### Experimenting with other models

In [188]:
# Fit Naive Bayes to the training set

from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()
classifier.fit(x_train, y_train)

# Predict test set results
predictions_dev = classifier.predict(x_test)

print('Recall: ', recall_score(y_test, predictions_dev, pos_label=-1))
print('Accuracy: ', accuracy_score(y_test, predictions_dev))
print('Precision: ', precision_score(y_test, predictions_dev))
print('F-Score: ', f1_score(y_test, predictions_dev))
print("AUROC: ", roc_auc_score(y_test, predictions_dev))

Recall:  0.71875
Accuracy:  0.6902654867256637
Precision:  0.64
F-Score:  0.6464646464646464
AUROC:  0.6859056122448979


In [189]:
# SVM

from sklearn.svm import SVC
from sklearn.metrics import recall_score, accuracy_score, precision_score, f1_score


classifier = SVC()
classifier.fit(x_train, y_train)
predictions_dev = classifier.predict(x_test)

# Check precision and recall (in this case it's better), f-measure, auc
print('Recall: ', recall_score(y_test, predictions_dev, pos_label=-1))
print('Accuracy: ', accuracy_score(y_test, predictions_dev))
print('Precision: ', precision_score(y_test, predictions_dev))
print('F-Score: ', f1_score(y_test, predictions_dev))
print("AUROC: ", roc_auc_score(y_test, predictions_dev))

Recall:  0.125
Accuracy:  0.4247787610619469
Precision:  0.4166666666666667
F-Score:  0.5517241379310345
AUROC:  0.47066326530612246


In [190]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(x_train, y_train)
predictions_dev = classifier.predict(x_test)

print('Recall: ', recall_score(y_test, predictions_dev, pos_label=-1))
print('Accuracy: ', accuracy_score(y_test, predictions_dev))
print('Precision: ', precision_score(y_test, predictions_dev))
print('F-Score: ', f1_score(y_test, predictions_dev))
print("AUROC: ", roc_auc_score(y_test, predictions_dev))

Recall:  0.6875
Accuracy:  0.7168141592920354
Precision:  0.6491228070175439
F-Score:  0.6981132075471698
AUROC:  0.7213010204081631


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [191]:
from sklearn.neural_network import MLPClassifier

classifier = MLPClassifier()
classifier.fit(x_train, y_train)
predictions_dev = classifier.predict(x_test)

print('Recall: ', recall_score(y_test, predictions_dev, pos_label=-1))
print('Accuracy: ', accuracy_score(y_test, predictions_dev))
print('Precision: ', precision_score(y_test, predictions_dev))
print('F-Score: ', f1_score(y_test, predictions_dev))
print("AUROC: ", roc_auc_score(y_test, predictions_dev))

Recall:  0.421875
Accuracy:  0.4424778761061947
Precision:  0.38333333333333336
F-Score:  0.42201834862385323
AUROC:  0.44563137755102045


In [192]:
from sklearn.neural_network import BernoulliRBM
lassifier = BernoulliRBM()
classifier.fit(x_train, y_train)
predictions_dev = classifier.predict(x_test)

print('Recall: ', recall_score(y_test, predictions_dev, pos_label=-1))
print('Accuracy: ', accuracy_score(y_test, predictions_dev))
print('Precision: ', precision_score(y_test, predictions_dev))
print('F-Score: ', f1_score(y_test, predictions_dev))
print("AUROC: ", roc_auc_score(y_test, predictions_dev))

Recall:  0.359375
Accuracy:  0.415929203539823
Precision:  0.36923076923076925
F-Score:  0.4210526315789474
AUROC:  0.4245854591836735


In [193]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(x_train, y_train)
predictions_dev = classifier.predict(x_test)

print('Recall: ', recall_score(y_test, predictions_dev, pos_label=-1))
print('Accuracy: ', accuracy_score(y_test, predictions_dev))
print('Precision: ', precision_score(y_test, predictions_dev))
print('F-Score: ', f1_score(y_test, predictions_dev))
print("AUROC: ", roc_auc_score(y_test, predictions_dev))

Recall:  0.859375
Accuracy:  0.7876106194690266
Precision:  0.7906976744186046
F-Score:  0.7391304347826086
AUROC:  0.776626275510204


In [194]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(x_train, y_train)
predictions_dev = classifier.predict(x_test)

print('Recall: ', recall_score(y_test, predictions_dev, pos_label=-1))
print('Accuracy: ', accuracy_score(y_test, predictions_dev))
print('Precision: ', precision_score(y_test, predictions_dev))
print('F-Score: ', f1_score(y_test, predictions_dev))
print("AUROC: ", roc_auc_score(y_test, predictions_dev))

Recall:  0.1875
Accuracy:  0.4336283185840708
Precision:  0.4157303370786517
F-Score:  0.536231884057971
AUROC:  0.47130102040816324


In [195]:
classifiers = [ ("cfl1",LogisticRegression()),("cfl2",SVC()),("cfl3",GaussianNB()),("cfl4",MultinomialNB()),("cfl5",MLPClassifier())]

from sklearn.ensemble import VotingClassifier

v_c = VotingClassifier(estimators= classifiers, voting="hard",n_jobs = -1)
v_c = v_c.fit(x_train,y_train)
predictions_dev = v_c.predict(x_test)

print('Recall: ', recall_score(y_test, predictions_dev, pos_label=-1))
print('Accuracy: ', accuracy_score(y_test, predictions_dev))
print('Precision: ', precision_score(y_test, predictions_dev))
print('F-Score: ', f1_score(y_test, predictions_dev))
print("AUROC: ", roc_auc_score(y_test, predictions_dev))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Recall:  0.1875
Accuracy:  0.4778761061946903
Precision:  0.44680851063829785
F-Score:  0.5874125874125874
AUROC:  0.5223214285714286
