# Assignment 4 - Data Set Description
The questions below relate to the data files associated with the contest with the title 'DengAI: Predicting Disease Spread' published at the following website. 
https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/data/

Anyone can join the contest and showcase your skills. To know about contest submissions visit the following webpage
https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/submissions/
You can showcase your Machine Learning skills by ranking top in the contest. 

Problem description:
Your goal is to predict the total_cases label for each (city, year, weekofyear) in the test set. There are two cities, San Juan and Iquitos, with test data for each city spanning 5 and 3 years respectively. You will make one submission that contains predictions for both cities. The data for each city have been concatenated along with a city column indicating the source: sj for San Juan and iq for Iquitos. The test set is a pure future hold-out, meaning the test data are sequential and non-overlapping with any of the training data. Throughout, missing values have been filled as NaNs.

Assignment:
The goal is achieved through three subsequent Assignments 1, 2, 3 and 4, all using the same dataset


The features in this dataset
You are provided the following set of information on a (year, weekofyear) timescale:

(Where appropriate, units are provided as a _unit suffix on the feature name.)

City and date indicators

    city – City abbreviations: sj for San Juan and iq for Iquitos
    week_start_date – Date given in yyyy-mm-dd format

NOAA's GHCN daily climate data weather station measurements

    station_max_temp_c – Maximum temperature
    station_min_temp_c – Minimum temperature
    station_avg_temp_c – Average temperature
    station_precip_mm – Total precipitation
    station_diur_temp_rng_c – Diurnal temperature range
    
PERSIANN satellite precipitation measurements (0.25x0.25 degree scale)

    precipitation_amt_mm – Total precipitation

NOAA's NCEP Climate Forecast System Reanalysis measurements (0.5x0.5 degree scale)

    reanalysis_sat_precip_amt_mm – Total precipitation
    reanalysis_dew_point_temp_k – Mean dew point temperature
    reanalysis_air_temp_k – Mean air temperature
    reanalysis_relative_humidity_percent – Mean relative humidity
    reanalysis_specific_humidity_g_per_kg – Mean specific humidity
    reanalysis_precip_amt_kg_per_m2 – Total precipitation
    reanalysis_max_air_temp_k – Maximum air temperature
    reanalysis_min_air_temp_k – Minimum air temperature
    reanalysis_avg_temp_k – Average air temperature
    reanalysis_tdtr_k – Diurnal temperature range

Satellite vegetation - Normalized difference vegetation index (NDVI) - NOAA's CDR Normalized Difference Vegetation Index (0.5x0.5 degree scale) measurements

    ndvi_se – Pixel southeast of city centroid
    ndvi_sw – Pixel southwest of city centroid
    ndvi_ne – Pixel northeast of city centroid
    ndvi_nw – Pixel northwest of city centroid

# Assignment 4 - Questions
Use the merged data frame from Assignment 1,2  and 3 for this assignment

This Assignment focuses on data preprocessing and model building. Continue with the datasets loaded in Assignment 1, 2 and 3 (or reload with same steps and create merged data frame). In this assignment you need to use Neural Network

1. Load the data (both features and label data set as before)
2. Preprocess the data - briefly comment if any special preprocessing is adopted to suit Neural Network
3. Optional: Build a Neural Network Multi-Layer Perceptron Regressor model (you can use sklearn neural network MLP Regressor)
4. Optional: Evaluate the model and compare it with the previous three assignments
5. Add a new column called 'above_average' with value 1 or 0. 1 if the total_cases > median of total_cases
6. Build a Neural Network MLP Classifier on the 'above_average' column with 80/20 train/test split
7. Explain the meaning of Precision, Recall and F1-Score and why these are used to evaluate Classification models (instead of using Accuracy as a metric). Evaluate the classifier using Precision, Recall and F1 score values

Submit the .ipynb, and .html (optional submission.csv if you performed MLP regressor)


In [0]:
#Importing required libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

In [0]:
np.random.seed(67354724)

In [0]:
df = pd.read_csv('dengue_features_train.csv')

In [108]:
#Renaming columns to abbreviate & changing category
df.columns = df.columns.str.replace("station", "stn")
df.columns = df.columns.str.replace("reanalysis", "re_an")
df.columns = df.columns.str.replace("humidity", "hd")
df.columns = df.columns.str.replace("precipitation", "prec")
print(df.columns)
type(df.columns)
df.year = df.year.astype('category')
df.year.dtype

Index([u'city', u'year', u'weekofyear', u'week_start_date', u'ndvi_ne',
       u'ndvi_nw', u'ndvi_se', u'ndvi_sw', u'prec_amt_mm', u're_an_air_temp_k',
       u're_an_avg_temp_k', u're_an_dew_point_temp_k', u're_an_max_air_temp_k',
       u're_an_min_air_temp_k', u're_an_precip_amt_kg_per_m2',
       u're_an_relative_hd_percent', u're_an_sat_precip_amt_mm',
       u're_an_specific_hd_g_per_kg', u're_an_tdtr_k', u'stn_avg_temp_c',
       u'stn_diur_temp_rng_c', u'stn_max_temp_c', u'stn_min_temp_c',
       u'stn_precip_mm'],
      dtype='object')


CategoricalDtype(categories=[1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,
                  2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
                  2010],
                 ordered=False)

In [0]:
df_label = pd.read_csv('dengue_labels_train.csv')

In [0]:
df_merge = pd.merge(df,df_label,how='inner',on=['city','year','weekofyear'])

Dropping reanalysis_sat_precip_amt_mm – Total precipitation since it is a repeating feature in the model

In [0]:
df_merge = df_merge.drop(['re_an_sat_precip_amt_mm'], axis=1)

Preprocess the data - briefly comment if any special preprocessing is adopted to suit Neural Network

In [0]:
df_merge.loc[df_merge.ndvi_ne <= 0.25, 'ndvi_ne'] = 0
df_merge.loc[df_merge.ndvi_ne > 0.25, 'ndvi_ne'] = 1

In [0]:
df_merge.loc[df_merge.ndvi_nw <= 0.25, 'ndvi_nw'] = 0
df_merge.loc[df_merge.ndvi_nw > 0.25, 'ndvi_nw'] = 1

In [0]:
df_merge.loc[df_merge.ndvi_se <= 0.25, 'ndvi_se'] = 0
df_merge.loc[df_merge.ndvi_se > 0.25, 'ndvi_se'] = 1

In [0]:
df_merge.loc[df_merge.ndvi_sw <= 0.25, 'ndvi_sw'] = 0
df_merge.loc[df_merge.ndvi_sw > 0.25, 'ndvi_sw'] = 1

In [0]:
#Converting year to category
df_merge.year = df_merge.year.astype('object')

In [0]:
# Categorical boolean mask
categorical_feature_mask = df_merge.dtypes==object
# filter categorical columns using mask and turn it into a list
categorical_cols = df_merge.columns[categorical_feature_mask].tolist()

In [0]:
# import labelencoder
from sklearn.preprocessing import LabelEncoder
# instantiate labelencoder object
le = LabelEncoder()

In [0]:
# apply le on categorical feature columns
df_merge[categorical_cols] = df_merge[categorical_cols].apply(lambda col: le.fit_transform(col))

In [0]:
#Taking target variable in a separate df, so as to not standardize it
df_total_cases = df_merge['total_cases']

In [0]:
#Dropping target variable
df_merge = df_merge.drop(['total_cases'],axis=1)

In [0]:
#Standardizing Numerical features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [0]:
df_merge = scaler.fit_transform(df_merge)

In [0]:
df_merge = pd.DataFrame(df_merge)

In [125]:
pip install missingpy



Imputing missing values with KNN

In [0]:
#For imputing missing values
from missingpy import KNNImputer
imputer = KNNImputer()

In [0]:
df_imp = imputer.fit_transform(df_merge)

In [0]:
df_merge = pd.DataFrame(df_imp)

In [0]:
df_merge.columns = ['city',
 'year',
 'weekofyear',
 'week_start_date',
 'ndvi_ne',
 'ndvi_nw',
 'ndvi_se',
 'ndvi_sw',
 'prec_amt_mm',
 're_an_air_temp_k',
 're_an_avg_temp_k',
 're_an_dew_point_temp_k',
 're_an_max_air_temp_k',
 're_an_min_air_temp_k',
 're_an_precip_amt_kg_per_m2',
 're_an_relative_hd_percent',
 're_an_specific_hd_g_per_kg',
 're_an_tdtr_k',
 'stn_avg_temp_c',
 'stn_diur_temp_rng_c',
 'stn_max_temp_c',
 'stn_min_temp_c',
 'stn_precip_mm']

In [130]:
#Count of missing values
df_merge.isna().sum().sum()

0

Optional: Build a Neural Network Multi-Layer Perceptron Regressor model (you can use sklearn neural network MLP Regressor)

In [0]:
from sklearn.neural_network import MLPRegressor
mlp_reg = MLPRegressor(hidden_layer_sizes=(100,50), activation='relu', solver='adam', alpha=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08, n_iter_no_change=10)

In [0]:
#Taking target variable for splitting
bins = np.linspace(0, 40, 50)

total_cases_bin = np.digitize(df_total_cases, bins)

In [0]:
#Stratified split based on total cases
X_train, X_test, y_train, y_test = train_test_split(df_merge,total_cases_bin,test_size=0.20,stratify=total_cases_bin,shuffle=True)

In [179]:
#Fitting the MLP Reg model 
mlp_reg.fit(X_train,y_train)

MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100, 50), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [0]:
pred = mlp_reg.predict(X_test)

In [181]:
print("MSE:")
print(mean_squared_error(y_test,pred))

MSE:
140.01206575089918


In [182]:
print("R squared:")
print(r2_score(y_test,pred))

R squared:
0.5195151270293776


In [183]:
print("MAE:")
print(mean_absolute_error(y_test,pred))

MAE:
9.194467044305112


Doing a Grid Search CV to find best hyperparameters.

In [0]:
from sklearn.model_selection import GridSearchCV

In [191]:
parameters = {"hidden_layer_sizes": [(150),(150,50),(100,50),(200)], "activation": ["identity", "logistic", "tanh", "relu"], "solver": ["lbfgs", "sgd", "adam"], "alpha": [0.00005,0.0005]}
grid = GridSearchCV(MLPRegressor(), parameters, refit = True, verbose = 3)
grid.fit(X_train, y_train)
print(grid)

Fitting 3 folds for each of 96 candidates, totalling 288 fits
[CV] alpha=5e-05, activation=identity, solver=lbfgs, hidden_layer_sizes=150 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  alpha=5e-05, activation=identity, solver=lbfgs, hidden_layer_sizes=150, score=0.374073735448, total=   0.5s
[CV] alpha=5e-05, activation=identity, solver=lbfgs, hidden_layer_sizes=150 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s


[CV]  alpha=5e-05, activation=identity, solver=lbfgs, hidden_layer_sizes=150, score=0.39566595699, total=   0.5s
[CV] alpha=5e-05, activation=identity, solver=lbfgs, hidden_layer_sizes=150 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.0s remaining:    0.0s


[CV]  alpha=5e-05, activation=identity, solver=lbfgs, hidden_layer_sizes=150, score=0.352677768312, total=   0.5s
[CV] alpha=5e-05, activation=identity, solver=sgd, hidden_layer_sizes=150 
[CV]  alpha=5e-05, activation=identity, solver=sgd, hidden_layer_sizes=150, score=0.361507374121, total=   0.2s
[CV] alpha=5e-05, activation=identity, solver=sgd, hidden_layer_sizes=150 
[CV]  alpha=5e-05, activation=identity, solver=sgd, hidden_layer_sizes=150, score=0.380688040288, total=   0.2s
[CV] alpha=5e-05, activation=identity, solver=sgd, hidden_layer_sizes=150 
[CV]  alpha=5e-05, activation=identity, solver=sgd, hidden_layer_sizes=150, score=0.348195730517, total=   0.2s
[CV] alpha=5e-05, activation=identity, solver=adam, hidden_layer_sizes=150 
[CV]  alpha=5e-05, activation=identity, solver=adam, hidden_layer_sizes=150, score=0.362965170696, total=   0.6s
[CV] alpha=5e-05, activation=identity, solver=adam, hidden_layer_sizes=150 
[CV]  alpha=5e-05, activation=identity, solver=adam, hidden_

[Parallel(n_jobs=1)]: Done 288 out of 288 | elapsed:  6.7min finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'alpha': [5e-05, 0.0005], 'activation': ['identity', 'logistic', 'tanh', 'relu'], 'solver': ['lbfgs', 'sgd', 'adam'], 'hidden_layer_sizes': [150, (150, 50), (100, 50), 200]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=3)


In [192]:
# print best parameter after tuning 
print(grid.best_params_) 

{'alpha': 5e-05, 'activation': 'relu', 'solver': 'sgd', 'hidden_layer_sizes': 200}


In [193]:
# print how our model looks after hyper-parameter tuning 
print(grid.best_estimator_) 

MLPRegressor(activation='relu', alpha=5e-05, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=200, learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='sgd', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)


In [0]:
grid_predictions = grid.predict(X_test) 

Checking metrics after tuning

In [195]:
print("MSE:")
print(mean_squared_error(y_test,grid_predictions))

MSE:
129.6882279675205


In [196]:
print("R squared:")
print(r2_score(y_test,grid_predictions))

R squared:
0.5549438442568009


In [197]:
print("MAE:")
print(mean_absolute_error(y_test,grid_predictions))

MAE:
8.960813338142088


We can see that the MAE is now much lower at 8.96

Now, checking the predicitons on the competition dataset

In [0]:
comp_test = pd.read_csv('dengue_features_test.csv')

Preprocessing the data like before

In [148]:
#Renaming columns to abbreviate & changing category
comp_test.columns = comp_test.columns.str.replace("station", "stn")
comp_test.columns = comp_test.columns.str.replace("reanalysis", "re_an")
comp_test.columns = comp_test.columns.str.replace("humidity", "hd")
comp_test.columns = comp_test.columns.str.replace("precipitation", "prec")
print(comp_test.columns)
type(comp_test.columns)
comp_test.year = comp_test.year.astype('category')
comp_test.year.dtype

Index([u'city', u'year', u'weekofyear', u'week_start_date', u'ndvi_ne',
       u'ndvi_nw', u'ndvi_se', u'ndvi_sw', u'prec_amt_mm', u're_an_air_temp_k',
       u're_an_avg_temp_k', u're_an_dew_point_temp_k', u're_an_max_air_temp_k',
       u're_an_min_air_temp_k', u're_an_precip_amt_kg_per_m2',
       u're_an_relative_hd_percent', u're_an_sat_precip_amt_mm',
       u're_an_specific_hd_g_per_kg', u're_an_tdtr_k', u'stn_avg_temp_c',
       u'stn_diur_temp_rng_c', u'stn_max_temp_c', u'stn_min_temp_c',
       u'stn_precip_mm'],
      dtype='object')


CategoricalDtype(categories=[2008, 2009, 2010, 2011, 2012, 2013], ordered=False)

In [0]:
comp_test = comp_test.drop(['re_an_sat_precip_amt_mm'], axis=1)

In [0]:
comp_test.loc[comp_test.ndvi_ne <= 0.25, 'ndvi_ne'] = 0
comp_test.loc[comp_test.ndvi_ne > 0.25, 'ndvi_ne'] = 1

In [0]:
comp_test.loc[comp_test.ndvi_nw <= 0.25, 'ndvi_nw'] = 0
comp_test.loc[comp_test.ndvi_nw > 0.25, 'ndvi_nw'] = 1

In [0]:
comp_test.loc[comp_test.ndvi_se <= 0.25, 'ndvi_se'] = 0
comp_test.loc[comp_test.ndvi_se > 0.25, 'ndvi_se'] = 1

In [0]:
comp_test.loc[comp_test.ndvi_sw <= 0.25, 'ndvi_sw'] = 0
comp_test.loc[comp_test.ndvi_sw > 0.25, 'ndvi_sw'] = 1

In [0]:
#Converting year to category
comp_test.year = comp_test.year.astype('object')

In [0]:
# Categorical boolean mask
categorical_feature_mask = comp_test.dtypes==object
# filter categorical columns using mask and turn it into a list
categorical_cols = comp_test.columns[categorical_feature_mask].tolist()

In [0]:
# import labelencoder
from sklearn.preprocessing import LabelEncoder
# instantiate labelencoder object
le = LabelEncoder()

In [0]:
# apply le on categorical feature columns
comp_test[categorical_cols] = comp_test[categorical_cols].apply(lambda col: le.fit_transform(col))

In [0]:
#Standardizing Numerical features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [0]:
comp_test = scaler.fit_transform(comp_test)

In [0]:
comp_test = pd.DataFrame(comp_test)

In [0]:
comp_test = imputer.fit_transform(comp_test)

In [0]:
comp_test = pd.DataFrame(comp_test)

In [0]:
comp_test.columns = ['city',
 'year',
 'weekofyear',
 'week_start_date',
 'ndvi_ne',
 'ndvi_nw',
 'ndvi_se',
 'ndvi_sw',
 'prec_amt_mm',
 're_an_air_temp_k',
 're_an_avg_temp_k',
 're_an_dew_point_temp_k',
 're_an_max_air_temp_k',
 're_an_min_air_temp_k',
 're_an_precip_amt_kg_per_m2',
 're_an_relative_hd_percent',
 're_an_specific_hd_g_per_kg',
 're_an_tdtr_k',
 'stn_avg_temp_c',
 'stn_diur_temp_rng_c',
 'stn_max_temp_c',
 'stn_min_temp_c',
 'stn_precip_mm']

In [0]:
pred_comp = grid.predict(comp_test)

In [0]:
pred_comp = pd.DataFrame(pred_comp)

In [0]:
pred_comp.columns = ['total_cases']

In [0]:
# Now we will convert total cases into 'int64' type. 
pred_comp.total_cases = pred_comp.total_cases.astype('int64') 

In [92]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
#Converting the final df to submit to csv
pred_comp.to_csv (r'/content/drive/My Drive/deng_sub_mlp', index = False, header=True)

After submitting the csv, I got a MAE of 27.05 there, which is better than the previous scores

Add a new column called 'above_average' with value 1 or 0. 1 if the total_cases > median of total_cases

In [50]:
#Taking median of total cases column
med_total_cases = np.median(df_total_cases)
print("Median of Total Cases: ", med_total_cases)

('Median of Total Cases: ', 12.0)


In [0]:
above_average = [] 
for value in df_total_cases: 
    if value >= med_total_cases: 
        above_average.append(1) 
    else: 
        above_average.append(0) 
       
df["above_average"] = above_average

In [0]:
df["above_average"].columns = ['above_average']

In [0]:
y = df['above_average']

Build a Neural Network MLP Classifier on the 'above_average' column with 80/20 train/test split

In [0]:
#Splitting the data frame
X_train, X_test, y_train, y_test = train_test_split(df_merge,y,test_size=0.20,stratify=y,shuffle=True)

In [0]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(100, ), activation='relu', solver='adam', alpha=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08, n_iter_no_change=10)

In [60]:
#Fitting the MLP Clf model 
clf.fit(X_train,y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [0]:
pred_clf = clf.predict(X_test)

In [102]:
from sklearn.metrics import precision_score
print('Precision Score:')
precision_score(y_test, pred_clf, labels=None, pos_label=1, average='binary', sample_weight=None)

Precision Score:


0.7756410256410257

In [99]:
from sklearn.metrics import classification_report
classification_report(y_test, pred_clf, labels=None, target_names=None, sample_weight=None, digits=2, output_dict=False)

u'              precision    recall  f1-score   support\n\n           0       0.80      0.76      0.78       144\n           1       0.78      0.82      0.80       148\n\n   micro avg       0.79      0.79      0.79       292\n   macro avg       0.79      0.79      0.79       292\nweighted avg       0.79      0.79      0.79       292\n'

In [103]:
from sklearn.metrics import recall_score
print('Recall Score:')
recall_score(y_test, pred_clf, labels=None, pos_label=1, average='binary', sample_weight=None)

Recall Score:


0.8175675675675675

In [104]:
from sklearn.metrics import f1_score
print('F1 Score:')
f1_score(y_test, pred_clf, labels=None, pos_label=1, average='binary', sample_weight=None)

F1 Score:


0.7960526315789475

Recall Score: 0.81
F1 Score: 0.79
Precision Score: 0.77

Precision is a measure of how many positives are actually classified as positive.
Recall actually calculates how many of the Actual Positives our model capture through labeling it as Positive.
F1 Score is the weighted average of precision and recall.

F1 Score, Precision & Recall are better measures because in case of a skewed dataset, if we only consider accuracy we do not get a clear image about the Falsely classified values.