# Binary Classification 

## Introduction
The purpose of this notebook is to create a classification model. Not much about the data is known. We are going to follow these steps iteratively:

1. Cleaning
2. Exploratory Data Analysis
3. Preprocessing
4. Feature Engineering
5. Modelling
6. Valuation


### Cleaning and Exploration

Before we model we should first view the data set for inconsistencies and missing values. After observing the rows displayed below, we see that some rows have inconsistent data types and missing values.

We will follow the following process:

1. Read in data
2. Add columns to the unnamed columns.
3. Handle inconsistent data types starting using **Class label** as a benchmark.
4. Transform Data Types
5. Handle missing values accordingly.
6. Find and remove duplicates.
7. Search for outliers.


In [1]:
# import statements
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, PolynomialFeatures, StandardScaler
from sklearn.impute import KNNImputer
from sklearn.metrics import log_loss, f1_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split, cross_val_score, RepeatedStratifiedKFold, RandomizedSearchCV
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler



# Add column names
colnames =  ['v1','v2', 'v3', 'v4', 'v33', 'v76', 'v12', 'v68', 'v50', 'v7', 'v70', 'v55', 'v20', 'v24','v32', 'v97', 'v28', 'v99', 'v95', 'v42', 'v53', 'v85', 'v9', 'v84','v44', 'classlabel']
train = pd.read_csv("../input/apt-train/Training.csv", names =  colnames, header = None)
validate = pd.read_csv("../input/aptvalidate/Validation.csv", names =  colnames, header = None)
train = train.iloc[1:len(train.index)]
validate = validate.iloc[1:len(validate.index)]


In [2]:
train.head(5)

In [3]:
train.classlabel.isna().sum()

### Handling inconsistences.
Some of the columns had inconsistent data types. In many rows values seem to have shifted into neighbouring columns. To handle this we will employ pandas' ***shift*** method. 

**Strategy**: We visually identify shifted columns by displaying rows where the **Class Label** is missing. Then we scan for patterns and make rules around such patterns. One of such patterns  is ***shifting rows where v4 = 'f' or 't'***. This pattern comes from the realization that ***v4*** should contain numeric values and these values (***f*** and ***t***) belong to the neighbouring column.

We identify these problematic columns throughout the dataset and iteratively come up with rules to adjust them. We compose these transformations as a function called **clean**. This allows to repeat our process on the validation set with ease.

In [4]:
def clean(dataset):
    # shifting rows where v4 = 'f' or 't'
    row_num, col = dataset.shape

    for i in range(row_num):
        if ((dataset.v4[i+1] == "f") | (dataset.v4[i+1] == "t")):
            dataset.iloc[i, 3:] = dataset.iloc[i, 3:].shift(1,fill_value = "0")

    # shifting columns where v12 has 'a' or 'b'
    
    j = dataset.columns.get_loc("v12")
    for i in range(row_num):
        if ((dataset.v12[i+1] == "a") | (dataset.v12[i+1] == "b")):
            dataset.iloc[i, 6:] = dataset.iloc[i, 6:].shift(1)

    # shifting columns where v70 does not have 0
    j = dataset.columns.get_loc("v70")
    for i in range(row_num):
        if (dataset.v70[i+1] != "0"):
            dataset.iloc[i, 10:] = dataset.iloc[i, 10:].shift(1)

    # shifting columns where v20 has 'u' or 'y '
    j = dataset.columns.get_loc("v20")
    for i in range(row_num):
        if ((dataset.v20[i+1] == "u") | (dataset.v20[i+1] == "y")):
            dataset.iloc[i, j:] = dataset.iloc[i, j:].shift(1)
        
    j = dataset.columns.get_loc("v24")
    for i in range(row_num):
        if ((dataset.v24[i+1] != "u") & (dataset.v24[i+1] != "y") & (dataset.v24[i+1] != 'l')):
            dataset.iloc[i, j:] = dataset.iloc[i, j:].shift(1)


    j = dataset.columns.get_loc("v97")
    for i in range(row_num):
        if ((dataset.v97[i+1] == "f") | (dataset.v97[i+1] == "t")):
            dataset.iloc[i, j:] = dataset.iloc[i, j:].shift(1, fill_value="0")


    start = dataset.columns.get_loc("v68")
    end = dataset.columns.get_loc("v70") + 1
    for i in range(row_num):
        if ((dataset.v68[i+1] == 't') | (dataset.v68[i+1] == 'f') ):
            dataset.iloc[i, start:end] = dataset.iloc[i, start:end].shift(1)
            
    return dataset


In [5]:
validate

In [6]:
train = clean(train)
validate = clean(validate)

In [7]:
#view results.
train.head()

Above we see the third and seventh row shifted towards the right.

In [8]:
#check for missing or null labels
train.classlabel.isna().sum()

In [9]:
validate.classlabel.isna().sum()

In [10]:
train.head()

Let's check if the columns now have more consistent values. By 
1. Printing out unique values in each column

In [11]:
validate.apply(set)

In [12]:
train.apply(set)

#### Dealing with formating

I pressume that the data was initially skewed on some rows, due to a formatting error. This assume this error was that decimals might have been represented by commas in the original data file.

To correct this we concatenate those colums together separated by periods**(".")**.

In [13]:
def reformat(dataset):
    """
    This function reformats the numerical columns.
    
    Arguments: Dataframe
    """
    ##Converting missing values in the v12 column to zero
    dataset.v12 = dataset.v12.fillna("0")
    dataset.v4 = dataset.v4.fillna("0")
    dataset.v3 = dataset.v3.fillna("0")

    #concatenating columns
    dataset["v5"] = dataset.v3 + "." + dataset.v4
    dataset["v8"] = dataset.v76 + "." + dataset.v12
    dataset["v91"] = dataset.v70 + "."+ dataset.v55
    dataset["v100"] = dataset.v32 + "." + dataset.v97

    # droping previous columns
    dataset = dataset.drop(columns='v3')
    dataset = dataset.drop(columns='v4')
    dataset = dataset.drop(columns='v76')
    dataset = dataset.drop(columns='v12')
    dataset = dataset.drop(columns='v70')
    dataset = dataset.drop(columns='v55')
    dataset = dataset.drop(columns='v32')
    dataset = dataset.drop(columns='v97')
    
    return dataset

In [14]:
train = reformat(train)
validate = reformat(validate)

### Transform Data Types
Though we have achieved homogenuity in the dataset, we have all the data set to "object" types, so we have to to specify the types.

1. We pick out the numeric columns.
2. We make the rest category types.


In [15]:
train.info()

In [16]:
# Convert columns to integer train
numeric_columns = ['v5', "v8", 'v7', "v91", 'v20','v100', 'v53', 'v42']
train[numeric_columns] = train[numeric_columns].astype('float')

# Convert columns to integer validate
validate[numeric_columns] = validate[numeric_columns].astype('float')

In [17]:
#Convert object types to categories train
object_columns =  train.select_dtypes(include="object").columns.tolist()
train[object_columns] = train[object_columns].astype('category')

#Convert object types to categories validate
validate[object_columns] = validate[object_columns].astype('category')

In [18]:
train.info()

**Reformating the classlabel column**

In [19]:
# Reformating the classlabel column
train.classlabel = np.where(train.classlabel == "yes.",1, 0)
validate.classlabel = np.where(validate.classlabel == "yes.",1, 0)

In [20]:
train.head()

##### **Dealing with missingness**
Here, we deal with missing values in the dataset. We first identify missing columns then we dig deeper to understand the nature of missingness.

We would examine rows with missing values. And conclude one of two things.
1. Values are missing at random
2. There is a pattern to the missingness.

The nature of missingness will determine the approach to handling it. 

In [21]:
#display all columns that have missing values
missing_columns = train.columns.values[train.isnull().any()]
missing_columns


In [22]:
missing_columns = validate.columns.values[validate.isnull().any()]
missing_columns

##### **Notes**
Both the validation and training set have similar missing values so we would treat them the same.

In [23]:
#train[train.v68.isna()].iloc[20:50, :]
#train[train.v24.isna()].iloc[20:50, :]
#train[train.v99.isna()].iloc[10:50, 10:]
#train[train.v95.isna()].iloc[:, 10:]
train[train.v85.isna()].iloc[:, 10:]

## The commented out code prints observations with missing columns

##### **Notes**
* The seems to be a pattern to the first two missing rows(**v1** and **v68**). We can use KNN impute for those columns. 

* Column **v95** seems to be missing too much information with 2/3 of value missing. I believe it should be dropped in it's entirity.




In [24]:
train = train.drop(columns='v95')
validate = validate.drop(columns='v95')

#### Duplicate Values
Here we treat duplicated rows

In [25]:
train = train.loc[~train.duplicated(), :]

### Exploratory Data Analysis


In [26]:
train[numeric_columns].describe()

##### Notes

* The the columns seem to be on very varying scales. We might need to scale the values while preprocessing.



In [27]:
# Compute the correlation matrix
corr = train[numeric_columns].corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 10, as_cmap=True)

sns.heatmap(corr, mask=mask, cmap=cmap, center=0, square=True, linewidths=.5,)

Above is a Correlation plot. The hue represents the level of interaction with a corresponding column. Here we see that only a few of the values are highly correlated (Inversely and otherwise). 

* That is we do not have too much interaction between columns. Such interactions can usually complicate modelling. 

* v42 and v7 seem to be highly correlated.


In [28]:
sns.scatterplot(x="v42", y="v7", data=train)

The plot about show that v42 and v7 have a perfectly linear realtionship. However, this can cause the model to needlessly overfit. 

After displaying the values below. We see that ***v42*** and ***v7*** are merely multiples of each other. We can remove one.

In [29]:
train[["v42", 'v7']]

In [30]:
train = train.drop(columns='v42')
validate = validate.drop(columns='v42')

In [31]:
sns.scatterplot(x="v8", y="v5", data=train)

v5	v8	v7	v91	v20	v100	v53	v42

In [32]:
f, axes = plt.subplots(ncols=7, figsize=(20,4))

sns.boxplot(x="classlabel", y="v8", data=train, ax = axes[0])
sns.boxplot(x="classlabel", y="v7", data=train, ax = axes[1])
sns.boxplot(x="classlabel", y="v91", data=train, ax=  axes[2])
sns.boxplot(x="classlabel", y="v20", data=train,ax = axes[3])
sns.boxplot(x="classlabel", y="v100", data=train, ax = axes[4])
sns.boxplot(x="classlabel", y="v53", data=train, ax = axes[5])
sns.boxplot(x="classlabel", y="v5", data=train, ax =axes[6])


plt.show()

#### Notes
The plots above, plot the distribution of the numerical values against our classlabels. 

* We observe that the differences between the distributions in each class varies from variable to variable. A big difference might mean that variable has some predicton power.

* We also see outliers. There is no one way to determine if a value is an outlier. Nevertheless, the inter quartile range method illusatrated by the plots above is a widely used method. The dots represents values outside the data range.

* It would pay to look closer at these dstributions.

In [33]:
f, axes = plt.subplots(ncols=6, figsize=(20,4))

sns.histplot(x=train.v5, ax = axes[0])
sns.histplot(x=train.v8, ax = axes[0])
sns.histplot(x=train.v7, ax = axes[1])
sns.histplot(x=train.v91, ax=  axes[2])
sns.histplot(x=train.v20,ax = axes[3])
sns.histplot(x=train.v100, ax = axes[4])
sns.histplot(x=train.v53, ax = axes[5])


plt.show()


In [34]:
from scipy.stats import norm
sns.distplot(x=train.v91)

#### Notes
Most of the data distributions are skeewed. Some models before better with more symetric distributions. We can perform some transformation on them.

## Explore Categoral Columns

Here we plot  the distribution of the class labels across the catogories in each varaible. **Less homogenuity might indicate higher prediction power**.  As those columns are likely to differentiate the data more.




In [35]:
f, axes = plt.subplots(ncols=3, figsize=(20,4))

#v1 column
class_by_v1 = train.groupby(['v1'])["classlabel"].value_counts(normalize=True).rename('percentage').mul(100).reset_index()
sns.barplot(x="v1", y="percentage", hue="classlabel", data=class_by_v1, ax = axes[0])

#v2
class_by_v2 = train.groupby(['v2'])["classlabel"].value_counts(normalize=True).rename('percentage').mul(100).reset_index()
sns.barplot(x="v2", y="percentage", hue="classlabel", data=class_by_v2, ax = axes[1])

class_by_v33 = train.groupby(['v33'])["classlabel"].value_counts(normalize=True).rename('percentage').mul(100).reset_index()
sns.barplot(x="v33", y="percentage", hue="classlabel", data=class_by_v33, ax = axes[2])


plt.show()

In [36]:
f, axes = plt.subplots(ncols=3, figsize=(20,4))

class_by_v68 = train.groupby(['v68'])["classlabel"].value_counts(normalize=True).rename('percentage').mul(100).reset_index()
sns.barplot(x="v68", y="percentage", hue="classlabel", data=class_by_v68, ax = axes[0])

class_by_v50 = train.groupby(['v50'])["classlabel"].value_counts(normalize=True).rename('percentage').mul(100).reset_index()
sns.barplot(x="v50", y="percentage", hue="classlabel", data=class_by_v50, ax = axes[1])

#v1 column
class_by_v24 = train.groupby(['v24'])["classlabel"].value_counts(normalize=True).rename('percentage').mul(100).reset_index()
sns.barplot(x="v24", y="percentage", hue="classlabel", data=class_by_v24, ax = axes[2])


plt.show()

In [37]:
f, axes = plt.subplots(ncols=3, figsize=(20,4))


#v2
class_by_v9 = train.groupby(['v9'])["classlabel"].value_counts(normalize=True).rename('percentage').mul(100).reset_index()
sns.barplot(x="v9", y="percentage", hue="classlabel", data=class_by_v9, ax = axes[0])

class_by_v85 = train.groupby(['v85'])["classlabel"].value_counts(normalize=True).rename('percentage').mul(100).reset_index()
sns.barplot(x="v85", y="percentage", hue="classlabel", data=class_by_v85, ax = axes[1])

class_by_v28= train.groupby(['v28'])["classlabel"].value_counts(normalize=True).rename('percentage').mul(100).reset_index()
sns.barplot(x="v28", y="percentage", hue="classlabel", data=class_by_v28, ax = axes[2])



plt.show()


In [38]:
f, axes = plt.subplots(ncols=4, figsize=(20,4))

class_by_v44 = train.groupby(['v44'])["classlabel"].value_counts(normalize=True).rename('percentage').mul(100).reset_index()
sns.barplot(x="v44", y="percentage", hue="classlabel", data=class_by_v44, ax = axes[0])

class_by_v99 = train.groupby(['v99'])["classlabel"].value_counts(normalize=True).rename('percentage').mul(100).reset_index()
sns.barplot(x="v99", y="percentage", hue="classlabel", data=class_by_v99, ax = axes[1])

#v44 column
class_by_v44 = train.groupby(['v44'])["classlabel"].value_counts(normalize=True).rename('percentage').mul(100).reset_index()
sns.barplot(x="v44", y="percentage", hue="classlabel", data=class_by_v44, ax = axes[2])

#v84
class_by_v84 = train.groupby(['v84'])["classlabel"].value_counts(normalize=True).rename('percentage').mul(100).reset_index()
sns.barplot(x="v84", y="percentage", hue="classlabel", data=class_by_v84, ax = axes[3])

plt.show()

##### **Notes**

feature v9 and v28. Are unlikely to differentiate the classes. As the classes have a similar variation in both groups.





### Preprocessing and Modelling

Here we finally apply a model to the cleaned data from our previous steps. However, it is very likely that would need to perform additional preprocessing steps on the data. 
1. Scaling
2. feature engineering
3. ***resampling***.

Due to the imbalanced nature of the dataset the model might overfit on training examples from the dominant class. To counter this we can perform a number of resampling operations. We wll use the results of cross validation to judge model performance.


**Feature Selection**: Since we know so little about the data a manual feature selection would not be well informed. However we can employ techniques like regularization, varible importance metrics or maybe even a dimensional .


#### Preprocessing
We would perform a number of operations before we finally tackle our model.
We will follow these steps:

1. Encode categorical variables
2. Impute or drop missing values
3. Scale numerical features.
4. Downsample or upsample. 
5. Try SMOTE
6. Create two baseline models. One  linear and the other treebased.

1, Imputing Missing values.

We determined earlier that the our data doesn't seem to be ***missing at random*** especially in the first two columns. As such, we can estimate the missng values based on observered patterns. However, to use such imputing techniques we have to encode our catogories as numbers.

In [39]:
#recoding non null labels as numbers
train_proc = train.apply(lambda series: pd.Series(LabelEncoder().fit_transform(series[series.notnull()]),index=series[series.notnull()].index))
train_proc 

In [40]:
#recoding non null labels as numbers
validate_proc = validate.apply(lambda series: pd.Series(LabelEncoder().fit_transform(series[series.notnull()]),index=series[series.notnull()].index))
validate_proc 

In [41]:
#Using the KNN method for the first two columns
imputer = KNNImputer(n_neighbors=4)
columns = train_proc.columns.values
train_proc[columns] = imputer.fit_transform(train_proc)

##impute for validate
#Using the KNN method for the first two columns
imputer = KNNImputer(n_neighbors=4)
columns = validate.columns.values
validate_proc[columns] = imputer.fit_transform(validate_proc)

In [42]:
#display all columns that have missing values
missing_columns = train_proc.columns.values[train_proc.isnull().any()]
missing_columns

missing_columns_val = validate_proc.columns.values[validate_proc.isnull().any()]
missing_columns_val

In [43]:
#Convert Category Columns back to categories.
cat_columns = train_proc.columns.values[train.dtypes == 'category']
train_proc[cat_columns] = train_proc[cat_columns].astype("int").astype("category")
train_proc["classlabel"] = train_proc["classlabel"].astype("int").astype("category")

cat_columns = validate_proc.columns.values[train.dtypes == 'category']
validate_proc[cat_columns] = validate_proc[cat_columns].astype("int").astype("category")
validate_proc["classlabel"] = validate_proc["classlabel"].astype("int").astype("category")

#### Feature  Engineering 
* Here we would rescale our numerical features. We shouldn't do this before spliting the dataset as that will cause some information leakage.

* Transform them into polynomials. This can help our model capture more complex relationships but can lead to overfitting. 

* Our label encoding is a well used technique. But it can add some error to our model in the sense that the arbitrary number ordering can be misinterpreted by some models. However, we can recode our using another well used technique 'one hot encoding'. 

We should introduce the pipeline module that makes the transformations more manageble and reuseable on a validation set.



In [44]:
#get indexes for numerical columns
num_columns = train_proc.columns.values[train_proc.dtypes == 'float64']
num_columns

In [79]:
#Scaling ensures that all the columns are on on the same scale/unit
scaler = StandardScaler()
validate_proc[num_columns] = scaler.fit_transform(validate_proc[num_columns])

#Scaling ensures that all the columns are on on the same scale/unit
scaler = StandardScaler()
train_proc[num_columns] = scaler.fit_transform(train_proc[num_columns])

In [51]:
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
poly_fit = poly.fit_transform(train_proc[num_columns])

poly_frame = pd.DataFrame(poly_fit, columns=[f"poly1_{i}" for i in range(poly_fit.shape[1])])

#reset columns indexes for smooth join
train_proc = train_proc.reset_index().drop(columns="index")

#join
pd.concat([train_proc, poly_frame], axis = 1)

In [47]:
#pd.merge(train_proc, poly_frame, how='inner', on=['v5:poly1_3'])

In [48]:
train_proc = train_proc.loc[~train_proc.duplicated(), :]

#### **Baseline Model**



In [None]:
# Extract Features and Target Columns
features = train_proc.columns.values[train_proc.columns.values != "classlabel"]
X_train = train_proc[features]
y_train = train_proc["classlabel"]

# Extract Features and Target Columns
features = validate_proc.columns.values[validate_proc.columns.values != "classlabel"]
X_test = validate_proc[features]
y_test = validate_proc["classlabel"]

In [None]:
X_train

In [None]:
X_test.columns.values

### Handling Class Imbalance

The distribution of the classes is imbalanced, as illustrated below. To deal with this we can employ a number of techniques. Such downsampling or upsampling.

In [None]:
sns.countplot(x="classlabel", data=train_proc)

In [None]:
# define pipeline
steps = [('under', RandomUnderSampler(0.1)), ('model', LGBMClassifier())]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Create the parameter grid
gbm_param_grid = {'model__learning_rate': np.arange(.05, 1, .05), 'model__max_depth': np.arange(3,20, 2),'model__n_estimators': np.arange(20, 100, 25), }

# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Perform RandomizedSearchCV
randomized_roc_auc = RandomizedSearchCV(estimator=pipeline, param_distributions=gbm_param_grid, n_iter=2, scoring='roc_auc', cv=cv, verbose=1)

# Fit the estimator
randomized_roc_auc.fit(X_train, y_train)

# Compute metrics
print(randomized_roc_auc.best_score_)
print(randomized_roc_auc.best_estimator_)

#### Final Model

We use the final estimator to train a final model. We would then use this model to perform inference on the validation set.

In [None]:
y_preds = randomized_roc_auc.predict(X_test)

f1_score(y_preds, y_test)

#### Notes
The dataset seems to be overfit on the training set and performs woefully on the training set.  

#### Model Valuation

Due to the imbalanced nature of the dataset. The accuracy score can be misleading, to upset this we can use the F1 score as our key evaluation metric. As it considers the nature of classification errors.

The dataset seems to be overfit on the training set and performs woefully on the training set. We can:
1. introduce regularization
2. reduce variance by decreasing the model's complexity. 
3. Introduce feature selection

