In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.linear_model

### 1. Data Exploration
Before we work on the data, we want to explore and find out some information and scope about the data that we are working with. Additionally, it is important for us to deal with any dirty data, for example, missing values and invalid values. 

Having an appreciation and understanding of the data will enable us to interpret the results of any model well.

#### Reading the data 

In [None]:
training_set = pd.read_csv('training.csv')
test_set = pd.read_csv('testing.csv')

In [None]:
training_set.head()

In [None]:
test_set.head()

In [None]:
print(training_set.columns)

In [None]:
np.unique(training_set['class'])

Taking a look at the first 5 observations at both training and test datasets provided to us, we observe that there is a class variable, and 9 other numerical variables labelled b1-b9. We are not entirely sure what these covariates represent because we are not given much information about them, so we'll just take them as they are. 

Also, there are also prediction variables somehow included in the dataset, which might not be relevant to our study, so we'll omit them as well. 

In [None]:
# Create a mask function to omit pred variables
def var_omit(dataset):
    mask = [] # creates an empty list
    for col in training_set:
        if not col.startswith('pred_minus_obs'): 
            mask.append(col) # add column to list if it does not start with pred
    data = dataset[mask] # applies the mask 
    return(data) # return the new dataset with appropriate columns

In [None]:
training_set, test_set = var_omit(training_set), var_omit(test_set)

A look at the datasets with omitted pred variables looks ok. 

In [None]:
training_set, test_set

#### Data Visualisation and Cleaning

In [None]:
training_set.dtypes
datasets = [training_set, test_set]

We observe that the datatypes for the 

In [None]:
# Check for any missing values in the datasets (train and test), vector dimensions have to match
def missing_value_test(datasets):
    missing_value = np.array(datasets[0].isnull().sum())
    colnames = datasets[0].columns
    for data in datasets[1:]:
        dm = np.array(data.isnull().sum())
        missing_value = np.stack((missing_value, dm))
    df = pd.DataFrame(missing_value, columns=colnames, index=['train','test'])
    return(df)

missing_value_test(datasets)

Data for both the training set and the test set look to be okay for now, no missing values so given data is clean. Hence, there is no need to deal with or impute missing data. 

Next, we use the describe method to look at some summary statistics of the two datasets. This will give us a rough idea of any huge outliers in the data, if any. As of now, no observable disparity of data between the 2 sets, and the min and max values look to be alright with no outliers. 

In [None]:
training_set.describe()

In [None]:
test_set.describe()

Perhaps we can better visualise this using a boxplot. 

In [None]:
fig, ax = plt.subplots(1,2, figsize=(15, 7))
ax[0].boxplot(np.array(training_set)[:,1:])
ax[1].boxplot(np.array(test_set)[:,1:])
ax[0].set_title('Training Set')
ax[1].set_title('Test Set')
ax[0].set_ylim(0, 200)
ax[1].set_ylim(0, 200)
ax[0].grid(True)
ax[1].grid(True)
fig.suptitle('Boxplot of b1-b9 variables')

By looking at the boxplots of the numerical variables, for both the training and the test sets, we are able to tell that there are quite a fair few number of outliers that fall outside the IQR range on some of these variables. However, the pattern of the 9 covariates looks acceptable between the 2 sets.

In [None]:
training_set.hist(bins=20, figsize=(20,15))
plt.show()