1. Perform an exploratory data analysis
    - What are the abnormalities in the data?
    - Are there any interesting, perhaps unexpected relationships to be found?

1. Create a model for predicting the cancer_type
    - Select an appropriate model and keep its complexity reasonable (number of used features, etc.)
    
    - I would like you to send me a submission.csv for the cases in test_data.csv at least on hour before the interview that includes the prediction of the cancer_type of your model. The cases should be in the same order as in the test_data.csv and should only contain the label of the predicted cancer_type. See the sample_submission.csv for format clarification.

1. Build a regression model for predicting radius_2 based on perimeter_1
    - the model should be able to quantify its prediction reliability, e.g. density estimates, etc. (please do not just use the outputted R^2 or confidence interval estimates of typical linear regression packages fitting results)
    - prepare a visualization that illustrates possible prediction uncertainties


In [1]:
import time

# data processing library
import numpy as np
import pandas as pd
from functools import reduce
import collections

# data visualization library  
import seaborn as sns 
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats as stats

# feature engineering library
from feature_engine.outlier_removers import Winsorizer

# machine learning library
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression, LogisticRegression 
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier

from sklearn.metrics import classification_report, confusion_matrix, f1_score

from sklearn import model_selection
from sklearn.model_selection import KFold, RepeatedStratifiedKFold, train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn import preprocessing
from sklearn.metrics import average_precision_score, auc, roc_curve, precision_recall_curve
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_squared_log_error

# gradient boosting machine 
import lightgbm as lgbm

# feature analysis
import shap

# oversampling imbalanced datasets
from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings("ignore")


This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


# Perform an exploratory data analysis

- What are the abnormalities in the data?

- Are there any interesting, perhaps unexpected relationships to be found?
    


## Data processing


In [3]:
train_file = 'train_data.csv'
test_file = 'test_data.csv'
sample_submission_file = 'sample_submission.csv'


In [4]:
df_train = pd.read_csv(train_file)
df_test = pd.read_csv(test_file)
df_sample_submission = pd.read_csv(sample_submission_file)


In [5]:
df_train.head()


Unnamed: 0,radius_0,texture_0,perimeter_0,radius_1,texture_1,perimeter_1,radius_2,texture_2,perimeter_2,age,treatment_date,diagnose_date,cancer_type
0,19.858394,27.204437,136.324256,22.68329,32.802578,119.523841,21.477052,27.3070874472,82.366936,44,2006-06-03,2005-10-23,0
1,14.182069,15.75473,80.916983,14.043753,30.094704,94.911073,15.012329,17.8551305385,103.078286,59,2004-02-22,2007-08-20,1
2,25.380268,21.291553,152.281062,23.852166,46.237931,,28.563252,21.0971528265,143.367792,37,2006-01-06,2004-08-07,0
3,11.835961,17.820702,72.178523,11.260258,44.805167,,12.082749,16.4992370844,65.920413,51,2003-04-14,2005-06-16,1
4,14.8756,17.534187,98.54583,14.380683,26.190447,89.712492,12.930685,19.8566873539,108.380754,21,2004-06-21,2002-11-27,1


In [6]:
df_train.shape


(398, 13)

In [7]:
df_train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   radius_0        398 non-null    float64
 1   texture_0       398 non-null    float64
 2   perimeter_0     398 non-null    float64
 3   radius_1        343 non-null    float64
 4   texture_1       398 non-null    float64
 5   perimeter_1     264 non-null    float64
 6   radius_2        398 non-null    float64
 7   texture_2       398 non-null    object 
 8   perimeter_2     398 non-null    float64
 9   age             398 non-null    int64  
 10  treatment_date  398 non-null    object 
 11  diagnose_date   398 non-null    object 
 12  cancer_type     398 non-null    int64  
dtypes: float64(8), int64(2), object(3)
memory usage: 40.5+ KB


The possible cancer types might be malignant and benign.


In [8]:
df_test.shape


(171, 12)

In [9]:
df_test.head()


Unnamed: 0,radius_0,texture_0,perimeter_0,radius_1,texture_1,perimeter_1,radius_2,texture_2,perimeter_2,age,treatment_date,diagnose_date
0,12.567724,13.561447,77.106898,10.773643,45.494416,,12.526989,15.7063580493,123.583682,31,2008-11-19,2003-04-22
1,11.195949,19.693575,81.244301,15.058411,7.909249,86.766622,13.72896,21.485344712,154.164201,18,2001-08-18,2003-07-07
2,15.71272,26.114134,90.977022,13.832857,18.086143,,14.758324,27.0205254475,114.023403,43,2006-11-17,2004-03-06
3,13.428698,26.649458,76.456016,14.837875,6.12295,89.609565,16.279206,29.1837924649,199.756098,57,2001-01-10,2006-02-24
4,179.763472,14.175435,51.125047,,21.116416,52.041704,9.191477,13.5857306814,74.879232,26,2008-07-12,2004-06-21


In [10]:
df_test.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171 entries, 0 to 170
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   radius_0        171 non-null    float64
 1   texture_0       171 non-null    float64
 2   perimeter_0     171 non-null    float64
 3   radius_1        146 non-null    float64
 4   texture_1       171 non-null    float64
 5   perimeter_1     105 non-null    float64
 6   radius_2        171 non-null    float64
 7   texture_2       171 non-null    object 
 8   perimeter_2     171 non-null    float64
 9   age             171 non-null    int64  
 10  treatment_date  171 non-null    object 
 11  diagnose_date   171 non-null    object 
dtypes: float64(8), int64(1), object(3)
memory usage: 16.2+ KB


In [11]:
df_sample_submission.shape


(171, 1)

In [12]:
df_sample_submission.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171 entries, 0 to 170
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   cancer_type  171 non-null    int64
dtypes: int64(1)
memory usage: 1.5 KB


In [13]:
df_sample_submission.head()


Unnamed: 0,cancer_type
0,1
1,1
2,1
3,0
4,1


### Convert treatment_date, diagnose_date into datetime


In [14]:
df_train.columns


Index(['radius_0', 'texture_0', 'perimeter_0', 'radius_1', 'texture_1',
       'perimeter_1', 'radius_2', 'texture_2', 'perimeter_2', 'age',
       'treatment_date', 'diagnose_date', 'cancer_type'],
      dtype='object')

In [15]:
df_train['treatment_date'] = pd.to_datetime(df_train['treatment_date'], format="%Y-%m-%d")


In [16]:
df_train['treatment_date'].describe()


count                     398
unique                    371
top       2003-04-27 00:00:00
freq                        3
first     2000-01-19 00:00:00
last      2008-11-26 00:00:00
Name: treatment_date, dtype: object

In [17]:
df_test['treatment_date'] = pd.to_datetime(df_test['treatment_date'], format="%Y-%m-%d")


In [18]:
df_test['treatment_date'].describe()


count                     171
unique                    160
top       2005-10-12 00:00:00
freq                        3
first     2000-01-09 00:00:00
last      2008-11-19 00:00:00
Name: treatment_date, dtype: object

### Convert texture_2 to float


# Create a model for predicting the cancer_type



- Select an appropriate model and keep its complexity reasonable (number of used features, etc.)
    
    
- I would like you to send me a submission.csv for the cases in test_data.csv at least on hour before the interview that includes the prediction of the cancer_type of your model. The cases should be in the same order as in the test_data.csv and should only contain the label of the predicted cancer_type. See the sample_submission.csv for format clarification.
    

# Build a regression model for predicting radius_2 based on perimeter_1

- the model should be able to quantify its prediction reliability, e.g. density estimates, etc. (please do not just use the outputted R^2 or confidence interval estimates of typical linear regression packages fitting results)



- prepare a visualization that illustrates possible prediction uncertainties

