After spending some time on the competitions discussion forums, it seems like a ton of people are modeling with just the tabular data. So let's try to run some models to create a baseline we'll use to compare our processed CT scans. 

We'll try random forests and a tabular deep learning model.
Could also create a deep learning model from scratch through PyTorch

In [68]:
from fastai2.tabular.all import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

In [7]:
path = Path('/home/azaidi/Desktop/fastai/nbs/kaggle/osic')
path.ls()

(#6) [Path('/home/azaidi/Desktop/fastai/nbs/kaggle/osic/test.csv'),Path('/home/azaidi/Desktop/fastai/nbs/kaggle/osic/train'),Path('/home/azaidi/Desktop/fastai/nbs/kaggle/osic/train.csv'),Path('/home/azaidi/Desktop/fastai/nbs/kaggle/osic/test'),Path('/home/azaidi/Desktop/fastai/nbs/kaggle/osic/osic-pulmonary-fibrosis-progression.zip'),Path('/home/azaidi/Desktop/fastai/nbs/kaggle/osic/sample_submission.csv')]

In [8]:
train_df = pd.read_csv(path/'train.csv', low_memory=False)
test_df = pd.read_csv(path/'test.csv', low_memory=False)
sample_sub = pd.read_csv(path/'sample_submission.csv')

In [9]:
train_df.head(5)

Unnamed: 0,Patient,Weeks,FVC,Percent,Age,Sex,SmokingStatus
0,ID00007637202177411956430,-4,2315,58.253649,79,Male,Ex-smoker
1,ID00007637202177411956430,5,2214,55.712129,79,Male,Ex-smoker
2,ID00007637202177411956430,7,2061,51.862104,79,Male,Ex-smoker
3,ID00007637202177411956430,9,2144,53.950679,79,Male,Ex-smoker
4,ID00007637202177411956430,11,2069,52.063412,79,Male,Ex-smoker


In [34]:
len(train_df['Patient'].unique())

176

So there's 176 patients, why not select like the last 10 or so patients as the valid? We could sort train_df

In [36]:
train_df.sort_values('Patient')

Unnamed: 0,Patient,Weeks,FVC,Percent,Age,Sex,SmokingStatus
0,ID00007637202177411956430,-4,2315,58.253649,79,Male,Ex-smoker
1,ID00007637202177411956430,5,2214,55.712129,79,Male,Ex-smoker
2,ID00007637202177411956430,7,2061,51.862104,79,Male,Ex-smoker
3,ID00007637202177411956430,9,2144,53.950679,79,Male,Ex-smoker
4,ID00007637202177411956430,11,2069,52.063412,79,Male,Ex-smoker
...,...,...,...,...,...,...,...
1543,ID00426637202313170790466,11,2976,73.077301,73,Male,Never smoked
1544,ID00426637202313170790466,13,2712,66.594637,73,Male,Never smoked
1545,ID00426637202313170790466,19,2978,73.126412,73,Male,Never smoked
1546,ID00426637202313170790466,31,2908,71.407524,73,Male,Never smoked


That sorted the Dataframe in place(!)

In [38]:
train_df.head(5)

Unnamed: 0,Patient,Weeks,FVC,Percent,Age,Sex,SmokingStatus
0,ID00007637202177411956430,-4,2315,58.253649,79,Male,Ex-smoker
1,ID00007637202177411956430,5,2214,55.712129,79,Male,Ex-smoker
2,ID00007637202177411956430,7,2061,51.862104,79,Male,Ex-smoker
3,ID00007637202177411956430,9,2144,53.950679,79,Male,Ex-smoker
4,ID00007637202177411956430,11,2069,52.063412,79,Male,Ex-smoker


In [39]:
train_df.columns

Index(['Patient', 'Weeks', 'FVC', 'Percent', 'Age', 'Sex', 'SmokingStatus'], dtype='object')

Already did most of the EDA on the dataframes in the first notebook

For categorical variables, specifically SmokingStatus -- does it make sense to encode them with an integer value to encode it as ordinal? Meaning, is 'never smoked' measureably worse than 'currently smokes' ?  Probably

In [40]:
train_df['SmokingStatus'].unique()

array(['Ex-smoker', 'Never smoked', 'Currently smokes'], dtype=object)

In [41]:
cont, cat = cont_cat_split(train_df, 1, dep_var=['FVC', 'Percent'])

In [42]:
cont, cat

(['Weeks', 'FVC', 'Percent', 'Age'], ['Patient', 'Sex', 'SmokingStatus'])

In [43]:
procs = [Categorify]

In [51]:
len(train_df), len(train_df)*0.8

(1549, 1239.2)

In [49]:
list(range(12))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

In [53]:
list(range(1239,1245))

[1239, 1240, 1241, 1242, 1243, 1244]

In [58]:
splits = (list(range(1239)), list(range(1239, 1549)))

In [74]:
to = TabularPandas(train_df, procs, cat, cont, splits=splits,
                y_names=['FVC'])

In [75]:
to

      Patient  Weeks   FVC    Percent  Age  Sex  SmokingStatus
0           1     -4  2315  58.253647   79    2              2
1           1      5  2214  55.712128   79    2              2
2           1      7  2061  51.862103   79    2              2
3           1      9  2144  53.950680   79    2              2
4           1     11  2069  52.063412   79    2              2
...       ...    ...   ...        ...  ...  ...            ...
1544      176     13  2712  66.594635   73    2              3
1545      176     19  2978  73.126411   73    2              3
1546      176     31  2908  71.407524   73    2              3
1547      176     43  2975  73.052742   73    2              3
1548      176     59  2774  68.117081   73    2              3

[1549 rows x 7 columns]

In [76]:
to.train.xs

Unnamed: 0,Patient,Sex,SmokingStatus,Weeks,FVC,Percent,Age
0,1,2,2,-4,2315,58.253647,79
1,1,2,2,5,2214,55.712128,79
2,1,2,2,7,2061,51.862103,79
3,1,2,2,9,2144,53.950680,79
4,1,2,2,11,2069,52.063412,79
...,...,...,...,...,...,...,...
1234,141,2,2,12,2242,56.445114,68
1235,141,2,2,18,2253,56.722054,68
1236,141,2,2,30,2160,54.380665,68
1237,141,2,2,42,2000,50.352467,68


In [77]:
to.valid.xs

Unnamed: 0,Patient,Sex,SmokingStatus,Weeks,FVC,Percent,Age
1239,142,1,1,28,2849,143.381989,68
1240,142,1,1,32,2851,143.482635,68
1241,142,1,1,35,2798,140.815292,68
1242,142,1,1,36,2854,143.633621,68
1243,142,1,1,38,2891,145.495728,68
...,...,...,...,...,...,...,...
1544,176,2,3,13,2712,66.594635,73
1545,176,2,3,19,2978,73.126411,73
1546,176,2,3,31,2908,71.407524,73
1547,176,2,3,43,2975,73.052742,73


In [78]:
xs = to.train.xs
y = to.train.y
valid_xs = to.valid.xs
valid_y = to.valid.y

In [88]:
m = DecisionTreeRegressor(min_samples_leaf=1)

In [89]:
m.fit(xs, y)

DecisionTreeRegressor()

In [90]:
def r_mse(pred, y):
    return round(math.sqrt(((pred-y)**2).mean()), 6)

In [91]:
def m_rmse(m, xs, y):
    return r_mse(m.predict(xs), y)

In [92]:
m_rmse(m, xs, y)

0.0

In [93]:
m_rmse(m, valid_xs, valid_y)

4.16514

Looks like we overfit :) 
Let's try a random forest

In [97]:
def rf(xs, y, n_estimators=40, max_samples=1200,
       max_features=0.5, min_samples_leaf=5, **kwargs):
    return RandomForestRegressor(n_jobs=-1, n_estimators=n_estimators,
        max_samples=max_samples, max_features=max_features,
        min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs, y)

In [98]:
m = rf(xs, y)

In [99]:
m_rmse(m, xs, y)

39.403021

In [100]:
m_rmse(m, valid_xs, valid_y)

78.617271