### Summary

The idea here is to take a first look at the data, validate and gain insights from the same.

In [6]:
import pandas as pd
import numpy as np
import time as time

In [2]:
INPUT_DIR = '../input/'


In [3]:
train = pd.read_csv(INPUT_DIR + 'train.csv')

In [7]:
ts = time.time()
test = pd.read_csv(INPUT_DIR + 'test.csv')
time.time() - ts

69.47160696983337

###### Let us print out some relevant count metrics here.

In [8]:
len(train)

4459

In [9]:
len(test)

49342

In [10]:
len(train.columns)

4993

In [11]:
cols_present_in_test_not_in_train = [x for x in test.columns if x not in train.columns]

In [12]:
cols_present_in_test_not_in_train

[]

In [13]:
cols_present_in_train_not_in_test = [x for x in train.columns if x not in test.columns]

In [14]:
cols_present_in_train_not_in_test

['target']

##### So, we can summarize the problem in the following way :

1. The number of columns in the training data (4993) exceeds the total number of training data records(4459).
2. The number of records in the test data (49342) is almost 9.8 times the number of records in the training set (4459).

While this does not look to be the most favourable settings for a prediction problem, the upshot is that we should be really focussed on a building a low dimensional model that fares very well on out of sample data.

#### Data Validation

In [17]:
train.isna().any().any()

False

In [18]:
test.isna().any().any()

False

In [19]:
train.head()

Unnamed: 0,ID,target,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000d6aaf2,38000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
1,000fbd867,600000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
2,0027d6b71,10000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
3,0028cbf45,2000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
4,002a68644,14400000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0


###### There looks to be a lot of zeros in the train data. Let us drill down on it !

In [52]:
cols_apart_from_id_and_target = [x for x in train.columns if x not in ['ID', 'target']]

In [53]:
cols_apart_from_id_and_target[0]

'48df886f9'

In [54]:
non_zero_percent_train_cols = np.zeros(len(cols_apart_from_id_and_target))



In [55]:
for i in range(len(cols_apart_from_id_and_target)):
    non_zero_percent_train_cols[i] = \
        len(train[cols_apart_from_id_and_target[i]].to_numpy().nonzero()[0])*100/len(train[cols_apart_from_id_and_target[i]])


In [56]:
non_zero_percent_train_cols

array([0.87463557, 0.08970621, 0.74007625, ..., 0.7625028 , 1.79412424,
       3.65552815])

In [62]:
np.sort(non_zero_percent_train_cols)[::-1]

array([35.09755551, 35.09755551, 34.98542274, ...,  0.        ,
        0.        ,  0.        ])

In [66]:
np.sum(non_zero_percent_train_cols > 5)/len(non_zero_percent_train_cols)

0.1997595672209978

###### Thus, we see that less than 20% of the columns have less than 5% of the data consisting of non zero values.

##### Things look ripe for us to try out a simple lasso model and let us try that in a new notebook