# 1. Setup kaggle cli and download dataset in google colab

Since all data is lost when google colab session ends, the six steps given below will download dataset from kaggle and save you from the trouble of downloading the dataset everytime. The first two steps below have to be done manually the first time. After that the rest of the steps can be executed by running the three cells (steps 3-6) below. You have to run these three cells to download the dataset everytime you start a new session. 
  

1. Download / create json credentials after creating an account in kaggle.  See https://github.com/Kaggle/kaggle-api for more details
2. Upload the kaggle.json file to your google drive
3. Run the script in the first cell below to download kaggle.json  to your colab environment
4. It will ask you to click on a link and enter the verification code
5. Install kaggle cli using pip install
6. Download the dataset




In [1]:
# Code from https://medium.com/@move37timm/using-kaggle-api-for-google-colaboratory-d18645f93648
# Create kaggle.json by following instructions at https://github.com/Kaggle/kaggle-api
# Upload kaggle.json to google drive
# Download kaggle.json to colab from the users google drive

from googleapiclient.discovery import build
import io, os
from googleapiclient.http import MediaIoBaseDownload
from google.colab import auth
auth.authenticate_user()
drive_service = build('drive', 'v3')
results = drive_service.files().list(
        q="name = 'kaggle.json'", fields="files(id)").execute()
kaggle_api_key = results.get('files', [])
filename = "/root/.kaggle/kaggle.json"
if not os.path.exists(os.path.dirname(filename)):
  os.makedirs(os.path.dirname(filename))
request = drive_service.files().get_media(fileId=kaggle_api_key[0]['id'])
fh = io.FileIO(filename, 'wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
    status, done = downloader.next_chunk()
    print("Download %d%%." % int(status.progress() * 100))
os.chmod(filename, 600)

Download 100%.


In [2]:
# Install kaggle cli
!pip install kaggle



In [4]:
# Download the dataset for digit-recognizer chalenge
!kaggle competitions download -c house-prices-advanced-regression-techniques

Downloading sample_submission.csv to /content
  0% 0.00/31.2k [00:00<?, ?B/s]
100% 31.2k/31.2k [00:00<00:00, 27.0MB/s]
Downloading test.csv to /content
  0% 0.00/441k [00:00<?, ?B/s]
100% 441k/441k [00:00<00:00, 57.6MB/s]
Downloading train.csv to /content
  0% 0.00/450k [00:00<?, ?B/s]
100% 450k/450k [00:00<00:00, 59.0MB/s]
Downloading data_description.txt to /content
  0% 0.00/13.1k [00:00<?, ?B/s]
100% 13.1k/13.1k [00:00<00:00, 12.4MB/s]


# 2. Read data in pandas dataframe
1. Check train and test csv files have been downloaded
2. import pandas and numpy and create train and test dataframes from the respective csv files
3. Inspect the dataframes
4. Convert to numpy arrays for train, validation, and test set 

In [5]:
# Check train and test csv files exist
!ls -ltr

total 952
drwxr-xr-x 1 root root   4096 Apr 29 16:32 sample_data
-rw-r--r-- 1 root root   2520 May  5 11:32 adc.json
-rw-r--r-- 1 root root  31939 May  5 11:33 sample_submission.csv
-rw-r--r-- 1 root root 451405 May  5 11:33 test.csv
-rw-r--r-- 1 root root 460676 May  5 11:33 train.csv
-rw-r--r-- 1 root root  13370 May  5 11:33 data_description.txt


In [20]:
!head -n 5 sample_submission.csv

Id,SalePrice
1461,169277.0524984
1462,187758.393988768
1463,183583.683569555
1464,179317.47751083


In [0]:
# Read the csv files using pandas
import pandas as pd
import numpy as np
df_tr = pd.read_csv('train.csv')
df_te = pd.read_csv('test.csv')


In [8]:
# Examine the contents of train.csv
print (df_tr.info())
df_tr.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [9]:
# Examine the contents of test.csv

print (df_te.info())
df_te.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
Id               1459 non-null int64
MSSubClass       1459 non-null int64
MSZoning         1455 non-null object
LotFrontage      1232 non-null float64
LotArea          1459 non-null int64
Street           1459 non-null object
Alley            107 non-null object
LotShape         1459 non-null object
LandContour      1459 non-null object
Utilities        1457 non-null object
LotConfig        1459 non-null object
LandSlope        1459 non-null object
Neighborhood     1459 non-null object
Condition1       1459 non-null object
Condition2       1459 non-null object
BldgType         1459 non-null object
HouseStyle       1459 non-null object
OverallQual      1459 non-null int64
OverallCond      1459 non-null int64
YearBuilt        1459 non-null int64
YearRemodAdd     1459 non-null int64
RoofStyle        1459 non-null object
RoofMatl         1459 non-null object
Exterior1st      1458 non-

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


In [0]:
# Code from fastai library
from sklearn.ensemble import RandomForestRegressor as RF
from IPython.display import display
def display_all(df):
    with pd.option_context("display.max_rows",1000):
        with pd.option_context("display.max_columns",1000):
            display(df)
            

def add_date_part(df,fldnm,drop=True):
    fld = df[fldnm]
    if not np.issubdtype(fld.dtype,np.datetime64):
        df[fldnm] = fld = pd.to_datetime(fld,infer_datetime_format=True)
    dateparts = ['day','week','month','year','dayofyear','dayofweek','is_month_start','is_month_end',
                 'is_quarter_start','is_quarter_end','is_year_start','is_year_end']
    for f in dateparts: 
        df[fldnm+'_'+f] = getattr(fld.dt,f)
    df[fldnm+'_'+'elapsed'] = fld.astype(np.int64)
    if drop: df.drop(fldnm,inplace=True,axis=1)
      
def train_cats(df):
    for n,c in df.items():
      if pd.api.types.is_string_dtype(c):
          df[n] = c.astype('category').cat.as_ordered()
    return df
def apply_cats(df,trn):
    for n,c in df.items():
        if n in trn.columns and pd.api.types.is_categorical(trn[n]):
            df[n] = c.astype('category').cat.as_ordered()
            df[n].cat.set_categories(trn[n].cat.categories,ordered=True,inplace=True)
    return df
def numericalize(df):
  for n,c in df.items():
    if not pd.api.types.is_numeric_dtype(c):
      df[n] = pd.Categorical(c).codes
  return df

In [0]:
import re 
df_tr["Title"] = df_tr["Name"].apply(lambda x: re.search(' ([A-Za-z]+)\.',x).group(1))
df_te["Title"] = df_te["Name"].apply(lambda x: re.search(' ([A-Za-z]+)\.',x).group(1))


In [0]:

df_tr = train_cats(df_tr)
df_te = apply_cats(df_te,df_tr)
df_tr = numericalize(df_tr)
df_tr = df_tr.fillna(-1)
df_te = numericalize(df_te)
df_te = df_te.fillna(-1)


In [38]:
# Partition the training data into pixels (independent variable) and label (dependent variable)
X = np.asarray(df_tr.drop(['Id','SalePrice'],axis=1))
yhat = np.log(np.asarray(df_tr['SalePrice']))
print (X.shape,yhat.shape)

(1460, 79) (1460,)


In [0]:

np.random.seed(2)
# Generate random indices for creating a random validation set with 20% of the labelled data
validx = (np.random.uniform(size=len(X)) <= 0.1)

# Create training set (80% of the labelled data)
X_trn = X[~validx]
y_trn = yhat[~validx]

# Create validation set (20% of the labelled data)
X_val = X[validx]
y_val = yhat[validx]

# Create the test set
X_tes = np.asarray(df_te.drop(['Id'],axis=1))

In [75]:
from sklearn.ensemble import RandomForestRegressor as RF
m =  RF(n_estimators=100,n_jobs=-1,oob_score = True,max_depth=15,min_samples_split=5,max_features=0.4,)
m.fit(X_trn,y_trn)
print (m.score(X_val,y_val))
print (m.oob_score_)
res = np.exp(m.predict(X_tes))

0.9153476614463018
0.8732729148621134


In [76]:
res

array([122424.71211496, 152655.39465517, 176264.87069436, ...,
       157908.63538141, 108857.79290006, 229088.56133757])

In [0]:
# Convert the results to a pandas dataframe
sub = pd.DataFrame({"Id":df_te['Id'],"SalePrice":res})

# Create the submission csv file from the dataframe
sub.to_csv("sub.csv",index=False)

In [78]:
# Submit the csv file to kaggle using the kaggle api
!kaggle competitions submit -c house-prices-advanced-regression-techniques -f sub.csv -m "submission_1"



100% 33.7k/33.7k [00:02<00:00, 13.2kB/s]
Successfully submitted to House Prices: Advanced Regression Techniques