## Now that I have figured out how to install XGBoost as well as run it cleanly, this notebook will be a cleaned-up version of the previous notebook. I'm going to take out all of the exploratory stuff and go straight to the heart of the matter. I'm also going to clean up the imports, test and train data frames as well as the null columns using the train data set. Go line by line and make sure everything is straightforward and decently commented. 

In [1]:
#The usual imports..
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import xgboost as xgb
import time                     # Using time to check how long it takes to train model
import gc

In [2]:
train_df = pd.read_csv('./data/train.csv', parse_dates=['timestamp'])
test_df = pd.read_csv('./data/test.csv', parse_dates=['timestamp'])

In [3]:
print(train_df.shape)
print(test_df.shape)
print(test_df.columns)
print(train_df.columns)

(30471, 292)
(7662, 291)
Index(['id', 'timestamp', 'full_sq', 'life_sq', 'floor', 'max_floor',
       'material', 'build_year', 'num_room', 'kitch_sq',
       ...
       'cafe_count_5000_price_1500', 'cafe_count_5000_price_2500',
       'cafe_count_5000_price_4000', 'cafe_count_5000_price_high',
       'big_church_count_5000', 'church_count_5000', 'mosque_count_5000',
       'leisure_count_5000', 'sport_count_5000', 'market_count_5000'],
      dtype='object', length=291)
Index(['id', 'timestamp', 'full_sq', 'life_sq', 'floor', 'max_floor',
       'material', 'build_year', 'num_room', 'kitch_sq',
       ...
       'cafe_count_5000_price_2500', 'cafe_count_5000_price_4000',
       'cafe_count_5000_price_high', 'big_church_count_5000',
       'church_count_5000', 'mosque_count_5000', 'leisure_count_5000',
       'sport_count_5000', 'market_count_5000', 'price_doc'],
      dtype='object', length=292)


In [4]:
train_df.isnull().sum()

id                                           0
timestamp                                    0
full_sq                                      0
life_sq                                   6383
floor                                      167
max_floor                                 9572
material                                  9572
build_year                               13605
num_room                                  9572
kitch_sq                                  9572
state                                    13559
product_type                                 0
sub_area                                     0
area_m                                       0
raion_popul                                  0
green_zone_part                              0
indust_part                                  0
children_preschool                           0
preschool_quota                           6688
preschool_education_centers_raion            0
children_school                              0
school_quota 

In [5]:
train_df = train_df.fillna(train_df.mean())
train_df.isnull().sum()

id                                       0
timestamp                                0
full_sq                                  0
life_sq                                  0
floor                                    0
max_floor                                0
material                                 0
build_year                               0
num_room                                 0
kitch_sq                                 0
state                                    0
product_type                             0
sub_area                                 0
area_m                                   0
raion_popul                              0
green_zone_part                          0
indust_part                              0
children_preschool                       0
preschool_quota                          0
preschool_education_centers_raion        0
children_school                          0
school_quota                             0
school_education_centers_raion           0
school_educ

### Now, to determine whether the test set has missing data, and how well it matches up with the training set. I hope to fill any missing data with the mean from the training set. To remind myself:
train_df - full data frame of training data

X_train - 80% sample of the training data

X_valid - 20% sample of training data (cross-validation set)

test_df - full data frame of the test data. (no labels)

In [6]:
#test_df[test_df.isnull()].dtypes
train_df['product_type'].value_counts()

Investment       19448
OwnerOccupier    11023
Name: product_type, dtype: int64

In [7]:
# We see that there is roughly an even split between investment and owneroccupier in the training data.
# I'll fill the test data frame missing values with "Investment" and move forward
test_df['product_type'].fillna('Investment', inplace=True)

In [8]:
test_df['sub_area'].values.max()

'Zjuzino'

In [9]:
for col in test_df.columns:
    if test_df[col].isnull().sum() != 0:
        test_df[col] = test_df[col].fillna(train_df[col].mean())
test_df.isnull().sum()

id                                       0
timestamp                                0
full_sq                                  0
life_sq                                  0
floor                                    0
max_floor                                0
material                                 0
build_year                               0
num_room                                 0
kitch_sq                                 0
state                                    0
product_type                             0
sub_area                                 0
area_m                                   0
raion_popul                              0
green_zone_part                          0
indust_part                              0
children_preschool                       0
preschool_quota                          0
preschool_education_centers_raion        0
children_school                          0
school_quota                             0
school_education_centers_raion           0
school_educ

### Well, it looks like I took care of the null values in the test dataframe. All I had to do was sort out the product_type column, it seems. I'll take it! 

In [10]:
# Saw this interesting method to create a cross validation set from a single data frame. 
df2split = train_df
msk = np.random.rand(len(df2split)) < 0.8
X_train = df2split[msk]
y_train = X_train.price_doc
X_valid = df2split[~msk]
y_valid = X_valid.price_doc
#X_train = X_train.drop('price_doc', axis=1)  #Not sure if I should drop the labels from train / validation sets.
#X_valid = X_valid.drop('price_doc', axis=1)

In [11]:
X_train.shape

(24318, 292)

In [12]:
X_valid.shape

(6153, 292)

In [13]:
#Set up some global variables for cleaning up the data
target = 'price_doc'
IDcol = 'id'
timestamp = 'timestamp'

In [14]:
X_train = X_train.drop('timestamp', axis=1)
test_df = test_df.drop('timestamp', axis=1)
train_df = train_df.drop('timestamp', axis=1)

In [15]:
print(X_train.shape)
print(train_df.shape)
print(test_df.shape)

(24318, 291)
(30471, 291)
(7662, 290)


In [16]:
# Clear up the categorical variables and create dummy columns of the categorical variables
s = train_df.dtypes
object_columns = s[s.values == 'object'].index.values

In [17]:
# Continue with cleaning up categorical stuff.
for i in object_columns:
    train_df[i] = train_df[i].astype('category')
    X_train[i] = X_train[i].astype('category')
    test_df[i] = test_df[i].astype('category')

In [None]:
# Create dummy variables for the categorical columns
train_df = pd.get_dummies(train_df)
X_train = pd.get_dummies(X_train)
test_df = pd.get_dummies(test_df)

In [20]:
# Prep the train dataframe for xgboost... it is a quirky model..
label = train_df.price_doc
predictors = [x for x in train_df.columns if x not in [target, IDcol]]
train_df = train_df[predictors]
dtrain = train_df.as_matrix()

AttributeError: 'DataFrame' object has no attribute 'price_doc'

In [21]:
# Set up the parameter dictionary for xgboost
params = {}
params['eta'] = 0.1
params['objective'] = 'reg:linear'
params['max_depth'] = 5
params['min_child_weight'] = 1
params['gamma'] = 0
params['eval_metric'] = 'mae'
params['updater'] = 'grow_gpu'

num_round = 20
xgtrain = xgb.DMatrix(dtrain, label=label)
tmp = time.time()
bst = xgb.train(params, xgtrain, num_round)
boost_time = time.time() - tmp
print('Train time: %s sec' % (str(boost_time)))
xgb.plot_importance(bst, max_num_features=20)
plt.show()

ValueError: could not convert string to float: 'poor'

In [None]:
xgtest = xgb.DMatrix(test_df)
preds = bst.predict(xgtest)