# Homework 6

In this homework I will explore a subset of the Ames Housing Dataset which is available on Kaggle. In order to reduce the length of parameter tuning and data cleaning I'll use a smaller subset of predictors. I'm accessing this data through the Kaggle API to save space on my computer. I'll implement an Elastic Net, Linear Regression, KNN Regression, Random Forest, and Gradient-Boosting Machine. Most of the code below is data acquisition and cleaning, I'll add comments where necessary but this can largely be skipped.



The Ames Housing Dataset is a famous dataset in machine learning with over 80 columns which characterize houses. The size of the lot, number of bathrooms, and house's sale price are some of the variables included. The training and testing data were accessed from [this](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) Kaggle competition. The Kaggle page has a detailed outline of the data, but the only variables I'll use are the following:


SalePrice - the property's sale price in dollars. This is the target variable that I'm trying to predict.

LotArea: Lot size in square feet

BldgType: Type of dwelling

HouseStyle: Style of dwelling

YearBuilt: Original construction date

Heating: Type of heating

CentralAir: Central air conditioning

FullBath: Full bathrooms above grade

HalfBath: Half baths above grade

Fireplaces: Number of fireplaces

GarageArea: Size of garage in square feet

PoolArea: Pool area in square feet

YrSold: Year Sold

In [None]:
! pip install -q kaggle

In [None]:
from google.colab import files
files.upload()
# import kaggle API token w/ upload()

Saving kaggle.json to kaggle (1).json


{'kaggle.json': b'{"username":"calvinsmith625","key":"c40d4b53983a1207856bd7b6391516f2"}'}

In [None]:
# create local directory called kaggle
! mkdir ~/.kaggle
# add kaggle.json token to that directory
! cp kaggle.json ~/.kaggle/

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [None]:
# change permissions of folder
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
# check my access to the kaggle API
! kaggle datasets list

ref                                                         title                                              size  lastUpdated          downloadCount  
----------------------------------------------------------  ------------------------------------------------  -----  -------------------  -------------  
gpreda/reddit-vaccine-myths                                 Reddit Vaccine Myths                              221KB  2021-03-28 09:42:48           1148  
crowww/a-large-scale-fish-dataset                           A Large Scale Fish Dataset                          3GB  2021-02-17 16:10:44            866  
dhruvildave/wikibooks-dataset                               Wikibooks Dataset                                   1GB  2021-02-18 10:08:27            743  
imsparsh/musicnet-dataset                                   MusicNet Dataset                                   22GB  2021-02-18 14:12:19            324  
alsgroup/end-als                                            End ALS Kaggle C

In [None]:
# load desired dataset
! kaggle competitions download -c house-prices-advanced-regression-techniques
# this will make the data available in the files panel, in other cases it may need to be unzipped

data_description.txt: Skipping, found more recently modified local copy (use --force to force download)
test.csv: Skipping, found more recently modified local copy (use --force to force download)
sample_submission.csv: Skipping, found more recently modified local copy (use --force to force download)
train.csv: Skipping, found more recently modified local copy (use --force to force download)


In [None]:
import pandas as pd
import numpy as np

In [None]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
files.upload()
test_answers = pd.read_csv('solution.csv')
# must append the testing data's true value from another source and combine to the full dataset

Saving solution.csv to solution (1).csv


In [None]:
display(train.shape, test.shape)

(1460, 81)

(1459, 80)

In [None]:
test = test.merge(test_answers,
           how='left',
           on='Id')

In [None]:
data = pd.concat([train, test], axis=0)

I'll only use the columns below for simplicity sake. There were over 80 variables in the dataset. I've broken it down to the few below. The categorical variables require one-hot encoding and I'll perform that below. Pandas' get_dummies() function will create a dummy variable for each variable less one in each categorical column. The drop_first argument ensures drops one of the cateogires. Therefore, the CentralAir column will be a binary variable where 1 = Yes (Y) there is Central Air, or 0 = No Central Air. The other categorical columns will get a dummy column for N-1 categories.

In [None]:
data = data[['Id', 'LotArea', 'BldgType', 'HouseStyle', 'YearBuilt', 'Heating', 'CentralAir', 'FullBath',
               'HalfBath', 'Fireplaces', 'GarageArea', 'PoolArea', 'YrSold', 'SalePrice']]

In [None]:
cat_cols = ['BldgType', 'HouseStyle', 'Heating', 'CentralAir']
dummies = [pd.get_dummies(data[x], drop_first=True) for x in cat_cols]

In [None]:
dummies = pd.concat(dummies, axis=1)

In [None]:
data.drop(cat_cols, axis=1, inplace=True)
data = pd.concat([data, dummies], axis=1)

In [None]:
# must omit a column with a random NaN and from the true values
omit_col = data.index[data.isna().any(axis=1)]
data = data.drop(omit_col, axis=0)

In [None]:
data.reset_index(inplace=True, drop=True)
data

Unnamed: 0,Id,LotArea,YearBuilt,FullBath,HalfBath,Fireplaces,GarageArea,PoolArea,YrSold,SalePrice,2fmCon,Duplex,Twnhs,TwnhsE,1.5Unf,1Story,2.5Fin,2.5Unf,2Story,SFoyer,SLvl,GasA,GasW,Grav,OthW,Wall,Y
0,1,8450,2003,2,1,0,548.0,0,2008,208500.0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1
1,2,9600,1976,2,0,1,460.0,0,2007,181500.0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1
2,3,11250,2001,2,1,1,608.0,0,2008,223500.0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1
3,4,9550,1915,1,0,1,642.0,0,2006,140000.0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1
4,5,14260,2000,2,1,1,836.0,0,2008,250000.0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2912,2915,1936,1970,1,1,0,0.0,0,2006,90500.0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1
2913,2916,1894,1970,1,1,0,286.0,0,2006,71000.0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1
2914,2917,20000,1960,1,0,1,576.0,0,2006,131000.0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1
2915,2918,10441,1992,1,0,0,0.0,0,2006,132000.0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1


In [None]:
y = data['SalePrice']
X = data.drop(['SalePrice', 'Id'], axis=1)

from sklearn.model_selection import train_test_split

def trainSets(x_data, y_data, test_size):
    x_train, x_test, y_train, y_test = train_test_split(
        x_data, y_data,
        test_size=test_size, train_size=1-test_size,
        random_state=611, shuffle=True)
    return x_train, x_test, y_train, y_test

split_dat = trainSets(X, y, .2)
x_train = split_dat[0]
x_test = split_dat[1]
y_train = split_dat[2]
y_test = split_dat[3]

In [None]:
display(x_train, x_test)

Unnamed: 0,LotArea,YearBuilt,FullBath,HalfBath,Fireplaces,GarageArea,PoolArea,YrSold,2fmCon,Duplex,Twnhs,TwnhsE,1.5Unf,1Story,2.5Fin,2.5Unf,2Story,SFoyer,SLvl,GasA,GasW,Grav,OthW,Wall,Y
1564,20062,1977,1,1,2,690.0,0,2010,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1
1909,13607,1986,2,1,1,501.0,0,2009,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1
827,8529,2001,2,0,1,527.0,0,2009,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1
385,3182,2004,2,0,1,430.0,0,2010,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,1
1470,1680,1971,1,1,0,264.0,0,2010,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1792,8010,1958,1,0,0,480.0,0,2009,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1
297,7399,1997,2,1,1,576.0,0,2007,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1
1266,13214,2008,2,0,1,746.0,0,2010,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1
698,8450,1965,1,0,1,336.0,0,2010,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1


Unnamed: 0,LotArea,YearBuilt,FullBath,HalfBath,Fireplaces,GarageArea,PoolArea,YrSold,2fmCon,Duplex,Twnhs,TwnhsE,1.5Unf,1Story,2.5Fin,2.5Unf,2Story,SFoyer,SLvl,GasA,GasW,Grav,OthW,Wall,Y
1958,10295,1969,1,0,0,288.0,0,2008,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1
2880,8170,1929,2,0,1,451.0,0,2006,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
74,5790,1915,2,0,0,379.0,0,2010,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0
1503,8000,2002,2,0,1,596.0,0,2010,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1
2619,9783,1996,2,1,1,443.0,0,2006,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2848,21533,1996,2,1,1,467.0,0,2006,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1
1013,7200,1910,1,0,0,280.0,0,2009,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0
1456,9042,1941,2,0,2,252.0,0,2010,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1
2861,11060,2003,2,1,1,502.0,0,2006,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1


In [None]:
# this function will perform a monte carlo validation to test the true accuracy of the optimized model
def monte_carlo_valid(model, dat, response, pct_data_split=0.8):
    # randomly split data
    inds = np.random.rand(len(X)) < pct_data_split
    x_train = dat[inds]
    x_test = dat[~inds]
    y_train = response[inds]
    y_test = response[~inds]

    ypred = model.predict(x_test)
    
    df = {'Explained_Variance': [np.round(explained_variance_score(y_test, ypred),5)],
          'Max_Error': [np.round(max_error(y_test, ypred),5)],
          'MSE': [np.round(mean_squared_error(y_test, ypred),5)],
          'MAE': [np.round(mean_absolute_error(y_test, ypred),5)]}
    df = pd.DataFrame(data=df)
    return df

# Elastic Net

The first model I'll build is a 5-Fold Cross-Validated Elastic Net Regression. You'll see the [Explained Variance](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html#sklearn.metrics.explained_variance_score), [Max Error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.max_error.html#sklearn.metrics.max_error), [Mean-Absolute-Error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error), and [Mean-Squared-Error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error) printed as the results for each model.

In [None]:
from sklearn.metrics import explained_variance_score, mean_squared_error, max_error, mean_absolute_error
def eval_metrics(y_true, y_pred):
  print('Explained Variance: ', np.round(explained_variance_score(y_true, y_pred),2))
  print('Max Error: ', np.round(max_error(y_true, y_pred),2))
  print('MSE: ', np.round(mean_squared_error(y_true, y_pred),2))
  print('MAE: ', np.round(mean_absolute_error(y_true, y_pred),2))

In [None]:
from sklearn.linear_model import ElasticNetCV
eNet = ElasticNetCV(cv=5, random_state=0)
eNet.fit(x_train, y_train)
eNet_preds = eNet.predict(x_test)
eval_metrics(y_test, eNet_preds)

Explained Variance:  0.2
Max Error:  532117.39
MSE:  5037233739.27
MAE:  51517.7


The Elastic Net does not perform well by any metric on the testing data. We can validate these results by running the Monte-Carlo validation function built above. I'll run the simulation 50 times with 20% of the data reserved for testing each time. The function simply refits the model to a new training sample and tests on a new testing split.

In [None]:
# run the monte carlo function 50 times
results = [monte_carlo_valid(eNet, X, y) for p in range(0,50)]
# concat all rows into one df
results = pd.concat(results, axis=0)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
results.describe()

Unnamed: 0,Explained_Variance,Max_Error,MSE,MAE
count,50.0,50.0,50.0,50.0
mean,0.2,437087.39,5159789100.13,51400.32
std,0.02,83187.18,591356393.98,1963.03
min,0.17,269009.62,4126127874.38,46959.44
25%,0.19,368918.89,4702250273.5,49681.18
50%,0.2,414043.84,5177912530.75,51716.95
75%,0.22,532117.39,5579762288.69,52528.77
max,0.25,536404.39,6810428887.2,57828.1


The Elastic Net's poor results were not fluke. The error ranges were consistent with what we saw above. The model was only able to explain ~20% of the variance in SalePrice and the average prediction was off by over $50,000. Let's see how that compares to other models.

In [None]:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(x_train, y_train)
print('R^2: ', reg.score(x_train, y_train))
lin_mod_preds = reg.predict(x_test)
eval_metrics(y_test, lin_mod_preds)

R^2:  0.6500159994811975
Explained Variance:  0.63
Max Error:  409893.2
MSE:  2323348538.56
MAE:  33742.58


# OLS Regression

A simple OLS Regression appears to outperform the Elastic Net. I've displayed the R-Squared for context and to display the difference between it and the Explained Variance, which are slightly different calculations.

In [None]:
# run the monte carlo function 50 times
results = [monte_carlo_valid(reg, X, y) for p in range(0,50)]
# concat all rows into one df
results = pd.concat(results, axis=0)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
results.describe()

Unnamed: 0,Explained_Variance,Max_Error,MSE,MAE
count,50.0,50.0,50.0,50.0
mean,0.65,322992.09,2259569678.05,33120.51
std,0.02,59023.8,321131874.5,1414.66
min,0.59,174095.77,1660454175.41,30567.45
25%,0.63,296657.79,2041824091.39,32117.63
50%,0.65,308809.08,2193046462.68,32810.25
75%,0.66,379200.38,2427688916.14,34002.43
max,0.7,409893.2,3259978353.33,37543.74


The simulation proves that the OLS is more effective than the simple implementation of the Elastic Net. However, there's still plenty of room to improve on for more refined models. Next I'll show the results of a K-Nearest Neighbor Regression.

# K-Nearest Neighbors

In [None]:
def bestParams(model):
    best_params = model.best_params_
    return best_params

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
knn = KNeighborsRegressor()
knn.fit(x_train, y_train)

parameters = {'n_neighbors': [3,5,7,10,15,20],
              'weights': ['uniform', 'distance']}

knn = GridSearchCV(knn, parameters, cv=5, scoring='neg_mean_squared_error', refit=True)
knn.fit(x_train, y_train)
knn_preds = knn.predict(x_test)
eval_metrics(y_test, knn_preds)
print('Optimal Parameters: ', bestParams(knn))

Explained Variance:  0.48
Max Error:  567706.76
MSE:  3264175811.66
MAE:  37958.34
Optimal Parameters:  {'n_neighbors': 10, 'weights': 'distance'}


Through a Cross-Validated Grid Search I've tuned the number of neighbors and weights parameters in order to optimize the Mean Squared Error. 10 neighbors is the best number from the options provided, and a non-uniform weighting of those neighbors improves predictive accuracy.

In [None]:
# run the monte carlo function 50 times
results = [monte_carlo_valid(knn, X, y) for p in range(0,50)]
# concat all rows into one df
results = pd.concat(results, axis=0)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
results.describe()

Unnamed: 0,Explained_Variance,Max_Error,MSE,MAE
count,50.0,50.0,50.0,50.0
mean,0.89,279347.82,702993425.64,7894.19
std,0.04,157032.02,305674249.84,1036.2
min,0.8,127747.62,331219239.54,5810.26
25%,0.88,183136.5,512889617.47,7108.09
50%,0.91,226928.95,582659737.02,7790.6
75%,0.92,228038.54,736942224.35,8573.12
max,0.95,567706.76,1416517014.0,10791.99


Our Monte-Carlo simulation shows fantastic improves for KNN over OLS and Elastic Net. Every metric improved substantially. Now we turn our attention to the focus of this assignment, Random Forest and Boosting algorithms.

# Random Forest
I'll tune the number of estimators, max depth of a tree, and the minimium number of samples required to form a split parameters. For the sake of training time I've chosen a fairly limited range of values for each of these parameters. These are chosen via a Cross-Validated Grid Search over a list of values.

I included each of the default values in each of the ranges to display how these values are oft not optimal. 

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=33)
rf.fit(x_train, y_train)

parameters = {'n_estimators': [50,75,100,200],
              'max_depth': [2,5,10,15],
              'min_samples_split': [2,25,50,100]}

rf = GridSearchCV(rf, parameters, cv=5, scoring='neg_mean_squared_error', refit=True)
rf.fit(x_train, y_train)
rf_preds = rf.predict(x_test)
eval_metrics(y_test, rf_preds)
print('Optimal Parameters: ', bestParams(rf))

Explained Variance:  0.75
Max Error:  269778.24
MSE:  1555339494.02
MAE:  25541.02
Optimal Parameters:  {'max_depth': 15, 'min_samples_split': 2, 'n_estimators': 200}


In [None]:
# run the monte carlo function 50 times
results = [monte_carlo_valid(rf, X, y) for p in range(0,50)]
# concat all rows into one df
results = pd.concat(results, axis=0)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
results.describe()

Unnamed: 0,Explained_Variance,Max_Error,MSE,MAE
count,50.0,50.0,50.0,50.0
mean,0.92,174497.53,482034100.69,12920.24
std,0.01,51742.74,76172999.12,612.16
min,0.9,116503.76,350651069.07,11617.82
25%,0.92,131088.32,428576959.9,12570.01
50%,0.92,175367.69,474470014.75,12950.82
75%,0.93,192201.26,532922470.69,13354.38
max,0.95,269778.24,691481298.33,14304.89


The parameter tuning method found that building 200 estimators (2x the default value). While building more trees, it also found that building the deepest trees from the range of values given would be optimal. So, the Forest consists of 200 models with a maximum depth of 15.

Interestingly, the results are mixed in comparison to the best performing baseline model (KNN). The Random Forest's Explained Variance is marginally better, and a Max Error that is lower by almost $100,000. This could indicate that the Random Forest's predictions are not as far off as the KNN's. This idea is supported by the Random Forest's lower MSE. However, KNN provides a better MAE. MSE will punish larger error more harshly. Therefore, the KNN algorithm is more prone to predict a number very close or very far off to the actual SalePrice. The Random Forest is off by more, on average, but it is not prone to substantially inaccurate predictions.



In [None]:
from sklearn.ensemble import GradientBoostingRegressor
gb = GradientBoostingRegressor(random_state=25)
gb.fit(x_train, y_train)

parameters = {'n_estimators': [50,75,100,200],
              'learning_rate': [.01, .05, .1, .2],
              'max_depth': [3, 6, 10, 15],
              'subsample': [.25, .5, .75, 1]}

gb = GridSearchCV(gb, parameters, cv=5, scoring='neg_mean_squared_error', refit=True)
gb.fit(x_train, y_train)
gb_preds = gb.predict(x_test)
eval_metrics(y_test, gb_preds)
print('Optimal Parameters: ', bestParams(gb))

Explained Variance:  0.75
Max Error:  235602.41
MSE:  1577888155.28
MAE:  26121.41
Optimal Parameters:  {'learning_rate': 0.05, 'max_depth': 6, 'n_estimators': 100, 'subsample': 0.75}


In [None]:
# run the monte carlo function 50 times
results = [monte_carlo_valid(gb, X, y) for p in range(0,50)]
# concat all rows into one df
results = pd.concat(results, axis=0)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
results.describe()

Unnamed: 0,Explained_Variance,Max_Error,MSE,MAE
count,50.0,50.0,50.0,50.0
mean,0.9,178151.52,659337267.04,17655.86
std,0.02,40531.7,76231109.08,689.68
min,0.86,96894.71,529143541.8,15906.78
25%,0.88,162795.32,605240958.86,17287.72
50%,0.9,173520.65,659197683.78,17557.94
75%,0.9,204104.67,705184481.11,18104.03
max,0.93,235602.41,859691237.86,19371.86


# Gradient-Boosted Regressor

For my Gradient-Boosted Regressor I tuned the number of estimators, learning rate, max tree depth, and the fraction of samples to use for base learners (subsample). Learning rate and subsample must be tuned if n_estimators is tuned as they tend to interact with the number of estimators. Although n_estimators remained at the default value, the learning rate and subsample decreased through the grid search. Learning rate shrinks the contribution of an individual tree, so a lower learning rate decreases the impact of a single tree. This balances the decrease in subsample from its default value. As subsample decreases variances reduces and bias increases as subsample lowers the fraction of samples being used. The ratio between these two was more optimal than the default values provided in the baseline model. 

Counter to my expectation, the Gradient-Boosted model did not outperform the 'simple' Random Forest. Its comparison to the KNN model was similar to the Random Forest's. But, the Gradient-Boosted model posted a higher MAE, MSE, and max error when compared to the Random Forest. So, it was missing by a wider margin and more often than the Random Forest.

 # Final Comparison

While the Random Forest and Gradient-Boosted Regressor outperformed the Elastic Net and Linear Regression, their results were not substantially better than an optimized K-Nearest Neighbor approach. While KNN may not be considered a true 'baseline' model, one would expect a Random Forest or GBR to be more predictive than KNN. I suspect KNN is a very good approach for this dataset as KNN functions similar to how a human prices a house. An agent walks through a house and points out the number of bathrooms and square-footage then shows a price comparison to houses with similar attributes. Because of the similarity between house pricing based on home characteristics and KNN predicting based on the best 'comparisons' nearest to a house in question the KNN algorithm is uniquely suited to fit the needs of this dataset. For instance, KNN predicts the same way a human would as a house with two bathrooms and 2,000 square feet would be expected to be priced very similarly. This is very easy to detect with a KNN approach as the prediction is based on the most similar neighbors. However, the difference between ensemble methods and KNN was not terribly significant.

In [None]:
!apt-get install texlive texlive-xetex texlive-latex-extra pandoc
!pip install pypandoc

Reading package lists... Done
Building dependency tree       
Reading state information... Done
pandoc is already the newest version (1.19.2.4~dfsg-1build4).
pandoc set to manually installed.
The following additional packages will be installed:
  fonts-droid-fallback fonts-lato fonts-lmodern fonts-noto-mono fonts-texgyre
  javascript-common libcupsfilters1 libcupsimage2 libgs9 libgs9-common
  libijs-0.35 libjbig2dec0 libjs-jquery libkpathsea6 libpotrace0 libptexenc1
  libruby2.5 libsynctex1 libtexlua52 libtexluajit2 libzzip-0-13 lmodern
  poppler-data preview-latex-style rake ruby ruby-did-you-mean ruby-minitest
  ruby-net-telnet ruby-power-assert ruby-test-unit ruby2.5
  rubygems-integration t1utils tex-common tex-gyre texlive-base
  texlive-binaries texlive-fonts-recommended texlive-latex-base
  texlive-latex-recommended texlive-pictures texlive-plain-generic tipa
Suggested packages:
  fonts-noto apache2 | lighttpd | httpd poppler-utils ghostscript
  fonts-japanese-mincho | fonts-ipa

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!jupyter nbconvert --to PDF "hw6_rf_boosting.ipynb"

This application is used to convert notebook files (*.ipynb) to various other
formats.


Options
-------

Arguments that take values are actually convenience aliases to full
Configurables, whose aliases are listed on the help line. For more information
on full configurables, see '--help-all'.

--execute
    Execute the notebook prior to export.
--allow-errors
    Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
--no-input
    Exclude input cells and output prompts from converted document. 
    This mode is ideal for generating code-free reports.
--stdout
    Write notebook output to stdout instead of files.
--stdin
    read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
--inplace
    Run nbconvert in place, overwriting the existing notebook (only 
    relevan