**Introduction to Machine Learning Course from Kaggle:**

**Import Necessary Libraries**

In [32]:
import pandas as pd
import kagglehub
import os
from kaggle.api.kaggle_api_extended import KaggleApi
import zipfile

**Kaggle API Set Up**

In [2]:
# Set the Kaggle API key
os.environ['KAGGLE_CONFIG_DIR'] = '/path/to/.kaggle'

# Initialize Kaggle API
api = KaggleApi()
api.authenticate()

**Download Dataset from Kaggle**

In [3]:
dataset_name = "dansbecker/melbourne-housing-snapshot"
download_path = "C:/Users/boddyc/projects/pythonlearning/kaggleML"  # where you want to save the dataset
api.dataset_download_files(dataset_name, path = download_path, unzip = True) # download and unzip the dataset
data_path = ("C:/Users/boddyc/projects/pythonlearning/kaggleML/melb_data.csv") # path to where its been saved

Dataset URL: https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot


**Import and Summarise the Dataset**

In [4]:
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(data_path) 
# print a summary of the data in Melbourne data
melbourne_data.describe()
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

**Drop NA Values**

The Melbourne data has some missing values (some houses for which some variables weren't recorded.)

We'll learn to handle missing values in a later tutorial.  

Your Iowa data doesn't have missing values in the columns you use. 

So we will take the simplest option for now, and drop houses from our data. 

Don't worry about this much for now, though the code is:

In [5]:
# dropna drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)

**Select a Prediction Target**

You can pull out a variable with dot-notation. This single column is stored in a Series, 

which is broadly like a DataFrame with only a single column of data.

We'll use the dot notation to select the column we want to predict, which is called the prediction target.

By convention, the prediction target is called y. 

So the code we need to save the house prices in the Melbourne data is...

In [6]:
y = melbourne_data.Price

**Choosing Features**

The columns that are inputted into our model (and later used to make predictions) are called "features."

In our case, those would be the columns used to determine the home price. 

Sometimes, you will use all columns except the target as features. 

Other times you'll be better off with fewer features.

For now, we'll build a model with only a few features. 

Later on you'll see how to iterate and compare models built with different features.

We select multiple features by providing a list of column names inside brackets.

In [7]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

# By convention, this data is called X.
X = melbourne_data[melbourne_features]

X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [8]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


**Building a Model**

You will use the scikit-learn library to create your models. When coding, this library is written as sklearn, as you will see in the sample code. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.

The steps to building and using a model are:

Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.

Fit: Capture patterns from provided data. This is the heart of modeling.

Predict: Just what it sounds like

Evaluate: Determine how accurate the model's predictions are.

Here is an example of defining a decision tree model with scikit-learn and fitting it with the features and target variable.

In [9]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

We now have a fitted model that we can use to make predictions.

In practice, you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we'll make predictions for the first few rows of the training data to see how the predict function works.

In [10]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]


**Validating the Model** 

There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error (also called MAE). Let's break down this metric starting with the last word, error.

The prediction error for each house is:

error=actual−predicted
So, if a house cost $150,000 and you predicted it would cost $100,000 the error is $50,000.

With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. 

In [11]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

1115.7467183128902

The measure we just computed can be called an **"in-sample" score**. We used a single "sample" of houses for both building the model and evaluating it. **Here's why this is bad.**

Imagine that, in the large real estate market, door color is unrelated to home price.

However, in the sample of data you used to build the model, all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.

Since this pattern was derived from the training data, the model will appear accurate in the training data.

But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used in practice.

Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called validation data.

In [12]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

268877.065203357


**Wow!**
Your mean absolute error for the in-sample data was about 1000 dollars. Out-of-sample it is more than 250,000 dollars.

This is the difference between a model that is almost exactly right, and one that is unusable for most practical purposes. As a point of reference, the average home value in the validation data is 1.1 million dollars. So the error in new data is about a quarter of the average home value.

There are many ways to improve this model, such as experimenting to find better features or different model types.

**Underfitting and Overfitting**

In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses) and a leaf. As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have  2^10 groups of houses by the time we get to the 10th level. That's 1024 leaves.

When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called **overfitting,** where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called **underfitting.**

**Controlling Tree Depth**

The max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area to the overfitting area.

We can use a utility function to help compare MAE scores from different values for max_leaf_nodes:

In [13]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

We can use a for-loop to compare the accuracy of models built with different values for max_leaf_nodes.

In [14]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

# %d is a placeholder for an integer, \t adds a tab space, %() fills the placeholder with the values in the brackets

Max leaf nodes: 5  		 Mean Absolute Error:  385696
Max leaf nodes: 50  		 Mean Absolute Error:  279794
Max leaf nodes: 500  		 Mean Absolute Error:  261718
Max leaf nodes: 5000  		 Mean Absolute Error:  271320


Of the options listed, 500 is the optimal number of leaves.

Models can suffer from either:

**Overfitting:** capturing spurious patterns that won't recur in the future, leading to less accurate predictions

**Underfitting:** failing to capture relevant patterns, again leading to less accurate predictions.

We use validation data, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one.

Now we know the optimal number of leaves we can re-run the model using all the data (without leaving out the validation data).

In [15]:
# Fill in argument to make optimal size and uncomment
final_model = DecisionTreeRegressor(max_leaf_nodes=500, random_state=1)

# fit the final model and uncomment the next two lines
final_model.fit(X, y)

**Random Forests**

Where one decision tree takes a subset of data for testing and then validates on the remaining data, a random forest builds multiple decision trees. Each tree is built on a different subset of the data, and then the predictions from each tree are averaged to get the final prediction. This helps to reduce overfitting and improve accuracy.

In [16]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

207190.6873773146


Here we can see that the MAE is lower than the decision tree model, which means that the random forest model is better at predicting house prices.

**Intermediate Machine Learning Course:**

**Download Housing Data**

This time a new dataset is used which is taken from the housing prices competition on Kaggle.

In [33]:
home_data_name = "house-prices-advanced-regression-techniques" # name of the dataset in kaggles api
# download_path is kept the same as the previous dataset
api.competition_download_files(home_data_name, path = download_path) # download and unzip the dataset
home_data_path = ("C:/Users/boddyc/projects/pythonlearning/kaggleML/train.csv") # path to where its been saved

# Unzip the files manually
zip_path = os.path.join(download_path, 'house-prices-advanced-regression-techniques.zip')
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(download_path)

Load the training and validation features in X_train and X_valid, along with the prediction targets in y_train and y_valid. The test features are loaded in X_test.

In [43]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X_full = pd.read_csv('../kaggleML/train.csv', index_col='Id')
X_test_full = pd.read_csv('../kaggleML/test.csv', index_col='Id')

# Obtain target and predictors
y = X_full.SalePrice # the target (sale price) is set to y
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd'] # features used for prediction are listed
X = X_full[features].copy() # creates completely independent copy of the data selecting only the features listed, otherwise just a view is created which can lead to problems
X_test = X_test_full[features].copy()

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

# Take a quick look at the training data
X_train.head()
X_train.describe()

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
count,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0
mean,10589.672945,1970.890411,1160.958904,351.479452,1.566781,2.882705,6.544521
std,10704.180793,30.407486,373.315037,438.137938,0.546698,0.802166,1.624493
min,1300.0,1872.0,334.0,0.0,0.0,0.0,2.0
25%,7589.5,1953.75,884.0,0.0,1.0,2.0,5.0
50%,9512.5,1972.0,1092.0,0.0,2.0,3.0,6.0
75%,11601.5,2000.0,1389.25,729.0,2.0,3.0,7.0
max,215245.0,2010.0,3228.0,1872.0,3.0,8.0,14.0


**Create Random Forest Models**

In [44]:
from sklearn.ensemble import RandomForestRegressor

# Define the models
model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
model_2 = RandomForestRegressor(n_estimators=100, random_state=0)
model_3 = RandomForestRegressor(n_estimators=100, criterion='absolute_error', random_state=0)
model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=20, random_state=0)
model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)

models = [model_1, model_2, model_3, model_4, model_5]

To select the best model out of the five, we define a function score_model() below. This function returns the mean absolute error (MAE) from the validation set. Recall that the best model will obtain the lowest MAE.

In [None]:
from sklearn.metrics import mean_absolute_error

# Function for comparing different models
def score_model(model, X_t=X_train, X_v=X_valid, y_t=y_train, y_v=y_valid):
    model.fit(X_t, y_t) # fit the model to the training data
    predictions = model.predict(X_v) # make predictions on the validation data
    return mean_absolute_error(y_v, predictions) # calculate the mean absolute error for the model predictions against the true validation data

for i in range(0, len(models)):
    mae = score_model(models[i])
    print("Model %d MAE: %d" % (i+1, mae))

Model 1 MAE: 24015
Model 2 MAE: 23740
Model 3 MAE: 23528
Model 4 MAE: 23996
Model 5 MAE: 23706


**Make Predictions**

Model 3 came out as the best model, so we will use it to make predictions on the test set.

In [46]:
# Generate test predictions
preds_test = model_3.predict(X_test)

# Save predictions in format used for competition scoring
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('initial_predictions.csv', index=False)