<a href="https://colab.research.google.com/github/ccwilliamsut/ml_beginners/blob/master/Intro_ML_for_Absolute_Beginners.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This basic project is based on the same type of project found in *Machine Learning for Absolute Beginners: A Plain English Introduction* by Oliver Theobald


---


![Machine Learning for Absolute Beginners](https://images-na.ssl-images-amazon.com/images/I/413%2BI3pEaXL.jpg)


---


Much of the code for this lesson can be found in Chapter 14 of his book, and the optional Grid Search code can be found at the end of that chapter as well.

His book is an excellent introduction for those just beginning their machine learning journey. Though this class introduces a number of principles found in that text, I highly recommend buying the book yourself and proceed through it after your experience here. He walks the reader through a number of important concepts that are too extensive for this course, but his writing is clear and he does a spectacular job of explaining difficult topics to ensure understanding. Additionally, he provides a number of datasets and examples which the reader can tackle after getting down the basics. 

The book can be found here if you are interested: [Machine Learning for Absolute Beginners](https://www.amazon.com/Machine-Learning-Absolute-Beginners-Introduction/dp/1549617214/ref=sr_1_1?crid=1AF1PFSE85G4F&keywords=machine+learning+for+absolute+beginners+a+plain+english+introduction&qid=1563399014&s=gateway&sprefix=machine+learning+for+absolute+beginners+%2Caps%2C326&sr=8-1)



# Step 1: Preliminary Work
## A: Download the dataset
Open a web browser and download the dataset at this link: [Sydney Housing Data](https://www.kaggle.com/anthonypino/melbourne-housing-market/)

1.   If you are not a member, you will need to register for an account
2.   Once registered, click the "Download" button and download the dataset
3.   Leave the dataset in the *Downloads* folder and unzip it (if you move the dataset, then note where you have placed it for future reference in the code)

> *Note: If you are in my beginner's class, then the data has already been downloaded and is located in the Downloads folder*

## B: Analyze the dataset
Now that we have our data downloaded, we can analyze it in various ways to get familiar with it
- How many columns (features)?
- How many rows (samples)?
- Are there any missing data? If so, where? How many samples are affected?
- Are there things that we can fix to make it more usable?
- How spread out is the data (what is the variance)?

We analyze the data in order to become familiar with it, but also to see what state the data is in and what needs to be done in order to make it more useful. This is known as *data scrubbing*, and it is a large part of any data scientist's work.

We also need to see how we can manipulate things to make them more usable, known as *feature engineering*. We do this for a number of reasons, but the primary ones are so that we can change any text-based data into numerical data and to reduce the number of features, where possible, in order to make our model more optimized. The more features one introduces, the more complex the model becomes and the slower it will run (often without any added benefits).

In [0]:
# Import the pandas library (to work with data files)
import pandas as pd

# Set the URL for the data file
url = 'https://raw.githubusercontent.com/ccwilliamsut/ml_beginners/master/Melbourne_housing_FULL.csv'

# Import the datafile from the provided url and run the cell
df = pd.read_csv(url)

# Test that the data was imported correctly (look at the first 10 rows of the file)
print(df.head(10))


Run the above code (either with the "play" button or by **selecting the cell** and then choosing **Runtime -->  Run the focused cell** from the menu).

If you do not see data in the frame above, then consult this [link](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92) to learn how to work with data files in Colaboratory.
       
            
            


---

## C: Locate specific records within a dataset

Next, let's see how to look at individual samples (records) within the dataframe. The way that you do this is by using the `iloc[]` command. 

For this example, we will look for record 27: `df.iloc[27]`

In [0]:
# Look at a specific sample (record) in the dataset
df.iloc[27]

You should see the individual features (columns) of the record displayed below the code. Feel free to look up other records within the file by simply changing the 27 to the desired number.


---

## D: Looking at column headers

If we want to see the names of the columns (headers), we can simply issue the **columns** command:

--> **df.columns**

In [0]:
# Print the names of the column headers
df.columns

# Step 2: Build a Decision Tree Model with Python
## 1. Import appropriate common libraries

For this project, we will draw on the following libraries (commonly used for machine learning projects):

- pandas (imported previously during preliminary steps, but added again for illustration purposes)
- sklearn (scikit-learn)

In [0]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn import ensemble 
from sklearn.metrics import mean_absolute_error 
from sklearn.externals import joblib
