# Innovation Center AI Team Lesson #1 - Data

Today you learned about pandas, one of the most common libraries in data science. It is most commonly used to organize tabular data (data that can be found in table or spreadsheet form) such as housing prices in relation to their features as we will see today. This data is often stored in a type of file called a csv which stands for comma separated values. Essentially, it is a large collection of values separated by commas with headings in the first line as can be seen below.

![](https://raw.githubusercontent.com/UncertainQubit/firstrepo/master/rawcsv.PNG)

While this may sound like a strange, new type of file it is essentially just a table, in fact you can load csv files in spreadsheet software such as Excel.

![](https://raw.githubusercontent.com/UncertainQubit/firstrepo/master/excelcsv.PNG)

Today you will learn how to import datasets to your project, explore your data, and prepare it for use in a data science project. Stay tuned because next technical stream we will create a basic machine learning model using the dataset you explore today to predict housing prices!

### Setup
All this code does is import the pandas library. Python libraries are essentially just large collections of code that we can use to make our job as programmers easier. Rather than writing many hundreds of line to perform simple tasks we can use functions included in these libraries. A typical function will include the name of the library then the name of the function (or action to perform) and then some input (such as a file).

In [5]:
import pandas as pd #imports the package we need and renames it to pd for a little less typing :)

### Finding and Importing a Dataset

Finding good data is one of the most important jobs for those wishng to create machine learning models. According to Dimensional Research 8 out of 10 AI projects fail, and 96% of those failures are due to a lack of data quality. Free datasets can be found in a number of places online. Often local and federal agencies will have free-to-use datasets as may companies seeking data science expertise. Some good tools to look for datasets include [Google Dataset Search](https://datasetsearch.research.google.com/) which searches the internet for open-source datasets and [Kaggle](https://www.kaggle.com/datasets), a free data science platform with over 33,000 datasets and counting created by users. The data we will be using today is from Kaggle, entitled ["Melbourne Housing Dataset"](https://www.kaggle.com/anthonypino/melbourne-housing-market) and created by Kaggle user Tony Pino.

Google Dataset Search:

![](https://raw.githubusercontent.com/UncertainQubit/firstrepo/master/datasetsearch.PNG)

Kaggle Datasets:

![](https://raw.githubusercontent.com/UncertainQubit/firstrepo/master/Kaggledatasetsearch.PNG)

To import this dataset we will use the pandas function `read_csv` where our input is a link to the csv holding the data located in a Github repository. Most online datasets will have a file hosted online, if they don't you can either download the dataset to a Github repository (as we've done) or, if you're using an Anaconda distribution on your computer, you can call a file hosted locally, i.e. 'C:\Users\user\Documents\dataset'. We store our dataset as a pandas object called a dataframe (essentially a table) under the variable `df`.

In [6]:
#creates the variable df that stores our csv file as a pandas dataframe
df = pd.read_csv('https://raw.githubusercontent.com/UncertainQubit/binder/master/Melbourne_housing_FULL.csv')

### Basics of Pandas
Before you begin to explore your data it is important to understand some of the powerful features contained within pandas. Pandas makes it easy to create dataframes with the command `pd.DataFrame`. There are a few different ways to add data to a dataframe but the most common way that we will use throughout this tutotial is through the use of dictionaries. Dictionaries are a type of object in python that, similar to a dictionary match one value to another. Dictionaries are an easy way to create dataframes because one value provides a column name while its associated values make up the rows. In the example below we create a dictionary consisting of two associations, 'Fruit' along with an array containing the names of fruits and 'Veggies' containing the names of various vegetables.

In [141]:
#Creates a simple dictionary
dictionary = {'Fruit' : ['strawberry', 'apple', 'banana', 'strawberry'], 'Veggies' : ['carrot', 'broccoli', 'lettuce', 'carrot']}
print(dictionary)

{'Fruit': ['strawberry', 'apple', 'banana', 'strawberry'], 'Veggies': ['carrot', 'broccoli', 'lettuce', 'carrot']}


That dictionary doesn't look very intuitive and certainly doesn't look like the tables above. However, we can easily turn this dictionary into a simple Dataframe and then use powerful pandas functions to manipulate it.

In [142]:
simpledf = pd.DataFrame(dictionary)
simpledf

Unnamed: 0,Fruit,Veggies
0,strawberry,carrot
1,apple,broccoli
2,banana,lettuce
3,strawberry,carrot


That's looking a lot better, however you may notice that it added some numbers along the side. This is what is known as the `index`. Essentially, it's just the name of every row, if you only want to create a 1D table (column names only) then this is fine. However, if you want to create more complex dataframes you will need to set the index to some values, typically an array.

In [143]:
indexdata = ['Good', 'Bad', 'Ok', 'Personal Favs']
simpledf = pd.DataFrame(dictionary, index=indexdata)
simpledf

Unnamed: 0,Fruit,Veggies
Good,strawberry,carrot
Bad,apple,broccoli
Ok,banana,lettuce
Personal Favs,strawberry,carrot


For most data having an index is largely unnecessary as all important information can be encoded in columns so you won't need to worry too much about working with them for now. All of our lessons that utilize tabular data won't have named indices. Now that you have a simple DataFrame we can begin to use some of pandas powerful functions to manipulate it. Below are some of the most common pandas functions you will encounter that give powerful data analysis capabilities.

In [145]:
# Generates some simple statistics about the dataset, for numerical data will include averages, etc.
simpledf.describe()

Unnamed: 0,Fruit,Veggies
count,4,4
unique,3,3
top,strawberry,carrot
freq,2,2


In [147]:
#Brackets allow for easy selection of individual columns in pandas
simpledf['Fruit']

Good             strawberry
Bad                   apple
Ok                   banana
Personal Favs    strawberry
Name: Fruit, dtype: object

In [151]:
#Pandas supports python boolean logic to select certain values
simpledf[simpledf.Fruit == 'strawberry']

Unnamed: 0,Fruit,Veggies
Good,strawberry,carrot
Personal Favs,strawberry,carrot


In [152]:
#value_counts counts the number of instances of a value within the dataframe
simpledf.Fruit.value_counts()

strawberry    2
apple         1
banana        1
Name: Fruit, dtype: int64

### Exploring your Data
It's important to explore your data so that you can understand it before you begin to create a machine learning model. The easiest way to do this is with the `head` function in pandas. Simply call the `head` function on a dataframe to display the first five rows. If you need to see more you can also add a number inside the paretheses to display that many rows. Below you can see a display of our dataframe.

In [15]:
df.head() #displays first five rows of the dataframe

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra City Council,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,4/02/2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra City Council,-37.8114,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019.0


It's important to explore your data to gain valuable insights. For example, when creating our model we'll eventually have to choose which features we want to incorporate in price prediction. However, even irregardless of machine learning exploring data can provide powerful, actionable insights. For example, a data scientist may be contracted by a real estate company to find which sellers they should recruit. One way we could approach this problem is to see which sellers sold the most real estate using the `value_counts` function. This function counts the number of times that a value appears in the dataframe. In our case `SellerG` contains the name of the seller of the property, counting this value will tell us how much real estate a seller has sold.

In [16]:
top_sellers = df.SellerG.value_counts() #counts the instances of every unique seller and sorts them greatest to least
top_sellers.head() #displays first five rows of data

Jellis           3359
Nelson           3236
Barry            3235
hockingstuart    2623
Marshall         2027
Name: SellerG, dtype: int64

Now that our real estate company knows Jellis is a top seller they may also want to know the average price of a house sold by Jellis. We can do this by creating a new dataframe that contains only the rows of real estate sold by Jellis and then using the `mean` function to find the average price of real estate sold by Jellis.

In [76]:
Jellis = df[df.SellerG == 'Jellis'] #creates a new dataframe called Jellis made only of real estate sold by him

jellisaverage = Jellis['Price'].mean() #takes the average price of all real estate sold by Jellis and stores it under jellisaverage

print(jellisaverage) #prints the average price of all real estate sold by Jellis

1350790.0284360189


In [77]:
#Finds the difference between the average price of houses sold by Jellis and the average selling price of all properties
jellisaverage - df['Price'].mean()

300616.68348061084

Jellis is looking like a good candidate for the job, the average property they sold was \$300,000AUD (Australian dollars) more than the average property price in Melbourne. However, we can also create a simple function to see whether other sellers have larger average selling prices.

In [108]:
pricedata = []
sellers = df.SellerG.unique() #Creates an array conaining the name of every seller
for value in sellers:
    sellerdf = df[df.SellerG == value]
    average = sellerdf['Price'].mean()
    pricedata.append(average)
sellersdf = pd.DataFrame({'Average_Sale_Price' : pricedata}, index=sellers)
sellersdf = sellersdf.sort_values(by=['Average_Sale_Price'], ascending=False)
sellersdf = sellersdf.dropna(axis=0)
sellersdf.head()

Unnamed: 0,Average_Sale_Price
For,3780000.0
Weast,3320000.0
Sotheby's,2688409.0
VICProp,2415750.0
Darras,2410000.0


This looks like a pretty good result, we can see the top five sellers who sold properties for, on average, the most. However, it is helpful to more critically analyze the data before we make our final decision about who is the best seller. After all, one of these sellers could have sold only one expensive property which would artificially inflate the average sale price.

In [129]:
count = 0
for value in df.SellerG:
    if value == 'Darras':
        count += 1
print(count)
#1
#1
#18
#4
#5

5


In [80]:
top_suburbs = df.Suburb.value_counts()
top_suburbs.head()

Reservoir         844
Bentleigh East    583
Richmond          552
Glen Iris         491
Preston           485
Name: Suburb, dtype: int64

### Cleaning your Data

In [15]:
df = df.dropna(axis=0) #Edit this to include dropping before/after features column, other solutions, why this can be bad
features = ['Rooms', 'Type', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude'] #Explain types of feature selection
X = df[features]
X.head()

Unnamed: 0,Rooms,Type,Bathroom,Landsize,Lattitude,Longtitude
2,2,h,1.0,156.0,-37.8079,144.9934
4,3,h,2.0,134.0,-37.8093,144.9944
6,4,h,1.0,120.0,-37.8072,144.9941
11,3,h,2.0,245.0,-37.8024,144.9993
14,2,h,1.0,256.0,-37.806,144.9954


### Summary
Today you learned how to...

### Your Turn

They say practice makes perfect, I don't know who "they" are but they're correct. Just click the link below to launch a practice notebook where you will explore a dataset containing ramen ratings and practice your newfound knowledge. Further instructions will be included in the notebook itelf.

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/UncertainQubit/binder/master?filepath=lesson1_practice)

### Stay Tuned
Next technical stream we will be using the data that you explored today to create a basic machine learning model called a decision tree that will allow us to predict the average selling price of a home in Melbourne Australia. We will also begin to cover how to optimize your machine learning models for the best results. Soon, you will be able to apply the basics of machine learning to new, unique problems.

We are also working on creating a custom challenge using a dataset of real estate sold in Boulder County to predict sale prices so stay tuned!

### Additional Resources
If you enjoyed exploring the data in today's lesson you can also try an optional extension notebook that will teach you how to use another popular library (seaborn) to create graphs and other visualizations of your data.

![](https://mybinder.org/badge_logo.svg) Again, link not yet active.

For more information on pandas view the resources below:
* [Pandas Documentation](https://pandas.pydata.org/docs/) (Documentation of every pandas function and a great resource)

For more information on python view the resources below:
* [Python Cheatsheet](https://www.pythoncheatsheet.org/) (Quick reference for Python commands, data structures, etc.)

For additional learning opportunities view the resources below:
* [Kaggle Learn](https://www.kaggle.com/learn/overview) (Amazing tutorials on all subjects related to artificial intelligence)
* [DatatoFish](https://datatofish.com/python-tutorials/) (Great tutorials related to data!)
* [W3Schools](https://www.w3schools.com/python/default.asp) (Great explanations of python!)

Random resources:
* [Stack Overflow](https://stackoverflow.com/) (Always a great resource for questions)
* [GeeksforGeeks](https://www.geeksforgeeks.org/) (Has some good examples of pandas)
* [Binder](https://mybinder.org/) (Free resource that we use to host these notebooks and make them available to you)