# Innovation Center AI Team Technical Livestream #1

Today you learned about pandas, one of the most common libraries in data science. It is most commonly used to organize tabular data (data that can be found in table or spreadsheet form) such as housing prices in relation to their features as we will see today. This data is often stored in a type of file called a csv which stands for comma separated values. Essentially, it is a large collection of values separated by commas with headings in the first line as can be seen below.

![](https://raw.githubusercontent.com/UncertainQubit/firstrepo/master/rawcsv.PNG)

While this may sound like a strange, new type of file it is essentially just a table, in fact you can load csv files in spreadsheet software such as Excel.

![](https://raw.githubusercontent.com/UncertainQubit/firstrepo/master/excelcsv.PNG)

Today you will learn how to import datasets to your project, explore your data, and prepare it for use in a data science project. Stay tuned because next technical stream we will create a basic machine learning model using the dataset you explore today to predict housing prices!

### Setup
All this code does is import the pandas library. Python libraries are essentially just large collections of code that we can use to make our job as programmers easier. Rather than writing many hundreds of line to perform simple tasks we can use functions included in these libraries. A typical function will include the name of the library then the name of the function (or action to perform) and then some input (such as a file).

In [5]:
import pandas as pd #Imports the package we need and renames it to pd for a little less typing :)

### Finding and Importing a Dataset

Finding good data is one of the most important jobs for those wishng to create machine learning models. According to Dimensional Research 8 out of 10 AI projects fail, and 96% of those failures are due to a lack of data quality. Free datasets can be found in a number of places online. Often local and federal agencies will have free-to-use datasets as may companies seeking data science expertise. Some good tools to look for datasets include [Google Dataset Search](https://datasetsearch.research.google.com/) which searches the internet for open-source datasets and [Kaggle](https://www.kaggle.com/datasets), a free data science platform with over 33,000 datasets and counting created by users. The data we will be using today is from Kaggle, entitled ["Melbourne Housing Dataset"](https://www.kaggle.com/anthonypino/melbourne-housing-market) and created by Kaggle user Tony Pino.

To import this dataset we will use the pandas function `read_csv` where our input is a link to the csv holding the data located in a Github repository. Most online datasets will have a file hosted online, if they don't you can either download the dataset to a Github repository (as we've done) or, if you're using an Anaconda distribution on your computer, you can call a file hosted locally, i.e. 'C:\Users\user\Documents\dataset'. We store our dataset as a pandas object called a dataframe (essentially a table) under the variable `df`.

In [6]:
#Creates the variable df that stores our csv file as a pandas dataframe
df = pd.read_csv('https://raw.githubusercontent.com/UncertainQubit/binder/master/Melbourne_housing_FULL.csv')

### Exploring your Data
It's important to explore your data so that you can understand it before you begin to create a machine learning model. The easiest way to do this is with the `head` function in pandas. Simply call the `head` function on a dataframe to display the first five rows. If you need to see more you can also add a number inside the paretheses to display that many rows. Below you can see a display of our dataframe.

In [15]:
df.head() #Maybe do some pandas analysis of sellers, suburbs

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra City Council,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,4/02/2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra City Council,-37.8114,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019.0


It's important to explore your data to gain valuable insights. For example, when creating our model we'll eventually have to choose which features we want to incorporate in price prediction. However, even irregardless of machine learning exploring data can provide powerful, actionable insights. For example, a data scientist may be contracted by a real estate company to find which sellers they should recruit. One way we could approach this problem is to see which sellers sold the most real estate using the `value_counts` function. This function counts the number of times that a value appears in the dataframe. In our case `SellerG` contains the name of the seller of the property, counting this value will tell us how much real estate a seller has sold.

In [16]:
top_sellers = df.SellerG.value_counts()
top_sellers.head()

Jellis           3359
Nelson           3236
Barry            3235
hockingstuart    2623
Marshall         2027
Name: SellerG, dtype: int64

Now that our real estate company knows Jellis is a top seller they may also want to know the average price of a house sold by Jellis, are they a luxury seller? We can do this by creating a new dataframe that contains only the rows of real estate sold by Jellis and then using the `mean` function to find the average price of real estate sold by Jellis.

In [25]:
Jellis = df[df.SellerG == 'Jellis']
jellisaverage = Jellis['Price'].mean()
print(jellisaverage)

1350790.0284360189


In [26]:
jellisaverage - df['Price'].mean()

300616.68348061084

Jellis is looking like a good candidate for the job, the average property they sold was 300,000 more Australian dollars than the average property price in Melbourne. However, we can also create a simple function to see whether other sellers have larger average selling prices.

In [74]:
#Work on getting this function to actually work properly. Identify sellers with the top average home sale prices.

pricedata = []
sellers = df.SellerG.unique() #Creates an array conaining the name of every seller
#print(sellersdf)
for value in sellers:
    sellerdf = df[df.SellerG == value]
    average = sellerdf['Price'].mean()
pricedata.append(average)
print(pricedata)
#sellersdf = pd.DataFrame(data = pricedata, columns = ['Average_Sale_Price'], index=sellers)
#sellersdf

[nan]


In [None]:
top_suburbs = df.Suburb.value_counts()
top_suburbs.head()
df.Price.mean()

### Cleaning your Data

In [15]:
df = df.dropna(axis=0) #Edit this to include dropping before/after features column, other solutions, why this can be bad
features = ['Rooms', 'Type', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude'] #Expalin types of feature selection
X = df[features]
X.head()

Unnamed: 0,Rooms,Type,Bathroom,Landsize,Lattitude,Longtitude
2,2,h,1.0,156.0,-37.8079,144.9934
4,3,h,2.0,134.0,-37.8093,144.9944
6,4,h,1.0,120.0,-37.8072,144.9941
11,3,h,2.0,245.0,-37.8024,144.9993
14,2,h,1.0,256.0,-37.806,144.9954


### Your Turn

In [1]:
#Maybe include some way to download a notebook that can be run in different distributions for practice. Github repo for hints?
#Nobel laureates database? Another housing? Look at making Boulder County one?

### Stay Tuned

In [4]:
#Just explain here what will happen next tech stream, creating a basic decision tree. Next concept over/underfitting, randomizing data, etc.?

### Additional Resources

In [3]:
#Include links to pandas documentation, maybe try and find some videos, etc.