# Basic Data Exploration (Kaggle Tutorial)

The most important part of the Pandas library is the DataFrame. A DataFrame holds the type of data you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database.

Pandas has powerful methods for most things you'll want to do with this type of data. As an example, we'll look at data about home prices in Melbourne, Australia. In the hands-on exercises, you will apply the same processes to a new dataset, which has home prices in Iowa.

The example (Melbourne) data is at the file path: <code>$DATA/input/melbourne-housing-snapshot/melb_data.csv</code>. We load and explore the data with the following commands:

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime

filename = 'dataset/melbourne_housing/melb_data.csv'
data = pd.read_csv(filename) 

# print a summary
data.describe()

## Interpreting Data Description

The results show 8 numbers for each column in your original dataset. The first number, the count, shows how many rows have non-missing values. 

1. Missing values arise for many reasons. For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house. We'll come back to the topic of missing data.

2. The second value is the mean, which is the average. 

3. Under that, std is the standard deviation, which measures how numerically spread out the values are.

To interpret the min, 25%, 50%, 75% and max values, imagine sorting each column from lowest to highest value. 

4. The first (smallest) value is the min. 

5. If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced "25th percentile"). 

6. The 50th and 75th percentiles are defined analogously. 

7. The max is the largest number.

## Exercise

In [None]:
# What is the average landsize (rounded to nearest integer)?
avg_ls = np.round(data['Landsize'].mean()) 
avg_ls

In [None]:
# As of today, how old is the newest home (current year - the date in which it was built)
newest = datetime.now().year - data['YearBuilt'].max()
newest