# Explore Housing Price Data

The dataset included in this repository is from [Zillow.com](https://www.zillow.com/research/data/) -- it represents the median price of a 3-bedroom house in every state on a monthly basis since 1996.  

We're going to use it to practice getting data from the file, compute averages, look for outliers, and explore.

First, we import the data and inspect it

In [None]:
# This cell imports python modules that make it easier to access the data
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
# This cell imports the data file and stores it in the variable df
df = pd.read_csv('../data/State_Zhvi_3bedroom.csv', index_col='RegionName')


Inspect by printing the names of the data columns and the whole date frame.  You can also inspect this by opening the data file using the file browser to the left.


In [None]:
df.columns

In [None]:
df

# What do we notice?

You should see that the file is organized on a monthly basis, with the rows ordered by the population of the state (Wyoming is last, California is first).

You should also notice that there are blanks or "NaN" values for North Dakota between 1996 and 2005 (NaN stands for "Not a Number").  

Why?  I don't know for sure, but this presumably means that North Dakota didn't provide data to Zillow for that time period, or they didn't have it.  This is going to be an issue because some of the analysis we will want to do will be confused by the missing entries.

People who do data analysis for a living say that they end up spending **a large percentage of their time** dealing with issues like this -- cleaning up and preparing data in a useable format, even before they are able to do any actual analysis.

In this case, since it's only the one state with any missing data, we're just going to ignore that entire row, using the following:

In [None]:
df.dropna(inplace=True)

"dropna" is a command that "drops" the "na" values.  the **inplace=True** part does this in a way that stores the result in the same variable as before, so we don't have to make a copy.



In [None]:
months = df.columns[2:]
months

**months** is now a variable holding all the months in the data table, and you can see that it started in April of 1996 and ended in October of 2018. 



You can now refer to the data in one of two ways, using the month or the state name:

In [None]:
df.loc['Washington']

In [None]:
df['2018-10']

Or you can use the two methods together to access a specific state in a specific month:

In [None]:
print('Apr 1996:', df.loc['Washington']['1996-04'])
print('Oct 2018:', df.loc['Washington']['2018-10'])

# First task

Your first task is to calculate the average nationwide price in October of 2018 using the **accumulator** pattern we used for the rainfall problem, which means you need to **initialize a variable** to hold the total of all the home prices during October 2018, then use a **loop** to accumulate the values, and divide by the number of data points to get the average.

The cell below is setup with a loop that will access each state in turn -- right now it prints out the state name.  You will need to add some code before the loop to initialize the variables, and some code in the loop to access the home price for every state during the month of October 2018 and add it to the running total:

In [None]:
# Add some lines before the loop to initialize the accumulator variables

for state in df.index:
    # add some lines in the loop to access the price during October 2018
    # and add it to the running total
    print(state)
    
#add some lines after the loop to compute the average and print it

# Second Task

 * How would you modify or use the code above to compute the national average for a different month?
 * How would you modify the code above to find the maximum value in a given month? The minimum?
 * How could you modify or use this code to quickly and easily compute the average price for every month in the data set?
 * Complete at least one of these tasks and make a markdown cell to explain what you did.

 

# Third Task

 * Which state changed the most in the 22 years?  The least?
 * How could you find the median value in a given month?
 * The most common value?
 * Is there a way to identify states whose pricing is very different from the national average or trends?
 * Complete at least the first of these (determine which states changed the most and the least in the 22 years) and explain how you did it.
