# Project - EDA with Pandas Using the Ames Housing Data

## Introduction

In this section, you've learned a lot about importing, cleaning up, analyzing (using descriptive statistics) and visualizing data. In this more free-form project, you'll get a chance to practice all of these skills with the Ames Housing dataset, which contains housing values in the suburbs of Ames.

## Objectives

You will be able to:

* Perform a full exploratory data analysis process to gain insight about a dataset 

## Goals

Use your data munging and visualization skills to conduct an exploratory analysis of the dataset below. At a minimum, this should include:

* Loading the data (which is stored in the file ``ames_train.csv``) 
* Use built-in Python functions to explore measures of centrality and dispersion for at least 3 variables
* Create *meaningful* subsets of the data using selection operations like `.loc`, `.iloc`, or related operations.   Explain why you used the chosen subsets and do this for three possible 2-way splits. State how you think the two measures of centrality and/or dispersion might be different for each subset of the data.
* Next, use histograms and scatter plots to see whether you observe differences for the subsets of the data. Make sure to use subplots so it is easy to compare the relationships.

## Variable Descriptions
Look in ``data_description.txt`` for a full description of all variables.

A preview of some of the columns:

**MSZoning**: Identifies the general zoning classification of the sale.
		
       A	 Agriculture
       C	 Commercial
       FV	Floating Village Residential
       I	 Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density

**OverallCond**: Rates the overall condition of the house

       10	Very Excellent
       9	 Excellent
       8	 Very Good
       7	 Good
       6	 Above Average	
       5	 Average
       4	 Below Average	
       3	 Fair
       2	 Poor
       1	 Very Poor

**KitchenQual**: Kitchen quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor

**YrSold**: Year Sold (YYYY)

**SalePrice**: Sale price of the house in dollars

In [1]:
# Let's get started importing the necessary libraries
import pandas as pd
import matplotlib as plt
%matplotlib notebook

In [2]:
# Loading the data
df = pd.read_csv('ames_train.csv')

In [3]:
# Investigate the Data
df.info()
less_necessary = ['Id',]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

In [10]:
plt.style.use('seaborn')
lst1=['MSSubClass','LotFrontage','LotArea','YearBuilt','1stFlrSF','TotRmsAbvGrd','SalePrice']
lst2=['OverallQual','OverallCond','YearBuilt','YearRemodAdd','TotRmsAbvGrd','YrSold','SalePrice']
df_area = df.loc[:,lst1]
#df_area.head()
df_value = df.loc[:,lst2]
df_value.head()

Unnamed: 0,OverallQual,OverallCond,YearBuilt,YearRemodAdd,TotRmsAbvGrd,YrSold,SalePrice
0,7,5,2003,2003,8,2008,208500
1,6,8,1976,1976,6,2007,181500
2,7,5,2001,2002,6,2008,223500
3,7,5,1915,1970,7,2006,140000
4,8,5,2000,2000,9,2008,250000


In [11]:
# Investigating Distributions using scatter_matrix
pd.plotting.scatter_matrix(df_area);

<IPython.core.display.Javascript object>

In [19]:
pd.plotting.scatter_matrix(df_value);

<IPython.core.display.Javascript object>

In [22]:
df.plot.scatter('YearBuilt','YearRemodAdd')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x2083babc7f0>

In [21]:
# Create a plot that shows the SalesPrice Distribution
plt.pyplot.figure()
df['SalePrice'].plot.hist()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x2083ba28828>

In [23]:
# Create a plot that shows the LotArea Distribution
plt.pyplot.figure()
df['LotArea'].plot.hist();

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x2083d281630>

In [30]:
# Create a plot that shows the Distribution of the overall house condition
plt.pyplot.figure()
bin_count = len(df['OverallCond'].unique())
df['OverallCond'].plot.hist(bins=bin_count);

<IPython.core.display.Javascript object>

In [31]:
# Create a Box Plot for SalePrice
plt.pyplot.figure()
df['SalePrice'].plot.box();

<IPython.core.display.Javascript object>

In [32]:
# Perform an Exploration of home values by age
df.plot.scatter('YearBuilt','SalePrice')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x2083f998128>

## Summary

Congratulations, you've completed your first "free form" exploratory data analysis of a popular dataset!