# Doing Data Science: chapter 2 - Housing Dataset

Python code for the exercise on the RealDirect study about the housing dataset ("*improve the way people sell and buy houses*").

**Author**: Damien Garaud

**Project on Github**: https://github.com/garaud/doing_pydata_science

## Getting Data 

Clone the official Github project at https://github.com/oreillymedia/doing_data_science and unzip the `dds_datasets.zip` file. You'll find a new ZIP file named `dds_chapter2_rollingsales.zip`. Then, you'll get five XLS files:

- `rollingsales_bronx.xls`
- `rollingsales_brooklyn.xls`
- `rollingsales_manhattan.xls`
- `rollingsales_queens.xls`
- `rollingsales_statenisland.xls`

**Note**: for each carried out task, I'll try to write a **function**. Functions are good.

## Modules

In [1]:
import pandas as pd

In [2]:
print(pd.__version__)

0.17.1


## Reading Data

You continue to use pandas for XLS files reading. Suppose you have the `rollingsales_AREA.xls` files in the `data` directory.

The relevant row number to extract header is the row no.5 (note: row spreadsheet begins to 1).

In [11]:
def read_data(fname):
    """Read data from an Excel file
    """
    return pd.read_excel(fname, header=4)

In [18]:
brooklyn = read_data("./data/rollingsales_brooklyn.xls")

In [4]:
brooklyn.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23373 entries, 0 to 23372
Data columns (total 21 columns):
BOROUGH                           23373 non-null int64
NEIGHBORHOOD                      23373 non-null object
BUILDING CLASS CATEGORY           23373 non-null object
TAX CLASS AT PRESENT              23373 non-null object
BLOCK                             23373 non-null int64
LOT                               23373 non-null int64
EASE-MENT                         23373 non-null object
BUILDING CLASS AT PRESENT         23373 non-null object
ADDRESS                           23373 non-null object
APART
MENT
NUMBER                 23373 non-null object
ZIP CODE                          23373 non-null int64
RESIDENTIAL UNITS                 23373 non-null int64
COMMERCIAL UNITS                  23373 non-null int64
TOTAL UNITS                       23373 non-null int64
LAND SQUARE FEET                  23373 non-null int64
GROSS SQUARE FEET                 23373 non-null int64
YEAR

21 columns and several different data

## Load and Clean up

Quoting: *First challenge: load in and clean up the data. Next, conduct
exploratory data analysis in order to find out where there are
outliers or missing values, decide how you will treat them, make
sure the dates are formatted correctly, make sure values you
think are numerical are being treated as such, etc.*

Rename some column names (there some \n)

In [16]:
def clean_columns(df):
    """Clean some columns name
    """
    df.columns = [x.replace('\n', ' ').lower() for x in df.columns]
    return df.rename_axis({'apart ment number': 'apartment number'}, axis=1)
    
def missing_string(df, colnames):
    """Strip content and fill empty string with NaN
    """
    df = df.copy()
    for col in colnames:
        df[col] = df[col].str.strip().apply(lambda x: 'NaN' if not x else x)
    return df

In [6]:
brooklyn.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23373 entries, 0 to 23372
Data columns (total 21 columns):
borough                           23373 non-null int64
neighborhood                      23373 non-null object
building class category           23373 non-null object
tax class at present              23373 non-null object
block                             23373 non-null int64
lot                               23373 non-null int64
ease-ment                         23373 non-null object
building class at present         23373 non-null object
address                           23373 non-null object
apart ment number                 23373 non-null object
zip code                          23373 non-null int64
residential units                 23373 non-null int64
commercial units                  23373 non-null int64
total units                       23373 non-null int64
land square feet                  23373 non-null int64
gross square feet                 23373 non-null int64
year

Column `apart ment number` sounds strange and this is an object (i.e. a string). Rename it and check if it can be integers.

**update** the clean columns function.

In [19]:
brooklyn = clean_columns(brooklyn)

In [20]:
brooklyn['apartment number'].sample(10)

4824                 
4223                 
21382    4C          
13851                
18186                
7375                 
99                   
7579                 
6079                 
6935                 
Name: apartment number, dtype: object

OK, sometimes, you can have two digits or one digits and letters. We can normalize that, i.e. discard some trailing whitespaces. And replace every empty string by 'NaN'.

In [22]:
brooklyn = missing_string(brooklyn, ['apartment number'])

Let's see if you have other string field with the same issues