## Characteristics of New Housing

This page provides national, annual data on the characteristics of new privately-owned residential structures, such as square footage, number of bedrooms and bathrooms, type of wall material, and sales prices. Many characteristics are available at the region level.  https://www.census.gov/construction/chars/

### Data Sources
- file1 : Description of where this file came from

### Changes
- 12-29-2018 : Started project

In [18]:
import pandas as pd
from pathlib import Path
from datetime import datetime
import zipfile
import requests
import io
import glob
import numpy as np

### File Locations

In [3]:
today = datetime.today()
in_file = Path.cwd() / "data" / "raw" / "FILE1"
in_directory = Path.cwd() / "data" / "raw"
summary_file = Path.cwd() / "data" / "processed" / f"summary_{today:%b-%d-%Y}.pkl"




### Load Data from URL Sources

Main Page for where to manually download this files.
https://www.census.gov/construction/chars/microdata.html

In [4]:
file_names = ['soc17.zip','soc16.zip','soc15.zip','soc14.zip','soc13.zip',
              'soc12.zip','soc11.zip','soc10.zip','soc09.zip','soc08.zip',
              'soc07.zip','soc06.zip','soc05.zip','soc04.zip','soc03.zip',
              'soc02.zip','soc01.zip','soc00.zip','soc99.zip']

for file in file_names:
    fileName = 'soc17.zip'

    source_url = 'https://www.census.gov/construction/chars/xls/%s' % file

    in_file = Path.cwd() / "data" / "raw" / file.split('.')[0]
    r = requests.get(source_url)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall(in_file)
    del in_file
    


In [5]:
#Compliments: http://pbpython.com/excel-file-combine.html
all_data = pd.DataFrame()
for f in glob.glob("./data/raw/*/*.xls"):
    #print (f.split('/')[-2])
    df = pd.read_excel(f)
    df['FILENAME'] = f.split('/')[-2]
    all_data = all_data.append(df,ignore_index=True, sort=False)




In [111]:
all_data.describe()

Unnamed: 0,ACS,AGER,ASSOC,BASE,CAT,CLOS,CON,DECK,DET,DIV,...,FSLPR,PVALU,SQFS,FSQFS,LOTV,FNSQ,FFNSQ,AREA,AUTH,ID
count,88727.0,88727.0,88727.0,88727.0,88727.0,88727.0,88727.0,88727.0,88727.0,88727.0,...,88727.0,88727.0,88727.0,54522.0,88727.0,88727.0,88727.0,88727.0,88727.0,23690.0
mean,1.032403,1.95831,1.401389,2.184273,1.44185,0.0,1.098279,1.653544,1.115613,5.586496,...,92074.79,216487.0,2587.919111,2087.344485,17956.404477,99.821137,6.951244,31421.91058,201515.578291,11845.5
std,0.31831,0.235261,0.557948,0.967597,0.787598,0.0,0.982427,0.568563,0.32518,2.274503,...,190669.4,208311.4,1306.470102,1505.783732,40312.67306,356.299658,96.217009,78991.57148,130.821321,6838.858275
min,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,200806.0,1.0
25%,1.0,2.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,4.0,...,0.0,100000.0,1800.0,1200.0,0.0,0.0,0.0,0.0,201408.0,5923.25
50%,1.0,2.0,1.0,3.0,1.0,0.0,2.0,2.0,1.0,5.0,...,0.0,189800.0,2405.0,2138.0,0.0,0.0,0.0,7810.0,201509.0,11845.5
75%,1.0,2.0,2.0,3.0,2.0,0.0,2.0,2.0,1.0,7.0,...,0.0,290000.0,3228.0,2999.0,24600.0,0.0,0.0,17100.0,201609.0,17767.75
max,2.0,2.0,2.0,4.0,4.0,0.0,2.0,2.0,2.0,9.0,...,2060000.0,3060300.0,8240.0,7937.0,733000.0,3967.0,3400.0,435600.0,201803.0,23690.0


In [113]:
all_data.head()

Unnamed: 0,ACS,AGER,ASSOC,BASE,CAT,CLOS,CON,DECK,DET,DIV,...,PVALU,SQFS,FSQFS,LOTV,FNSQ,FFNSQ,AREA,AUTH,ID,FileName
0,1,2,1,1,1,0,1,1,2,1,...,0,1700,1700.0,0,0,0,0,201501,1.0,soc17
1,1,2,2,1,1,0,0,2,2,1,...,175000,2600,0.0,0,0,0,0,201609,2.0,soc17
2,1,2,1,1,1,0,0,1,1,1,...,0,2100,0.0,0,0,0,0,201710,3.0,soc17
3,1,2,2,1,1,0,2,2,1,1,...,130000,1800,0.0,50000,0,0,21800,201603,6.0,soc17
4,1,2,2,1,1,0,2,1,1,1,...,80000,1500,1500.0,0,0,0,4400,201612,8.0,soc17


### Column Cleanup
- Rename the columns for consistency; they are all acronyms.

In [12]:
cols_to_rename = {'ACS': 'CentralAir',
                 'AGER': 'RestrictedAge',
                  'ASSOC': 'HomeOwnerAssociation',
                  'BASE': 'FoundationType',
                  'CAT': 'BuildReason',
                  'CLOS': 'ClosingCostsInc',
                  'CON': 'CondominiumProject',
                  'DECK': 'Deck',
                  'DET': 'HouseDesign',
                  'DIV': 'CensusDivision',
                  'FINC': 'FinancingType',
                  'FNBS': 'FinishedBasement',
                  'FOYER': 'TwoStoryFoyer',
                  'FRAME': 'FramingMaterial',
                  'GAR': 'Garage',
                  'HEAT': 'HeatingSystemPrimary',
                  'HEAT2': 'CensusDivisionSecondary',
                  'LNDR': 'LaundryLocation',
                  'METRO': 'InMetroArea',
                  'MFGS': 'ConstructionMethod',
                  'PATI': 'Patio',
                  'PRCH': 'Porch',
                  'SEWER': 'SewerType',
                  'STOR': 'Stories',
                  'WALS': 'WAL1-MERGE',
                  'WAL1': 'ExteriorMaterialsPrimary',
                  'WAL2': 'ExteriorMaterialsSecondary',
                  'WATER': 'WaterSupply',
                  'AREA': 'LotSizeSQFT',
                  'BEDR': 'Bedrooms',
                  'COMP': 'CompletionDate',
                  'FNSQ': 'FinishedBasementSQFT',
                  'FFNSQ': 'FinishedBasementFinalSQFT',
                  'FPLS': 'Fireplaces',
                  'FULB': 'BathroomsFull',
                  'HAFB': 'BathroomsHalf',
                  'LOTV': 'LotValue',
                  'PVALU': 'PermitValue',
                  'SALE': 'SaleDate',
                  'FSQFS': 'FootageFinalSQFT',
                  'STRT': 'StartDate',
                  'CONPR': 'PriceContract',
                  'SLPR': 'PriceSales',
                  'SQFS': 'FootagePrelimSQFT',
                  'WEIGHT': 'SurveyWieght',
                  'FUEL': 'HeatingSystemFuelPrimary',
                  'FUEL2': 'HeatingSystemFuelSecondary',
                  'FCONPR': 'PriceContractAtCompletion',
                  'FSLPR': 'PriceSalesAtCompletion',

                  'AREA_F': 'LotSizeChange',
                  'FNSQ_F': 'FinishedBasementChange',
                  'FFNSQ_F': 'FootageChange',
                  'SLPR_F': 'PriceSalesChange',
                  'FSLPR_F': 'PriceSalesFinalChange',
                  'CONPR_F': 'ContractChange',
                  'FCONPR_F': 'PriceContractFinalChange',
                  'LOTV_F': 'LotSizeChange',
                  'SQFS_F': 'FootageChange',
                  'FSQFS_F': 'FootageFinalChange',
                  'PVALU_F': 'PermitValueChange',
                  'AUTH': 'PermitAuthorizationDate',
                 }
all_data.rename(columns=cols_to_rename, inplace=True)

### Clean up each columns descriptions.

In [22]:
#Central Conditioning
# 1 = yes, 2=no, 0=unknown
all_data.loc[all_data['CentralAir']==1,'CentralAir']='Yes'
all_data.loc[all_data['CentralAir']==2,'CentralAir']='No'
all_data.loc[all_data['CentralAir']==0,'CentralAir']= np.nan

all_data.CentralAir.value_counts(dropna=False)

Yes    264752
No      26571
NaN      9838
Name: CentralAir, dtype: int64

### Clean Up Data Types
Most the data is typed as Integers; this process maps proper naming from the schema document hosted at https://www.census.gov/construction/chars/pdf/socmicro_info.pdf or located in the docs folder of the project.

In [114]:
all_data.dtypes

ACS           int64
AGER          int64
ASSOC         int64
BASE          int64
CAT           int64
CLOS          int64
CON           int64
DECK          int64
DET           int64
DIV           int64
FINC          int64
FNBS          int64
FOYER       float64
FRAME         int64
GAR           int64
HEAT          int64
HEAT2         int64
LNDR          int64
METRO         int64
MFGS        float64
PATI          int64
PRCH          int64
SEWER         int64
STOR          int64
WAL1          int64
WAL2          int64
WALS          int64
WATER         int64
AREA          int64
BEDR          int64
             ...   
FFNSQ         int64
FPLS          int64
FULB          int64
HAFB          int64
LOTV          int64
PVALU         int64
SALE          int64
FSQFS       float64
STRT          int64
CONPR         int64
SLPR          int64
SQFS          int64
WEIGHT        int64
FUEL          int64
FUEL2         int64
FCONPR        int64
FSLPR         int64
AREA_F        int64
FNSQ_F        int64


### Data Manipulation

### Save output file into processed directory

Save a file in the processed directory that is cleaned properly. It will be read in and used later for further analysis.

Other options besides pickle include:
- feather
- msgpack
- parquet

In [None]:
df.to_pickle(summary_file)