### The following processes will be conducted in this data wrangling section:

### Reading the raw data from a CSV file

### Display data samples to understand its structure

### Data Cleaning: Identify and handle missing values

### Feature Engineering: Create new features based on existing ones (e.g., calculate mean income per zip code)

### Data Filtering: Filter out properties that are not relevant to the analysis (e.g., outliers or properties outside Austin city limits)

### # Feature Engineering: Create new features based on existing ones (e.g., calculate price per square foot)

### Data Aggregation: Calculate average price per zip code

### Data Merging: Combining housing data with demographic data

### Data Visualization: Visualize the data to gain insights and identify trends

### After data wrangling, the cleaned and transformed data can be used for further analysis and modeling.


In [1]:
# First, import the relevant modules and packages
import pandas as pd
import numpy as np
#from matplotlib import pyplot
import matplotlib.pyplot as plt

In [2]:
# Import Austin_TX_House_Listings_data_v1.csv & us_income_zipcode.csv
df1 = pd.read_csv('../01_raw_data/AustinTXHouseListingsDataV1.csv', low_memory=False)

In [3]:
# Displaying the first few rows of the raw data to understand its structure
df1.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,address_city,address_neighborhood,address_state,address_streetAddress,address_subdivision,address_zipcode,bathrooms,bedrooms,...,latest_salemonth,latest_saleyear,latest_price_source,numOfAccessibilityFeatures,numOfAppliances,numPhotos,numPriceChanges,photo,imagepath,cleanpath
0,0,15,Austin,,TX,12801 Wooded Lake Ct,,78732,5.0,4.0,...,2,2018,Agent Provided,0,3,39,1,https://photos.zillowstatic.com/fp/911f9a59acb...,/content/gdrive/MyDrive/zillow-images/70352485...,70352485_911f9a59acb6fe5fd909538a685e2a0c-p_f.jpg
1,1,16,Austin,,TX,904 Lakewood Hills Ter,,78732,5.0,5.0,...,8,2020,Broker Provided,0,3,58,2,https://photos.zillowstatic.com/fp/c6346c4a39d...,/content/gdrive/MyDrive/zillow-images/70352465...,70352465_c6346c4a39d1f87578ee1529329d9296-p_f.jpg
2,2,17,Austin,,TX,13701 Montview Dr,,78732,3.5,,...,3,2018,Agent Provided,0,0,1,1,https://maps.googleapis.com/maps/api/streetvie...,/content/gdrive/MyDrive/zillow-images/83823478...,83823478_streetviewlocation13701MontviewDr2CAu...
3,3,18,Austin,,TX,700 Lakewood Hills Ter,,78732,4.0,4.0,...,6,2018,Agent Provided,0,1,40,4,https://photos.zillowstatic.com/fp/805b5a6b748...,/content/gdrive/MyDrive/zillow-images/70352478...,70352478_805b5a6b748b1d9cffeb93a4c41fedb0-p_f.jpg
4,4,19,Austin,Steiner Ranch-Lakewood Hills,TX,1008 Lakewood Hills Ter,,78732,4.0,4.0,...,8,2019,Broker Provided,0,4,40,7,https://photos.zillowstatic.com/fp/a09cfad45a9...,/content/gdrive/MyDrive/zillow-images/70352461...,70352461_a09cfad45a949f7195c9540224f51e49-p_f.jpg


In [4]:
df1.shape

(16482, 736)

It looks like our data has more features (columns) than we need

In [5]:
#Analyze the summary of the data
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16482 entries, 0 to 16481
Columns: 736 entries, Unnamed: 0.1 to cleanpath
dtypes: bool(13), float64(227), int64(10), object(486)
memory usage: 91.1+ MB


In [6]:
df1.select_dtypes(include="object").count()

address_city             16482
address_neighborhood       431
address_state            16482
address_streetAddress    16482
address_subdivision        175
                         ...  
latest_saledate          16482
latest_price_source      16482
photo                    16482
imagepath                16482
cleanpath                16482
Length: 486, dtype: int64

In [7]:
list(df1.select_dtypes(include="object"))

['address_city',
 'address_neighborhood',
 'address_state',
 'address_streetAddress',
 'address_subdivision',
 'description',
 'homeStatus',
 'resoFactsStats_accessibilityFeatures_0',
 'resoFactsStats_accessibilityFeatures_1',
 'resoFactsStats_accessibilityFeatures_2',
 'resoFactsStats_accessibilityFeatures_3',
 'resoFactsStats_accessibilityFeatures_4',
 'resoFactsStats_accessibilityFeatures_5',
 'resoFactsStats_accessibilityFeatures_6',
 'resoFactsStats_accessibilityFeatures_7',
 'resoFactsStats_additionalParcelsDescription',
 'resoFactsStats_appliances_0',
 'resoFactsStats_appliances_1',
 'resoFactsStats_appliances_2',
 'resoFactsStats_appliances_3',
 'resoFactsStats_appliances_4',
 'resoFactsStats_appliances_5',
 'resoFactsStats_appliances_6',
 'resoFactsStats_appliances_7',
 'resoFactsStats_appliances_8',
 'resoFactsStats_appliances_9',
 'resoFactsStats_appliances_10',
 'resoFactsStats_appliances_11',
 'resoFactsStats_appliances_12',
 'resoFactsStats_architecturalStyle',
 'resoFact

In [8]:
list(df1['lotSize'])

['0.36 Acres',
 '0.66 Acres',
 '0.50 Acres',
 '0.49 Acres',
 '0.48 Acres',
 '7,797 sqft',
 '0.41 Acres',
 '0.47 Acres',
 '2.67 Acres',
 '0.28 Acres',
 '0.30 Acres',
 '0.30 Acres',
 '0.30 Acres',
 '0.59 Acres',
 '8,232 sqft',
 '8,276 sqft',
 '0.32 Acres',
 '7,405 sqft',
 '0.52 Acres',
 '8,842 sqft',
 '0.29 Acres',
 '0.31 Acres',
 '8,363 sqft',
 '0.28 Acres',
 '0.27 Acres',
 '7,971 sqft',
 '9,931 sqft',
 '10,497 sqft',
 '2.68 Acres',
 '7,492 sqft',
 '0.37 Acres',
 '0.36 Acres',
 '0.36 Acres',
 '7,753 sqft',
 '0.37 Acres',
 '0.44 Acres',
 '4.18 Acres',
 '1.34 Acres',
 '5,662 sqft',
 '0.29 Acres',
 '10,018 sqft',
 '2.35 Acres',
 '0.26 Acres',
 '8,929 sqft',
 '0.34 Acres',
 '1.07 Acres',
 '0.28 Acres',
 '0.59 Acres',
 '1.36 Acres',
 '3.96 Acres',
 '0.34 Acres',
 '9,888 sqft',
 '0.32 Acres',
 '1.86 Acres',
 '6,316 sqft',
 '0.87 Acres',
 '9,670 sqft',
 '0.76 Acres',
 '0.50 Acres',
 '1.31 Acres',
 '10,367 sqft',
 '0.48 Acres',
 '1.25 Acres',
 '0.40 Acres',
 '3.54 Acres',
 '6,011 sqft',
 '1.52 

In [9]:
df1[['lotSize_value', 'lotSize_unit']] = df1['lotSize'].str.split(' ', expand=True)
df1.head(3)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,address_city,address_neighborhood,address_state,address_streetAddress,address_subdivision,address_zipcode,bathrooms,bedrooms,...,latest_price_source,numOfAccessibilityFeatures,numOfAppliances,numPhotos,numPriceChanges,photo,imagepath,cleanpath,lotSize_value,lotSize_unit
0,0,15,Austin,,TX,12801 Wooded Lake Ct,,78732,5.0,4.0,...,Agent Provided,0,3,39,1,https://photos.zillowstatic.com/fp/911f9a59acb...,/content/gdrive/MyDrive/zillow-images/70352485...,70352485_911f9a59acb6fe5fd909538a685e2a0c-p_f.jpg,0.36,Acres
1,1,16,Austin,,TX,904 Lakewood Hills Ter,,78732,5.0,5.0,...,Broker Provided,0,3,58,2,https://photos.zillowstatic.com/fp/c6346c4a39d...,/content/gdrive/MyDrive/zillow-images/70352465...,70352465_c6346c4a39d1f87578ee1529329d9296-p_f.jpg,0.66,Acres
2,2,17,Austin,,TX,13701 Montview Dr,,78732,3.5,,...,Agent Provided,0,0,1,1,https://maps.googleapis.com/maps/api/streetvie...,/content/gdrive/MyDrive/zillow-images/83823478...,83823478_streetviewlocation13701MontviewDr2CAu...,0.5,Acres


In [10]:
df1.columns

Index(['Unnamed: 0.1', 'Unnamed: 0', 'address_city', 'address_neighborhood',
       'address_state', 'address_streetAddress', 'address_subdivision',
       'address_zipcode', 'bathrooms', 'bedrooms',
       ...
       'latest_price_source', 'numOfAccessibilityFeatures', 'numOfAppliances',
       'numPhotos', 'numPriceChanges', 'photo', 'imagepath', 'cleanpath',
       'lotSize_value', 'lotSize_unit'],
      dtype='object', length=738)

In [11]:
list(df1.select_dtypes(include="object"))

['address_city',
 'address_neighborhood',
 'address_state',
 'address_streetAddress',
 'address_subdivision',
 'description',
 'homeStatus',
 'resoFactsStats_accessibilityFeatures_0',
 'resoFactsStats_accessibilityFeatures_1',
 'resoFactsStats_accessibilityFeatures_2',
 'resoFactsStats_accessibilityFeatures_3',
 'resoFactsStats_accessibilityFeatures_4',
 'resoFactsStats_accessibilityFeatures_5',
 'resoFactsStats_accessibilityFeatures_6',
 'resoFactsStats_accessibilityFeatures_7',
 'resoFactsStats_additionalParcelsDescription',
 'resoFactsStats_appliances_0',
 'resoFactsStats_appliances_1',
 'resoFactsStats_appliances_2',
 'resoFactsStats_appliances_3',
 'resoFactsStats_appliances_4',
 'resoFactsStats_appliances_5',
 'resoFactsStats_appliances_6',
 'resoFactsStats_appliances_7',
 'resoFactsStats_appliances_8',
 'resoFactsStats_appliances_9',
 'resoFactsStats_appliances_10',
 'resoFactsStats_appliances_11',
 'resoFactsStats_appliances_12',
 'resoFactsStats_architecturalStyle',
 'resoFact

In [12]:
df1['lotSize_unit'].unique()

array(['Acres', 'sqft', nan], dtype=object)

In [13]:
len(df1['lotSize_value'])

16482

In [14]:
df1['lotSize_value'].info

<bound method Series.info of 0          0.36
1          0.66
2          0.50
3          0.49
4          0.48
          ...  
16477      0.54
16478     9,321
16479    10,280
16480     8,450
16481     7,797
Name: lotSize_value, Length: 16482, dtype: object>

In [15]:
df1.replace('nan', float('nan'), inplace=True)

In [16]:
df1['lotSize_unit'].unique()

array(['Acres', 'sqft', nan], dtype=object)

In [17]:
df1['lotSize_value'].replace(',', '', inplace=True)

In [18]:
df1['lotSize_value'].info

<bound method Series.info of 0          0.36
1          0.66
2          0.50
3          0.49
4          0.48
          ...  
16477      0.54
16478     9,321
16479    10,280
16480     8,450
16481     7,797
Name: lotSize_value, Length: 16482, dtype: object>

In [19]:
df1['lotSize_value'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 16482 entries, 0 to 16481
Series name: lotSize_value
Non-Null Count  Dtype 
--------------  ----- 
16154 non-null  object
dtypes: object(1)
memory usage: 128.9+ KB


In [20]:
df1['lotSize_value'] = df1['lotSize_value'].str.replace(',', '')

In [21]:
df1['lotSize_value'].info

<bound method Series.info of 0         0.36
1         0.66
2         0.50
3         0.49
4         0.48
         ...  
16477     0.54
16478     9321
16479    10280
16480     8450
16481     7797
Name: lotSize_value, Length: 16482, dtype: object>

In [22]:
df1['lotSize_value'] = pd.to_numeric(df1['lotSize_value'])

In [23]:
df1[['lotSize_unit']]

Unnamed: 0,lotSize_unit
0,Acres
1,Acres
2,Acres
3,Acres
4,Acres
...,...
16477,Acres
16478,sqft
16479,sqft
16480,sqft


In [24]:
# Convert acres to square feet where column_unit is 'Acres'
df1['lotSize_sqft'] = np.where(df1['lotSize_unit'] == 'Acres',
                               df1['lotSize_value'] * 43560,
                               df1['lotSize_value'])
df1.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,address_city,address_neighborhood,address_state,address_streetAddress,address_subdivision,address_zipcode,bathrooms,bedrooms,...,numOfAccessibilityFeatures,numOfAppliances,numPhotos,numPriceChanges,photo,imagepath,cleanpath,lotSize_value,lotSize_unit,lotSize_sqft
0,0,15,Austin,,TX,12801 Wooded Lake Ct,,78732,5.0,4.0,...,0,3,39,1,https://photos.zillowstatic.com/fp/911f9a59acb...,/content/gdrive/MyDrive/zillow-images/70352485...,70352485_911f9a59acb6fe5fd909538a685e2a0c-p_f.jpg,0.36,Acres,15681.6
1,1,16,Austin,,TX,904 Lakewood Hills Ter,,78732,5.0,5.0,...,0,3,58,2,https://photos.zillowstatic.com/fp/c6346c4a39d...,/content/gdrive/MyDrive/zillow-images/70352465...,70352465_c6346c4a39d1f87578ee1529329d9296-p_f.jpg,0.66,Acres,28749.6
2,2,17,Austin,,TX,13701 Montview Dr,,78732,3.5,,...,0,0,1,1,https://maps.googleapis.com/maps/api/streetvie...,/content/gdrive/MyDrive/zillow-images/83823478...,83823478_streetviewlocation13701MontviewDr2CAu...,0.5,Acres,21780.0
3,3,18,Austin,,TX,700 Lakewood Hills Ter,,78732,4.0,4.0,...,0,1,40,4,https://photos.zillowstatic.com/fp/805b5a6b748...,/content/gdrive/MyDrive/zillow-images/70352478...,70352478_805b5a6b748b1d9cffeb93a4c41fedb0-p_f.jpg,0.49,Acres,21344.4
4,4,19,Austin,Steiner Ranch-Lakewood Hills,TX,1008 Lakewood Hills Ter,,78732,4.0,4.0,...,0,4,40,7,https://photos.zillowstatic.com/fp/a09cfad45a9...,/content/gdrive/MyDrive/zillow-images/70352461...,70352461_a09cfad45a949f7195c9540224f51e49-p_f.jpg,0.48,Acres,20908.8


In [25]:
# Analyze how the dependent variable, 'latest_price', is correlated with the rest of independent variables
dict(df1.corr()['latest_price'].abs().sort_values(ascending = False))

  dict(df1.corr()['latest_price'].abs().sort_values(ascending = False))


{'latest_price': 1.0,
 'taxAssessedValue': 0.8628559938898727,
 'taxAnnualAmount': 0.8156128587000064,
 'bathrooms': 0.551445307264037,
 'resoFactsStats_bathrooms': 0.47586093567052096,
 'livingArea': 0.3919678078297034,
 'resoFactsStats_bathroomsFull': 0.3854093485257164,
 'bedrooms': 0.3236994120035267,
 'resoFactsStats_bedrooms': 0.29769330047900794,
 'schools_2_rating': 0.2879634712029946,
 'schools_1_studentsPerTeacher': 0.2846390962391095,
 'schools_1_rating': 0.24097254565070483,
 'schools_0_rating': 0.23987455502465402,
 'lotSize_value': 0.23968531085209016,
 'schools_1_size': 0.23598605610932682,
 'resoFactsStats_bathroomsHalf': 0.22839099263704998,
 'stories': 0.21967042532231817,
 'numPhotos': 0.181905441001118,
 'longitude': 0.18157016525279043,
 'hasSpa': 0.17628657766491157,
 'coveredSpaces': 0.17196269771541464,
 'parking': 0.16639348500886164,
 'schools_2_studentsPerTeacher': 0.16305886359458596,
 'garageSpaces': 0.16297483664756376,
 'schools_0_distance': 0.15324071928

In [26]:
#Create a "corr_r" row to filter out less correlated features (columns)
df1.loc['corr_r'] = df1.corr()['latest_price'].abs().sort_values(ascending = False)

  df1.loc['corr_r'] = df1.corr()['latest_price'].abs().sort_values(ascending = False)
  df1.loc['corr_r'] = df1.corr()['latest_price'].abs().sort_values(ascending = False)


In [27]:
# Check the new row
df1.tail()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,address_city,address_neighborhood,address_state,address_streetAddress,address_subdivision,address_zipcode,bathrooms,bedrooms,...,numOfAccessibilityFeatures,numOfAppliances,numPhotos,numPriceChanges,photo,imagepath,cleanpath,lotSize_value,lotSize_unit,lotSize_sqft
16478,16480.0,22070.0,Austin,,TX,6318 Clairmont Dr,,78749.0,2.0,4.0,...,0.0,3.0,33.0,1.0,https://photos.zillowstatic.com/fp/2f26fb52681...,/content/gdrive/MyDrive/zillow-images/29487158...,29487158_2f26fb52681290dd3e096cb85fc299ec-p_f.jpg,9321.0,sqft,9321.0
16479,16481.0,22072.0,Austin,,TX,6104 Abilene Trl,,78749.0,2.0,3.0,...,0.0,5.0,25.0,5.0,https://photos.zillowstatic.com/fp/71a9662d303...,/content/gdrive/MyDrive/zillow-images/29491564...,29491564_71a9662d3031a1834e29f578f5a92b71-p_f.jpg,10280.0,sqft,10280.0
16480,16482.0,22073.0,Austin,,TX,7702 Kincheon Ct,,78749.0,2.5,3.0,...,0.0,1.0,27.0,1.0,https://photos.zillowstatic.com/fp/f452bc0f4ba...,/content/gdrive/MyDrive/zillow-images/29486556...,29486556_f452bc0f4ba54fe11c89905773f38840-p_f.jpg,8450.0,sqft,8450.0
16481,16483.0,22074.0,Austin,,TX,5300 Indio Cv,,78745.0,,,...,0.0,0.0,11.0,1.0,https://photos.zillowstatic.com/fp/eb612b3f7c8...,/content/gdrive/MyDrive/zillow-images/58314089...,58314089_eb612b3f7c80976715d8a54e389b8e15-p_f.jpg,7797.0,sqft,7797.0
corr_r,0.12465,0.124756,,,,,,0.152605,0.551445,0.323699,...,0.015614,0.048005,0.181905,0.09137,,,,0.239685,,0.019782


In [28]:
 df1.shape

(16483, 739)

In [29]:
# Exclude columns that have correlation values less than 0.005
df2 = df1.loc[:,df1.loc['corr_r'] >= 0.005]

In [30]:
df2.shape

(16483, 63)

The number of columns are reduced from 736 to 61 by dropping columns that have less than 0.005 correlation value 

In [31]:
df2.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,address_zipcode,bathrooms,bedrooms,dateposted,latitude,livingArea,longitude,propertyTaxRate,...,zpid,latest_price,latest_salemonth,latest_saleyear,numOfAccessibilityFeatures,numOfAppliances,numPhotos,numPriceChanges,lotSize_value,lotSize_sqft
0,0.0,15.0,78732.0,5.0,4.0,,30.354322,4060.0,-97.911278,1.98,...,70352485.0,715000.0,2.0,2018.0,0.0,3.0,39.0,1.0,0.36,15681.6
1,1.0,16.0,78732.0,5.0,5.0,,30.355553,4558.0,-97.912544,1.98,...,70352465.0,1025000.0,8.0,2020.0,0.0,3.0,58.0,2.0,0.66,28749.6
2,2.0,17.0,78732.0,3.5,,,30.382227,3402.0,-97.908333,1.98,...,83823478.0,59000.0,3.0,2018.0,0.0,0.0,1.0,1.0,0.5,21780.0
3,3.0,18.0,78732.0,4.0,4.0,,30.352081,4749.0,-97.912048,1.98,...,70352478.0,825000.0,6.0,2018.0,0.0,1.0,40.0,4.0,0.49,21344.4
4,4.0,19.0,78732.0,4.0,4.0,,30.356226,4867.0,-97.911697,1.98,...,70352461.0,849000.0,8.0,2019.0,0.0,4.0,40.0,7.0,0.48,20908.8


In [32]:
df2.isnull().sum().sort_values(ascending=False)

resoFactsStats_storiesTotal                16446
resoFactsStats_carportSpaces               16427
resoFactsStats_numberOfUnitsInCommunity    16376
resoFactsStats_fireplaces                  16302
resoFactsStats_onMarketDate                16259
                                           ...  
furnished                                      0
longitude                                      0
latitude                                       0
address_zipcode                                0
parking                                        0
Length: 63, dtype: int64

In [33]:
df2_missing_counts = df2.isnull().sum()

In [34]:
df2_columns_to_drop = df2_missing_counts[df2_missing_counts > 14000].index

In [35]:
df2_columns_to_drop

Index(['dateposted', 'resoFactsStats_carportSpaces',
       'resoFactsStats_entryLevel', 'resoFactsStats_fireplaces',
       'resoFactsStats_numberOfUnitsInCommunity',
       'resoFactsStats_onMarketDate', 'resoFactsStats_storiesTotal',
       'resoFactsStats_yearBuiltEffective'],
      dtype='object')

In [36]:
# Create a row called 'null_sum_col' to drop columns that have more than 80% missing values
df2.loc['null_sum_col'] = df2.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2.loc['null_sum_col'] = df2.isnull().sum()


In [37]:
# Filter out columns that has more than 80% missing values
df3 = df2.loc[:, df2.loc['null_sum_col'] < 14000]

In [38]:
df3 = df2.drop(columns=df2_columns_to_drop)

In [39]:
df3.shape

(16484, 55)

In [40]:
df3.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,address_zipcode,bathrooms,bedrooms,latitude,livingArea,longitude,propertyTaxRate,resoFactsStats_atAGlanceFacts_1_factValue,...,zpid,latest_price,latest_salemonth,latest_saleyear,numOfAccessibilityFeatures,numOfAppliances,numPhotos,numPriceChanges,lotSize_value,lotSize_sqft
0,0.0,15.0,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,...,70352485.0,715000.0,2.0,2018.0,0.0,3.0,39.0,1.0,0.36,15681.6
1,1.0,16.0,78732.0,5.0,5.0,30.355553,4558.0,-97.912544,1.98,2007.0,...,70352465.0,1025000.0,8.0,2020.0,0.0,3.0,58.0,2.0,0.66,28749.6
2,2.0,17.0,78732.0,3.5,,30.382227,3402.0,-97.908333,1.98,2007.0,...,83823478.0,59000.0,3.0,2018.0,0.0,0.0,1.0,1.0,0.5,21780.0
3,3.0,18.0,78732.0,4.0,4.0,30.352081,4749.0,-97.912048,1.98,2011.0,...,70352478.0,825000.0,6.0,2018.0,0.0,1.0,40.0,4.0,0.49,21344.4
4,4.0,19.0,78732.0,4.0,4.0,30.356226,4867.0,-97.911697,1.98,2009.0,...,70352461.0,849000.0,8.0,2019.0,0.0,4.0,40.0,7.0,0.48,20908.8


The number of columns further decreased to 37

In [41]:
# Check how many rows have a total of more than 5 null values
df3[df3.isnull().sum(axis = 1) > 3]

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,address_zipcode,bathrooms,bedrooms,latitude,livingArea,longitude,propertyTaxRate,resoFactsStats_atAGlanceFacts_1_factValue,...,zpid,latest_price,latest_salemonth,latest_saleyear,numOfAccessibilityFeatures,numOfAppliances,numPhotos,numPriceChanges,lotSize_value,lotSize_sqft
2,2.0,17.0,78732.0,3.5,,30.382227,3402.0,-97.908333,1.98,2007.0,...,8.382348e+07,59000.0,3.0,2018.0,0.0,0.0,1.0,1.0,0.50,21780.0
11,11.0,26.0,78732.0,,,30.330782,1480.0,-97.925789,1.98,1975.0,...,2.934395e+07,315000.0,4.0,2019.0,0.0,1.0,18.0,9.0,0.30,13068.0
78,78.0,106.0,78732.0,,,30.382215,2512.0,-97.909950,1.98,2006.0,...,7.998023e+07,100000.0,2.0,2018.0,0.0,0.0,15.0,1.0,0.49,21344.4
93,93.0,122.0,78732.0,,,30.332327,1.0,-97.911598,1.98,2005.0,...,2.419302e+08,2195000.0,4.0,2019.0,0.0,0.0,5.0,1.0,2.00,87120.0
104,104.0,136.0,78734.0,3.0,4.0,30.342268,3023.0,-97.930733,1.98,1992.0,...,5.829815e+07,1500000.0,8.0,2020.0,0.0,0.0,31.0,2.0,0.86,37461.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16417,16419.0,21857.0,78737.0,3.0,5.0,30.198322,2747.0,-97.989227,2.01,2006.0,...,8.007668e+07,365000.0,1.0,2018.0,0.0,7.0,14.0,1.0,0.26,11325.6
16437,16439.0,21946.0,78717.0,3.0,3.0,30.476957,1645.0,-97.772057,2.21,2018.0,...,2.093629e+09,331700.0,1.0,2018.0,0.0,0.0,16.0,3.0,,
16439,16441.0,21951.0,78758.0,1.0,1.0,30.406792,657.0,-97.697861,1.98,1986.0,...,2.113423e+09,124999.0,9.0,2019.0,0.0,5.0,17.0,7.0,,
16466,16468.0,22039.0,78751.0,,,30.311029,3700.0,-97.719620,1.98,2020.0,...,2.940159e+07,1299900.0,12.0,2020.0,0.0,0.0,30.0,6.0,9408.00,9408.0


In [42]:
df3['null_sum_rows'] = df3.isnull().sum(axis = 1)

In [43]:
df3.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,address_zipcode,bathrooms,bedrooms,latitude,livingArea,longitude,propertyTaxRate,resoFactsStats_atAGlanceFacts_1_factValue,...,latest_price,latest_salemonth,latest_saleyear,numOfAccessibilityFeatures,numOfAppliances,numPhotos,numPriceChanges,lotSize_value,lotSize_sqft,null_sum_rows
0,0.0,15.0,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,...,715000.0,2.0,2018.0,0.0,3.0,39.0,1.0,0.36,15681.6,1
1,1.0,16.0,78732.0,5.0,5.0,30.355553,4558.0,-97.912544,1.98,2007.0,...,1025000.0,8.0,2020.0,0.0,3.0,58.0,2.0,0.66,28749.6,0
2,2.0,17.0,78732.0,3.5,,30.382227,3402.0,-97.908333,1.98,2007.0,...,59000.0,3.0,2018.0,0.0,0.0,1.0,1.0,0.5,21780.0,7
3,3.0,18.0,78732.0,4.0,4.0,30.352081,4749.0,-97.912048,1.98,2011.0,...,825000.0,6.0,2018.0,0.0,1.0,40.0,4.0,0.49,21344.4,1
4,4.0,19.0,78732.0,4.0,4.0,30.356226,4867.0,-97.911697,1.98,2009.0,...,849000.0,8.0,2019.0,0.0,4.0,40.0,7.0,0.48,20908.8,1


In [44]:
df3.shape

(16484, 56)

In [45]:
df3 = df3[df3['null_sum_rows'] <= 3]

In [46]:
df3.shape

(14735, 56)

In [47]:
# Drop unwanated columns 
df3 = df3.drop(columns=['Unnamed: 0.1', 'Unnamed: 0', 'numPhotos', 'null_sum_rows','lotSize_value'])

In [48]:
# Drop rows that are no longer needed
df3.drop(index=['corr_r', 'null_sum_col'], inplace = True)

In [49]:
df3.tail()

Unnamed: 0,address_zipcode,bathrooms,bedrooms,latitude,livingArea,longitude,propertyTaxRate,resoFactsStats_atAGlanceFacts_1_factValue,resoFactsStats_bathrooms,resoFactsStats_bathroomsFull,...,schools_2_totalCount,yearBuilt,zpid,latest_price,latest_salemonth,latest_saleyear,numOfAccessibilityFeatures,numOfAppliances,numPriceChanges,lotSize_sqft
16476,78749.0,2.0,3.0,30.224886,1496.0,-97.862335,1.98,1988.0,2.0,0.0,...,1.0,1988.0,29483690.0,292500.0,1.0,2018.0,0.0,0.0,1.0,5662.0
16477,78749.0,2.0,3.0,30.221336,2270.0,-97.879677,1.98,1999.0,2.0,0.0,...,1.0,1999.0,29484499.0,425000.0,1.0,2018.0,0.0,3.0,1.0,23522.4
16478,78749.0,2.0,4.0,30.216406,2645.0,-97.876045,1.98,1993.0,2.0,0.0,...,1.0,1993.0,29487158.0,424999.0,1.0,2018.0,0.0,3.0,1.0,9321.0
16479,78749.0,2.0,3.0,30.213816,1469.0,-97.873711,1.98,1992.0,2.0,2.0,...,1.0,1992.0,29491564.0,316000.0,1.0,2018.0,0.0,5.0,5.0,10280.0
16480,78749.0,2.5,3.0,30.218878,1700.0,-97.86496,1.98,1986.0,,0.0,...,1.0,1986.0,29486556.0,384900.0,1.0,2018.0,0.0,1.0,1.0,8450.0


In [50]:
df3['price_per_sqft'] = df3['latest_price'] / df3['livingArea']

In [51]:
df3.shape

(14733, 52)

In [52]:
# Evaluate the amount of missing values in each columns
df3.isnull().sum()

address_zipcode                                 0
bathrooms                                     118
bedrooms                                       45
latitude                                        0
livingArea                                      0
longitude                                       0
propertyTaxRate                                 0
resoFactsStats_atAGlanceFacts_1_factValue       0
resoFactsStats_bathrooms                      158
resoFactsStats_bathroomsFull                    0
resoFactsStats_bathroomsHalf                    0
resoFactsStats_bathroomsThreeQuarter          189
resoFactsStats_bedrooms                         4
coveredSpaces                                8315
furnished                                       0
garageSpaces                                    0
hasAttachedGarage                               0
hasCarport                                      0
hasCooling                                      0
hasGarage                                       0


In [53]:
# Evaluate the amount of missing values across rows
df3[df3.isnull().sum(axis = 1) > 0]

Unnamed: 0,address_zipcode,bathrooms,bedrooms,latitude,livingArea,longitude,propertyTaxRate,resoFactsStats_atAGlanceFacts_1_factValue,resoFactsStats_bathrooms,resoFactsStats_bathroomsFull,...,yearBuilt,zpid,latest_price,latest_salemonth,latest_saleyear,numOfAccessibilityFeatures,numOfAppliances,numPriceChanges,lotSize_sqft,price_per_sqft
0,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,4.0,...,2007.0,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374
3,78732.0,4.0,4.0,30.352081,4749.0,-97.912048,1.98,2011.0,4.0,3.0,...,2011.0,70352478.0,825000.0,6.0,2018.0,0.0,1.0,4.0,21344.4,173.720783
4,78732.0,4.0,4.0,30.356226,4867.0,-97.911697,1.98,2009.0,4.0,4.0,...,2009.0,70352461.0,849000.0,8.0,2019.0,0.0,4.0,7.0,20908.8,174.440107
6,78732.0,,3.0,30.329847,2326.0,-97.926910,1.98,2018.0,0.0,0.0,...,2018.0,124904977.0,139990.0,4.0,2018.0,0.0,0.0,1.0,17859.6,60.184867
7,78732.0,5.0,4.0,30.352270,3385.0,-97.912140,1.98,2007.0,5.0,4.0,...,2007.0,70352477.0,800000.0,10.0,2020.0,0.0,8.0,2.0,20473.2,236.336780
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16476,78749.0,2.0,3.0,30.224886,1496.0,-97.862335,1.98,1988.0,2.0,0.0,...,1988.0,29483690.0,292500.0,1.0,2018.0,0.0,0.0,1.0,5662.0,195.521390
16477,78749.0,2.0,3.0,30.221336,2270.0,-97.879677,1.98,1999.0,2.0,0.0,...,1999.0,29484499.0,425000.0,1.0,2018.0,0.0,3.0,1.0,23522.4,187.224670
16478,78749.0,2.0,4.0,30.216406,2645.0,-97.876045,1.98,1993.0,2.0,0.0,...,1993.0,29487158.0,424999.0,1.0,2018.0,0.0,3.0,1.0,9321.0,160.680151
16479,78749.0,2.0,3.0,30.213816,1469.0,-97.873711,1.98,1992.0,2.0,2.0,...,1992.0,29491564.0,316000.0,1.0,2018.0,0.0,5.0,5.0,10280.0,215.112321


In [54]:
df3.mean().round(1)

address_zipcode                                 78735.9
bathrooms                                           2.7
bedrooms                                            3.4
latitude                                           30.3
livingArea                                       2153.2
longitude                                         -97.8
propertyTaxRate                                     2.0
resoFactsStats_atAGlanceFacts_1_factValue        1987.5
resoFactsStats_bathrooms                            2.6
resoFactsStats_bathroomsFull                        2.0
resoFactsStats_bathroomsHalf                        0.4
resoFactsStats_bathroomsThreeQuarter                0.0
resoFactsStats_bedrooms                             3.4
coveredSpaces                                       1.7
furnished                                           0.0
garageSpaces                                        1.2
hasAttachedGarage                                   0.0
hasCarport                                      

In [55]:
df3.shape

(14733, 52)

In [56]:
(df3.address_zipcode.min(), df3.address_zipcode.max())

(78617.0, 78759.0)

In [57]:
df3.address_zipcode.dtype

dtype('float64')

In [58]:
df3['zipcode'] = df3.address_zipcode.astype('int')

In [59]:
df3 = df3.set_index(['zipcode'])

In [60]:
df3.head()

Unnamed: 0_level_0,address_zipcode,bathrooms,bedrooms,latitude,livingArea,longitude,propertyTaxRate,resoFactsStats_atAGlanceFacts_1_factValue,resoFactsStats_bathrooms,resoFactsStats_bathroomsFull,...,yearBuilt,zpid,latest_price,latest_salemonth,latest_saleyear,numOfAccessibilityFeatures,numOfAppliances,numPriceChanges,lotSize_sqft,price_per_sqft
zipcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
78732,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,4.0,...,2007.0,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374
78732,78732.0,5.0,5.0,30.355553,4558.0,-97.912544,1.98,2007.0,5.0,5.0,...,2007.0,70352465.0,1025000.0,8.0,2020.0,0.0,3.0,2.0,28749.6,224.879333
78732,78732.0,4.0,4.0,30.352081,4749.0,-97.912048,1.98,2011.0,4.0,3.0,...,2011.0,70352478.0,825000.0,6.0,2018.0,0.0,1.0,4.0,21344.4,173.720783
78732,78732.0,4.0,4.0,30.356226,4867.0,-97.911697,1.98,2009.0,4.0,4.0,...,2009.0,70352461.0,849000.0,8.0,2019.0,0.0,4.0,7.0,20908.8,174.440107
78732,78732.0,4.0,5.0,30.341896,3485.0,-97.907944,1.98,2009.0,4.0,3.0,...,2009.0,89028960.0,625000.0,7.0,2019.0,0.0,4.0,4.0,7797.0,179.340029


# load the US Income data for Austin and merge it with Austin House Listings cleaned data

In [61]:
# load the us_income data
df1_inc = pd.read_csv('../01_raw_data/us_income_zipcode.csv', low_memory=False)

In [62]:
df1_inc.shape

(364998, 111)

In [63]:
df2_inc = df1_inc[df1_inc['ZIP'].between(78617,78759)]

In [64]:
df2_inc.shape

(1025, 111)

In [65]:
list(df2_inc.columns)

['ZIP',
 'Geography',
 'Geographic Area Name',
 'Households',
 'Households Margin of Error',
 'Households Less Than $10,000',
 'Households Less Than $10,000 Margin of Error',
 'Households $10,000 to $14,999',
 'Households $10,000 to $14,999 Margin of Error',
 'Households $15,000 to $24,999',
 'Households $15,000 to $24,999 Margin of Error',
 'Households $25,000 to $34,999',
 'Households $25,000 to $34,999 Margin of Error',
 'Households $35,000 to $49,999',
 'Households $35,000 to $49,999 Margin of Error',
 'Households $50,000 to $74,999',
 'Households $50,000 to $74,999 Margin of Error',
 'Households $75,000 to $99,999',
 'Households $75,000 to $99,999 Margin of Error',
 'Households $100,000 to $149,999',
 'Households $100,000 to $149,999 Margin of Error',
 'Households $150,000 to $199,999',
 'Households $150,000 to $199,999 Margin of Error',
 'Households $200,000 or More',
 'Households $200,000 or More Margin of Error',
 'Households Median Income (Dollars)',
 'Households Median Income

In [66]:
df2_inc = df2_inc.set_index(['ZIP'])

In [67]:
df2_inc.head()

Unnamed: 0_level_0,Geography,Geographic Area Name,Households,Households Margin of Error,"Households Less Than $10,000","Households Less Than $10,000 Margin of Error","Households $10,000 to $14,999","Households $10,000 to $14,999 Margin of Error","Households $15,000 to $24,999","Households $15,000 to $24,999 Margin of Error",...,"Nonfamily Households $150,000 to $199,999","Nonfamily Households $150,000 to $199,999 Margin of Error","Nonfamily Households $200,000 or More","Nonfamily Households $200,000 or More Margin of Error",Nonfamily Households Median Income (Dollars),Nonfamily Households Median Income (Dollars) Margin of Error,Nonfamily Households Mean Income (Dollars),Nonfamily Households Mean Income (Dollars) Margin of Error,Nonfamily Households Nonfamily Income in the Past 12 Months,Year
ZIP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
78617,860Z200US78617,ZCTA5 78617,7662.0,682.0,3.0,2.3,1.7,2.0,3.7,1.8,...,2.1,3.4,2.6,4.0,44941.0,7175.0,56019.0,12104.0,46.5,2021.0
78618,860Z200US78618,ZCTA5 78618,129.0,50.0,6.2,6.8,0.8,2.4,5.4,5.5,...,14.8,22.7,14.8,15.0,44688.0,32515.0,154152.0,98763.0,37.0,2021.0
78619,860Z200US78619,ZCTA5 78619,2159.0,372.0,0.0,2.2,1.0,1.2,0.0,2.2,...,0.0,25.9,0.0,25.9,86354.0,28771.0,75649.0,16152.0,16.8,2021.0
78620,860Z200US78620,ZCTA5 78620,6538.0,432.0,1.5,1.0,3.4,2.1,2.9,1.4,...,4.8,3.9,5.4,4.2,47183.0,12833.0,70972.0,13816.0,27.4,2021.0
78621,860Z200US78621,ZCTA5 78621,8103.0,552.0,5.0,2.3,2.2,1.4,2.4,1.4,...,12.2,15.1,2.0,3.0,72837.0,21942.0,80298.0,21264.0,39.2,2021.0


In [68]:
#Analyze Data across a row
dict(df2_inc.iloc[0])

{'Geography': '860Z200US78617',
 'Geographic Area Name': 'ZCTA5 78617',
 'Households': 7662.0,
 'Households Margin of Error': 682.0,
 'Households Less Than $10,000': 3.0,
 'Households Less Than $10,000 Margin of Error': 2.3,
 'Households $10,000 to $14,999': 1.7,
 'Households $10,000 to $14,999 Margin of Error': 2.0,
 'Households $15,000 to $24,999': 3.7,
 'Households $15,000 to $24,999 Margin of Error': 1.8,
 'Households $25,000 to $34,999': 6.6,
 'Households $25,000 to $34,999 Margin of Error': 2.4,
 'Households $35,000 to $49,999': 19.2,
 'Households $35,000 to $49,999 Margin of Error': 6.4,
 'Households $50,000 to $74,999': 20.1,
 'Households $50,000 to $74,999 Margin of Error': 4.9,
 'Households $75,000 to $99,999': 10.3,
 'Households $75,000 to $99,999 Margin of Error': 3.9,
 'Households $100,000 to $149,999': 23.0,
 'Households $100,000 to $149,999 Margin of Error': 5.4,
 'Households $150,000 to $199,999': 6.0,
 'Households $150,000 to $199,999 Margin of Error': 2.8,
 'Household

In [69]:
selected_cols = [col for col in df2_inc.columns if 'Mean Income' in col and 'Margin of Error' not in col]

In [70]:
df2_inc = df2_inc[selected_cols]

In [71]:
df2_inc.head()

Unnamed: 0_level_0,Households Mean Income (Dollars),Families Mean Income (Dollars),Married-Couple Families Mean Income (Dollars),Nonfamily Households Mean Income (Dollars)
ZIP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
78617,99354.0,106461.0,,56019.0
78618,105468.0,92582.0,,154152.0
78619,165887.0,169968.0,,75649.0
78620,151574.0,165526.0,,70972.0
78621,94097.0,95930.0,,80298.0


In [72]:
df2_inc.shape

(1025, 4)

In [73]:
df2_inc_average = df2_inc.mean(axis = 1).round(1)

In [74]:
df2_inc['zip_inc_ave'] = df2_inc.mean(axis = 1).round(2)

In [75]:
df2_inc.columns

Index(['Households Mean Income (Dollars)', 'Families Mean Income (Dollars)',
       'Married-Couple Families Mean Income (Dollars)',
       'Nonfamily Households Mean Income (Dollars)', 'zip_inc_ave'],
      dtype='object')

In [76]:
df2_inc.rename(columns={'ZIP': 'zipcode', '': 'inc_average'}, inplace=True)
df2_inc.head()

Unnamed: 0_level_0,Households Mean Income (Dollars),Families Mean Income (Dollars),Married-Couple Families Mean Income (Dollars),Nonfamily Households Mean Income (Dollars),zip_inc_ave
ZIP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
78617,99354.0,106461.0,,56019.0,87278.0
78618,105468.0,92582.0,,154152.0,117400.67
78619,165887.0,169968.0,,75649.0,137168.0
78620,151574.0,165526.0,,70972.0,129357.33
78621,94097.0,95930.0,,80298.0,90108.33


In [77]:
df3.head()

Unnamed: 0_level_0,address_zipcode,bathrooms,bedrooms,latitude,livingArea,longitude,propertyTaxRate,resoFactsStats_atAGlanceFacts_1_factValue,resoFactsStats_bathrooms,resoFactsStats_bathroomsFull,...,yearBuilt,zpid,latest_price,latest_salemonth,latest_saleyear,numOfAccessibilityFeatures,numOfAppliances,numPriceChanges,lotSize_sqft,price_per_sqft
zipcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
78732,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,4.0,...,2007.0,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374
78732,78732.0,5.0,5.0,30.355553,4558.0,-97.912544,1.98,2007.0,5.0,5.0,...,2007.0,70352465.0,1025000.0,8.0,2020.0,0.0,3.0,2.0,28749.6,224.879333
78732,78732.0,4.0,4.0,30.352081,4749.0,-97.912048,1.98,2011.0,4.0,3.0,...,2011.0,70352478.0,825000.0,6.0,2018.0,0.0,1.0,4.0,21344.4,173.720783
78732,78732.0,4.0,4.0,30.356226,4867.0,-97.911697,1.98,2009.0,4.0,4.0,...,2009.0,70352461.0,849000.0,8.0,2019.0,0.0,4.0,7.0,20908.8,174.440107
78732,78732.0,4.0,5.0,30.341896,3485.0,-97.907944,1.98,2009.0,4.0,3.0,...,2009.0,89028960.0,625000.0,7.0,2019.0,0.0,4.0,4.0,7797.0,179.340029


In [78]:
df3.shape

(14733, 52)

In [79]:
df2_inc_average.head()

ZIP
78617     87278.0
78618    117400.7
78619    137168.0
78620    129357.3
78621     90108.3
dtype: float64

In [80]:
df2_inc_average.tail()

ZIP
78754    60590.3
78756    84134.3
78757    67990.0
78758    52497.7
78759    91688.0
dtype: float64

In [81]:
df2_inc_average.shape

(1025,)

In [82]:
type(df2_inc_average)

pandas.core.series.Series

In [83]:
df3_inc_average = pd.DataFrame(df2_inc_average)
type(df3_inc_average)

pandas.core.frame.DataFrame

In [84]:
df3_inc_average.head(2)

Unnamed: 0_level_0,0
ZIP,Unnamed: 1_level_1
78617,87278.0
78618,117400.7


In [85]:
df4 = df3.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill="")
df4.head()

Unnamed: 0,zipcode,address_zipcode,bathrooms,bedrooms,latitude,livingArea,longitude,propertyTaxRate,resoFactsStats_atAGlanceFacts_1_factValue,resoFactsStats_bathrooms,...,yearBuilt,zpid,latest_price,latest_salemonth,latest_saleyear,numOfAccessibilityFeatures,numOfAppliances,numPriceChanges,lotSize_sqft,price_per_sqft
0,78732,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,...,2007.0,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374
1,78732,78732.0,5.0,5.0,30.355553,4558.0,-97.912544,1.98,2007.0,5.0,...,2007.0,70352465.0,1025000.0,8.0,2020.0,0.0,3.0,2.0,28749.6,224.879333
2,78732,78732.0,4.0,4.0,30.352081,4749.0,-97.912048,1.98,2011.0,4.0,...,2011.0,70352478.0,825000.0,6.0,2018.0,0.0,1.0,4.0,21344.4,173.720783
3,78732,78732.0,4.0,4.0,30.356226,4867.0,-97.911697,1.98,2009.0,4.0,...,2009.0,70352461.0,849000.0,8.0,2019.0,0.0,4.0,7.0,20908.8,174.440107
4,78732,78732.0,4.0,5.0,30.341896,3485.0,-97.907944,1.98,2009.0,4.0,...,2009.0,89028960.0,625000.0,7.0,2019.0,0.0,4.0,4.0,7797.0,179.340029


In [86]:
df4_inc_average = df3_inc_average.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill="")
df4_inc_average.head() 

Unnamed: 0,ZIP,0
0,78617,87278.0
1,78618,117400.7
2,78619,137168.0
3,78620,129357.3
4,78621,90108.3


In [87]:
df4_inc_average.rename(columns={'ZIP': 'zipcode', '0': 'average_income'}, inplace=True)
df4_inc_average.head()

Unnamed: 0,zipcode,0
0,78617,87278.0
1,78618,117400.7
2,78619,137168.0
3,78620,129357.3
4,78621,90108.3


In [88]:
df4_inc_average.shape

(1025, 2)

In [89]:
df4.shape

(14733, 53)

In [90]:
merged_df = pd.merge(df4, df4_inc_average, on='zipcode', how='left')
merged_df.shape

  merged_df = pd.merge(df4, df4_inc_average, on='zipcode', how='left')


(162063, 54)

In [91]:
merged_df.head()

Unnamed: 0,zipcode,address_zipcode,bathrooms,bedrooms,latitude,livingArea,longitude,propertyTaxRate,resoFactsStats_atAGlanceFacts_1_factValue,resoFactsStats_bathrooms,...,zpid,latest_price,latest_salemonth,latest_saleyear,numOfAccessibilityFeatures,numOfAppliances,numPriceChanges,lotSize_sqft,price_per_sqft,0
0,78732,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,...,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374,194007.3
1,78732,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,...,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374,195901.3
2,78732,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,...,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374,186103.0
3,78732,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,...,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374,172963.7
4,78732,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,...,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374,162950.7


In [92]:
merged_df.rename(columns={'ZIP': 'zipcode', '0': 'average_income'}, inplace=True)
merged_df.head()

Unnamed: 0,zipcode,address_zipcode,bathrooms,bedrooms,latitude,livingArea,longitude,propertyTaxRate,resoFactsStats_atAGlanceFacts_1_factValue,resoFactsStats_bathrooms,...,zpid,latest_price,latest_salemonth,latest_saleyear,numOfAccessibilityFeatures,numOfAppliances,numPriceChanges,lotSize_sqft,price_per_sqft,0
0,78732,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,...,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374,194007.3
1,78732,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,...,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374,195901.3
2,78732,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,...,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374,186103.0
3,78732,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,...,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374,172963.7
4,78732,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,...,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374,162950.7


In [93]:
merged_df.rename(columns={merged_df.columns[-1]: 'income_ave'}, inplace=True)
merged_df.head()

Unnamed: 0,zipcode,address_zipcode,bathrooms,bedrooms,latitude,livingArea,longitude,propertyTaxRate,resoFactsStats_atAGlanceFacts_1_factValue,resoFactsStats_bathrooms,...,zpid,latest_price,latest_salemonth,latest_saleyear,numOfAccessibilityFeatures,numOfAppliances,numPriceChanges,lotSize_sqft,price_per_sqft,income_ave
0,78732,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,...,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374,194007.3
1,78732,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,...,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374,195901.3
2,78732,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,...,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374,186103.0
3,78732,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,...,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374,172963.7
4,78732,78732.0,5.0,4.0,30.354322,4060.0,-97.911278,1.98,2007.0,5.0,...,70352485.0,715000.0,2.0,2018.0,0.0,3.0,1.0,15681.6,176.108374,162950.7


In [94]:
merged_df.columns

Index(['zipcode', 'address_zipcode', 'bathrooms', 'bedrooms', 'latitude',
       'livingArea', 'longitude', 'propertyTaxRate',
       'resoFactsStats_atAGlanceFacts_1_factValue', 'resoFactsStats_bathrooms',
       'resoFactsStats_bathroomsFull', 'resoFactsStats_bathroomsHalf',
       'resoFactsStats_bathroomsThreeQuarter', 'resoFactsStats_bedrooms',
       'coveredSpaces', 'furnished', 'garageSpaces', 'hasAttachedGarage',
       'hasCarport', 'hasCooling', 'hasGarage', 'hasHeating', 'hasSpa',
       'hasView', 'parking', 'stories', 'taxAnnualAmount', 'taxAssessedValue',
       'resoFactsStats_yearBuilt', 'schools_0_distance', 'schools_0_rating',
       'schools_0_size', 'schools_0_studentsPerTeacher',
       'schools_0_totalCount', 'schools_1_distance', 'schools_1_rating',
       'schools_1_size', 'schools_1_studentsPerTeacher',
       'schools_1_totalCount', 'schools_2_distance', 'schools_2_rating',
       'schools_2_studentsPerTeacher', 'schools_2_totalCount', 'yearBuilt',
       'zp

In [103]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 162063 entries, 0 to 162062
Data columns (total 54 columns):
 #   Column                                     Non-Null Count   Dtype  
---  ------                                     --------------   -----  
 0   zipcode                                    162063 non-null  int64  
 1   address_zipcode                            162063 non-null  float64
 2   bathrooms                                  160765 non-null  float64
 3   bedrooms                                   161568 non-null  float64
 4   latitude                                   162063 non-null  float64
 5   livingArea                                 162063 non-null  float64
 6   longitude                                  162063 non-null  float64
 7   propertyTaxRate                            162063 non-null  float64
 8   resoFactsStats_atAGlanceFacts_1_factValue  162063 non-null  float64
 9   resoFactsStats_bathrooms                   160325 non-null  float64
 10  resoFact

In [108]:
# Save the DataFrame to a CSV file
merged_df.to_csv('../03_processed_data/austin_housePrice_and_income_data.csv', index=False)

print("DataFrame saved to 'austin_housePrice_and_income_data.csv'.")

#datapath = '../03_processed_data'
#save_file(merged_df, 'austin_housePrice_and_income_data.csv', datapath)

DataFrame saved to 'austin_housePrice_and_income_data.csv'.
