# [Retrieve demographic data](#retrieve-demographic-data)

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
from uszipcode import SearchEngine

<a id="toc"></a>

## [Table of Contents](#table-of-contents)
0. [About](#load-data)
1. [User Inputs](#user-inputs)
2. [Load Prices](#load-prices)
3. [Get demographic data](#get-demographic-data)
4. [Merge](#merge)
5. [Export merged data](#export-merged-data)

<a id="about"></a>

## 0. [About](#about)

In this notebook, we will extract new features (related to demographic data) from the zipcode of the listing. Specifically, this notebook extracts demographic dat within a 5-mile radius of the listing zipcode.

**Note**
1. The 5-mile radius surrounding each zipcode might result in overlap with eachother, resulting in demographic data being counted twice. As a result, although these features are generated, they will not be used in subsequent analysis. Future work may explore this by (for example) retrieving demographic data based on the geographic region, within each city, that contains the listing's zipcode.

<a id="user-inputs"></a>

## 1. [User Inputs](#user-inputs)

In [None]:
prices_filepath = Path().cwd() / "data" / "processed_data__AUS29_SEA22_2SEAzipcodes_20200605_144009.csv"
merged_filepath = Path().cwd() / "data" / "processed_data__AUS29_SEA22_2SEAzipcodes_20200605_144009__with_demographics.csv"

<a id="load-prices"></a>

## 2. [Load Prices](#load-prices)

We'll start by loading the cars prices data into a `DataFrame`

In [None]:
df = pd.read_csv(prices_filepath)

<a id="get-demographic-data"></a>

## 3. [Get demographic data](#get-demographic-data)

In [None]:
search = SearchEngine(simple_zipcode=True)

First, we'll assemble a dictionary mapping the zipcode to a latitude and longitude (`LAT` and `LONG`)

In [None]:
zips_list = df['seller_zip'].str.extract(r'(\d{5})')[0].value_counts().index.tolist()

In [None]:
zips_wanted = {z: [search.by_zipcode(int(z)).lat, search.by_zipcode(int(z)).lng] for z in zips_list}
df_zips = pd.DataFrame.from_dict(zips_wanted, orient="index", columns=["LAT", "LONG"]).reset_index()
df_zips = df_zips.rename(columns={"index": "zipcode"})
df_zips

Next, we'll use the `LAT` and `LONG` columns to get demographic data within a 5 mile radius

In [None]:
def get_median_household_value(row):
    result = search.by_coordinates(row["LAT"], row["LONG"], radius=5, returns=100)
    home_value = np.mean([r.median_home_value if r.median_home_value is not None else 0 for r in result])
    return home_value

def get_median_household_income(row):
    result = search.by_coordinates(row["LAT"], row["LONG"], radius=5, returns=100)
    home_income = np.mean([r.median_household_income if r.median_household_income is not None else 0 for r in result])
    return home_income

In [None]:
df_zips["median_household_value"] = df_zips.apply(get_median_household_value, axis=1)
df_zips["median_household_income"] = df_zips.apply(get_median_household_income, axis=1)
df_zips

<a id="merge"></a>

## 4. [Merge](#merge)

Join car listings data with the demographic data on zipcodes

Having extracted demographic data, we can now merge this `DataFrame` with the the data of the scraped car price listings

In [None]:
df_f = df.merge(df_zips, left_on=["seller_zip"], right_on="zipcode", how="inner")

This is the final dataframe we'll use for feature engineering, EDA and assessing model.

<a id="export-merged-data"></a>

## 5. [Export merged data](#export-merged-data)

In [None]:
df_f.to_csv(merged_filepath, index=False)