<font size="+3"><strong>Exploring the Data</strong></font>

In this project we're going to work with data from the [Funda](https://funda.nl). This company provides houses pricing in the Netherlands.

Keep in mind that **Funda** prices are just a starting point and properties usually sold for 110% of their price or even more than 120% due to the auction nature.
During this auction, you should not know bids of others, but this is not true, because agents sometimes share current bids with their clients (this is not legal, but happens).
In addition, lower bid can win, because the person is not going to take out any loans (cash is ready to be transferred).

Additional expenses usually around 7%, but this is just additional information about property market in the Netherlands.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Prepare data

First, we need to load data. At this pount you already should have dump of data from [Funda](https://funda.nl) (if not, please follow the README file of this project).

In [None]:
df = pd.read_json('../scrapy/x.json_backup')
df.info()
df.head()

All fields are dumped as a string. It is necessary to convert fields like **price**, **year_of_construction** and **living_area**.

## Field "price"

This field has three different states:
* start price is known (example: "")
* .
* .

## Field "year_of_construction"
This field ...

In [None]:
total_records = df['year_of_construction'].count()
print(f'Total records {total_records}')

known_year_mask = df['year_of_construction'].str.isdigit() == True
known_year_count = df[known_year_mask]['year_of_construction'].count()
print(f'Records with known year: {known_year_count}')

after_2020_year_mask = df['year_of_construction'] == 'After 2020'
after_2020_year_count = df[after_2020_year_mask]['year_of_construction'].count()
print(f'Records with known approximate date (after 2020): {after_2020_year_count}')

before_1906_year_mask = df['year_of_construction'] == 'Before 1906'
before_1906_year_count = df[before_1906_year_mask]['year_of_construction'].count()
print(f'Records with known approximate date (before): {before_1906_year_count}')

missed_count = df[~single_year_mask & ~after_2020_year_mask & ~before_1906_year_mask]['year_of_construction'].count()
assert missed_count == 0, 'Detected new type of records'

# plot
labels = ['Known year', 'Before 1906', 'After 2020']
values = [known_year_count, before_1906_year_count, after_2020_year_count]
explode = (0, 0.1, 0.3)

fig1, ax1 = plt.subplots()
ax1.pie(values, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

Interesting field: **After 2020** and **Before 1906**. Sounds like there are three ranges: 

- before 1906
- in range from 1906 to 2022
- after 2022

Lets check, are there any dates before **1906** and after **2020**.

In [None]:
df_years = df[known_year_mask]

fig, axes = plt.subplots(nrows=2, ncols=1, figsize=[20, 15])

# before 1906
known_years_before_1906_mask = df_years['year_of_construction'].astype(int) < 1906
known_years_before_1906 = df_years[known_years_before_1906_mask]['year_of_construction'].astype(int)
known_years_before_1906.plot.hist(bins=50, ax=axes[0]);

# after 2020, but found that construction date is not limited (they can sell house which will be built 400 years after, in 2424, so lets filter such cases)
known_years_after_2020_mask = (df_years['year_of_construction'].astype(int) > 2020) & (df_years['year_of_construction'].astype(int) < 2040)
known_years_after_2020 = df_years[known_years_after_2020_mask]['year_of_construction'].astype(int)
known_years_after_2020.plot.hist(bins=5, ax=axes[1]);

As we can see, years are overlapped by **Before 1906** and **After 2020**. In addition years are not limited, so property dated by **2424** was found.

## Field "living_area"
This field ...