# 1.- Project overview and goals 
We at Red Cedar Property Advisors are committed to advice our customers on the best timing and expected price for their real state properties. Or recommendation is not on selling but also on acquiring properties that can easily be flip for a high profit. 

Our client Timothy Stevens is interested in (*"Owns expensive houses in the center, needs to get rid, best timing within a year, open for renovation when profits rise"*):
- selling *"some/all"* of its properties. He owns expensive houses in the center of: \
&nbsp;&nbsp;1. The county? \
&nbsp;&nbsp;2. Seattle? \
&nbsp;&nbsp;3. Auburn? \
&nbsp;&nbsp;4. Bellevue? 

- He wants to achieve it within a year, 
- He is open for  renovation when profits rise.



# 2.- Data description
We at Red Cedar Property Advisors use King County official data from 2014 and "early" 2015 to make our recommendations on price and timing to sell and flip real state properties for the highest profitability.
The King county dataset has 21597 records with the following 13 variables. It can be found at: (https://www.kaggle.com/datasets/swathiachath/kc-housesales-data)

1. id: County database id key (int)
2. date: This is the date when the house was sold, according to the original data description. (str, *needs cleaning*)
3. price: This the sell price, (int, plot it in K)
4. bedrooms: The number of bedrooms per house (int, some data may be missing. *it needs cleaning*)
5. bathrooms: The number of bathrooms per house (num, some data may be missing. *it needs cleaning*)
6. sqft_living: Living space area (num)
7. sqft_lot: Property area (num)
8. floors: Number of levels in the property (num)
9. waterfront: Specifies if the property has a water front or not(boolean between 0 an 1, *needs cleaning as some data may be missing*)
10. view: it is not clear what does it means. But it states if it has been viewed (num, *needs further analysis and maybe cleaning*) From the conversation at 16:00 it was determined that the variable is a categorical variable with values between 0 and 4. It represents the quality of the view from the house, where 0 is worst and 4 is the best.
11. Condition: It specifies the house condition. (num, between 1 and 5)
        - 1: Poor - Worn out. End of life.\
        - 2: Fair - Badly worn. Needs repairs and refurbish.\
        - 3: Average. Needs minor repairs.\
        - 4: Good. Above average house conditions.\
        - 5: Very Good.
12. Grade: It wages the house condition regarding its building code. (num, between 1 and 13)
    - 1-3: Falls short of minimum building standard. Normally a cabin or inferior structure.
    - 4: Older or poor construction. Does not meet building code.
    - 5: Low construction cost and workmanship. Small and simple design.
    - 6: Lowest construction grade that meets building code. Low quality of construction materials.
    - 7: Average grade of construction and design. 
    - 8: Just above average construction and design. 
    - 9: Better architectural design with extra interior an exterior design and quality.
    - 10: Homes with high quality and features. 
    - 11: Custom designs with amenities made of solid wood, bathroom fixtures and other luxurious options.
    - 12: Custom designs and excellent builders.
    - 13: Custom designs, mansion level.\
13. sqft_above: Living area space without taking the basement into consideration (num, *review that no data is missing*)
14. sqft_basement: Basment area. (num, *needs cleaning*)
15. yr_build: Year of construction(num)
16. yr_renovate: Year when the house was renovated(num)
17. zipcode: House location zipcode (category, *This may be a key player*)
18. lat: House latitude location. (num)
19. long: House longitude location. (num)
20. sqft_living15: Represents the mean sqft living area of the next 15 neighbors. 
21. sqrt_lot15: Represents the mean sqft lot area of the next 15 neighbors. 



In [None]:
# Load based libraries
import warnings

warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# configure Seaborn plot parameters
from matplotlib.ticker import PercentFormatter
plt.rcParams.update({ "figure.figsize" : (8, 5),"axes.facecolor" : "white", "axes.edgecolor":  "black"})
plt.rcParams["figure.facecolor"]= "w"
pd.plotting.register_matplotlib_converters()
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [113]:
# load data from ./data/King_County_House_prices_dataset.csv
df_rcpa = pd.read_csv("data/King_County_House_prices_dataset.csv", sep = ",")
df_rcpa

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900.000,3,1.000,1180,5650,1.000,,0.000,...,7,1180,0.0,1955,0.000,98178,47.511,-122.257,1340,5650
1,6414100192,12/9/2014,538000.000,3,2.250,2570,7242,2.000,0.000,0.000,...,7,2170,400.0,1951,1991.000,98125,47.721,-122.319,1690,7639
2,5631500400,2/25/2015,180000.000,2,1.000,770,10000,1.000,0.000,0.000,...,6,770,0.0,1933,,98028,47.738,-122.233,2720,8062
3,2487200875,12/9/2014,604000.000,4,3.000,1960,5000,1.000,0.000,0.000,...,7,1050,910.0,1965,0.000,98136,47.521,-122.393,1360,5000
4,1954400510,2/18/2015,510000.000,3,2.000,1680,8080,1.000,0.000,0.000,...,8,1680,0.0,1987,0.000,98074,47.617,-122.045,1800,7503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21592,263000018,5/21/2014,360000.000,3,2.500,1530,1131,3.000,0.000,0.000,...,8,1530,0.0,2009,0.000,98103,47.699,-122.346,1530,1509
21593,6600060120,2/23/2015,400000.000,4,2.500,2310,5813,2.000,0.000,0.000,...,8,2310,0.0,2014,0.000,98146,47.511,-122.362,1830,7200
21594,1523300141,6/23/2014,402101.000,2,0.750,1020,1350,2.000,0.000,0.000,...,7,1020,0.0,2009,0.000,98144,47.594,-122.299,1020,2007
21595,291310100,1/16/2015,400000.000,3,2.500,1600,2388,2.000,,0.000,...,8,1600,0.0,2004,0.000,98027,47.535,-122.069,1410,1287


In [None]:
df_rcpa.info()

# 3.- Data descritive statistics


In [None]:
df_rcpa.describe()


In [None]:
df_rcpa.bedrooms.unique()
df_rcpa.bathrooms.unique()
df_rcpa.query("bathrooms == 3.25")



# 4.- Hypothesis 

**We are interested in houses with values of: condition >= 4 and grade >= 9**
- Do these houses present a seasonality in theirs sell price?
- When is the best month to sell/buy a house with this characteristics?
- Is there a difference in price within the zip codes, we presume that Seattle will be higher?
- Is it not worth to renovate houses with condition 4 and grade 9? (*correlation? how?*)
- The average price gab between houses with condition == 3 and grade == 8, and condition == 4 and grade == 9; is higher than between the last with condition == 5 and grade ==10. Right? 

# 5.- Data cleaning


In [120]:
# Identify the variables that have nan in its values. And, count how many times nan appears in it.
df_rcpa.isna().sum()

id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

In [115]:
# Clean nan in waterfront. Make them 0, as it is assumed that since there is no value the house do not has a view
df_rcpa['waterfront'] = df_rcpa['waterfront'].fillna(0) #Code to substitute the nan values for 0
df_rcpa[df_rcpa['waterfront'].isna()]['waterfront'] # identify the columns where waterfront is nan. The final result should be an empty df. 

Series([], Name: waterfront, dtype: float64)

In [116]:
# Apply the same previous code to the variable view to convert the nan into 0. It 
# is assumed that since the value was not introduced, there was no highlight in the house view.
df_rcpa['view'].unique()
df_rcpa['view'] = df_rcpa['view'].fillna(0)
df_rcpa[df_rcpa['view'].isna()]['view']

Series([], Name: view, dtype: float64)

In [117]:
# yr_renovated also contains nan values. First identify which are its unique values to find how to replace the nan.
df_rcpa['yr_renovated'].unique()

array([   0., 1991.,   nan, 2002., 2010., 1992., 2013., 1994., 1978.,
       2005., 2003., 1984., 1954., 2014., 2011., 1983., 1945., 1990.,
       1988., 1977., 1981., 1995., 2000., 1999., 1998., 1970., 1989.,
       2004., 1986., 2007., 1987., 2006., 1985., 2001., 1980., 1971.,
       1979., 1997., 1950., 1969., 1948., 2009., 2015., 1974., 2008.,
       1968., 2012., 1963., 1951., 1962., 1953., 1993., 1996., 1955.,
       1982., 1956., 1940., 1976., 1946., 1975., 1964., 1973., 1957.,
       1959., 1960., 1967., 1965., 1934., 1972., 1944., 1958.])

In [119]:
# The values represent years, but also have 0 an nan. Therefore the nan will become 0
df_rcpa['yr_renovated'] = df_rcpa['yr_renovated'].fillna(0)
df_rcpa[df_rcpa['yr_renovated'].isna()]['yr_renovated']

Series([], Name: yr_renovated, dtype: float64)

In [131]:
df_rcpa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   id             21597 non-null  int64         
 1   date           21597 non-null  datetime64[ns]
 2   price          21597 non-null  float64       
 3   bedrooms       21597 non-null  int64         
 4   bathrooms      21597 non-null  float64       
 5   sqft_living    21597 non-null  int64         
 6   sqft_lot       21597 non-null  int64         
 7   floors         21597 non-null  float64       
 8   waterfront     21597 non-null  boolean       
 9   view           21597 non-null  category      
 10  condition      21597 non-null  category      
 11  grade          21597 non-null  category      
 12  sqft_above     21597 non-null  int64         
 13  sqft_basement  21597 non-null  float64       
 14  yr_built       21597 non-null  int64         
 15  yr_renovated   2159

In [126]:
# Review each variable type and configure properly
#df_rcpa.info()
# date needs to change from float64 to a timedate variable
df_rcpa['date'] = pd.to_datetime(df_rcpa['date'])
#df_rcpa['price'] = df_rcpa['price'].astype('int64')
df_rcpa['waterfront'] = df_rcpa['waterfront'].astype('boolean')
df_rcpa['view'] = df_rcpa['view'].astype('category')
df_rcpa['condition'] = df_rcpa['condition'].astype('category')
df_rcpa['grade'] = df_rcpa['grade'].astype('category')
df_rcpa['zipcode'] = df_rcpa['zipcode'].astype('category')


In [128]:
df_rcpa['sqft_basement'] = df_rcpa['sqft_basement'].replace({'?':"0"})
df_rcpa['sqft_basement'] = df_rcpa['sqft_basement'].astype('float64')



# 6.- Analysis


# 7.- Findings