<h1 align = "center">House Price Prediction</h1>

---

**Objective:** [House Prices - Advanced Regression Techniques](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques) is a free to use dataset provided in Kaggle. I'd previously used this dataset, however the results were poor. Let's see if my skills have increased! The notebook will serve to provide data analysis without any prior knowledge on the data. In this notebbok, let's discuss the following things:
 * **Understand the Dataset/Problem:** The given dataset has about `79` features, lets analyze each feature one by one and understand their practical impact.

In [11]:
# show current code version
# use https://semver.org/
# this file is kept to keep track of individual
# project/competitions progress in check
# the actual tag is represented as: <PROJECT_CODE>:<version>
open("VERSION", 'rt').read() # bump codecov

'development #semver-2.0.0'

## Code Imports

**PEP8 Style Guide** lists out the following *guidelines* for imports:
 1. Imports should be on separate lines,
 2. Import order should be:
    * standard library/modules,
    * related third party imports,
    * local application/user defined imports
 3. Wildcard import (`*`) should be avoided, else specifically tagged with **`# noqa: F403`** as per `flake8`
 4. Avoid using relative imports; use explicit imports instead.
 
For more details, visit [here](https://peps.python.org/pep-0008/#imports) for more information. Note, that actual `flake8` file is currently missing from the template, and will be later added if required. In addition, `logging` module is imported and configured.

[**`logging`**](https://docs.python.org/3/howto/logging.html) is a standard python module that is meant for tracking any events that happen during any software/code operations. This module is super powerful and helpful for code debugging and other purposes. The next section defines a `logging` configuration in **`/logs/`** directory. Each project is separated as `<PROJECT_CODE>/<VERSION>/<DATE>.log` file. The directory is automatically created, if not available. Use logging operations like:

```python
 >> logging.debug("This is a Debug Message.")
 >> logging.info("This is a Information Message.")
 >> logging.warning("This is a Warning Message.")
 >> logging.error("This is a ERROR Message.")
 >> logging.critical("This is a CRITICAL Message.")
```

In [1]:
import logging # configure logging on `global arguments` section

In [2]:
from time import ctime # will be used in logging, file/output directory create etc.
from os import makedirs # create directories dynamically, if not already done so manually
from os.path import join # keep directories `os`-independent
from copy import deepcopy # `pd.Dataframe` is mutable, so any `df` operation may need `deepcopy`
from tqdm import tqdm as TQ # provide progress bar for code completions
from uuid import uuid1 as UUID # keep output file name unique
from datetime import datetime as dt # formatting datetime objects

In [3]:
# import numpy as np
import pandas as pd

## Define Global Arguments

In [4]:
# a single project can have multiple sub-projects and/or output
# generally, each sub-project has it's own `notebook` and code files
# use the `PROJECT_CODE` tag to create a directory of the format
# <execution date>/<PROJECT_CODE> thus giving an unique identity for
# each run of code. Once defined, keep this code same throughout.
# this code can also be used for keeping track on progress per
# sub-project level.
PROJECT_CODE = "House Price Prediction (data analysis)"

In [5]:
ROOT = "." # current directory
DATA = join(ROOT, "data")

In [6]:
# define output directory
# this is defined on current date
# `today` is so configured that it permits windows/*.nix file/directory names
today = dt.strftime(dt.strptime(ctime(), "%a %b %d %H:%M:%S %Y"), "%a, %b %d %Y")

print(f"Code Execution Started on: {today}") # only date

Code Execution Started on: Wed, Apr 06 2022


In [7]:
OUTPUT_DIR = join(ROOT, "output", today, PROJECT_CODE)
makedirs(OUTPUT_DIR, exist_ok = True) # create dir if not exist

# also create directory for `logs`
LOGS_DIR = join("/", "logs", PROJECT_CODE, open("VERSION", 'rt').read())
makedirs(LOGS_DIR, exist_ok = True)

In [8]:
logging.basicConfig(
    filename = join(LOGS_DIR, f"{today}.log"), # change `reports` file name
    filemode = "a", # append logs to existing file, if file exists
    format = "%(asctime)s - %(name)s - CLASS:%(levelname)s:%(levelno)s:L#%(lineno)d - %(message)s",
    level = logging.DEBUG
)

In [9]:
# set/change output file name
OUTPUT_FILE = f"{UUID()}.xlsx" # randomly generate names

# log/inform users of current output file name
logging.info(f"Output File : {join(OUTPUT_DIR, OUTPUT_FILE)}")
print(f"Output File : {join(OUTPUT_DIR, OUTPUT_FILE)}") # use this syntax

Output File : .\output\Wed, Apr 06 2022\House Price Prediction (data analysis)\3628e095-b5a8-11ec-871b-5405db104a4e.xlsx


In [10]:
INPUT_FILENAME = join(DATA, "train.csv") # let's start with `training` dataset

## Read & Process Input File(s)

In [13]:
def read_file(filename : str) -> pd.DataFrame:
    """
    Read a CSV File using `pd.read_csv()`
    
    The function is intended to read the given `training` and `testing` file,
    for the given project. Since we're dealing with the same type file for
    both the use case, the same function can be used to read and process both
    the training and testing file.
    
    :param filename: Input file name. Generally, this is either `train.csv` or
                     `test.csv` along with total (absolute/relative) path.
    """
    
    data = pd.read_csv(filename, index_col = 0)
    return data.copy() # deepcopy

### Understand the Dataset/Problem

The **objective** is to predict the **`SalePrice`** given various attributes that describes (almost) every aspect of residential homes in Ames, Iowa. In order to understand our data, let's look into each of the categories, and understand their relevance in comparison to the given problem.

In [14]:
data = read_file(INPUT_FILENAME)
data.sample(5)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1185,20,RL,50.0,35133,Grvl,,Reg,Lvl,AllPub,Inside,...,0,,,,0,5,2007,WD,Normal,186700
1078,20,RL,,15870,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,3,2006,WD,Abnorml,138800
1353,50,RM,50.0,6000,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,7,2009,WD,Normal,134900
1262,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,6,2009,WD,Normal,128900
865,20,FV,72.0,8640,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,5,2008,New,Partial,250580
