# Notebook 03 - Data Analysis

## Objectives

* Answer business requirement 1: "The client is interested in discovering how house attributes correlate with sale prices. Therefore, the client expects data visualisations of the correlated variables against the sale price."
    - Load and inspect the data gathered during data collection in the previous notebook
    - Analyze the data to identify relevant variables
    - Conduct a correlation study between these variables and the sale price
    - Create visualizations to illustrate the correlations
    - Summarize the findings and provide insights on the relationship between house attributes and sale prices
    - Identify next steps

## Inputs

* outputs/datasets/collection/house_price_records.csv

## Outputs

* Code that answers business requirement 1 and can be used to build Streamlit App
* Plots saved in folder for documentation 

## [Conclusions]

* [Conclusions here]

---

# Import Packages

In [2]:
import os
import pandas as pd
from ydata_profiling import ProfileReport


# Change working directory

* This notebook is stored in the `jupyter_notebooks` subfolder
* The current working directory therefore needs to be changed to the workspace, i.e., the working directory needs to be changed from the current folder to its parent folder

Firstly, the current directory is accessed with os.getcwd()

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\franc\\housing-price-predictor\\jupyter_notebooks'

Next, the working directory is set as the parent of the current `jupyter_notebooks` directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory
* This allows access to all the files and folders within the workspace, rather than solely those within the `jupyter_notebooks` directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Finally, confirm that the new current directory has been successfully set

In [5]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\franc\\housing-price-predictor'

---

# Load Data

The data is loaded from the outputs/datasets/collection folder:

In [6]:
df = pd.read_csv('outputs/datasets/collection/house_price_records.csv')
df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


---

# Pandas Profile Report

In order to become more familiar with the dataset, we can create a Pandas profile report:

In [7]:
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

The Pandas profile report shows several noteworthy features of the data:

* A number of features have at least some missing values
* `EnclosedPorch` and `WoodDeckSF` have very high proportions of missing values, with 90.7% and 89.4% respectively
* Several features have some clear outliers: notably `OpenPorchSF`, `TotalBsmtSF`, `LotArea` and `LotFrontage`
* Several of the features relating to houses' basements have high numbers of zeros, presumably because the houses in question have no basement
* Similarly, `GarageArea` also has a significant number of zeros
    - We can assume that this is because these houses do not have garages, since we can see from the word cloud that `GarageFinish` has a number of houses recorded as having no garage
    - Note that `GarageYrBlt` has 81 missing values (and no zeros), which are likely to correspond to the 81 zeros of `GarageArea` and relate to houses with no garage
* We see that `1stFlrSF`, `BsmtUnfSF`, `GrLivArea` and the target variable `SalePrice` all have a positively skewed distribution
    - This is in line with what we would expect, with a majority of values close to a relatively low median value and a small number of higher values relating to unusually large or high-value houses

To view the information about variables with missing data in a way that is easier to read and interpret, we can create a new dataframe as follows:

In [21]:
vars_with_missing_data = df.columns[df.isna().sum() > 0]

missing_data_table = pd.DataFrame({
    'Variable': vars_with_missing_data,
    'Percentage Missing': (df[vars_with_missing_data].isna().mean() * 100).round(1)
}).sort_values(by='Percentage Missing', ascending=False).reset_index(drop=True)

print(missing_data_table)

        Variable  Percentage Missing
0  EnclosedPorch                90.7
1     WoodDeckSF                89.4
2    LotFrontage                17.7
3   GarageFinish                11.1
4   BsmtFinType1                 7.8
5   BedroomAbvGr                 6.8
6       2ndFlrSF                 5.9
7    GarageYrBlt                 5.5
8     MasVnrArea                 0.5


---

# Correlation Study

The dataset has a number of categorical variables. These can be transformed using [One Hot Encoding](https://feature-engine.trainindata.com/en/1.1.x/encoding/OneHotEncoder.html).

* Note that One Hot Encoding is preferred in machine learning to numerical values, because it avoids a situation where a model can add bias by giving preference to labels encoded as higher numbers, when in fact we want all labels to be equally important in the dataset

In [22]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
df_ohe.head()

ValueError: Some of the variables to transform contain missing values. Check and remove those before using this transformer.

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [8]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block (2852421808.py, line 5)