# Sales price study Notebook

## Objectives.

- This are the answers for the business requirement 1.
    - My niece is interested in discovering how the house attributes correlate with the sale price of the houses. 
    - Therefore, my niece is expecting a data visualization of the correlated variables against the sale price. 

- Load and inspect the data prepared during data collection (01_data_collection).
- Data Exploration.
- EDA on selected variables.
- Conclusion and next steps.

## Inputs:

- inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv

Additional Comments: 

- This notebook has been written with the help from my colleagues, the walkthrough project and the data cleaning lesson provided within the course.
- This notebook is designed to allow us to explore the data using the CRISP-DM data understanding methodology. 

___


## Changing the working directory:

Change the working directory from its current folder to its parent folder: 

- Access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory.
 
    - os.path.dirname() gets the parent directory.
    - os.chir() defines the new current directory. 

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory, congrats!")

The following command will confirm the new current directory: 

In [None]:
current_dir = os.getcwd()
current_dir

___

Now that this is done, we will need to import Packages:

## Import Packages:

In [None]:
import pandas as pd
pd.options.display.max_columns = None
pd.options.display.max_rows = None
from pandas_profiling import ProfileReport

Now that we have imported the pandas packages we can load the houses price records previously prepared: 

In [None]:
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
print(df.shape)
df.head()

## Data Exploration:

- We have the data, now we have to explore the dataset, by checking variable types and distribution, missing levels and what value these variables may add in the context of the first business requirement. 

In [None]:
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

___


## Correlation Study

Asses correlation levels across numerical variables using 'Spearman' and 'Pearson' methods.

- We will exclude the first item returned as this will be the correlation between SalePrice (Pearson) and SalePrice (Spearman).
- The ideal is to be able to fetch only the most relevant correlations. (We will correlate 10)

### Pearson Method:

- Using the 'Pearson' method to measure the linear relationship between two features.

In [None]:
corr_pearson = df.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

## Spearman Method:

- Using the 'Spearman' method to measure the linear relationship between two features.

In [None]:
corr_spearman = df.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

After conducting the correlation study with the 2 methods (Pearson & Spearman) we got to the conclusion that there are positively strong levels of correlation. (This level of correlation is between Sales Price and at least 5 variables)

### Investigation:

- We will take the top 5 variables returned for each method, and transform them into a list by concatenate the two lists.
- After this we will be able to visualize a unique correlation from both methods combined. 

In [None]:
top_n = 5
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

- This results 6 variables (1stFlrSF, GarageArea, GrLivArea, OverallQual, TotalBsmtSF, YearBuilt) that correlate to Sale Price.
- These 6 variables will be tested on strength to predicting the Sale Price.

In [None]:
corr_var_list = list(set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list()))
corr_var_list

___

## EDA on the correlated variable list.

- Filter the house price dataset on only the correlated variable list and include the sale price.

In [None]:
df_eda = df.filter(corr_var_list + ['SalePrice'])
print(df_eda.shape)
df_eda.head(7)

### Visualize variable correlation to Sale Price:

- Plot the distribution: 



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

%matplotlib inline

def plot_numerical(df, col, target_var):
  plt.figure(figsize=(15, 8))
  sns.regplot(data=df, x=col, y=target_var)  
  plt.title(f"{col}", fontsize=20)
  plt.show()


target_var = 'SalePrice'
for col in corr_var_list:
  plot_numerical(df_eda, col, target_var)
  print("\n\n")

___

## Conclusion and the next steps:

**The correlations and plots interpretation converge.**

- The following are the variables isolated in the correlation study:
    - 1stFlrSF: First Floor square feet.
    - GarageArea: Size of the garage in square feet.
    - GrLivArea: Above grade living area square feet.
    - OverallQual: Rates the overall material and finish of the house when constructed / refurbished.
    - YearBuilt: Original construction date. (1972 to 2010).
    - TotalBsmtSF: Total square feet of basement area.

- Following the above analyses, we consider that the important elements playing a very important role in the house pricing are the following:
        - Ground floor living area,
        - Basement area,
        - Garage area.
- In addition other important factor in house pricing is the year that the house has been built and the quality of the materials used in building or refurbishing the house.

- The plots shows that the variables, isolated in the correlation study, do indeed have a strong correlation and possibly a strong predictive power of the Sale Price for these houses. 

- Our next step will be data cleaning. Let's go! :)


*Please refer to the readme.md file to be able to understand better the parameters and the naming used within this project.
Please keep in mind that this prices represent every pice of each house until 2010*.

*The house prices have been increasing lately due to the high interest in mortgages, inflation (Resulting in a higher price of materials) and many other factors. This project is mostly fictional with a public dataset provided by Code Institute, please do not take a financial advice from this project. 😃*



___