# **(ADD THE NOTEBOOK NAME HERE)**

## Objectives

* Write your notebook objective here, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write here which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

#### Understanding data:

- Understand the following:

    - Distribution
    - Data type
    - Apply Descriptive Statistics
    - Population and sample
    - Outliers

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/milestone-project-heritage-housing-issues/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/milestone-project-heritage-housing-issues'

## Import the dataset from Kaggle.com

- Kaggle must be installed to load the data

- You also need to have registered an account and obtained a Kaggle API key (JSON-File)

In [4]:
! pip install kaggle==1.5.12



---

- Import operating system
- Set the kaggle config directory to the current working directory
- Set the read and write permissions to read and write for user only (600)

In [8]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

- Define kaggle dataset as the file at the url containing the dataset on kaggle.com
- Set/Create the folder this will stored in to inputs/datasets/raw
- -p flag specifies the save to directory path 
- -d flag specifies the download path 

In [9]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading housing-prices-data.zip to inputs/datasets/raw
  0%|                                               | 0.00/49.6k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 49.6k/49.6k [00:00<00:00, 2.10MB/s]


In [10]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
    && rm kaggle.json

Archive:  inputs/datasets/raw/housing-prices-data.zip
  inflating: inputs/datasets/raw/house-metadata.txt  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv  


## Load and Inspect the Kaggle Data

- The following work contained in this notebook is part of the **'Data Understanding'** phase of the CRISP-DM Workflow.

In [24]:
import pandas as pd

df_house_prices_records = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
print(len(df_house_prices_records))
df_house_prices_records.head(20)

1460


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000
5,796,566.0,1.0,No,732,GLQ,64,,480,Unf,...,85.0,0.0,30,5,5,796,,1993,1995,143000
6,1694,0.0,3.0,Av,1369,GLQ,317,,636,RFn,...,75.0,186.0,57,5,8,1686,,2004,2005,307000
7,1107,983.0,3.0,Mn,859,ALQ,216,,484,,...,,240.0,204,6,7,1107,,1973,1973,200000
8,1022,752.0,2.0,No,0,Unf,952,,468,Unf,...,51.0,0.0,0,5,7,952,,1931,1950,129900
9,1077,0.0,2.0,No,851,GLQ,140,,205,RFn,...,50.0,0.0,4,6,5,991,,1939,1950,118000


#### Check the information relating to each column in the dataframe.
- Note: Any column which has non-null-count below 1460 has 1460-n empty cells
- Note: The target variable is an integer and has no cells with null values
- Note: There are 9 columns which contain missing data
- Note: Enclosed Porch and WoodeckSF are imbalanced
- Note: There is no need to check for duplicates in this data as all columns will contain them and there is no concern of having duplicates in our numerical target variable.

In [63]:
df_house_prices_records.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1460 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1346 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1298 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

---

- Check how many unique datatypes are contained in the dataset.

In [59]:
unique_dtypes = set(df_house_prices_records.dtypes)
unique_dtypes

{dtype('int64'), dtype('float64'), dtype('O')}

- Check how many columns contain: **categorical, numerical(int), and numerical(float) variables**

- Return a list of all these columns

In [57]:
df_house_prices_records_categorical_columns = df_house_prices_records.select_dtypes(include=['object']).columns.to_list()
print(f"There are {len(df_house_prices_records_categorical_columns)} columns that contain categorical data.\n They are: {df_house_prices_records_categorical_columns}.\n\n")

df_house_prices_records_numerical_columns = df_house_prices_records.select_dtypes(include=['int']).columns.to_list()
print(f"There are {len(df_house_prices_records_numerical_columns)} columns that contain numerical data.\n They are: {df_house_prices_records_numerical_columns}.\n\n")

df_house_prices_records_numerical_fl_columns = df_house_prices_records.select_dtypes(include=['float']).columns.to_list()
print(f"There are {len(df_house_prices_records_numerical_fl_columns)} columns that contain numerical data which are floats.\n They are: {df_house_prices_records_numerical_fl_columns}.\n\n")


There are 4 columns that contain categorical data.
 They are: ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'].


There are 13 columns that contain numerical data.
 They are: ['1stFlrSF', 'BsmtFinSF1', 'BsmtUnfSF', 'GarageArea', 'GrLivArea', 'LotArea', 'OpenPorchSF', 'OverallCond', 'OverallQual', 'TotalBsmtSF', 'YearBuilt', 'YearRemodAdd', 'SalePrice'].


There are 7 columns that contain numerical data which are floats.
 They are: ['2ndFlrSF', 'BedroomAbvGr', 'EnclosedPorch', 'GarageYrBlt', 'LotFrontage', 'MasVnrArea', 'WoodDeckSF'].




- Check all column names are unique by using a python set

In [21]:
cols = df_house_prices_records.columns.to_list()
cols
cols=set(cols)
cols

{'1stFlrSF',
 '2ndFlrSF',
 'BedroomAbvGr',
 'BsmtExposure',
 'BsmtFinSF1',
 'BsmtFinType1',
 'BsmtUnfSF',
 'EnclosedPorch',
 'GarageArea',
 'GarageFinish',
 'GarageYrBlt',
 'GrLivArea',
 'KitchenQual',
 'LotArea',
 'LotFrontage',
 'MasVnrArea',
 'OpenPorchSF',
 'OverallCond',
 'OverallQual',
 'SalePrice',
 'TotalBsmtSF',
 'WoodDeckSF',
 'YearBuilt',
 'YearRemodAdd'}

In [22]:
len(cols)

24

In [66]:

columns_with_nan = df_house_prices_records.columns[df_house_prices_records.isna().sum() > 0].to_list()
columns_with_nan

['2ndFlrSF',
 'BedroomAbvGr',
 'BsmtFinType1',
 'EnclosedPorch',
 'GarageFinish',
 'GarageYrBlt',
 'LotFrontage',
 'MasVnrArea',
 'WoodDeckSF']

- To Do: 10-01-2023
- Create a function which maps each column names(keys) and returns a list.

- Iterate over each column
- For each iteration:
    - Sum the NaN values
    - Append this sum to the list
    - Append the column name to the list
    - Append its datatype to the list

- Understand which categories of variables may be affected by the quantity of NaN values.

- Also:
    - Understand better if the imbalance in woodeck and enclosed porch will cause issues.

- To do: 10-01-2023
    - Study the inherited houses dataset

In [18]:
df_inherited_houses = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
df_inherited_houses

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd
0,896,0,2,No,468.0,Rec,270.0,0,730.0,Unf,...,11622,80.0,0.0,0,6,5,882.0,140,1961,1961
1,1329,0,3,No,923.0,ALQ,406.0,0,312.0,Unf,...,14267,81.0,108.0,36,6,6,1329.0,393,1958,1958
2,928,701,3,No,791.0,GLQ,137.0,0,482.0,Fin,...,13830,74.0,0.0,34,5,5,928.0,212,1997,1998
3,926,678,3,No,602.0,GLQ,324.0,0,470.0,Fin,...,9978,78.0,20.0,36,6,6,926.0,360,1998,1998


NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [7]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block (2852421808.py, line 5)