# **(Data Collection)**

## Objectives

* Fetch data from Kaggle and save as raw data.
* Carry out an initial investigation of the data.
* Check data types are correct for carrying out further analysis on the target 'saleprice'.
* Check data types in both csv files are matching so later correlation analysis is accurate.
* Evaluate the data types of the feature variables
* Evaluate the quantity of null values in the dataset and their possible implications.
* Develop an understanding of what distributions and further analysis may be necessary.


## Inputs

* Kaggle JSON file - the authentication token.
* Kaggle dataset URL: codeinstitute/housing-prices-data

## Outputs

* outputs/datasets/collection/original.csv
* outputs/datasets/collection/inherited.csv




---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

* We need to change the working directory from its current folder to its parent folder

    * We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/milestone-project-heritage-housing-issues/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/milestone-project-heritage-housing-issues'

## Import the dataset from Kaggle.com

- Kaggle must be installed to load the data

- You also need to have registered an account and obtained a Kaggle API key (JSON-File)

In [4]:
! pip install kaggle==1.5.12



---

- Import operating system
- Set the kaggle config directory to the current working directory
- Set the read and write permissions to read and write for user only (600)

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory


- Define kaggle dataset as the file at the url containing the dataset on kaggle.com
- Set/Create the folder this will stored in to inputs/datasets/raw
- -p flag specifies the save to directory path 
- -d flag specifies the download path 

In [6]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Traceback (most recent call last):
  File "/workspace/.pip-modules/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/workspace/.pip-modules/lib/python3.8/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/workspace/.pip-modules/lib/python3.8/site-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /workspace/milestone-project-heritage-housing-issues. Or use the environment method.


In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
    && rm kaggle.json

unzip:  cannot find or open inputs/datasets/raw/*.zip, inputs/datasets/raw/*.zip.zip or inputs/datasets/raw/*.zip.ZIP.

No zipfiles found.


## Load and Inspect the Kaggle Data

- The following work contained in this notebook is part of the **'Data Understanding'** phase of the CRISP-DM Workflow.

In [8]:
import pandas as pd

df_house_prices_records = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
print(len(df_house_prices_records))
print(df_house_prices_records.shape)
df_house_prices_records.head(8)

1460
(1460, 24)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000
5,796,566.0,1.0,No,732,GLQ,64,,480,Unf,...,85.0,0.0,30,5,5,796,,1993,1995,143000
6,1694,0.0,3.0,Av,1369,GLQ,317,,636,RFn,...,75.0,186.0,57,5,8,1686,,2004,2005,307000
7,1107,983.0,3.0,Mn,859,ALQ,216,,484,,...,,240.0,204,6,7,1107,,1973,1973,200000


#### Check the information relating to each column in the dataframe.
- Note: Any column which has non-null-count below 1460 has 1460-n empty cells
- Note: The target variable is an integer and has no cells with null values
- Note: There are 9 columns which contain missing data
- Note: Enclosed Porch and WoodeckSF have few entries in the dataset
- Note: All columns contain duplicates but there is no concern of having duplicates in their context.

In [9]:
df_house_prices_records.info()
print(df_house_prices_records['WoodDeckSF'].info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1460 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1346 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1298 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

---

- Check how many unique datatypes are contained in the dataset.

In [10]:
unique_dtypes = set(df_house_prices_records.dtypes)
unique_dtypes

{dtype('int64'), dtype('float64'), dtype('O')}

- Check how many columns contain: **categorical, numerical(int), and numerical(float) variables**

- Return a list of all these columns

In [11]:
df_house_prices_records_categorical_columns = df_house_prices_records.select_dtypes(include=['object']).columns.to_list()
print(f"There are {len(df_house_prices_records_categorical_columns)} columns that contain categorical data.\n They are: {df_house_prices_records_categorical_columns}.\n\n")

df_house_prices_records_numerical_columns = df_house_prices_records.select_dtypes(include=['int']).columns.to_list()
print(f"There are {len(df_house_prices_records_numerical_columns)} columns that contain numerical data which are integers.\n They are: {df_house_prices_records_numerical_columns}.\n\n")

df_house_prices_records_numerical_fl_columns = df_house_prices_records.select_dtypes(include=['float']).columns.to_list()
print(f"There are {len(df_house_prices_records_numerical_fl_columns)} columns that contain numerical data which are floats.\n They are: {df_house_prices_records_numerical_fl_columns}.\n\n")


There are 4 columns that contain categorical data.
 They are: ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'].


There are 13 columns that contain numerical data which are integers.
 They are: ['1stFlrSF', 'BsmtFinSF1', 'BsmtUnfSF', 'GarageArea', 'GrLivArea', 'LotArea', 'OpenPorchSF', 'OverallCond', 'OverallQual', 'TotalBsmtSF', 'YearBuilt', 'YearRemodAdd', 'SalePrice'].


There are 7 columns that contain numerical data which are floats.
 They are: ['2ndFlrSF', 'BedroomAbvGr', 'EnclosedPorch', 'GarageYrBlt', 'LotFrontage', 'MasVnrArea', 'WoodDeckSF'].




In [12]:

columns_with_nan = df_house_prices_records.columns[df_house_prices_records.isna().sum() > 0].to_list()
columns_with_nan

['2ndFlrSF',
 'BedroomAbvGr',
 'BsmtFinType1',
 'EnclosedPorch',
 'GarageFinish',
 'GarageYrBlt',
 'LotFrontage',
 'MasVnrArea',
 'WoodDeckSF']

- Make quick assessment of each of the above features in order to understand what values may be required fill the Nan and 0.0 values if needed.

In [13]:
# The sum total of 'null' values in all the columns which contain them.
# Interesting observation is the possibility of woodecks and enclosed porches being desirable due to rarity.
df_house_prices_records[columns_with_nan].isna().sum()

2ndFlrSF           86
BedroomAbvGr       99
BsmtFinType1      114
EnclosedPorch    1324
GarageFinish      162
GarageYrBlt        81
LotFrontage       259
MasVnrArea          8
WoodDeckSF       1305
dtype: int64

In [14]:
# There are 781 houses from 1460 samples which have no second floor(54% of the data taken)
# Total percentage of houses with no second floor is:
no_top_floor_float = len(df_house_prices_records.query("`2ndFlrSF` == 0.0"))
no_top_floor_NaN = df_house_prices_records['2ndFlrSF'].isna().sum()
print(no_top_floor_float)
print(no_top_floor_NaN)
percen_dataset_no_top_floor = (no_top_floor_float + no_top_floor_NaN)/1460
percen_dataset_no_top_floor = round(percen_dataset_no_top_floor, 2)
print(f"{percen_dataset_no_top_floor}% of houses in this dataset have no top floor")

781
86
0.59% of houses in this dataset have no top floor


In [15]:
# 'BedroomAbvGr' may be a candidate to be transformed to a categorical variable if necessary.
# These are bedrooms that are not at basement level i.e. above ground. Nan could likely be grouped with 0.0 here as both likely mean the same thing in this instance.
unique_bedroom_grades = set(df_house_prices_records['BedroomAbvGr'].unique())
unique_bedroom_grades

{nan, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 8.0}

In [16]:
df_unique_bedroom_grades = df_house_prices_records['BedroomAbvGr']
df_unique_bedroom_grades.head(30)

0     3.0
1     3.0
2     3.0
3     NaN
4     4.0
5     1.0
6     3.0
7     3.0
8     2.0
9     2.0
10    3.0
11    4.0
12    2.0
13    3.0
14    NaN
15    2.0
16    2.0
17    2.0
18    NaN
19    3.0
20    4.0
21    3.0
22    3.0
23    3.0
24    3.0
25    3.0
26    3.0
27    3.0
28    2.0
29    1.0
Name: BedroomAbvGr, dtype: float64

In [17]:
# Note here: None and nan likely both mean no basement in this context and are essentially the same category.
unique_bsmt_finish_types = set(df_house_prices_records['BsmtFinType1'].unique())
unique_bsmt_finish_types 

{'ALQ', 'BLQ', 'GLQ', 'LwQ', 'None', 'Rec', 'Unf', nan}

In [18]:
# Enclosed porches will probably add value to a home. They would give more living space and so are possibly a desirable feature.
# The absence of an enclosed porch may also decrease the sale price, and so might correlate to poor sale prices??
# There are 136 examples with porches and 18 unique sizes in that group.
unique_enclosed_porch_values  = set(df_house_prices_records['EnclosedPorch'].unique())
unique_enclosed_porch_values

{0.0,
 nan,
 42.0,
 50.0,
 91.0,
 112.0,
 136.0,
 138.0,
 144.0,
 145.0,
 158.0,
 185.0,
 190.0,
 216.0,
 224.0,
 226.0,
 234.0,
 244.0,
 268.0,
 286.0}

In [19]:
# Develop a better understanding of the distribution in this column
unique_garage_finishes = df_house_prices_records['GarageFinish'].unique()
unique_garage_finishes

array(['RFn', 'Unf', nan, 'Fin', 'None'], dtype=object)

In [20]:
# Develop a better understanding of the distribution in this column

df_garage_yt_blt = df_house_prices_records['GarageYrBlt']
df_garage_yt_blt

0       2003.0
1       1976.0
2       2001.0
3       1998.0
4       2000.0
         ...  
1455    1999.0
1456    1978.0
1457    1941.0
1458    1950.0
1459    1965.0
Name: GarageYrBlt, Length: 1460, dtype: float64

In [21]:
df_bsmt_exposure = set(df_house_prices_records['BsmtExposure'].unique())
df_bsmt_exposure

{'Av', 'Gd', 'Mn', 'No', 'None'}

In [22]:
# Distribution/correlationanalysis needed to understand this feature further.
# NaN values will likely take mean or median
df_lot_frontage = df_house_prices_records['LotFrontage'].head(10)
print(df_lot_frontage)
df = df_house_prices_records['LotFrontage'].iloc[7:8]
df

0    65.0
1    80.0
2    68.0
3    60.0
4    84.0
5    85.0
6    75.0
7     NaN
8    51.0
9    50.0
Name: LotFrontage, dtype: float64


7   NaN
Name: LotFrontage, dtype: float64

In [23]:
# There are only 8 NaN values in this column as seen above. Assess the distribution
df_mas_vnr_area = df_house_prices_records['MasVnrArea']
print(df_mas_vnr_area.head(10))



0    196.0
1      0.0
2    162.0
3      0.0
4    350.0
5      0.0
6    186.0
7    240.0
8      0.0
9      0.0
Name: MasVnrArea, dtype: float64


In [24]:
# This feature stands out as a possible rarity and something that could add value. May have to alter the NaN to 0.0 to get a measurable square ft value?
df_woodeck = df_house_prices_records['WoodDeckSF']
df_woodeck.head(10)

df = df_house_prices_records['WoodDeckSF'].iloc[7:10]
df.values.sum()

nan

In [25]:
# Possible room to categorize these variables if required. < 5 = poor > 5 = Good?
df_condition = set(df_house_prices_records['OverallCond'].unique())
print(df_condition)

df_quality = set(df_house_prices_records['OverallQual'].unique())
print(df_quality)


{1, 2, 3, 4, 5, 6, 7, 8, 9}
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}


In [26]:
# Assess later if older houses have any correlation on sale price, particularly if they have been remodelled or have some stand out features.
df_YearBuilt = 	df_house_prices_records['YearBuilt']
print(df_YearBuilt.head(5))

df_YearRemodAdd = df_house_prices_records['YearRemodAdd']
print(df_YearRemodAdd.head(5))

0    2003
1    1976
2    2001
3    1915
4    2000
Name: YearBuilt, dtype: int64
0    2003
1    1976
2    2002
3    1970
4    2000
Name: YearRemodAdd, dtype: int64



- Study the inherited houses dataset

In [27]:
df_inherited_houses = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
df_inherited_houses

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd
0,896,0,2,No,468.0,Rec,270.0,0,730.0,Unf,...,11622,80.0,0.0,0,6,5,882.0,140,1961,1961
1,1329,0,3,No,923.0,ALQ,406.0,0,312.0,Unf,...,14267,81.0,108.0,36,6,6,1329.0,393,1958,1958
2,928,701,3,No,791.0,GLQ,137.0,0,482.0,Fin,...,13830,74.0,0.0,34,5,5,928.0,212,1997,1998
3,926,678,3,No,602.0,GLQ,324.0,0,470.0,Fin,...,9978,78.0,20.0,36,6,6,926.0,360,1998,1998


- Compare the 2 dataframes to make sure all column data types match up.

In [28]:
df_inherited_houses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 23 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       4 non-null      int64  
 1   2ndFlrSF       4 non-null      int64  
 2   BedroomAbvGr   4 non-null      int64  
 3   BsmtExposure   4 non-null      object 
 4   BsmtFinSF1     4 non-null      float64
 5   BsmtFinType1   4 non-null      object 
 6   BsmtUnfSF      4 non-null      float64
 7   EnclosedPorch  4 non-null      int64  
 8   GarageArea     4 non-null      float64
 9   GarageFinish   4 non-null      object 
 10  GarageYrBlt    4 non-null      float64
 11  GrLivArea      4 non-null      int64  
 12  KitchenQual    4 non-null      object 
 13  LotArea        4 non-null      int64  
 14  LotFrontage    4 non-null      float64
 15  MasVnrArea     4 non-null      float64
 16  OpenPorchSF    4 non-null      int64  
 17  OverallCond    4 non-null      int64  
 18  OverallQual   

In [29]:
# See some values which were left out due to length of outputs
df_testing = df_house_prices_records.filter(['WoodDeckSF','YearBuilt','YearRemodAdd'])		
print(df_testing.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   WoodDeckSF    155 non-null    float64
 1   YearBuilt     1460 non-null   int64  
 2   YearRemodAdd  1460 non-null   int64  
dtypes: float64(1), int64(2)
memory usage: 34.3 KB
None


---

In [30]:
df_house_prices_records.head(3)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500


- A comparison of datatypes in the house_records set vs the inherited_houses set highlighted the following needed changes to avoid skewed correlation results further into the project:

    - In the inherited housing set, change the following:
        - BsmtFinSF1 change to int
        - BsmtUnfSF change to int
        - GarageArea change to int
        - TotalBsmtSF change to int
    
    - In the housing records set, change the following:
        - 2ndFlrSF change to int
        - BedroomAbvGr change to int
        - EnclosedPorch change to int
        - WoodDescSF change to int

In [31]:
columns_to_update1 = ['BsmtFinSF1','BsmtUnfSF', 'GarageArea', 'TotalBsmtSF']

df_inherited_houses[columns_to_update1] = df_inherited_houses[columns_to_update1].astype(int)

df_inherited_houses.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 23 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       4 non-null      int64  
 1   2ndFlrSF       4 non-null      int64  
 2   BedroomAbvGr   4 non-null      int64  
 3   BsmtExposure   4 non-null      object 
 4   BsmtFinSF1     4 non-null      int64  
 5   BsmtFinType1   4 non-null      object 
 6   BsmtUnfSF      4 non-null      int64  
 7   EnclosedPorch  4 non-null      int64  
 8   GarageArea     4 non-null      int64  
 9   GarageFinish   4 non-null      object 
 10  GarageYrBlt    4 non-null      float64
 11  GrLivArea      4 non-null      int64  
 12  KitchenQual    4 non-null      object 
 13  LotArea        4 non-null      int64  
 14  LotFrontage    4 non-null      float64
 15  MasVnrArea     4 non-null      float64
 16  OpenPorchSF    4 non-null      int64  
 17  OverallCond    4 non-null      int64  
 18  OverallQual   

In [32]:
columns_to_update = ['2ndFlrSF', 'BedroomAbvGr', 'EnclosedPorch', 'WoodDeckSF']
df_house_prices_records.loc[:, ('2ndFlrSF', 'BedroomAbvGr', 'EnclosedPorch', 'WoodDeckSF')] = df_house_prices_records[columns_to_update].fillna(0).astype(int)
df_house_prices_records.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1460 non-null   int64  
 2   BedroomAbvGr   1460 non-null   int64  
 3   BsmtExposure   1460 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1346 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  1460 non-null   int64  
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1298 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

In [44]:
df_inherited_houses.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, House A to House D
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   House Choice   4 non-null      object 
 1   1stFlrSF       4 non-null      int64  
 2   2ndFlrSF       4 non-null      int64  
 3   BedroomAbvGr   4 non-null      int64  
 4   BsmtExposure   4 non-null      object 
 5   BsmtFinSF1     4 non-null      int64  
 6   BsmtFinType1   4 non-null      object 
 7   BsmtUnfSF      4 non-null      int64  
 8   EnclosedPorch  4 non-null      int64  
 9   GarageArea     4 non-null      int64  
 10  GarageFinish   4 non-null      object 
 11  GarageYrBlt    4 non-null      float64
 12  GrLivArea      4 non-null      int64  
 13  KitchenQual    4 non-null      object 
 14  LotArea        4 non-null      int64  
 15  LotFrontage    4 non-null      float64
 16  MasVnrArea     4 non-null      float64
 17  OpenPorchSF    4 non-null      int64  
 18  Overall

##### Compare both dataframes for type

- Inherited houses dataframe  dtypes: float64(3), int64(16), object(4)

- House records dataframe  dtypes: float64(3), int64(17), object(4)

In [33]:
# See which columns remain with NaN values
df_house_prices_records[columns_with_nan].isna().sum()

2ndFlrSF           0
BedroomAbvGr       0
BsmtFinType1     114
EnclosedPorch      0
GarageFinish     162
GarageYrBlt       81
LotFrontage      259
MasVnrArea         8
WoodDeckSF         0
dtype: int64

In [34]:
# Change the 8 NaN values in MasVnrArea due to small quantity and similar meaning.
df_house_prices_records.loc[:, ('MasVnrArea')] = df_house_prices_records['MasVnrArea'].fillna(0.0)

df_mas_vnr_area = df_house_prices_records['MasVnrArea']
df_mas_vnr_area.isna().sum()

0

##### These columns still have NaN values at this point.

- BsmtFinType1 (obj: 114)

- GarageFinish (obj: 162)

- GarageYrBlt (float: 81)

- LotFrontage (float: 259)


In [35]:
remaining_columns_with_nan = ['BsmtFinType1', 'GarageFinish', 'GarageYrBlt', 'LotFrontage']
df_house_prices_records[remaining_columns_with_nan].head(10)

Unnamed: 0,BsmtFinType1,GarageFinish,GarageYrBlt,LotFrontage
0,GLQ,RFn,2003.0,65.0
1,ALQ,RFn,1976.0,80.0
2,GLQ,RFn,2001.0,68.0
3,ALQ,Unf,1998.0,60.0
4,GLQ,RFn,2000.0,84.0
5,GLQ,Unf,1993.0,85.0
6,GLQ,RFn,2004.0,75.0
7,ALQ,,1973.0,
8,Unf,Unf,1931.0,51.0
9,GLQ,RFn,1939.0,50.0


In [41]:

df_inherited_houses.insert(0, 'House Choice', ['House A', 'House B', 'House C', 'House D'], True)
df_inherited_houses.columns


Index(['House Choice', '1stFlrSF', '2ndFlrSF', 'BedroomAbvGr', 'BsmtExposure',
       'BsmtFinSF1', 'BsmtFinType1', 'BsmtUnfSF', 'EnclosedPorch',
       'GarageArea', 'GarageFinish', 'GarageYrBlt', 'GrLivArea', 'KitchenQual',
       'LotArea', 'LotFrontage', 'MasVnrArea', 'OpenPorchSF', 'OverallCond',
       'OverallQual', 'TotalBsmtSF', 'WoodDeckSF', 'YearBuilt',
       'YearRemodAdd'],
      dtype='object')

In [43]:
df_inherited_houses.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, House A to House D
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   House Choice   4 non-null      object 
 1   1stFlrSF       4 non-null      int64  
 2   2ndFlrSF       4 non-null      int64  
 3   BedroomAbvGr   4 non-null      int64  
 4   BsmtExposure   4 non-null      object 
 5   BsmtFinSF1     4 non-null      int64  
 6   BsmtFinType1   4 non-null      object 
 7   BsmtUnfSF      4 non-null      int64  
 8   EnclosedPorch  4 non-null      int64  
 9   GarageArea     4 non-null      int64  
 10  GarageFinish   4 non-null      object 
 11  GarageYrBlt    4 non-null      float64
 12  GrLivArea      4 non-null      int64  
 13  KitchenQual    4 non-null      object 
 14  LotArea        4 non-null      int64  
 15  LotFrontage    4 non-null      float64
 16  MasVnrArea     4 non-null      float64
 17  OpenPorchSF    4 non-null      int64  
 18  Overall

#### Conclusion

- This notebook allowed me to understand:

    -  What datatypes were contained in the dataset
    - The inherited and original set needed to alter some data types so later predictions would be more accurate
    - Some object datatypes which will categorize to numerical if needed.
    - Overall, what data was contained in the dataset and how I might proceed with analysis and cleaning.

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [42]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
    print(e)

df_house_prices_records.to_csv(f"outputs/datasets/collection/original.csv", index=False)
df_inherited_houses.to_csv(f"outputs/datasets/collection/inherited.csv", index=False)
df_inherited_houses.to_csv(f"outputs/datasets/collection/inherited_with_index.csv", index=True)


[Errno 17] File exists: 'outputs/datasets/collection'
