# **Data Collection**

## Objectives

- **Download the dataset from Kaggle and store it as unprocessed raw data.**

- **Review and explore the dataset, then save the inspected version in outputs/datasets/collection/.**

## Inputs

- **Kaggle JSON file - the authentication token**

## Outputs

- **Generate Dataset: outputs/datasets/collection/HousePrices.csv**

## Additional Comments

- **Data set can be found [here](https://www.kaggle.com/datasets/codeinstitute/housing-prices-data)**

---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir
print(os.listdir())  
print(os.listdir('/workspaces'))


We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir('/workspaces/Heritage-Housing/jupyter_notebooks')
print(os.getcwd())

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch data from Kaggle

Install the Kaggle package to enable data downloading.

In [None]:
%pip install kaggle==1.5.12

Please ensure that your kaggle.json file is added to the directory so that the cell below can add the token to the session. (Make sure kaggle.json is in the .gitignore file as well!)

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Download the Kaggle dataset and put the extracted files into the "inputs/datasets/raw" folder.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
!kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

We unzip the files and then we can remove both the zip and kaggle.json.

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and Inspect Kaggle data

The dataset comprises two CSV files, both of which will be inspected to understand their structure and contents.

- The variable df will be assigned to house_prices_records.csv.

- The variable dfa will be assigned to inherited_houses.csv.

### **House Prices CSV**

In [None]:
import pandas as pd
df = pd.read_csv(f"/workspaces/Heritage-Housing/inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df.head(10)

The df.info() method in pandas provides a concise summary of a DataFrame, including the number of entries, column names, data types, non-null counts, and memory usage. This summary is useful for quickly assessing the structure and completeness of the dataset.

In [None]:
df.info()

### **Inherited Houses CSV**

In [None]:
import pandas as pd
dfa = pd.read_csv(f"/workspaces/Heritage-Housing/inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
dfa.head(10)

The df.info() method in pandas provides a concise summary of a DataFrame, including the number of entries, column names, data types, non-null counts, and memory usage. This summary is useful for quickly assessing the structure and completeness of the dataset.

In [None]:
dfa.info()

---

### **house-metadata TXT**

The dataset provides detailed information on various housing features and characteristics relevant to residential property sales. Each row represents a single house, while the columns capture structural attributes, quality ratings, and sale information.

### **Data Exploration House Prices CSV**

We are interested to get famialar with our dataset, check variable type and distribution, missing levels and what they mean in  business context.

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

**Duplicates**

Checked for full duplicate rows in the dataset to avoid data redundancy. If found, duplicates will be dropped using `df.drop_duplicates()`

In [None]:
duplicate_rows = df.duplicated()
print(f"Total duplicate rows: {duplicate_rows.sum()}")

**Missing Data**

Identify variables with significant missing data. Variables with a lot of missing values might degrade the quality of your analysis or model.

In [None]:
missing = df.isnull().sum()
missing_percent = (missing / len(df)) * 100
print(f"Missing Data:\n", missing_percent[missing_percent > 30].apply(lambda x: f"{x:.2f}%"))

Determine whether these variables positively or negatively correlate with SalePrice.

- A correlation above ~0.3 may indicate predictive value.
- If close to 0, it may be non-informative.

In [None]:
print(df[['WoodDeckSF', 'EnclosedPorch', 'SalePrice']].corr())

The correlation analysis indicates that `WoodDeckSF` and `EnclosedPorch` exhibit very weak or negligible correlation with the target variable `SalePrice`. This implies that variations in these features do not meaningfully explain or predict changes in the outcome of interest. In other words, these variables may have limited predictive power or relevance within the context of our analysis.

We will discard `WoodDeckSF` and `EnclosedPorch` from our data

In [None]:
from feature_engine.selection import DropFeatures
features_discard = ['WoodDeckSF', 'EnclosedPorch']
dropper = DropFeatures(features_to_drop=features_discard)
dropper.fit(df)
df_drop = dropper.transform(df)

print("Dropped features:", features_discard)
print("Original shape:", df.shape)
print("New shape:", df_drop.shape)


---

# Push files to Repo

In [None]:

import os
try:
    os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
    print(e)

df_drop.to_csv(f"../outputs/datasets/collection/HousePrices.csv",index=False)