# Homework: Data Exploration

## Name: <span style="color:blue"> *Azaan Patil* </span>

## Utils

In [17]:
from typing import List, Dict, Tuple
import os
import gc
import traceback
import warnings
from pdb import set_trace

# Default seed
seed = 0

In [18]:
class TodoCheckFailed(Exception):
    pass

def todo_check(asserts, mute=False, success_msg="", **kwargs):
    locals().update(kwargs)
    failed_err = "You passed {}/{} and FAILED the following code checks:\n{}"
    failed = ""
    n_failed = 0
    for check, (condi, err) in enumerate(asserts):
        exc_failed = False
        if isinstance(condi, str):
            try:
                passed = eval(condi)
            except Exception:
                exc_failed = True
                n_failed += 1
                failed += f"\nCheck [{check+1}]: Failed to execute check [{check+1}] due to the following error...\n{traceback.format_exc()}"
        elif isinstance(condi, bool):
            passed = condi
        else:
            raise ValueError("asserts must be a list of strings or bools")

        if not exc_failed and not passed:
            n_failed += 1
            failed += f"\nCheck [{check+1}]: Failed\n\tTip: {err}\n"

    if len(failed) != 0:
        passed = len(asserts) - n_failed
        err = failed_err.format(passed, len(asserts), failed)
        raise TodoCheckFailed(err.format(failed))
    if not mute: print(f"Your code PASSED all the code checks! {success_msg}")

## Instructions
In this assignment, you will begin to experience and practice some of the initial stages of the ML pipeline: problem formulation, data gathering, and data visualization/exploration.

Your job is to read through the assignment and fill in any code segments that are marked by `TODO` headers and comments. Some TODOs will have a `todo_check()` function which will give you a rough estimate of whether your code is functioning as excepted. Other's might not have these checks, like visualization TODOs. Regardless,  all the correct outputs are given below each code cell. It might be useful to copy the contents of certain TODO cells into a new cell so you can try to match the desired output with the output produced by your own code! For visualization TODOs, you simply have to have a plot that looks similar. You can change aspects such as color, titles, or x/y-axis labels if you so wish.

At any point, if you feel lost concerning how to program a specific TODO, take some time and visit the official documentation for each library and read about the methods/functions that you need to use.

## Submission

1. Save the notebook.
2. Enter your name in the appropriate markdown cell provided at the top of the notebook.
3. Select `Kernel` -> `Restart Kernel and Run All Cells`. This will restart the kernel and run all cells. Make sure everything runs without errors and double-check the outputs are as you desire!
4. Submit the `.ipynb` notebook on Canvas.


# Data Exploration

## Problem formulation

<center><img src="https://insideclimatenews.org/wp-content/uploads/2023/03/wildfire_thibaud-mortiz-afp-getty-scaled.jpg" alt="drawing" width="500"/></center>

For this assignment, you will be tasked with exploring data related to the prediction of forest fires, a major environmental concern that affects forest preservations around the world.

However before getting into the data itself, it is important that we demonstrate how to properly formulate the goal of this assignment. Recall, that when formulating a problem, it is useful to ask yourself the following three questions:  

1. What is the problem?
2. Why does the problem need to be solved?
3. How would you solve the problem?

### What is the problem?
Well, as you might have already read, this assignment is aiming to tackle is forest fire prediction. To be more specific, let's focus on predicting forest fires in particular parks, like national parks. By doing so, we will simplify the problem by focusing on a specific park. The big assumption here is going to be that it will be possible to predict forest fires from data. But what kind of data do we need? If we do a little bit of research by reading research papers and talking with experts on the subject, we'll find that natural forest fires actually tend to coincide with the weather (what a shock)!

For instance, natural fires tend to coincide with high temperatures, dry conditions, and wind. All of which tend to be commonly recorded features in meteorological datasets [\[1\]](https://mylandplan.org/content/facts-about-fire)! Thus, the goal will be to find a meteorological dataset for a specific national park.

### Why does the problem need to be solved?
Why are we even looking at this problem to begin with. Well, to some of you this might be obvious, but let's make it clear anyway. 

One point of forest fire prediction is to help prevent fires by either predicting likely areas where they may occur or predicting how much a potential area could burn (identifying high risk areas). By making these predictions, we hope to allow for faster detection of fires by focusing on particular "high risk" areas and, in turn, becoming more successful at forest fire prevention. 

Now, this is just one way to frame why this problem needs to be solved. Feel free to think about your own personal take.


### How would you solve the problem?
To answer this question, think about how you would go about achieving or solving the problem (try to visualize what you would do for each of those 7 ML steps). Since you are likely new to ML, we will try to walk you through our thought process.

The first step you might think about is how would you gather data? Well, we tried to answer this question earlier. Recall, we talked about narrowing our focus down to national parks. The next step would be to contact parks and see if they do indeed have any open source data (particularly meteorological data). Further, we could search for related problems to see if they already have datasets, or we could simply try searching for a dataset related to forest fires. We will also have to make sure the dataset is labeled so that we can train our model to predict coordinates for which parts of the forest are likely to be vulnerable to fires or how much of an area will be burned.

Say we found some data, then what? What features will our data even have? We mentioned we would like some sort of meteorological data that maybe contains information about temperature, moisture, wind, etc. But are there other domain specific features out there? As none of us are forest fire expert, we probably do not have any clue if there are! Take note as this is a place of uncertainty that we have identified and would need to further investigate.

Okay, so let's also say we found some data with useful meteorological and domain specific features (whatever those may be). Now, what assumptions can we make about our data? Do we know what algorithm might actually be well suited for our problem? Notice, that these questions are becoming hard and harder to answer as we attempt to plan further into the future. Once answers start becoming this vague, it might be a good idea to get your hands dirty by finding some data and exploring it using these questions as guides.


### Summary

In summary, our goal is to help prevent and make firefighting easier by identifying "high risk" areas by either predicting likely areas where forest fires may occur or predicting how much a potential area could burn when a forest fire occurs. We think we can achieve these predictions by using some sort of labeled data where the labels are either coordinates for which parts of the forest have burned before or how much of an area was burned.

# Data Gathering and Loading
In this assignment, you will be working with the House Listings Dataset. This is a real-world dataset containing information about residential properties listed for sale. The dataset includes various features such as property characteristics (number of bedrooms, bathrooms, square footage), location information, and the sale price.

This dataset provides an excellent opportunity to explore the relationships between property features and their market values, which is a fundamental problem in real estate analysis and machine learning.

## Library Imports
Before loading the data, you will need to import any Python libraries you think will be needed. For data handling (loading, storing, and manipulation), you can use either NumPy or Pandas. For plotting, you can use Matplotlib and Pandas (which has wrappers for Matplotlib which makes plotting super easy). 

Recall that Pandas is a high-level data manipulation and analysis tool built on-top of NumPy. In particular, it can be easier to work with when cleaning, visualizing, and preprocessing data. Plus, Pandas tends to be easier on the eyes when visualizing raw datasets. 

- [Pandas docs](https://pandas.pydata.org/pandas-docs/stable/reference/index.html).
- [NumPy docs](https://numpy.org/doc/stable/)
- [Matplotlib docs](https://matplotlib.org/stable/contents.html)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

'conda' is not recognized as an internal or external command,
operable program or batch file.


ModuleNotFoundError: No module named 'numpy'

### Downloading
The House Listings dataset file `House_listings_dataset.csv` should be located in the **SAME** directory/folder as this notebook. If you do not have it, please ensure it is placed in the correct location before proceeding.

To display the local path of this notebook and the directory it is currently in, run the below code.

In [None]:
print(f"The current path for your notebook is:\n {os.getcwd()}\n")
print(f"Your notebook is currently in the following directory:\n {os.path.basename(os.getcwd())}")

#### TODO 1 (5 points): Data Loading and Displaying
Load the `House_listings_dataset.csv` by using Pandas `read_csv()` function. The  `read_csv()` function works by taking in a path to a csv file (e.g., `/home/user/Downloads/House_listings_dataset.csv`). However, since you should have moved the data into the same directory as this notebook, you will only need to pass the name of the csv `House_listings_dataset.csv`. 

**WARNING: Do NOT pass a path specific to your computer! If you do, the TA's will not be able to run your notebook without making changes to it. This is why we are having you move the dataset into the same directory as this notebook.**

1. Load House Listings dataset by passing the name of the csv file "House_listings_dataset.csv" to the Pandas function `read_csv()` ([docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)). Store the output into the `df` variable.

2. Using the `df` DataFrame you just defined, call the `columns` class variable to get all the column/feature names. Store the output into the variable `feature_names`.

In [None]:
# TODO 1.1
df = pd.read_csv("House_listings_dataset.csv")


In [None]:
todo_check([
    ("os.path.exists('House_listings_dataset.csv')", f"The House_listings_dataset.csv is not detected in your local path! You need to move the 'House_listings_dataset.csv' file to the same location/directory as this notebook which is {os.getcwd()}"),
    ("len(df) > 0", 'The dataset was not loaded correctly!')
])

To display the `forestfire_df` you can simply print it using Python's `print()` function. However, this does not look very nice.

In [None]:
print(df)

Instead, you can pretty print a Pandas DataFrame using Jupyter Lab/Notebooks built-in function `display()`.

In [None]:
display(df)

In [None]:
# TODO 1.2
feature_names = df.columns

print(f'The feature names are:\n{feature_names.values}')

In [None]:
todo_check([
    ("len(feature_names) > 0", "No feature names detected!")
])

# Data Visualization and Exploration

## Defining the features

Once you have loaded the data, it is time to understand what the provided features even mean.If you paid close attention, the web-page does provide a brief description of each feature. For your convince, the descriptions are posted below.

    X - x-axis spatial coordinate within the Montesinho park map: 1 to 9
    Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9
    month - month of the year: 'jan' to 'dec'
    day - day of the week: 'mon' to 'sun'
    FFMC - FFMC index from the FWI system: 18.7 to 96.20
    DMC - DMC index from the FWI system: 1.1 to 291.3
    DC - DC index from the FWI system: 7.9 to 860.6
    ISI - ISI index from the FWI system: 0.0 to 56.10
    temp - temperature in Celsius degrees: 2.2 to 33.30
    RH - relative humidity in %: 15.0 to 100
    wind - wind speed in km/h: 0.40 to 9.40
    rain - outside rain in mm/m2 : 0.0 to 6.4
    area - the burned area of the forest (in ha): 0.00 to 1090.84
    
Some of the features are pretty straight forward to understand like temp, RH (relative humidity), wind, rain, area, month, and day. But what about these other features like X, Y, FFMC, DMC, DC, and ISI which are more technical? 

X and Y are x-y coordinates that correspond with the below image that shows off Montesinho natural park.

<center><img src="https://www.researchgate.net/profile/Paulo-Cortez-4/publication/238767143/figure/fig1/AS:298804772392991@1448252017812/The-map-of-the-Montesinho-natural-park.png" width=500 height=500></center>

For the other variables (FFMC, DMC, DC, and ISI), you actually have to take a look at the [research paper](http://www.dsi.uminho.pt/~pcortez/fires.pdf) to get a better explanation. To save you some time, here is what it says.

> The forest Fire Weather Index (FWI) is the Canadian system for rating fire danger
and it includes six components (Figure 1) [24]: Fine Fuel Moisture Code (FFMC),
Duff Moisture Code (DMC), Drought Code (DC), Initial Spread Index (ISI), Buildup
Index (BUI) and FWI. The first three are related to fuel codes: the FFMC denotes the
moisture content surface litter and influences ignition and fire spread, while the DMC
and DC represent the moisture content of shallow and deep organic layers, which affect
fire intensity. The ISI is a score that correlates with fire velocity spread, while BUI
represents the amount of available fuel. The FWI index is an indicator of fire intensity
and it combines the two previous components. Although different scales are used for
each of the FWI elements, high values suggest more severe burning conditions. Also,
the fuel moisture codes require a memory (time lag) of past weather conditions: 16
hours for FFMC, 12 days for DMC and 52 days for DC.

In summary, it sounds like these remaining variables are domain specific variables related to a system for rating fire danger based on things like intensity, spread, and potential fuel.


## Exploring

Given your rough understanding of the features, it is time to start to developing a better intuition for each feature by exploring their values.


#### TODO 2 (5 points): Data Info 
When exploring a new dataset, you should first figure out how many data samples and features are present and the data type of each feature.

1. Get the shape of the `forestfire_df` DataFrame (which contains the number of data samples and features) by calling the `shape` class variable. Store the output into `ff_shape`.
   
2. Display a short the summary of the data by calling the `info()` method for the  `forestfire_df` DataFrame.

In [None]:
# TODO 2.1
ff_shape = df.shape

print(f'The house listings dataset shape is: {ff_shape}')

In [None]:
todo_check([
    ("ff_shape[0] > 0 and ff_shape[1] > 0", 'The shape does not appear to be correct')
])

The shape of the dataset indicates the number of data samples (rows) and features (columns) available in the house listings dataset.

In [None]:
# TODO 2.2

df.info()

Based on the output of the `info()` method, you should be able to see that all the features are either integers or floats, except for the 'month' and 'day' features. You can tell this by looking at the "Dtype" column, which stands for data types ([dtype docs](https://numpy.org/doc/stable/reference/arrays.dtypes.html)). It seems the 'month' and 'day'  are classified as an "object". When Pandas or NumPy encounters a string, it will assign it the dtype "object". Thus, 'month' and 'day' are strings (this can be confirmed by looking at the values of month/day in the displayed `forestfire_df` provided above).

Lastly, it is worth noting that `info()` is reporting that there are no null values in any of our features columns, as the "Non-Null Count" is empty for each feature! That's good, that means there are no missing values, and you will explore this manually later on.


#### TODO 3 (10 points): Month and Day Visualization

Take a closer look at 'month' and 'day' features to see if you can gain any further insights about the data. For these TODOs use either `iloc`, `loc` or square brackets `[ ]` to slice/index the `forestfire_df` DataFrame.

1. Index the 'month' column from our `forestfire_df` feature and call the `value_counts()` method ([docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html)) on said column data to get the number of times each month appears in the data. Store the output into the variable `month_counts`.

2. Plot `month_counts` using the Matplotlib bar plot function `plt.bar()` ([docs](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html)) so that the months are displayed on the x-axis and the month counts are displayed on the y-axis.
    1. Hint: To easily accesses the month names, use the `.index` class variable for `mounth_counts`.

3. Index the 'day' column from our `forestfire_df` feature and call the `value_counts()` method on said column data to get the number of times each day appears in the data. Store the output into the variable `day_counts`.

4. Plot `day_counts` using the Matplotlib bar plot function `plt.bar()` ([docs](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html)) so that the days are displayed on the x-axis and the day counts are displayed on the y-axis.

In [None]:
# TODO 3.1
state_counts = df["State"].value_counts()

display(state_counts)

In [None]:
todo_check([
    ("state_counts.shape[0] > 0", 'State values did not load correctly!')
])

In [None]:
# TODO 3.2
plt.figure(figsize=(12, 5))
plt.bar(state_counts.index, state_counts.values)
plt.xlabel("State")
plt.ylabel("Count")
plt.title("House Listings by State")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

The distribution of house listings across states gives us insight into the geographic concentration of the data. Some states may have significantly more listings than others.

In [None]:
# TODO 3.3
bedroom_counts = df["Bedroom"].value_counts().sort_index()

display(bedroom_counts)

In [None]:
todo_check([
    ("bedroom_counts.shape[0] > 0", 'Bedroom values did not load correctly!')
])

In [None]:
# TODO 3.4
plt.figure(figsize=(10, 5))
plt.bar(bedroom_counts.index, bedroom_counts.values)
plt.xlabel("Number of Bedrooms")
plt.ylabel("Count")
plt.title("Distribution of Bedrooms in House Listings")
plt.xticks(bedroom_counts.index)
plt.show()

The distribution of bedrooms shows the variety of house types available in the dataset. Most houses tend to have a typical number of bedrooms, with varying frequencies.

#### TODO 4 (10 points): Null Check

While the `info` method used previously indicated that there were no nulls or missing values, it is possible to check manually. If there are any nulls or missing values, then these will need to be dealt with during the data cleaning phase by filling them or dropping the entire data sample.

1. Convert our `forestfire_df` into a boolean DataFrame (true indicates a value is a missing) using the Pandas DataFrame `isnull()` method ([docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html)). Store the output into the `forestfire_isnull` variable.

It's hard to see if every single element in the dataset is false. If only there was an easier way. Well, there is! You can use the NumPy function `any()` ([docs](https://numpy.org/doc/stable/reference/generated/numpy.any.html)) to check if every boolean in `forestfire_isnull` is false. In other words, `any()` will only return true if at least one element in `forestfire_isnull` is true.

2. Use the NumPy `any()` function to determine if `forestfire_isnull` has any True values indicating there are null elements in the data. Store the output into the variable `hasnull`.

In [None]:
# TODO 4.1
df_isnull = df.isnull()

display(df_isnull.head(10))

In [None]:
todo_check([
    ("df_isnull.shape[0] > 0", "df_isnull shape does not match expected"),
])

In [None]:
# TODO 4.2
hasnull = np.any(df_isnull)

print(f"Dataset contains null values: {hasnull}")

# Count null values per column
print("\nNull values per column:")
print(df.isnull().sum())

In [None]:
todo_check([
    ("isinstance(hasnull, (bool, np.bool_))", 'hasnull should be a boolean value!')
])

The house listings dataset contains some missing data. Several columns have null values that will need to be addressed during preprocessing through imputation or removal.


#### TODO 5 (15 points): Data Statistics

Next up, we need to check out the statistics of the numerical features in the data, such as the mean and standard deviation. Such scale information will prove to be vital in the future, as ML algorithm can have a hard time learning if the scales of all the features are different. For instance, an algorithm might end up valuing a feature more just because it has a larger scale, which is not what we want.


1. Use the Pandas DataFrame `describe()` method on our `forestfire_df` to get a statistical summary for each of our numerical features. Store the output into the variable `ff_describe`.

2. Using a box plot, visualize the statistics of the data by calling the Pandas `.boxplot()` method ([docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html)) for `forestfire_df`.

3. Visualize each numerical feature's box plot separately by creating a for-loop which loops over all column names (i.e., features) in `forestfire_df` and plots a single box plot for the current column by passing the column name to the `boxplot()` method (you used this method in the previous step).
    1. Hint: You will need to avoid plotting the 'day' and month 'columns' as they are non-numerical. 
    2. Hint: Be sure to add `plt.show()` after plotting within the for-loop, or all the plots will be plotted on top of each other.

In [None]:
# TODO 5.1
ff_describe = df.describe()

display(ff_describe)

In [None]:
todo_check([
    ("ff_describe.shape[0] > 0", 'ff_describe did not compute correctly'),
])

You should notice that nearly every feature has a different mean and STD. For example, the 'DC' features has a mean of ~547 while the rain feature has a mean of ~ 0.0216.  This is an indication that the features have different scales!

Further, take note of how large the standard deviation or [STD](https://en.wikipedia.org/wiki/Standard_deviation) is for some of the features. If you have a large mean and comparably large STD, this is not necessarily a problem. However, if you have a small mean and a very large STD, this can cause trouble for learning as the values for said feature will vary drastically. For instance, the area has a small mean and a much larger STD. Keep this in mind as you'll investigate this more soon.

As a side note, the `describe()` method also provides the min, max, count, and the [percentiles](https://www.w3schools.com/python/python_ml_percentile.asp) values.

Finally, notice that the 'day' and 'month' features are not included. This is because their values are non-numerical (in this case they are string values) and computing the statistics is impossible.  

In [None]:
# TODO 5.2
df.boxplot(figsize=(15, 6))
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

 For those that do not know how to read a box plot, refer to this blog [*Box Plot Explained*](https://www.simplypsychology.org/boxplots.html). Any black circle here indicates an outlier or anything above the maximum or below the minimum (i.e., the whiskers). 
 
 As you should see, this is not a very informative plot because all the features have different scales. So, features with small scales are all squashed towards 0, making them unreadable.

In [None]:
# TODO 5.3
numeric_cols = df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
    df.boxplot(column=col, figsize=(10, 5))
    plt.title(f"Box Plot of {col}")
    plt.show()

#### TODO 6 (10 points): Price Visualization

Now it's time to take a closer look at the target/label, the 'Price' column. To do so, you need to examine the actual values and understand the price distribution.

1. Store the 'Price' data into a separate variable. Do so by indexing the 'Price' column using Pandas' `iloc`, `loc` or square brackets `[ ]`. Store the output into the variable `price_values`.
   
2. To more easily visualize the 'Price' values, create a histogram. First remove NaN values, then plot `price_values` using the Matplotlib `plt.hist()` function ([docs](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html)).
    1. Hint: The only required parameters are `x` for the function. The `bins` parameter is optional and will automatically be computed using the passed data. 

3. Call the `describe()` method for `price_values` to get summary statistics of house prices. Store the output into the variable `price_stats`.

4. Examine the price distribution and note any patterns or outliers.


In [None]:
# TODO 6.1
price_values = df["Price"]

display(price_values.head(20))

In [None]:
todo_check([
    ("len(price_values) > 0", 'price_values did not load correctly!'),
])

Just from the quick printout of the values for `area_values` you should already be noticing lots of zero values.

In [None]:
# TODO 6.2
# Remove NaN values for visualization
price_clean = price_values.dropna()
plt.figure(figsize=(10, 5))
plt.hist(price_clean, bins=50)
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.title("Distribution of House Prices")
plt.show()

Notice, the majority of the area values are centered around 0 such that the data seems to be [skewed positively or to the right](https://www.investopedia.com/terms/s/skewness.asp).

In [None]:
# TODO 6.3
price_stats = price_values.dropna().describe()

display(price_stats)

In [None]:
todo_check([
    ("len(price_stats) > 0", 'price_stats did not compute correctly!')
])

In [None]:
# TODO 6.4 - observe statistics
print(f"Price Statistics:")
print(f"Count: {price_stats['count']:.0f}")
print(f"Mean: ${price_stats['mean']:,.2f}")

print(f"Std: ${price_stats['std']:,.2f}")print(f"Max: ${price_stats['max']:,.2f}")
print(f"Min: ${price_stats['min']:,.2f}")

In [None]:
todo_check([
    ("price_stats.shape[0] > 0", 'price_stats did not compute correctly!')
])

Now, looking at the exact numbers, you should be able to see that a majority (247 samples) of the area values are zero (no forest area was burned). However, further take note that the maximum value is 1090.84 and that there are many values reported in between 0 and 1090, yet they all have a count near 1. This spread of the data, with the majority of the data samples having an area of 0, is likely the cause of the data having a small mean but a large STD. As such, the 'area' data is positively or right skewed!

#### TODO 7 (10 points): Price Transformation

Data transformation can help normalize distributions and improve model performance. The [logarithmic transformation](https://machinelearningmastery.com/skewness-be-gone-transformative-tricks-for-data-scientists/) is a common technique for handling skewed data by taking the $\log$ of values to spread out the distribution and make it more [Gaussian-like](https://en.wikipedia.org/wiki/Normal_distribution).

1. Apply the log transform to `price_clean` (cleaned price values without NaNs) by passing it to NumPy's `log()` function. Add 1 before taking the log to handle any edge cases. Store the output into the variable `log_price_values`.

2. Create a histogram using the log-transformed values stored in `log_price_values` using the Matplotlib `plt.hist()` function ([docs](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html)).

3. Call the `describe()` method for `log_price_values` to get a statistical summary of the log-transformed price values. Store the output into the variable `log_price_describe`.

In [None]:
# TODO 7.1
log_price_values = np.log(price_clean + 1)

display(log_price_values.head())

In [None]:
todo_check([
    ("len(log_price_values) > 0", 'log_price_values was not computed correctly.'),
])

In [None]:
# TODO 7.2
plt.figure(figsize=(10, 5))
plt.hist(log_price_values, bins=50)
plt.xlabel("Log Price")
plt.ylabel("Frequency")
plt.title("Distribution of Log-Transformed House Prices")
plt.show()

Now, you should be able to see that the value range (the x-axis) for the 'area' values has been greatly condensed. This should allow to more easily see the spread of the counts for each value bin.

In [None]:
# TODO 7.3
log_price_describe = log_price_values.describe()

display(log_price_describe)

In [None]:
todo_check([
    ("len(log_price_describe) > 0", 'log_price_describe was not computed correctly!')
])

Now, if you take a look at the mean and STD you should see that the STD value more in line with the mean. Additionally, you should see that the max value has shrunk significantly. 

#### TODO 8 (5 points): Feature Scatter Matrix
Next, it is time to investigate the relationships between the features in the data and the distributions of each feature. It is important to investigate relationships to see how each feature compares with every other feature. In doing so, you can begin to observer trends in the data. Distributions are important to investigate in order to see how each feature is distributed - just like you did for target/label 'area'. 

To achieve both the comparison of features against one another and to observe the distributions of each feature, a handy tool called a scatter matrix can be employed. 

**WARNING: Plotting the scatter matrix can take a while!**

1. Use Panda's `scatter_matrix()` function ([docs](https://pandas.pydata.org/docs/reference/api/pandas.plotting.scatter_matrix.html)) to plot a scatter matrix. Pass `forestfire_df` as input. If the graph is too small, pass a larger `figsize` as an additional argument (e.g., `figsize=(15, 15)`).

In [None]:
# TODO 8.1
# Select numeric columns only for scatter matrix
numeric_df = df.select_dtypes(include=[np.number])

# Create a subset with key features for visualization
from pandas.plotting import scatter_matrix
scatter_matrix(numeric_df.iloc[:, :6], figsize=(15, 15), alpha=0.5)
plt.show()

If you read the docs all `scatter_matrix` does is plot each feature against one another. If a feature is plotted against its self, then the distribution over all the feature's values is given. For instance, the bottom right corner is the histogram you plotted for 'area'. Also notice that the plot is symmetric and 'day' and 'month' are not included. 

Now it can be hard to see any patterns as there is a lot of information being thrown at you. You might say there are some linear-like trends for the feature pairs  'DMC' and 'DC'. However, what might be more useful is looking at a reduced version of this graph, where you compare all the input features (everything but the target) against your target/label ('area' in this case).

#### TODO 9 (20 points): Area-Feature Scatter Matrix

First, you must drop the 'area' column from the data so that only the input features remain. The `log_area_values` will be used in-place of the original 'area' columns values.

1. Drop the 'area' column from the `forestfire_df` by calling the `drop()` ([docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)) method and passing the name column name 'area'. Store the output into the variable `X`.
    1. Hint: Be sure to pass the correct `axis` argument. `axis=0` corresponds to the rows or indexes, and `axis=1` corresponds to the columns. To know which one to use, think about whether 'area' is a row or column name.

Next, you will need to create a subplot. A subplot is a plot that contains many smaller plots. As there are 12 features (not including 'area') the subplot must have 12 plots. One way this can be achieved is by having 3 rows and 4 columns of plots. Each `ax` will represent one plot, therefore, once `ax` is flattened, it should have a length of 12.

2. Create a subplot by using Matplotlib's `plt.subplots()` function ([docs](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html)). The subplot should have 3 rows and 4 columns. Store the output into the variables `fig` and `ax`.
    1. Hint: Look at the docs to see which argument is used to set the number of rows and columns.
    2. Hint: If the graph is too small, pass a larger `figsize` as an additional argument (e.g., `figsize=(15,15)`).
   
**Inside the for-loop**

The loop code is looping over the column names, where the current column name is stored in `column_name`. `enumerate` is a counter and stores the corresponding index of the current column name inside `idx`. Since there are 12 column names and 12 plots, `idx` can be used to index `ax`. 

3. Index the current `ax` using the variable `idx`. Store the output into `current_ax`. `current_ax` will represent the current plot you are plotting to.

4. Create a scatter plot by calling the `plot()` method for `current_ax`. Pass the `log_area_values` as the x-values, the current feature data as the y-values, and the format string `'.'` (to make the plot a scatter plot).

    1. Hint: You can get the current feature data by indexing `X` by the current column name `column_name`.

In [None]:
# TODO 9.1
X = df.drop("Price", axis=1)

# Select only numeric features
numeric_features = X.select_dtypes(include=[np.number]).columns

print(f"Features for plotting:\n{list(numeric_features)}")

# TODO 9.2
fig, ax = plt.subplots(nrows=3, ncols=3, figsize=(15, 12))
ax = ax.flatten()

# Filter out NaN values in price
price_clean_full = df["Price"].dropna()
df_clean = df.dropna(subset=['Price'])

# TODO 9.3
for idx, column_name in enumerate(numeric_features[:9]):
    current_ax = ax[idx]
    
    # Filter out NaN values in the current feature
    feature_clean = df_clean[column_name].dropna()
    price_subset = price_clean_full[:len(feature_clean)]
    
    current_ax.plot(feature_clean, price_subset, '.')
    current_ax.set_ylabel('Price')
    current_ax.set_xlabel(column_name)
    
fig.tight_layout()
plt.show()

Given the below picture of what different correlations can look like, do you see any? (the numbers above indicate the correlation coefficient which you will compute next).

![](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Correlation_examples2.svg/1280px-Correlation_examples2.svg.png)

Well, using this qualitative visualization, it does not look like the features have any obvious correlations with the 'area' targets. However, do note that it is hard to observe any correlations as many of the variables take on discrete values like the features 'X', 'Y', 'month' and 'day'. 

#### TODO 10 (10 points): Correlation Matrix

To quantitatively examine correlations between features and the target, compute a correlation matrix. Recall that the correlation coefficient ranges from -1 to 1: close to 1 means strong positive correlation, close to -1 means strong negative correlation, and near zero means very weak or no correlation.

1. Call the `corr()` method ([docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html)) for the `df` DataFrame to compute the correlation matrix of all **numerical** feature pairs. Store the output into the variable `corr_matrix`.
    1. Pass the `numeric_only` argument to include only numerical features.
<br><br>
2. Index the 'Price' column of `corr_matrix` and store the output into the variable `price_corr`.

In [None]:
# TODO 10.1
corr_matrix = df.corr(numeric_only=True)

display(corr_matrix.style.background_gradient())

In [None]:
todo_check([
    ("corr_matrix.values.diagonal().sum() == 11.0", 'corr_matrix values did not match!'),
    ("np.isclose(corr_matrix.iloc[0].sum(), 1.5719, rtol=0.01)", 'corr_matrix values did not match!'),
])

By using `.style.background_gradient()` you can color code the cells of `corr_matrix` where blue values correspond to positive correlations, red values correspond to negative correlations, and gray values correspond to no correlation.

In [None]:
# TODO 10.2
price_corr = corr_matrix["Price"]

display(price_corr)

In [None]:
todo_check([
    ("np.isclose(price_corr['Price'], 1.0, rtol=0.01)", 'price_corr values did not match!')
])

Examine the correlation coefficients to understand which property features have the strongest relationships with house prices. Features with stronger correlations (further from 0) are likely to be better predictors of price.