In [None]:
#!git clone https://github.com/darioka/impactdeal-2022.git
#%cd impactdeal-2022
#!pip install -r requirements.txt
#!pip install .

# Data Exploration - EPC Rating


In this notebook you will explore a dataset with known EPC ratings from three major cities in the UK.
The data has been downloaded, subsampled and pseudonimized from https://epc.opendatacommunities.org/.

Remember that our final goal with this dataset will be to train a machine learning model that predicts the energy rating of a dwelling, based on the other features of this dataset. But for the moment, we just need to understand what information it contains and become confident with it.

## Loading the dataset

The dataset is **known_epc_ratings.csv.gz**. The format is CSV but it's compressed with gzip, as it contains quite a few lines. But no worries, pandas can detect compression and handle it.

In [None]:
# Write your code here!


## First look at the dataset

Here is where you get to know the dataset. In this section, try to give answers to very simple questions like:

* How many rows/columns?
* What are there data types?
* What are the meaning of the columns?
* Are data types consistent with column descriptions?
* What is the column with the EPC rating? How many classes are there?

In particular, make sure to check the column descriptions ([column_description.tsv](https://github.com/darioka/impactdeal-2022/blob/main/data/column_description.tsv)) and compare them with actual data! This kind of sanity checks are often very useful with real-world datasets.

Also, be flexible! The dataset is big (many columns, many more rows) and it's not very easy to "take a look at it"... why don't you [sample](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) a hundred rows, export them to a standard [csv](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) and read them with your favorite Excel-like tool?

In [None]:
# Write your code here!


Finally, let's create lists with the name of the columns, divided by their type. It will be useful later, when we will do different kinds of analyses depending on data types.

For you convenience, lists have already been created and the next cell will import them. Notice that, besides numerical and categorical columns, we make two more groups:

* **ids**: uniquely identify a single sample or a small set of them. No useful information can be inferred from them.

* **dates**: cannot be treated naively as numbers nor as categories.

In [None]:
from impactdeal.config.column_names import TARGET, IDS, DATES, NUMERICAL, CATEGORICAL

## Missing values

It would be great if missing values were always clearly indicated as `np.nan`! Unfortunately, data collection is complex business and invalid data can be encoded in many more ways. For example, the people and the software that collected the data could have written `"NULL"` or `"missing"` to indicate an unknown value. This is one of the aspects of dealing with [data quality](https://en.wikipedia.org/wiki/Data_quality).

Use the following cell to find missing values in the columns `ENERGY_TARIFF`, `FLOOR_LEVEL` and `GLAZED_TYPE`.

In [None]:
# Write your code here!


Now, if we want to count how many missing values are present in the dataset, we should replace the values you found with `np.nan`, so that all can be counted together.

Let's write a function that:
1. takes two inputs: a dataframe and the names of the columns to check,
2. replace all values that we consider missing values with `np.nan`
3. returns the new dataframe.

Some hints:
- in (2) we should include all the values found before, plus an additional pattern (it can be found in other columns): all values starting with `SAP05` or `sap05` are missing values too. 
- make the search case insensitive i.e. `hello` and `HELLO` are the same values,
- you can `.copy()` the input dataframe to make sure to return a new independent copy of the dataframe
- to iterate faster during development, try first with just one column or a few rows.

Take your time, this is a hard task!

In [None]:
# def normalize_missing(df, columns):
#     ...
#     return new_df

# new_df = normalize_missing(df, columns=CATEGORICAL)

In [None]:
# this is a possible solution (but try first to solve the problem by yourself)

# from impactdeal.cleaning import normalize_missing

Now we are ready to check the completeness of the dataset. Let's print all relative frequencies of missing values for each column, ordered from the largest to the smallest.

In [None]:
# Write your code here!


## Exploratory data analysis

In the following section, we will try to answer some questions with the help of our dataset:

1. What is the evolution in time of the number of EPC rating? Try to visualize the time series.

2. Which city has more buildings in the lowest categories? Rank cities by the number of properties with EPC rating lower than E.

3. What is the distribution of construction dates?

4. How spread is the use of low energy lighting?

5. Are there outliers in properties' total floor area? If yes, how should we treat them?

6. There are multiple variables with information about the dimension of the property and multiple variables with information about the lighting. What are their correlations? What could we do here?


Some hints:
* Some of the questions are deliberately ambiguous, because business questions usually are.
* For dates, take a look at `pandas` [to_datetime](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) and [datetime methods](https://pandas.pydata.org/docs/user_guide/timeseries.html#time-series-date-functionality).
* `COSTRUCTION_AGE_BAND` needs some cleaning. If you are in a hurry, use the function `impactdeal.cleaning.clean_age_band`.

In [None]:
# Write your code here!
