<p align="center">
  <a>
    <img src="./figures/logo-hi-paris-retina.png" alt="Logo" width="280" height="180">
  </a>

  <h3 align="center">Data Science Bootcamp</h3>
</p>

Authors : Yann Berthelot, Florian Bettini, Laure-Amélie Colin

Data cleaning
======

#### How can it be problematic for our analyst to use the dataset as is, without cleaning? 

#### WHAT IS DATA CLEANING:
The purpose of this step is to normalize the data to facilitate its manipulation during the analysis.
Several operations are possible: modify or delete data that are incorrect, incomplete, irrelevant, corrupted, duplicated or badly formatted

### Why is this important? 
- Correct duplicate or misfiled data. 
- Correct errors in manual data entry. 
- Wrong data can affect the results and their accuracy.

Objective of this lab
======

Clean the fires datasets in order to obtain a quality dataset, without errors, duplicates, irrelevant values... ready to be analyzed

### Data Path

`./data/` is the path that contains all data.

<p align="center">
  <a>
    <img src="./figures/UpToYou.png" alt="Logo" width="200" height="280">
  </a>
</p>

#### Libraries

In [6]:
import pandas as pd
import numpy as np
from datetime import datetime
from utils import check_duplicates

#### Input variables

In [7]:
checks = {True:"OK", False: "NOK"}

<h1 align="center">Preparation of the fires dataset</h1>

The raw dataset is located in `./data/fires/fires.csv`

This table includes wildfire data for the period of 2011-2015 compiled from US federal, state, and local reporting systems.

Columns are :
* `FOD_ID` = Global unique identifier.
* `FIRE_SIZE` = Estimate of acres within the final perimeter of the fire.
* `FIRESIZECLASS` = Code for fire size based on the number of acres within the final fire perimeter expenditures (A=greater than 0 but less than or equal to 0.25 acres, B=0.26-9.9 acres, C=10.0-99.9 acres, D=100-299 acres, E=300 to 999 acres, F=1000 to 4999 acres, and G=5000+ acres).
* `FIRE_NAME` = Name of the incident, from the fire report (primary) or ICS-209 report (secondary).
* `FIRE_YEAR` = Calendar year in which the fire was discovered or confirmed to exist.
* `DISCOVERY_DATE` = Date on which the fire was discovered or confirmed to exist (Julian format)
* `DISCOVERY_TIME` = Time of day that the fire was discovered or confirmed to exist.
* `CONT_DATE` = Date on which the fire was declared contained or otherwise controlled (Julian format)
* `CONT_TIME` = Time of day that the fire was declared contained or otherwise controlled (hhmm where hh=hour, mm=minutes).
* `LATITUDE` = Latitude (NAD83) for point location of the fire (decimal degrees).
* `LONGITUDE` = Longitude (NAD83) for point location of the fire (decimal degrees).
* `STATE` = Two-letter alphabetic code for the state in which the fire burned (or originated), based on the nominal designation in the fire report.
* `CAUSE_CODE` = Code for the cause of the fire.
* `CAUSE_DESCR` = Description of the cause of the fire.

In [8]:
def convert_datetime(x: pd.Series, opt: str) -> datetime:
    '''
    Create a datetime column for a DataFrame, based on dates and times. 

    Input:
    x (pd.Series): row of the input DataFrame
    opt (str): options for the columns name. Either "DISCOVERY" or "CONT".

    Output:
    (datetime): output datetime
    '''
    if (not np.isnan(x[opt + "_TIME"])) & (not pd.isnull(x[opt + "_DATE"])):
        t = str(int(x[opt + "_TIME"])).rjust(4,"0")
        d = x[opt + "_DATE"].strftime("%Y-%m-%d")
        dt = datetime.strptime(f"{d} {t[:2]}:{t[2:]}", "%Y-%m-%d %H:%M")
        return dt

def cleaning_fires(fires: pd.DataFrame, cols:list) -> pd.DataFrame:
    '''
    Clean the input dataframe, by converting dates and selecting columns

    Input:
    fires (pd.DataFrame): input DataFrame
    cols (list): list of columns to keep

    Output:
    (pd.DataFrame): cleaned DataFrame
    '''
    # select useful columns
    fires = fires.loc[:,cols]

    # convert dates from Julian to Gregorian format
    for c in ["DISCOVERY_DATE", "CONT_DATE"]:
        fires[c] = pd.to_datetime(fires[c] - pd.Timestamp(0).to_julian_date(), unit='D')

    # convert time if available
    for option in ["DISCOVERY", "CONT"]:
        fires[option + "_TIME"] = fires.apply(lambda x: convert_datetime(x, option), axis=1)

    fires["DURATION"] = (fires["CONT_TIME"] - fires["DISCOVERY_TIME"]).dt.total_seconds()
    
    return fires

In [9]:
cols = [
    'FOD_ID', 'FIRE_YEAR', 'DISCOVERY_DATE', 'DISCOVERY_TIME',
    'CONT_DATE', 'CONT_TIME', 'FIRE_SIZE', 'FIRE_SIZE_CLASS',
    'LATITUDE', 'LONGITUDE', 'STATE', 'CAUSE_CODE', 'CAUSE_DESCR'
]

# cleaning and feature engineering
fires = pd.read_csv("./data/1_raw/fires/fires.csv")
fires = cleaning_fires(fires, cols)

# check duplicate values
c = checks.get(check_duplicates(fires, ["FOD_ID"]), False)
print(f"Check duplicates: {c}")

# save to csv
fires.to_csv("./data/2_clean/fires.csv", index=False)

Check duplicates: OK


# Take Away
- Edit variable types / formats
- Identify duplicates
- Delete columns with many missing values
- Use common sense and keep only relevant variables
- Observe the distribution of values of a variable
- Visual representations are useful to understand how a variable works

### Pitfalls to avoid
- Automatically delete a duplicate: understand why the duplicate appeared
- Automatically delete all rows with missing values and lose information. Approximating some values allows you to keep information to meet an objective.
- Automatically delete outliers: understand where they come from, are they errors or do they only represent extreme cases?
- Retain variables that could be harmful to the ethics of a project (skin color, address...)

### Go Further :
- [The Ultimate Guide to Data Cleaning](https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4)
- [Learn Data Cleaning Tutorials | Kaggle](https://www.kaggle.com/learn/data-cleaning)
