<p align="center">
  <a>
    <img src="./figures/logo-hi-paris-retina.png" alt="Logo" width="280" height="180">
  </a>

  <h3 align="center">Data Science Bootcamp</h3>
</p>

Authors : Yann Berthelot, Florian Bettini, Laure-Amélie Colin

Data cleaning
======

#### How can it be problematic for our analyst to use the dataset as is, without cleaning? 

#### WHAT IS DATA CLEANING:
The purpose of this step is to normalize the data to facilitate its manipulation during the analysis.
Several operations are possible: modify or delete data that are incorrect, incomplete, irrelevant, corrupted, duplicated or badly formatted

### Why is this important? 
- Correct duplicate or misfiled data. 
- Correct errors in manual data entry. 
- Wrong data can affect the results and their accuracy.

In [1]:
# first, we need to import pandas
import pandas as pd

Context and files for this Lab
======

Multiple datasets will be used to compute features for the final model. All raw datasets are located in `./data/1_raw/`

- **Fires** --> **this dataset will be cleaned in this notebook.**
    - location: `./data/1_raw/fires/fires_train.csv`
    - This table includes wildfire data for the period of 2011-2014 compiled from US federal, state, and local reporting systems.
    - Columns are :
        * `FOD_ID` = Global unique identifier.
        * `FIRE_SIZE` = Estimate of acres within the final perimeter of the fire.
        * `FIRESIZECLASS` = Code for fire size based on the number of acres within the final fire perimeter expenditures (A=greater than 0 but less than or equal to 0.25 acres, B=0.26-9.9 acres, C=10.0-99.9 acres, D=100-299 acres, E=300 to 999 acres, F=1000 to 4999 acres, and G=5000+ acres).
        * `FIRE_NAME` = Name of the incident, from the fire report (primary) or ICS-209 report (secondary).
        * `FIRE_YEAR` = Calendar year in which the fire was discovered or confirmed to exist.
        * `DISCOVERY_DATE` = Date on which the fire was discovered or confirmed to exist. `Warning`: date is in Julian format.
        * `DISCOVERY_TIME` = Time of day that the fire was discovered or confirmed to exist. `Warning`: Format is HHMM. Ex: 5:30PM will be "1730".
        * `CONT_DATE` = Date on which the fire was declared contained or otherwise controlled. `Warning`: date is in Julian format.
        * `CONT_TIME` = Time of day that the fire was declared contained or otherwise controlled (hhmm where hh=hour, mm=minutes). `Warning`: Format is HHMM. Ex: 5:30PM will be "1730".
        * `LATITUDE` = Latitude (NAD83) for point location of the fire (decimal degrees).
        * `LONGITUDE` = Longitude (NAD83) for point location of the fire (decimal degrees).
        * `STATE` = Two-letter alphabetic code for the state in which the fire burned (or originated), based on the nominal designation in the fire report.
        * `CAUSE_CODE` = Code for the cause of the fire.
        * `CAUSE_DESCR` = Description of the cause of the fire.

In [2]:
# show fires_days_train
fires = pd.read_csv("./data/1_raw/fires/fires.csv")
print('print the first 5 rows of the dataframe')
display(fires.head())

print the first 5 rows of the dataframe


Unnamed: 0,FOD_ID,FIRE_NAME,FIRE_YEAR,DISCOVERY_DATE,DISCOVERY_TIME,CONT_DATE,CONT_TIME,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE,CAUSE_DESCR,CAUSE_CODE
0,20020059,VFD BEAR CREEK #1,2011,2455641.5,1212.0,2455641.5,1618.0,0.1,A,60.246389,-149.349444,AK,accidental,1
1,20020060,CPR LNDG ORGANIC DMP,2011,2455666.5,1812.0,2455669.5,1156.0,0.1,A,60.475833,-149.7525,AK,accidental,1
2,20020061,TOKLAT WAY DEBRIS,2011,2455692.5,1250.0,2455692.5,1331.0,0.1,A,60.514444,-149.4675,AK,accidental,1
3,20020062,LAWING DRIVE,2011,2455694.5,1220.0,2455694.5,1250.0,0.1,A,60.399722,-149.360833,AK,accidental,1
4,20020063,RUSSIAN RIVER TRAIL,2011,2455759.5,1020.0,2455759.5,1230.0,0.1,A,60.4675,-149.973056,AK,accidental,1


- **External data** --> **Multiple datasets that will be cleaned and merged with another notebook:** `2_external_data_preparation.ipynb`
    - Temperature and precipitation
        - The `./data/1_raw/cities/` folder contains temperature and precipitation values for 210 US cities.
        - it contains a file `./data/1_raw/cities/city_info.csv` that maps the city name with a code (example: "USW00094728" for "New York")
        - a README file `./data/1_raw/cities/README.txt` can give you additional information on these files.
        - all other files are named with a city code (example: `./data/1_raw/cities/USW00094728.csv` for "New York") and contains historical temperature and precipitations between **1894** and **2021**, if available.
    - Demographics
        - The `./data/1_raw/demographics/us-cities-demographics.csv` file contains demographic data (age, total population, etc.) for US cities.

In [3]:
# show Temperature and precipitation

# city infos
city_infos = pd.read_csv("./data/1_raw/cities/city_info.csv")
print('City infos: print the first 5 rows of the dataframe')
display(city_infos.head())

# one file with temperature and precipitation
city_example = pd.read_csv("./data/1_raw/cities/USW00094728.csv")
print('One example of a city file (USW00094728): print the first 5 rows of the dataframe')
display(city_example.head())

City infos: print the first 5 rows of the dataframe


Unnamed: 0.1,Unnamed: 0,Name,ID,Lat,Lon,Stn.Name,Stn.stDate,Stn.edDate,Unnamed: 8
0,1,Lander,USW00024021,42.8153,-108.7261,LANDER WBO,1892-01-01,5/28/1946,False
1,2,Lander,USW00024021,42.8153,-108.7261,LANDER HUNT FIELD,5/29/1946,12/31/2021,False
2,3,Cheyenne,USW00024018,41.1519,-104.8061,CHEYENNE WBO,1871-01-01,8/31/1935,False
3,4,Cheyenne,USW00024018,41.1519,-104.8061,CHEYENNE MUNICIPAL ARPT,9/1/1935,12/31/2021,False
4,5,Wausau,USW00014897,44.9258,-89.6256,Wausau Record Herald,1896-01-01,12/31/1941,False


One example of a city file (USW00094728): print the first 5 rows of the dataframe


Unnamed: 0.1,Unnamed: 0,Date,tmax,tmin,prcp
0,1,1869-01-01,29.0,19.0,0.75
1,2,1869-01-02,27.0,21.0,0.03
2,3,1869-01-03,35.0,27.0,0.0
3,4,1869-01-04,37.0,34.0,0.18
4,5,1869-01-05,43.0,37.0,0.05


- **fires_days_train** --> **This dataset is already cleaned and will not be used during this first lab**
    - location: `./data/1_raw/fires/fires_days_train.csv`
    - This table says if at least 1 fire of class size B or bigger was reported for a given date (between 2011 and 2014), in a given state (all US states). Combinations for states and dates are also given for 2015, where predictions will be made in the end. As a consequence, the target value `FIRE` is null in 2015.
    - It contains 3 columns:
        * `DISCOVERY_DATE` = Date (format: YYYY-mm-dd)
        * `STATE` = 2 letters abbreviation for the US state
        * `FIRE` = binary target value
            - 1 if a fire of class size B or bigger is reported in the given state, at the given date (fires of class A are not considered)
            - 0 otherwise

In [4]:
# show fires_days_train
fires_days = pd.read_csv("./data/1_raw/fires/fires_days_train.csv", parse_dates=["DISCOVERY_DATE"])
print('print the first 5 rows of the dataframe (target value is available)')
display(fires_days.head())
print('print the last 5 rows of the dataframe (target value is unavailable)')
display(fires_days.tail())

print the first 5 rows of the dataframe (target value is available)


Unnamed: 0,DISCOVERY_DATE,STATE,FIRE
0,2011-01-01,AK,0.0
1,2011-01-01,MN,0.0
2,2011-01-01,MI,0.0
3,2011-01-01,MO,1.0
4,2011-01-01,IL,0.0


print the last 5 rows of the dataframe (target value is unavailable)


Unnamed: 0,DISCOVERY_DATE,STATE,FIRE
94947,2015-12-31,PR,
94948,2015-12-31,RI,
94949,2015-12-31,VT,
94950,2015-12-31,MA,
94951,2015-12-31,DE,


Warm-up
===========

#### Useful functions from the pandas library (see below for some examples):
- to read a csv file, one can use the function [pd.read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). Some parameters will help you adapt the behaviour of this function (see above documentation for further details):
    * delimiter
    * parse_dates
    * index_col
- to save a pandas DataFrame "df" into a csv file, one can use the function [.to_csv()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html). We strongly recommend to use the parameter index=False if you are using a default index for your DataFrame.
- If you want to access part of a DataFrame based on labels (column name or row name), you can use [.loc()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)
- Similarly, you can access part of a DataFrame based on indexing (column number or row number), you can use [.iloc()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html)

In [5]:
# read a csv file, and store it into a DataFrame "example_df"
relative_path_file = "./data/4_predictions/example_submission.csv"
example_df = pd.read_csv(relative_path_file)

# show DataFrame
print("Raw DataFrame")
display(example_df)

# display only DISCOVERY_DATE and STATE columns, and the first 5 rows, using .loc()
# For .loc[:4, ["col1", "col2"]]
# ":4" indicates that we select all rows until the row named 4 included (indexing starts at 0)
# and ["col1", "col2"] indicates that we select only col1 and col2
loc_df = example_df.loc[:4, ["DISCOVERY_DATE", "STATE"]]
print("Part of the DataFrame, using .loc()")
display(loc_df)

# Similarly, display only DISCOVERY_DATE and STATE columns, and the first 5 rows, using .iloc()
# For .iloc[:, [0, 1]]
# ":5" indicates that we select all rows until row number 5 excluded (indexing starts at 0),
# and [0, 1] indicates that we select only columns 0 and 1, that corresponds to DISCOVERY_DATE and STATE
iloc_df = example_df.iloc[:5, [0, 1]]
print("Part of the DataFrame, using .iloc()")
display(iloc_df)

# store "loc_df"" into a csv file, without the index
# you can see the result by opening the file "./data/6_test/example_df.csv" after execution
loc_df.to_csv("./data/6_test/example_df.csv", index=False)


Raw DataFrame


Unnamed: 0,DISCOVERY_DATE,STATE,FIRE
0,2015-01-01,AK,0
1,2015-01-01,MN,0
2,2015-01-01,MI,0
3,2015-01-01,MO,0
4,2015-01-01,IL,0
...,...,...,...
18975,2015-12-31,PR,0
18976,2015-12-31,RI,0
18977,2015-12-31,VT,0
18978,2015-12-31,MA,0


Part of the DataFrame, using .loc()


Unnamed: 0,DISCOVERY_DATE,STATE
0,2015-01-01,AK
1,2015-01-01,MN
2,2015-01-01,MI
3,2015-01-01,MO
4,2015-01-01,IL


Part of the DataFrame, using .iloc()


Unnamed: 0,DISCOVERY_DATE,STATE
0,2015-01-01,AK
1,2015-01-01,MN
2,2015-01-01,MI
3,2015-01-01,MO
4,2015-01-01,IL


Objectives of this Notebook
======

Objectives:
- Read the fires dataset `./data/1_raw/fires/fires_train.csv`
- Analyze it, to find duplicate values, columns' types, numerical and categorical distributions
- Clean it accordingly in order to obtain a quality dataset, without errors, duplicates, irrelevant values... ready to be analyzed. Cleaning can consist in removing, correcting or imputing data.
- Save the cleaned DataFrame in `./data/2_clean/fires.csv`


##### One can find bellow some guidelines for this process:

##### Step 1. Analyze the dataset
- Check columns' types with [.info()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html)
- Check duplicate values with [.duplicated()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html)
- Check numerical data distribution with [.describe()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)
- Check categorical data distribution with [.value_counts()](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html)

##### Step 2. Cleaning
- Replace/remove missing values
    - Impute new values with [.fillna()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html)
        - Option 1: with statistical data (mean, median, etc.)
        - Option 2: with a dedicated flag (e.g. 0, etc.)  
    - Option 3: If relevant, drop observations with [.dropna()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) function.
- Remove irrelevant data, if any (check that all columns are needed). To drop a column, use [.drop(columns=[...])](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)
- If relevant, remove duplicate rows with [.drop_duplicates()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html)
 perform type conversion with [.astype()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html)
- If relevant, fix discovered typos. One way of doing so is by using the [.replace()](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html) function for pandas Series.
- If relevant, Map categorical data into smaller groups by using the [.map()](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html) function.

<p align="center">
  <a>
    <img src="./figures/UpToYou.png" alt="Logo" width="200" height="280">
  </a>
</p>

#### Libraries

In [6]:
import pandas as pd
import numpy as np
from datetime import datetime
from utils import check_duplicates

#### Input files/variables

In [7]:
input_file = "./data/1_raw/fires/fires.csv" # path input file
dest_file = "./data/2_clean/fires.csv" # path output file
checks = {True:"OK", False: "NOK"} # dict to convert boolean to string

<h1 align="center">Preparation of the fires dataset</h1>

In [None]:
# CODE HERE

# Take Away
- Edit variable types / formats
- Identify duplicates
- Delete columns with many missing values
- Use common sense and keep only relevant variables
- Observe the distribution of values of a variable
- Visual representations are useful to understand how a variable works

### Pitfalls to avoid
- Automatically delete a duplicate: understand why the duplicate appeared
- Automatically delete all rows with missing values and lose information. Approximating some values allows you to keep information to meet an objective.
- Automatically delete outliers: understand where they come from, are they errors or do they only represent extreme cases?
- Retain variables that could be harmful to the ethics of a project (skin color, address...)

### Go Further :
- [The Ultimate Guide to Data Cleaning](https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4)
- [Learn Data Cleaning Tutorials | Kaggle](https://www.kaggle.com/learn/data-cleaning)
