Author: Kevin ALBERT  

Created: September 2020 (Updated: 24 Mar 2022)  

TestRun: 24 Mar 2022

# Datareport
_**How to generate an interactive and static report.**_

In [None]:
# install python modules
!conda install -y -c conda-forge pandas-profiling dtale seaborn pyarrow fastparquet xlrd

In [1]:
import numpy as np
import pandas as pd
import dtale
import pandas_profiling as pp
from IPython.display import Javascript

In [2]:
# check versions
!conda -V
!python -V
!conda list |grep numpy
!conda list |grep pandas
!conda list |grep pandas_profiling
!conda list |grep dtale

conda 4.10.3
Python 3.8.12
numpy                     1.22.3                   pypi_0    pypi
pandas                    1.4.1                    pypi_0    pypi
pandas-profiling          3.1.0                    pypi_0    pypi
dtale                     2.2.0                    pypi_0    pypi


### load data

In [3]:
# demo with clean '*.parquet' data: ../bronze/ -> silver -> gold -> platinum
df = pd.read_parquet('../../data/platinum/diabetes.parquet')

### summary

In [4]:
# concise summary (shape, memory use, data types, nan's)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   PatientID               10000 non-null  int64  
 1   Pregnancies             10000 non-null  int64  
 2   PlasmaGlucose           10000 non-null  int64  
 3   DiastolicBloodPressure  10000 non-null  int64  
 4   TricepsThickness        10000 non-null  int64  
 5   SerumInsulin            10000 non-null  int64  
 6   BMI                     10000 non-null  float64
 7   DiabetesPedigree        10000 non-null  float64
 8   Age                     10000 non-null  int64  
 9   Diabetic                10000 non-null  int64  
dtypes: float64(2), int64(8)
memory usage: 781.4 KB


### interactive report

In [5]:
# start webapp (change IP, port)
d = dtale.show(df, host="20.223.36.46", port="40000", ignore_duplicate=True, drop_index=True, reaper_on=False)

In [6]:
# show all running instances
d.main_url()

http://20.223.36.46:40000/dtale/main/1
Executing shutdown...


2022-03-24 20:18:31,944 - INFO     - Executing shutdown...


In [None]:
# stop webapp
d.kill()

### static report

In [7]:
reportFile = "../../data/report/diabetes_report.html"

In [8]:
# quick report on 100% records (no correlation matrix stuff)
pp.ProfileReport(df=df.sample(frac=1),
                 minimal=True,
                 progress_bar=False,
                 correlations={"cramers": {"calculate": False}},
                 title="Report Title",
                 html={"style": {"full_width": True}}).to_file(reportFile)

In [9]:
# open the report (*.html)
display(Javascript('window.open("{url}");'.format(url=reportFile)))

<IPython.core.display.Javascript object>

### data loading test
```python
# check first 2 lines
! head -n 2 ../../data/bronze/data.csv 
```
### data loading
```python
# read dataset, define delimiter, dtypes and encoding (preserve id: '000000')
df = pd.read_csv('data.csv',
                 delimiter=';',
                 encoding='utf-8',
                 dtype={"event_id":"str",
                        "game_id":"str"})
```
### report
```python
# generate a data report to get a feel of the data (EDA)
d = dtale.show(df, host="52.174.238.247", port="40000", ignore_duplicate=True, drop_index=True, reaper_on=False)
```
### data selection
```python
# remove irrelevant usecase columns
# drop features with only 1 unique value
# drop features that are highly correlated (ex: lat, long == loc_x, loc_y)
df = df.drop(['col1', 'col2'], axis=1)
# select columns by regular expression [tags_0_name, tags_1_name, ...]
df.filter(regex='tags_\d*_name', axis=1)
```
```python
# remove irrelevant usecase records
# remove records that are 2 characters long
df = df.drop(df[df["col1"].str.len() == 2].index, axis=0)
```
```python
# drop all columns with all values NA
df = df.dropna(axis=1, how='all')
# drop all records with all values NA
df = df.dropna(axis=0, how='all')
```
```python
# remove similar rows
df = df.drop_duplicates()
```
### replace values
```python
# careful when using .map() for causing nan
# replace values and categorical errors (typos, capitalization, labeling)
df = df.replace({'col1':{'Male': 'M', 'Man': 'M'},
                 'col2':{'True': 1, 'Yes': 1}})
```
### variance
Remove the red flagged skewed columns, not usable for machine learning
![enable low variance flag](../../image/howto_dataclean/dtale_low_variance.png)
### outliers
Never remove an outlier, they will be normalized or scaled for machine learning
![enable outliers flag](../../image/howto_dataclean/dtale_highlight_outliers.png)
### missing
#### KEEP
```python
# replace nan
df['col1'] = df['col1'].fillna('No Value')
```
#### DELETE
Drop columns with >90% nan
![highlight missing](../../image/howto_dataclean/dtale_highlight_missing.png)
#### IMPUTE
Replace categorical missing with top most frequency value
![replace categorical missing values](../../image/howto_dataclean/dtale_replace_missing_with_categorical_top_frequency.png)
Replace numerical missing with the calculated median not the mean value  
![replace numerical missing values](../../image/howto_dataclean/dtale_replace_missing_with_numerical_median.png)  
```python
# replace 'age' with median
df['age'] = df['age'].replace({np.nan: getattr(df['age'], 'median')()})
```
### setup dtypes
```python
# timestamp (format=strftime or autodetect)
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
```
```python
# add: year, month, day, hour, part_of_day
df['year'] = df['dateTime'].dt.year
df['month'] = df['dateTime'].dt.month
df['day'] = df['dateTime'].dt.day
df['hour'] = df['dateTime'].dt.hour

def f(x):
    if (x > 4) and (x <= 8):
        return 'Early Morning'
    elif (x > 8) and (x <= 12 ):
        return 'Morning'
    elif (x > 12) and (x <= 16):
        return'Noon'
    elif (x > 16) and (x <= 20) :
        return 'Eve'
    elif (x > 20) and (x <= 24):
        return'Night'
    elif (x <= 4):
        return'Late Night'

df['part_of_day'] = df['hour'].apply(f)
```
```python
# generate dtypes dictionary
df.dtypes.apply(lambda x: x.name).to_dict()
# float, bool, category, str, int (size: int8, int16, int32, int64)
df = df.astype({'col1':'float', 'col2':'bool', 'col3':'category', 'col4':'int', 'col5':'str'})
```
### setup feature names
```python
# replace feature names with interpretable naming (others are left as-is) 
df = df.rename(columns={'col1': 'age', 'col2': 'price'})
# rename all the columns
df.columns = ['age', 'price', '...']
```
### feature engineering
#### CREATE
Transform what you have into the right format, handle text, metrics and units
```python
# create an aggregation (sum, mean, count) on a group
df.groupby(['year','month','day','hour','camera'])['amount'].sum().reset_index()
# calculate the time remaining using (#min x 60sec/min) + #sec)
df['time_remaining_seconds'] = (df['minutes_remaining'] * 60) + (df['seconds_remaining'])
```
#### CAPPING
replace low values with 5 percentile value and high values with 1 percentile value
![winsorize capping](../../image/howto_dataclean/winsorize_capping.png)
```python
from scipy.stats.mstats import winsorize
winsorize(df["col1"], limits=[0.05, 0.01])
```
#### BINNING
Convert numeric data into bins (categorize or discretize)
![categorize numeric data](../../image/howto_dataclean/dtale_binning.png)
```python
# evenly sized
pd.qcut(df["col1"], 3)
```
```python
# evenly spaced
pd.cut(df["col1"], [0,20,40,60])
```
### machine learning
Distinct ID values are used for database, but removed for ML training data
### data saving
```python
# export without the index, add encoding, delimiter and rounded to two decimals
df.to_csv("../../data/silver/data.csv", encoding='utf-8', sep=',', float_format='%.2f', index=False)
```
```python
# parquet preserve the schema with optimal compression
df.to_parquet("../../data/silver/data.parquet")
```
### documentation
[framework guide](https://elitedatascience.com/data-cleaning)  
[howto dtale youtube](https://www.youtube.com/watch?v=Q2kMNPKgN4g)  
[Google Docs - Giant Data Quality List](https://docs.google.com/document/d/19THZHWUQkxHg4t5kToRnEnOb7VKcye6Gs046meXCtng/edit?usp=sharing)  