# {{cookiecutter.project_name}}

{{cookiecutter.description}}

## Data Sources
- file1 : Description of where this file came from

## Changes
- {% now 'utc', '%Y-%m-%d' %} : Started project

## Requirements

```shell
conda install feather-format -c conda-forge
```

In [None]:
from imports import *

# Data preparation
## 1. Load phase

In [None]:
# File Locations
today = datetime.today()

INPUT_DIR = Path.cwd() / 'data'/ '01-input'
PROCESSED_DIR = Path.cwd() / 'data'/ '02-processed'

# Consider: make input file name one of cookiecutter parameters and use it both here and in the project description.
INPUT_FILE = INPUT_DIR / 'FILE1.csv'
OUTPUT_FILE = PROCESSED_DIR / f'cleaned_{today:%Y-%m-%d}.feather'

In [None]:
%%time
global df # Workaround against %%time bug. See: https://stackoverflow.com/questions/55341134/variable-scope-is-changed-in-consecutive-cells-using-time-in-jupyter-notebook

df = pd.read_csv(INPUT_FILE)
# or:
# df = pd.read_excel(INPUT_FILE)

Other optional arguments:
```python
df = pd.read_csv(
    INPUT_FILE, 
    nrows=100000,
    dtype={ 
        'class_1': 'category',
        'target_class': 'category'
    }
)
```

## 2. Cleanup phase

### 2.1 Preview the data

In [None]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [None]:
df.head(10)

In [None]:
describe(df)

Inspect the statistical properties of all features, grouped by values of the selected feature.

In [None]:
df.groupby(['selected_feature']).agg(['mean', 'count'])

### 2.2 Column Cleanup

- Remove all leading and trailing spaces
- Rename the columns for consistency.

In [None]:
# https://stackoverflow.com/questions/30763351/removing-space-in-dataframe-python
df.columns = [x.strip() for x in df.columns]

In [None]:
cols_to_rename = {'col1': 'new_name'}
df.rename(columns=cols_to_rename, inplace=True)

### 2.3 Clean Up Data Types

In [None]:
# Fix for: Date+time stored as object
df['date1'] = df['date1'].astype('datetime64')

In [None]:
# Fix for: Boolean stored as object
# Step 1
distinct_values = df['has_flag'].unique()
distinct_values

In [None]:
# Step 2
boolean_values = [True, False, False, True] # Must have same length as distinct_values
df['has_flag'] = df['has_flag'].replace(distinct_values, boolean_values)
df['has_flag'].unique()

In [None]:
# Alternative fix for: Boolean stored as object
# Works on multiple columns of the same type at once.
# Step 1
bool_columns = ['has_flag01', 'has_flag02', 'has_flag03', 'has_flag04']
# alternative: filter columns by name, using regex:
# bool_columns = list(df.filter(regex='^phrase_').columns)

values_set = set()

for col_name in bool_columns:
    distinct_values = set(df[col_name].unique())
    values_set = values_set.union(distinct_values)

values_set

In [None]:
# Step 2
# Use output from previous cell to create dictionary of replacements
replacements = {0: False, 1: True}

for col_name in bool_columns:
    print('col_name: {}'.format(col_name))
    df[col_name] = df[col_name].replace(replacements)

## 3. Transformation phase
### Data Manipulation

#### Create derived features

Bool feature based on found substring in one of the original features

In [None]:
df['new_bool_feature'] = df['original_str_feature'].str.contains('interesting_substring', na=False)
df['new_bool_feature'].value_counts()

#### Drop redundant or unnecesarry columns

In [None]:
df.drop(
    ['col1', 'col2'], 
    axis=1, 
    inplace=True
)

## 4. Export phase

### Inspect the results

Inspect the dataset one last time before the export. 
Tweak and re-run previous steps if needed.

In [None]:
df.head(10)

In [None]:
describe(df)

### Save output file into processed directory

Save a file in the processed directory that is cleaned properly. It will be read in and used later for further analysis.

Format options include:
- pickle
- feather
- msgpack
- parquet

In [None]:
df.to_feather(OUTPUT_FILE)
# or:
# df.to_pickle(OUTPUT_FILE)

## 5. Notes

If the input file is too large, we can do initial inspection of the data and column types on subset of the rows.

```python
df = pd.read_csv(INPUT_FILE, nrows=x)
``` 

Feather format does not support the compression ([yet](https://stackoverflow.com/a/57685438/401095)), so the output file is still large - approximately as large as the input file in csv format.