# <img style="float: left; padding-right: 10px; width: 200px" src="https://fligoo.com/img/logo-large.png"> Fligoo - Tech Interview: Strict DF

### March 2020

<hr style="height:2pt">

## Description

The Pandas DataFrames can handle mixed data types on the columns: for instance, a column "month" can have both integer and string values, being "object" the data type associated in those cases. This ability could result in undesired side effects in several use cases, especially on data pipelines where it is necessary to know the type of data processed over the steps.

The goal of this test is to develop a library that provides utils to handle Pandas DataFrames in a "strict" way regarding the data schema, so a DataFrame enforces a proper data type for each of its columns, as well as other related utils. 

The library should be tested the following dataset, created from a financial institution and purposedely manipulated for this exercise: [link](https://s3-us-west-2.amazonaws.com/fligoo.data-science/TechInterviews/StrictDF/data/credit-data.csv).

Please read carefully the following assignment in order to understand what do we exepect as an outcome.

## Assignment 

The end goal for this exercise is to develop a library providing a class `StrictDataFrame` which allows to enforce a strict data schema/data type for each of columns in the DataFrame. The use case for the library to be created should look like: 

```python
from strictdf import StrictDataFrame
import pandas as pd

df = pd.read_csv("data/credit-data.csv")

sdf = StrictDataFrame(df)

sdf.dtypes
# Returns the following dict:
# {"serious_dlqin2yrs": "bool", 
#  "revolving_utilization_of_unsecured_lines": "float64",
#  "age": "int64",
#  "number_of_time30-59_days_past_due_not_worse": "int64",
#  "debt_ratio": "float64",
#  "monthly_income": "int64",
#  "number_of_open_credit_lines_and_loans": "int64",
#  "number_of_times90_days_late": "int64", 
#  "number_real_estate_loans_or_lines": int64",
#  "number_of_time60-89_days_past_due_not_worse": "int64",
#  "number_of_dependents": "int64"}

sdf.report()
# Prints "DataFrame having shape '(120263, 11)' (29737 rows removed from original)"

sdf.old_df
# Returns the original pd.DataFrame

sdf.new_df
# Returns the modified pd.DataFrame
```

You are free to extend the behavior in a way that is aligned to the use case, but in the end, it is required to comply with "mandatory" aspects of this work. Also, there are some "bonus" aspects to obtain an outstanding qualification on this assessment.

**NOTE: documentation must be in English.**

### Mandatory

- **Technologies:** The work involved should be done by using at least the following stack:   
    - Python 3.x (not 2.x)
    - Pandas 1.0.x
    - Jupyter 1.0.x
    - Pytest 5.3.x
- **Python package:** The obtained library "strictdf" must be installable as a Python package, solving all necessary dependencies for correct usage. It should cover the following aspects from the use case defined:
	- Class `StrictDataFrame` that accepts a pd.DataFrame and apply the following filters:
		- NaNs are removed from all columns (no missing values allowed).
		- For each column, infer the expected data type and remove all the rows that don't belong to that type. For example, if 90% of rows for a column are integers, then floats and strings should be removed.
		- The supported data types for this class are: `int64`, `float64`, `bool`, `str`. 
	- The class should allow retrieving the attributes `old_df` (the pd.DataFrame received on the constructor) and `new_df` (the pd.DataFrame obtained after filters).
	- The attribute `dtypes` should be a dict of the data schema obtained for the DataFrame after filters applied.
	- A method `report()` should print a message about the shape of the DataFrame obtained, compared with the original one (see the example to know the expected format).
- **Testing:** Unit-testing through pytest, in order to achieve a code coverage of at least 70%.
- **Jupyter notebook:** A very simple example of usage, covering the use case defined as well. 
- **Code versioning with Git** (you are free to publish it on your own Github/Bitbucket account).
- **Use all possible [programming styles in Python](https://blog.newrelic.com/engineering/python-programming-styles/)** (i.e. imperative, OOP, functional, procedural).
- **Docstring** on classes and functions developed.

### Bonus

- Dockerfile to setup environment and execute tests.
- API Reference in HTML / PDF automatically generated (e.g. Sphinx, Mkdocs)
- Propose a way to impute values that are being removed to not meet expected data types.

This assesment is designed to be finished in 1 week as much. Once you complete it, please send a ZIP file of the folder with all the resources used in this work (e.g. Python library, text files, Jupyter notebooks, etc) to leandro.ferrado@fligoo.com. Then you are going to have a final meeting with the team to discuss the work done in this notebook and answer the questions that could arise.

**Have fun!**

![Have fun](https://media.thefinergifs.club/08x11-352852.gif)