## Data cleaning at load

One of the most common mistakes seen from proffesionals working with data is that they do not utilize the cleaning capabilities of their loading framework.

###### Adding a few arguments to your load statement will save hours down the line

###### It also has the added benefit of tidying your [ETL pipeline](https://en.wikipedia.org/wiki/Extract,_transform,_load) drastically.


## Exercise 0.1

 - play around with the arguments for `pd.read_csv`
 - `pd.read_csv(path, argument1 = x, argument2 = y)`
 - look at the arguments `true_values,false_values=["FALSE"], comment="#", parse_dates=[0]` 


![](https://miro.medium.com/max/882/1*eVZyKIcUXOfzPrMTGx7yVw.png)


In [2]:
import pandas as pd
path = "../data/exercise.csv"
df = pd.read_csv(path)
df.head()

Unnamed: 0,Date,Logical,Integer,Real,Factor
0,#NNAA,#NNAA,#NNAA,#NNAA,#NNAA
1,#NNAA,#NNAA,#NNAA,#NNAA,#NNAA
2,01/01/2013,TRUE,74,67.64,Washington.DC
3,01/02/2013,FALSE,32,52.42,Denver
4,01/03/2013,TRUE,20,52.42,Atlanta


In [3]:
## Your code

#### Try to end up with only 2 columns of type `object`

Why is the column `Logical` of type object?

This is due to the fact that only np.nan is type float and only int is implcitely casted to float when containing NaN.

In [7]:
df=pd.read_csv(path, true_values=[" TRUE"], false_values=["FALSE"], comment="#", parse_dates=[0])
df["Factor"]=df["Factor"].astype('category')
df.dtypes
#df.head()

Date       datetime64[ns]
Logical            object
Integer             int64
Real              float64
Factor           category
dtype: object

In [8]:
df.loc[df["Logical"].isnull(),:].head()

Unnamed: 0,Date,Logical,Integer,Real,Factor
11,2013-01-12,,35,54.87,Miami
78,2013-03-20,,61,46.42,Denver
81,2013-03-23,,28,47.06,Denver
86,2013-03-28,,47,73.97,LosAngeles
93,2013-04-04,,12,66.86,Washington.DC
112,2013-04-23,,65,58.56,Atlanta
133,2013-05-14,,13,47.06,Chicago
152,2013-06-02,,58,41.67,Chicago
154,2013-06-04,,75,41.67,Washington.DC
180,2013-06-30,,99,58.56,NewYork


In [5]:
df.dtypes

Date       datetime64[ns]
Logical            object
Integer             int64
Real              float64
Factor           category
dtype: object

##### Consider casting the `object` column factor to type `categorical`.

##### This can yield certain advantagous

- A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see [here](https://pandas.pydata.org/pandas-docs/version/0.25/user_guide/categorical.html#categorical-memory).
- The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see [here](https://pandas.pydata.org/pandas-docs/version/0.25/user_guide/categorical.html#categorical-sort).
- As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).