## Data Cleansing
OK, I've put this off long enough.  It's time to cover the least interesting and possibly most critical aspect of feature engineering... data cleansing!  

Many will include data cleansing as part of the raw data collection pipeline rather than the feature engineering step - and I can't argue with cleansing data as early in the process as possible.  However, your data can never be too clean so I take the "belt and suspenders" approach.  Clean your data on collection, clean on usage.  Clean, clean, clean!    

The motivation for 
* to_datetime, to_numeric, astype() (int, string, float...)
* fillna(ffill, 0, mean)

### Data Typing
If you've spent any time with data work in python, you're already familiar with the sometimes annoying data typing issues of a "duck typed" language.  Pandas does an admirable job of inferring types from your data but you'll sometimes want to exercise more control to make sure your data is perfect.  

The first data typing issue I face is representation of dates and times, which can be represented in several different formats.  I prefer to standardize all datetimes using the pandas pd.to_datetime() method which yields two main benefits: (1) you will be able to align and join multiple datetime values together and (2) you'll be able to take advantage of the many pandas date/time functions.

Example:

In [2]:
## code of casting to datetime, selecting weekday etc...


If you fail to control your datetime typing, you'll inevitably end up with difficulty in aligning and joining data on date, like this:  

In [3]:
# example of a str and a datetime repr which are joined on axis=1 and result in an awkward dataframe

Among the pandas date/time functions is a very useful resampling method, which allows you to aggregate from a higher frequency (e.g., hourly) to a lower frequency (e.g., daily or weekly or monthly).  Depending on the timeframe of your strategy, you may seek to resample everything to a lower frequency 

In [4]:
## example of resampling


The other main typing issue I find is with numeric types. Number values are commonly represented as integers, floats, and strings which look like integers or floats.  Pandas attempts to guess the right type for data when it's loaded (via `read_csv` or `read_sql` etc..).  Problems arise when there are some values within a column which don't follow the type .

The below example illustrates how 

In [5]:
df = pd.DataFrame({'symbol':['a','b','c','d','e'],'price':[1,2,3,4,'None']})
print(df)
print()
print('Average: ',df.mean()) # no results
print()
print('######################')
# retype to numeric

print()
df['price'] = pd.to_numeric(df.price,errors='coerce')
print(df)
print()
print('Average: ',df.mean()) # works


NameError: name 'pd' is not defined

### Handling Missing Data
Incomplete data is a reality for us all.  Whether it's because some input sources are of a lower frequency, shorter history (i.e., don't go back as far in time) or have unexplained unavailable data points at times, we need a thoughtful approach for addressing missing data.

Most machine learning algorithms require a valid value for each feature at each observation point (or they will fail to run...). If we don't apply some sensible workarounds, we'll end up dropping lots of _valid_ data points because of a single missing feature.  

Before outlining the tactics and code patterns we can apply, my core principles for data cleansing are:
1. Always try to reflect the data you might have applied _at the time_ of the missing data point.  In other words, don't peek into the future if at all possible.   
2. Drop valid data only as a last resort (and as late in the process as possible).  
3. Questionable data (i.e., extreme outliers) should be treated like missing data.

