## Installation
Python PYPI:

In [62]:
# !pip install exploretransform

Import the exploretransform package:

In [63]:
import exploretransform.exploretransform as et

### Summary of Functions and Classes

In [64]:
%%html
<style>
table {float:left}
</style>

Function / Class | Description
:---- | :------------- 
loadboston | loads the Boston housing dataset
peek | returns dtype, levels, # of observations, and the first five observations for a dataframe
explore | provides various statistics on a dataframe (zeros, inf, missing, levels, dtypes)
nested | takes a list, series or dataframe and returns the location of nested objects
freq | for categorical or ordinal features, provides the count, percent, and cumulative percent for each level
plotfreq | generates a bar plot using the data generated by freq
corrtable | generates a table of all pairwise correlations and uses the average correlation for the row and column in to decide on potential drop/filter candidates
calcdrop | analyzes corrtable output determines which features should be filtered/drop 
skewstats | returns the skewness statistics and magnitude for each numeric feature
ascores | calculates various association scores (kendall, pearson, mic, dcor, spearman) between predictors and target
ColumnSelect | custom transformer that selects columns for pipeline
CategoricalOtherLevel | custom transformer that creates "other" level in categorical / ordinal data based on threshold
CorrelationFilter | custom transformer that filters numeric features based on pairwise correlation

## How to use exploretransform
In addition to this guide, detailed examples and explanations are provided in the docstrings.  Type ? and the name of the function for more information. For example:

In [65]:
?et.explore

### loadboston() loads the Boston housing dataset

In [66]:
df, X, y = et.loadboston()

### peek() returns dtype, levels, # of observations, and the first five observations for a dataframe

In [67]:
et.peek(X)

Unnamed: 0,variable,dtype,lvls,obs,head
0,town,object,92,506,"[Nahant, Swampscott, Swampscott, Marblehead, M..."
1,lon,float64,375,506,"[-70.955, -70.95, -70.936, -70.928, -70.922]"
2,lat,float64,376,506,"[42.255, 42.2875, 42.283, 42.293, 42.298]"
3,crim,float64,504,506,"[0.00632, 0.02731, 0.02729, 0.0323699999999999..."
4,zn,float64,26,506,"[18.0, 0.0, 0.0, 0.0, 0.0]"
5,indus,float64,76,506,"[2.31, 7.07, 7.07, 2.18, 2.18]"
6,chas,category,2,506,"[0, 0, 0, 0, 0]"
7,nox,float64,81,506,"[0.5379999999999999, 0.469, 0.469, 0.457999999..."
8,rm,float64,446,506,"[6.575, 6.421, 7.185, 6.997999999999999, 7.147]"
9,age,float64,356,506,"[65.2, 78.9, 61.1, 45.8, 54.2]"


In [68]:
%%html
<style>
table {float:left}
</style>

Column | Description
:---- | :------------- 
variable | name of variable
dtype | Python dtype
lvls | unique values of variable
obs | number of observations
head | first five observations

### explore() provides various statistics on a dataframe (zeros, inf, missing, levels, dtypes)

In [69]:
et.explore(X)

Unnamed: 0,variable,obs,q_zer,p_zer,q_na,p_na,q_inf,p_inf,dtype
0,town,506,0,0.0,0,0.0,0,0.0,object
1,lon,506,0,0.0,0,0.0,0,0.0,float64
2,lat,506,0,0.0,0,0.0,0,0.0,float64
3,crim,506,0,0.0,0,0.0,0,0.0,float64
4,zn,506,372,73.52,0,0.0,0,0.0,float64
5,indus,506,0,0.0,0,0.0,0,0.0,float64
6,chas,506,0,0.0,0,0.0,0,0.0,category
7,nox,506,0,0.0,0,0.0,0,0.0,float64
8,rm,506,0,0.0,0,0.0,0,0.0,float64
9,age,506,0,0.0,0,0.0,0,0.0,float64


In [70]:
%%html
<style>
table {float:left}
</style>

Column | Description
:---- | :------------- 
variable | name of variable
obs | number of observations
q\_zer | number of zeros
p\_zer | percentage of zeros
q\_na | number of missing
p\_na | percentage of missing
q\_inf | number of infinity
p\_inf | percentage of infinity
dtype | Python dtype

### nested() takes a list, series or dataframe and returns the location of nested objects

In [71]:
import pandas as pd
a = pd.DataFrame({'first' : [1, 2, 3, (1,2,3), 4, 5, 6],
                  'second': [2, 4, 5, [1,3,4], 6, 7, 8]},
                  columns = ['first', 'second'])
print(a)

       first     second
0          1          2
1          2          4
2          3          5
3  (1, 2, 3)  [1, 3, 4]
4          4          6
5          5          7
6          6          8


In [72]:
et.nested(a, retloc = True)

[(3, 0), (3, 1)]

In [73]:
et.nested(a)

True

### freq() for categorical or ordinal features, provides the count, percent, and cumulative percent for each level

In [74]:
t = et.freq(X['town'])

In [75]:
t

Unnamed: 0,town,freq,perc,cump
0,Cambridge,30,5.93,5.93
1,Boston Savin Hill,23,4.55,10.47
2,Lynn,22,4.35,14.82
3,Boston Roxbury,19,3.75,18.58
4,Newton,18,3.56,22.13
...,...,...,...,...
87,Wenham,1,0.20,99.21
88,Nahant,1,0.20,99.41
89,Middleton,1,0.20,99.60
90,Millis,1,0.20,99.80


### plotfreq() generates a bar plot using the data generated by freq

In [76]:
%matplotlib qt

In [77]:
et.plotfreq(t)

<ggplot: (8781419750747)>

### corrtable() returns a table of all pairwise correlations and uses the average correlation for the row and column in to decide on potential drop/filter candidates

In [78]:
N = X.select_dtypes('number').copy()

In [79]:
c = et.corrtable(N, cut = 0.5, full= True)

In [80]:
c

Unnamed: 0,v1,v2,v1.target,v2.target,corr,drop
52,nox,dis,0.578860,0.526551,0.880015,nox
25,crim,nox,0.562681,0.578860,0.821465,nox
63,age,dis,0.525682,0.526551,0.801610,dis
51,nox,age,0.578860,0.525682,0.795153,nox
42,indus,nox,0.549707,0.578860,0.791189,nox
...,...,...,...,...,...,...
8,lon,tax,0.242329,0.486066,0.050237,
22,lat,lstat,0.159767,0.522203,0.039065,
14,lat,indus,0.159767,0.549707,0.021472,
18,lat,dis,0.159767,0.526551,0.012832,


Column | Description
:---- | :------------- 
v1 | variable 1
v2 | variable 2
v1.target | metric used to compare v1 and v2 for drop
v2.target | metric used to compare v1 and v2 for drop
corr | pairwise correlation based on method
drop | if the correlation > threshold, the drop decision 

### calcdrop () analyzes corrtable output determines which features should be filtered/drop 

In [81]:
et.calcdrop(c)

['age', 'dis', 'nox', 'lstat', 'indus', 'crim']

### skewstats() returns the skewness statistics and magnitude for each numeric feature

In [82]:
et.skewstats(N)

Unnamed: 0,dtype,skewness,magnitude
crim,float64,5.207652,2-high
zn,float64,2.219063,2-high
dis,float64,1.008779,2-high
b,float64,-2.881798,2-high
nox,float64,0.727144,1-medium
age,float64,-0.597186,1-medium
tax,int64,0.667968,1-medium
ptratio,float64,-0.799945,1-medium
lstat,float64,0.903771,1-medium
lon,float64,-0.204775,0-approx_symmetric


### ascores() calculates various association scores (kendall, pearson, mic, dcor, spearman) between predictors and target

In [83]:
et.ascores(N,y)

Unnamed: 0,pearson,kendall,spearman,mic,dcor
lon,0.322947,0.278908,0.42094,0.379753,0.435849
lat,0.006826,0.013724,0.02142,0.234796,0.16703
crim,0.389582,0.406992,0.562982,0.375832,0.528595
zn,0.360386,0.340738,0.438768,0.290145,0.404253
indus,0.484754,0.420263,0.580004,0.41414,0.543948
nox,0.4293,0.398342,0.565899,0.442515,0.523653
rm,0.696304,0.485182,0.635092,0.46161,0.711034
age,0.377999,0.391067,0.551747,0.414676,0.480248
dis,0.249315,0.313745,0.446392,0.316136,0.382746
tax,0.471979,0.418005,0.566999,0.336899,0.518158


### ColumnSelect() is a custom transformer that selects columns for pipelines

In [84]:
categorical_columns = ['rad', 'town']

In [85]:
cs = et.ColumnSelect(categorical_columns).fit_transform(X)

In [86]:
print(cs)

    rad        town
0     1      Nahant
1     2  Swampscott
2     2  Swampscott
3     3  Marblehead
4     3  Marblehead
..   ..         ...
501   1    Winthrop
502   1    Winthrop
503   1    Winthrop
504   1    Winthrop
505   1    Winthrop

[506 rows x 2 columns]


### CategoricalOtherLevel() is a custom transformer that creates "other" level in categorical / ordinal data based on threshold
Note: This transformer can also be used to create an "other" level to account for the possibilities of novel categories in test and unknown data by setting threshold = 0

In [87]:
co = et.CategoricalOtherLevel(colname = 'town', threshold = 0.015).fit_transform(cs)

In [88]:
print(co)

    rad   town
0     1  other
1     2  other
2     2  other
3     3  other
4     3  other
..   ..    ...
501   1  other
502   1  other
503   1  other
504   1  other
505   1  other

[506 rows x 2 columns]


### CorrelationFilter() is a custom transformer that filters numeric features based on pairwise correlation

In [89]:
cf = et.CorrelationFilter(cut = 0.5).fit_transform(N)

In [90]:
print(cf)

         lon      lat     crim    zn     rm  ptratio       b
0   -70.9550  42.2550  0.00632  18.0  6.575     15.3  396.90
1   -70.9500  42.2875  0.02731   0.0  6.421     17.8  396.90
2   -70.9360  42.2830  0.02729   0.0  7.185     17.8  392.83
3   -70.9280  42.2930  0.03237   0.0  6.998     18.7  394.63
4   -70.9220  42.2980  0.06905   0.0  7.147     18.7  396.90
..       ...      ...      ...   ...    ...      ...     ...
501 -70.9860  42.2312  0.06263   0.0  6.593     21.0  391.99
502 -70.9910  42.2275  0.04527   0.0  6.120     21.0  396.90
503 -70.9948  42.2260  0.06076   0.0  6.976     21.0  396.90
504 -70.9875  42.2240  0.10959   0.0  6.794     21.0  393.45
505 -70.9825  42.2210  0.04741   0.0  6.030     21.0  396.90

[506 rows x 7 columns]
