### Installation and import

In [1]:
#!pip install exploretransform


In [3]:
import exploretransform as et

### How to use exploretransform

Let's load the boston corrected dataset and get started.

In [4]:
df, X, y = et.loadboston()

At this stage, I like to check that the data types align with the data dictionary and first five observations.  Also, the # of lvls can indicate potential categorical features or features with high cardinality.  Any dates that need reformatting can also be detected here.  We can use peek() to provide a the statistics needed.

In [7]:
et.peek(X)

Unnamed: 0,variable,dtype,lvls,obs,head
0,town,object,92,506,"[Nahant, Swampscott, Swampscott, Marblehead, M..."
1,lon,float64,375,506,"[-70.955, -70.95, -70.936, -70.928, -70.922]"
2,lat,float64,376,506,"[42.255, 42.2875, 42.283, 42.293, 42.298]"
3,crim,float64,504,506,"[0.00632, 0.02731, 0.02729, 0.0323699999999999..."
4,zn,float64,26,506,"[18.0, 0.0, 0.0, 0.0, 0.0]"
5,indus,float64,76,506,"[2.31, 7.07, 7.07, 2.18, 2.18]"
6,chas,category,2,506,"[0, 0, 0, 0, 0]"
7,nox,float64,81,506,"[0.5379999999999999, 0.469, 0.469, 0.457999999..."
8,rm,float64,446,506,"[6.575, 6.421, 7.185, 6.997999999999999, 7.147]"
9,age,float64,356,506,"[65.2, 78.9, 61.1, 45.8, 54.2]"


After analyzing the data types, we can use explore() to identify missing, zero, and infinity values.

In [8]:
et.explore(X)

Unnamed: 0,variable,obs,q_zer,p_zer,q_na,p_na,q_inf,p_inf,dtype
0,town,506,0,0.0,0,0.0,0,0.0,object
1,lon,506,0,0.0,0,0.0,0,0.0,float64
2,lat,506,0,0.0,0,0.0,0,0.0,float64
3,crim,506,0,0.0,0,0.0,0,0.0,float64
4,zn,506,372,73.52,0,0.0,0,0.0,float64
5,indus,506,0,0.0,0,0.0,0,0.0,float64
6,chas,506,0,0.0,0,0.0,0,0.0,category
7,nox,506,0,0.0,0,0.0,0,0.0,float64
8,rm,506,0,0.0,0,0.0,0,0.0,float64
9,age,506,0,0.0,0,0.0,0,0.0,float64


Earlier, we saw that town was likely a categorical feature with high cardinality.  We can use freq() to analyze categorical or ordinal features providing the count, percent, and cumulative percent for each level

In [9]:
t = et.freq(X['town'])

In [10]:
t

Unnamed: 0,town,freq,perc,cump
0,Cambridge,30,5.93,5.93
1,Boston Savin Hill,23,4.55,10.47
2,Lynn,22,4.35,14.82
3,Boston Roxbury,19,3.75,18.58
4,Newton,18,3.56,22.13
...,...,...,...,...
87,Cohasset,1,0.20,99.21
88,Middleton,1,0.20,99.41
89,Hamilton,1,0.20,99.60
90,Medfield,1,0.20,99.80


To visualize the resutls of freq() we can use plotfreq().  It generates a bar plot showing the levels in descending order.

In [11]:
%matplotlib qt

In [48]:
et.plotfreq(t)

<ggplot: (8769697425797)>

To pair with histograms you probably normally examine, skewstats() returns the skewness statistics and magnitude for each numeric feature.  When you have too many features to plot, this function becomes more useful.

In [22]:
et.skewstats(N)

Unnamed: 0,dtype,skewness,magnitude
crim,float64,5.207652,2-high
zn,float64,2.219063,2-high
dis,float64,1.008779,2-high
b,float64,-2.881798,2-high
nox,float64,0.727144,1-medium
age,float64,-0.597186,1-medium
tax,int64,0.667968,1-medium
ptratio,float64,-0.799945,1-medium
lstat,float64,0.903771,1-medium
lon,float64,-0.204775,0-approx_symmetric


In order to determine the association between the predictors and target, ascores() calculates pearson, kendall, pearson, spearman, mic, and dcor.  A variety of these scores is useful since certain scores measure linear associations and others will detect non-linear relationships.

In [83]:
et.ascores(N,y)

Unnamed: 0,pearson,kendall,spearman,mic,dcor
lon,0.322947,0.278908,0.42094,0.379753,0.435849
lat,0.006826,0.013724,0.02142,0.234796,0.16703
crim,0.389582,0.406992,0.562982,0.375832,0.528595
zn,0.360386,0.340738,0.438768,0.290145,0.404253
indus,0.484754,0.420263,0.580004,0.41414,0.543948
nox,0.4293,0.398342,0.565899,0.442515,0.523653
rm,0.696304,0.485182,0.635092,0.46161,0.711034
age,0.377999,0.391067,0.551747,0.414676,0.480248
dis,0.249315,0.313745,0.446392,0.316136,0.382746
tax,0.471979,0.418005,0.566999,0.336899,0.518158


Correlation matricies can get unweidly once we hit a certain number of features.  While the boston dataset is well below this threshold, one can imagine that having a table might be more useful than a matrix when dealing with high dimensionality.  Corrtable() returns a table of all pairwise correlations and uses the average correlation for the row and column in to decide on potential drop/filter candidates. You can use any of the methods you normally would with pandas corr function:

* pearson
* kendall    
* spearman  
* callable

This function is used by the CorrelationFilter() class to help determine which columns should dropped.

In [13]:
N = X.select_dtypes('number').copy()

In [17]:
c = et.corrtable(N, cut = 0.5, full= True, methodx = 'pearson')

In [20]:
c

Unnamed: 0,v1,v2,v1.target,v2.target,corr,drop
52,nox,dis,0.504729,0.471084,0.769230,nox
42,indus,nox,0.509010,0.504729,0.763651,indus
63,age,dis,0.478022,0.471084,0.747881,age
51,nox,age,0.504729,0.478022,0.731470,nox
46,indus,tax,0.509010,0.483391,0.720760,indus
...,...,...,...,...,...,...
22,lat,lstat,0.155477,0.484758,0.045660,
14,lat,indus,0.155477,0.509010,0.041093,
10,lon,b,0.207700,0.315351,0.018300,
7,lon,dis,0.207700,0.471084,0.011243,


Column | Description
:---- | :------------- 
v1 | variable 1
v2 | variable 2
v1.target | metric used to compare v1 and v2 for drop
v2.target | metric used to compare v1 and v2 for drop
corr | pairwise correlation based on method
drop | if the correlation > threshold, the drop decision 

Based on the output of corrtable(), calcdrop() determines which features should be dropped.

In [21]:
et.calcdrop(c)

['age', 'indus', 'nox', 'dis', 'lstat', 'tax']

ColumnSelect() is a custom transformer that selects columns for pipelines

In [23]:
categorical_columns = ['rad', 'town']

In [24]:
cs = et.ColumnSelect(categorical_columns).fit_transform(X)

In [47]:
cs

Unnamed: 0,rad,town
0,1,other
1,2,other
2,2,other
3,3,other
4,3,other
...,...,...
501,1,other
502,1,other
503,1,other
504,1,other


CategoricalOtherLevel() is a custom transformer that creates "other" level in categorical / ordinal data based on threshold.  This is useful in situation where you have high cardinality predictors and when there is a possibility of having new categories appear in future data. 

In [31]:
co = et.CategoricalOtherLevel(colname = 'town', threshold = 0.015).fit_transform(cs)

In [42]:
co.iloc[0:15, :]

Unnamed: 0,rad,town
0,1,other
1,2,other
2,2,other
3,3,other
4,3,other
5,3,other
6,5,other
7,5,other
8,5,other
9,5,other


CorrelationFilter() is a custom transformer that filters numeric features based on pairwise correlation.  It uses corrtable() and calcdrop() to perform the drop evaluations and calcuations.  For more information on how it works please see:

[Are you dropping too many correlated features?](https://towardsdatascience.com/are-you-dropping-too-many-correlated-features-d1c96654abe6)

In [45]:
cf = et.CorrelationFilter(cut = 0.5).fit_transform(N)

In [46]:
cf

Unnamed: 0,lon,lat,crim,zn,rm,ptratio,b
0,-70.9550,42.2550,0.00632,18.0,6.575,15.3,396.90
1,-70.9500,42.2875,0.02731,0.0,6.421,17.8,396.90
2,-70.9360,42.2830,0.02729,0.0,7.185,17.8,392.83
3,-70.9280,42.2930,0.03237,0.0,6.998,18.7,394.63
4,-70.9220,42.2980,0.06905,0.0,7.147,18.7,396.90
...,...,...,...,...,...,...,...
501,-70.9860,42.2312,0.06263,0.0,6.593,21.0,391.99
502,-70.9910,42.2275,0.04527,0.0,6.120,21.0,396.90
503,-70.9948,42.2260,0.06076,0.0,6.976,21.0,396.90
504,-70.9875,42.2240,0.10959,0.0,6.794,21.0,393.45
