# Selecting an alternative type for a column

Sometimes ptype infers a type for a column which we know to be incorrect; we can select a different column type, and still take advantage of ptype per-row type inference (conditional on the new type of the column) to identify anomalous and missing values. To demonstrate this functionality of ptype, we present two usecases which are summarized below:
- A toy example: We employ ptype on a toy example constructed using 4-digit formatted years (e.g., 1991) as normal values and a 2-digit year (e.g., 99) as an anomalous value. We assume that the user runs ptype on this data frame and then inspects the inferred schema. The schema denotes that the column is classified as integer rather than date-iso-8601. Therefore, the user asks ptype to reclassify this column as date-iso-8601 and notices that this feedback lets ptype detect an anomalous entry which could not be detected before.
- A real-world example: We consider the task of measuring the association between two non-numerical data columns of the Grub Damage dataset, which is a collection of information about grass grub numbers, irrigation and damage. We assume that the user loads the dataset into a Pandas DataFrame using Pandas `read_csv` and transforms the data frame using ptype to calculate the Cramers V statistic between these columns. However, ptype misclassifies one of the columns and causes the statistic to be undefined. Therefore, the user needs to fix ptype's prediction in order to correctly calculate the Cramers V statistic.

In [1]:
# Preamble to run notebook in context of source package.
import sys
sys.path.insert(0, '../')

### Toy Example
Here, we construct a Pandas DataFrame that contains 9 data entries which are valued between 1991 and 1999, where 1999 is encoded by 99. Note that we set `dtype` to `str` so that ptype can infer a "schema" based on this untyped (string) representation.

In [2]:
import pandas as pd

x = ['1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '99']
column = 'year'

df = pd.DataFrame(x, dtype='str', columns=[column])
df

Unnamed: 0,year
0,1991
1,1992
2,1993
3,1994
4,1995
5,1996
6,1997
7,1998
8,99


First, we show that the presence of 99 prevents ptype from labelling the column with the date type. 

ptype can infer a “schema” that specifies the most likely type for the column and additional relevant metadata about missing or anomalous values.

In [3]:
from ptype.Ptype import Ptype

ptype = Ptype()
ptype.schema_fit(df)
ptype.show_schema()

Unnamed: 0,year
type,integer
normal values,"[1991, 1992, 1993, 1994, 1995, 1996, 1997, 199..."
ratio of normal values,1
missing values,[]
ratio of missing values,0
anomalous values,[]
ratio of anomalous values,0


Notice that the column's type is reported as integer and no missing or anomalous entries are detected. 

We can now change how the column is interpreted by interacting with ptype:

In [4]:
ptype.cols[column].reclassify('date-iso-8601')

As a result of this interaction, the inferred schema is modified as follows:

In [5]:
ptype.show_schema()

Unnamed: 0,year
type,date-iso-8601
normal values,"[1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998]"
ratio of normal values,0.89
missing values,[]
ratio of missing values,0
anomalous values,[99]
ratio of anomalous values,0.11


We observe that the column is correctly classified as date-iso-8601, which in turn lets ptype update its beliefs over anomalous values, i.e., 99 is now detected as an anomalous entry.

### Real-world Example
In this example, we use the Grub Damage dataset to analyze the relationship between grass grub numbers, irrigation and damage.

Let us simplify the problem and consider the task of finding the association between the zone and GG_new columns.

In [6]:
df = pd.read_csv('../data/grub-damage.csv', encoding="ISO-8859-1",dtype='str')
df.head()

Unnamed: 0,year_zone,year,strip,pdk,damage_rankRJT,damage_rankALL,dry_or_irr,zone,GG_new
0,6f,86,3,1,1,0,D,F,low
1,6f,86,3,2,0,0,D,F,high
2,6f,86,3,3,1,1,D,F,high
3,6f,86,3,4,1,0,D,F,high
4,6f,86,3,5,0,0,D,F,low


First, we use ptype to inspect the properties of this dataset and transform it accordingly. 

In [7]:
ptype = Ptype()

schema = ptype.schema_fit(df)
ptype.show_schema()

Unnamed: 0,year_zone,year,strip,pdk,damage_rankRJT,damage_rankALL,dry_or_irr,zone,GG_new
type,string,integer,integer,integer,integer,integer,string,boolean,string
normal values,"[0c, 0f, 0m, 1c, 1f, 1m, 2c, 2f, 2m, 6c, 6f, 6...","[86, 87, 88, 89, 90, 91, 92]","[1, 10, 2, 3, 4, 5, 6, 7, 9]","[0, 1, 2, 3, 4, 5]","[0, 1, 2, 3, 4, 5]","[0, 1, 2, 3, 4, 5]","[B, D, O]",[F],"[average, high, low, veryhigh]"
ratio of normal values,1,1,1,1,1,1,1,0.46,1
missing values,[],[],[],[],[],[],[],[],[]
ratio of missing values,0,0,0,0,0,0,0,0,0
anomalous values,[],[],[],[],[],[],[],"[C, M]",[]
ratio of anomalous values,0,0,0,0,0,0,0,0.54,0


As you can see, ptype predicts the data type of the zone column as boolean and labels the values of C and M as anomalies. Note that we can confirm that these values are normal values using the corresponding metadata, which states "8. zone - position of paddock (F: foothills, M: midplain, C: coastal) - enumerated".

Without interacting with ptype, we obtain the following data frame:

In [8]:
df_transformed = ptype.schema_transform(df, schema)
df_transformed

Unnamed: 0,year_zone,year,strip,pdk,damage_rankRJT,damage_rankALL,dry_or_irr,zone,GG_new
0,6f,86,3,1,1,0,D,False,low
1,6f,86,3,2,0,0,D,False,high
2,6f,86,3,3,1,1,D,False,high
3,6f,86,3,4,1,0,D,False,high
4,6f,86,3,5,0,0,D,False,low
...,...,...,...,...,...,...,...,...,...
150,2c,92,9,4,1,1,B,,average
151,2c,92,10,1,3,3,O,,high
152,2c,92,10,2,1,1,D,,average
153,2c,92,10,3,2,2,O,,average


Therefore, the Cramers V statistic between zone and GG_new columns would be undefined due to anomalous values.

In [9]:
# NBVAL_IGNORE_OUTPUT
import numpy as np
import scipy.stats as ss

# see https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V
def cramers_corrected_stat(x, y):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    confusion_matrix = pd.crosstab(x, y)
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))


cramers_corrected_stat(df_transformed['zone'], df_transformed['GG_new'])



nan

Let us now interact with ptype to fix its predictions for the zone column.

In [10]:
ptype.cols['zone'].reclassify('string')
ptype.show_schema()

Unnamed: 0,year_zone,year,strip,pdk,damage_rankRJT,damage_rankALL,dry_or_irr,zone,GG_new
type,string,integer,integer,integer,integer,integer,string,string,string
normal values,"[0c, 0f, 0m, 1c, 1f, 1m, 2c, 2f, 2m, 6c, 6f, 6...","[86, 87, 88, 89, 90, 91, 92]","[1, 10, 2, 3, 4, 5, 6, 7, 9]","[0, 1, 2, 3, 4, 5]","[0, 1, 2, 3, 4, 5]","[0, 1, 2, 3, 4, 5]","[B, D, O]","[C, F, M]","[average, high, low, veryhigh]"
ratio of normal values,1,1,1,1,1,1,1,1,1
missing values,[],[],[],[],[],[],[],[],[]
ratio of missing values,0,0,0,0,0,0,0,0,0
anomalous values,[],[],[],[],[],[],[],[],[]
ratio of anomalous values,0,0,0,0,0,0,0,0,0


As we can see, the column type prediction of the zone column is now correct. Moreover, the row type predictions are also updated accordingly.

In [11]:
# we use the updated schema
schema = ptype.cols
df_transformed = ptype.schema_transform(df, schema)
df_transformed

Unnamed: 0,year_zone,year,strip,pdk,damage_rankRJT,damage_rankALL,dry_or_irr,zone,GG_new
0,6f,86,3,1,1,0,D,F,low
1,6f,86,3,2,0,0,D,F,high
2,6f,86,3,3,1,1,D,F,high
3,6f,86,3,4,1,0,D,F,high
4,6f,86,3,5,0,0,D,F,low
...,...,...,...,...,...,...,...,...,...
150,2c,92,9,4,1,1,B,C,average
151,2c,92,10,1,3,3,O,C,high
152,2c,92,10,2,1,1,D,C,average
153,2c,92,10,3,2,2,O,C,average


We can now calculate the Cramers V statistic as below:

In [12]:
cramers_corrected_stat(df_transformed['zone'], df_transformed['GG_new'])

0.3074039662588285