# Selecting an alternative type for a column

Sometimes ptype infers a type for a column which we know to be incorrect; we can select a different column type, and still take advantage of ptype per-row type inference (conditional on the new type of the column) to identify anomalous and missing values.

We present two usecases which are summarized below:
- A toy example: We employ ptype on a toy example constructed using 4-digit formatted years as normal values and a 2-digit year as an anomalous entry. We assume that the user runs ptype on this data frame and then inspects the inferred schema. The schema denotes that the column is classified as integer whereas the correct column type is date-iso-8601. Therefore, the user asks ptype to reclassify this column as date-iso-8601 and notices that this feedback lets ptype detect anomalous entries which could not be detected before.
- A real-world example: 



In [1]:
# Preamble to run notebook in context of source package.
import sys
sys.path.insert(0, '../')

In [None]:
from IPython.core.display import display
from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcdefaults()
import numpy as np
import pandas as pd

from ptype.Ptype import Ptype
from utils import plot_column_type_posterior, plot_arff_type_posterior, subsample_df

### Toy Example

In [None]:
x = ['1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '90']
column = 'year'

df = pd.DataFrame(x, dtype='str', columns=[column])
df

In [None]:
ptype = Ptype()

ptype.schema_fit(df)
ptype.show_schema()

In [None]:
ptype.cols[column].reclassify('date-iso-8601')

In [None]:
ptype.show_schema()

### Real-world Example
In this example, we use the Grub Damage dataset to analyze the relationship between grass grub numbers, irrigation and damage.

Let us simply the problem and consider the task of finding the association between the zone and GG_new columns.

In [None]:
df = pd.read_csv('../data/grub-damage.csv', encoding="ISO-8859-1",dtype='str')
df.head()

First, we use ptype to inspect the properties of this dataset and transform it accordingly. 

In [None]:
ptype = Ptype()

schema = ptype.schema_fit(df)
ptype.show_schema()

As you can see, ptype predicts the data type of the zone column as boolean and labels the values of C and M as anomalies. Note that we can confirm that these values are normal values using the corresponding metadata, which states "8. zone - position of paddock (F: foothills, M: midplain, C: coastal) - enumerated".

If we are not interacting with ptype, we would obtain the following data frame.

In [None]:
df_transformed = ptype.schema_transform(df, schema)
df_transformed

Therefore, the Cramers V statistic between zone and GG_new columns would be undefined due to anomalous values.

In [None]:
# NBVAL_IGNORE_OUTPUT

import scipy.stats as ss

def cramers_corrected_stat(x, y):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    confusion_matrix = pd.crosstab(x, y)
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))


cramers_corrected_stat(df_transformed['zone'], df_transformed['GG_new'])

Let us now interact with ptype to fix its predictions for the zone column.

In [None]:
ptype.cols['zone'].reclassify('string')
ptype.show_schema()

As we can see, the column type prediction of the zone column is now correct. Moreover, the row type predictions are also updated.

In [None]:
# we use the updated schema
schema = ptype.cols
df_transformed = ptype.schema_transform(df, schema)
df_transformed

We can now calculate the Cramers V statistic as below:

In [None]:
cramers_corrected_stat(df_transformed['zone'], df_transformed['GG_new'])