In this demo, we demonstrate how ptype can be used. The tasks are as follows:

- to run ptype on a data frame, and print a summary of the results. 
- to show possible interactions ptype offers to its users, when a change on the predictions is necessary.

In [None]:
# Preamble to run notebook in context of source package.
# NBVAL_IGNORE_OUTPUT
import sys
sys.path.insert(0, '../')
!{sys.executable} -m pip install -r ../requirements.txt


In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:100% !important;}</style>"))

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcdefaults()

from ptype.Ptype import Ptype
from ptype.utils import evaluate_types
import pandas as pd
import numpy as np

## 1 Using ptype
### 1.a Create a ptype assistant

In [None]:
ptype = Ptype()

### List ptype’s target types:

In [None]:
list(ptype.types.values())

### loading data

In [None]:
dataset_name = 'auto'
dataset_path = '../data/' + dataset_name + '.csv'

df =  pd.read_csv(dataset_path, sep=',', encoding='ISO-8859-1', dtype=str, header=None, keep_default_na=False, skipinitialspace=True)
print(df.shape)
df.head(5)

### 1.b Run ptype

In [None]:
ptype.run_inference(df)

### 1.c Report the results

In [None]:
evaluate_types(dataset_name, ptype)

#### Show the results for all of the columns
We can generate a new dataframe which includes the column type predictions in the header.

In [None]:
df = ptype.show_results_df()
df.head(20)

To inspect the results in detail, we can generate a human-readable description for any column.

#### Show the results for the columns with missing data

In [None]:
for col in ptype.cols.values():
    if col.has_missing():
        col.show_results()

#### Show the results for the columns with anomalies

In [None]:
for col in ptype.cols.values():
    if col.has_anomalous():
        col.show_results()

## 2. User Interactions
- changing the column type predictions,
- changing the anomaly type predictions,
- changing the missing type predictions.


### 2.a Change the column type predictions

In [None]:
dataset_name = 'data_gov_10151_1'
dataset_path = '../data/' + dataset_name + '.csv'
df =  pd.read_csv(dataset_path, sep=',', encoding='ISO-8859-1', dtype=str, keep_default_na=False, skipinitialspace=True)
print(df.shape)
df.head(2)

In [None]:
ptype.run_inference(df)
ptype.show_results_df().head(20)

#### checking columns annotated with the gender type

In [None]:
gender_col = [col for col in ptype.cols.values() if col.predicted_type == 'gender'][0]
gender_col.show_results()

In [None]:
gender_col.predicted_type = 'string'
gender_col.show_results()

### 2.b Changing the anomaly annotations
Notice that the values of 'T' and 'U' are still annotated as anomalies. We need to update the annotations to fix this.

In [None]:
gender_col.reclassify_normal(['T', 'U'])
gender_col.show_results()

### 2.c Change missing data encodings

In [None]:
dataset_name = 'auto'
dataset_path = '../data/' + dataset_name + '.csv'
df =  pd.read_csv(dataset_path, sep=',', encoding='ISO-8859-1', dtype=str, header=None, keep_default_na=False, skipinitialspace=True)
print(df.shape)
df.head(2)

In [None]:
ptype.run_inference(df)

#### checking the columns with missing data

In [None]:
for col in ptype.cols.values():
    if col.has_missing():
        col.show_results()

In [None]:
col = ptype.cols[0]
col.reclassify_normal(['-1'])
col.show_results()

### 2.d Merging Different Encodings of Missing Data

In [None]:
dataset_name = 'mass_6'
dataset_path = '../data/' + dataset_name + '.csv'
df =  pd.read_csv(dataset_path, sep=',', encoding='ISO-8859-1', dtype=str, keep_default_na=False, skipinitialspace=True)
print(df.shape)
df.head(2)

In [None]:
ptype.run_inference(df)

In [None]:
column_name = 'LRE Ages 3-5 - Full Incl #'
ptype.cols[column_name].show_results()

In [None]:
new_encoding = 'NA'
ptype.replace_missing(column_name, new_encoding)
ptype.cols[column_name].show_results()