In this demo, we demonstrate how ptype can be used. The tasks are as follows:

- to run ptype on a data frame, and print a summary of the results. 
- to show possible interactions ptype offers to its users, when a change on the predictions is necessary.

In [1]:
!pip install greenery

Collecting greenery
  Downloading greenery-3.1.zip (40 kB)
[K     |████████████████████████████████| 40 kB 1.3 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: greenery
  Building wheel for greenery (setup.py) ... [?25ldone
[?25h  Created wheel for greenery: filename=greenery-3.1-py3-none-any.whl size=39400 sha256=07d2ad928d4fc24d1129eed5df9d861f537ad77a468ecda475995d9946d794ef
  Stored in directory: /home/jovyan/.cache/pip/wheels/5a/77/1f/7abfa93f2e8a77645e8baa1c732202f7f47e2072c53a56de2d
Successfully built greenery
Installing collected packages: greenery
Successfully installed greenery-3.1


### imports

In [4]:
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:100% !important;}</style>"))

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcdefaults()

import sys
sys.path.insert(0, '../')

from src.Ptype import Ptype
from src.utils import evaluate_types
import pandas as pd
import numpy as np

## 1 Using ptype
### 1.a Create a ptype assistant

In [5]:
ptype = Ptype()

### loading data

In [6]:
dataset_name = 'auto'
dataset_path = '../data/' + dataset_name + '.csv'

df =  pd.read_csv(dataset_path, sep=',', encoding='ISO-8859-1', dtype=str, header=None, keep_default_na=False, skipinitialspace=True)
print(df.shape)
df.head(5)

(205, 26)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


### 1.b Run ptype

In [7]:
ptype.run_inference(_data_frame=df)

### 1.c Report the results

In [8]:
evaluate_types(dataset_name, ptype)

correct/total =  1.0 (26/26)


#### Show the results for all of the columns
We can generate a new dataframe which includes the column type predictions in the header.

In [9]:
df = ptype.show_results_df()
df.head(2)

Unnamed: 0,0(integer),1(integer),2(string),3(string),4(string),5(string),6(string),7(string),8(string),9(float),...,16(integer),17(string),18(float),19(float),20(float),21(integer),22(integer),23(integer),24(integer),25(integer)
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500


To inspect the results in detail, we can generate a human-readable description.

In [10]:
ptype.show_results()

col: 0
	predicted type: integer
	posterior probs:  [9.99999674e-01 0.00000000e+00 3.26244845e-07 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00]
	types:  ['integer', 'string', 'float', 'boolean', 'gender', 'date-iso-8601', 'date-eu', 'date-non-std-subtype', 'date-non-std'] 

	some normal data values:  ['-2', '0', '1', '2', '3']
	their counts:  [3, 67, 54, 32, 27]
	fraction of normal: 0.89 

	missing values: ['-1']
	their counts:  [22]
	fraction of missing: 0.11 

col: 1
	predicted type: integer
	posterior probs:  [1.00000000e+00 0.00000000e+00 4.73609772e-47 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00]
	types:  ['integer', 'string', 'float', 'boolean', 'gender', 'date-iso-8601', 'date-eu', 'date-non-std-subtype', 'date-non-std'] 

	some normal data values:  ['101', '102', '103', '104', '106', '107', '108', '110', '113', '115', '118', '119', '121', '122', '125', '128', '129', '134', '137', '142'

#### Show the results for the columns with missing data

In [11]:
column_names = ptype.get_columns_with_missing()
ptype.show_results(column_names)

# columns with missing data: 8 

col: 0
	predicted type: integer
	posterior probs:  [9.99999674e-01 0.00000000e+00 3.26244845e-07 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00]
	types:  ['integer', 'string', 'float', 'boolean', 'gender', 'date-iso-8601', 'date-eu', 'date-non-std-subtype', 'date-non-std'] 

	some normal data values:  ['-2', '0', '1', '2', '3']
	their counts:  [3, 67, 54, 32, 27]
	fraction of normal: 0.89 

	missing values: ['-1']
	their counts:  [22]
	fraction of missing: 0.11 

col: 1
	predicted type: integer
	posterior probs:  [1.00000000e+00 0.00000000e+00 4.73609772e-47 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00]
	types:  ['integer', 'string', 'float', 'boolean', 'gender', 'date-iso-8601', 'date-eu', 'date-non-std-subtype', 'date-non-std'] 

	some normal data values:  ['101', '102', '103', '104', '106', '107', '108', '110', '113', '115', '118', '119', '121', '122', '125', 

#### Show the results for the columns with anomalies

In [12]:
column_names = ptype.get_columns_with_anomalies()
ptype.show_results(column_names)

# columns with anomalies: 0 



## 2. User Interactions
- changing the column type predictions,
- changing the anomaly type predictions,
- changing the missing type predictions.


### 2.a Change the column type predictions

In [13]:
dataset_name = 'data_gov_10151_1'
dataset_path = '../data/' + dataset_name + '.csv'
df =  pd.read_csv(dataset_path, sep=',', encoding='ISO-8859-1', dtype=str, keep_default_na=False, skipinitialspace=True)
print(df.shape)
df.head(2)

(99, 21)


Unnamed: 0,ï»¿OBJECTID,Loc_name,Status,Score,Match_type,Match_addr,Side,Ref_ID,X,Y,...,Addr_type,ARC_Street,ARC_City,ARC_State,ARC_ZIP,Name,Municipali,Address,Municipa_1,ZipCodes
0,1,DW_Addressing_,M,95.42,A,"1710 PACIFIC AVE, HARRISON, PA, 15065",,3150892,1420542.434568,475246.110473,...,StreetAddress,1710 Pacific Avenue,Natrona Heights,,15065,Community Market,Natrona Heights,1710 Pacific Avenue,Natrona Heights,15065
1,2,DW_Addressing_,M,93.91,A,"1117 MILLTOWN RD, PENN HILLS, PA, 15147",,3148048,1401019.429449,431068.460889,...,StreetAddress,1117 Mill Town Road,Verona,,15147,Community Market,Verona,1117 Mill Town Road,Verona,15147


In [14]:
ptype.run_inference(_data_frame=df)

#### checking columns annotated with the gender type

In [15]:
gender_columns = ptype.get_columns_with_type('gender')
ptype.show_results(gender_columns)

col: Status
	predicted type: gender
	posterior probs:  [0. 0. 0. 0. 1. 0. 0. 0. 0.]
	types:  ['integer', 'string', 'float', 'boolean', 'gender', 'date-iso-8601', 'date-eu', 'date-non-std-subtype', 'date-non-std'] 

	some normal data values:  ['M']
	their counts:  [86]
	fraction of normal: 0.87 

	anomalies: ['T', 'U']
	their counts: [5, 8]
	fraction of anomalies: 0.13 



In [16]:
ptype.change_column_type_annotations(gender_columns, ['string',])
ptype.show_results(gender_columns)

The column type of Status is changed from gender to string
col: Status
	predicted type: string
	posterior probs:  [0. 0. 0. 0. 1. 0. 0. 0. 0.]
	types:  ['integer', 'string', 'float', 'boolean', 'gender', 'date-iso-8601', 'date-eu', 'date-non-std-subtype', 'date-non-std'] 

	some normal data values:  ['M']
	their counts:  [86]
	fraction of normal: 0.87 

	anomalies: ['T', 'U']
	their counts: [5, 8]
	fraction of anomalies: 0.13 



### 2.b Changing the anomaly annotations
Notice that the values of 'T' and 'U' are still annotated as anomalies. We need to update the annotations to fix this.

In [17]:
ptype.change_anomaly_annotations('Status', ['T', 'U'])
ptype.show_results(gender_columns)

col: Status
	predicted type: string
	posterior probs:  [0. 0. 0. 0. 1. 0. 0. 0. 0.]
	types:  ['integer', 'string', 'float', 'boolean', 'gender', 'date-iso-8601', 'date-eu', 'date-non-std-subtype', 'date-non-std'] 

	some normal data values:  ['M', 'T', 'U']
	their counts:  [86, 5, 8]
	fraction of normal: 1.0 



### 2.c Change missing data encodings

In [18]:
dataset_name = 'auto'
dataset_path = '../data/' + dataset_name + '.csv'
df =  pd.read_csv(dataset_path, sep=',', encoding='ISO-8859-1', dtype=str, header=None, keep_default_na=False, skipinitialspace=True)
print(df.shape)
df.head(2)

(205, 26)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500


In [19]:
ptype.run_inference(_data_frame=df)

#### checking the columns with missing data

In [20]:
column_names = ptype.get_columns_with_missing()
ptype.show_results(column_names)

# columns with missing data: 8 

col: 0
	predicted type: integer
	posterior probs:  [9.99999674e-01 0.00000000e+00 3.26244845e-07 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00]
	types:  ['integer', 'string', 'float', 'boolean', 'gender', 'date-iso-8601', 'date-eu', 'date-non-std-subtype', 'date-non-std'] 

	some normal data values:  ['-2', '0', '1', '2', '3']
	their counts:  [3, 67, 54, 32, 27]
	fraction of normal: 0.89 

	missing values: ['-1']
	their counts:  [22]
	fraction of missing: 0.11 

col: 1
	predicted type: integer
	posterior probs:  [1.00000000e+00 0.00000000e+00 4.73609772e-47 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00]
	types:  ['integer', 'string', 'float', 'boolean', 'gender', 'date-iso-8601', 'date-eu', 'date-non-std-subtype', 'date-non-std'] 

	some normal data values:  ['101', '102', '103', '104', '106', '107', '108', '110', '113', '115', '118', '119', '121', '122', '125', 

In [21]:
column_name = '0'
ptype.change_missing_data_annotations(column_name, ['-1'])
ptype.show_results(column_name)

col: 0
	predicted type: integer
	posterior probs:  [9.99999674e-01 0.00000000e+00 3.26244845e-07 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00]
	types:  ['integer', 'string', 'float', 'boolean', 'gender', 'date-iso-8601', 'date-eu', 'date-non-std-subtype', 'date-non-std'] 

	some normal data values:  ['-1', '-2', '0', '1', '2', '3']
	their counts:  [22, 3, 67, 54, 32, 27]
	fraction of normal: 1.0 



### 2.d Merging Different Encodings of Missing Data

In [22]:
dataset_name = 'mass_6'
dataset_path = '../data/' + dataset_name + '.csv'
df =  pd.read_csv(dataset_path, sep=',', encoding='ISO-8859-1', dtype=str, keep_default_na=False, skipinitialspace=True)
print(df.shape)
df.head(2)

(3148, 23)


Unnamed: 0,Year,Org Name,Org Code,Spec Ed. Grad Rate - Grad Cohort #,Spec Ed. Grad Rate - Grad #,Spec Ed. Grad Rate - Grad %,Spec Ed. Dropout - Enr #,Spec Ed. Dropout - Drop #,Spec Ed. Dropout - Drop %,LRE Ages 6-21 - Students #,...,LRE Ages 3-5 - Full Incl #,LRE Ages 3-5 - Full Incl %,Cohort Completion Year,Substantial growth of knowledge & skills,Survey Period,Surv Meet Std #,Surv Meet Std %,Sch Yr Rev,# of Students Engaged,Dist Rate
0,2012,Abington,10000,27,17,63%,84,3,3.60%,295,...,1,5%,2012-13,-,Spring 2013,,,2010-11,8,100%
1,2013,Abington,10000,20,16,80%,70,2,2.90%,272,...,4,19%,2012-13,-,Spring 2013,11.0,73.20%,2010-11,8,100%


In [23]:
ptype.run_inference(_data_frame=df)

In [24]:
column_name = 'LREAges3-5-FullIncl#'
ptype.show_results([column_name,])

col: LREAges3-5-FullIncl#
	predicted type: integer
	posterior probs:  [1.00000000e+00 0.00000000e+00 1.66434002e-57 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00]
	types:  ['integer', 'string', 'float', 'boolean', 'gender', 'date-iso-8601', 'date-eu', 'date-non-std-subtype', 'date-non-std'] 

	some normal data values:  ['0', '1', '10', '108', '11', '110', '116', '12', '129', '13', '14', '147', '15', '158', '16', '17', '18', '19', '2', '20']
	their counts:  [78, 37, 8, 1, 12, 1, 1, 8, 1, 14, 8, 1, 8, 1, 14, 5, 7, 5, 26, 8]
	fraction of normal: 0.16 

	missing values: ['', '-', 'N/A', 'NA']
	their counts:  [1565, 292, 751, 29]
	fraction of missing: 0.84 



In [25]:
new_encoding = 'NA'
ptype.merge_missing_data(column_name, new_encoding)
ptype.show_results([column_name,])

col: LREAges3-5-FullIncl#
	predicted type: integer
	posterior probs:  [1.00000000e+00 0.00000000e+00 1.66434002e-57 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00]
	types:  ['integer', 'string', 'float', 'boolean', 'gender', 'date-iso-8601', 'date-eu', 'date-non-std-subtype', 'date-non-std'] 

	some normal data values:  ['0', '1', '10', '108', '11', '110', '116', '12', '129', '13', '14', '147', '15', '158', '16', '17', '18', '19', '2', '20']
	their counts:  [78, 37, 8, 1, 12, 1, 1, 8, 1, 14, 8, 1, 8, 1, 14, 5, 7, 5, 26, 8]
	fraction of normal: 0.16 

	missing values: ['NA']
	their counts:  [2637]
	fraction of missing: 0.84 

