In this notebook, we present various usecases to demonstrate the benefit of using ptype for column type inference. The organization of the notebook is as follows:

- Part 1 (solve data cleaning problem without Ptype):
    - import dataset using Pandas read_csv
    - run linear regression on data
    - error occurs because of missing data: could not convert string to float: ‘?’
    - inspect dtypes property of dataframe to see the problem
    - use Pandas to change encoding of missing data, remove relevant rows
    - run linear regression again (no errors), plot results
    - use dtypes to verify that we now have appropriate column types

- Part 2 (how Ptype makes this problem easier):
    - import dataset using Pandas read_csv, but this time with dtype=’str’
    - instantiate Ptype
    - ask Ptype to infer schema; show inferred types
    - ask Ptype to adjust type of dataframe to match schema
    - inspect transformed dataframe to verify types as expected
    - then as per Part 1 to remove missing data and continue


In [None]:
# Preamble to run notebook in context of source package.
# NBVAL_IGNORE_OUTPUT
import sys
sys.path.insert(0, '../')

In [None]:
from IPython.core.display import display
from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcdefaults()
import numpy as np
import pandas as pd

from utils import scatter_plot

### UCI Automobile Dataset

In [None]:
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

df = pd.read_csv('../data/auto.csv', names = headers)
df.head()

### The Analytical Task

This dataset is commonly used for a regression task, where the goal is to predict the price of an automobile given its attributes.

### A Solution using Standard Python Libraries
Let's now develop a simple solution for this problem. The solution is inspired from Kaggle (see https://www.kaggle.com/fazilbtopal/data-wrangling and https://www.kaggle.com/fazilbtopal/model-development-and-evaluation-with-python).

In [None]:
features = ['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']
target = ['price']

X = df[features]
y = df[target]

df = df[features+target]
df.head()

In [None]:
# to see the error message, uncomment the following

lm = LinearRegression()
# lm.fit(X, y)
# y_hat = lm.predict(X)

We notice that some data entries are valued ? and that they cannot be processed with the fit function. 

Although it is not directly obvious which data entries are valued ?, we can query the dataframe to determine the occurences of ?.

In [None]:
df[(df['horsepower']=='?') | (df['price']=='?')]

Note that this also leads Pandas to misclassify two data columns as object rather than int64.

In [None]:
df.dtypes

We need to "clean" the horsepower and price columns in terms of missing values. Let's first have a look at what we can do without ptype:

In [None]:
# replace missing data encoding
df['horsepower'].replace("?", np.nan, inplace = True)
df['price'].replace("?", np.nan, inplace = True)

# drop rows
n = df.shape[0]
df.dropna(subset=["horsepower", "price"], axis=0, inplace=True)
print("# rows deleted = " + str(n-df.shape[0]))

# update the indices
df.reset_index(drop=True, inplace=True)

In [None]:
df.dtypes

Although, this does not cause any errors, we may want to update data types.

In [None]:
df = df.astype(int)
df.dtypes

In [None]:
X = df[features].values
y = df[target].values

lm.fit(X, y)
y_hat = lm.predict(X)

scatter_plot(y, y_hat)

In [None]:
df.dtypes

Let's now revisit the problem and see how we can use ptype to resolve it. Note that we now use an additional parameter of the read_csv function. We set 'dtype' to 'str' so that all data entries are parsed as strings. This is needed as ptype processes each data value as a string.


In [None]:
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

df = pd.read_csv('../data/auto.csv', names = headers, dtype='str')
df = df[features+target]
df.head()

In [None]:
from ptype.Ptype import Ptype

ptype = Ptype()

In [None]:
schema = ptype.fit_schema(df)
ptype.show_schema()

In [None]:
df = ptype.transform_schema(df, schema)

In [None]:
df[(df['horsepower'].isna()) | (df['price'].isna())]

In [None]:
df.dtypes

In [None]:
# drop rows
n = df.shape[0]
df.dropna(subset=["horsepower", "price"], axis=0, inplace=True)
print("# rows deleted = " + str(n-df.shape[0]))

# update the indices
df.reset_index(drop=True, inplace=True)

In [None]:
X = df[features].values
y = df[target].values

lm = LinearRegression()
lm.fit(X, y)
y_hat = lm.predict(X)

scatter_plot(y, y_hat)