A list of checklists to ensure consistency and quality across key data science
tasks.

## General

- Use samples of different sized for code development if useful, but don't
  switch back and forth between samples for analysis (might get hung up on
  explaining small sample artifacts).

## Importing data

## A first quick look

In [1]:
import pandas as pd
import seaborn as sns

sns.set_style('whitegrid')

In [5]:
df = pd.read_csv('data/competitor_prices.csv')
df.head(3)

Unnamed: 0,Date,Product_id,Competitor_id,Competitor_Price
0,25/11/2013,4.0,C,74.95
1,25/11/2013,4.0,D,74.95
2,25/11/2013,4.0,E,75.0


In [8]:
def inspect(df, nrows=2):
    print('({:,}, {})'.format(*df.shape))
    display(df.head(nrows))
    
inspect(df)

(15,395, 4)


Unnamed: 0,Date,Product_id,Competitor_id,Competitor_Price
0,25/11/2013,4.0,C,74.95
1,25/11/2013,4.0,D,74.95


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15395 entries, 0 to 15394
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              15132 non-null  object 
 1   Product_id        15132 non-null  float64
 2   Competitor_id     15132 non-null  object 
 3   Competitor_Price  15132 non-null  float64
dtypes: float64(2), object(2)
memory usage: 481.2+ KB


In [7]:
df.describe()

Unnamed: 0,Product_id,Competitor_Price
count,15132.0,15132.0
mean,248.535223,85.339813
std,115.916096,48.460998
min,4.0,2.3
25%,143.0,49.95
50%,251.0,75.0
75%,355.0,108.0
max,421.0,500.0


In [8]:
df.Competitor_id.describe()

count     15132
unique        7
top           D
freq       8092
Name: Competitor_id, dtype: object

## Check data integrity

### Missing values

todo: best practices of use of: https://github.com/ResidentMario/missingno

### Duplicates

In [9]:
def dups(df):
    d = df.duplicated().sum()
    print(f'{d} of {len(df)} rows ({d/len(df):.1%}) are duplicates.')
    
dups(df)

500 of 15395 rows (3.2%) are duplicates.


### Columns types

Ensure that columns are of desired type

### Value format

Ensure that values conform to required formats (e.g. use regex to validate postcodes and ids)

## Regression model specification

- Does it make sense to take logs of currency amounts or large integers
  (especially if there are few cases with zero, in which case using *log(x +
  1)* is usually fine to avoid missing values from zeroes)?

- Does it make sense to standardise some variables to interpret changes in std
  or even standardise all variables (to get Beta coefficients) to easily gage
  relative importance of variables?

- Have I included variables that I shouldn't have given the ceteris-paribus
  interpretation of the coefficients (e.g. include alcohol consumption when
  estimating effect of alcohol tax on traffic fatalities, which will mostly run
  through lower alcohol consumption).

- Are there variables that are correlated with *y* but not the included *x*s
  that I haven't included yet? (If so, I should, as it increases precision of
  the estimates without causing multicollinearity issues.)