This notebook is testing different ways to check for duplicate values in a column.
My current list-based implementation appears to be hanging for large sets that have the fractional seconds removed.
This is strange to me, but the mystery will have to remain for now.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import WP19_analysis as wpa
import seaborn as sns
import matplotlib.pyplot as plt

These are using numpy and pandas.

In [2]:
energy = wpa.load_timeseries_file('asei-raw.csv')
np.unique(energy.index.values).shape[0] == len(energy.index)

False

In [3]:
energy = wpa.load_timeseries_file('asei-clean.csv')
np.unique(energy.index.values).shape[0] == len(energy.index)

True

This is an implementation using pandas.

In [4]:
energy = wpa.load_timeseries_file('asei-raw.csv')
energy.index.is_unique

False

In [5]:
energy = wpa.load_timeseries_file('asei-clean.csv')
energy.index.is_unique

True

I'd like to do an implementation in pure python that works for larger data files.  It looks like a set-based implementation can handle the larger files even though the logic isn't that different.  I'd speculate that member testing is higher performance for a set.

I have put this new routine in the `WP19_analysis.py` file and will use it for validation.

In [6]:
def csv_col_duplicates(filename, col=1, skiprows=1):
    with open(filename) as f:
        existing_values = []
        for line in f.readlines()[skiprows:]:
            vals = line.strip().split(',')
            if vals[col] not in existing_values:
                existing_values.append(vals[col])
            else:
                return line.strip()
            
def csv_col_duplicates_set(filename, col=1, skiprows=1):
    with open(filename) as f:
        unique = set()
        for line in f.readlines()[skiprows:]:
            vals = line.strip().split(',')
            if vals[col] in unique:
                return line.strip()
            else:
                unique.add(vals[col])
        return 0    

In [7]:
filename = 'asei-raw.csv'
csv_col_duplicates(filename, col=0)

'2015-04-22 17:22:00.001,10469,,6,232.1,232.7,232.1,6.6,1.76,18.78,,16.633,1.0'

In [8]:
filename = 'asei-raw.csv'
csv_col_duplicates_set(filename, col=0)

'2015-04-22 17:22:00.001,10469,,6,232.1,232.7,232.1,6.6,1.76,18.78,,16.633,1.0'

In [9]:
filename = 'asei-clean.csv'
csv_col_duplicates_set(filename, col=0)

0

In [10]:
filename = 'ajau-clean.csv'
csv_col_duplicates_set(filename, col=0)

0