# Diagnose a data file

# Document

<table align="left">
    <tr>
        <th class="text-align:left">Title</th>
        <td class="text-align:left">Diagnose a data file</td>
    </tr>
    <tr>
        <th class="text-align:left">Last modified</th>
        <td class="text-align:left">2019-01-22</td>
    </tr>
    <tr>
        <th class="text-align:left">Author</th>
        <td class="text-align:left">Gilles Pilon <gillespilon13@gmail.com></td>
    </tr>
    <tr>
        <th class="text-align:left">Status</th>
        <td class="text-align:left">Active</td>
    </tr>
    <tr>
        <th class="text-align:left">Type</th>
        <td class="text-align:left">Jupyter notebook</td>
    </tr>
    <tr>
        <th class="text-align:left">Created</th>
        <td class="text-align:left">2018-12-21</td>
    </tr>
    <tr>
        <th class="text-align:left">File name</th>
        <td class="text-align:left">data_file_diagnose.ipynb</td>
    </tr>
    <tr>
        <th class="text-align:left">Other files required</th>
        <td class="text-align:left">thirteen_weeks.csv</td>
    </tr>
</table>

# Introduction

- Read a csv file
- Determine the column types
- Determine the number of unique entries
- Create a filter
- Read a file with a filter
- Scatter plot of a column
- Create a munging filter

In [None]:
import pandas as pd

## Read a csv file

In [None]:
FILE_TO_READ = 'thirteen_weeks.csv'
df = pd.read_csv(FILE_TO_READ,
                 parse_dates=True,
                 index_col='Time')

In [None]:
df.shape

In [None]:
# Delete weekend data.
df = df[df.index.dayofweek < 5]

In [None]:
df.shape

## Determine the column types

In [None]:
# Check data type for the columns. We want float or int for numbers, not object.
df.dtypes

## Determine the number of unique entries

In [None]:
for column_name in df.columns:
    print(column_name, 'has', df[column_name].nunique(), 'unique values.')

In [None]:
for column_name in df.columns:
    print(column_name, 'has', df[column_name].unique(), 'unique values.', '\n')

In [None]:
# Find text values for a single column.
print(df[df['Trim Board Density (lb/cft)']
         .str.contains('[a-z]')]['Trim Board Density (lb/cft)']
         .unique())

In [None]:
# Find text values in the dataframe, by column.
for column_name in df.columns:
    if df[column_name].dtype == object:
        print(column_name, df[df[column_name]
                               .str.contains('[a-z]')][column_name]
                               .unique())
    else:
        pass

## Create a filter

In [None]:
NA_VALUES = ['Bad Input',
             'Invalid Data',
             'No Data',
             'Calc Failed',
             'Pt Created']

## Read a file with a filter

In [None]:
df = pd.read_csv(FILE_TO_READ,
                 parse_dates=True,
                 index_col='Time',
                 na_values=NA_VALUES)

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
# Write the file to save this munged version
df.to_csv('thirteen_weeks_munged.csv')

## Scatter plot of a column

In [None]:
# Create a scatter plot of one column versus the index 'Time'.
import matplotlib.pyplot as plt
%matplotlib inline
ax = df.plot.line(y='Water Load (lb/MSF)',
                  legend=False,
                  style='.')

## Create a munging filter

In [None]:
# Munge this column with a filter.
import numpy as np
df.loc[df['Water Load (lb/MSF)'] > 1200,
       'Water Load (lb/MSF)'] = np.nan

In [None]:
# Redo the scatter plot
ax = df.plot.line(y='Water Load (lb/MSF)',
                  legend=False,
                  style='.')

# References

- [matplotlib pyplot](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.html#module-matplotlib.pyplot)

- [pandas plot line](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.line.html)

- [numpy](https://docs.scipy.org/doc/)

- [pandas read csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)