# Data Cleaning
In this stage, you will read in the raw data and perform any necessary reshaping, combining, and cleaning of the data. Though this stage is not explicitly an exploratory stage, you will learn a lot about the content of the data throughout the course of this notebook.

### Imports

For your convenience, we have included a few pre-written functions, which you might find useful in your analysis. They are by no means necessary, but feel free to use any or all of them. The code for these functions can be found in /src

In [1]:
import pandas as pd
import os
import sys

In [2]:
src_path = os.path.abspath('../src/')
sys.path.append(src_path)

from data_cleaning import *

The Data Cleaning functions in the src file are:

- standardize_col_names(df)

Standardizes column names of a dataframe. It will remove white space, replace spaces with underscores, and eliminate special characters (including parenthesis and slashes). All letters are converted to lowercase.

Returns a copy of the dataframe.

- null_counts(df)

Returns a dataframe containing the number of null values in each column of a given dataframe.

### Data
Read in the data and take an initial look.

In [None]:
input_loc = '../data/raw/'

## Describing the Data
Before diving into the data, let's consider a few practical issues, which may or may not be important, depending on your dataset.

### Data Shape
Take a look at the current shape of the data set or sets. Is it in a convenient form to perform some exploratory analysis on? If not, what will you need to change before analysis can happen?

### Data Type
Which data types does each variable contain? Are variables continuous or categorical?

### Data Amount
How much data does the set contain? (ie, how many variables, how many records, how many files?) If the data set is very large, this may effect the tools you can use later (or else how much features selection will need to take place).

## Cleaning

Before you can dive into exploratory analysis, it's likely that you will have to do some reshaping of the data. Depending on what your data set looks like, there may be a lot or very little work to do here.

### Combine and/or Reshape
Be careful not to lose any data while reshaping. Make sure you have the same number of variables and observations in the raw set as you do after reshaping. Keep in mind that, for analysis and machine learning, you will want to have one or more dataframes with each column representing a measured variable and each row representing an observation.

### Indexing
Which variable or variables should you index on? Depending on your dataset, this may or may not have a simple answer. Regardless, choose one or more index values for your data set.

### Variable Names
Depending on your data set, variables may have names with strange symbols, which can making loading, saving, and subsetting data difficult. If applicable, you should deal with this now.

### Duplicates
Deal with any duplicate rows/columns (or, contrarily, any empty rows/columns), if applicable.

### Categorical Encoding
Are there any categorical variables in your data? If so, how would you deal with this so that it can be handled by a machine learning algorithm? You're free to implement the solution now or during your model training process, whichever suits you better.

### Uniformity
Are all variables measured using compatible units? Or are there monetary values in different forms of currency? Or something else entirely? If any of these issues are applicable to your data, design and implement a solution.

### Missing Data
If missing data is present (ie, NaN values), what is the best way to deal with it? Should values be imputed (and if so, how)? Or should they simply be filled? How will each option will affect the ultimate outcome of your model? Design and implement a solution.

### Additional Cleaning
Depending on your data set, there may be additional cleaning you would like to do at this stage. If so, do that here.

####  Save the Analytic Set
After this step, you should have a single dataframe with any inconsistencies, such as non-uniform column names and missing data, fixed. Depending on the data, the dataframe could be multiindexed as well. This is a good time to save the set. We'll load it from here during the next notebook.

Be sure to give your analytic set a unique name, as other people will be using the same repository to store their data on git. To adhere to the git repo naming conventions, prepend your initials to the filename.

In [None]:
output_loc = '../data/interim/'

## Verifying the Data Quality
While we're cleaning the data, it's important to give some thought to the quality.

### Data Source and Reliability
What is the origin of the data? How was it obtained? For quality assurance purposes, it's relevant to know how reliable the data is. This will help you deal with any potential data-entry errors or determine how trustworthy your eventual model will be.

### Error Handling
Are there any data-entry errors in the data? How can you tell? How would you characterize the overall quality of the dataset?
If appropriate, implement a solution for dealing with any errors you find.

## Outcome 

By the end of this notebook you should have an analytic data set and be ready to dive into some analysis.