# `datexplore` Example Usage
Here we will show how the datexplore package can be used for the early stages of a data analysis project. We will show example usages for each function in the package (`clean_names`, `visualise`, and `detect_outliers`). 

The early stages of data analysis projects often begin with similar steps. For many projects, data cleaning and exploratory data analysis are essential before beginning more complex analysis. Using clean data for your analysis can make your code less suceptible to bugs or errors. Additionally, performing exploratory data analysis can help to direct the analysis of your project and gives a stronger understanding of the data you are working with. 

This package aims to help with the early stages of a project. Specifically, it contains a function to clean the column names of tabular data, a function to detect outliers in numerical data, and a function to create useful visulaization for exploratory data analysis. 

## Imports

In [2]:
from datexplore.datexplore import clean_names
from datexplore.datexplore import visualise
from datexplore.datexplore import detect_outliers
import pandas as pd

## Clean names

Often times raw data contains non syntactic column names. It can be particulary troublesome when the column names contain spaces and you are working with other packages which are designed only for column names without spaces.

#### For column name with a space:
An example of one such tool which does not work for column names with spaces the .query() method from the pandas library. This is shown below:

In [4]:
raw_df = pd.DataFrame({'Even Numbers': [2, 4, 6, 8],'odd numbers': [1, 3, 5, 7]})
filtered_df = raw_df.query("Even Numbers > 2")

SyntaxError: invalid syntax (<unknown>, line 1)

As you can see, using the column name containing a space results in an error. 
Now, we can use the clean_names function to "clean" the column names of the data frame. By "cleaning" the column names, we mean that we  make all column names in a dataframe such that the names only use letters, numbers, and underscores.

The clean_names function takes a pandas dataframe containing data with column names as an input. There is also an optional parameter, case, which specifies the capitalization structure of the output dataframe (more on this later). 

#### For column names without spaces:
Below we use the clean_names function and show that the resulting dataframe can now be used with the .query() method.

In [11]:
# Clean the column names and view the resulting dataframe
raw_df = pd.DataFrame({'Even Numbers': [2, 4, 6, 8],'odd numbers': [1, 3, 5, 7]})
clean_df = clean_names(raw_df)
clean_df

Unnamed: 0,even_numbers,odd_numbers
0,2,1
1,4,3
2,6,5
3,8,7


In [10]:
# Use the .query method on the new dataframe
filtered_df = clean_df.query("even_numbers > 2")
filtered_df

Unnamed: 0,even_numbers,odd_numbers
1,4,3
2,6,5
3,8,7


This may not seem that useful for a dataframe with only two columns, but for a data frame with many columns, using the clean_names function could save a lot of time. 

#### Exploring the case parameter: 
The clean_names function also has an optional parameter, case, which specifics the capitalization structure of the output column names. The default value for this parameter is "snake_case" and the other options are "CamelCase" and "lowerCamelCase". snake_case uses only lowercase letters and spaces are replaced with underscores. "CamelCase" capitalizes the first letter of a name and every letter following a space. "lowerCamelCase" results in the first letter of the name being lowercase and the first letter following a space being capitalized. 
Below are some examples using this optional parameter:  

## Visualise

## Detect outliers