# LOAD LIBRARIES
- automatically import required libraries to notebook
- `%run ../../src/start.py`


In [1]:
%run ../../src/start.py


python	3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 21:52:21) 
[GCC 7.3.0]
---------------------
Versions:
----------------------
pandas      0.25.3
numpy       1.17.3
matplotlib  3.1.1
seaborn     0.9.0
plotly      4.2.1
----------------------


Loaded Libraries
-------------------
import pandas as pd
import numpy as np
import sys,os
import re
import glob
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
import plotly
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
----------------
source file: src/start.py



# LOAD files FROM `data` FOLDERS
## Example 1

- Notebook file is under `notebooks\EXAMPLES.ipynb`
- data file is under `data\raw\tix_data_file.csv`

```bash 
.
├── data
│   ├── interim
│   ├── processed
│   └── raw
│       └── tix_data_file.csv
├── enviroment
├── notebooks
|   └── EXAMPLES.ipynb
├── reports
│   └── figures
└── src

```

## create raw_data path

- create variable with the name of the file you want to load to pandas
- use `os.path.join` to combine the path from current directory to file directory
```python
file_name = 'tix_data_file.csv'
raw_data = os.path.join('..','data','raw',f'{file_name}')
```
- the `raw_data` variable just creates a neat path to the file we want to load

     - `'../data/raw/tix_data_file.csv'`
## load `raw_data` file to pandas
```python
df = pd.read_csv(raw_data)
```
-----

## Example 2
Folder structure is a bit diffrent
```bash 
.
├── data
│   ├── interim
│   ├── processed
│   └── raw
│       └── tix_data_file.csv
├── enviroment
├── notebooks
│   ├── 01_CLEAN
│   └── 02_EDA
│       └── EXAMPLES.ipynb
├── reports
│   └── figures
└── src
```
- Notebook file is under `notebooks\02_EDA\EXAMPLES.ipynb`
- data file is under `data\raw\tix_data_file.csv`


## create raw_data path

- create variable with the name of the file you want to load to pandas
- use `os.path.join` to combine the path from current directory to file directory
- since we are nested 1 folder down, we added an extra path to our `os.path.join` function
```python
file_name = 'tix_data_file.csv'
raw_data = os.path.join('..', '..','data','raw',f'{file_name}')
```
- the `raw_data` variable just creates a neat path to the file we want to load

     - `'../../data/raw/tix_data_file.csv'`
## load `raw_data` file to pandas
```python
df = pd.read_csv(raw_data)
```
----------------------------

## Example 3
### loading a group of similar files

```bash 
.
├── data
│   ├── interim
│   ├── processed
│   └── raw
│       └── montly_data
|               ├── apr.csv
|               ├── may.csv
|               ├── jun.csv
|               ├── jul.csv
|               ├── aug.csv
|               └── sep.csv
├── enviroment
├── notebooks
│   ├── 01_CLEAN
│   └── 02_EDA
│       └── EXAMPLE.ipynb
├── reports
│   └── figures
└── src
```

- Notebook file is under `notebooks\02_EDA\EXAMPLES.ipynb`
- data files are under `data\raw\monthly_data\`


## create raw_data path

- create variable with the name of the folder containing the similar files
- use `os.path.join` to combine the path from current directory to folder with similar files
- since we are nested 1 folder down, we added an extra path to our `os.path.join` function

```python
path = 'monthly_data'
raw_directory = os.path.join('..', '..','data','raw',f'{path}/')
```
- the `raw_directory` variable just creates a neat path to the folder we want.

     - `'../../data/raw/monthly_data/'`
  
## load files
- load all similar **csv** files from `raw_directory` path
- create a list comprehension that loads each file into  a pandas dataframe
- combine all dataframes into one.

```python
all_files = glob.glob(os.path.join(raw_directory, "*.csv")) # change "*.xlsx" for excel files
df_from_each_file = (pd.read_csv(f) for f in all_files)    # change to `pd.read_excel(f)` for excel files
df   = pd.concat(df_from_each_file, ignore_index=True)     # combine into one large dataframe
```

# IMPORT SRC FILES
## Example
 - from the src file you want to load `eda.py` file from `visualization` folder

```bash
src
├── data
├── dscore
├── features
├── helpers
├── models
├── process_mining_core
├── utils
└── visualization
    ├── SankeyBreak.py
    ├── bubble_plot.py
    ├── eda.py
    ├── format_layout.py
    ├── heatmap.py
    ├── parsing.py
    ├── to_html.py
    └── visualize.py
```
- need to append src path
    - `sys.path.append('../../src/')`
- then you load `from visualization import eda as eda`


In [4]:
sys.path.append('../../src/')

In [7]:
from visualization import eda as eda

## list functions inside `eda`

In [8]:
dir(eda)

['Bar',
 'Figure',
 'Histogram',
 'Layout',
 'SET1_COLORS',
 'Scatter',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'consolidate_data',
 'create_intro_div',
 'create_intro_paragraph',
 'format_layout',
 'intro_stats_barplots',
 'np',
 'pd',
 'plotly',
 'plotly_bar_plot',
 'resource_breakdown',
 'ttr_histogram_sidebyside']

### display docstring for `eda.consolidate_data`

In [9]:
help(eda.consolidate_data)

Help on function consolidate_data in module visualization.eda:

consolidate_data(indata, top=8)
    Takes in a dataframe with counts broken out by columns and consolidates
    categories into 'Other' according to argument top
    e.g. if there are 10 columns and top is 8, will wrap up 2 smallest
    columns into 'Other'. Note: if the 'Other' category is the biggest after
    consolidation, will list it first.
    
    Parameters
    ----------
    indata : pandas.DataFrame
        Dataframe which is copied and consolidated.
    top : int, default
        Determines how many columns to include and categorizes smallest
        remainders as other.
    
    Returns
    -------
    pandas.DataFrame
        Returns a dataframe of original columns and smaller columns categorized
        as other.




# Need work


# SAVE PLOTS

In [None]:
def save_plots(plt_figure,name):
    # generate plot
    iplot(plt_figure)
    # save to html
    html_name = '{}.html'.format(name)
    plot(plt_figure,filename=plt_directory_saves + html_name)
    # save to pkl format
    filename = pkl_directory_saves+'{}'.format(name)
    outfile = open(filename,'wb')
    pickle.dump(plt_figure,outfile)
    outfile.close()