# EDA Repository Considerations

## Your README file

It is important to create a repository that delivers a clear narrative describing your process and results. Your repository should include foremost a `README.md` file that provides that overview with illustrations that support findings, and descriptions of any processes along the way: data preparation, calculations, lines of inquiry and insights.



## Repository directory structure

Consumers of your project, stakeholders, "future you", or anyone who wishes to reproduce your results should also find your project repository organized in a sensible and somewhat-standard fashion.

Below is a fairly standard EDA structure.

|              |                                                |
| -----------  | ---------------------------------------------- |
| 📂 data      | for dataset                                    |
| 📂 img       | for plots & illustrations                      |
| 📂 notebooks | for Jupyter notebooks                          |
| 📂 src       | for `.py` script files                         |
| README.md    | your report for people viewing your repository |
| .gitignore   | to keep cruft from your nice clean repo!       |



## Importing scripts with this structure

There are a few quirks of importing and running Python script files in
Jupyter notebooks that should be mentioned in order to keep things
running smoothly for your group.

Under normal circumstances, all of your import statements would be here
at the top of the notebook. Here, imports have been placed as they are discussed for demonstration purposes.

### Standard Library Imports

Any library you already have installed in your environment (such as Pandas, Numpy, Matplotlib, or Seaborn) will remain available as usual.

In [15]:
# Imports
import pandas as pd

In [16]:
! ls ../data

grades.csv


In [17]:
# Import grades.csv
grades_df = pd.read_csv(
    "../data/grades.csv",
    index_col="Student"
)

In [18]:
# Display grades_df
grades_df

Unnamed: 0_level_0,Math,English,Gym
Student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Sanjay,83,87,84
Sally,93,97,94
Edgar,73,77,74


In [19]:
# Check info
grades_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, Sanjay to Edgar
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   Math     3 non-null      int64
 1   English  3 non-null      int64
 2   Gym      3 non-null      int64
dtypes: int64(3)
memory usage: 96.0+ bytes


### Not importing: running a file in another directory

Once you have created a file in `src`, you will want access to the functions, classes, variables, and other content you have created.

You can run Python script files in Jupyter notebooks in much the same way that you run them in iPython.

As the file is run as `"__main__"` in this fashion, an `if __name__ == "__main__"` block will run using this method.

Note:  
> Oddly, adding a Python-style comment above this line of code causes the block to fail.

In [20]:
run ../src/double_file

I was the main file!
5 doubled equals 10.


In [21]:
# The function double is now available for use
double(20)

40

### Importing from within the same directory

Importing from a Python script file within the same directory works much as you would expect it to.

Unfortunately, this makes your folder organization untidy at best. Although it's demonstrated here, it's not suggested.

In [22]:
# import from the same directory
import in_folder

In [23]:
in_folder.inside()

I am inside this directory


### Traversing to another directory to run code

You can traverse to another directory to run code as well. Also demonstrated here but not suggested.

In [24]:
# you can run unix-style commands
# if you have multiple, start with !

# print current dir
! pwd

# look at what's up one level and inside src
! ls ../src

/Users/shawnkeech/daimil10/case_studies/tbd_file_structure_demo/notebooks
[1m[36m__pycache__[m[m           double_file.py
added_to_path_file.py traversed_file.py


In [25]:
cd ../src

# confirm contents of file
! ls

SyntaxError: invalid syntax (1533412488.py, line 1)

In [None]:
import traversed_file

ModuleNotFoundError: No module named 'traversed_file'

In [None]:
traversed_file.traversed()

I traversed to get here.


In [None]:
# getting back to notebooks directory 
# for the purpose of demonstration

! cd ../notebooks

## A better way to do imports: adding `src` to the path

There's a quick and easy way to add the `src` directory to the path. Once
this is done, you may import files easily.

In [None]:
# This would normally be at the top of the notebook
# with the other imports
import sys
sys.path.insert(0, '../src')

In [None]:
# importing a file inside src
import added_to_path_file

In [None]:
triple(5)

10

In [None]:
# triple is now available
triple(grades_df)

Unnamed: 0_level_0,Math,English,Gym
Student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Sanjay,166,174,168
Sally,186,194,188
Edgar,146,154,148


In [None]:
grades_df['Gym'].apply(double)

Student
Sanjay    168
Sally     188
Edgar     148
Name: Gym, dtype: int64

## Conclusion

For a clean repository, keep a nice file structure.

|              |                                                |
| -----------  | ---------------------------------------------- |
| 📂 data      | for dataset                                    |
| 📂 img       | for plots & illustrations                      |
| 📂 notebooks | for Jupyter notebooks                          |
| 📂 src       | for `.py` script files                         |
| README.md    | your report for people viewing your repository |
| .gitignore   | to keep cruft from your nice clean repo!       |


Use the `run` command to run your Python scripts or add the `src` directory to the path and import your functions.
