# pandas 10 - Best Practices in Data Analysis

by Nova@Douban

The video record of this session is here: https://zoom.us/recording/share/-AyhhqiRKrw42R8xjEWHfXKDs-w2-IGS_NLh01a9q5SwIumekTziMw


---

This sharing focuses on some best practices I gained during the past years working as a NLP engineer and data scientist.

## 10.1 Version control

Version conttol is an important practice in software development, and we shall follow this best practice to reduce errors, such as removing code by mistake.

### 10.1.1 Git

Git is one of the most popular version control tools, and we can find main version control platforms supporting this protocal. We can choose from Github, Gitlab or Bitbucket.

Git has many workfolows to follow, we can start with gitflow, which is simple to follow. 

<img src="../image/gitflow.svg">

### 10.1.2 Git LFS

Git is suitable to store code, but not for big files. Therefore, another version control protocal for big file emerged -- Git LFS. We can use it to store our datasets.

### 10.1.3 Version control your code, not your data

Some of us may used to store temporary / intermediate datasets on harddisk. However, we do not recommend this way; instead, we recommend version contoal your code, and commit as frequent as possible.



---

## 10.2 Folder structure

1. If our code folder is managed by git, it can be an individual folder only for code;
2. At the same level, we can have another two folder for raw data and and results. Therefore, we can user git LFS to manage these two folders.
3. We can use virtualenv to control the package environment.


<img src="../image/folder.png">

---

## 10.3 Jupyter Notebook and IPython

### 10.3.1 Jupyter Notebook

1. Jupyter Notebook is a convinient tool for interacting data analysis.
2. We can put code, document, visualisation in a single Jupyter notebook.
3. We can experiment / draft / benchmark in Jupyter notebook.
4. We can export to markdown / python code / PDF, etc.
5. We can install extensions to make Jupyter Notebook easier to use. See [this](https://towardsdatascience.com/jupyter-notebook-extensions-517fa69d2231?gi=e865fc4d7033).
6. All outlines of this course are created with Jupyter Notebook.

### 10.3.2 IPython

1. IPython is the backend of Jupyter Notebook.
2. Jupyter Notebook needs browser support. For example, if we need fix on a server, and we only can access the server via SSH, then IPython is the only choice.
3. If you are familiar with Jupyter Notebook, then IPython is farily easy to use.

In [10]:
%load_ext autoreload
%autoreload 2

from test_script import fast_read

data = fast_read('../data/gspc.csv', ['Open', 'Close'])
display(data)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Unnamed: 0,Open,Close
0,2498.77002,2485.73999
1,2498.939941,2506.850098
2,2476.959961,2510.030029
3,2491.919922,2447.889893
4,2474.330078,2531.939941


---

## 10.4 Data file format

Please refer to pandas 04 - pandas IO.

1. For data serializaion, we can choose from JSON / Parquet / Arrow / HDF5.
2. When we use JSON, try using line-based JSON (JSONL) in case for possible stream processing.
3. NEVER use Python pickle!

---

## 10.5 Script structure

1. Collect all frequently-used snippets to a single script, and we can import these snippets to other scripts.
2. Build core functions with those snippets, and put them into another script.
3. Build a script of wrappers to call core functions.

---

## 10.6 How to write functions?

1. Make your functions readable and pure;
2. Make the function name simple to remember;
3. Always add docstring and comments;
4. Always abstract your functions to be used repeatedly;
5. Always refactor your code;
6. If a problem is to difficult to solve, try writing pseudo code first. 

## 10.7 Exercises

1. Read [Cookiecutter Data Science — Organize your Projects — Atom and Jupyter](https://medium.com/@rrfd/cookiecutter-data-science-organize-your-projects-atom-and-jupyter-2be7862f487e)
2. Read [Gitflow introduction](https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow) by Bitbucket