# Data Preparation

In [38]:
# Commands to enable autoreload of the imported packages
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


The first goal is to load all 8 csv files. Each of the csv files is a pandas DataFrame. They will be loaded into a single dictionary named data, where each key is the name of the csv file, and each value is the dataframe created from the csv.

eg:

```python
data = { 
    'sellers': DataFrame1,
    'orders': DataFrame2,
    ...
    }
```

### 1. The variable `csv_path` is created, which stores the path to the `"csv" folder` as a string

- When calling `pd.read_csv(csv_path)`, `csv_path` can be absolute or relative:
    - A **`relative path`** can start with `.` or `..`, it is always computed with respect to the current working directory 
        - *Use `!pwd` in notebook or `pwd` in terminal to know where the current directory is located*
    - An **`absolute path`** starts with `/` 

In [39]:
# Checking current working directory using `os.getcwd()`
import os
os.getcwd()

'/home/chongxe1991/code/chongxe1991/olist_data_analysis'

☝️ `getcwd` a.k.a `get current working directory` refers to the absolute path _from which this notebook is being executed_

Relative `csv_path` needs to be created.

Using [`os.path.join`](https://docs.python.org/3/library/os.path.html) will replace both:
* Linux/MacOS syntax (e.g. `../folder_name`) 
* and Windows syntax (e.g. `..\\folder_name`) 

and is therefore more robust !

In [40]:
#Creating relative path
csv_path = os.path.join("data", "csv")
csv_path

'data/csv'

In [41]:
#Getting the root path of the current working directory
root_path = os.getcwd()
root_path

'/home/chongxe1991/code/chongxe1991/olist_data_analysis'

In [42]:
#Joining the root path to data folder and csv file
os.path.join(root_path, "data", "csv")

'/home/chongxe1991/code/chongxe1991/olist_data_analysis/data/csv'

In [43]:
#Listing the files in the directory
os.listdir(os.path.join(root_path, "data", "csv"))

['.gitkeep',
 'olist_sellers_dataset.csv',
 'olist_order_reviews_dataset.csv',
 'olist_order_items_dataset.csv',
 'olist_customers_dataset.csv',
 'olist_orders_dataset.csv',
 'olist_order_payments_dataset.csv',
 'product_category_name_translation.csv',
 'olist_products_dataset.csv',
 'olist_geolocation_dataset.csv']

In [44]:
# Testing code below
import pandas as pd
pd.read_csv(os.path.join(csv_path, 'olist_sellers_dataset.csv')).head()

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP


### 2.The list `file_names` is created. It contains all csv file names in the csv directory

- It looks like this `file_names = ['olist_sellers_dataset.csv', ....]`
- `os.listdir()` can be used
- It must only lists csv files!

In [45]:
file_names = []
for item in os.listdir(csv_path):
    if item.endswith(".csv"):
        file_names.append(item)
file_names

['olist_sellers_dataset.csv',
 'olist_order_reviews_dataset.csv',
 'olist_order_items_dataset.csv',
 'olist_customers_dataset.csv',
 'olist_orders_dataset.csv',
 'olist_order_payments_dataset.csv',
 'product_category_name_translation.csv',
 'olist_products_dataset.csv',
 'olist_geolocation_dataset.csv']

### 3.  The list of dictionary key `key_names` is created
Starting from file_names and:
- Removing its suffix ".csv" when it exists
- Removing its suffix "_dataset.csv" when it exists
- Removing its prefix "olist_" when it exists

<details>
    <summary>- Details - </summary>

- `.replace()` is used
    
- `str` ings are iterables that can be slice with [ ]
</details>

In [46]:
key_names = []
for item in file_names:
    item = item.replace("_dataset.csv", "")
    item = item.replace(".csv", "")
    item = item.replace("olist_", "")
    key_names.append(item)
key_names

['sellers',
 'order_reviews',
 'order_items',
 'customers',
 'orders',
 'order_payments',
 'product_category_name_translation',
 'products',
 'geolocation']

### 4. The dictionary `data` is constructed

```python
data = { 
    'sellers': DataFrame1,
    'orders': DataFrame2,
    'order_items': DataFrame3,
    ...
    }
```

<details>
    <summary>▸ Details</summary>

The `zip()` method is used to iterate over two lists
```python
for (x, y) in zip(['a','b','c'], [1,2,3]):
    print(x,y)

# returns ('a', 1), ('b', 2), ('c', 3)
    
```
</details>

In [47]:
data = {}
for (x, y) in zip(key_names, file_names):
    data[x] = pd.read_csv(os.path.join(csv_path, y)).head()

### 5. The method `get_data()` in `data.py` is implemented

It is used to return the dictionary `data` upon calling it as per below

```python
from olist.data import Olist
Olist().get_data()
```
- Method `get_data()` needs to be callable from various places (e.g Terminal, this notebook, another notebook located elsewhere, etc...)
- Relative path cannot be used this time as the current working directory `os.getcwd()` depends on where the code is run in the first place

In [48]:
from olist.data import Olist
Olist().get_data()['sellers'].head()

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP
