## Data Wrangling

In [5]:
import os
import pandas as pd
from tqdm import tqdm

### TQDM Warmer

In [4]:
!pip install tqdm

Collecting tqdm
  Downloading tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tqdm
Successfully installed tqdm-4.64.1


In [7]:
empty_list = []




In [10]:
len(empty_list)

0

Reading in the temperature data is a tricky task! There are over 6000 individual files and the dataset is too large to process all at once in the typical computer's memory. They need to be processed one by one.

### Walk-through: Build a big temperature dataset for Europe

In this walk-through Python, raw SQL, Notebooks and the `psql` shell shall be used. The general procedure is as follows:

1. Read in each individual file with `pandas` for data cleaning
2. Append each dataset to a large `.csv` file on disk
3. Use psql's `\copy` command to import the merged dataset into PostgreSQL

#### Start with small dataset first
Try the exercise on a small subset of data files first. Once everything works properly then attempt a run on the complete dataset.

### Objective 1: Python

Write a Python function `parse_file(filename)` that processes a single dataset

- Read in the dataset and skip the initial header lines
- Clean the column names
- Cast the `date` column into the `datetime` format
- Only select valid observations (the ones where `q_tg==0`)
- Drop the columns `souid` and `q_tg`

Test the function on a single dataset first:

```python
parse_file('TG_STAID000001.txt')
```

should return

|    |   staid | date       |   tg |
|---:|--------:|:-----------|-----:|
|  0 |       1 | 1860-01-01 |   21 |
|  1 |       1 | 1860-01-02 |   46 |
|  2 |       1 | 1860-01-03 |   31 |
|  3 |       1 | 1860-01-04 |   37 |
|... |     ... | ...        |  ... |

In [11]:
# STEP 1: Read in the dataset and skip the initial header lines

df = pd.read_csv('./data/TG_STAID000001.txt', skiprows=19)

FileNotFoundError: [Errno 2] No such file or directory: './data/TG_STAID000001.txt'

In [None]:
# Display the column names, how could we clean them?



In [None]:
# STEP 2: Clean the column names



In [None]:
# STEP 3: Cast the `date` column into the `datetime` format



In [None]:
# What is the shape of the dataframe? 



In [None]:
# STEP 4: Only select valid observations (the ones where `q_tg==0`)



In [None]:
# What is the shape of the dataframe after the filtering?



In [None]:
# STEP 5: Drop the columns `souid` and `q_tg`



In [None]:
# What is the shape of the dataframe now?



In [None]:
# Take a look at the first 5 rows and all of the columns



In [None]:
# Now wrap all five steps that transform the data into a function:

def parse_file(filename):
    
    
    
    
    
    

In [None]:
# test the function on a file

parse_file('TG_STAID000001.txt').head()

### Objective 2: Automation

Loop over all files read in the data and append the data frame it to a single' text file `mean_temperature.csv`: 

```python
from tqdm import tqdm

with ___("./data/mean_temperature.csv", mode="w", newline='') as file:
    for ___ in tqdm(os.listdir(___)):
        if 'TG_STAID' ___ filename:
            df = ___(filename)
            ___.to_csv(file, index=False, header=False)
```

- Use the `tqdm` method to generate a progress bar while looping over the files
- Only process files that contain `TG_STAID` in their filename. 

**NOTE:**
This step can take around 15 minutes. It also create a csv file that is over `2GB` large! Test it with a few datafiles first!

In [None]:
# recap os.listdir(directory)

os.listdir()

# which directory can we use in the for-loop?

In [None]:
# create a small version first with 2-3 datasets as "mean_temperature_small.csv"   >> os.listdir('../data/ECA_blend_tg/')[0:10] 
# we will use it for the upload test !




In [None]:
# read in the CSV, check to see if the data looks correct.



In [None]:
# change the file nama to mean_temperature.csv and run your code with ALL files (5-10 min)





In [None]:
# let's check the file. import the mean_temperature.csv (it needs time to load)



In [None]:
# how many rows? 



In [None]:
# how many unique stations?



### Objective 3: SQL

As the file is so big we process it outside of python and with `psql`. The `\copy` 
command is one of the fastest way of bulk loading large amounts of data into a database.

Write a `.sql` script that contains:

- A table definition for the table `mean_temperature`
- A foreign key constraint for the column `staid`**(skip if you don't have station table in your climate DB)**
- A `\copy` statement that imports the data

Use this SQL script as a reference:

```postgresql

SELECT transaction_timestamp();

BEGIN;

___ ___ IF EXISTS mean_temperature CASCADE;

CREATE TABLE ___ (
    staid INT,
    date ___,
    ___ ___
);

\COPY ___ FROM ___  WITH (HEADER false, FORMAT csv);

COMMIT;

SELECT transaction_timestamp();
```

**BIG DATA**:
As this step depends heavily on the speed of your network and the processing power of your database, it can take up to several hours to complete.

The `BEGIN` and `COMMIT` statements that are wrapped around the actual queries 
setup a transaction. It bundles all statements into a single all-or-nothing operation.

## TRY to use the SQL with THE SMALL VERSION FIRST !  ``mean_temperature_small.csv``

verify that the mean_temperature table is now populated with data

### Last step: Foreign Keys

assuming that the stations table has a primary key on staid. Add a foreign key to the mean_temperature table using SQL query:

```sql
ALTER TABLE mean_temperature
ADD FOREIGN KEY (staid) REFERENCES stations(staid);
```
