# Pre-Requisite Notebook
## Welcome
Welcome to Applied Data Science for 2023 Semester 2! 

## General Tutorial Structure for Weeks 1 - 4
- We will be using Python, PySpark and Git for the majority of this subject. 
- This is because TLC as of 2022 has opted in to use `parquet` over `csv`. We will discuss this in the first tutorial.
- The first hour of the tutorial will be based on general programming how-to's and walkthroughs.
- The remainder of the tutorial is for asking questions and forming groups. We can also answer general questions about industry / applying for jobs. 
- Finally, **tutorial attendance will be marked after the first 4 tutorials for the industry project component.**

----

## `git` (GitHub) Summary
_Whilst this should have been covered in pre-req subjects, I understand that the uni may not have covered it_. For this subject, the minimal requirement for `git` is:
1. `clone`: copy an existing repo from remote (repository) into your local destination.
2. Publishing new changes:
  - `add` + `commit`: create a new snapshot of the local repository and commit the changes.
  - `push`: upload your local commits to remote.
3. Syncing unseen changes:
  - `fetch`: download unseen commits from remote to local.
  - `merge`: merge the commits from remote with changes in local.
    - If local has no new changes (or `is up to date`), the merge does not create new snapshot.
    - Otherwise, changes will be automatically merged if there is no *conflict*, else you need to resolve the conflict. You will need to `commit` the merge result once this process finishes.
      - Question: After `merge`, is the local and remote now synced? Why or why not?
  - `pull`: Shorthand for chaining `fetch` and `merge`
  
Graphical illustration:
![gitoverview](../../media/git-process.png)

4. Authentication:
  - For GitHub, [Personal Access Tokens (PAT)](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token) is required as a security measure.

### GitHub Desktop
GitHub Desktop hides lots of the process under-the-hood. It is good for those who are not familiar with `git` and honestly, we use it for industry work because its easy.

**Cloning:** 
1. Download [GitHub Desktop](https://desktop.github.com/)
2. Login with your credentials
3. On the top-left menu, click on `Add` -> `Clone repository...`
4. Enter https://github.com/VoLKyyyOG/MAST30034_Python as the URL
5. Click on `Clone`.
6. Done!

**Publishing:**
1. Add changes.
2. Click on the `MAST30034_Python` repo.
3. Add a summary (i.e `"removed incorrect transformation for xyz"`)
4. Commit to `main` (or your specified `branch` if you know what it is)
5. Push, and you are done.

**Syncing:**
1. Click on the `MAST30034_Python` repo.
2. Click `Fetch origin` (refresh icon)
3. Pull, and you are done.

### `git` CLI (Command Line Interface)
If you are using `git` CLI, you will need PAT:
1. Visit https://github.com/settings/tokens 
2. Generate a token (set it to expire end of this semester).
3. Add changes and commit as usual.
4. Now, after adding your `username`, you will be prompted with `password`. Rather than using your GitHub password, you should use your generated PAT here.
5. Done!

**Cloning:** 
1. Open a terminal (yes it is commandline `git` for this to work).
2. `git clone HTTPS` (where HTTPS is the https url to your gitlab repo).
3. Enter your credentials (with PAT).
4. Done.

**Publishing:** 
1. Change directories to inside your repository (`cd NAME_OF_REPO_FOLDER`).
2. `git add -A` (this will stage all changed/untracked files files for the next commit, ignored files are excepted). You can use `git status` to track changed files before adding.
3. `git commit -m "message"` (make a commit with a message).
5. `git push`
6. Enter your credentials.
    - Here, use the same username
    - BUT, instead of your password, use the PAT you generated.
7. Done.

**Syncing:** 
1. Change directories to inside your repository (`cd NAME_OF_REPO_FOLDER`).
2. `git pull`
3. Done.

---

## General Tips for Jupyter Notebook
Cell shortcuts:
- `shift + enter` : Run current cell (equivalent of pressing <button class='btn btn-default btn-xs'><i class="fa-play fa"></i><span class="toolbar-btn-label">Run</span></button>)
- `ctrl + enter` : Run selected cells

Command mode (press `esc` to enter):
- `m` : Makes the cell markdown
- `y` : Makes the cell into code
- `a` : Insert cell above
- `b` : Insert cell above
- double `d` : Delete current cell

Code Shortcuts:
- `shift + tab` : brings function arguments

Multiline Cursor:
- Hold down `ctrl` on Windows or `cmd` on Mac and click on the places you wish to edit all together.

---

## Python / Requests
This notebook will explain how to use requests to download files via Python.

1. There are several libraries and packages available for Python when it comes to requesting data. For this tutorial, we'll use `urllib`.

In [1]:
from urllib.request import urlretrieve

2. We now want to set an output directory. You can manually create it OR you can also use Python to do so. We will be using the latter method for automation purposes. To do so, we will use the [`os` library](https://docs.python.org/3/library/os.html#os.mkdir).

Important (Paths): 
- Windows Users: https://www.computerhope.com/issues/ch001708.htm#windows
- MacOS/Linux/WSL Users: https://www.computerhope.com/issues/ch001708.htm#linux
- `..` is used to _go up_ a level (i.e the back button).

We will make a new folder _outside_ this `tutorials/tute_1` folder inside the root `MAST30034` directory. To do so, we will use `../data` to "exit" the current directory or "go up" to the parent directory. Then, we will go into the `data` folder to create subdirectories.

If you cloned the repo, you should already have the `data/taxi_zones` directory.

In [2]:
import os

# from the current `tute_1` directory, go back two levels to the `MAST30034` directory
output_relative_dir = '../../data/'

# check if it exists as it makedir will raise an error if it does exist
if not os.path.exists(output_relative_dir):
    os.makedirs(output_relative_dir)
    
# now, for each type of data set we will need, we will create the paths
for target_dir in ('tlc_data', 'tute_data'): # taxi_zones should already exist
    if not os.path.exists(output_relative_dir + target_dir):
        os.makedirs(output_relative_dir + target_dir)

3. Now, we will download the required datasets. For this tutorial, we will only use January-February, but you can adjust it to your requirements.

**Please only use the years where there are zones (post 2015).**

In [6]:
YEAR = '2022'
# adjust the range function to the numerical months i.e 1 = jan, 2 = feb, etc...
# MONTHS = range(1, 13)
MONTHS = range(1, 4)

In [7]:
# this is the URL template as of 07/2023
URL_TEMPLATE = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_"#year-month.parquet

In [8]:
# data output directory is `data/tlc_data/`
tlc_output_dir = output_relative_dir + 'tlc_data'

for month in MONTHS:
    # 0-fill i.e 1 -> 01, 2 -> 02, etc
    month = str(month).zfill(2) 
    print(f"Begin month {month}")
    
    # generate url
    url = f'{URL_TEMPLATE}{YEAR}-{month}.parquet'
    # generate output location and filename
    output_dir = f"{tlc_output_dir}/{YEAR}-{month}.parquet"
    # download
    urlretrieve(url, output_dir) 
    
    print(f"Completed month {month}")

Begin month 01
Completed month 01
Begin month 02
Completed month 02
Begin month 03
Completed month 03


4. The shapefile is inside the zip file from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page:
    - https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv
    - https://d37ci6vzurychx.cloudfront.net/misc/taxi_zones.zip
    
and now we are done!