# Python Good Practices

This notebook is for keeping track of tips & tricks for Python good practices.

## Contents

* [Main declaration in scripts](#scrollTo=fI2EjehkYvTD)
* [Comprehensions](#scrollTo=PTI6g18lbBdW&line=1&uniqifier=1)
* [Using numpy for loops](#scrollTo=blfA3wRPdGaf)
* [Glob](#scrollTo=lQFOqMouIqZP&line=1&uniqifier=1)

## Main declaration in scripts

In order to define a main in Python, is a good practice to use the following `main` definition in the scripts.

> **Note:** file != script. A script is a file that is meant to be run and is not meant to be imported.

In [1]:
def main():
  ...

if __name__ == '__main__':
  main()

Why is this? Well, Python files have a different `__name__` on execution depending if they are the being executed as a script or as an imported file. With this main declaration, because Python doesn't have entry points like the `public static void main` (psvm) in Java, this serves as the sntry point for the file, and in case the script is being imported by another file, the main code won't run by mistake.

## Comprehensions

### List comprehensions

List comprehensions are another form of processing lists in Python. They are *generally* easier to read and can be put in a single line of code.

In [2]:
numbers = [2, 7, 4, 5, 10, 4, 5]

numbers_greater_than_five = [number for number in numbers if number > 5]
print(numbers_greater_than_five)

[7, 10]


### Dictionary comprehensions

These are the same as **list comprehensions** but with dictionaries.

In [3]:
fruits = ["apple", "apple", "kiwi", "melon", "apple", "kiwi"]

fruit_dict = {f: fruits.count(f) for f in fruits}
print(fruit_dict)

{'apple': 3, 'kiwi': 2, 'melon': 1}


## Using `numpy` for loops

In [4]:
from time import perf_counter as timer
import numpy as np

def for_loop(n=100_000_000):
  s = 0
  for i in range(n):
    s += i

  return s

def while_loop(n=100_000_000):
  s = 0
  i = 1
  while i < n:
    s += i
    i += 1

  return s

def sum_fn(n=100_000_000):
  s = sum(range(n))
  return s

def numpy_sum(n=100_000_000):
  s = np.sum(np.arange(n))
  return s

# Test results
start = timer()
result = for_loop()
end = timer()
print(f"For loop sum | Result: {result} | Time: {end - start}")

start = timer()
result = while_loop()
end = timer()
print(f"While loop sum | Result: {result} | Time: {end - start}")

start = timer()
result = sum_fn()
end = timer()
print(f"Python sum | Result: {result} | Time: {end - start}")

start = timer()
result = numpy_sum()
end = timer()
print(f"Numpy sum | Result: {result} | Time: {end - start}")

For loop sum | Result: 4999999950000000 | Time: 15.217263341000006
While loop sum | Result: 4999999950000000 | Time: 14.777114275000002
Python sum | Result: 4999999950000000 | Time: 2.0536612530000014
Numpy sum | Result: 4999999950000000 | Time: 0.38569682400000715


## Glob

Glob is a useful library for file and path matching. It uses Unix syntax like shown below:

In [5]:
import glob

# * matches everything
print("Returns every file ending in '.csv':")
print(glob.glob("*.csv", root_dir="sample_data"))

# ? matches a single character
print("\nReturns every file starting with 5 unknown characters and ending in '_test.csv':")
print(glob.glob("?????_test.csv", root_dir="sample_data"))

# [] matches characters included
print("\nReturns every file starting with 'm' or 'a' and ending in '.csv':")
print(glob.glob("[ma]*.csv", root_dir="sample_data"))

# [!] matches characters not included
print("\nReturns every file not starting with 'm' or 'a' and ending in '.csv':")
print(glob.glob("[!ma]*.csv", root_dir="sample_data"))

Returns every file ending in '.csv':
['mnist_train_small.csv', 'mnist_test.csv', 'california_housing_test.csv', 'california_housing_train.csv']

Returns every file starting with 5 unknown characters and ending in '_test.csv':
['mnist_test.csv']

Returns every file starting with 'm' or 'a' and ending in '.csv':
['mnist_train_small.csv', 'mnist_test.csv']

Returns every file not starting with 'm' or 'a' and ending in '.csv':
['california_housing_test.csv', 'california_housing_train.csv']


However, `glob()` loads a list of all paths in memory, so we can turn it into a **generator** in order to iterate over and thus avoiding memory overload.

In [6]:
globs = glob.iglob("*.csv", root_dir="sample_data")

print(next(globs))
print(next(globs))
print(next(globs))

mnist_train_small.csv
mnist_test.csv
california_housing_test.csv
