<a href="https://colab.research.google.com/github/binhudas/Data-Application-Programming/blob/main/Week_1_Tutorials.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pyspark==3.4.0


Collecting pyspark==3.4.0
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317122 sha256=38e0d9de6a243a66149a948dc2b8aae0d4afb31f092b3f2fd72dcafaa4c3e50b
  Stored in directory: /root/.cache/pip/wheels/7b/1b/4b/3363a1d04368e7ff0d408e57ff57966fcdf00583774e761327
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.0


#  Profiling Python Code

## Software Profiling is the process of analysing various metrics of a program to identify the most inefficient parts. It is the final part of the process of considering whether optimising code is worthwhile.

*Testing: Have you tested your code to prove that it works as expected and without errors?*

*Refactoring: Does your code need some cleanup to become more maintainable and Pythonic?*

*Profiling: Have you identified the most inefficient parts of your code?*

## Dynamic analysis is important for profiling rather than static code review. As Dynamic profiling involves running a slow piece of code again and again, you start with small inputs of data.

## The profiling tools used will be **Timers**, **Deterministic Profilers** and **Statistical Profilers**


# **Timers**

## **time**: Measure the Execution Time

In [8]:
>>> import time

>>> def sleeper():
...     time.sleep(1.75)
# This function asks the OS task scheduler to suspend current thread of execution fot 1.75s
# The function remain dormant allowing other programs to run
...

>>> def spinlock():
...     for _ in range(100_000_000):
...         pass
# This function wastes CPU cycles in constrast
...

# We now call the test functions but before and after we check the current time with time.perf_counter() to obtain the elapsed real time and time.process_time() to obtain the CPU time
>>> for function in sleeper, spinlock:
...     t1 = time.perf_counter(), time.process_time()
...     function()
...     t2 = time.perf_counter(), time.process_time()
...     print(f"{function.__name__}()")
...     print(f" Real time: {t2[0] - t1[0]:.2f} seconds")
...     print(f" CPU time: {t2[1] - t1[1]:.2f} seconds")
...     print()

# sleeper()
#  Real time: 1.75 seconds
#  CPU time: 0.01 seconds

# spinlock()
#  Real time: 2.31 seconds
#  CPU time: 2.31 seconds


sleeper()
 Real time: 1.75 seconds
 CPU time: 0.01 seconds

spinlock()
 Real time: 2.31 seconds
 CPU time: 2.31 seconds



The time module is versatile and quick to set up - *good for temporary checks*

It takes into account factors like system load - *gives accurate impression of runtime in real world conditions*


## timeit: Benchmark Short Code Snippets


In [9]:
>>> from timeit import timeit


# this is a function that calculates the nth element of the fibonacci sequence.
>>> def fib(n):
...     return n if n < 2 else fib(n - 2) + fib(n - 1)
...

>>> iterations = 100

# we ask timeit to measure the total time over 100 iterations of the program
>>> total_time = timeit("fib(30)", number=iterations, globals=globals())

# and then calculate the average time by dividing by the total number of iterations.
>>> f"Average time is {total_time / iterations:.2f} seconds"
'Average time is 0.15 seconds'

'Average time is 0.15 seconds'



This repetition minimises the effects of sysem noise on timing - *reducess the impoact of external factors*





# Extra Stuff

https://realpython.com/python-profiling/#timeit-benchmark-short-code-snippets
