## Naming Conflicts in Module Imports

Importing modules in Python and R can lead to naming conflicts if a function with that name already exists. This article demonstrates why you should be careful when importing modules to ensure that these conflicts do not occur.

A common example in Python is using [`from pyspark.sql.functions import *`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html), which will overwrite some built-in Python functions (e.g. `sum()`). Instead, it is good practice to use `from pyspark.sql import functions as F`, where you prefix the functions  with `F`, e.g. [`F.sum()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.sum.html).

### Naming variables

When writing code, it is important to give your variables sensible names, that are informative but not too long. A good reference on this is the [Clean Code](https://best-practice-and-impact.github.io/qa-of-code-guidance/core_programming.html#clean-code) section from [QA of Code for Analysis and Research](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html). **You should avoid using the names of existing built in functions for user-defined variables**.

### Keywords

Some words are reserved: for instance, in Python you cannot have a variable called `def`, `False` or `lambda`. These are referred to as *keywords* and the code will not even compile if you try, raising a `SyntaxError`. You can generate a list of these with [`keyword.kwlist`](https://docs.python.org/3/library/keyword.html).

In R, use `?reserved` to get a list of the reserved words.

In [1]:
import keyword
print(keyword.kwlist)

['False', 'None', 'True', 'and', 'as', 'assert', 'break', 'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'finally', 'for', 'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'nonlocal', 'not', 'or', 'pass', 'raise', 'return', 'try', 'while', 'with', 'yield']


```r
?reserved
```

### Built in functions and module imports in Python

<details>
    
<summary><b>Python Example</b></summary>

You might notice that the Python keyword list is quite short and that some common Python functionality is not listed, for instance, `sum()` or `round()`. This means that it is possible to overwrite these; obviously this is not good practice and should be avoided. 

This can be surprisingly easy to do in PySpark, and can be hard to debug if you do not know the reason for the error.

#### Python Example

First, look at the documentation for `sum`:

In [2]:
help(sum)

Help on built-in function sum in module builtins:

sum(iterable, start=0, /)
    Return the sum of a 'start' value (default: 0) plus an iterable of numbers
    
    When the iterable is empty, return the start value.
    This function is intended specifically for use with numeric values and may
    reject non-numeric types.



Show that `sum` works with a simple example: adding three integers together:

In [3]:
sum([1, 2, 3])

6

Now import the modules we need to use Spark. The recommended way to do this is `import pyspark.sql.functions as F`, which means that whenever you want to access a function from this module you prefix it with `F`, e.g. `F.sum()`. Sometimes the best way to see why something is recommended is to try a different method and show it is a bad idea, in this case, importing all the `functions` as `*`:

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

Attempting to sum the integers will now give an error:

In [5]:
try:
    sum([1, 2, 3])
except AttributeError as e:
    print(e)

'NoneType' object has no attribute '_jvm'


To see why this error exists, take another look at `help(sum)`; we can see that the documentation is different to previously.

In [6]:
help(sum)

Help on function sum in module pyspark.sql.functions:

sum(col)
    Aggregate function: returns the sum of all values in the expression.
    
    .. versionadded:: 1.3



So by importing all the PySpark functions we have overwritten some key Python functionality. Note that this would also apply if you imported individual functions, e.g. `from pyspark.sql.functions import sum`.

You can also overwrite functions with your own variables, often unintentionally. As an example, first Start a Spark session:

In [7]:
spark = (SparkSession.builder.master("local[2]")
         .appName("module-imports")
         .getOrCreate())

Create a small DataFrame:

In [8]:
sdf = spark.range(5).withColumn("double_id", col("id") * 2)
sdf.show()

+---+---------+
| id|double_id|
+---+---------+
|  0|        0|
|  1|        2|
|  2|        4|
|  3|        6|
|  4|        8|
+---+---------+



Loop through the columns, using `col` as the control variable. This will work, but is not a good idea as it is overwriting `col()` from `functions`:

In [9]:
for col in sdf.columns:
    sdf.select(col).show()

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+

+---------+
|double_id|
+---------+
|        0|
|        2|
|        4|
|        6|
|        8|
+---------+



If we try adding another column with `col()` then it will not work as we have now reassigned `col` to be `double_id`:

In [10]:
try:
    sdf = sdf.withColumn("triple_id", col("id") * 3)
except TypeError as e:
    print(e)

'str' object is not callable


In [11]:
col

'double_id'

Importing the PySpark `functions` as `F` and using [`F.col()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.col.html) solves this problem:

In [12]:
from pyspark.sql import functions as F 
sdf = sdf.withColumn("triple_id", F.col("id") * 3)
sdf.show()

+---+---------+---------+
| id|double_id|triple_id|
+---+---------+---------+
|  0|        0|        0|
|  1|        2|        3|
|  2|        4|        6|
|  3|        6|        9|
|  4|        8|       12|
+---+---------+---------+



</details>

### Built in functions and package imports in R

<details>
<summary><b>R Example</b></summary>

It is advised to use `::` to directly call a function from a package. For instance, there is a `filter` function in both [`stats`](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/filter.html) and [`dplyr`](https://dplyr.tidyverse.org/reference/filter.html); you can specify exactly which to use with `dplyr::filter()` or `stats::filter()`.

Note that despite being commonly used for an R DataFrame, [`df` is actually a built-in function for the F distribution](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Fdist.html). As such, it is not recommended to use `df` for DataFrames.

```r
?df
```

</details>

### Further Resources

PySpark Documentation:
- [`pyspark.sql.functions`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html)
- [`F.col()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.col.html)
- [`F.sum()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.sum.html)

sparklyr and tidyverse Documentation:
- [`dplyr::filter()`](https://dplyr.tidyverse.org/reference/filter.html)

Python Documentation:
- [`keyword.kwlist`](https://docs.python.org/3/library/keyword.html)

R Documentation:
- [`stats::df()`](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Fdist.html)
- [`stats::filter()`](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/filter.html)