# Installing Python Packages to Existing EMR Clusters

## Notebook Scope Libraries
Notebook-scoped libraries provide you the following benefits:

* Runtime installation – You can import your favorite Python libraries from PyPI repositories and install them on your remote cluster (driver and executors) on the fly when you need them. These libraries are instantly available to your Spark runtime environment. There is no need to restart the notebook session or recreate your cluster.
* Dependency isolation – The libraries you install using EMR Notebooks are isolated to your notebook session and don’t interfere with bootstrapped cluster libraries or libraries installed from other notebook sessions. These notebook-scoped libraries take precedence over bootstrapped libraries. Multiple notebook users can import their preferred version of the library and use it without dependency clashes on the same cluster.
* Portable library environment – The library package installation happens from your notebook file. This allows you to recreate the library environment when you switch the notebook to a different cluster by re-executing the notebook code. At the end of the notebook session, the libraries you install through EMR Notebooks are automatically removed from the hosting EMR cluster.

#### This functionality is available for clusters running EMR release >= 5.26.0. More info can be found in the [docs here](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-installing-libraries-and-kernels.html)

In [None]:
%%info

## To enable notebook scoped libraries, we must set the configuration to use a virtualenv

In [None]:
%%configure -f
{ "conf":{
          "spark.pyspark.python": "python3",
          "spark.pyspark.virtualenv.enabled": "true",
          "spark.pyspark.virtualenv.type":"native",
          "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
         }
}

In [4]:
sc.list_packages()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

aws-cfn-bootstrap (2.0)
beautifulsoup4 (4.9.3)
boto (2.49.0)
boto3 (1.18.63)
botocore (1.21.65)
certifi (2021.10.8)
charset-normalizer (2.0.9)
click (8.0.1)
cycler (0.11.0)
Cython (0.29.24)
docutils (0.14)
idna (3.3)
jmespath (0.10.0)
joblib (1.0.1)
kiwisolver (1.3.2)
lockfile (0.11.0)
lxml (4.6.3)
matplotlib (3.4.3)
mysqlclient (1.4.2)
nltk (3.6.2)
nose (1.3.4)
numpy (1.21.2)
pandas (1.2.5)
Pillow (8.4.0)
pip (9.0.1)
py-dateutil (2.2)
pyparsing (3.0.6)
pystache (0.5.4)
python-daemon (2.2.3)
python-dateutil (2.8.2)
python37-sagemaker-pyspark (1.4.1)
pytz (2021.1)
PyYAML (5.4.1)
regex (2021.8.3)
requests (2.26.0)
s3transfer (0.5.0)
setuptools (28.8.0)
simplejson (3.2.0)
six (1.13.0)
tqdm (4.62.1)
urllib3 (1.26.7)
wheel (0.29.0)
windmill (1.6)

You are using pip version 9.0.1, however version 21.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

## Notice that there is no "translate" package available on our cluster

In [None]:
# Install translate from given PyPI repository
sc.install_pypi_package("translate", "https://pypi.org/simple") 

In [6]:
from translate import Translator
from pyspark.sql.functions import udf, col

@udf
def translate_to_german(sentence):
    translator= Translator(to_lang="de")
    return translator.translate(sentence)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
columns = ["TaskID","Sentence"]
data = [("1", "This is a pen"),
    ("2", "This is a chair")]

df = spark.createDataFrame(data=data,schema=columns)

df.show(truncate=False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------+---------------+
|TaskID|Sentence       |
+------+---------------+
|1     |This is a pen  |
|2     |This is a chair|
+------+---------------+

In [8]:
df.select(col("TaskID"), \
    translate_to_german(col("Sentence")).alias("Sentence") ) \
   .show(truncate=False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------+-------------------+
|TaskID|Sentence           |
+------+-------------------+
|1     |Das ist ein Gehege.|
|2     |Das ist ein Stuhl  |
+------+-------------------+