# Getting started with machine learning pipelines in Pyspark
  
PySpark has built-in, cutting-edge machine learning routines, along with utilities to create full machine learning pipelines. You'll learn about them in this chapter.
  
```
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   
      /_/
```


## Resources
  
**Notebook Syntax**
  
<span style='color:#7393B3'>NOTE:</span>  
- Denotes additional information deemed to be *contextually* important
- Colored in blue, HEX #7393B3
  
<span style='color:#E74C3C'>WARNING:</span>  
- Significant information that is *functionally* critical  
- Colored in red, HEX #E74C3C
  
---
  
**Links**
  
[NumPy Documentation](https://numpy.org/doc/stable/user/index.html#user)  
[Pandas Documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)  
[Matplotlib Documentation](https://matplotlib.org/stable/index.html)  
[Seaborn Documentation](https://seaborn.pydata.org)  
[Apache Spark Documentation](https://spark.apache.org/docs/latest/api/python/index.html)  
  
---
  
**Notable Functions**
  
<table>
  <tr>
    <th>Index</th>
    <th>Operator</th>
    <th>Use</th>
  </tr>
  <tr>
    <td>1</td>
    <td>pyspark.SparkContext()</td>
    <td>Creates a SparkContext, the entry point to using Spark functionality.</td>
  </tr>
  <tr>
    <td>2</td>
    <td>pyspark.SparkContext().version</td>
    <td>Returns the version of the SparkContext.</td>
  </tr>
  <tr>
    <td>3</td>
    <td>pyspark.SparkContext().stop()</td>
    <td>Stops the SparkContext, terminating the connection to the Spark cluster.</td>
  </tr>
  <tr>
    <td>4</td>
    <td>pyspark.sql.SparkSession</td>
    <td>Creates a SparkSession, the entry point to using Spark SQL functionality.</td>
  </tr>
  <tr>
    <td>5</td>
    <td>pyspark.sql.SparkSession.builder.getOrCreate()</td>
    <td>Returns an existing SparkSession or creates a new one.</td>
  </tr>
  <tr>
    <td>6</td>
    <td>pyspark.sql.SparkSession.builder.appName</td>
    <td>Sets the name of the application for the SparkSession.</td>
  </tr>
  <tr>
    <td>7</td>
    <td>pyspark.sql.SparkSession.builder.getOrCreate</td>
    <td>Returns an existing SparkSession or creates a new one.</td>
  </tr>
  <tr>
    <td>8</td>
    <td>pyspark.sql.SparkSession.read</td>
    <td>Creates a DataFrameReader for reading data into a DataFrame.</td>
  </tr>
  <tr>
    <td>9</td>
    <td>pyspark.sql.SparkSession.read.format</td>
    <td>Sets the input format for reading data into a DataFrame.</td>
  </tr>
  <tr>
    <td>10</td>
    <td>pyspark.sql.SparkSession.read.format.option('inferSchema', 'True')</td>
    <td>Sets the option to infer the schema of the DataFrame from the data.</td>
  </tr>
  <tr>
    <td>11</td>
    <td>pyspark.sql.SparkSession.read.format.option('header', 'True')</td>
    <td>Sets the option to treat the first row as the header in the DataFrame.</td>
  </tr>
  <tr>
    <td>12</td>
    <td>pyspark.sql.SparkSession.read.format.load()</td>
    <td>Loads data into a DataFrame using the specified format.</td>
  </tr>
  <tr>
    <td>13</td>
    <td>pyspark.sql.SparkSession.createOrReplaceTempView</td>
    <td>Creates or replaces a temporary view of a DataFrame.</td>
  </tr>
  <tr>
    <td>14</td>
    <td>pyspark.sql.SparkSession.catalog.listTables()</td>
    <td>Returns a list of tables in the catalog.</td>
  </tr>
  <tr>
    <td>15</td>
    <td>pyspark.sql.SparkSession.sql()</td>
    <td>Executes a SQL query and returns the result as a DataFrame.</td>
  </tr>
  <tr>
    <td>16</td>
    <td>pyspark.sql.SparkSession.sql().toPandas()</td>
    <td>Converts a DataFrame to a Pandas DataFrame.</td>
  </tr>
  <tr>
    <td>17</td>
    <td>pyspark.sql.SparkSession.sql().toPandas().head()</td>
    <td>Returns the first n rows of a Pandas DataFrame.</td>
  </tr>
  <tr>
    <td>18</td>
    <td>pyspark.sql.SparkSession.createDataFrame()</td>
    <td>Creates a DataFrame from a Pandas DataFrame or an RDD.</td>
  </tr>
  <tr>
    <td>19</td>
    <td>pyspark.sql.SparkSession.read.csv()</td>
    <td>Reads a CSV file and returns a DataFrame.</td>
  </tr>
</table>

  
---
  
**Language and Library Information**  
  
Python 3.11.0  
  
Name: numpy  
Version: 1.24.3  
Summary: Fundamental package for array computing in Python  
  
Name: pandas  
Version: 2.0.3  
Summary: Powerful data structures for data analysis, time series, and statistics  
  
Name: matplotlib  
Version: 3.7.2  
Summary: Python plotting package  
  
Name: seaborn  
Version: 0.12.2  
Summary: Statistical data visualization  
  
Name: scikit-learn  
Version: 1.3.0  
Summary: A set of python modules for machine learning and data mining  
  
Name: hyperopt  
Version: 0.2.7  
Summary: Distributed Asynchronous Hyperparameter Optimization  
  
Name: TPOT  
Version: 0.12.1  
Summary: Tree-based Pipeline Optimization Tool  
  
Name: pyspark  
Version: 3.4.1  
Summary: Apache Spark Python API  
  
---
  
**Miscellaneous Notes**
  
<span style='color:#7393B3'>NOTE:</span>  
  
`python3.11 -m IPython` : Runs python3.11 interactive jupyter notebook in terminal.
  
`nohup ./relo_csv_D2S.sh > ./output/relo_csv_D2S.log &` : Runs csv data pipeline in headless log.  
  
`print(inspect.getsourcelines(test))` : Get self-defined function schema  
  
<span style='color:#7393B3'>NOTE:</span>  
  
Snippet to plot all built-in matplotlib styles :
  
```python

x = np.arange(-2, 8, .1)
y = 0.1 * x ** 3 - x ** 2 + 3 * x + 2
fig = plt.figure(dpi=100, figsize=(10, 20), tight_layout=True)
available = ['default'] + plt.style.available
for i, style in enumerate(available):
    with plt.style.context(style):
        ax = fig.add_subplot(10, 3, i + 1)
        ax.plot(x, y)
    ax.set_title(style)
```
  

In [None]:
import numpy as np                  # Numerical Python:         Arrays and linear algebra
import pandas as pd                 # Panel Datasets:           Dataset manipulation
import matplotlib.pyplot as plt     # MATLAB Plotting Library:  Visualizations
import seaborn as sns               # Seaborn:                  Visualizations
import pyspark                      # Apache Spark:             Cluster Computing

# Setting a standard figure size
plt.rcParams['figure.figsize'] = (8, 8)

# Set the maximum number of columns to be displayed
pd.set_option('display.max_columns', 50)