# Model tuning and selection in Pyspark
  
In this last chapter, you'll apply what you've learned to create a model that predicts which flights will be delayed.
  
```
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   
      /_/
```


## Resources
  
**Notebook Syntax**
  
<span style='color:#7393B3'>NOTE:</span>  
- Denotes additional information deemed to be *contextually* important
- Colored in blue, HEX #7393B3
  
<span style='color:#E74C3C'>WARNING:</span>  
- Significant information that is *functionally* critical  
- Colored in red, HEX #E74C3C
  
---
  
**Links**
  
[NumPy Documentation](https://numpy.org/doc/stable/user/index.html#user)  
[Pandas Documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)  
[Matplotlib Documentation](https://matplotlib.org/stable/index.html)  
[Seaborn Documentation](https://seaborn.pydata.org)  
[Apache Spark Documentation](https://spark.apache.org/docs/latest/api/python/index.html)  
  
---
  
**Notable Functions**
  
<table>
  <tr>
    <th>Index</th>
    <th>Operator</th>
    <th>Use</th>
  </tr>
  <tr>
    <td>1</td>
    <td>pyspark.SparkContext()</td>
    <td>Create a new SparkContext instance, the entry point to using Spark functionality.</td>
  </tr>
  <tr>
    <td>2</td>
    <td>pyspark.SparkContext().version</td>
    <td>Retrieve the version of the SparkContext.</td>
  </tr>
  <tr>
    <td>3</td>
    <td>pyspark.SparkContext().stop()</td>
    <td>Stop the SparkContext, releasing associated resources.</td>
  </tr>
  <tr>
    <td>4</td>
    <td>pyspark.sql.SparkSession</td>
    <td>Create a new SparkSession instance, offering an entry point for DataFrame and SQL functionality.</td>
  </tr>
  <tr>
    <td>5</td>
    <td>pyspark.sql.SparkSession.builder.getOrCreate()</td>
    <td>Retrieve an existing SparkSession or create a new one if none exists.</td>
  </tr>
  <tr>
    <td>6</td>
    <td>pyspark.sql.SparkSession.builder.appName</td>
    <td>Set the application name for the SparkSession.</td>
  </tr>
  <tr>
    <td>7</td>
    <td>pyspark.sql.SparkSession.builder.getOrCreate</td>
    <td>Get an existing SparkSession or create a new one if none exists.</td>
  </tr>
  <tr>
    <td>8</td>
    <td>pyspark.sql.SparkSession.read</td>
    <td>Create a DataFrameReader for reading data in various formats.</td>
  </tr>
  <tr>
    <td>9</td>
    <td>pyspark.sql.SparkSession.read.format</td>
    <td>Specify the input data format when reading data using the DataFrameReader.</td>
  </tr>
  <tr>
    <td>10</td>
    <td>pyspark.sql.SparkSession.read.format.option('inferSchema', 'True')</td>
    <td>Specify options, such as inferring schema from data, when reading data using the DataFrameReader.</td>
  </tr>
  <tr>
    <td>11</td>
    <td>pyspark.sql.SparkSession.read.format.option('header', 'True')</td>
    <td>Specify options, such as reading headers from data, when reading data using the DataFrameReader.</td>
  </tr>
  <tr>
    <td>12</td>
    <td>pyspark.sql.SparkSession.read.format.load()</td>
    <td>Load data into a DataFrame based on specified options using the DataFrameReader.</td>
  </tr>
  <tr>
    <td>13</td>
    <td>pyspark.sql.SparkSession.createOrReplaceTempView</td>
    <td>Create or replace a temporary view of a DataFrame.</td>
  </tr>
  <tr>
    <td>14</td>
    <td>pyspark.sql.SparkSession.catalog.listTables()</td>
    <td>List the tables available in the catalog.</td>
  </tr>
  <tr>
    <td>15</td>
    <td>pyspark.sql.SparkSession.sql()</td>
    <td>Execute a SQL query and return the result as a DataFrame.</td>
  </tr>
  <tr>
    <td>16</td>
    <td>pyspark.sql.SparkSession.sql().toPandas()</td>
    <td>Convert the result of a SQL query to a Pandas DataFrame.</td>
  </tr>
  <tr>
    <td>17</td>
    <td>pyspark.sql.SparkSession.sql().toPandas().head()</td>
    <td>Retrieve the first few rows of a Pandas DataFrame obtained from a SQL query result.</td>
  </tr>
  <tr>
    <td>18</td>
    <td>pyspark.sql.SparkSession.createDataFrame()</td>
    <td>Create a DataFrame from a list or RDD.</td>
  </tr>
  <tr>
    <td>19</td>
    <td>pyspark.sql.SparkSession.read.csv()</td>
    <td>Read data from a CSV file and load it into a DataFrame.</td>
  </tr>
  <tr>
    <td>20</td>
    <td>pyspark.sql.SparkSession.table</td>
    <td>Create a DataFrame representing a table in the catalog.</td>
  </tr>
  <tr>
    <td>21</td>
    <td>pyspark.sql.SparkSession.filter</td>
    <td>Filter rows of a DataFrame based on a condition.</td>
  </tr>
  <tr>
    <td>22</td>
    <td>pyspark.sql.SparkSession.select</td>
    <td>Select columns from a DataFrame.</td>
  </tr>
  <tr>
    <td>23</td>
    <td>pyspark.sql.SparkSession.selectExpr</td>
    <td>Select columns using SQL expressions from a DataFrame.</td>
  </tr>
  <tr>
    <td>24</td>
    <td>pyspark.sql.SparkSession.printSchema</td>
    <td>Print the schema of a DataFrame.</td>
  </tr>
  <tr>
    <td>25</td>
    <td>pyspark.sql.SparkSession.withColumn</td>
    <td>Add or replace a column in a DataFrame.</td>
  </tr>
  <tr>
    <td>26</td>
    <td>pyspark.sql.types.IntegerType</td>
    <td>Create an IntegerType column type for use in DataFrame schema.</td>
  </tr>
  <tr>
    <td>27</td>
    <td>pyspark.sql.functions.col</td>
    <td>Reference a column in a DataFrame based on its name.</td>
  </tr>
  <tr>
    <td>28</td>
    <td>pyspark.sql.SparkSession.groupBy</td>
    <td>Group rows in a DataFrame based on specified columns.</td>
  </tr>
  <tr>
    <td>29</td>
    <td>pyspark.sql.SparkSession.groupBy.min</td>
    <td>Compute the minimum value of specified columns for grouped rows.</td>
  </tr>
  <tr>
    <td>30</td>
    <td>pyspark.sql.SparkSession.groupBy.max</td>
    <td>Compute the maximum value of specified columns for grouped rows.</td>
  </tr>
  <tr>
    <td>31</td>
    <td>pyspark.sql.SparkSession.groupBy.avg</td>
    <td>Compute the average value of specified columns for grouped rows.</td>
  </tr>
  <tr>
    <td>32</td>
    <td>pyspark.sql.SparkSession.groupBy.sum</td>
    <td>Compute the sum of specified columns for grouped rows.</td>
  </tr>
  <tr>
    <td>33</td>
    <td>pyspark.sql.SparkSession.groupBy.count</td>
    <td>Compute the count of rows for grouped columns.</td>
  </tr>
  <tr>
    <td>34</td>
    <td>pyspark.sql.functions.stddev</td>
    <td>Compute the standard deviation of specified columns in a DataFrame.</td>
  </tr>
  <tr>
    <td>35</td>
    <td>pyspark.sql.SparkSession.withColumnRenamed</td>
    <td>Rename a column in a DataFrame.</td>
  </tr>
  <tr>
    <td>36</td>
    <td>pyspark.sql.SparkSession.join</td>
    <td>Join two DataFrames based on specified columns.</td>
  </tr>
  <tr>
    <td>37</td>
    <td>pyspark.ml.feature.StringIndexer</td>
    <td>Convert categorical strings to numerical indices using StringIndexer.</td>
  </tr>
  <tr>
    <td>38</td>
    <td>pyspark.ml.feature.OneHotEncoder</td>
    <td>Encode categorical indices as one-hot vectors using OneHotEncoder.</td>
  </tr>
  <tr>
    <td>39</td>
    <td>pyspark.ml.feature.VectorAssembler</td>
    <td>Combine multiple columns into a single feature vector using VectorAssembler.</td>
  </tr>
  <tr>
    <td>40</td>
    <td>pyspark.ml.Pipeline</td>
    <td>Construct a ML pipeline by assembling a sequence of transformers and an estimator.</td>
  </tr>
  <tr>
    <td>41</td>
    <td>pyspark.sql.SparkSession.randomSplit</td>
    <td>Randomly split a DataFrame into training and testing datasets.</td>
  </tr>
</table>

  
---
  
**Language and Library Information**  
  
Python 3.11.0  
  
Name: numpy  
Version: 1.24.3  
Summary: Fundamental package for array computing in Python  
  
Name: pandas  
Version: 2.0.3  
Summary: Powerful data structures for data analysis, time series, and statistics  
  
Name: matplotlib  
Version: 3.7.2  
Summary: Python plotting package  
  
Name: seaborn  
Version: 0.12.2  
Summary: Statistical data visualization  
  
Name: pyspark  
Version: 3.4.1  
Summary: Apache Spark Python API  
  
---
  
**Miscellaneous Notes**
  
<span style='color:#7393B3'>NOTE:</span>  
  
`python3.11 -m IPython` : Runs python3.11 interactive jupyter notebook in terminal.
  
`nohup ./relo_csv_D2S.sh > ./output/relo_csv_D2S.log &` : Runs csv data pipeline in headless log.  
  
`print(inspect.getsourcelines(test))` : Get self-defined function schema  
  
<span style='color:#7393B3'>NOTE:</span>  
  
Snippet to plot all built-in matplotlib styles :
  
```python

x = np.arange(-2, 8, .1)
y = 0.1 * x ** 3 - x ** 2 + 3 * x + 2
fig = plt.figure(dpi=100, figsize=(10, 20), tight_layout=True)
available = ['default'] + plt.style.available
for i, style in enumerate(available):
    with plt.style.context(style):
        ax = fig.add_subplot(10, 3, i + 1)
        ax.plot(x, y)
    ax.set_title(style)
```
  

In [None]:
import numpy as np                  # Numerical Python:         Arrays and linear algebra
import pandas as pd                 # Panel Datasets:           Dataset manipulation
import matplotlib.pyplot as plt     # MATLAB Plotting Library:  Visualizations
import seaborn as sns               # Seaborn:                  Visualizations
import pyspark                      # Apache Spark:             Cluster Computing

# Setting a standard figure size
plt.rcParams['figure.figsize'] = (8, 8)

# Set the maximum number of columns to be displayed
pd.set_option('display.max_columns', 50)

### What is logistic regression?
  
The model you'll be fitting in this chapter is called a logistic regression. This model is very similar to a linear regression, but instead of predicting a numeric variable, it predicts the probability (between 0 and 1) of an event.
  
To use this as a classification algorithm, all you have to do is assign a cutoff point to these probabilities. If the predicted probability is above the cutoff point, you classify that observation as a 'yes' (in this case, the flight being late), if it's below, you classify it as a 'no'!
  
You'll tune this model by testing different values for several hyperparameters. A hyperparameter is just a value in the model that's not estimated from the data, but rather is supplied by the user to maximize performance. For this course it's not necessary to understand the mathematics behind all of these values - what's important is that you'll try out a few different choices and pick the best one.
  
---
  
Why do you supply hyperparameters?
  
1. Possible Answers
  
- [ ] They explain information about the data.
- [x] They improve model performance.
- [ ] They improve model fitting speed.
  
Great job! You supply hyperparameters to optimize your model.