#Model Training, Selection, and Evaluation

This lab is structured into three notebooks that cover the most important topics related to model training, selection, and evaluation. If your are new to Python and/or data science we highly recommend to go through the lab notebooks in the suggested order. Using the suggested orded will enable you to get the most out of the allocated time.

The lab notebooks are:  
* <a href="$./02 Basic Regression with Azure Databricks">Basic Regression with Azure Databricks</a> - explores the basics of creating a regression model. The notebook covers simple linear regression, learning curves and model tuning, cost functions, linear regression with multiple features, and cross validation.
* <a href="$./03 Classification with Azure Databricks">Create a classification model with Azure Databricks</a>  - explores the process of creating a classification model.
* <a href="$./04 Advanced Regression with Azure Databricks">Advanced Regression with Azure Databricks</a>  - explores the more advanced aspects of creating a regression model.

###Import lab data

Note: You should run the remaining cells of this notebook only if you haven't already imported the data.

First, you will download a copy of the used cars data set. 

You can download this from here:
[UsedCars.csv](https://databricksdemostore.blob.core.windows.net/data/02.02/UsedCars.csv)

Second, you will upload this CSV file to your Azure Databricks Workspace by following these steps.

Open a new browser tab and navigate to your workspace.

Navigate to the Data tab and then select + to the right of Tables to create a new table. 

![img](https://databricksdemostore.blob.core.windows.net/images/02/data-tab.png)

Leave the Data source set to Upload File. 

![img](https://databricksdemostore.blob.core.windows.net/images/02/create-new-table-ui-data-source.png)

Select browse and then choose your copy of UsedCars.csv

![img](https://databricksdemostore.blob.core.windows.net/images/02/create-new-table-ui-file.png)

Your file will be uploaded. Select Create Table with UI.

![img](https://databricksdemostore.blob.core.windows.net/images/02/create-new-table-ui-file-ready.png)

In cluster drop-down, select an available cluster, and choose Preview Table. 

Then in the Specify Table Attributes, change the table name to **"usedcars_#####"** (replace ##### to make the name unique within your environment) and check the box for "First row is header". Your preview should look as follows. Observe that the table has the correct header names and that we are defaulting all columns to type STRING.

![img](https://databricksdemostore.blob.core.windows.net/images/02/create-new-table-ui-table-attributes.png)

Select Create Table. When the Table:usedcars screen appears showing your new table, your data is loaded into a Table and you continue with the next steps in this notebook.

Run the following cell to create a new DataFrame where all of the numeric cells are of the correct data type.

Be sure to update the table name  "usedcars\_#####" with the unique name created while importing the data.

In [7]:
df_typed = spark.sql("SELECT cast(Price as int), cast(Age as int), cast(KM as int), FuelType, cast(HP as int), cast(MetColor as int), cast(Automatic as int), cast(CC as int), cast(Doors as int), cast(Weight as int) FROM usedcars_#####")
df_typed

Let's cleanup the FuelType values in our DataFrame. We want to perform these transformations:
- "Diesel" to "diesel"
- "Petrol" to "petrol"
- "CompressedNaturalGas" to "cng"
- "methane" to "cng"
- "CNG" to "cng"

We can use the replace() method of the na subpackage of the DataFrame to easily describe and apply our transformation in way that will work at scale.

In [9]:
df_cleaned_fueltype = df_typed.na.replace(["Diesel","Petrol","CompressedNaturalGas","methane","CNG"],["diesel","petrol","cng","cng","cng"],"FuelType")
display(df_cleaned_fueltype.select("FuelType").distinct())

Now for the last bit of cleanup- let's address the rows that have missing (null) values. Recall from our previous exploration that the columns Price, Age and KM each had rows with missing values. 

You typically handle missing values either by deleting the rows that have them or filling them in with a suitable computed valued (sometimes called data imputation). While how you handle missing values depends on the situation, in our case we just want to delete the rows that having missing values.

In [11]:
df_cleaned_of_nulls = df_cleaned_fueltype.na.drop("any",subset=["Price", "Age", "KM"])
display(df_cleaned_of_nulls.describe())

Next, we want to save this prepared dataset as a global table so that we could use the cleansed data easily such as for further data understanding efforts or for modeling, irrespective of which Databrick cluster we end up using later on.

To do so, execute the following cell. 

Be sure to update the table name  "usedcars\_clean\_#####" (replace ##### to make the name unique within your environment).

In [13]:
df_cleaned_of_nulls.write.mode("overwrite").saveAsTable("usedcars_clean_#####")