<a href="https://colab.research.google.com/github/ccwilliamsut/machine_learning/blob/master/MLAB_02_Scrubbing_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scrubbing Data

Most of a data scientist's time is spent on working with data. This includes performing all of the tasks that we have done previously as well as the ones which we will now cover:
- One-hot encoding
- Deleting (and sometimes creating new) features
- Fixing missing data
- Dealing with outlier data

## A. Setup Environment

To begin with, we will re-import all of our libraries and perform all of the small changes that we have made thus far. The reason I re-introduce this code here is to make it easier and faster for you to make changes to your code and see the results. It keeps you from having to wait for all of the above code to execute before seeing any of your changes.

> **NOTE:** Because we will be looking at a heatmap a couple of times, I have created a function call that will make it much easier. By calling this function and supplying it with a version of the dataframe (optional), we do not have to re-create the code each time. To call it from here on out, we simply issue the command **```make_heatmap(df)```**.

In [0]:
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SETUP ENVIRONMENT <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# Import libraries
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
from IPython.display import display
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

url = 'https://raw.githubusercontent.com/ccwilliamsut/machine_learning/master/absolute_beginners/data_files/modified/CaliforniaHousingDataModified.csv'

df = pd.read_csv(url)
#df = pd.read_csv(~/Downloads/CaliforniaHousingDataModified.csv)


# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SETUP DATA <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

# --------------------------------------------- RENAME FEATURES ---------------------------------------------
df.rename(columns = {'lattitude':'latitude', 't_rooms':'total_rooms'}, inplace=True)



# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FUNCTIONS <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
def make_heatmap(dataframe):
  corr = dataframe.corr()
  plt.figure(figsize=(15 ,9))
  sns.heatmap(corr, 
              annot=True, 
              vmin = -1, 
              vmax = 1, 
              center = 0, 
              fmt = '.1g', 
              cmap = 'coolwarm'
              )
  plt.show()
  plt.close()

## B. One-Hot Encoding
One of our features, **```ocean_proximity```**, is *categorical*. Because we believe this feature to be useful in helping us predict housing prices, we want to keep it, but **we must change the categorical values into numerical ones** for modeling purposes. 

To do this, we employ a procedure known as **one-hot encoding** which creates a new feature for each of the known values and then assigns a **1 or 0** depending upon whether or not each row contains that value.

In [0]:
# Make a copy of our dataset to make things easier below
dfx = df.copy()

# Perform one-hot encoding on 'ocean_proximity'
# Show the initial state 
print('The original state of our dataset:')
display(dfx.sample(5))

# Change our target feature into a categorical data type
dfx['ocean_proximity'] = pd.Categorical(dfx['ocean_proximity'])

# Create a temp dataframe to hold our new "dummy values"
dfx_dummies = pd.get_dummies(dfx['ocean_proximity'], 
                            drop_first = False
                            )

# Drop the target feature from the original dataframe
dfx.drop(['ocean_proximity'], 
        axis = 1, 
        inplace = True
        )

# Join our temp dataframe the our original to create a single dataframe (concatenate dataframes)
dfx = pd.concat([dfx, dfx_dummies], 
               axis=1
               )

# Show the new state after one-hot encoding
print('\n\nThe state of our dataset after one-hot encoding:')
display(dfx.sample(5))

## C. Removing Unwanted Features
When working with many datasets, there are **likely to be features that are not useful for the goal at hand**. In our case, we are trying to predict the median house value, so we must look at the features we have and decide if there are any that will not help us in that goal.

Let's first take a look at the correlations between our target variable and the other variables to get an idea of which features will be useful in predicting a price. We will use our brand new **```make_heatmap```** function to build it.

>**A note on multicollinearity**:
> This term refers to the instance of two independent variables (features) being perfectly (or nearly so) correlated with one another. For example, as light increases, darkness decreases and so these two things are correlated closely.
This can create problems in **linear regression** models (see this [link](https://www.quora.com/Is-multicollinearity-a-problem-with-gradient-boosted-trees) and this [link](https://medium.com/future-vision/collinearity-what-it-means-why-its-bad-and-how-does-it-affect-other-models-94e1db984168) for more information). Therefore, we want to **keep only one feature if it exhibits multicollinearity with one or more others if we are using linear regression**.

In [0]:
# Perform all the previous actions
dfx = df.copy()
dfx['ocean_proximity'] = pd.Categorical(dfx['ocean_proximity'])
dfx_dummies = pd.get_dummies(dfx['ocean_proximity'], drop_first = True)
dfx.drop(['ocean_proximity'], axis = 1, inplace = True)
dfx = pd.concat([dfx, dfx_dummies], axis=1)

# Call on your new function to create a heatmap of the current dataframe
make_heatmap(dfx)

Based upon our goal, the following features can probably be dropped:
- **```households / total_bedrooms```** - Exhibiting multicollinearity
- **```total_rooms / population```** - Exhibiting multicollinearity
- **```proximity_to_store```** - Remember from our graphical analysis that this appears to be uniformly distributed and, thus, of little use for predictive purposes
- **```ISLAND```** - This features appears to have almost no correlation with anything across the board. Investigate this further, but it looks like we might be able to drop it as well.

In [0]:
# Compare possible multicollinearity candidates
dfx_pair = dfx[['households', 'total_bedrooms', 'total_rooms', 'population']].copy()
sns.pairplot(data = dfx_pair, x_vars = dfx_pair.columns, y_vars = dfx_pair.columns, dropna = True); plt.show(); plt.close()

# Look at the distribution of questionable features
sns.distplot(dfx['proximity_to_store'], bins = 20); plt.show(); plt.close()
icount = dfx['ISLAND']

Having double-checked the data, it appears safe to eliminate:
- **```total_rooms```**
- **```households```**
- **```proximity_to_store```**

After eliminating them, we can re-check to see if we have any further problems.

In [0]:
# Drop unwanted / unnecessary features
dfx = dfx.drop(['households', 
                'proximity_to_store'], 
               axis = 1
               )

# Also drop this feature from our temp dataframe for one last comparison
dfx_pair = dfx_pair.drop(['households'], axis = 1)

# Verify that we do not need to eliminate any more features
make_heatmap(dfx)
sns.pairplot(data = dfx, x_vars = dfx_pair.columns, y_vars = dfx_pair.columns, dropna = True); plt.show(); plt.close()

## D. Working with Missing Data
When first analyzing our data, we discovered that there were some "NaN" values (null data) in the **```total_bedrooms```** feature. Before proceeding to build our model, we must first figure out what to do with the records that contain those missing values.

Let's quickly count the missing values again

In [0]:
# Get a count of "NaN" or missing values by feature
print('Count of missing values by feature:\n')
display(dfx.isnull().sum(axis=0))


Now that we have located missing data, we have *at least* four choices:
1. Fill in missing values with the **mode**
2. Fill in missing values with the **median**
3. **Delete** samples with missing values
4. **Delete** the feature(s) containing the missing data

**Options 1 & 2** will depend upon the type of data being manipulated. 
- For **categorical data**, it is generally okay to use the **mode** as this represents the most frequently encountered value.
- For **continuous data**, it is generally okay to use the **median** so as to avoid the influence of any outlier data (using the mean will allow this influence)

**Option 3** (deleting samples) is considered the **last resort** as it will cause us to lose valuable data that can help our model perform better. If the number of samples to be deleted are insignificant compared to the overall size of the dataset, then this might not be a problem. Often, however, our data is limited and every sample counts, so we will want to keep every bit of it that we can.

**Option 4** (deleting features with missing data) is also worth considering **if and only if the feature is not consequential for the model**. In other words, if we do not feel that the data in this feature will help us to train a more accurate prediction model, then we can just delete the feature altogether and be done with it.

> **NOTE:** There are actually a host of different methods for dealing with missing data, and they range greatly in their complixity. **The method one chooses will ultimately depend upon the type of data being used and the goal(s) of the model.** For the purposes of this lesson, however, we have covered 4 possibilities simply to demonstrate that there are easy ways to manage datasets with missing values. For more information on dealing with missing data, please see the following links:
- [Towards Data Science: Compensating for missing values](https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779)
- [Geeks for Geeks: Excellent yet simple tutorial for working with missing data](https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/)
- [Pandas Documentation: Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)
- [Machine Learning Mastery: Detailed options for dealing with missing data](https://machinelearningmastery.com/handle-missing-data-python/)

```
```
In this case, we have 207 values missing out of 20,640 which represents about 1% of our data. For the purposes of this exercise, we will elect to keep that data and fill those missing values.

> **Question:** Which method should we use to fill in the missing data?

In [0]:
# Let's look at the current state of our data prior to changes
print('-------------------- BEFORE CHANGES --------------------')
print('\nCount of null values per feature:')
print(dfx.isnull().sum(axis=0))


# ---------------------------- APPLY CHANGES ----------------------------
# First, create a variable to hold the median value
tb_med = dfx['total_bedrooms'].median(axis=0)

# Next, we want to fill the "NaN" values with the median value of the column (with current 'NaN' columns skipped in the calculation)
dfx['total_bedrooms'] = dfx['total_bedrooms'].fillna(value = tb_med)

# Change the datatype of 'total_bedrooms' to 'int' (since we cannot really have partial bedrooms)
dfx['total_bedrooms'] = dfx['total_bedrooms'].astype(int)


# ---------------------------- VERIFY CHANGES ----------------------------
# Next, we want to verify that our changes have been correctly applied
print('-------------------- AFTER CHANGES --------------------')
print('\nCount of null values per feature:')
print(dfx.isnull().sum(axis=0))

Next, we want to inspect **```ISLAND```** to count the instances of 0 and 1. We will also compare the categorical values against one another to get an idea of relevancy in terms of training.

In [0]:
# Analyze our categorical values for any multicollinearity or other issues
cols = dfx[['ISLAND', 'INLAND', 'NEAR BAY', 'NEAR OCEAN']].copy()
make_heatmap(cols)

# Count only the records that have '1'
print('\n\nValue Counts for Categoricals:\n')
print(dfx['ISLAND'].value_counts(), '\n')
print(dfx['INLAND'].value_counts(), '\n')
print(dfx['NEAR BAY'].value_counts(), '\n')
print(dfx['NEAR OCEAN'].value_counts(), '\n')

As we can see from the graph and counts above, there are only **5 instances** in which the **```ISLAND```** feature is relevant. We will need to keep an eye on this as we move through the next steps.

## E. Working with Outlier Data

Now that we have filled in our missing data, let's take a look at our remaining features' distributions. This will help us to get an idea of other changes that we might need to make.

First, let's look at some graphs (histograms and boxplots).

>Websites Referenced:
- [Towards Data Science: Working with Outlier Data](https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba)
- [Medium.com: Standardize and Normalize Data](https://medium.com/@rrfd/standardize-or-normalize-examples-in-python-e3f174b65dfc)

In [0]:
# Look at distributions to identify possible outlier data
fig, axes = plt.subplots(nrows=3, 
                         ncols=4, 
                         figsize = (20, 10)
                         )
# Flatten the axes to make them easier to reference
axes = axes.flatten()

# Plot each graph that you would like to see
bins = 20
dfx.hist('longitude', ax = axes[0], bins = bins)
dfx.hist('latitude', ax = axes[1], bins = bins)
dfx.hist('population', ax = axes[2], bins = bins)
dfx.hist('housing_median_age', ax = axes[3], bins = bins)
dfx.hist('total_rooms', ax = axes[4], bins = bins)
dfx.hist('total_bedrooms', ax = axes[5], bins = bins)
dfx.hist('median_income', ax = axes[6], bins = bins)
dfx.hist('median_house_value', ax = axes[7], bins = bins)
sns.countplot(x = dfx['INLAND'], ax = axes[8])
sns.countplot(x = dfx['NEAR BAY'], ax = axes[9])
sns.countplot(x = dfx['NEAR OCEAN'], ax = axes[10])
sns.countplot(x = dfx['ISLAND'], ax = axes[11]); plt.tight_layout(); plt.show(); plt.close()

We can also use **boxplots** and **logistic plots** to get a better idea about outliers.

In [0]:
# Look at distributions to identify possible outlier data
fig, axes = plt.subplots(nrows=2, 
                         ncols=4, 
                         figsize = (16, 8)
                         )
# Flatten the axes to make them easier to reference
axes = axes.flatten()

# Plot each graph that you would like to see
bins = 20
sns.boxplot(x=df['longitude'], ax = axes[0])
sns.boxplot(x=df['latitude'], ax = axes[1])
sns.boxplot(x=df['housing_median_age'], ax = axes[2])
sns.boxplot(x=df['total_rooms'], ax = axes[3])
sns.boxplot(x=df['total_bedrooms'], ax = axes[4])
sns.boxplot(x=df['population'], ax = axes[5])
sns.boxplot(x=df['median_income'], ax = axes[6])
sns.boxplot(x=df['median_house_value'], ax = axes[7])
sns.lmplot(x = 'median_house_value', y = 'INLAND', logistic = True, data = dfx, ci = None)
sns.lmplot(x = 'median_house_value', y = 'NEAR BAY', logistic = True, data = dfx, ci = None)
sns.lmplot(x = 'median_house_value', y = 'NEAR OCEAN', logistic = True, data = dfx, ci = None)
sns.lmplot(x = 'median_house_value', y = 'ISLAND', logistic = True, data = dfx, ci = None)
plt.show()
plt.close()

### i. Using the Z-score Method with Outlier Data

There are a number of ways in which to deal with outlier data, but one of the more common ones is utilizing a **z-score** which gives us the ability to exclude items based on standard deviation. This, in turn, will allow us to exclude data that falls outside of standard deviation boundaries of our choice (thus eliminating outliers). We will then look at our data again and see if this has helped.

In [0]:
# Note that we will use the following library to accomplish this task (imported in "Setup Environment")
#  from scipy import stats

# ------------------------------------- Calculate the z-score for all data -------------------------------------
# Get the absolute z-score value for our dataset
z = np.abs(stats.zscore(dfx))

# Finally, we remove those values with standard deviations greater than 3
dfz = dfx[(z < 3).all(axis = 1)]

# Create a "figure" that can hold multiple plots
fig, ((ax1, ax2, ax3, ax4), (ax5, ax6, ax7, ax8)) = plt.subplots(nrows = 2, 
                                                                ncols = 4, 
                                                                figsize = (16, 8)
                                                                )

# Show boxplots for all of our current non-categorical features
sns.boxplot(x=dfz['longitude'], ax = ax1)
sns.boxplot(x=dfz['latitude'], ax = ax2)
sns.boxplot(x=dfz['housing_median_age'], ax = ax3)
sns.boxplot(x=dfz['total_rooms'], ax = ax4)
sns.boxplot(x=dfz['total_bedrooms'], ax = ax5)
sns.boxplot(x=dfz['population'], ax = ax6)
sns.boxplot(x=dfz['median_income'], ax = ax7)
sns.boxplot(x=dfz['median_house_value'], ax = ax8)
plt.show()
plt.close()

# Also show the data description to get accurate values for the quartiles
print('Original Dataframe (dfx):\n')
display(dfx.describe())

print('Z-score Dataframe (dfz):\n')
display(dfz.describe())

> **Questions:**
- Were there any **substantial changes** to our data using the z-score method for removing outliers?
- Did this **help our goal?**

### ii. Using the Interquartile Range Method (IQR) with Outlier Data

Another way to work with outliers is to eliminate those values that fall outside of our IQR (Interquartile Range). This will work somewhat similarly to the z-score, but our threshold for eliminating outlier data will be based upon quartiles.

> Websites referenced:
- [Towards Data Science: Detect and Remove Outliers](https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba)
- [Purple Math: How IQR Relates to Box-and-Whisker Graphs (great, simple explanation of IQR)](https://www.purplemath.com/modules/boxwhisk3.htm)

After running the code below, we will have the IQR necessary to calculate our threshold for outliers. The formula is:

- ```lower boundary = q1 - (1.5 * iqr)```
- ```upper boundary = q3 + (1.5 * iqr)```

The subsequent code will then look for and remove all samples that fall outside these boundaries and leave us (hopefully) with substantially fewer outliers. 

We will then compare both methods (z-score and IQR) against the original dataset to see which one gives us the most usable data.

In [0]:
# Remove outliers with IQR method

# Establish variables
q1 = dfx.quantile(0.25)
q3 = dfx.quantile(0.75)
iqr = q3 - q1
lower = q1 - (1.5 * iqr)
upper = q3 + (1.5 * iqr)
print('\n\nValues less than these will be removed:\n', 
      '-'*30,
      '\n',
      lower
      )
print('\n\nValues greater than these will be removed:\n', 
      '-'*30,
      '\n', 
      upper
      )

# Remove samples outside of the upper and lower boundaries
dfi = dfx[~((dfx < lower) | (dfx > upper)).any(axis = 1)]

# Compare housing_median_age
fig, (ax1a, ax1b, ax1c) = plt.subplots(nrows = 3, ncols = 1, figsize = (10, 5), sharex = True)
sns.boxplot(x=dfi['longitude'],  ax = ax1a, color='g')
sns.boxplot(x=dfz['longitude'],  ax = ax1b)
sns.boxplot(x=dfx['longitude'],  ax = ax1c, color='grey')
plt.show()
plt.close()
print('\n')


# Compare total_rooms
fig, (ax2a, ax2b, ax2c) = plt.subplots(nrows = 3, ncols = 1, figsize = (10, 5), sharex = True)
sns.boxplot(x=dfi['latitude'],  ax = ax2a, color='g')
sns.boxplot(x=dfz['latitude'],  ax = ax2b)
sns.boxplot(x=dfx['latitude'],  ax = ax2c, color='grey')
plt.show()
plt.close()
print('\n')


# Compare total_bedrooms
fig, (ax3a, ax3b, ax3c) = plt.subplots(nrows = 3, ncols = 1, figsize = (10, 5), sharex = True)
sns.boxplot(x=dfi['housing_median_age'], ax = ax3a, color='g')
sns.boxplot(x=dfz['housing_median_age'], ax = ax3b)
sns.boxplot(x=dfx['housing_median_age'], ax = ax3c, color='grey')
plt.show()
plt.close()
print('\n')


# Compare population
fig, (ax4a, ax4b, ax4c) = plt.subplots(nrows = 3, ncols = 1, figsize = (10, 5), sharex = True)
sns.boxplot(x=dfi['total_rooms'], ax = ax4a, color='g')
sns.boxplot(x=dfz['total_rooms'], ax = ax4b)
sns.boxplot(x=dfx['total_rooms'], ax = ax4c, color='grey')
plt.show()
plt.close()
print('\n')


# Compare households
fig, (ax5a, ax5b, ax5c) = plt.subplots(nrows = 3, ncols = 1, figsize = (10, 5), sharex = True)
sns.boxplot(x=dfi['total_bedrooms'], ax = ax5a, color='g')
sns.boxplot(x=dfz['total_bedrooms'], ax = ax5b)
sns.boxplot(x=dfx['total_bedrooms'], ax = ax5c, color='grey')
plt.show()
plt.close()
print('\n')


# Compare median_income
fig, (ax6a, ax6b, ax6c) = plt.subplots(nrows = 3, ncols = 1, figsize = (10, 5), sharex = True)
sns.boxplot(x=dfi['median_income'], ax = ax6a, color='g')
sns.boxplot(x=dfz['median_income'], ax = ax6b)
sns.boxplot(x=dfx['median_income'], ax = ax6c, color='grey')
plt.show()
plt.close()
print('\n')


# Compare median_house_value
fig, (ax7a, ax7b, ax7c) = plt.subplots(nrows = 3, ncols = 1, figsize = (10, 5), sharex = True)
sns.boxplot(x=dfi['median_house_value'], ax = ax7a, color='g')
sns.boxplot(x=dfz['median_house_value'], ax = ax7b)
sns.boxplot(x=dfx['median_house_value'], ax = ax7c, color='grey')
plt.show()
plt.close()

In [0]:
# Compare data descriptions for all three dataframes
print('Z-Score Dataframe (dfz):\n')
display(dfz.describe())

print('Original Dataframe (dfx):\n')
display(dfx.describe())

print('IQR Dataframe (dfi):\n')
display(dfi.describe())

Everything appears to be better using the IQR method with the exception of **```median_house_value```**, which we might want to inspect a little more closely.Let's look at the new data in a distribution plot (histogram) to see what it looks like.

In [0]:
# Compare median_house_price via histograms
fig, (ax8a, ax8b, ax8c) = plt.subplots(nrows = 1, ncols = 3, figsize = (20, 5), )
ax8a = sns.distplot(dfi['median_house_value'],  
                    kde = True,
                    hist = True,
                    bins = 50, 
                    color = 'green',
                    ax = ax8a,
                    hist_kws={'alpha':1}, 
                    ).set_title('dfi (IQR)')
ax8b = sns.distplot(dfz['median_house_value'], 
                    kde = True,
                    hist = True,
                    bins = 50,
                    ax = ax8b,
                    hist_kws={'alpha':1}
                    ).set_title('dfz (Z-score)')
ax8c = sns.distplot(df['median_house_value'],  
                    kde = True,
                    hist = True,
                    bins = 50, 
                    color = 'grey',
                    ax = ax8c,
                    hist_kws={'alpha':1}
                    ).set_title('dfx (Original)')

The IQR method appears to eliminate many of the outlier problems that exist while also handling the problem of the **```median_house_value```** "wall" problem at ~500,000. We will be capable of using any of the dataframes as we move forward, but the **IQR-based dataframe seems likely to produce the best model**. We can experiment with that hypothesis as we build out various models in the next setps.