# California Info

It would help to confine users to valid entries.


In [1]:
import sys

# adds everything in the directory above to the path
sys.path.insert(0, '../')

In [2]:
# autoreload all libraries/modules
%load_ext autoreload
%autoreload 2

In [3]:
# joblib tends to be more efficient with larger models
import joblib

from sklearn.datasets import fetch_california_housing

## Recap: California Housing Data

This is the model you created in the Cross Validation assignment. This section will serve as a brief recap.

### Import Data & Separate Features & Targets

What is the target for the California housing data?

* [The California housing dataset — Scikit-learn course](https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html) 




</br>
<details>
<summary>Solution</summary>

The target contains the median of the house value for each district.

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

```python
print(california.DESCR)
```
</details>
</br>



In [4]:
# Fetch dataset from sklearn's internal datasets
california = fetch_california_housing(as_frame=True)

# # Features for dataset
X = california['data']
# print(X)

# # Target for dataset
y = california['target']
# print(y)

## Get basic information

In [5]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
dtypes: float64(8)
memory usage: 1.3 MB


In [6]:
X.describe().T[['min', 'mean', 'max']]

Unnamed: 0,min,mean,max
MedInc,0.4999,3.870671,15.0001
HouseAge,1.0,28.639486,52.0
AveRooms,0.846154,5.429,141.909091
AveBedrms,0.333333,1.096675,34.066667
Population,3.0,1425.476744,35682.0
AveOccup,0.692308,3.070655,1243.333333
Latitude,32.54,35.631861,41.95
Longitude,-124.35,-119.569704,-114.31


In [7]:
OBS_BOUNDS = X.describe().T[['min', 'mean', 'max']]

In [8]:
OBS_BOUNDS['min']['MedInc']

0.4999

### Joblib

These bounds would be useful to have.

In [9]:
# joblib
with open("../models/observation_bounds.joblib", "wb") as f:

    joblib.dump(X.describe().T[['min', 'mean', 'max']], f, protocol=5)

In [10]:
# Here you can replace pickle with joblib or cloudpickle
with open("../models/observation_bounds.joblib", "rb") as f:

    observation_bounds = joblib.load(f)

In [11]:
observation_bounds

Unnamed: 0,min,mean,max
MedInc,0.4999,3.870671,15.0001
HouseAge,1.0,28.639486,52.0
AveRooms,0.846154,5.429,141.909091
AveBedrms,0.333333,1.096675,34.066667
Population,3.0,1425.476744,35682.0
AveOccup,0.692308,3.070655,1243.333333
Latitude,32.54,35.631861,41.95
Longitude,-124.35,-119.569704,-114.31


How do I get the mean MedInc? The maximum population?

</br>
<details>
<summary>Solution</summary>

```python
observation_bounds.loc['MedInc', :]['min']
observation_bounds.loc['Population', :]['max']
observation_bounds['mean'].iloc[0]
observation_bounds.loc['MedInc', "mean"]
```
</details>
</br>



In [12]:
observation_bounds[['mean']].T.reset_index(drop=True)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704
