In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab07.ipynb")

# Lab 07: Exploratory Data Analysis of California Housing Prices

Exploratory Data Analysis (EDA) in Python is a critical step in understanding and summarizing the characteristics of a dataset before diving into modeling. It involves techniques such as data visualization, statistical analysis, and summarization to gain insights into the data's underlying patterns and relationships. EDA is useful for identifying outliers, missing values, and understanding the distribution of variables, ultimately guiding preprocessing steps and informing subsequent modeling decisions. By exploring the data, we can uncover patterns, validate assumptions, and generate hypotheses, laying a solid foundation for further analysis and modeling. Many visualizations for EDA in Python are generated with the `seaborn` module.

The `seaborn` documentation and tutorials can can be found [here](https://seaborn.pydata.org/examples/index.html).

### In today's lab, we will...
- Demonstrate mastery of methods for EDA
- Use Pandas DataFrame
- Use Seaborn to plot
    - pair plots
    - scatterplots
    - heat map of correlation coefficients
    - line of best fit and confidence interval
    

In [None]:
# import packages we will use
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn # to get the California data set

### 1. Import the Data

We will begin by importing the dataset from a Python module called `scikit-learn`. `scikit-learn` is a machine learning module for Python. We will not learn about the functions and capabilities of the module in great depth in this course (you will see a lot more of it in DATA 322 and DATA 422), but we will occasionally use it in this class to import datasets. Like `plotnine` and `seaborn`, `scikit-learn` has several stored datasets. Today, we will use one about California housing prices. Run the cells below to import the data. 

In [None]:
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing(as_frame=True)

In [None]:
# check the type of data
type(data)

As we can see, when we import the data it is imported as a `sklearn` bunch object. 

In [None]:
# This is what the data look like
data

In [None]:
# To get a general description of the dataset
print(data.DESCR)

<!-- BEGIN QUESTION -->

**Question 1:** Based on the info shown above, the dataset contains attributes and a target. Make a bullet point list in Markdown of each attribute and target in the dataset and a description of what each one is. 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

Now that we have a general understanding of the data, let's put it into a Pandas DataFrame with the target variable as the last column.

In [None]:
ca_houses = pd.DataFrame(data=data.data, columns=data.feature_names)
ca_houses['MedHouseVal'] = data.target # add last column
ca_houses.head()

### 2. EDA - statistics

**Question 2.1:** How many rows and columns are in the `ca_houses` DataFrame?

In [None]:
num_rows = ...
num_cols = ...

num_rows,num_cols

In [None]:
grader.check("q2_1")

**Question 2.2:** Use Pandas method(s) to determine the minimum, maximum, mean, and median of the median house values in the data set.

In [None]:
describe_ca_houses = ...

In [None]:
min_val = ...
max_val = ...
mean_val = ...
median_val = ...

min_val, max_val, mean_val, median_val

In [None]:
grader.check("q2_2")

**Question 2.3:** In the data set, what feature has the largest mean? What feature has the smallest mean? Report your answers as strings. 

In [None]:
largest_mean = ...
smallest_mean = ...

largest_mean, smallest_mean

In [None]:
grader.check("q2_3")

<!-- BEGIN QUESTION -->

**Question 2.4:** Do any variables take on negative values? Explain. 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### 3. EDA - preprocessing

**Question 3.1:** Does the dataset contain any null values? Assign `has_nulls` to 0 if there are no null values and 1 if there are null values. Assign `columns_with_nulls` to a list containing the names of columns that contain null values. (Assign `columns_with_nulls` to an empty list if there are no null values)

In [None]:
...

In [None]:
has_nulls = ...
columns_with_nulls = ...

In [None]:
grader.check("q3_1")

**Question 3.2:** Does the dataset contain any duplicated rows? Assign `num_duplicates` to the number of duplicated rows in the dataset. If there are duplicates remove them from `ca_houses`. 

In [None]:
num_duplicates = ...
num_duplicates

In [None]:
grader.check("q3_2")

### 4. EDA - visualization

**Question 4.1:** Import the `seaborn` module. 

In [None]:
...
...

In [None]:
grader.check("q4_1")

**Question 4.2:** Generate a Seaborn pairplot from the `ca_houses` data.

In [None]:
houses_pairplot = ...
houses_pairplot

In [None]:
grader.check("q4_2")

**Question 4.3:** Based on your pairplot, identify two variables which appear to be positively correlated.

1. `HouseAge` and `MedInc`
2. `MedInc` and `MedHouseVal`
3. `Population` and `MedHouseVal`
4. `AveOccup` and `Longitude`

Enter 1, 2, 3, or 4. 

In [None]:
positive_corr = ...
positive_corr

In [None]:
grader.check("q4_3")

<!-- BEGIN QUESTION -->

**Question 4.4:** Based on your pairplot, do any of the variables appear to be bimodal? If so, which? Any hypothesis as to why that might be? 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.5:** List any other insights you gained from your pairplot.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 4.6:** Generate a Seaborn countplot to determine how many rows of each `HouseAge` there are. Explain what you see. Does this reveal anyting about how the data were collected or recorded?

_Type your answer here, replacing this text._

In [None]:
...
countplot_houses = ...

In [None]:
grader.check("q4_6")

**Question 4.7:** Create a correlation matrix and a heatmap to observe the corrlation between each pair of variables in `ca_houses`.

In [None]:
...
correlation_mat = ...
houses_heatmap = ...

In [None]:
grader.check("q4_7")

**Question 4.8:** Which variables are positively correlated with median house price?

1. `MedInc`
2. `HouseAge`
3. `AveRooms`
4. `AveBedrms`
5. `Population`
6. `AveOccup`
7. `Latitude`
8. `Longitude`

Assign `pos_corr` to a list with the numbers associated with positive correlations. 

In [None]:
pos_corr = ...

In [None]:
grader.check("q4_8")

**Question 4.9:**  One thing we might want to explore more is how housing prices relate to the median income (since those are strongly correlated according to the heatmap). Create a copy of the `ca_houses` dataframe called `ca_houses_with_income_status`.

In [None]:
grader.check("q4_9")

**Question 4.10:**  Create a new column in the new dataframe called `IncStatus`. Set the value in the column to `below average` if the income is below the average `MedInc` value. Set it to `above average` otherwise. 

*HINT:* You may want to use list comprehension to create this new column. The structure should look like 
```python
ca_houses_with_income_status['IncStatus'] = # your list comprehension here
```

In [None]:
ave_income = ...
ca_houses_with_income_status['IncStatus'] = ...
ca_houses_with_income_status

In [None]:
grader.check("q4_10")

**Question 4.11:**  Create a boxplot showing the distribution of median housing prices for each category of income status. Comment on what you see. 

_Type your answer here, replacing this text._

In [None]:
...
houses_boxplot = ...

In [None]:
grader.check("q4_11")

**Question 4.12:**  Create a seaborn scatter plot with Longitude along the x-axis and Latitude along the y-axis. Map the medain house value to the color of points and map the population size to the size of the points. Be sure to give your plot a title.

*NOTE:* You might find it helpful to change the transperency of the points using `alpha = 0.5` when you create the graph.

_Type your answer here, replacing this text._

In [None]:
...
cali_scatter = ...
...
...

In [None]:
grader.check("q4_12")

<!-- BEGIN QUESTION -->

**Question 4.13:** What do you notice about the figure you made in the previous problem?  Describe your observations.  Does anything in this figure relate to bimodality you noticed in a few histograms?  Explain.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.14:** In this 1990s data set, how do prices in Humboldt compare qualitatively to the rest of the state?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 4.15:**  Use a Seaborn data visualization to determine at least one region of California that tends to have older homes in this dataset. Assign the visualization to `age_plot`, then answer the question in the text cell provided.

_Type your answer here, replacing this text._

In [None]:
...
age_plot = ...
...
...

In [None]:
grader.check("q4_15")

**Question 4.16:**  Use a Seaborn data visualization to determine at least one region of California that tends to have below average income. Assign the visualization to `income_plot`, then answer the question in the text cell provided.

_Type your answer here, replacing this text._

In [None]:
...
income_plot = ...
...
...

In [None]:
grader.check("q4_16")

## You're done! 

Congratulations on finishing the lab! Gus wishes he could join you for spring break. Run the cell below and submit to Canvas. 

<img src="gus_trip.JPG" alt="drawing" width="500"/>

## References
- Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)