# Lab 03
## King County Housing, This Time With Graphs

In this lab, we are going to import data into Python using `pandas`, clean it up, and make some **graphs**. We'll again use the housing data from [King County in Washington State](https://geodacenter.github.io/data-and-lab/KingCounty-HouseSales2015/).

Use our online notes to complete our labs. I have my own commentary and links that will help. You can work through the parts below in order.

If you are getting errors and are not sure why, my first suggestion is always to **Restart** the Python kernel, to **Clear All Output**, and run each code cell **one-by-one** from the top.

Let each code cell represent one idea, or output. Take advantage of **Markdown** in your write-up. Use headers and formatting. Here's a [Markdown cheat sheet](https://www.markdownguide.org/cheat-sheet/) that I keep handy. Once you figure out some of the basics, you might not want to go back to Word or Google Docs. 

Finally, use **Chapter 7** of our textbook as a resource. I have that chapter and our online notes open as I'm making this lab. I will refer to page numbers. I may also ask you to do figure out how to do something that isn't exactly in our notes!

## Part 1

Open up VS Code. Have you set up your class folder yet? I've posted a video that will help you figure that part out. Open up that folder under File. You should see it in the **EXPLORER** window pane on the left. 

Type **Cmd-Shift-P** (Ctrl-Shift-P in Windows) to open up the command palette. Make sure there's a little `>` in the search bar that pops up (there should be). Search for `Jupyter: Create New Jupyter Notebook`. This will open up a blank `.ipynb` file. Save this in a new folder in your course folder called `lab03`. Call this file `lab03-lastname-firstname`, where you fill in your name, of course! 


## Part 2

You'll see a blank code cell, or **cell**, at the top. By default, this is set to Python. See the lower right-hand corner of the cell? Click there and search and select **Markdown**. 

Create the following in that top Markdown cell.

```
# Lab03
Firstname Lastname
Date
```

As you answer the questions below, use new Markdown cells and headers to separate your answers, as appropriate.

Create your first code cell where you `import` both `numpy` and `pandas` as `np` and `pd`, respectively. 

Also include `import datetime as dt`. The `datetime` library is discussed in the first part of the second DataCamp assignment.

Include `import matplotlib.pyplot as plt` and `from matplotlib import style`. We are going to automatically style our graphs too.

Add a **comment** to that cell calling it **Set-Up**. You can add comments to each code cell to remind yourself what you're doing. Comments are different from the Markdown that you're using to write your narrative - they go in the Python cells and use `#`. 

## Part 3

I'll remind you about this data. 

The DataFrame we will be working with will come from `kc_house_data.csv`, which is up in my data folder on [GitHub](https://github.com/aaiken1/fin-data-analysis-python/tree/main/data).

`kc_house_data.csv` is a CSV file on home sales and characteristics from May 2014 - May 2015 in King County, Washington State. I pulled the file from Kaggle, a data science competition page and a good resource for interesting data sets.

The **url** for the data is <https://raw.githubusercontent.com/aaiken1/fin-data-analysis-python/main/data/kc_house_data.csv>. Download it directly using `pandas` and put it into a DataFrame called `kc`. There is a variable called `id` in the first column. Make that the index value for the DataFrame. Clean up the `date` variable like last time.

As mentioned in the last lab, we want to think about our question first and then match that to our data. What are we interested in? Can our data answer that question? In practice, just knowing this can be really tough. For example, some questions that this data **might** be able to help us answer:

- What time period does this data cover?
- How large our houses in King County? What are the typical number of bedrooms or bathrooms?
- What does the distribution of prices look like?
- What types of amenities are present?
- What housing characteristics do people value? 
- Can we predict home values?

Answering these questions would help you figure out what to price a particular home at, for example. Zillow's [Zestimate pricing model](https://www.zillow.com/z/zestimate/) does that. They messed up, though, when trying to use their Zestimate to [actually buy and sell houses](https://www.npr.org/2021/11/08/1053689886/ibuyers-zillow-and-the-lemons-problem). The problem wasn't their model, per se. They ran into what's called the [lemon problem](https://en.wikipedia.org/wiki/The_Market_for_Lemons). 

Anyone trading, or just buying and selling when the quality of the good is uncertain, has to be aware of this problem! Always ask: Why is this other person trying to sell me this item (e.g. house, car, stock) at this particular price?

A friend of mine has created a company that scores homes and neighborhoods based on their curb-appeal [using machine learning techniques](https://www.wsj.com/articles/selling-your-home-its-whats-on-the-outside-that-counts-11579792560) based on photos.

## Part 4

Let's start with some sorting, filtering, and variable creation. 

Create a new `pandas` DataFrame from our main data file that contains: date (date sold), price, square foot of living space, year built, number of bedrooms, and price per square foot. You'll need to create two of these variables! Name this DataFrame `kc_subset`. 

You can create the year by selecting the `date` column and using the `dt.year` method on it. You can find an example [here](https://datascienceparichay.com/article/pandas-extract-year-from-datetime-column/). We did this in the last lab too. 

How can you create a `prc_sqr_ft` variable?

Remember, this new DataFrame will live in memory, but it isn't stored on your computer until you actually save it using something like `to.CSV`. 

Do the following:

- Show the ten least expensive homes sold in this data set on a price per square foot basis.
- Print the min price.
- Print the max price.

## Part 5

What does the distribution of housing prices look like? Let's create a **histogram** of prices. 

It’s good practice to always think about your data. For a histogram, this can mean figuring out the number of bins to use. When you create your histogram include this as an argument: `range = (0, 8000000)`. This will have the x-axis go from 0 to 8,000,000, essentially the range of prices that you should have see above.

Now, how can you figure out the **bin width** and the number of bins to use? If you choose `bin = 8`, then each bin is going to have a range of 100,000. If you choose `bin = 80`, then each bin is 10,000 wide. You might ask yourself: “What would be a meaningful difference in housing prices?” $1 is obviously too little, $500,000 might be too high.

Pick a number of bins. Add a title. Remove the legend. Add the following to style the graph automatically: `style = 'seaborn-white'`.

All of your options will go **inside** of the `plot()` method.

For various settings in `pandas` plot, check out this [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html).

`pandas` is using `matplotlib` functions to style the plots. In fact, `pandas` plot and `matplotlib` can be used together. To see a list of styles, try: `plt.style.available`. Don't change the color of the line, etc. when making the plot, or this will override your style choice.


## Part 6

I didn't like the x-axis on my graph. It uses scientific notation and just goes 1, 2, 3... up to 8, because the top range was 8,000,000. Let's do two things now. 

- We can filter our data before plotting it. Try `kc_subset[kc_subset['price'] < 1000000].your_plot_code`, where `your_plot_code` is your histogram code from above. This will filter your data to only homes with prices less than 1,000,000 and then pass that filtered data set to the `plot()` function. Edit the range and bins to something more appropriate. **Do you see spikes in the data? What's going on there?**

By the way, there's a term for this kind of programming logic. We are filtering our DataFrame and then **piping** it, or sending it, to the plot function. Basically, there's some logic to your code that you can read from left to right. And this is all done without changing the actual DataFrame.
  
- I still don't like the x-axis. Let's use `matplotlib` to make the graph. It's easier to modify things this way. You can actually use your `pandas` plot and then modify it with `matplotlib` code. But, we'll start from scratch. **Make the same histogram again, but use the `pyplot` way discussed in our notes.**

This means using the `plt.hist` function. Use **all of the price data** by plotting `kc_subset.price` or `kc_subset['price']` inside of `plt.hist`. Either pulls the column that you want out of the DataFrame. Add an x-label and a title. Do not include a legend. Style it using `plt.style.use('seaborn-white')`.

Let's fix that x-axis finally. 

Make sure that the last line has a semi-colon.


## Part 7

Let’s create a simple **categorical variable** from our data. Let’s define a large house as anything greater than 2,500 square feet. Implement the code below to create a new categorical variable called “large_house”.

Let’s use the mutate and ntile functions to create housing categories based on the quantile of living space that the house falls into. This is a less ad hoc way of creating categories. You can use this code below to create the new variable, which will range from 1 to 4, depending on the size of the living space. Note that I am saving over my data set when I create my new variable.

Plot the distribution of sqft_living using a histogram, faceted by sqft_living_4

Which sqft_living_4 quantile category has the highest mean price per square foot? Let’s use group_by for the first time.



## Part 8



## Part 9



## Part 10

Finally, **clear the outputs of all of your cells** using the button at the top of the notebook. Restart your Python kernel. This will clear everything from memory. Then, **Run All**. All of you cells should now run, in order, from top to bottom.

You are done! **Turn in this .ipynb file via Moodle.**

You are developing a nice set of tools. You can bring in some raw data, clean it up, summarize it, create new variables, and make some graphs. Once you have the basics down, you can read documentation and look at examples to figure out how to do a lot more.