# Discussion 03: Plotting and Functions

Welcome to Discussion 03! 

This week, we will go over plotting, writing functions, and utilizing the ```.apply()``` function to modify DataFrames.

You can find additional help on these topics in the ['Visualization' (section 3)](https://notes.dsc10.com/03-visualization/intro.html) and the ['Definining and Applying Functions' (section 2.12)](https://notes.dsc10.com/02-data_sets/apply.html) of the course notes.

[Here](https://babypandas.readthedocs.io/en/latest/) is a pointer to that reference sheet we saw last time.

<img src="data/panda_lounging.jpeg" width="1000">

In [1]:
import babypandas as bpd
import numpy as np

import otter
grader = otter.Notebook()

calfire = bpd.read_csv('data/calfire-full.csv')
calfire

# Plotting

We can visualize and plot our data directly from DataFrames! 
This can be very beneficial to help us draw conclusions that would be difficult to come up with otherwise.

`df.plot(kind='...', x=..., y=...)`
- `kind= ` "scatter", "line", "bar", "barh", "hist"

`df.get(col_name).plot(kind='hist', bins=n_bins, density=True)`

## Tips and steps to answering plotting questions
- Get the data to the right aggregated level (if needed) with the required columns
- Have a mental plot - based on, which plot will answer the question? How it should look like?
- Code
- Generate insights, validate the plot, and answer the question


### Question 1: Is the number of fires increasing with time(year)?

*set ```count_is_increasing``` to True if the the number of fires is increasing*

- What kind of plot will best answer this question?
- Is there more than one appropriate plot?

<!--
BEGIN QUESTION
name: q11
-->

In [2]:
count_is_increasing = ...
count_is_increasing

# Agregate data to the right level - What should we groupby with? And what is the right aggregation metric? 

# Plot the data

In [None]:
grader.check("q11")

### Question 2: What about the median size of a fire? (is it also increasing?)
*set ```median_is_increasing``` to True if the the median fire size is increasing*

<!--
BEGIN QUESTION
name: q12
-->

In [4]:
median_is_increasing = ...
median_is_increasing



In [None]:
grader.check("q12")

### Question 3: Is the largest fire per year increasing?
*set ```max_fire_is_increasing``` to True if the the largest fire per year is increasing*

<!--
BEGIN QUESTION
name: q13
-->

In [6]:
max_is_increasing = ...
max_is_increasing




In [None]:
grader.check("q13")

### Question 4: Is there an association between latitude and fire size?
*set ```latitude_size_association``` to True if there is an association between latitude and fire size*

<!--
BEGIN QUESTION
name: q14
-->

In [8]:
latitude_size_association = ...
latitude_size_association 

# Do we need to aggregate (groupby) data to answer this question? Why or why not?

# Plot the data

In [None]:
grader.check("q14")

### Question 5: What is the distribution of fire sizes?

In [10]:
# What plot is used to identify the distribution of a numerical variable?

In [11]:
# Why do you see almost no bars on the right side? What does that indicate?
# Let's plot histogram for small fires (for acres<100). What difference did you notice?

### Question 6: Plot the number of fires due to each cause.

In [12]:
# What is the right plot? Does sorting the data before plotting helps? When is `bar` vs `barh` useful?

### Question 7a: In what times of the year (month) are fires most common?

In [13]:
# What plot is appropriate? Even though 'month' variable is involved in the graph, would a bar chart be more useful?

### Question 7b: In what times of the year are *large* fires most common in *Southern California*?

By large, say over 5,000 acres. By SoCal, we mean latitude < 37.

In [14]:
...

### Question 8: On the same plot, show natural vs. human-caused over time.

- That is, have one line for the number of fires caused by lightning over time.
- Another line for all other causes over time.

In [15]:

# Get the count of natural caused fires across years

# Get the count of human caused fires across years

# What happens if you plot two plots in the same cell? Is that what you expected?

# Functions and Apply

Now lets take a look at writing our own functions and then applying these functions to DataFrames.

### Question 9: Cause Codes

- Currently, the causes are written as `<code> - <name>`. Like: `7 - Arson`.
- Write a function which takes in the cause and outputs only the name.

In [17]:
causes_old = ...
causes_old

In [18]:
def convert_cause(cause):
    ...

In [19]:
# print out causes before and after (Don't worry if you don't know for loops - they will be covered in a week)
for cause in causes_old:
    print(f"{cause} --> {convert_cause(cause)}")

### Question 10:  Replace the `cause` column with one containing only the names

In [20]:
# Create a new series using apply function
# Assign this new series to the dataframe

### Question 11:  Convert months numbers to names.

- Write a function that converts a month's number (1, 2, 3, ..., 12) to the word (January, ..., December).

In [21]:
import calendar
calendar.month_name[1]

In [22]:
def month_number_to_name(month_number):
    import calendar
    month_name = calendar.month_name[month_number]
    return month_name

In [23]:
months = ...
months

In [24]:
# print out months before and after 
for month in months:
    print(f"{month} --> {month_number_to_name(month)}")

In [25]:
# Create a new series using apply function
# Assign this new series to the dataframe

### Question 12:  Closest fire

- Write a function that accepts a latitude/longitude pair and returns the name of the closest fire.


- We will use one specific way to answer this question and here are some hints -
    - We will use the function [`haversine_distances`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.haversine_distances.html) from `sklearn` module which takes in `array_like` inputs and returns distance between each pair of the two collections of inputs
    - You may need to install `sklearn` module by running `!pip install sklearn` in a jupyter cell
    - Note that the [Haversine formula](https://en.wikipedia.org/wiki/Haversine_formula) is the closest approximation to finding the distance between two points on a sphere given their latitude and longitude. In this case, the function from `sklearn` will take care of the calculation. If you are interested, you can try creating that formula as a function yourselves.
    - To the `haversine_distances` function, provide the first pair of lat/long coordinates as the first input, and the numpy array of all `calfire` lat/longs as the second input
    - Check how `argmin()` from `numpy` can be used to find the index of the minimum value

In [26]:
# !pip install sklearn

In [27]:
from sklearn.metrics.pairwise import haversine_distances
def get_closest_fire(latitude, longitude):
    
    # target point - as a list
    target = ...
    
    # get fire coords - as a numpy matrix
    fire_coords = ...
    
    # get distances
    distances = ...
    
    # get index of closest fire
    closest_fire_index = ...
    
    # get the name of the fire
    closest_fire = ...

    return closest_fire

In [28]:
# Let's find out what's the closest fire incident to geisel library
geisel = [32.881158, -117.237566]
closest_fire = get_closest_fire(geisel[0], geisel[1])
closest_fire

In [29]:
grader.check_all()