# Lambda School Data Science - Forest Fire Statistics

![Forest fire](https://www.publicdomainpictures.net/pictures/220000/velka/forest-fire.jpg)

Forest fires are a sadly timely topic, but data can help us better understand and perhaps manage them in future. In this assignment you'll look at a data set of forest fires in Portugal during 2007 - this is a real research dataset, and you can [read more about it](https://archive.ics.uci.edu/ml/datasets/Forest+Fires) though you do not need to for this assignment.

For our purposes, the main thing that you need to understand are the attributes in the data.

1. X - x-axis spatial coordinate within the Montesinho park map: 1 to 9
2. Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9
3. month - month of the year: 'jan' to 'dec'
4. day - day of the week: 'mon' to 'sun'
5. FFMC - FFMC index from the FWI system: 18.7 to 96.20
6. DMC - DMC index from the FWI system: 1.1 to 291.3
7. DC - DC index from the FWI system: 7.9 to 860.6
8. ISI - ISI index from the FWI system: 0.0 to 56.10
9. temp - temperature in Celsius degrees: 2.2 to 33.30
10. RH - relative humidity in %: 15.0 to 100
11. wind - wind speed in km/h: 0.40 to 9.40
12. rain - outside rain in mm/m2 : 0.0 to 6.4
13. area - the burned area of the forest (in ha): 0.00 to 1090.84 

Most of these features are numeric - this means we can do things like look at their mean, median, mode, and plot histograms. They have technical sounding names, but generally refer to meteorological data (i.e. the weather).

For the discrete features we can still draw histograms (as in the lecture notebook). X and Y are already integer values - month and day do have natural order, but if you want to use them you may want to translate them from strings to numbers (hint - you can build a dict that maps them, and loop over to apply it).

## Exercise 1 - Load the data and take a peek

The data is accessible as a CSV at the URL: https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv

You want to load this into a dataframe, so you can then look at the variables and perform descriptive statistics.

After you load it, verify that you've got it working by printing the first five rows of data.

In [1]:
# Your code here!
# Hint - look at the day 2 material for how to use pandas
import pandas as pd
pd.set_option('display.max_rows', None)
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv')
df.head(30)

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0
5,8,6,aug,sun,92.3,85.3,488.0,14.7,22.2,29,5.4,0.0,0.0
6,8,6,aug,mon,92.3,88.9,495.6,8.5,24.1,27,3.1,0.0,0.0
7,8,6,aug,mon,91.5,145.4,608.2,10.7,8.0,86,2.2,0.0,0.0
8,8,6,sep,tue,91.0,129.5,692.6,7.0,13.1,63,5.4,0.0,0.0
9,7,5,sep,sat,92.5,88.0,698.6,7.1,22.8,40,4.0,0.0,0.0


## Exercise 2 - Explore and summarize the data

Now that you've got the data, take a deeper look at it - the description above gives the overall range (from minimum to maximum), but look at the other core statistics. You should pick three variables you want to look at - two continuous, and one discrete, and for each you should calculate the mean and median.

Don't use the magic built-in functions of pandas or other libraries - write your own functions to calculate mean and median (you can of course refer to the lecture notebooks for help). This is a good exercise both to practice coding and reinforce the statistical concepts.

For each of the three variables you look at, answer the following questions (as comments in your code):

- Is the median larger or smaller than the mean?
- What does that tell you about how the variable is distributed?
- (For the discrete variable only) What is the mode?

Hint - it may help to draw a histogram to look at the variable and really get a feel for it.

Another hint - part of this exercise is making sure you can distinguish between continuous and discrete variables, so take some time to think through your choice of variables.

In [19]:
# Your code here!
# Remember, you should write your own code to calculate mean, median, and mode
# If you want, you can doublecheck your answer with pandas methods afterwards
# And if you want to draw a histogram, you should use:
from collections import Counter as ct
from matplotlib.pyplot import hist
from functools import reduce
wind = df['wind']
RH = df['RH']
X = df['X']



def mean(numbers):
  total_num = reduce(lambda x, y:x+y, numbers)
  mean_num = total_num / len(numbers)
  return mean_num



def median(wind):
  wind_sorted = sorted(wind)
  length = len(wind_sorted)
  if length %2 == 0:
    mid_num = wind_sorted[int(length/2)] 
    median_num = (mid_num+wind_sorted[(length/2)-1])/2

  if length %2 != 0:
    median_num = wind_sorted[int(length//2)]
  return median_num

def mode(n_num):
  n = len(n_num) 
  
  data = ct(n_num) 
  get_mode = dict(data) 
  mode = [k for k, v in get_mode.items() if v == max(list(data.values()))] 
  
  if len(mode) == n: 
    get_mode = "No mode found"
  else: 
    get_mode =  ', '.join(map(str, mode)) 
       
  return get_mode


print('mean of the wind speed is :' , mean(wind))
print('median of the wind speed is :' , median(wind))
print('mean of the RH is :' , mean(RH))
print('median of the RH is :' , median(RH))
print('mean of the X is :' , mean(X))
print('median of the X is :' , median(X))
print('mode of the X is :' , mode(X))

mean of the wind speed is : 4.017601547388782
median of the wind speed is : 4.0
mean of the RH is : 44.28820116054158
median of the RH is : 42
mean of the X is : 4.669245647969052
median of the X is : 4
mode of the X is : 4


## Exercise 3 - Simulate more data!

There are many more things that could be done with this data, but for now we've not learned about hypothesis testing or inferential statistics. So, one fun thing to do is - make more data!

How do we do that? We can use the same `random` module demonstrated in lecture, and repeatedly sample our data. This is related to the Monte Carlo method used to demonstrate the central limit theorem. In this setting, such simulations could then be applied to Bayesian methods - another topic for another time.

For the same three variables that you looked at in exercise 2, you should do the following:

1. Generate a *new* variable based on taking values at random from the original one - make the new variable have at least 10 times as many observations as the original
2. Calculate the mean, median, and mode of the new variable (it's okay to use prewritten functions for this)
3. Compare your results to what you saw in exercise 2 - it should be very similar

Once you're done, look back at your code. Chances are you learned things as you wrote it, and you can revisit it to clean it up a bit. Maybe put pieces of code you use multiple times in a function, or add some explanatory comments so anyone reading (including "future you") has a clearer understanding of what you did.

In [26]:
# Your code here!
# You'll definitely want to import random
# And you may find random.choice particularly helpful
import random
import numpy as np 
random_wind = [0]
for i in range(100):
  random_num = random.choice(wind) 

  random_wind.append(random_num)
  print(random_num)
  ++i

print('mean of the random wind is :', mean(random_wind))
print('median of the random wind is :', median(random_wind))
print('mode of the random wind is :', mode(random_wind))

3.6
8.5
3.1
6.3
2.2
3.1
2.2
4.9
2.2
6.3
2.2
2.2
4.9
3.1
4.5
1.8
2.2
2.7
4.0
4.9
1.3
2.7
4.0
2.7
6.3
6.3
4.9
4.0
6.3
3.6
5.8
4.0
4.5
7.6
4.0
1.3
4.0
4.9
1.8
4.9
5.4
4.9
6.3
2.7
3.6
3.1
1.3
1.8
3.1
5.4
6.7
4.0
2.2
6.3
4.9
9.4
1.8
2.7
4.9
1.3
2.7
0.9
1.8
6.3
4.9
5.4
4.9
6.3
4.9
2.7
1.3
5.4
4.0
6.7
5.8
3.1
3.1
4.0
4.9
3.1
0.9
9.4
8.0
4.5
7.6
4.0
3.6
3.1
8.5
4.9
5.4
4.5
4.5
4.0
4.0
3.1
4.5
3.1
2.2
5.8
mean of the random wind is : 4.1524752475247535
median of the random wind is : 4.0
mode of the random wind is : 4.9
