<a href="https://colab.research.google.com/github/baddfish/Data-Wrangling-Exercise-1-Basic-Data-Manipulation/blob/master/Portugal_2007_EDA_Statistics_of_Forest_Fires.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Portugal 2007 Forest Fire Statistics

![Forest fire](https://www.publicdomainpictures.net/pictures/220000/velka/forest-fire.jpg)

For this mini project, I'll look at a research data set of forest fires in Portugal during 2007.  [More here](https://archive.ics.uci.edu/ml/datasets/Forest+Fires)

Lets take a look at the attributes..

1. X - x-axis spatial coordinate within the Montesinho park map: 1 to 9
2. Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9
3. month - month of the year: 'jan' to 'dec'
4. day - day of the week: 'mon' to 'sun'
5. FFMC - FFMC index from the FWI system: 18.7 to 96.20
6. DMC - DMC index from the FWI system: 1.1 to 291.3
7. DC - DC index from the FWI system: 7.9 to 860.6
8. ISI - ISI index from the FWI system: 0.0 to 56.10
9. temp - temperature in Celsius degrees: 2.2 to 33.30
10. RH - relative humidity in %: 15.0 to 100
11. wind - wind speed in km/h: 0.40 to 9.40
12. rain - outside rain in mm/m2 : 0.0 to 6.4
13. area - the burned area of the forest (in ha): 0.00 to 1090.84 

Most of these features are numeric - this means we can do things like look at their mean, median, mode, and plot histograms. 

For the discrete features we can draw histograms. X and Y are already integer values

## 1. Load the data and take a peek

The data is accessible as a CSV at the URL: https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv

Let's load this into a dataframe, so we can then look at the variables and perform descriptive statistics.

After loading, we verify it's working by printing the first five rows of data.

In [0]:
import pandas as pd

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv')



In [0]:
df.shape


(517, 13)

In [0]:
df.head(5)

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


In [0]:
df.describe()

Unnamed: 0,X,Y,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
count,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0
mean,4.669246,4.299807,90.644681,110.87234,547.940039,9.021663,18.889168,44.288201,4.017602,0.021663,12.847292
std,2.313778,1.2299,5.520111,64.046482,248.066192,4.559477,5.806625,16.317469,1.791653,0.295959,63.655818
min,1.0,2.0,18.7,1.1,7.9,0.0,2.2,15.0,0.4,0.0,0.0
25%,3.0,4.0,90.2,68.6,437.7,6.5,15.5,33.0,2.7,0.0,0.0
50%,4.0,4.0,91.6,108.3,664.2,8.4,19.3,42.0,4.0,0.0,0.52
75%,7.0,5.0,92.9,142.4,713.9,10.8,22.8,53.0,4.9,0.0,6.57
max,9.0,9.0,96.2,291.3,860.6,56.1,33.3,100.0,9.4,6.4,1090.84


## 2. - Explore and summarize the data

Now that you've got the data, take a deeper look. Lets pick three variables you want to look at - two continuous, and one discrete, and for each calculate the mean and median.

Questions to consider:
- Is the median larger or smaller than the mean?
- What does that tell you about how the variable is distributed?
- (For the discrete variable only) What is the mode?



**continuous** #1

In [0]:
# temperature
temp_mean = sum(df.temp) / len(df.temp)
print(temp_mean)

temp_median = sorted(df.temp)[len(df.temp) // 2]
print(temp_median)

# Is the median larger or smaller than the mean?
# What does that tell you about how the variable is distributed?

# The median is larger. This would mean that the data is skewed to the left.


18.88916827852998
19.3


**continuous #2**

In [0]:
# wind speed
wind_mean = sum(df.wind) / len(df.wind)
print(wind_mean)

wind_median = sorted(df.wind)[len(df.wind) // 2]
print(wind_median)

# Is the median larger or smaller than the mean?
# What does that tell you about how the variable is distributed?

# Because the mean and median are approxinately the same, this would suggest
# that the distribution is symetrical

4.017601547388782
4.0


**Discrete**

In [0]:
# area
area_mean = sum(df.area) / len(df.area)
print(area_mean)

area_median = sorted(df.area)[len(df.area) // 2]
print(area_median)

from collections import Counter
data = Counter(df.area)
data.most_common()   # Returns all unique items and their counts
data.most_common(1)  # Returns the highest occurring item


# Is the median larger or smaller than the mean?
# What does that tell you about how the variable is distributed?

# The median is larger. This would mean that the data is skewed to the left.

# (For the discrete variable only) What is the mode?
# It looks like the mode would be 247


12.847292069632491
0.52


[(0.0, 247)]

## 3. Simulate data

Let's simulate more data 

1. Generate a *new* variable based on taking values at random from the original one.
2. Calculate the mean, median, and mode of the new variable.
3. Compare results



In [0]:
# import random, look at directory 
import random
import numpy as np
dir(random)

# New Temp Mean
temp_mean_list = df.temp  
  
for x in range(5170):  #I'm sure there are easier ways
    print(random.choice(temp_mean_list)) 
mean
# shows the mean as 18.89, the same 

## New Temp Median
temp_median_list = df.temp  
  
for x in range(5170):  #I'm sure there are easier ways
    new_median = (random.choice(temp_median_list))
    print(new_median) 

new_temp_median = sorted(new_median)[len(new_median) // 2]
print(new_temp_median)





20.1
4.6
14.3
19.3
18.0
16.1
20.4
18.3
26.8
17.2
23.3
18.2
5.8
30.2
16.6
5.1
12.2
24.8
14.2
20.1
22.9
28.6
30.6
17.9
26.3
13.3
8.2
19.5
22.3
19.3
5.8
20.4
19.3
28.7
23.3
22.3
19.6
13.8
23.7
17.9
22.7
17.0
16.2
18.0
19.6
24.3
18.9
22.2
23.5
16.8
23.9
18.4
17.4
5.1
25.3
26.8
26.9
21.5
20.3
29.6
20.8
27.4
24.1
18.7
7.5
22.8
17.6
27.4
16.7
26.4
22.1
5.3
21.1
22.8
17.7
23.2
24.0
26.4
15.6
22.6
17.0
14.3
11.6
15.5
5.8
23.4
17.7
5.3
26.4
18.9
12.8
24.6
28.3
21.5
24.6
20.6
5.3
22.8
27.9
17.7
4.6
17.1
13.8
21.7
20.9
11.2
19.1
22.7
20.5
27.6
30.8
27.8
20.8
25.3
27.8
15.4
24.8
17.4
19.2
4.6
21.4
20.5
17.0
23.4
25.4
16.7
16.8
12.2
25.9
20.9
24.9
24.9
12.9
24.1
21.9
20.6
23.4
13.8
18.7
26.4
26.4
11.5
22.8
11.5
16.8
9.8
28.6
24.8
20.2
24.3
14.3
20.6
23.8
18.2
12.9
26.2
15.4
19.6
11.5
11.0
16.4
10.1
15.9
26.4
20.3
20.4
14.9
21.5
15.4
17.6
10.6
12.4
18.2
19.6
21.5
21.2
5.8
21.3
28.3
24.3
21.3
23.9
18.8
19.3
20.3
19.3
17.6
13.8
20.4
24.3
28.0
23.1
12.7
18.9
9.3
19.9
11.6
11.0
12.2
20.9
20.8
17.7
22.1
9

TypeError: ignored