# pandas

In [1]:
import pandas as pd

We're going to work with our forest fire dataset again from Day One.

#### If you are using Google Colab, you must run the next line of code. *If you are NOT using Google Colab, do NOT run the next line.*

In [None]:
!wget https://raw.githubusercontent.com/aGitHasNoName/pandasBasics/main/forestfires.csv

<br><br>Everyone can now run the next line of code to create a DataFrame from our csv file.

In [2]:
df = pd.read_csv("forestfires.csv")

In [3]:
df.head()

Unnamed: 0,X,Y,month,day,fuel_code,moisture_code,drought_code,initial_spread_code,temp,humidity,wind,rain,area_burned
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


## <br><br><br>Data aggregation

Data aggregation means taking many data points and reducing them to one number, whether it's a count, sum, mean, or other single statistic. Here are some DataFrame method functions:

- [`.count()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.count.html)
- [`.sum()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html)
- [`.mean()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html)
- [`.median()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html)
- [`.min()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.min.html)
- [`.max()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.max.html)
- [`.unique()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html)
- [`.nunique()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.nunique.html)
- [`.std()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html)   #Standard error
- [`.var()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.var.html)   #Variance
- And more! https://pandas.pydata.org/docs/reference/series.html#computations-descriptive-stats

If you use a method function on the entire dataset, it will try its best to execute the method for all columns.

In [4]:
df.count()

X                      517
Y                      517
month                  517
day                    517
fuel_code              517
moisture_code          517
drought_code           517
initial_spread_code    517
temp                   517
humidity               517
wind                   517
rain                   517
area_burned            517
dtype: int64

In [5]:
df.min()

X                         1
Y                         2
month                   apr
day                     fri
fuel_code              18.7
moisture_code           1.1
drought_code            7.9
initial_spread_code     0.0
temp                    2.2
humidity                 15
wind                    0.4
rain                    0.0
area_burned             0.0
dtype: object

In [6]:
df.sum()

X                                                                   2414
Y                                                                   2223
month                  maroctoctmarmaraugaugaugsepsepsepsepaugsepseps...
day                    frituesatfrisunsunmonmontuesatsatsatfrimonwedf...
fuel_code                                                        46863.3
moisture_code                                                    57321.0
drought_code                                                    283285.0
initial_spread_code                                               4664.2
temp                                                              9765.7
humidity                                                           22897
wind                                                              2077.1
rain                                                                11.2
area_burned                                                      6642.05
dtype: object

In [7]:
df.unique()

AttributeError: 'DataFrame' object has no attribute 'unique'

<br>Not all functions will work on the entire DataFrame. Most of the time you are interested in only a subset of the data:

In [8]:
df["day"].unique()

array(['fri', 'tue', 'sat', 'sun', 'mon', 'wed', 'thu'], dtype=object)

In [9]:
list(df["day"].unique())

['fri', 'tue', 'sat', 'sun', 'mon', 'wed', 'thu']

In [10]:
df["month"].nunique()

12

In [11]:
df["temp"].var()

33.7168979503096

### <br><br>Exercise 1

Write code to find the mean humidity of the dataset:

In [12]:
df["humidity"].mean()

44.28820116054158

Write code to find the coldest temperature in the dataset:

In [13]:
df["temp"].min()

2.2

In [15]:
df.temp

0       8.2
1      18.0
2      14.6
3       8.3
4      11.4
       ... 
512    27.8
513    21.9
514    21.2
515    25.6
516    11.8
Name: temp, Length: 517, dtype: float64

## <br><br>groupby

Often, you will want to calculate the statistics for a particular subgroup of a data column.

For example, let's say we want to ask if more fires happen on certain days of the week. This code will tell you the count for every column in the DataFrame except the column that you are using to group your data (i.e. "day").

In [16]:
df.groupby("day").count()

Unnamed: 0_level_0,X,Y,month,fuel_code,moisture_code,drought_code,initial_spread_code,temp,humidity,wind,rain,area_burned
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
fri,85,85,85,85,85,85,85,85,85,85,85,85
mon,74,74,74,74,74,74,74,74,74,74,74,74
sat,84,84,84,84,84,84,84,84,84,84,84,84
sun,95,95,95,95,95,95,95,95,95,95,95,95
thu,61,61,61,61,61,61,61,61,61,61,61,61
tue,64,64,64,64,64,64,64,64,64,64,64,64
wed,54,54,54,54,54,54,54,54,54,54,54,54


<br>We can make this easier to read by sorting the rows from lowest count to highest count. We can string multiple method functions onto our code, as long as they go in the correct logical order. We will pick any column to sort on, since they are all the same.

In [20]:
df.groupby("day").count().sort_values("X")

Unnamed: 0_level_0,X,Y,month,fuel_code,moisture_code,drought_code,initial_spread_code,temp,humidity,wind,rain,area_burned
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
wed,54,54,54,54,54,54,54,54,54,54,54,54
thu,61,61,61,61,61,61,61,61,61,61,61,61
tue,64,64,64,64,64,64,64,64,64,64,64,64
mon,74,74,74,74,74,74,74,74,74,74,74,74
sat,84,84,84,84,84,84,84,84,84,84,84,84
fri,85,85,85,85,85,85,85,85,85,85,85,85
sun,95,95,95,95,95,95,95,95,95,95,95,95


<br>It looks like weekends have more fires than weekdays.

### <br><br>Exercise 2

Write code to find the mean for all the columns when grouped by month.

In [21]:
df.groupby("month").mean()

Unnamed: 0_level_0,X,Y,fuel_code,moisture_code,drought_code,initial_spread_code,temp,humidity,wind,rain,area_burned
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
apr,5.777778,4.222222,85.788889,15.911111,48.555556,5.377778,12.044444,46.888889,4.666667,0.0,8.891111
aug,4.483696,4.282609,92.336957,153.732609,641.077717,11.072283,21.631522,45.48913,4.086413,0.058696,12.489076
dec,4.555556,5.0,84.966667,26.122222,351.244444,3.466667,4.522222,38.444444,7.644444,0.0,13.33
feb,5.15,4.4,82.905,9.475,54.67,3.35,9.635,55.7,3.755,0.0,6.275
jan,3.0,4.5,50.4,2.4,90.35,1.45,5.25,89.0,2.0,0.0,0.0
jul,5.21875,4.59375,91.328125,110.3875,450.603125,9.39375,22.109375,45.125,3.734375,0.00625,14.369687
jun,6.294118,4.823529,89.429412,93.382353,297.705882,11.776471,20.494118,45.117647,4.135294,0.0,5.841176
mar,4.722222,4.481481,89.444444,34.542593,75.942593,7.107407,13.083333,40.0,4.968519,0.003704,4.356667
may,5.0,4.0,87.35,26.7,93.75,4.6,14.65,67.0,4.45,0.0,19.24
nov,6.0,3.0,79.5,3.0,106.7,1.1,11.8,31.0,4.5,0.0,0.0


<br><br><br><br>If you only want to see an aggregate total for one column in the DataFrame, you can add on the indexing technique we learned yesterday. With this code I will ask, What is the mean area burned on each day of the week?

In [22]:
df.groupby("day")[["area_burned"]].mean()

Unnamed: 0_level_0,area_burned
day,Unnamed: 1_level_1
fri,5.261647
mon,9.547703
sat,25.534048
sun,10.104526
thu,16.345902
tue,12.621719
wed,10.714815


<br>Again, we can also sort the data:

In [23]:
df.groupby("day")[["area_burned"]].mean().sort_values("area_burned")

Unnamed: 0_level_0,area_burned
day,Unnamed: 1_level_1
fri,5.261647
mon,9.547703
sun,10.104526
wed,10.714815
tue,12.621719
thu,16.345902
sat,25.534048


<br>So Saturday fires are the most destructive fires.

<br>We can also add some other functions to our code, like round:

In [26]:
df.groupby("day")[["area_burned"]].mean().round(2).sort_values("area_burned")

Unnamed: 0_level_0,area_burned
day,Unnamed: 1_level_1
fri,5.26
mon,9.55
sun,10.1
wed,10.71
tue,12.62
thu,16.35
sat,25.53


### <br><br>Exercise 3

Write code to count how many fires happened in each month (as a bonus, sort the results):

In [27]:
df.groupby("month").count().sort_values("X")

Unnamed: 0_level_0,X,Y,day,fuel_code,moisture_code,drought_code,initial_spread_code,temp,humidity,wind,rain,area_burned
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
nov,1,1,1,1,1,1,1,1,1,1,1,1
jan,2,2,2,2,2,2,2,2,2,2,2,2
may,2,2,2,2,2,2,2,2,2,2,2,2
apr,9,9,9,9,9,9,9,9,9,9,9,9
dec,9,9,9,9,9,9,9,9,9,9,9,9
oct,15,15,15,15,15,15,15,15,15,15,15,15
jun,17,17,17,17,17,17,17,17,17,17,17,17
feb,20,20,20,20,20,20,20,20,20,20,20,20
jul,32,32,32,32,32,32,32,32,32,32,32,32
mar,54,54,54,54,54,54,54,54,54,54,54,54


Write code to see the mean area burned for fires in each month (as a bonus, round and sort the results):

In [28]:
df.groupby("month")[["area_burned"]].mean().round(2).sort_values("area_burned")

Unnamed: 0_level_0,area_burned
month,Unnamed: 1_level_1
jan,0.0
nov,0.0
mar,4.36
jun,5.84
feb,6.28
oct,6.64
apr,8.89
aug,12.49
dec,13.33
jul,14.37
