## Mean and median
Summary statistics are exactly what they sound like - they summarize many numbers in one statistic. For example, mean, median, minimum, maximum, and standard deviation are summary statistics. Calculating summary statistics allows you to get a better sense of your data, even if there's a lot of it.

**sales** is available and **pandas** is loaded as **pd**.

### Instructions

- Explore your new DataFrame first by printing the first few rows of the **sales** DataFrame.
- Print information about the columns in **sales**.
- Print the mean of the **weekly_sales** column.
- Print the median of the **weekly_sales** column.

In [3]:
import pandas as pd

sales = pd.read_csv('../sales_subset.csv',index_col=0)

print(sales.head())

print(sales.info())

print(sales["weekly_sales"].mean())

print(sales.weekly_sales.median()) #if column names is letter and number can use .(dot)

   store type  department        date  weekly_sales  is_holiday  \
0      1    A           1  2010-02-05      24924.50       False   
1      1    A           1  2010-03-05      21827.90       False   
2      1    A           1  2010-04-02      57258.43       False   
3      1    A           1  2010-05-07      17413.94       False   
4      1    A           1  2010-06-04      17558.09       False   

   temperature_c  fuel_price_usd_per_l  unemployment  
0       5.727778              0.679451         8.106  
1       8.055556              0.693452         8.106  
2      16.816667              0.718284         7.808  
3      22.527778              0.748928         7.808  
4      27.050000              0.714586         7.808  
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10774 entries, 0 to 10773
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   store                 10774 non-null  int64  
 1   t

## Summarizing dates
Summary statistics can also be calculated on date columns that have values with the data type **datetime64**. Some summary statistics — like mean — don't make a ton of sense on dates, but others are super helpful, for example, minimum and maximum, which allow you to see what time range your data covers.

**sales** is available and **pandas** is loaded as **pd**.

### Instructions

- Print the maximum of the **date** column.
- Print the minimum of the **date** column.

In [5]:
print(sales.date.max())

print(sales.date.min())

2012-10-26
2010-02-05


## Efficient summaries
While pandas and NumPy have tons of functions, sometimes, you may need a different function to summarize your data.

The **.agg()** method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. For example,

***df['column'].agg(function)***

In the custom function for this exercise, "IQR" is short for inter-quartile range, which is the 75th percentile minus the 25th percentile. It's an alternative to standard deviation that is helpful if your data contains outliers.

**sales** is available and **pandas** is loaded as **pd**.

### Instructions

- Use the custom **iqr** function defined for you along with **.agg()** to print the IQR of the **temperature_c** column of **sales**.
- Update the column selection to use the custom **iqr** function with **.agg()** to print the IQR of **temperature_c**, **fuel_price_usd_per_l**, and **unemployment**, in that order.
- Update the aggregation functions called by **.agg()**: include **iqr** and **np.median** in that order.

In [10]:
# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

import numpy as np

print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, np.median]))

        temperature_c  fuel_price_usd_per_l  unemployment
iqr         16.583333              0.073176         0.565
median      16.966667              0.743381         8.099


## Cumulative statistics
Cumulative statistics can also be helpful in tracking summary statistics over time. In this exercise, you'll calculate the cumulative sum and cumulative max of a department's weekly sales, which will allow you to identify what the total sales were so far as well as what the highest weekly sales were so far.

A DataFrame called **sales_1_1** has been created for you, which contains the sales data for department 1 of store 1. **pandas** is loaded as **pd**.

### Instructions

- Sort the rows of **sales_1_1** by the date column in ascending order.
- Get the cumulative sum of **weekly_sales** and add it as a new column of **sales_1_1** called **cum_weekly_sales**.
- Get the cumulative maximum of **weekly_sales**, and add it as a column called **cum_max_sales**.
- Print the **date**, **weekly_sales**, **cum_weekly_sales**, and **cum_max_sales** columns.

In [33]:
sales_1_1 = sales[(sales["store"] == 1) & (sales["department"] == 1)]

sales_1_1.sort_values('date')


sales_1_1["cum_weekly_sales"] = sales_1_1["weekly_sales"].cumsum()

sales_1_1["cum_max_sales"] = sales_1_1["weekly_sales"].cummax()

print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])



          date  weekly_sales  cum_weekly_sales  cum_max_sales
0   2010-02-05      24924.50          24924.50       24924.50
1   2010-03-05      21827.90          46752.40       24924.50
2   2010-04-02      57258.43         104010.83       57258.43
3   2010-05-07      17413.94         121424.77       57258.43
4   2010-06-04      17558.09         138982.86       57258.43
5   2010-07-02      16333.14         155316.00       57258.43
6   2010-08-06      17508.41         172824.41       57258.43
7   2010-09-03      16241.78         189066.19       57258.43
8   2010-10-01      20094.19         209160.38       57258.43
9   2010-11-05      34238.88         243399.26       57258.43
10  2010-12-03      22517.56         265916.82       57258.43
11  2011-01-07      15984.24         281901.06       57258.43


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sales_1_1["cum_weekly_sales"] = sales_1_1["weekly_sales"].cumsum()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sales_1_1["cum_max_sales"] = sales_1_1["weekly_sales"].cummax()
