# **Mean and median**

Summary statistics are exactly what they sound like - they summarize many numbers in one statistic. For example, **mean**, **median**, **minimum**, **maximum**, and **standard deviation** are summary statistics. Calculating summary statistics allows you to get a better sense of your data, even if there's a lot of it.

In the following code cell, you will install pandas if you didn't do so far and you will import pandas as:

`import pandas as pd `




In [None]:
#Install and import pandas
#!pip install pandas
import pandas as pd
import zipfile

Then read data from csv file using pandas and create a `sales` DataFrame.

In [None]:
file = "https://github.com/elmoallistair/datacamp/blob/master/data-manipulation-with-pandas/datasets/sales_subset.csv?raw=True"
# open zipped dataset
sales = pd.read_csv(file)
# display dataset
print(sales.head())

   Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0           0      1    A           1  2010-02-05      24924.50       False   
1           1      1    A           1  2010-03-05      21827.90       False   
2           2      1    A           1  2010-04-02      57258.43       False   
3           3      1    A           1  2010-05-07      17413.94       False   
4           4      1    A           1  2010-06-04      17558.09       False   

   temperature_c  fuel_price_usd_per_l  unemployment  
0       5.727778              0.679451         8.106  
1       8.055556              0.693452         8.106  
2      16.816667              0.718284         7.808  
3      22.527778              0.748928         7.808  
4      27.050000              0.714586         7.808  


**Instructions to the exercises:**



*   Explore your new DataFrame first by printing the first few rows of the `sales` DataFrame.
*   Print information about the columns in `sales`.
*   Print the mean of the `weekly_sales` column.
*   Print the median of the `weekly_sales` column.


In [None]:
# Print the head of the sales DataFrame
print(sales.head)

# Print the info about the sales DataFrame
print(sales.info)

# Print the mean of weekly_sales
mean_weekly_sales = sales['weekly_sales'].mean()
print(mean_weekly_sales)

# Print the median of weekly_sales
median_weekly_sales = sales['weekly_sales'].median()
print(median_weekly_sales)

<bound method NDFrame.head of        Unnamed: 0  store type  department        date  weekly_sales  \
0               0      1    A           1  2010-02-05      24924.50   
1               1      1    A           1  2010-03-05      21827.90   
2               2      1    A           1  2010-04-02      57258.43   
3               3      1    A           1  2010-05-07      17413.94   
4               4      1    A           1  2010-06-04      17558.09   
...           ...    ...  ...         ...         ...           ...   
10769       10769     39    A          99  2011-12-09        895.00   
10770       10770     39    A          99  2012-02-03        350.00   
10771       10771     39    A          99  2012-06-08        450.00   
10772       10772     39    A          99  2012-07-13          0.06   
10773       10773     39    A          99  2012-10-05        915.00   

       is_holiday  temperature_c  fuel_price_usd_per_l  unemployment  
0           False       5.727778              

# **Summarizing dates**

Summary statistics can also be calculated on date columns that have values with the data type `datetime64`. Some summary statistics — like mean — don't make a ton of sense on dates, but others are super helpful, for example, minimum and maximum, which allow you to see what time range your data covers.

The `sales` DataFrame is available as it is read in the previous cells and and pandas is loaded as pd as well in the first cell. Answer each question below in each cell.

Then you are expected to do the following two exercises:

**Instructions:**
* Print the maximum of the `date` column.
* Print the minimum of the `date` column.


In [None]:
# Print the maximum of the date column
data_max=sales["date"].max()
print(data_max)

# Print the minimum of the date column
data_min=sales["date"].min()
print(data_min)

2012-10-26
2010-02-05


# **Efficient summaries**

While pandas and NumPy have tons of functions, sometimes, you may need a different function to summarize your data.

The `.agg()` method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. For example,

`df['column'].agg(function)`

In the custom function for this exercise, "IQR" is short for inter-quartile range, which is the 75th percentile minus the 25th percentile. It's an alternative to standard deviation that is helpful if your data contains outliers.

The `sales` DataFrame is available as it is read in the previous cells and and pandas is loaded as pd as well in the first cell. Answer each question below in each cell.

Based on the `.agg()` function, do the following exercises.

**Instructions:**


1.   Use the custom `iqr` function defined for you along with `.agg()` to print the IQR of the `temperature_c` column of `sales`.



In [None]:
# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Print IQR of the temperature_c column
temperature_iqr = iqr(sales['temperature_c'])
print("IQR of the temperature_c column:", temperature_iqr)

2. Update the column selection to use the custom `iqr` function with `.agg()` to print the IQR of `temperature_c`, `fuel_price_usd_per_l`, and `unemployment`, in that order.

In [None]:
# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Print IQR of the temperature_c, fuel_price_usd_per_l, and unemployment columns
iqr_values = sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg(iqr)
print(iqr_values)

3. Update the aggregation functions called by `.agg()`: include `iqr` and `np.median` in that order.

In [None]:
# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

iqr_median_values = sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, "median"])

print(iqr_median_values)
# Print IQR and median of the temperature_c, fuel_price_usd_per_l, and unemployment columns


        temperature_c  fuel_price_usd_per_l  unemployment
iqr         16.583333              0.073176         0.565
median      16.966667              0.743381         8.099


# **Cumulative statistics**

Cumulative statistics can also be helpful in tracking summary statistics over time. In this exercise, you'll calculate the cumulative sum and cumulative max of a department's weekly sales, which will allow you to identify what the total sales were so far as well as what the highest weekly sales were so far.

The `sales` DataFrame is available as it is read in the previous cells and and pandas is loaded as pd as well in the first cell. Answer each question below in each cell.

If you have a DataFrame object, `df`, you can use `df.sort_values()` function to sort rows.

Based on this, do the following exercises.

**Instructions:**

* Sort the rows of `sales` by the `date` column in ascending order.
* Get the cumulative sum of `weekly_sales` and add it as a new column of `sales` called `cum_weekly_sales`.
* Get the cumulative maximum of `weekly_sales`, and add it as a column called `cum_max_sales`.
* Print the `date`, `weekly_sales`, `cum_weekly_sales`, and `cum_max_sales` columns.


In [None]:
# Sort sales by date
import pandas as pd

# Assuming 'sales' DataFrame is already loaded
# Sort the rows by the 'date' column in ascending order
sales_sorted = sales.sort_values(by="date")

# Add cumulative sum of 'weekly_sales' as a new column
sales_sorted["cum_weekly_sales"] = sales_sorted["weekly_sales"].cumsum()

# Add cumulative maximum of 'weekly_sales' as a new column
sales_sorted["cum_max_sales"] = sales_sorted["weekly_sales"].cummax()

# Print the required columns
print(sales_sorted[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])


             date  weekly_sales  cum_weekly_sales  cum_max_sales
0      2010-02-05      24924.50      2.492450e+04       24924.50
6437   2010-02-05      38597.52      6.352202e+04       38597.52
1249   2010-02-05       3840.21      6.736223e+04       38597.52
6449   2010-02-05      17590.59      8.495282e+04       38597.52
6461   2010-02-05       4929.87      8.988269e+04       38597.52
...           ...           ...               ...            ...
3592   2012-10-05        440.00      2.568932e+08      293966.05
8108   2012-10-05        660.00      2.568938e+08      293966.05
10773  2012-10-05        915.00      2.568947e+08      293966.05
6257   2012-10-12          3.00      2.568947e+08      293966.05
3384   2012-10-26        -21.63      2.568947e+08      293966.05

[10774 rows x 4 columns]


# **Dropping duplicates**

Removing duplicates is an essential skill to get accurate counts because often, you don't want to count the same thing multiple times. In this exercise, you'll create some new DataFrames using unique values from `sales`.

If you have a DataFrame object, `df`, you can use `df.drop_duplicates(subset=column_name)` to remove duplicates using a single or more columns.

The `sales` DataFrame is available as it is read in the previous cells and and pandas is loaded as pd as well in the first cell. Answer each question below in each cell.

Based on this, do the following exercises.

**Instructions:**
* Remove rows of `sales` with duplicate pairs of `store` and `type` and save as `store_types` and print the head.
* Remove rows of `sales` with duplicate pairs of `store` and `department` and save as `store_depts` and print the head.
* Subset the rows that are holiday weeks using the `is_holiday` column, and drop the duplicate `dates`, saving as `holiday_dates`.
* Select the `date` column of `holiday_dates`, and print.



In [33]:
# Drop duplicate store/type combinations
store_types = sales.drop_duplicates(subset=["store","type"])
print(store_types.head())

# Drop duplicate store/department combinations
store_depts =sales.drop_duplicates(subset=["store","department"])
print(store_depts.head())

# Subset the rows where is_holiday is True and drop duplicate dates
holiday_dates = sales[sales["is_holiday"]].drop_duplicates(subset="date")


# Print date col of holiday_dates
print(holiday_dates["date"])

      Unnamed: 0  store type  department        date  weekly_sales  \
0              0      1    A           1  2010-02-05      24924.50   
901          901      2    A           1  2010-02-05      35034.06   
1798        1798      4    A           1  2010-02-05      38724.42   
2699        2699      6    A           1  2010-02-05      25619.00   
3593        3593     10    B           1  2010-02-05      40212.84   

      is_holiday  temperature_c  fuel_price_usd_per_l  unemployment  
0          False       5.727778              0.679451         8.106  
901        False       4.550000              0.679451         8.324  
1798       False       6.533333              0.686319         8.623  
2699       False       4.683333              0.679451         7.259  
3593       False      12.411111              0.782478         9.765  
    Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0            0      1    A           1  2010-02-05      24924.50       False   

# **Counting categorical variables**

Counting is a great way to get an overview of your data and to spot curiosities that you might not notice otherwise. In this exercise, you'll count the number of each type of store and the number of each department number using the DataFrames you created in the previous exercise:

`# Drop duplicate store/type combinations`

`store_types = sales.drop_duplicates(subset=["store", "type"])`

`# Drop duplicate store/department combinations`

`store_depts = sales.drop_duplicates(subset=["store", "department"])`

The `store_types` and `store_depts` DataFrames you created in the last exercise are available, and pandas is imported as pd.

Based on this, do the following exercises.

**Instructions:**
* Count the number of stores of each store `type` in `store_types`.
* Count the proportion of stores of each store `type` in `store_types`.
* Count the number of different `department` s in `store_depts`, sorting the counts in descending order.
* Count the proportion of different `department` s in `store_depts`, sorting the proportions in descending order.


In [36]:
# Count the number of stores of each type
store_type_counts = store_types["type"].value_counts()
print("Number of stores of each store type:")
print(store_type_counts)

# Count the proportion of stores of each store type in store_types
store_type_proportions = store_types["type"].value_counts(normalize=True)
print("\nProportion of stores of each store type:")
print(store_type_proportions)

# Count the number of different departments in store_depts, sorted in descending order
department_counts = store_depts["department"].value_counts()
print("\nNumber of different departments (sorted):")
print(department_counts)

# Count the proportion of different departments in store_depts, sorted in descending order
department_proportions = store_depts["department"].value_counts(normalize=True)
print("\nProportion of different departments (sorted):")
print(department_proportions)


Number of stores of each store type:
type
A    11
B     1
Name: count, dtype: int64

Proportion of stores of each store type:
type
A    0.916667
B    0.083333
Name: proportion, dtype: float64

Number of different departments (sorted):
department
1     12
55    12
72    12
71    12
67    12
      ..
37    10
48     8
50     6
39     4
43     2
Name: count, Length: 80, dtype: int64

Proportion of different departments (sorted):
department
1     0.012917
55    0.012917
72    0.012917
71    0.012917
67    0.012917
        ...   
37    0.010764
48    0.008611
50    0.006459
39    0.004306
43    0.002153
Name: proportion, Length: 80, dtype: float64


# **What percent of sales occurred at each store type?**

While `.groupby()` is useful, you can also calculate grouped summary statistics without it.

Walmart distinguishes three types of stores: "supercenters," "discount stores," and "neighborhood markets," encoded in this dataset as type "A," "B," and "C." In this exercise, you'll calculate the total sales made at each store type, without using `.groupby()`. You can then use these numbers to see what proportion of Walmart's total sales were made at each type.

The `sales` DataFrame is available as it is read in the previous cells and and pandas is loaded as pd as well in the first cell. Answer each question below in each cell.

Based on this, do the following exercises.

**Instructions:**
* Calculate the total `weekly_sales` over the whole dataset.
* Subset for `type` "A" stores, and calculate their total weekly sales.
* Do the same for `type` "B" and `type` "C" stores.
* Combine the A/B/C results into a list, and divide by `sales_all` to get the proportion of sales by type.


In [40]:

sales_all = sales["weekly_sales"].sum()
print(f"Total sales (all stores): {sales_all}")

# Subset for type "A" stores and calculate their total weekly sales
sales_A = sales[sales["type"] == "A"]["weekly_sales"].sum()
print(f"Total sales (type A stores): {sales_A}")

# Subset for type "B" stores and calculate their total weekly sales
sales_B = sales[sales["type"] == "B"]["weekly_sales"].sum()
print(f"Total sales (type B stores): {sales_B}")

# Subset for type "C" stores and calculate their total weekly sales
sales_C = sales[sales["type"] == "C"]["weekly_sales"].sum()
print(sales_C)

# Get proportion for each type
sales_propn_by_type = [sales_A, sales_B, sales_C] / sales_all
print(sales_propn_by_type)

Total sales (all stores): 256894718.89999998
Total sales (type A stores): 233716315.01
Total sales (type B stores): 23178403.89
0.0
[0.9097747 0.0902253 0.       ]


# **Calculations with `.groupby()`**

The `.groupby()` method makes life much easier. In this exercise, you'll perform the same calculations as last time, except you'll use the `.groupby()` method. You'll also perform calculations on data grouped by two variables to see if sales differ by store type depending on if it's a holiday week or not.

The `sales` DataFrame is available as it is read in the previous cells and and pandas is loaded as pd as well in the first cell. Answer each question below in each cell.

Based on this, do the following exercises.

**Instructions:**
* Group sales by "`type`", take the sum of "`weekly_sales`", and store as `sales_by_type`.
* Calculate the proportion of `sales` at each store `type` by dividing by the sum of `sales_by_type`. Assign to `sales_propn_by_type`.


In [41]:
# Group by type; calc total weekly sales
sales_by_type = sales.groupby("type")["weekly_sales"].sum()

# Get proportion for each type
sales_propn_by_type = sales_by_type / sum(sales_by_type)
print(sales_propn_by_type)

type
A    0.909775
B    0.090225
Name: weekly_sales, dtype: float64


# **Multiple grouped summaries**

Earlier in this chapter, you saw that the `.agg()` method is useful to compute multiple statistics on multiple variables. It also works with grouped data. NumPy, which is imported as np, has many different summary statistics functions, including: np.min, np.max, np.mean, and np.median.

The `sales` DataFrame is available as it is read in the previous cells and and pandas is loaded as pd as well in the first cell. Answer each question below in each cell.

Based on this, do the following exercises.

**Instructions:**
* Import `numpy` with the alias `np`.
* Get the `min`, `max`, `mean`, and `median` of `weekly_sales` for each store type using `.groupby()` and `.agg()`. Store this as `sales_stats`. Make sure to use numpy functions!
* Get the `min`, `max`, `mean`, and `median` of `unemployment` and `fuel_price_usd_per_l` for each store `type`. Store this as `unemp_fuel_stats`.


In [50]:
# Import numpy with the alias np
import numpy as np


# For each store type, aggregate weekly_sales: get min, max, mean, and median
sales_stats = sales.groupby("type")["weekly_sales"].agg(['min', 'max', 'mean', 'median'])



# Print sales_stats
print(sales_stats)

# For each store type, aggregate unemployment and fuel_price_usd_per_l: get min, max, mean, and median

unemp_fuel_stats = sales.groupby("type")[["unemployment","fuel_price_usd_per_l"]].agg(['min','max','mean','median'])

# Print unemp_fuel_stats
print(unemp_fuel_stats)

         min        max          mean    median
type                                           
A    -1098.0  293966.05  23674.667242  11943.92
B     -798.0  232558.51  25696.678370  13336.08
     unemployment                         fuel_price_usd_per_l            \
              min    max      mean median                  min       max   
type                                                                       
A           3.879  8.992  7.972611  8.067             0.664129  1.107410   
B           7.170  9.765  9.279323  9.199             0.760023  1.107674   

                          
          mean    median  
type                      
A     0.744619  0.735455  
B     0.805858  0.803348  


# **Pivoting on one variable**

Pivot tables are the standard way of aggregating data in spreadsheets.

In pandas, pivot tables are essentially another way of performing grouped calculations. That is, the `.pivot_table()` method is an alternative to `.groupby()`.

In this exercise, you'll perform calculations using `.pivot_table()` to replicate the calculations you performed in the last lesson using `.groupby()`.

The `sales` DataFrame is available as it is read in the previous cells and and pandas is loaded as pd as well in the first cell. Answer each question below in each cell.

Based on this, do the following exercises.

**Instructions:**
1. Get the mean `weekly_sales` by type using `.pivot_table()` and store as `mean_sales_by_type`.

In [56]:
# Pivot for mean weekly_sales for each store type
mean_sales_by_type = sales.pivot_table(values="weekly_sales",index="type",aggfunc="mean")

# Print mean_sales_by_type
print(mean_sales_by_type)

      weekly_sales
type              
A     23674.667242
B     25696.678370


2. Get the `mean` and `median` (using NumPy functions) of `weekly_sales` by type using `.pivot_table()` and store as `mean_med_sales_by_type`.

In [58]:
# Pivot for mean and median of weekly_sales for each store type
mean_med_sales_by_type = sales.pivot_table(values="weekly_sales",index="type",aggfunc=["mean","median"])

# Print mean_med_sales_by_type
print(mean_med_sales_by_type)

              mean       median
      weekly_sales weekly_sales
type                           
A     23674.667242     11943.92
B     25696.678370     13336.08


3. Get the `mean` of `weekly_sales` by `type` and `is_holiday` using .`pivot_table()` and store as `mean_sales_by_type_holiday`.

In [62]:
# Pivot for mean weekly_sales for each store type for holiday
mean_sales_by_type_holiday = sales.pivot_table(values="weekly_sales",index="type",columns="is_holiday",aggfunc="mean")

# Print mean_sales_by_type_holiday
print(mean_sales_by_type_holiday)

is_holiday         False      True 
type                               
A           23768.583523  590.04525
B           25751.980533  810.70500


# **Fill in missing values and sum values with pivot tables**

The `.pivot_table()` method has several useful arguments, including `fill_value` and `margins`.

* `fill_value` replaces missing values with a real value (known as imputation). What to replace missing values with is a topic big enough to have its own course (Dealing with Missing Data in Python), but the simplest thing to do is to substitute a dummy value.
* `margins` is a shortcut for when you pivoted by two variables, but also wanted to pivot by each of those variables separately: it gives the row and column totals of the pivot table contents.

In this exercise, you'll practice using these arguments to up your pivot table skills, which will help you crunch numbers more efficiently!

The `sales` DataFrame is available as it is read in the previous cells and and pandas is loaded as pd as well in the first cell. Answer each question below in each cell.

Based on this, do the following exercises.

**Instructions:**
1. Print the mean `weekly_sales` by `department` and `type`, filling in any missing values with `0`.

2. Print the mean `weekly_sales` by `department` and `type`, filling in any missing values with `0` and summing all rows and columns.

In [63]:
# Print mean weekly_sales by department and type; fill missing values with 0
print((sales.pivot_table(values="weekly_sales",index="type",columns="department",aggfunc="mean",fill_value='0')
))

# Print mean weekly_sales by department and type; fill missing values with 0
# and summing all rows and columns
print(sales.pivot_table(values="weekly_sales",index="type",columns="department",aggfunc="mean",fill_value='0', margins=True))


department            1              2             3             4   \
type                                                                  
A           30961.725379   67600.158788  17160.002955  44285.399091   
B           44050.626667  112958.526667  30580.655000  51219.654167   

department            5             6             7             8   \
type                                                                 
A           34821.011364   7136.292652  38454.336818  48583.475303   
B           63236.875000  10717.297500  52909.653333  90733.753333   

department            9             10  ...            90            91  \
type                                    ...                               
A           30120.449924  30930.456364  ...  85776.905909  70423.165227   
B           66679.301667  48595.126667  ...  14780.210000  13199.602500   

department             92            93            94             95  \
type                                                         

  print((sales.pivot_table(values="weekly_sales",index="type",columns="department",aggfunc="mean",fill_value='0')
  print(sales.pivot_table(values="weekly_sales",index="type",columns="department",aggfunc="mean",fill_value='0', margins=True))
