<a href="https://colab.research.google.com/github/chonginbilly/Moringa_DS/blob/main/statisticalMethodsPandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="green">*To start working on this notebook, or any other notebook that we will use in this course, we will need to save our own copy of it. We can do this by clicking File > Save a Copy in Drive. We will then be able to make edits to our own copy of this notebook.*</font>

---

# Statistical Methods In Pandas

## Introduction

Let’s explore the powerful tools that will equip us with the skills to gain profound insights into your datasets. Our focus will be on leveraging Pandas methods such as `df.describe()` and `df.info()` to obtain detailed summary statistics, providing a solid foundation for understanding the structure and metadata of your DataFrames. Additionally, we will explore the efficiency of built-in Pandas methods designed for calculating tailored summary statistics, enabling you to extract valuable information specific to your data. As an integral part of our exploration, we will also unlock the potential of the `.apply()` method, discovering how it empowers us to apply custom functions systematically to every element within a Series or DataFrame. By the end of this lesson, you will be well-equipped to navigate and analyze datasets with confidence, making informed decisions in your data exploration endeavors. Let's dive in and unlock the secrets of efficient data analysis with Pandas!

## Objectives

By the end of this lesson, you will be able to:

- Obtain comprehensive DataFrame-level summary statistics
- Compute individual column statistics
- Use the `.apply()` methods to apply a function to a pandas series or DataFrame

## Import Libraries

In [None]:
# for numerical operations
import numpy as np
# for tabular data analysis
import pandas as pd

# for os functionality
import os

## Mount Google

In [None]:
from google.colab import drive

# mount your google drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# path to data folder
data_folder = "/content/drive/MyDrive/Product/Naivas Big Data /Data"

# change diretory to Notebook folder
os.chdir(data_folder)

In [None]:
# current location
os.getcwd()

'/content/drive/MyDrive/Product/Naivas Big Data /Data'

## Load the data

Let's explore the [Online Retail](https://docs.google.com/spreadsheets/d/1RrAWT89rctGHlEKrZhHusG0IilzE0QzS/edit?usp=sharing&ouid=101284319059865935215&rtpof=true&sd=true) dataset to learn how to use some of the key summary statistics methods in Pandas.

In [None]:
# importing the csv file
df = pd.read_excel('Online Retail.xlsx', nrows=200)
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom


In [None]:
# preview last five rows
df.tail()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
195,536388,22469,HEART OF WICKER SMALL,12,2010-12-01 09:59:00,1.65,16250,United Kingdom
196,536388,22242,5 HOOK HANGER MAGIC TOADSTOOL,12,2010-12-01 09:59:00,1.65,16250,United Kingdom
197,536389,22941,CHRISTMAS LIGHTS 10 REINDEER,6,2010-12-01 10:03:00,8.5,12431,Australia
198,536389,21622,VINTAGE UNION JACK CUSHION COVER,8,2010-12-01 10:03:00,4.95,12431,Australia
199,536389,21791,VINTAGE HEADS AND TAILS CARD GAME,12,2010-12-01 10:03:00,1.25,12431,Australia


## Data Understanding

In [None]:
# shape
print(f"The dataset has: \n\t* {df.shape[0]} rows \n\t* {df.shape[1]} columns")

The dataset has: 
	* 200 rows 
	* 8 columns


In [None]:
# data types
df.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID              int64
Country                object
dtype: object

In [None]:
# columns of the data
df.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

## Getting DataFrame-Level Summary Statistics

When dealing with a new dataset, the initial step always involves gaining an understanding of its composition. The Pandas DataFrame class conveniently provides two inherent methods that simplify this process for us.

### Using `df.info()`

The `df.info()` method provides us with concise metadata summaries regarding our DataFrame. In other words, it offers information about our dataset, revealing details such as the number of rows and columns it contains, along with the respective data types in which the data is stored.

In [None]:
# brief info about the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   InvoiceNo    200 non-null    object        
 1   StockCode    200 non-null    object        
 2   Description  200 non-null    object        
 3   Quantity     200 non-null    int64         
 4   InvoiceDate  200 non-null    datetime64[ns]
 5   UnitPrice    200 non-null    float64       
 6   CustomerID   200 non-null    int64         
 7   Country      200 non-null    object        
dtypes: datetime64[ns](1), float64(1), int64(2), object(4)
memory usage: 12.6+ KB


As evident from the displayed output, the `.info()` method imparts valuable insights into the DataFrame's characteristics, yet it refrains from revealing specific details about the actual data within. Take a closer look at the output and observe key aspects, including:
- the count of columns and rows in the DataFrame
- the data type assigned to each column
- the count of non-null values in each column (excluding NaNs)
- the memory usage of the DataFrame.

This type of information, often referred to as **metadata**, provides a comprehensive overview of the dataset by offering details about its structure and properties without diving into the content of the data itself.

### Using `.describe()`

The `df.describe()` method, in contrast to `df.info()`, provides statistical summaries of the *numerical* columns within the DataFrame. Instead of offering metadata, `df.describe()` delivers statistical measures such as the *mean*, *standard deviation*, *minimum, maximum*, and *quartile values* for each numerical column. It gives a more in-depth understanding of the distribution and central tendencies of the numerical data in the DataFrame, assisting in data exploration and analysis.

In [None]:
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,200.0,200.0,200.0
mean,19.435,3.571,15709.23
std,50.224179,3.543118,1862.393243
min,-1.0,0.38,12431.0
25%,3.0,1.65,14688.0
50%,6.0,2.55,15670.0
75%,12.0,4.25,17850.0
max,432.0,27.5,18074.0


As observed, the output generated by the `.describe()` method proves highly useful, providing pertinent details such as:

- A count of the values in each column, aiding in the identification of columns with missing values.
- The mean and standard deviation of each column, offering insights into the central tendencies and variability of the data.
- The minimum and maximum values within each column, helping to understand the data range.
- The median (50%) and quartile values (25% and 75%) for each column, facilitating a comprehensive view of the data distribution.

We should leverage the power of the `.describe()` method during the initial stages of the Exploratory Data Analysis process for a quick and effective understanding of the provided dataset.


Identify columns similar such as  `customer ID` from our data where statistical measures like mean, standard deviation, minimum, and maximum values are irrelevant. Such columns typically involve categorical or unique identifier data, where numerical summary statistics may not convey meaningful insights. It's essential to recognize and treat such columns differently during data analysis, focusing on their categorical or identifier nature rather than attempting to derive numerical statistics.

## Creating Sales Column

We can perform element wise arithmetic operation between two columns and the results stored in a new column in the dataframe. This way we can create useful informtion that tells us something important about our dataset.

Let's create a new column `Sales` from `UnitPrice` and `Quantity`:

In [None]:
# Sales = UnitPrice * Quantity
df["Sales"] = df["UnitPrice"] * df["Quantity"]


In [None]:
# brief data description
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   InvoiceNo    200 non-null    object        
 1   StockCode    200 non-null    object        
 2   Description  200 non-null    object        
 3   Quantity     200 non-null    int64         
 4   InvoiceDate  200 non-null    datetime64[ns]
 5   UnitPrice    200 non-null    float64       
 6   CustomerID   200 non-null    int64         
 7   Country      200 non-null    object        
 8   Sales        200 non-null    float64       
dtypes: datetime64[ns](1), float64(2), int64(2), object(4)
memory usage: 14.2+ KB


## Calculating Individual Column Statistics

We can easily calculate individual statistics about a column using Pandas DataFrames and Series objects as they provide a variety of built-in methods for quick summary statistics. Check out the examples in the code blocks below to see how we can swiftly compute these statistics.

### Count

`.count()`  helps us figure out the number of entries in a specific column and provides us with a quick way to understand the completeness of our data.

In [None]:
# count
df["StockCode"].count()

200

### Measure of Central Tendancy



In [None]:
# mean sales
total_sales = sum(df["Sales"])
n = len(df)
mean = total_sales / n

print(f"Total Sales: {total_sales}")
print(f"Number of sales (n): {n}")
print(f"Average Sale: {mean}")

Total Sales: 8813.649999999998
Number of sales (n): 200
Average Sale: 44.06824999999999


In [None]:
# mean sales
total_sales = sum(df["Sales"])
n = len(df)
mean = total_sales / n

print(f"Total Sales: {np.round(total_sales, 4)}")
print(f"Number of sales (n): {n}")
print(f"Average Sale: {np.round(mean,4)}") # np.round(number_to_round_off, number_of_decimals)

Total Sales: 8813.65
Number of sales (n): 200
Average Sale: 44.0682


In [None]:
# using Numpy to calculate mean
numpy_mean = np.mean(df["Sales"])
print(f"Mean calculated using Numpy: {numpy_mean}")

Mean calculated using Numpy: 44.06825


In [None]:
# using pandas
pandas_mean = df["Sales"].mean()
print(f"Mean calculated using Pandas(rounded off, 3 decimals): {np.round(pandas_mean, 3)}")

Mean calculated using Pandas(rounded off, 3 decimals): 44.068


In [None]:
# median using numpy
numpy_median = np.median(df["Sales"])
print(f"Median calculated using Numpy: {numpy_median}")

Median calculated using Numpy: 17.85


In [None]:
# using pandas
pandas_median = df["Sales"].median()
print(f"Median calculated using Pandas: {pandas_median}")

Median calculated using Pandas: 17.85


In [None]:
# mode of quantity column
mode_ = df["Quantity"].mode()
print(mode_)

0    6
Name: Quantity, dtype: int64


### Measures of Dispersion

* variance - measure of how far a set of numbers are spread out from their average.

In [None]:
# variance - numpy
variance_numpy = np.var(df["Sales"])
print(variance_numpy)

10724.712670437499


In [None]:
# pandas
variance_pandas = df["Sales"].var()
print(variance_pandas)

10778.605698932159


In [None]:
# Variance Manually
mean = sum(df["Sales"]) / len(df["Sales"])# calculate the mean

deviations = [x - mean for x in df["Sales"]] # calculate deviations from the mean

squared_deviations = [x ** 2 for x in deviations] # calculate squared_deviations

variance = sum(squared_deviations) / len(squared_deviations) # calculate variance

print(variance)

10724.712670437493


* standard deviation - Square root of variance and measures the amount of variation or dispersion

In [None]:
# std numpy
std_numpy = np.std(df["Sales"])
print(std_numpy)

103.56018863654845


In [None]:
# std pandas
std_pandas = df["Sales"].std()
print(std_pandas)

103.82006404800644


In [None]:
# using python
std = variance ** 0.5
print(std)

103.56018863654842


## Summary Statistics for Categorical Columns

Certainly, it's evident that we can't compute most summary statistics on columns containing non-numeric data; for instance, finding the mean of letters in the `StockCode` column wouldn't make sense. Nevertheless, there are specific summary statistics tailored for better understanding our categorical columns. Refer to the examples in the following cell for a clearer insight into these statistics.

In [None]:
# the unique items a column
df["StockC["Quantity"]ode"].value_counts()

SyntaxError: invalid syntax. Perhaps you forgot a comma? (<ipython-input-31-05cdb261a417>, line 2)

In [None]:
# unique items in the column
df["StockCode"].unique()

array(['85123A', 71053, '84406B', '84029G', '84029E', 22752, 21730, 22633,
       22632, 84879, 22745, 22748, 22749, 22310, 84969, 22623, 22622,
       21754, 21755, 21777, 48187, 22960, 22913, 22912, 22914, 21756,
       22728, 22727, 22726, 21724, 21883, 10002, 21791, 21035, 22326,
       22629, 22659, 22631, 22661, 21731, 22900, 21913, 22540, 22544,
       22492, 'POST', 22086, 20679, 37370, 21871, 21071, 21068, 82483,
       82486, 82482, '82494L', 21258, 22114, 21733, 22386, '85099C',
       21033, 20723, '84997B', '84997C', 21094, 20725, 21559, 22352,
       21212, 21975, 21977, 84991, '84519A', '85183B', '85071B', 21931,
       21929, 22961, 22139, 84854, 22411, 82567, 21672, 22774, 22771,
       71270, 22262, 22637, 21934, 21169, 21166, 21175, '37444A',
       '37444C', 22083, '84971S', 47580, 22261, 84832, 22644, 21533,
       21557, '15056BL', '15056N', 22646, 22176, 22438, 22778, 22719,
       21523, 'D', 21912, 21832, 22379, 22381, 22798, 22926, 22839, 22838,
       22783, 

In [None]:
# number of unique items in a column
df["StockCode"].nunique()


156

These methods are extremely useful when dealing with categorical data!

`.unique()` shows us all the unique values contained in the column.

`.value_counts()` shows us a count for how many times each unique value is present in a dataset, giving us a feel for the distribution of values in the column.

## `.apply()` functions to a dataframe

Using the `.apply()` method in Pandas allows us to apply a function to either a Pandas Series (a single column) or a whole DataFrame. It's like telling Pandas, "Hey, apply this function to every element in the Series or DataFrame."

For a Series, it means we can transform each element individually using a custom function. For example, we might want to double every number in a column or convert certain text values to uppercase.

For a DataFrame, the `.apply()` method can work on either the entire DataFrame or specific columns. It's handy when you want to perform custom operations on the data across rows or columns. This could be anything from complex calculations on multiple columns to applying a function to each row.

In essence, `.apply()` is a powerful tool that enables us to bring in our own custom logic and apply it to the data in a flexible and efficient manner, making it a key part of data manipulation and analysis in Pandas.

In [None]:
# give discounts for sale above 2000
df['discount'] = df['Sales'].apply(lambda x: 0.05 * x if x > 21 else 0)

In [None]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Sales,discount
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3,0.0
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,0.0
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0,1.1
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,0.0
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,0.0


## Summary

In this lesson, we actively engaged with essential Pandas methods for exploring and understanding our datasets. We mastered the application of `df.describe()` and `df.info()` for obtaining comprehensive summary statistics, gaining insights into the structure and metadata of our DataFrames. Additionally, we harnessed the power of built-in Pandas methods to efficiently calculate summary statistics tailored to our data, enhancing our analytical capabilities. Moreover, we explored the versatility of the `.apply()` method, employing it to systematically apply custom functions to every element within a Series or DataFrame. These skills empower us to derive meaningful insights, perform thorough data analysis, and make informed decisions in our data exploration journey.