# Descriptive Statistics Cheat Sheet

__Description:__
* <a href = '#sec1'>Preliminary</a>
* <a href = '#sec2'>Descriptive statistics</a>

----
<a id='sec1'></a>
# Preliminary

### Import required packages and change directory 

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set working directory
#os.chdir(default_path)

### Load data

In [None]:
# Load the data
churn_df = pd.read_pickle("churn_for_engineering.p")

----

<a id='sec2'></a>
# Descriptive Statistics

### Creating a summary table

    describe(): Get summary of data

In [None]:
churn_df.describe()

### Unique values & counts of "profession"
     unique(): Get unique values of a column
     value_counts(): Count occurences of each value in a column

In [None]:
# Display just the unique values
churn_df.profession.unique()

In [None]:
# Display unique values, sorted by their count
churn_df.profession.value_counts().head(10)

###  Grouped Summary Statistics
Run descriptive analyses grouped by our target __churn_flag__.

    groupby('variable name'): variable will be used to group data, 
    all commands will be executed for each group individually

In [None]:
summary_grouped = churn_df.groupby('churn_flag').describe()
summary_grouped

In [None]:
# Let's change the way the results are displayed
# We can adjust various display options to customize the outputs; 
# in this case, we want to display up to 100 rows
# With command "T" we can transpose the results
pd.set_option('display.max_rows', 100)
summary_grouped = churn_df.groupby('churn_flag').describe()
summary_grouped.T

In [None]:
# This command resets the display options for DataFrames to the default
pd.reset_option('display.max_rows')

### Data aggregation
We can also aggregate one variable by a specific group. Let's see how much cash customers withdraw on average depending whether they churn or not. Will there be any differences?

    groupby('variable'): all calculations will be grouped by the specified variables
    mean(): Calculates the mean of a column
    agg(): Aggregate using one or more operations over several columns

#### Aggregate of one variable

In [None]:
churn_df.groupby('churn_flag')['cash_withdrawals_value'].mean()

#### Aggregate of multiple variables
We can also specify several grouping variables

In [None]:
data_agg = churn_df.groupby(['churn_flag', 'gender'])['cash_withdrawals_value'].mean()
data_agg.head(20)

#### Multiple Aggregations
We can also specify multiple aggregations for different variables

In [None]:
data_agg2 = churn_df.agg({'cash_withdrawals_value':['sum', 'max', 'min', 'mean'], 'credit_rating': ['sum','min', 'max', 'mean']})
data_agg2.head()

### Contigency tables
Two-way tables, also known as contingency tables, are tables of two dimensions. 

    pd.crosstab(): Compute a simple cross tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.

In [None]:
pd.crosstab(index=churn_df["churn_flag"], columns=churn_df["gender"])