# Summarizing and Computing Descriptive Statistics

- calling sum method returns a series contianing columns sums

- we can also pass axis sums accoss the columns instead

- options for reduction methods: 
    * https://wesmckinney.com/book/pandas-basics#tbl-table_pandas_reduction

- Some methods an return indirect statistics
    * idxmin and idxmax
    * accumulations = cumsum()

- 'describe' produces multiple summary statistics in one shot

- Descriptive and summary statistics: 
    * https://wesmckinney.com/book/pandas-basics#tbl-table_descriptive_stats

**Covariance**
- It measures how much two variables change together
- +ve covariance indicates that one variable tends to increase or decrease TOGETHER
- -ve covariance indicates that one variable tends to increase when the other decreases

**Correlation**
- It measures the strengh and direction of a linear relationship between two variables
- it ranges from '-1' to '1'
- 1 = perfect positive correlation
- -1 = perfect negative correlation
- 0 = no linear correlation

**Unique Values, Value Counts, and Membership**
- unique() gives an array of unique values in a Series
- value_counts() computes a Series containing value frequencies.
- isin() performs a vectorized set membership check 
    * it's useful in filtering dataset down to a subset of values in a Series or Column in a DataFrame
- Unique, value counts and set membership methods: 
    * https://wesmckinney.com/book/pandas-basics#tbl-table_binning


In [4]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# create a multidimensional array
arr = np.array([[2, 4, 6], [3, 6, 9], [4, 8, 12], [5, 10, 15]])

print(arr)

#create dataframe using the array
df = pd.DataFrame(arr, index =["k", "l", "m", "n"], columns=['one', 'two', 'three'])
print(f'\ndataFrame:\n{df}')

[[ 2  4  6]
 [ 3  6  9]
 [ 4  8 12]
 [ 5 10 15]]

dataFrame:
   one  two  three
k    2    4      6
l    3    6      9
m    4    8     12
n    5   10     15


In [5]:
# sum methods
df.sum()

one      14
two      28
three    42
dtype: int64

In [6]:
#passing axis='columns' across columns
df.sum(axis="columns")

k    12
l    18
m    24
n    30
dtype: int64

In [7]:
#passing axis='columns' across columns
df.sum(axis=1)

k    12
l    18
m    24
n    30
dtype: int64

In [8]:
df

Unnamed: 0,one,two,three
k,2,4,6
l,3,6,9
m,4,8,12
n,5,10,15


In [10]:
#max values & min values

print(df.idxmax())
print('\n')
print(df.idxmin())

one      n
two      n
three    n
dtype: object


one      k
two      k
three    k
dtype: object


In [13]:
df.cumsum()

Unnamed: 0,one,two,three
k,2,4,6
l,5,10,15
m,9,18,27
n,14,28,42


In [15]:
#describe for numeric data

print(df)

print(df.describe())

   one  two  three
k    2    4      6
l    3    6      9
m    4    8     12
n    5   10     15
            one        two      three
count  4.000000   4.000000   4.000000
mean   3.500000   7.000000  10.500000
std    1.290994   2.581989   3.872983
min    2.000000   4.000000   6.000000
25%    2.750000   5.500000   8.250000
50%    3.500000   7.000000  10.500000
75%    4.250000   8.500000  12.750000
max    5.000000  10.000000  15.000000


In [18]:
#describe for non-numeric data

letter_objeck = pd.Series(["w", "x", "y", "z"] * 3)
letter_objeck

0     w
1     x
2     y
3     z
4     w
5     x
6     y
7     z
8     w
9     x
10    y
11    z
dtype: object

In [20]:
letter_objeck.describe()

count     12
unique     4
top        w
freq       3
dtype: object

**POSITIVE COVARIANCE**

In [22]:
cov_data = {
    'A': [10, 20, 30, 40, 50],
    'B': [15, 25, 35, 45, 60]
}

df = pd.DataFrame(cov_data)
df

Unnamed: 0,A,B
0,10,15
1,20,25
2,30,35
3,40,45
4,50,60


In [23]:
cov_matrix = df.cov()
cov_matrix

Unnamed: 0,A,B
A,250.0,275.0
B,275.0,305.0


**NEGATIVE CORRELATION**
- One variable increases, the other decreases

In [31]:
cov_data_negative = {
    'A': [10, 20, 30, 40, 50],
    'B': [50, 40, 30, 20, 10]
}

df_negative = pd.DataFrame(cov_data_negative)
df

Unnamed: 0,A,B
0,10,50
1,20,40
2,30,30
3,40,20
4,50,10


In [32]:
cov_matrix_negative = df_negative.cov()
cov_matrix_negative

Unnamed: 0,A,B
A,250.0,-250.0
B,-250.0,250.0


**PERFECT POSITIVE CORRELATION**

In [34]:
corr_matrix = df.corr()
corr_matrix

Unnamed: 0,A,B
A,1.0,-1.0
B,-1.0,1.0


**PERFECT NEGATIVE CORRELATION**


In [35]:
corr_matrix_negative = df_negative.corr()
corr_matrix_negative

Unnamed: 0,A,B
A,1.0,-1.0
B,-1.0,1.0
