# Descriptive Statistics

Descriptive Statistics gives a brief overview of the charactieristics of dataset. We try to summarize a given dataset. It consists of two basic categories of measures: measures of central tendency and measures of variability (or spread).

#### Usage: 
This notebook can infer statistics on any type of dataset and read it in format of CSV or Datasets. Refer to **Load the dataset**  part of the notebook to play around your custom dataset file

#### Methods used in this notebook:
- Detecting Categorical Features Vs Numerical Features
- Missing data, such as features with empty values.
- Mean of the features
- Median of the features
- Standard Deviation of the features
- Visual of the Spread of the data 
- Labels treated as features, so that your model gets to peek at the right answer during training.
- Features with values outside the range you expect.

### Libraries used 
- We use ``TensorFlow Data Validation(TFDV)`` to investigate and visualize the dataset. Understanding the input data is the most important step in building the data Science pipeline, as it can potentially harm model's prediction. 

#### Input: 
The input to this notebook is Tabular dataset.

#### Output:
Output of this notebook is statistics generated from the dataset.

![Klee - Visual Analytics](https://github.com/nikbearbrown/Visual_Analytics/blob/main/IMG/Klee_Visual_Analytics.png?raw=true)


YouTube - https://www.youtube.com/c/NikBearBrown    
GitHub - https://github.com/nikbearbrown/Visual_Analytics   
Kaggle - https://www.kaggle.com/nikbearbrown   
Klee.ai (Visual AI) - http://klee.ai    



In [9]:
## install the library 
## uncomment the lines below if tenforflow is not installed

# print('Installing TensorFlow Data Validation')
# !pip install -q tensorflow_data_validation[visualization]

### Restart the runtime

In [2]:
# imports 
import pandas as pd
import tensorflow as tf
import tensorflow_data_validation as tfdv

print('TF version: {}'.format(tf.__version__))
print('TFDV version: {}'.format(tfdv.version.__version__))

TF version: 2.7.0
TFDV version: 1.5.0


## Load the dataset

In [3]:
# loading dataset
url = "https://github.com/nikbearbrown/Visual_Analytics/raw/main/CSV/titanic_dataset.csv"
df=pd.read_csv(url)
df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## Visualizing the statistics:

Here we use ``tfdv.generate_statistics_from_csv()`` to compute statistics for the training dataset. TFDV can compute descriptive statistics to provide a quick overview of the data. Other option is to use directly the dataframe ``tfdv.generate_statistics_from_dataframe()``
- Numeric features and catagorical features are visualized separately, and that charts are displayed showing the distributions for each feature.
- Features with missing or zero values display a percentage in red as a visual indicator that there may be issues with examples in those features. The percentage is the percentage of examples that have missing or zero values for that feature.
- Click "expand" above the charts to change the display
- Hovering over bars in the charts to display bucket ranges and counts
- Switch between the log and linear scales, and notice how the log scale reveals much more detail 
- Select "quantiles" from the "Chart to show" menu, and hover over the markers to show the quantile percentages

In [6]:
# train_stats = tfdv.generate_statistics_from_csv(data_location=traintext)

titanic_stats = tfdv.generate_statistics_from_dataframe(df)

In [7]:
# dictionary format of the output
titanic_stats 

datasets {
  num_examples: 418
  features {
    num_stats {
      common_stats {
        num_non_missing: 418
        min_num_values: 1
        max_num_values: 1
        avg_num_values: 1.0
        num_values_histogram {
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 41.8
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 41.8
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 41.8
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 41.8
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 41.8
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 41.8
          }
          buckets {
            low_value: 1.0
            high_value: 

In [8]:
# visual output of the statistics from the dataset
tfdv.visualize_statistics(titanic_stats)

## Schema for the data
We use ``tfdv.infer_schema`` to create a schema for the data. It can be understood as documentation of the data, when multiple people are working on a particular dataset.
The schema describes:
- which features are expected to be present
- their type
- the number of values for a feature in each example
- the presence of each feature across all examples
- the expected domains of features.

In [9]:
schema = tfdv.infer_schema(statistics=titanic_stats)
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'PassengerId',INT,required,,-
'Pclass',INT,required,,-
'Name',BYTES,required,,-
'Sex',STRING,required,,'Sex'
'Age',FLOAT,optional,single,-
'SibSp',INT,required,,-
'Parch',INT,required,,-
'Ticket',BYTES,required,,-
'Fare',FLOAT,optional,single,-
'Cabin',STRING,optional,single,'Cabin'


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'Sex',"'female', 'male'"
'Cabin',"'A11', 'A18', 'A21', 'A29', 'A34', 'A9', 'B10', 'B11', 'B24', 'B26', 'B36', 'B41', 'B45', 'B51 B53 B55', 'B52 B54 B56', 'B57 B59 B63 B66', 'B58 B60', 'B61', 'B69', 'B71', 'B78', 'C101', 'C105', 'C106', 'C116', 'C130', 'C132', 'C22 C26', 'C23 C25 C27', 'C28', 'C31', 'C32', 'C39', 'C46', 'C51', 'C53', 'C54', 'C55 C57', 'C6', 'C62 C64', 'C7', 'C78', 'C80', 'C85', 'C86', 'C89', 'C97', 'D', 'D10 D12', 'D15', 'D19', 'D21', 'D22', 'D28', 'D30', 'D34', 'D37', 'D38', 'D40', 'D43', 'E31', 'E34', 'E39 E41', 'E45', 'E46', 'E50', 'E52', 'E60', 'F', 'F E46', 'F E57', 'F G63', 'F2', 'F33', 'F4', 'G6'"
'Embarked',"'C', 'Q', 'S'"


## Pandas Analysis

continuous dataset

In [13]:
a = df.describe().T
a

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PassengerId,418.0,1100.5,120.810458,892.0,996.25,1100.5,1204.75,1309.0
Pclass,418.0,2.26555,0.841838,1.0,1.0,3.0,3.0,3.0
Age,332.0,30.27259,14.181209,0.17,21.0,27.0,39.0,76.0
SibSp,418.0,0.447368,0.89676,0.0,0.0,0.0,1.0,8.0
Parch,418.0,0.392344,0.981429,0.0,0.0,0.0,0.0,9.0
Fare,417.0,35.627188,55.907576,0.0,7.8958,14.4542,31.5,512.3292


categorical columns

In [14]:
print(df.columns.to_list())
df.describe().columns.to_list()

['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']


['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

In [15]:
categoricalColumns = list(set(df.columns.to_list()) - set(df.describe().columns.to_list()))
categoricalColumns

['Ticket', 'Cabin', 'Sex', 'Name', 'Embarked']

In [16]:
b = pd.DataFrame(df[categoricalColumns].describe()).T
b

Unnamed: 0,count,unique,top,freq
Ticket,418,363,PC 17608,5
Cabin,91,76,B57 B59 B63 B66,3
Sex,418,2,male,266
Name,418,418,"Foley, Mr. William",1
Embarked,418,3,S,270


In [17]:
df.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [18]:
mergeddf = pd.concat( [a,b])

mergeddf["dataTypes"] = pd.DataFrame(df.dtypes)
mergeddf["NullValues"] = pd.DataFrame(df.isnull().sum())
mergeddf

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,unique,top,freq,dataTypes,NullValues
PassengerId,418,1100.5,120.810458,892.0,996.25,1100.5,1204.75,1309.0,,,,int64,0
Pclass,418,2.26555,0.841838,1.0,1.0,3.0,3.0,3.0,,,,int64,0
Age,332,30.27259,14.181209,0.17,21.0,27.0,39.0,76.0,,,,float64,86
SibSp,418,0.447368,0.89676,0.0,0.0,0.0,1.0,8.0,,,,int64,0
Parch,418,0.392344,0.981429,0.0,0.0,0.0,0.0,9.0,,,,int64,0
Fare,417,35.627188,55.907576,0.0,7.8958,14.4542,31.5,512.3292,,,,float64,1
Ticket,418,,,,,,,,363.0,PC 17608,5.0,object,0
Cabin,91,,,,,,,,76.0,B57 B59 B63 B66,3.0,object,327
Sex,418,,,,,,,,2.0,male,266.0,object,0
Name,418,,,,,,,,418.0,"Foley, Mr. William",1.0,object,0


In [19]:
schema = pd.DataFrame()
schema = mergeddf.copy()
schema

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,unique,top,freq,dataTypes,NullValues
PassengerId,418,1100.5,120.810458,892.0,996.25,1100.5,1204.75,1309.0,,,,int64,0
Pclass,418,2.26555,0.841838,1.0,1.0,3.0,3.0,3.0,,,,int64,0
Age,332,30.27259,14.181209,0.17,21.0,27.0,39.0,76.0,,,,float64,86
SibSp,418,0.447368,0.89676,0.0,0.0,0.0,1.0,8.0,,,,int64,0
Parch,418,0.392344,0.981429,0.0,0.0,0.0,0.0,9.0,,,,int64,0
Fare,417,35.627188,55.907576,0.0,7.8958,14.4542,31.5,512.3292,,,,float64,1
Ticket,418,,,,,,,,363.0,PC 17608,5.0,object,0
Cabin,91,,,,,,,,76.0,B57 B59 B63 B66,3.0,object,327
Sex,418,,,,,,,,2.0,male,266.0,object,0
Name,418,,,,,,,,418.0,"Foley, Mr. William",1.0,object,0


In [20]:
schema["FillMissingValues"] = schema.NullValues.apply(lambda x: "No" if x == 0 else "Yes") 
schema["RemoveFeature"]     = None 
schema["NormalizeData"]     = None

In [63]:
schema

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,unique,top,freq,dataTypes,NullValues,FillMissingValues,RemoveFeature,NormalizeData
patient,120.0,60.5,34.785054,1.0,30.75,60.5,90.25,120.0,,,,int64,0,No,,
bp_before,120.0,156.45,11.389845,138.0,147.0,154.5,164.0,185.0,,,,int64,0,No,,
bp_after,120.0,151.358333,14.177622,125.0,140.75,149.5,161.0,185.0,,,,int64,0,No,,
sex,120.0,,,,,,,,2.0,Male,60.0,object,0,No,,
agegrp,120.0,,,,,,,,3.0,30-45,40.0,object,0,No,,


In [21]:
pd.DataFrame(schema.FillMissingValues)

Unnamed: 0,FillMissingValues
PassengerId,No
Pclass,No
Age,Yes
SibSp,No
Parch,No
Fare,Yes
Ticket,No
Cabin,Yes
Sex,No
Name,No


In [22]:
schema.index[schema['FillMissingValues'] == "No"].tolist()

['PassengerId',
 'Pclass',
 'SibSp',
 'Parch',
 'Ticket',
 'Sex',
 'Name',
 'Embarked']