# Descriptive Statistics

Descriptive Statistics gives a brief overview of the charactieristics of dataset. We try to summarize a given dataset. It consists of two basic categories of measures: measures of central tendency and measures of variability (or spread).

#### Usage: 
This notebook can infer statistics on any type of dataset and read it in format of CSV or Datasets. Refer to **Load the dataset**  part of the notebook to play around your custom dataset file

#### Methods used in this notebook:
- Detecting Categorical Features Vs Numerical Features
- Missing data, such as features with empty values.
- Mean of the features
- Median of the features
- Standard Deviation of the features
- Visual of the Spread of the data 
- Labels treated as features, so that your model gets to peek at the right answer during training.
- Features with values outside the range you expect.

### Libraries used 
- We use ``TensorFlow Data Validation(TFDV)`` to investigate and visualize the dataset. Understanding the input data is the most important step in building the data Science pipeline, as it can potentially harm model's prediction. 

#### Input: 
The input to this notebook is Tabular dataset.

#### Output:
Output of this notebook is statistics generated from the dataset.

In [9]:
## install the library 
## uncomment the lines below if tenforflow is not installed

# print('Installing TensorFlow Data Validation')
# !pip install -q tensorflow_data_validation[visualization]

### Restart the runtime

In [1]:
# imports 
import pandas as pd
import tensorflow as tf
import tensorflow_data_validation as tfdv

print('TF version: {}'.format(tf.__version__))
print('TFDV version: {}'.format(tfdv.version.__version__))

TF version: 2.7.0
TFDV version: 1.5.0


## Load the dataset

In [4]:
url = "https://github.com/nikbearbrown/Visual_Analytics/raw/main/CSV/titanic_train.csv"
train = pd.read_csv(url)
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
url = "https://github.com/nikbearbrown/Visual_Analytics/raw/main/CSV/titanic_test.csv"
test = pd.read_csv(url)
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## Visualizing the statistics:

Here we use ``tfdv.generate_statistics_from_csv()`` to compute statistics for the training dataset. TFDV can compute descriptive statistics to provide a quick overview of the data. Other option is to use directly the dataframe ``tfdv.generate_statistics_from_dataframe()``
- Numeric features and catagorical features are visualized separately, and that charts are displayed showing the distributions for each feature.
- Features with missing or zero values display a percentage in red as a visual indicator that there may be issues with examples in those features. The percentage is the percentage of examples that have missing or zero values for that feature.
- Click "expand" above the charts to change the display
- Hovering over bars in the charts to display bucket ranges and counts
- Switch between the log and linear scales, and notice how the log scale reveals much more detail 
- Select "quantiles" from the "Chart to show" menu, and hover over the markers to show the quantile percentages

In [7]:
# train_stats = tfdv.generate_statistics_from_csv(data_location=traintext)

# or

train_stats = tfdv.generate_statistics_from_dataframe(train)

In [8]:
# dictionary format of the output
train_stats

datasets {
  num_examples: 891
  features {
    num_stats {
      common_stats {
        num_non_missing: 891
        min_num_values: 1
        max_num_values: 1
        avg_num_values: 1.0
        num_values_histogram {
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 89.1
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 89.1
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 89.1
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 89.1
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 89.1
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 89.1
          }
          buckets {
            low_value: 1.0
            high_value: 

In [9]:
# visual output of the statistics from the dataset
tfdv.visualize_statistics(train_stats)

## Schema for the data
We use ``tfdv.infer_schema`` to create a schema for the data. It can be understood as documentation of the data, when multiple people are working on a particular dataset.
The schema describes:
- which features are expected to be present
- their type
- the number of values for a feature in each example
- the presence of each feature across all examples
- the expected domains of features.

In [10]:
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'PassengerId',INT,required,,-
'Survived',INT,required,,-
'Pclass',INT,required,,-
'Name',BYTES,required,,-
'Sex',STRING,required,,'Sex'
'Age',FLOAT,optional,single,-
'SibSp',INT,required,,-
'Parch',INT,required,,-
'Ticket',BYTES,required,,-
'Fare',FLOAT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'Sex',"'female', 'male'"
'Embarked',"'C', 'Q', 'S'"


## Compare Test vs Eval data for the errors
Sometimes, we focus alot on our training dataset and do not think about how the test set looks like. It is indeed very important tobe consistent, with the datasets, and compare the schemas. It means that the evaluation dataset should roughly contain the same range of values for numerical features as our training dataset.

In [12]:
# Compute stats for evaluation data
eval_stats = tfdv.generate_statistics_from_dataframe(test)

- Features now contain statistics from both train and evaluation dataset, which now makes it really reasy to compare

In [14]:
# docs-infra: no-execute
# Compare evaluation data with training data
tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,
                          lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')

This gives us a good idea about the difference in the statistics between test and eval data

In [15]:
# Check eval data for errors by validating the eval data stats using the previously inferred schema.

anomalies = tfdv.validate_statistics(statistics=eval_stats, schema=schema)

# display the anomolies
tfdv.display_anomalies(anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'Survived',Column dropped,Column is completely missing
'Fare',Multiple errors,"The feature has a shape, but it's not always present (if the feature is nested, then it should always be present at each nested level) or its value lengths vary. The feature was present in fewer examples than expected: minimum fraction = 1.000000, actual = 0.997608"


## Pandas Analysis

In [16]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv") 

df

Unnamed: 0,patient,sex,agegrp,bp_before,bp_after
0,1,Male,30-45,143,153
1,2,Male,30-45,163,170
2,3,Male,30-45,153,168
3,4,Male,30-45,153,142
4,5,Male,30-45,146,141
...,...,...,...,...,...
115,116,Female,60+,152,152
116,117,Female,60+,161,152
117,118,Female,60+,165,174
118,119,Female,60+,149,151


continuous dataset

In [17]:
a = df.describe().T
a

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
patient,120.0,60.5,34.785054,1.0,30.75,60.5,90.25,120.0
bp_before,120.0,156.45,11.389845,138.0,147.0,154.5,164.0,185.0
bp_after,120.0,151.358333,14.177622,125.0,140.75,149.5,161.0,185.0


categorical columns

In [18]:
print(df.columns.to_list())
df.describe().columns.to_list()

['patient', 'sex', 'agegrp', 'bp_before', 'bp_after']


['patient', 'bp_before', 'bp_after']

In [19]:
categoricalColumns = list(set(df.columns.to_list()) - set(df.describe().columns.to_list()))
categoricalColumns

['sex', 'agegrp']

In [20]:
b = pd.DataFrame(df[categoricalColumns].describe()).T
b

Unnamed: 0,count,unique,top,freq
sex,120,2,Male,60
agegrp,120,3,30-45,40


In [21]:
df.isnull().sum()

patient      0
sex          0
agegrp       0
bp_before    0
bp_after     0
dtype: int64

In [22]:
mergeddf = pd.concat( [a,b])

mergeddf["dataTypes"] = pd.DataFrame(df.dtypes)
mergeddf["NullValues"] = pd.DataFrame(df.isnull().sum())
mergeddf

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,unique,top,freq,dataTypes,NullValues
patient,120,60.5,34.785054,1.0,30.75,60.5,90.25,120.0,,,,int64,0
bp_before,120,156.45,11.389845,138.0,147.0,154.5,164.0,185.0,,,,int64,0
bp_after,120,151.358333,14.177622,125.0,140.75,149.5,161.0,185.0,,,,int64,0
sex,120,,,,,,,,2.0,Male,60.0,object,0
agegrp,120,,,,,,,,3.0,30-45,40.0,object,0


In [23]:
schema = pd.DataFrame()
schema = mergeddf.copy()
schema

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,unique,top,freq,dataTypes,NullValues
patient,120,60.5,34.785054,1.0,30.75,60.5,90.25,120.0,,,,int64,0
bp_before,120,156.45,11.389845,138.0,147.0,154.5,164.0,185.0,,,,int64,0
bp_after,120,151.358333,14.177622,125.0,140.75,149.5,161.0,185.0,,,,int64,0
sex,120,,,,,,,,2.0,Male,60.0,object,0
agegrp,120,,,,,,,,3.0,30-45,40.0,object,0


In [24]:
schema["FillMissingValues"] = schema.NullValues.apply(lambda x: "No" if x == 0 else "Yes") 
schema["RemoveFeature"]     = None 
schema["NormalizeData"]     = None

In [25]:
schema

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,unique,top,freq,dataTypes,NullValues,FillMissingValues,RemoveFeature,NormalizeData
patient,120,60.5,34.785054,1.0,30.75,60.5,90.25,120.0,,,,int64,0,No,,
bp_before,120,156.45,11.389845,138.0,147.0,154.5,164.0,185.0,,,,int64,0,No,,
bp_after,120,151.358333,14.177622,125.0,140.75,149.5,161.0,185.0,,,,int64,0,No,,
sex,120,,,,,,,,2.0,Male,60.0,object,0,No,,
agegrp,120,,,,,,,,3.0,30-45,40.0,object,0,No,,


In [26]:
pd.DataFrame(schema.FillMissingValues)

Unnamed: 0,FillMissingValues
patient,No
bp_before,No
bp_after,No
sex,No
agegrp,No


In [27]:
schema.index[schema['FillMissingValues'] == "No"].tolist()

['patient', 'bp_before', 'bp_after', 'sex', 'agegrp']