# Descriptive Statistics and Data Visualization

In this comprehensive lecture, we will explore the fundamental concepts of descriptive statistics and data visualization. We'll cover everything from data collection methods to advanced visualization techniques, providing you with the essential tools for understanding and presenting data effectively.


## Table of Contents
1. [Getting the Data](#getting-the-data)
2. [Statistical Variables and Types of Data](#statistical-variables-and-types-of-data)
3. [Datasets & Design Matrices](#datasets--design-matrices)
4. [Descriptive Statistics](#descriptive-statistics)
5. [Data Visualization](#data-visualization)
6. [Data Cleaning](#data-cleaning)
7. [Data Normalization](#data-normalization)

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Set random seed for reproducibility
np.random.seed(42)

## 1. Getting the Data

Data collection is the crucial first step in any data analysis process. There are several methods to obtain data, each with its own advantages and limitations.

### 1.1 Surveys

Surveys are structured questionnaires used to gather information from individuals or groups. They can be conducted through various means:
- **Paper questionnaires**: Traditional method, good for controlled environments
- **Online forms**: Cost-effective, wide reach, easy data collection
- **Telephone interviews**: Personal interaction, higher response rates
- **In-person interviews**: Most detailed responses, highest quality data

**Key characteristics of surveys:**
- Structured and standardized questions
- Can be administered in various formats
- Allow for quantifiable responses
- Useful for collecting opinions, attitudes, and demographic information

### 1.2 Experiments

Experiments allow researchers to collect data in a **controlled** environment and are primarily used to establish cause-and-effect relationships between variables.

**Randomized Controlled Experiments (RCTs)** are the gold standard:
- **Controlled manipulation** of variables
- **Random assignment** of participants to treatment groups
- **Replicability** to test hypotheses
- **Control groups** to isolate the effect of the treatment

**Example**: Testing drug effectiveness
- Population: People with a specific disease
- Sample: Random selection of patients
- Treatment: Half receive the drug, half receive placebo
- Randomization: Assignment is random to avoid bias

### 1.3 Observational Studies

Observational data collection involves passive recording of information as it naturally occurs, without interference from the researcher.

**Key characteristics:**
- No direct manipulation of variables
- Captures data as it naturally occurs
- Useful for studying complex, uncontrolled environments
- Often used when experiments are unethical or impractical

**Example**: Studying the effects of smoking
- Cannot ethically ask people to smoke
- Observe existing smokers vs. non-smokers
- Track health outcomes over time

### 1.4 Data Collection Methods

#### Scraping
Web scraping involves automatically extracting data from websites:
- **Advantages**: Large amounts of data, automated collection
- **Challenges**: Legal considerations, changing website structures
- **Tools**: BeautifulSoup, Scrapy, Selenium

#### APIs (Application Programming Interfaces)
APIs provide structured access to data from various services:
- **Advantages**: Structured data, official access, real-time updates
- **Examples**: Twitter API, Google Maps API, Weather APIs
- **Considerations**: Rate limits, authentication requirements

#### Public Datasets
Many organizations provide free access to datasets:

**Kaggle**: 
- Largest community of data scientists
- Competitions with real-world datasets
- Wide variety of domains (finance, healthcare, sports, etc.)

**UCI Machine Learning Repository**:
- Academic datasets for research
- Well-documented and clean datasets
- Classic datasets for learning and benchmarking

**ISTAT (Italian National Institute of Statistics)**:
- Official Italian government statistics
- Demographic, economic, and social data
- Reliable and authoritative source

### 1.5 Sample vs Population

Understanding the distinction between sample and population is crucial for proper statistical inference.

**Population (Ω)**:
- The complete set of all possible observations
- Often theoretical or too large to study entirely
- Examples: All people in the world, all possible coin tosses

**Sample**:
- A subset of the population: {ω⁽¹⁾, ω⁽²⁾, ..., ω⁽ⁿ⁾} ⊆ Ω
- Practical and manageable size
- Should be representative of the population

**Key considerations**:
- **Sampling bias**: When the sample doesn't represent the population
- **Sample size**: Larger samples generally provide better estimates
- **Sampling method**: Random sampling is preferred for generalizability

In [3]:
# Example: Loading a dataset from different sources

# 1. From a URL (simulating API or web data)
titanic_url = 'https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv'
titanic = pd.read_csv(titanic_url, index_col='PassengerId')

print("Titanic Dataset Shape:", titanic.shape)
print("\nFirst 5 rows:")
titanic.head()

Titanic Dataset Shape: (891, 11)

First 5 rows:


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 2. Statistical Variables and Types of Data

Statistical variables are the building blocks of data analysis. Understanding different types of variables is essential for choosing appropriate analysis methods.

### 2.1 Definition of Statistical Variables

A statistical variable is a function that maps observations to values:

**X : Ω → S**
**ω ↦ x**

Where:
- Ω is the population
- S is the set of possible values
- ω is an observation
- x is the value assigned to that observation

### 2.2 Classification by Nature

#### Quantitative Variables
Represent properties that can be measured numerically:
- **Examples**: Height, weight, age, income, temperature
- **Operations**: Can perform arithmetic operations (addition, subtraction, etc.)
- **Analysis**: Can calculate mean, standard deviation, correlation

#### Qualitative (Categorical) Variables
Represent properties that describe categories or qualities:
- **Examples**: Gender, color, brand, nationality, education level
- **Operations**: Cannot perform arithmetic operations
- **Analysis**: Use frequencies, proportions, chi-square tests

### 2.3 Classification by Continuity

#### Discrete Variables
- Can assume a **finite** or **countably infinite** number of values
- Often result from counting
- **Examples**: Number of children, number of cars, dice rolls
- **Visualization**: Bar charts, frequency tables

#### Continuous Variables
- Can assume an **infinite** number of values within a range
- Often result from measuring
- **Examples**: Height, weight, time, temperature
- **Visualization**: Histograms, density plots

### 2.4 Scales of Measurement

#### Nominal Scale
- **No natural ordering** of categories
- Categories are mutually exclusive
- **Examples**: Gender (male, female), eye color (blue, brown, green)
- **Operations**: Equality (=, ≠)
- **Statistics**: Mode, frequencies

#### Ordinal Scale
- Categories have a **natural ordering**
- Differences between categories are not necessarily equal
- **Examples**: Education level (high school, bachelor's, master's), satisfaction rating (poor, fair, good, excellent)
- **Operations**: Equality, ordering (<, >)
- **Statistics**: Median, percentiles

#### Interval Scale
- **Equal intervals** between consecutive values
- **No true zero** point
- **Examples**: Temperature in Celsius, calendar years
- **Operations**: Addition, subtraction
- **Statistics**: Mean, standard deviation

#### Ratio Scale
- Equal intervals AND **true zero** point
- **Examples**: Height, weight, age, income
- **Operations**: All arithmetic operations
- **Statistics**: All descriptive statistics, geometric mean

### 2.5 Grouped Variables

Sometimes continuous variables are **grouped** into categories for analysis:
- **Age groups**: 0-18, 19-35, 36-65, 65+
- **Income brackets**: Low, Medium, High
- **Advantages**: Simplifies analysis, easier interpretation
- **Disadvantages**: Loss of information, arbitrary boundaries

In [4]:
# Example: Identifying variable types in the Titanic dataset

print("Titanic Dataset Info:")
print(titanic.info())
print("\n" + "="*50)

# Classify variables by type
quantitative_vars = ['Age', 'SibSp', 'Parch', 'Fare']
qualitative_vars = ['Survived', 'Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

print("\nQuantitative Variables:")
for var in quantitative_vars:
    if var in titanic.columns:
        print(f"- {var}: {titanic[var].dtype}")

print("\nQualitative Variables:")
for var in qualitative_vars:
    if var in titanic.columns:
        print(f"- {var}: {titanic[var].dtype}")

# Example of creating grouped variables
titanic['Age_Group'] = pd.cut(titanic['Age'], 
                             bins=[0, 18, 35, 65, 100], 
                             labels=['Child', 'Young Adult', 'Adult', 'Senior'])

print("\nAge Groups Distribution:")
print(titanic['Age_Group'].value_counts())

Titanic Dataset Info:
<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB
None


Quantitative Variables:
- Age: float64
- SibSp: int64
- Parch: int64
- Fare: float64

Qualitative Variables:
- Survived: int64
- Pclass: int64
- Name: object
- Sex: object
- Ticket: object
- Cabin: object
- Embarked: object

Age Groups Distribution:
Age_Group
Young Adult    358
Adult          209
Chi

## 3. Datasets & Design Matrices

### 3.1 Structure of Datasets

A dataset is typically organized as a **design matrix** or **data matrix** where:
- **Rows** represent observations (samples, instances)
- **Columns** represent variables (features, attributes)
- Each cell contains the value of a specific variable for a specific observation

### 3.2 Long vs Wide Data

Data can be organized in different formats depending on the analysis needs:

#### Wide Format
- Each variable has its own column
- Each observation is a single row
- **Advantages**: Easy to read, good for analysis
- **Use cases**: Most statistical analyses, machine learning

#### Long Format
- Variables are stacked into fewer columns
- Multiple rows per observation
- **Advantages**: Efficient storage, good for certain visualizations
- **Use cases**: Time series, repeated measures, some plotting functions

In [5]:
# Example: Wide vs Long format

# Create sample data in wide format
wide_data = pd.DataFrame({
    'Student': ['Alice', 'Bob', 'Charlie'],
    'Math': [85, 92, 78],
    'Science': [88, 85, 92],
    'English': [92, 88, 85]
})

print("Wide Format:")
print(wide_data)

# Convert to long format
long_data = pd.melt(wide_data, 
                   id_vars=['Student'], 
                   var_name='Subject', 
                   value_name='Score')

print("\nLong Format:")
print(long_data)

# Convert back to wide format
wide_again = long_data.pivot(index='Student', columns='Subject', values='Score')
print("\nBack to Wide Format:")
print(wide_again)

Wide Format:
   Student  Math  Science  English
0    Alice    85       88       92
1      Bob    92       85       88
2  Charlie    78       92       85

Long Format:
   Student  Subject  Score
0    Alice     Math     85
1      Bob     Math     92
2  Charlie     Math     78
3    Alice  Science     88
4      Bob  Science     85
5  Charlie  Science     92
6    Alice  English     92
7      Bob  English     88
8  Charlie  English     85

Back to Wide Format:
Subject  English  Math  Science
Student                        
Alice         92    85       88
Bob           88    92       85
Charlie       85    78       92


## 4. Descriptive Statistics

Descriptive statistics summarize and describe the main features of a dataset. They provide simple summaries about the sample and measures.

### 4.1 Measures of Central Tendency

Central tendency measures indicate where the center of a distribution lies.

#### Mean (Arithmetic Average)
The sum of all values divided by the number of values:

**x̄ = (1/n) Σᵢ xᵢ**

- **Advantages**: Uses all data points, mathematically tractable
- **Disadvantages**: Sensitive to outliers
- **Best for**: Symmetric distributions without extreme outliers

#### Median
The middle value when data is ordered from smallest to largest:

**Median = x₍ₙ₊₁₎/₂** (if n is odd)
**Median = (x₍ₙ/₂₎ + x₍ₙ/₂₊₁₎)/2** (if n is even)

- **Advantages**: Robust to outliers, good for skewed distributions
- **Disadvantages**: Doesn't use all information
- **Best for**: Skewed distributions, ordinal data

#### Mode
The most frequently occurring value(s):

- **Advantages**: Can be used with any type of data
- **Disadvantages**: May not exist or may not be unique
- **Best for**: Categorical data, finding most common value

### 4.2 Measures of Spread (Dispersion)

Dispersion measures indicate how spread out the data values are.

#### Range
The difference between the maximum and minimum values:

**Range = max(x) - min(x)**

- **Advantages**: Simple to calculate and understand
- **Disadvantages**: Very sensitive to outliers

#### Variance
The average of squared deviations from the mean:

**s² = Σᵢ(xᵢ - x̄)² / (n-1)** (sample variance)

- **Advantages**: Uses all data points, mathematically useful
- **Disadvantages**: Units are squared, sensitive to outliers

#### Standard Deviation
The square root of variance:

**s = √(s²)**

- **Advantages**: Same units as original data, interpretable
- **Disadvantages**: Sensitive to outliers

#### Interquartile Range (IQR)
The difference between the 75th and 25th percentiles:

**IQR = Q₃ - Q₁**

- **Advantages**: Robust to outliers
- **Disadvantages**: Doesn't use all data

### 4.3 Quantiles, Quartiles, and Percentiles

#### Quantiles
Values that divide the dataset into equal-sized groups:
- **q-quantile of order α**: A value that divides data so that α proportion is below it
- **Interpretation**: If qₐ = x, then α×n observations have values ≤ x

#### Percentiles
Quantiles expressed as percentages (0-100%)

#### Quartiles
Specific quantiles that divide data into four equal parts:
- **Q₀** (0th quartile): Minimum value
- **Q₁** (1st quartile): 25th percentile
- **Q₂** (2nd quartile): 50th percentile (median)
- **Q₃** (3rd quartile): 75th percentile
- **Q₄** (4th quartile): Maximum value

In [None]:
# Example: Calculating descriptive statistics

# Create sample data
np.random.seed(42)
sample_data = np.random.normal(100, 15, 1000)  # Normal distribution, mean=100, std=15

# Add some outliers
outliers = [150, 160, 40, 30]
data_with_outliers = np.concatenate([sample_data, outliers])

# Calculate measures of central tendency
print("MEASURES OF CENTRAL TENDENCY")
print("="*40)
print(f"Mean: {np.mean(data_with_outliers):.2f}")
print(f"Median: {np.median(data_with_outliers):.2f}")
print(f"Mode: {stats.mode(data_with_outliers.round()).mode:.2f}")

# Calculate measures of dispersion
print("\nMEASURES OF DISPERSION")
print("="*40)
print(f"Range: {np.ptp(data_with_outliers):.2f}")
print(f"Variance: {np.var(data_with_outliers, ddof=1):.2f}")
print(f"Standard Deviation: {np.std(data_with_outliers, ddof=1):.2f}")
print(f"IQR: {np.percentile(data_with_outliers, 75) - np.percentile(data_with_outliers, 25):.2f}")

# Calculate quartiles and percentiles
print("\nQUARTILES AND PERCENTILES")
print("="*40)
quartiles = np.percentile(data_with_outliers, [0, 25, 50, 75, 100])
print(f"Q0 (Min): {quartiles[0]:.2f}")
print(f"Q1 (25th): {quartiles[1]:.2f}")
print(f"Q2 (Median): {quartiles[2]:.2f}")
print(f"Q3 (75th): {quartiles[3]:.2f}")
print(f"Q4 (Max): {quartiles[4]:.2f}")

# Compare with and without outliers
print("\nIMPACT OF OUTLIERS")
print("="*40)
print(f"Mean without outliers: {np.mean(sample_data):.2f}")
print(f"Mean with outliers: {np.mean(data_with_outliers):.2f}")
print(f"Median without outliers: {np.median(sample_data):.2f}")
print(f"Median with outliers: {np.median(data_with_outliers):.2f}")

In [None]:
# Descriptive statistics for Titanic dataset
print("TITANIC DATASET - DESCRIPTIVE STATISTICS")
print("="*50)

# Numerical variables
numerical_stats = titanic[['Age', 'SibSp', 'Parch', 'Fare']].describe()
print("\nNumerical Variables:")
print(numerical_stats)

# Categorical variables
print("\nCategorical Variables:")
print("\nSurvival Rate:")
print(titanic['Survived'].value_counts(normalize=True))

print("\nPassenger Class Distribution:")
print(titanic['Pclass'].value_counts(normalize=True))

print("\nGender Distribution:")
print(titanic['Sex'].value_counts(normalize=True))