# Univariate Analysis
Etymologically the word "Uni-" is Latin and mean "single" or "one". Univariate analysis, is the simplest form of analysis and focuses on analysing a single variable. The analysis can take the form of descriptive or inferential statistics and can be applied to both numerical and categorical data. 

- **Descriptive statistics**: Used to describe the characteristics or features of a dataset. The term ‘descriptive statistics’ can be used to describe both individual quantitative observations (also known as ‘summary statistics’) as well as the overall process of obtaining insights from these data.
- **Inferential statistics**: Focus on making generalizations about a larger population based on a representative sample of that population. Because inferential statistics focuses on making predictions (rather than stating facts) its results are usually in the form of a probability.


In [1]:
import numpy as np
import pandas as pd

# Dataset
For our analysis we load the **titanic-disaster-dataset**. 

In [2]:
data = pd.read_csv("../Datasets/titanic-disaster-dataset/titanic.csv")

In [3]:
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


We extract the "_age_" variable for the numerical examples and the "_sex_" variablefor the categorical examples.

# Descpriptive Statistics
Descriptive statistics, in Univariate analysis, are used to summarize and explore characteristics of variables. In this section, we provide an overview of commonly used descriptive statistics for both categorical and numerical data. We will also learn how to interpret the results using summaries, analyses and examples. 

## Summary Statistics
Summary statistics are used to summarize and provide information about a set of observations. They provide a condensed version of the data and are used to describe the main characteristics of the dataset, including the central tendency, dispersion, and distribution of the data. Examples of summary statistics include mean, median, mode, standard deviation, variance, minimum and maximum values, quartiles, and percentiles. These statistics are used to gain a quick understanding of the overall pattern and behavior of the data, and to help identify trends and relationships in the data.

### Arithmetic Mean
Arithmetic mean is also referred to as the arithmetic average. It is a measure of central tendency of a finite set of numbers. More specifically, the mean is the sum of the values divided by the number of values. Given a data set $X = \{x_1, x_2, ..., x_n\}$, the mean is denoted using an overhead bar, $\overline{x}$ and is defined by the following formula:

$$\overline{x} = \frac{1}{n} \sum_{i=1}^n x_i = \frac{x_1 + x_2 + ... + x_n}{n}$$

In [4]:
# Using numpy
np.mean(data.age)

29.8811345124283

In [5]:
# Using pandas
data.age.mean()

29.8811345124283

### Arithmetic Median
Arithmetic median is the middle value of a data set. It is the value that separates the lower half from the higher half. If the length of the data set is odd, then the median is simply the middle value of the set, else (if length is even) it is the mean value of the two middle numbers. Median is defined by the following formula:
$$\mbox{If $n$ is odd, } median(x) = x_{(n+1)/2}$$
$$\mbox{If $n$ is even, } median(x) = \frac{ x_{n/2} x_{(n/2)+1}}{2}$$

In [6]:
# Using numpy
np.median(data.age)

nan

In [7]:
# Uisng pandas
data.age.median()

28.0

### Arithmetic Mode

The mode defines the most frequently occurring value in a data set. In other words, it is the value that is most likely to be sampled from a population. Same as the mean and the median, the mode is also a measure of central tendency and it is defined as:

$$Mode(x) = argmax_{x_i} P(X=x_i) $$

In [8]:
# Using pandas
data.sex.mode()

0    male
dtype: object

### Standard Deviation

Variance is the squared standard deviation. It is also a measure of dispersion and is often used because it is easier to be manipulated in calculations. Variance is defined as:

$$ Var(X) = \frac{\sum (x_i - \overline{x})^2}{N} $$

In [9]:
# Using numpy
np.std(data.age)

14.406608234217705

In [10]:
# Using pandas
data.age.std()

14.413499699923594

### Minimum & Maximum

The minimum is simply the smallest value of a data set and the maximum is the largest value.

In [11]:
# Using numpy
np.min(data.age)

0.1667

In [12]:
# Using pandas
data.age.min()

0.1667

In [13]:
# Using numpy
np.max(data.age)

80.0

In [14]:
# Using pandas
data.age.max()

80.0

### Percentiles

A k-percentile is a score that sets the percentage $k$ of values that fall below a given value in a data set. For example, the $50^{th}$ percentile of a data set is the median value, because 50\% of the values fall bellow the median. 

In [15]:
# Using numpy
print(f"25th percentile: {np.percentile(data.age, 25)}")
print(f"50th percentile: {np.percentile(data.age, 50)}")
print(f"75th percentile: {np.percentile(data.age, 75)}")

25th percentile: nan
50th percentile: nan
75th percentile: nan


In [16]:
# Using pandas
print(f"25 quantile: {data.age.quantile(.25)}")
print(f"50 quantile: {data.age.quantile(.5)}")
print(f"75 quantile: {data.age.quantile(.75)}")

25 quantile: 21.0
50 quantile: 28.0
75 quantile: 39.0


## Frequency Table

In [17]:
data.sex.value_counts()

male      843
female    466
Name: sex, dtype: int64

In [18]:
data.survived.value_counts()

0    809
1    500
Name: survived, dtype: int64

## Quantile Table