# Exploratory Data Analysis (EDA)

#### Today we will focus on how...
- load data
- analyze individual attributes
- analyze the relationships between attributes
- visualize data (suitable types of visualizations, properties of good visualizations, how not to deceive with visualization)

#### Before we start analyzing the data, we should clarify‚Ä¶
- What questions should we answer with the analysis?
- What task do we have to solve?

#### In this subject we will deal with ML only two tasks
- Classification
- Regression

#### In both cases, we try to find the function $f$ of the attributes $X$, which will predict the value of the dependent variable $Y$
- In case of regression ùëå‚ààùëÖ
- In case of classification ùëå‚àà{ùê∂1,ùê∂2,‚Ä¶,ùê∂ùëÅ}

Both tasks are an example of **supervised learning**

# We can start EDA

- Describe the data together with their characteristics = **Descriptive statistics**
- Formulate and verify data hypotheses = **Data visualization** + inferential statistics
- Identify relationships between attributes = **Dependencies** (e.g. correlations)
- Identify problems in the data = What we will have to solve as part of preprocessing

#### Possible problems in the data

* inappropriate data structure (data is not in tabular form or one entity is described by several rows of the table)
* duplicate records, or ambiguous mapping between records
* inconsistent data formats
* missing values
* deviated (outlier) values 
* and more

# Iris dataset

Three species: setosa, virginica, versicolor
<img src="https://i.imgur.com/PQqYGaW.png" width="40%" />

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import scipy.stats as stats

In [None]:
iris = sns.load_dataset("iris")
iris.info()

In [None]:
iris.shape[0] - iris.dropna().shape[0]

In [None]:
iris[iris.isnull().any(axis=1)]

In [None]:
iris.species.unique()

# Attribute types
* Continuous (numeric)
* Discrete (categorical) - nominal vs. ordinal

**Beware of categorical attributes that are represented numerically, i.e. the numbers just code the category**

### Univariate analysis - Analysis of attributes one by one

* **continuous** - descriptive statistics (average, median, ...), distributions
* **categorical** - number of unique values, frequency of their occurrence

### bivariate analysis - Pair analysis 

* **continuous-continuous** - dependence, correlation
* **continuous-categorical** - differences in the value of the continuous attribute depending on the category
* **categorical-categorical** - table, ratio of frequency of values

In [None]:
iris.describe()

In [None]:
iris.describe(exclude=np.number)

# Univariate analysis - Analysis of individual attributes: Continuous attributes

We want to show what the shape of the data distribution is, whether it is clustered around some **center**, and what the **dispersion** of the values ‚Äã‚Äãis

## The most common Measures of Central Tendency

* **mean**
* **median**: the middle value that separates the higher and lower values
* **mode** (modal value, most likely value): most frequent value (value with the greatest probability of occurrence)

In [None]:
x = np.array([1000, 1000, 1200, 1100, 10000])
x.mean()

In [None]:
np.median(x)

In [None]:
stats.mode(x)

## Measures of Dispersion

* **variance**: mean of the squared deviations from the mean 
$$ E[(X-E[X])^2] $$

* **standard deviation**: the square root of the variance, is in the units of the measured variable
$$ s = \sqrt{\frac{1}{N-1}\sum_{i=1}^N{(x_i-\overline{x})^2}} $$ 

* **range**: max - min
* **quartile**: value from which 25%, or 75% of values ‚Äã‚Äãsmaller 
* **percentile**: value from which XX% of values ‚Äã‚Äãare smaller
* **inter quartile range**: difference between 25% and 75% quartile, less prone to outliers than range

### We use two basic types of graphs to visualize continuous variables

* Boxplot
* Histogram

In [None]:
iris[iris.columns.difference(['species'])].plot.box()

## Histogram

- For continuous variables, a **pyplot.hist** or **seaborn.distplot** may be used. 
- For discrete variables, a *seaborn.countplot* is more convenient.

In [None]:
iris.petal_length.plot.hist(bins=30)

## Data distribution

In [None]:
sns.displot(iris.petal_length, bins=30)

## Skewness and Kurtosis

### Skewness

Skewness is a measure of asymmetry.

The coefficient of skewness is a metric of how skewed a distribution is. A perfectly symmetric distribution has a coefficient value equal to 0. A distribution skewed to the left will have a coefficient value greater than 0, a distribution skewed to the right will have a coefficient value less than 0.

<img src="https://miro.medium.com/max/600/1*nj-Ch3AUFmkd0JUSOW_bTQ.jpeg" alt="Skewness explained" />

So, when is the skewness too much?
* If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
* If the skewness is between -1 and -0.5 (negatively skewed) or between 0.5 and 1 (positively skewed), the data are moderately skewed.
* If the skewness is less than -1 (negatively skewed) or greater than 1 (positively skewed), the data are highly skewed.

In [None]:
sample_size = 10000

norm = stats.norm(0, 1)
x = np.linspace(-5, 5, 100)
sample = norm.rvs(sample_size)

plt.plot(x, norm.pdf(x))
plt.hist(sample, bins=20)
plt.title("Normal distribution: ""Skewness %.5f" % (stats.skew(sample), ))

In [None]:
sample_size = 1000

chi2 = stats.chi2(5)
x = np.linspace(0, 30, 100)
sample = chi2.rvs(sample_size)

plt.plot(x, chi2.pdf(x))
plt.hist(sample, bins=20)
plt.title("Chi-squared(5) distribution: ""Skewness %.5f" % (stats.skew(sample)))

In [None]:
sample_size = 1000

chi2 = stats.chi2(5)
x = np.linspace(0, 30, 100)
sample = 30 - chi2.rvs(sample_size)

plt.plot(x, chi2.pdf(30 - x))
plt.hist(sample, bins=20)
plt.title("30 - Chi-squared(5) distribution: ""Skewness %.5f" % (stats.skew(sample), ))

### Kurtosis

- Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution
- The kurtosis coefficient measures the amount of data concentrated in the tails. It therefore expresses the amount, or the tendency of a given distribution to produce outliers (far from the center of the distribution) values.
- It is very often compared to the value of the normal distribution coefficient, which is 3. 
If it is more than 3, more data is concentrated on the edges. If less than 3, then there is less data in the margins.
- *excess kurtosis* is also often used, which is the difference from the normal distribution, i.e. kurtosis - 3.

<img src="https://excelrcom.b-cdn.net/assets/admin/ckfinder/userfiles/images/tableau1/tableau2/tableau3/tableau4/tableau5/tableau6/skewness-kurtosis_1JPG-.jpg" width="50%"/>

In [None]:
sample_size = 100000

norm = stats.norm(0, 1)
x = np.linspace(-5, 5, 100)
sample = norm.rvs(sample_size)

plt.plot(x, norm.pdf(x))
plt.hist(sample, bins=20)
plt.title("Normal distribution: ""Kurtosis %.5f" % (stats.kurtosis(sample), ))

In the basic setting, the function returns excess kurtosis.

In [None]:
sample_size = 100000

norm = stats.norm(0,1)
x = np.linspace(-7, 7, 100)
sample = norm.rvs(sample_size)

plt.plot(x, norm.pdf(x))
plt.hist(sample, bins=20)
plt.title("Normal distribution: ""Kurtosis %.5f" % (stats.kurtosis(sample, fisher=False), ))
# we have to change the fisher parameter to False

In [None]:
sample_size = 1000

logistic = stats.logistic()
x = np.linspace(-7, 7, 100)
sample = logistic.rvs(sample_size)

plt.plot(x, logistic.pdf(x))
plt.hist(sample, bins=20)

plt.title("Logistic distribution: ""Kurtosis %.5f" % (stats.kurtosis(sample, fisher=False)))

In [None]:
sample_size = 1000

uniform = stats.uniform()
x = np.linspace(-7, 7, 100)
sample = uniform.rvs(sample_size)

plt.plot(x, uniform.pdf(x))
plt.hist(sample, bins=20)

plt.title("Uniform distribution: ""Kurtosis %.5f" % (stats.kurtosis(sample, fisher=False)))

## Univariate analysis - Analysis of individual attributes - Categorical attributes

The most common way of display is a frequency table showing either the number of observations for individual unique attribute values ‚Äã‚Äãor the ratio to the total number of observations. 

For graphical visualization, a **column graph (bar plot)** is used.

In [None]:
diamonds = pd.read_csv('data/diamonds.csv')
diamonds.head()

In [None]:
diamonds.color.value_counts()

In [None]:
diamonds.color.value_counts().plot(kind='bar')

### When is it appropriate to use a column chart, and when is it a pie chart? What are their advantages and disadvantages?

In [None]:
diamonds.color.value_counts().plot(kind='pie')

For more than 3-4 values, it is better to use a bar chart.

## Bivariate analysis - Pair analysis

### Continuous - continuous: Scatter plot

The most common way to visualize the relationship of two continuous attributes.

Shows the distribution in the value space. It allows you to see if there are any natural clusters in the data.

In [None]:
plt.scatter(iris.sepal_length, iris.sepal_width)

In [None]:
sns.pairplot(iris, hue="species")

## Correlation

A value in the range [-1, 1] that tells how strong the **linear** relationship is between the attributes.

* -1 perfect negative correlation
* 0 no correlation
* 1 perfect positive correlation

Pearson's correlation coefficient:
$$ corr(X, Y) = \frac{cov(X,Y)}{\sigma_X\sigma_Y} = \frac{E[(X-E[X])(Y-E[Y])]}{\sigma_X\sigma_Y }$$

$$ r_{xy} = \frac{\sum_{i=1}^{n}{(x_i-\overline{x})(y_i-\overline{y})}}{(n-1)s_xs_y} $$

<img src="https://www.analyticsvidhya.com/wp-content/uploads/2015/02/Data_exploration_4.png" width="50%"/>

### Correlation between two variables X and Y with values ‚Äã‚Äãin <-1, +1>

Pearson's correlation coefficient measures the **linear relationship** between two variables.

However, there may be another type of dependency between two variables.

Alternatives to the Pearson correlation coefficient that do not require linearity, only monotonicity, are:
* Spearman's coefficient
* Kendal's $\tau$

In [None]:
sns.regplot(x="petal_length", y="petal_width", data=iris)
print("Pearson correlation: %.3f" % iris.petal_length.corr(iris.petal_width))

In [None]:
iris.corr()

In [None]:
fig, ax = plt.subplots(figsize=(10,8))
sns.heatmap(iris.corr(), ax=ax, annot=True, fmt=".3f")

## Correlation does not imply causation - Correlation $\neq$ causality

- If two phenomena are correlated, it may be a coincidence. (*See examples of spurious correlations here:* http://tylervigen.com/spurious-correlations)
- Or there may be some other phenomenon that is the cause of both. (*E.g. student participation in lectures can be correlated with their final grade in the subject, but maybe more hard-working students attend lectures, who would have a better grade anyway.*)
- **Proving causality is non-trivial - a controlled (randomized) experiment**

## Bivariate analysis - Pair analysis: Continuous - categorical

Here, dividing observations by categorical value and displaying distributions of subsets of numerical values, for example, using histograms or boxplots, is most often used.

That is, it is a multiple use of visualizations that are used to display continuous attributes.

In [None]:
sns.boxplot(x='species', y='petal_length', data=iris)

## Bivariate analysis - Pair analysis: Categorical - categorical

* Contingency table
* Heatmap
* Pairplot
* Correlation visualization

In [None]:
titanic = pd.read_csv('data/titanic/train.csv')
titanic.head()

In [None]:
# frequency table
titanic["Survived"].value_counts()

In [None]:
survived_class = pd.crosstab(index=titanic["Survived"], 
                             columns=titanic["Pclass"])
survived_class.index= ["died","survived"]
survived_class

In [None]:
sns.heatmap(survived_class, annot=True, fmt="d")

In [None]:
survived_class_perc = pd.crosstab(index=titanic["Survived"], 
                                  columns=titanic["Pclass"],
                                  normalize='columns') #'index', 'all'
survived_class_perc.index= ["died","survived"]

sns.heatmap(survived_class_perc, 
            annot=True, 
            fmt=".4f")
survived_class_perc

In [None]:
pd.crosstab(index=titanic["Survived"], 
            columns=[titanic["Pclass"], 
            titanic["Sex"]],
            margins=True)

In [None]:
pd.crosstab(index=titanic["Pclass"], columns=titanic["Survived"]).plot.bar(stacked=True)

# Visualizations help us understand the data
### If they are done well...