# Introduction to Statistics and Bid Data Analytics
---
mail: marco.bovo@unibo.it

unibo: [website](https://www.unibo.it/sitoweb/marco.bovo/)

address: DISTAL department Viale Fanin 46, Bologna.

---

# Suggested Reading

- **Statistical Rethinking** -  Richard McElreath (Great [YouTube playlist](https://www.youtube.com/playlist?list=PLDcUM9US4XdPz-KxHM4XHt7uUVGWWVSus))
- **Elements of Statistical Learning** - T. Hastie, R. Tibshirani, J. Friedman
- **Pattern Recognition and Machine Learning** - Christopher M. Bishop
- **Deep Learning** - I. Goodfellow, Y. Bengio ([Available for Free](https://www.deeplearningbook.org/))
- **Think Python 2nd Ed.** - Allen B. Downey ([Available for Free](https://greenteapress.com/wp/think-python-2e/))

# Course Structure

1. Introduction to Collab and Python
2. The Data
3. Basic Statistics
4. Linear Models
5. Multivariate Linear Models

# Introduction to Python

## Why Programming?

- **REPRODUCIBILITY**: Allow others to use your analysis on new data and use cases
- **REPLICABILITY**
  - Repeat old analysis months later
  - Allow others to do the same
- In many cases, it is the only way to handle large amount of data
- In general, is a useful skill to have

## Why Python?

> Python is the best at nothing, but second best at most things

- It is a good all rounder
- High level language (closer to human language than machine) $\to$ Readability
- Large community, extensive documentations and **A LOT** of libraries
- Generally better for Machine Learning and Deep Learning than R

## Variables and Data Structures

A *variable* is a name that points to a value stored in memory. You can think of it as a label you attach to some information so you can reuse it later.

In Python, variables are **dynamically typed**, which means you don’t need to declare the type in advance—Python figures it out for you based on the value you assign.

A data structure is a way to organize and store multiple pieces of data. In this section we start with the simplest building blocks:

-   Numbers (int, float)
-   Text (str, strings)

In [None]:
# How to define variables (numbers)
a = 5       # integer (int)
b = 3.14    # floating point number (float)

print("a =", a)
print("b =", b)

print("a + b =", a + b)   # arithmetic
print("a > b =", a > b)   # comparison (returns True/False)

In [None]:
a = "Hello World" # This is a String

print("Sum of strings: ", a + ", I am string")
print("UPPER function: ", a.upper())
print("lower function: ", a.lower())
print("Splitting strings :", a.split(" "))

## List and Tuples

A list is a mutable collection of objects. A tuple is an immutable collection of objects

In [None]:
a = 5
b = 3.14

this_is_a_list = [a, b, 3, 4.4, "Hello World", 4]
print(this_is_a_list, "This is a list")

this_is_a_tuple = (a, b, "hello world", 4)
print(this_is_a_tuple, "This is a tuple")

In [None]:
print("Select first element: ", this_is_a_list[0])
print("Select fifth element: ", this_is_a_list[5])
print("Select elements 1, 2, 3: ", this_is_a_list[1: 4])
print("Select last element :", this_is_a_list[-1])

## Dictionaries

A dictionary is a mutable collection of key-value pairs

Imagine we have a dataset of people and we want to store data for "What is his/her age?"

We can use a dictionary:

In [None]:
age_dictionary = {"Alice": 25, "Bob": 30, "Charlie": 28}
print(age_dictionary)

In [None]:
# We can retrieve data for one person
print(age_dictionary["Alice"])

In [None]:
# We can add a new person
age_dictionary["Dave"] = 32
print(age_dictionary)

## Conditionals

In [None]:
a = 5

if a > 0:
    print("a is positive")
elif a < 0:
    print("a is negative")
else:
    print("a is zero")

## Functions

In [None]:
# Function definition
def myadd(a, b):
    return a + b # The return statement outputs the result

print("Pass arguments in order: ", myadd(5, 3))
print("Pass arguments by names: ", myadd(a=5, b=3))
print("Pass arguments by names: ", myadd(b=3, a=5))

## When to write a function?
- Every time you have to repeat a piece of code
- It gives a name and generalization to that specific piece of code

## Iterations

In [None]:
this_is_a_list = [1, 2, "Hello"]

print(this_is_a_list[0])
print(this_is_a_list[1])
print(this_is_a_list[2])

In [None]:
this_is_a_list = [1, 2, "Hello"]

for element in this_is_a_list:
    print(element)

## Imports

You will find your-self **always** importing functionalities from other libraries
How do you install new libraries?

On google-collab's notebook the command is:

```python
!pip install package_name
```

For example:

In [None]:
!pip install scikit-learn
!pip install statsmodels

In [None]:
import sklearn
print(sklearn.__version__)

## Main Libraries:

- [pandas](https://pandas.pydata.org/) to handle tabular data
- [numpy](https://numpy.org/) for numerical computation
- [matplotlib](https://matplotlib.org/) for plots
- [seaborn](https://seaborn.pydata.org/) for better plots
- [statsmodels](https://www.statsmodels.org/stable/index.html) for statistical analysis
- [scikit learn](https://scikit-learn.org/stable/) for machine learning
- [tensorflow](https://www.tensorflow.org/) or [pytorch](https://pytorch.org/) for deep-learning

## $\color{red}{\text{Exercise}}$

Define a function that takes as input this list of numbers:

[1, -1, 0, -100, 300]

Iterate through the numbers and print if a number is positive, negative or zero.

## Solution

# The Data
 _Issues with data_

- Type
  - Quantitative, Qualitative, Structured, ...
- Quality
  - Data are never perfect: missing values, inconsistencies, ...
  - Handle outliers: small amount of data different from the rest (anomalies or errors)
  - Understand how to properly collect **OR** how data was collected

- Better Data Quality $\to$ Better results
  - <span style="color:red">garbage-in-garbage-out</span>.

  - Pre-processing:
    - Trasform your data **BEFORE** analysis to ease the activities

## Data Types

or Variables type

- **Categorical** (Discrete)
  - **Nominal** - Variables with no ordering (sex, colors, names...)
  - **Ordinal** - Variables with an order (months, minerals' hardness, severity of illness...)

- **Numerical** (Continuos)
  - **Intervals** - Continuos variables with an arbitrary 0 point (Temperature in Celsius, )
  - **Ratio** - Like Intervals but with a unique 0 definition (Temperature in Kelvin, lenght, mass, counts)

## Data storage format

There are many ways to store data.
What you will tipycally see in your career are two format:

1. Tables in databases
2. Tables in files

The first type one goes beyond the scope of this course.



## Comma Separated Values

CSV (Comma Separated Values) are one the most common format for data storage in science. They are:

- Read by any common statistical tool
- Easy to share
- Not ideal for very large dataset (But that would require a database)

Files with this format ends with .csv and can be opened with any csv reader (as excel).

## Excel sheets

Another common format is .xlxs, which is Microsoft's excel sheets.
I would recommend against it unless you reaaaally want to work in excel.

## Import Data

### _DataFrame in Pandas_

- A DataFrame is a 2-dimensional data structure that can store data of different types in columns.
- Every language has its own implementation of dataframe

- Imagine a table containing information about people, flowers, animals etc...:
  - each row is a subject (**OBSERVATIONS**)
  - each column is something measured about that subject (**FEATURES**)

A special role is played by the data that are organized as **TIDY DATA**, term introduced in the community by [Wickham in 2010 in his paper](http://vita.had.co.nz/papers/tidy-data.pdf).

## Reading a CSV with pandas

In [None]:
# Import the library pandas
import pandas as pd

# Load dataset
iris = pd.read_csv("./iris.csv", index_col=0)
iris

150 observations of 4 features and 3 class labels: Iris Setosa, Iris Versicolour, Iris Virginica

In [None]:
iris.head()

In [None]:
iris.tail()

In [None]:
iris.info()

## Operations with DataFrame

Every operation is **vectorized** $\to$ is applied *element-wise* to the entire column (vector)

Examples

In [None]:
# Vectorized addition
iris["sepal_length"] + 100

In [None]:
# Vectorized multiplication
iris["sepal_length"] * 100

## Operations between columns

In [None]:
iris["sepal_length"] + iris["sepal_width"]
iris["sepal_length"] - iris["sepal_width"]
iris["sepal_length"] * iris["sepal_width"]
iris["sepal_length"] / iris["sepal_width"]

print("Every operation is applied element by element")

## Create new columns

In [None]:
iris["new_col"] = 1.0
# Or
iris = iris.assign(
    new_col2=2.0
)
iris.head()

## Data Exploration

- Visually exploring data gives you a grasp of the relationship between variables

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plot sepal_width against sepal_lenght from the dataframe df
sns.scatterplot(data=iris, x="sepal_length", y="sepal_width");

Specifying a *hue* helps with categorical plotting

In [None]:
# Add hue="class" to scatterplot to color the points by class
sns.scatterplot(data=iris, x="sepal_length", y="sepal_width", hue="class");

In [None]:
# Create a figure and 2 axes in the figure with sizes 12x5 inches
figure, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Fill first axis with petal lenght and width and modify title and labels
sns.histplot(data=iris, x="sepal_length", hue="class", ax=ax1, kde=True, bins=20);
ax1.set(xlabel="Sepal Lenght (cm)", ylabel="", yticks=[], title="Sepal Lenght Distribution")

sns.despine(ax=ax1, left=True)

# Fill second axis with petal lenght and width and modify title and labels
sns.kdeplot(data=iris, x="petal_length", y="petal_width", hue="class", ax=ax2);
sns.scatterplot(data=iris, x="petal_length", y="petal_width", hue="class", ax=ax2, alpha=0.7);
ax2.set(xlabel="Petal Lenght (cm)", ylabel="Petal Width (cm)", title="Petal Dimensions")

sns.move_legend(ax2, "upper left")

# Show the figure
plt.show()

## A real world example

I am mainly used to work with cows automatically milked

-  Each row is a single **milking**, for a particular **cow**, on a given **time**
- Each column is a measurment registered by the AMS


In [None]:
cows = pd.read_csv("./cows.csv", sep=";")
print(cows.shape)
cows.head()

In [None]:
# Add a column and make IDs categorical
cows = cows.assign(
    dim_int=lambda _df: _df["dim"].astype(int),
    lactation_id=lambda _df: pd.Categorical(_df["lactation_id"]),
    milking_id=lambda _df: pd.Categorical(_df["milking_id"]),
)

# Select only the columns I want
list_of_columns = ["farm_id", "lactation_id", "milking_id", "parity", "tmy", "dim_int", "dim", "mi", "dimstage"]
cows = cows[list_of_columns]
cows.head()

In [None]:
# Filter data for a single lactation
lactation = cows[cows["lactation_id"] == 453]

# Plot Data
ax = sns.scatterplot(data=lactation, x="dim_int", y="tmy", color="tab:blue", alpha=0.4)
ax.set(xlabel="Days in Milk", ylabel="Total Milk Yield (kg)")
plt.show()

In [None]:
# Or we can also plot other kind of relationships
ax = sns.scatterplot(data=lactation, x="mi", y="tmy", alpha=0.4, hue="dim")
ax.set(xlabel="Milking Interval (kg)", ylabel="Total Milk Yield (kg)")
plt.show()

## Interactive Plotting
Plotly allows for interactive plots (very useful for data exploration!)

In [None]:
import plotly.express as px
px.scatter(lactation, x="dim", y="tmy")

## $\color{red}{\text{Exercise}}$

Create a new column in the dataframe that is the sum of petal_lenght and petal_width and make a scatter plot of it against sepal_width. Coloring by class

## Solution

# Data Distributions

It tells us what values the variable takes and how often it does. One of the most important is the **Gaussian** or **Normal** distribution:

![](https://drive.google.com/uc?export=view&id=1Yvw0Cul2stOkXTKJ6L2tOJnmgRJjKnLm
)

But **why** it is so important?

## Central Limit Theorem

The Gaussian is so useful because any random variable that is a sum of a large number of small contributions follows it.

The Theorem says:

For $n$ indipendent r.v. $x_i$ with variances $\sigma_i^2$, arbitrarly distributed, the sum

$$y = \sum_{i=1}^{n} x_i$$

For large $n$, $y$ is a **Gaussian**.

## Histogram

To visualize the distribution of our data, the most common way is to plot and **histogram**




In [None]:
sns.histplot(data=cows, x="tmy")
plt.show()

## Descriptive Statistics

Descriptive statistics, as for the name, are used to describe data distributions. The most common are:

- **Mean**: Sum of all the values divided by the number of values

$$\mu_x = \frac{1}{N}\sum_{i=1}^{N} x_i$$

- **Median**: Is the $\bar x$ value that divide the distribution in half
- **Percentile (or Quantile)**: The $n$-th percentile of a distribution is the value before which n% of the distribution is contained. The 50-th percentile is the Median itself.
- **Variance**: Is the average of the square deviation of data from the mean. It measure the average deviation of data form the mean

$$\sigma^2 = \frac{1}{N-1}\sum_{i=1}^{N}(x_i - \mu_x)^2$$

- **Standard Deviation**: Is the square root of the variance, usually referred as $\sigma$

$$\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N}(x_i - \mu_x)^2}$$

In [None]:
# Compute the Mean production per-milking:
print("Mean:", cows['tmy'].mean().round(2))

# Compute the median
print("Median:", cows["tmy"].median())

# Standard deviation
print("Std Dev:", cows["tmy"].std())

# Percentile
print("75% Percentile", cows["tmy"].quantile(0.75))

# Pandas provide also a full summary table:
display(cows["tmy"].describe())

An **Histogram** can readily provide those information in a graphical form

In [None]:
# Let's see those in a plot:
fig, ax1 = plt.subplots(1, 1, figsize=(8, 4))

# Histogram
sns.histplot(data=cows, x="tmy", ax=ax1, kde=True, bins=20);
ax1.axvline(cows["tmy"].mean(), color="k", lw=2, ls="--", label="Mean")
ax1.axvline(cows["tmy"].median(), color="r", lw=2, ls="--", label="Median")

# Plot 1 std from the mean
ax1.axvline(cows["tmy"].mean() + cows["tmy"].std(), lw=2, color="g", ls="--", label="$\\mu_x \\pm \sigma$")
ax1.axvline(cows["tmy"].mean() - cows["tmy"].std(), lw=2, color="g", ls="--")
ax1.legend()

plt.tight_layout()
plt.show()

### Boxplot

A BoxPlot can provide basically the same information

![Caption](https://drive.google.com/uc?export=view&id=1cm4Vvfq9er4Oanv9h0_2y7Uqp66Kwexz)

Image source: [Kdnuggets](https://www.kdnuggets.com/2019/11/understanding-boxplots.html)


- The center line is the **Median** of the data
- **Q1** and **Q3** are called First and Third **Quartiles**
- The box represents the range between 25 and 75% of the data


## Correlations

Correlations measure the association between two random variables.
It answer the question: *does variable x and y vary together or not?*
Typically a correlation coefficient varies in the range $[-1, 1]$ where -1 and 1 are complete dependence and 0 is complete indipendence.

Mainly, two correlations coefficient are used:

- **Pearson** Correlation: measures the **linear** association between two variables

$$\rho = \frac{cov(x, y)}{\sigma_x \sigma_y}$$

![](https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/Correlation_coefficient.png/400px-Correlation_coefficient.png)

Source: [Wikipedia](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)

- **Spearman** Correlation: is the Pearson correlation between rank and it can also measure non linear association

![](https://upload.wikimedia.org/wikipedia/commons/thumb/4/4e/Spearman_fig1.svg/300px-Spearman_fig1.svg.png)

Source: [Wikipedia](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient)

### Correlation $\nRightarrow$ Causation

[Spurious Correlations](https://www.tylervigen.com/spurious-correlations)

In [None]:
# With pandas we can easily measure correlations between variables with the "corr" function:
print("Correlation between Milk Production and Milking Interval")

# Compute Pearson Correlation
p_corr = cows["tmy"].corr(cows["mi"], method="pearson").round(2)
print("Pearson:", p_corr)

# Compute Spearman Correlation
s_corr = cows["tmy"].corr(cows["mi"], method="spearman").round(2)
print("Spearman:", s_corr)

# Linear Models

What is a linear regression?

Mathematically, the general shape of a linear model is:

$$\hat y = w_0 + \sum_{i=1}^{N} w_i x_i = w_0 + w_1 x_1 + w_2 x_2 + ... w_N x_N$$

Where $w_0$ is the intercept and $w_i$ are the coefficients of the model.
The objective is to find the best set of weights $\bar{w}$ that minimize the error between the predicted values $\hat y$ and the true values $y$.

$$E(w) = \sum_{i=0}^{M} (\hat y_i - y_i)^2$$

All the minimization algorithms are already implemented in many libraries.
This technique is called **Ordinary Least Square** (OLS).

## Let's build a Linear Model!

A simple model of Weight ($W$) and Height ($H$).
Data comes from a census produced in the late 1960s for the !Kung San area (Angola, Namibia).

Load the data with:


In [None]:
howell = pd.read_csv("./howell1.csv")
howell

In [None]:
# Let's Plot!
ax = sns.scatterplot(data=howell, x="weight", y="height", alpha=0.4, color="grey", edgecolor="black");
ax.set(xlabel="Weight (kg)", ylabel="Height (cm)")
plt.show()

In [None]:
fig = sns.relplot(kind="scatter", data=howell, x="weight", y="height", col="male", hue="age")
ax1, ax2 = fig.axes.ravel()
ax1.set(xlabel="Weight (kg)", ylabel="Height (cm)", title="Women")
ax2.set(xlabel="Weight (kg)", title="Men")

plt.tight_layout()
plt.show()

In [None]:
# Filter By Adult Age and create a copy
df = howell[howell["age"] >= 18].copy()

# Plot the new relationship
ax = sns.scatterplot(data=df, x="weight", y="height", alpha=0.5, color="grey")
ax.set(xlabel="Weight (kg)", ylabel="Height (cm)")

plt.tight_layout()
plt.show()

## Statsmodels

- The most complete library for statistics in python is `statsmodels`

- In usage, is very similar to what you would find in R


Our first model will be a simple model:

$$H \sim W$$

That means:

$$H = w_0 + w_1 W$$

- $w_0$ is called **Intercept** or **Bias**
- $w_1$ is the **weight** of the relationship between $H$ and $W$

In [None]:
import statsmodels.formula.api as smf

# Definition of the model with a formula
m1 = smf.ols(formula="height ~ 1 + weight", data=df)

# Fit the model
res1 = m1.fit()

# See the results
print(res1.summary())

Given the results, our model is:

$$H = 113.6 + 0.91 \cdot W$$

## Interpretations

- The model is saying that for *a person 1kg heavier is expected to be 0.90 cm taller*
- The $R^2$ suggest that $W$ explains almost 60% of the variation in $H$
- Moreover, the $95\%$ confidence interval for $w_1$ lies between $0.82$ and $0.99$.
- All in all, strong evidence that there's a relationship between $H$ and $W$

A small point about Intercepts: more often than not are difficult to interpret, indeed a person with $W=0 kg$ should be $113.6 cm$, which of course doesn't make any sense.

## P-values

P-values indicate the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null-hypothesis is correct. [source](https://en.wikipedia.org/wiki/P-value).

Our null-hypothesis in this case is:

- $H_0$: the feature $W$ has no impact on $H$ *or* the value of the coefficient $w_1$ is $0$

a pvalue $\leq T$ (where $T$ is tipically set to $0.05$) indicates that we can *reject* $H_0$. So $W$ is said to be **significant**.

But [**beware of pvalues**](https://en.wikipedia.org/wiki/Misuse_of_p-values).
Always look at your data, doubt your models and think about the logical implications of your findings.
That's because:

> *All models are wrong, but some are useful*
>
> George Box, 1976

In [None]:
# Let's Plot the model
ax = sns.lineplot(x=df["weight"], y=res1.fittedvalues, color="k", label="Model")
sns.scatterplot(data=df, x="weight", y="height", alpha=0.5, color="grey", label="Data", ax=ax)

ax.set(xlabel="Weight (kg)", ylabel="Height (cm)", title="LM Height vs Weight")
ax.legend()

plt.show()

In [None]:
# What if we fit our model to the whole dataset?

m2 = smf.ols(formula="height ~ 1 + weight", data=howell)
res2 = m2.fit()

# And Plot it
ax = sns.lineplot(x=howell["weight"], y=res2.fittedvalues, color="k", label="Model")
sns.scatterplot(data=howell, x="weight", y="height", alpha=0.5, color="grey", label="Data", ax=ax)

ax.set(xlabel="Weight (kg)", ylabel="Height (cm)", title="LM Height vs Weight")
ax.legend()

plt.show()

## Polynomial Regression

Is still a linear regression where we use also exponetial versions of the variables. For example:

$$H = w_0 + w_1 W + w_2 W^2$$

This is a parabolic (second order) polynomial.

Now we will construct a parabola, instead of a straight line, and $w_2$ is the streght of the curvature.

The first thing to do when building this kind of model is **Standardize** the variables.

## Standardization

A useful technique for dealing with *Multicollinearity* and *Numerical* problems is **Standardizing** the input features.
The new features are centered versions of the original features:

$$\hat x = \frac{x - \mu_x}{\sigma_x}$$

Where:
- $\hat x$ is the standrdized feature
- $x$ is the original feature
- $\mu_x$ is the mean of $x$
- $\sigma_x$ is the standard deviation of $x$


Notes:
- It reduces multicollinearity
- Solves "Numerical glitches" with high number
- Makes intepretations sligthly more difficult
- Allows to compare predictors

In [None]:
# Standardize
howell["weight_s"] = (howell["weight"] - howell["weight"].mean()) / howell["weight"].std()

# Plot The difference between standardized and non-standardized weight
ax = sns.histplot(data=howell, x="weight", alpha=0.4, color="salmon", label="Non-Standardized", kde=True)
sns.histplot(data=howell, x="weight_s", alpha=0.4, color="lightblue", ax=ax, label="Standardized", kde=True)

ax.axvline(0, ls="--", color="k")
ax.set(xlabel="Weight (kg)", ylabel="Count", title="Weight Distribution")
ax.legend()

plt.tight_layout()
plt.show()

In [None]:
# Fit the polynomial model:

m3 = smf.ols(formula="height ~ 1 + weight_s + I(weight_s**2)", data=howell)
res3 = m3.fit()
print(res3.summary())

In [None]:
# Plot the model
ax = sns.scatterplot(data=howell, x="weight", y="height", alpha=0.5, color="grey", label="Data")
sns.lineplot(x=howell["weight"], y=res2.fittedvalues, color="k", label="Model", ax=ax)

plt.show()

In [None]:
res4 = smf.ols(formula="height ~ 1 + weight_s + I(weight_s**2) + I(weight_s**3)", data=howell).fit()

# Plot all comparisons:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(12, 5))

sns.scatterplot(data=howell, x="weight", y="height", alpha=0.5, color="grey", label="Data", ax=ax1)
sns.lineplot(x=howell["weight"], y=res2.fittedvalues, color="k", label="Model", ax=ax1)
ax1.set(title="Original Model")

sns.scatterplot(data=howell, x="weight", y="height", alpha=0.5, color="grey", label="Data", ax=ax2)
sns.lineplot(x=howell["weight"], y=res3.fittedvalues, color="k", label="Model", ax=ax2)
ax2.set(title="Second Order")

sns.scatterplot(data=howell, x="weight", y="height", alpha=0.5, color="grey", label="Data", ax=ax3)
sns.lineplot(x=howell["weight"], y=res4.fittedvalues, color="k", label="Model", ax=ax3)
ax3.set(title="Third Order")

plt.tight_layout()
plt.show()

## Multivariate Linear Models

### *Or simply Linear Models*

The concept is the same as for linear models, but instead of one predictor we use multiples!

In the precedent case we may add the categorical variables "male" to the model:

$$H = w_0 + w_1 W + w_2 M$$

The question we are asking now is: *does the model change if the subject is male?*



In [None]:
# We go back to adult subjects

m5 = smf.ols(formula="height ~ 1 + weight + C(male)", data=df)
res5 = m5.fit()
print(res5.summary())

The new model for *male* subject is:

$$H = 122.70 + 0.64 \cdot W + 6.5 \cdot M$$

and the new model for *female* subject is:

$$H = 122.70 + 0.64 \cdot W$$

In [None]:
ax = sns.scatterplot(data=df, x="weight", y="height", alpha=0.5, color="grey", hue="male", palette={1: "tab:blue", 0: "tab:orange"})

male_model = 122.7034 + df["weight"] * 0.6412 + 6.5003 * 1
sns.lineplot(x=df["weight"], y=male_model, color="blue", label="Male Model", ax=ax)

female_model = 122.7034 + df["weight"] * 0.6412 + 6.5003 * 0
sns.lineplot(x=df["weight"], y=female_model, color="r", label="Female Model", ax=ax)

original_model = 113 + df["weight"] * 0.90
sns.lineplot(x=df["weight"], y=original_model, color="k", label="Original Model", ax=ax, ls="--")

plt.tight_layout()
plt.show()

## LM limits

- Variables must be uncorrelated (Multicollinearity)
- The residuals must normally distributed
- Susceptible  to Outliers

## What's Next

- Bayesian Probability
- Generalized Linear Models
- Non-linear modelling
- Unsupervised models

In the next lesson (Statistics and Big Data analytics) I will show you the basics of Machine Learning and Deep Learning

# Questions?

...