### Estimation and inference
in this section, we will cover:

> - Statistical estimation and inference
> - Parametric and non-parametric approaches to modeling.
> - Common statistical distributions
> - Frequentist vs Bayesian statistics

--- 
### Estimation vs Inference
![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

---
### Machine Learning and Statistical Inference 

**Machine Learning** and **Statistical Inference** are similar (a case of computer science borrowing from a long history in statistics).

In both cases, we're using data to learn/infer qualities of a distribution that generated the data (often termed the **data-generating process**).

We may care either about the whole distribution or just features (e.g. mean).

Machine learning applications that focus on understanding and individual effects involve more tools from statistical inference.(some applications are focused only on results).

### Example: Customer Churn 

Customer **churn** occurs when a customer leaves a company.

Data related to churn may include a target variable for whether or not the customer left

Features could include:
    - The lenght of time as a customer
    - The type and amount purchased
    - Other customer characteristics

Churn prediction is often approached by predicting a score for individuals that estimates the probability the customer will leave. 

### Customer Churn: Estimation

**Estimation** of factors driving customer churn involves measuring the impact of each factor in predicting churn.

**Inference** Involves determining whether these measured impacts are statitically significant.

### Customer Churn: Example Dataset

IBM Cognos Customer Churn Dataset:
- Data from fictional telecommunications firm
- Includes account type, customer characteristics, revenue per customer, satisfaction score, estimate of customer lifetime value. 
- Includes information on whether customer churned (and some categories of churn type)

### Customer Churn Example: Plotting 

```python
# Examining churn data, churn value by 
# payment type
sns.barplot(x='churn_value', x='payment',
            data=df_phone, ci=None)
plt.ylabel('Churn Rate')
plt.xlabel('Payment Type')
plt.title('Churn Rate by Payment Type, Phone Customers')
plt.show()
```
![image-3.png](attachment:image-3.png)


```python
# Examining churn data, this time by 
# tenure
sns.barplot(y='churn_value',
            x=pd.cut(df_phone.months, bins=5),
            data=df_phone, ci=None)
plt.ylabel('Churn Rate')
plt.xlabel('Tenure Range in Months')
plt.title('Churn Rate by Tenure, Phone Customers')
plt.show()
```
![image-4.png](attachment:image-4.png)

```python
# seaborn plot, feature correlations
pairplot = df_phone[['months',
                      'gb_mon',
                      'total_revenue',
                      'cltv',
                      'churn_value']]
sns.pairplot(pairplot, hue='churn_value')
plt.show()
```
![image-5.png](attachment:image-5.png)


```python
# seaborn hexbin plot
sns.jointplot(x=df[labels['months']],
              y=df[labels['monthly']],
              kind='hex')
plt.show()
```
![image-6.png](attachment:image-6.png)
---


---



### Parametric vs Non-Parametric Models

if **inference** is about trying to find out the Data-Generating Process(DGP), then we can say that a statistical model (of the data) is a set of possible distributions or maybe even regressions.

A **parametric** model is a particular type of statistical model: it's also a set of distributions or regressions, but they have a finite number of parameters. 


### Non-Parametric Statistics

in **non-parametric** statistics, we make fewer assumptions

In particular, we don't assume that the data belong to any particular distribution.(also calle **distribution-free inference**)

This doesn't mean that we know nothing, though!

### Non-Parametric Inference

An example of **non-parametric inference** is creating a distribution of data (CDF or cumulative distribution function) using a histogram.

In this case, we're not specifying parameters. 

### Parametric Models

A **parametric model** is a particular type of statistical model: it's also a set of distributions or regressions, but they have a finite number of parameters.

An example of parametric model: The normal distribution, which has two parameters: mean and standard deviation.

![image.png](attachment:image.png)

### Example: Customer lifetime value

**Customer lifetime value** is an estimate of the customer's value to the company. 

Data related to customer lifetime value might include:
    - The expected length of time as a customer
    - The expected amount spent over time
To estimate lifetime value, we make **assumptions** about the data

These assumptions can be **parametric** (assuming a specific distribution), or **non-parametric** 

### Parametric Models: Maximun Likelihood

The most common way of estimating parameters in a parametric model is though **maximum likelihood estimation (MLE)**.

The **likelihood function** is related to probability and is an function of the **parameters** of the model.

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

---
### Commonly Used Distributions

![image-4.png](attachment:image-4.png)

![image-5.png](attachment:image-5.png)

![image-6.png](attachment:image-6.png)

![image-7.png](attachment:image-7.png)

![image-8.png](attachment:image-8.png)
---

### Frequentist vs Bayesian Statistics

A **Frenquentist* is concerned with repeated observations in the limit. 

Process may have true frequencies, but we're interested in the modeling probabilities as many repeats of an experiment. 

Frequentist approach:
1. **Derive** the probabilistic property of a procedure
2. **Apply** the probability directly to the observed data

A **Bayesian** describes parameters by probability distributions.

Before seeing any data, a **prior distribution** (based on the experimentter's beliefs) is formulated.

This **prior distribution** is then updated after seeing data (a sample from the distribution)

After updating, the distribution is called the **posterior distribution**.

We use much of the same math and the same formulas in both Frequentist and Bayesian statistics.

The element that differs is the **interpretation** of the probabilities.

We will point out the difference in interpretation, where appropriate.