# Homework for statistics module

Rules for code style:
* All the code in this notebook
* Imports are provided in the head of the notebook
* All the cells could be ran in the right order from the top to the bottom
* Commentaries are required
* All the plots should have: title, axis labels and summaries (if necessary)
* Main summaries are provided in cells using markdown **(your analysis of the results and data is very important!!!)**
* Try to use functions and classes to reduce duplicated code blocks to minimum

Also you can use $\LaTeX$ to write formulas. F.e. $\bar{y} = \frac{1}{n}\sum_{i=1}^n \hat{x}_i$ or on a new line:
$$
    \bar{y} = \frac{1}{n}\sum_{i=1}^n \hat{x}_i
$$
To do so, you can write anything between $\$ \quad \$$ (или $ \$\$ \quad\$\$ $).


### Criteria (50 points total):
* Task 1 (17 points)

    * Requested formulas are provided - 4 points
    * All 18 necessary experiments are done - 6 points
    * Results are analysed with commentaries - 7 points
* Tasks 2.1 и 2.2 (16 points)

    * Visualization and/or simple exploratory data analysis are implemented - 2 points
    * Hypotheses are tested - 2 points
    * Results are analysed with commentaries - 4 points
* Task 3 (13 points)

    * Visualization and/or simple exploratory data analysis are implemented - 2 points
    * Hypotheses are tested - 4 points
    * Data was aggregated properly - 2 points
    * Results are analysed with commentaries - 5 points
* Extra points:

    * 4 points in case clear, "pythonic" and understandable code style.

Good work: 40+ points.

In [433]:
import pandas as pd
import numpy as np
import scipy.stats as st

import matplotlib.pyplot as plt
%matplotlib inline

# <<<All the imports here>>>
 

# Task 1

## Stratification

The example:

Let's assume, we need to estimate a mean votes count for every election candidate. Suppose there are 3 cities in a country: 1 million factory workers live in city A, 2 million office workers live in city B, and 3 million senior citizens live in city B. We can choose a random sample of 60 votes from the entire population, but there is some chance that the random sample will be poorly balanced between these cities and, therefore, will be biased and of little use ("average temperature in the hospital"), causing a significant error in the estimation. Instead, if we choose to use a simple random sample of 10, 20 and 30 votes from cities A, B and C, respectively, we can get a smaller error in the estimate with the same total sample size. This technique is called stratification.


### The task

Suppose the population is a mixture of 3 normally distributed CBs. In other words, the population can be divided into 3 strata.
$$
    F(X) = a_1 F(X_1) + a_2 F(X_2) + a_3 F(X_3)
$$

**Goals:**  

1. Derive (for example, in the block below, using Markdown) the formulas for point estimates of the mathematical expectation and variance for the average value of subsamples formed in different ways:
- random selection from the entire population;
- random selection of strata in proportion to shares;
- random selection according to the optimal sub-sampling.

2. Calculate point estimates of the mathematical expectation and variance for the average value for each data sampling method from p.1, provided:
* Experiments should be conducted for 3 cases (for every method from p.1 each):
     * all strata have the same mat. expectations and variance;
     * strata have different mat. expectations, but the same variance;
     * strata have different mat. expectations and variance.
* Repeat this for these sample sizes: 40 and 500;
* Also a single experiment is repeated 1000 times;

Thus, total number of experiments will be equal to 18 (3 methods of sampling \* 3 cases of distribution parameters \* 2 sample sizes).  

**Example**: you conduct an experiment for random sample method with equal ME and variance of stata. For each strata you sample the data with size equal to 40 and estimate the statistics 1000 times. Thus, now you can average the results or plot boxplots (we suggest you do this) to compare the point estimations for each strata. Then you repeat this for sample size equal to 500. And then you can conduct the same pipeline for different sampling methods.  

Define the parameters of normal distributions, fractions and size of the subsample yourself.
To facilitate the structuring of the code, you can draw up your solution using the specified class:

In [490]:
class Experiment:

    def __init__(self, means, stds, random_state=42):
        """Initializes our experiment and saves the given distributions
        
        :param means: List of expectations for normal distributions
        :param stds: List of standard deviations for normal distributions
        :param random_state: Parameter fixing randomness. Needed so that when conducting
        experiment repeatedly with the same input parameters, the results remained the same
        """
        self.means = means
        self.stds = stds
        self.random_state = random_state
        self.strats = [st.norm(mean, std) for mean, std in zip(means, stds)]
    
    def sample(self, sizes):
        """Creates a population sample
        
        :param sizes: List with sample sizes of the corresponding normal distributions
        """
    
        self.strats_samples = [rv.rvs(size) for rv, size in zip(self.strats, self.sizes)]
        self.general_samples = np.hstack(self.strats_samples)
        self.N = self.general_samples.shape[0]
 
        

    def random_subsampling(self, size):
        """Creates a random subset of the entire population
        
        :param sizes: subsample size
        """
      
    def proportional_subsampling(self, size):
        """Creates a subsample with the number of elements, proportional shares of strata
        
        :param sizes: subsample size
        """
 
    
      
    def optimal_subsampling(self, size):
        """Creates a subsample with the optimal number of elements relative to strata
        
        :param sizes: subsample size
        """
 
 
    
    
    def run_experiments(self, subsampling_method, n_experiments=1000):
        """Conducts a series of experiments and saves the results
        
        :param subsampling_method: method for creating a subsample
        :param n_experiments: number of experiment starts
        """
 

# Task 2

Data is here https://drive.google.com/drive/folders/1zlvCNV6zNY9i3KIiFM6McByEgPBn_y2w?usp=sharing

### Part 1
Using [this criteria](https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/how-to/correlation/interpret-the-results/#:~:text=For%20the%20Pearson%20correlation%2C%20an,linear%20relationship%20between%20the%20variables.&text=If%20both%20variables%20tend%20to,represents%20the%20correlation%20slopes%20upward.), check whether there is a correlation between a brain size and intelligence for a dataset consisted both of men and women objects. Also check it for men and women subsamples separately.

**Data is in `HW1_task2_brain_data.tsv`**

### Part 2
Using $Chi^2$ criteria check whether there is a statistical difference between men's and women's choice of auto.  (Features `Sex` и `PreferCar`)

**Data is in `HW1_task2_car_prefs_data.tsv`**

In [None]:
# <<<YOUR CODE HERE>>>

# Task 3

You can find and download a dataset there:
https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016


1) For any country (you are free to choose any of the presented) 
 *  Visualize a feature **suicides_no** considering other features: **sex**, **age** (or **generation**) and year;
 *  Check whether there is statistical difference for suicide number between these groups: men / women. If it is found, can we claim that people of certain sex are more prone to suicides or do we need additional information?

2) For 2016: divide countries into 3-4 groups according to the values of the **gdp_per_capita** feature (use statistical characteristics to determine how to divide the data into groups), check if the suicides / 100k pop indicator differs in these groups. Do not forget that for each country you have several values and they need to be aggregated or checked for each group separately.

In [None]:
# <<<YOUR CODE HERE>>>