# 06 Hypothesis testing
- Author: Agustín Arturo Melian Su
- Date: 18/08/2025
- Goal: Test statements regarding some population's characteristics. Learn and practice concepts such as z and t scores, p-values, significance, one and two-tail tests, etc. 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

In [12]:
def hypothesis_testing(n, sample_mean, pob_mean, pob_std, alpha, two_tails=False):
    
    z_score = (sample_mean - pob_mean) / (pob_std / np.sqrt(n))
    p_value = (1 - stats.norm.cdf(abs(z_score))) * (2 if two_tails else 1)

    alpha = alpha / 2 if two_tails else alpha

    if p_value > alpha:
        return 'H_0 is NOT rejected'
    else:
        return 'H_0 is rejected'

*Problem 1:*

It has already been stablished by scientist in the medical community that if the mean content of nicotine inside cigarretes is 25 mg or more, there's a greater chance that the smoker gets cancer. If the measured content is less than 25 mg, the user is relatively safe.

For stablishing the hypothesis, our interest lays on testing a cigarrette sample and verifying if the mean of the sample is greater than 25, thus:

$$H_0: \mu \geq 25$$

$$H_1: \mu < 25$$

$$\alpha = 0.01$$

Data:

$\bar{X}=23.5$

$\sigma = 4.2$

$\mu _0 = 25$ 

$n=40$



In [22]:
hypothesis_testing(n=40, pob_mean=25, sample_mean=23.5, pob_std=4.2, alpha=0.05)

conclusion = f'Since the p-value is less than the z-score, then the null hypothesis is \
    rejected, meaning that the mean of the sample is less than 25 mg of nicotine per cigarrete'

conclusion

'Since the p-value is less than the z-score, then the null hypothesis is     rejected, meaning that the mean of the sample is less than 25 mg of nicotine per cigarrete'

An old industrial process used by a factory, gives a mean production of 100 units per hour, with a standard deviation of 8 units. On the other hand, a new machine was just launched to deliver the same product. The machine is pricey in comparisson with the one the factory already has, although, its adoption would be really lucrative if the mean production were 150 units per hour. 

Suppose the machine has been tested 35 hours and it was found that the mean units the machine delivers is 160 per hour. Should the company buy the new machine?

Hypothesis:
$$H_0 = \mu \leq 150$$
$$H_1 = \mu > 150$$

Data:

$\bar{x} = 160$

$\sigma = 8$

$\mu_0 = 150$

$n = 35$

In [None]:
hypothesis_testing(n=35, pob_mean=150, sample_mean=160, pob_std=8, alpha=0.05)

conclusion = f'Since the p-value is less than the z-score Then the null hypothesis is \
    rejected, meaning that the factory should buy the new machine because it \
    can produce more than 150 units per hour'

conclusion

'Since the p-value is less than the z-score Then the null hypothesis is     rejected, meaning that the factory should buy the new machine because it     can produce more than 150 units per hour'

*Problem 2:*

The average distance a certain brand of car must travel to stop when traveling at 30 miles per hour is 65 feet. The company's engineering department has designed a new braking system considered more effective. To test this invention, the new braking system is tested on 64 cars. The tests show that the average stopping distance for a car at a speed of 30 miles per hour is 63.5 feet with a standard deviation of 4 feet. This reduction in distance demonstrates that the braking system is more effective than the old one.

Hypothesis:

$$H_0: \mu \geq 65$$
$$H_1: \mu < 65$$

$$\alpha = 0.01$$

Data:

$\bar{x} = 63.5$

$\sigma = 4$

$\mu_0 = 65$

$n = 64$

In [5]:
hypothesis_testing(n=64, sample_mean=63.5, pob_mean=65, pob_std=4, alpha=0.01)

conclusion = 'The null hypothesis is rejected with a probability of 0.01 that this is NOT true \
    thus, the new breaking system can confidently stop at less than 65 feet'

conclusion

'The null hypothesis is rejected with a probability of 0.01 that this is NOT true     thus, the new breaking system can confidently stop at less than 65 feet'

*Problem 3:*

A floor wax manufacturer has salespeople nationwide. In the past, its salespeople earned an average commission of $600 per month with a standard deviation of $80. Recently, new brands of floor wax have entered the market, and this factor tends to reduce the sales volume per salesperson, and thus, salespeople's income. 

Suppose a random sample of 100 commissions is drawn and the average is found to be $585. The company's management is anxious to discover whether the entry of this new brand has had any net effect on commissions.

Hypothesis:

$$H_0: \mu = 600$$
$$H_1: \mu \neq 600$$

$$\alpha = 0.05$$

Data:

$\bar{x} = 600$

$\sigma = 80$

$\mu_0 = 585$

$n = 100$

In [6]:
hypothesis_testing(n=100, sample_mean=585, pob_mean=600, pob_std=80, alpha=0.05, two_tails=True)

conclusion = 'The null hypothesis is accepted with a probability of 0.05 that this is false \
    Thus, the average salary of the salespeople has not changed and is still 600 USD.'

conclusion

'The null hypothesis is accepted with a probability of 0.05 that this is false     Thus, the average salary of the salespeople has not changed and is still 600 USD.'

# Part 2: Hypothesis testing with data from crimes in Barcelona.

In [7]:
df_2010_raw = pd.read_csv('/workspaces/clases-4geeks/datasets/ACCIDENTS_GU_BCN_2010.csv', encoding='latin-1')

df_2010_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9056 entries, 0 to 9055
Data columns (total 25 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Número d'expedient            9056 non-null   object
 1   Codi districte                9056 non-null   int64 
 2   Nom districte                 9056 non-null   object
 3   NK barri                      9056 non-null   object
 4   Nom barri                     9056 non-null   object
 5   Codi carrer                   9056 non-null   int64 
 6   Nom carrer                    9056 non-null   object
 7   Num postal caption            9056 non-null   object
 8   Descripció dia setmana        9056 non-null   object
 9   Dia de setmana                9056 non-null   int64 
 10  Descripció tipus dia          9056 non-null   object
 11  NK Any                        9056 non-null   int64 
 12  Mes de any                    9056 non-null   int64 
 13  Nom mes           

In [8]:
df_2010_raw.sample(10)

Unnamed: 0,Número d'expedient,Codi districte,Nom districte,NK barri,Nom barri,Codi carrer,Nom carrer,Num postal caption,Descripció dia setmana,Dia de setmana,...,Hora de dia,Descripció torn,Descripció causa vianant,Número de morts,Número de lesionats lleus,Número de lesionats greus,Número de víctimes,Número de vehicles implicats,Coordenada UTM (Y),Coordenada UTM (X)
5011,2010S004373,7,Horta-Guinardó,74-7-35,el Guinardó,367402,Volart,0083 0083,Dimarts,2,...,23,Nit,Desconegut,0,1,0,1,2,458539716,43108296
4098,2010S003102,2,Eixample,21-2-10,Sant Antoni,349706,Comte d'Urgell,0010 0010,Dijous,4,...,22,Nit,Desconegut,0,1,0,1,1,458120684,42997356
7271,2010S005840,7,Horta-Guinardó,71-7-39,Sant Genís dels Agudells,148409,Sant Cugat,000140000,Diumenge,7,...,15,Tarda,Desconegut,0,2,0,2,1,458632160,42742032
7986,2010S006226,10,Sant Martí,104-10-69,Diagonal Mar i el Front Marítim del Poblenou,700892,Garcia Fària,0024S0024S,Dimecres,3,...,7,Matí,Desconegut,0,1,0,1,2,458396491,43444594
7704,2010S002797,4,Les Corts,41-4-19,les Corts,48608,Breda,0015 0015,Dilluns,1,...,8,Matí,Desconegut,0,2,0,2,2,458201350,42820230
8231,2010S004624,3,Sants-Montjuïc,34-3-17,Sants - Badal,328808,Sugranyes,0048 0048,Dissabte,6,...,6,Matí,Desconegut,0,1,0,1,2,458060499,42713216
6991,2010S000913,2,Eixample,23-2-8,l'Antiga Esquerra de l'Eixample,89004,Consell de Cent,0305 0309,Diumenge,7,...,16,Tarda,Desconegut,0,1,0,1,2,458227202,42999337
8489,2010S000318,9,Sant Andreu,91-9-62,el Congrés i els Indians,367300,Biscaia,0435 0435,Divendres,5,...,11,Matí,Desconegut,0,0,0,0,1,458576356,43164403
7696,2010S002354,4,Les Corts,41-4-19,les Corts,231502,Numància,0084 0084,Dijous,4,...,20,Tarda,Desconegut,0,2,0,2,2,458188144,42785690
1469,2010S007368,4,Les Corts,42-4-20,la Maternitat i Sant Ramon,144601,Diagonal,0645 0645,Divendres,5,...,19,Tarda,Desconegut,0,1,0,1,1,458191188,42617459


In [9]:
df_2010_baking = df_2010_raw.copy()

df_2010_baking["date"] = (
    df_2010_baking["Dia de mes"].astype(str) 
    + "-" 
    + df_2010_baking["Mes de any"].astype(str)
)

data = df_2010_baking['date'].value_counts()

data

date
15-12    46
29-11    45
30-11    44
16-11    43
14-4     42
         ..
3-4       8
1-1       7
2-4       7
12-10     7
16-8      4
Name: count, Length: 365, dtype: int64

In [10]:
pob_mean = data.value_counts().mean()
pob_std = data.value_counts().std()

pob_std

np.float64(6.264203373330006)

*Problem 4:*

Barcelona's city is verifying the efficiency of their security strategy. After taking a sample of 30 days, it was observed that the accidents where 8.02 on average.

Can we consider that the security policies have had any effect?

Hypothesis:
$$H_O: \mu = 8.9$$
$$H_1: \mu \neq 8.9$$
$$\alpha=0.05$$

Data:

$\bar{x} = 8.92$

$\sigma = 6.26$

$\mu_0 = 8.02$

$n = 30$

In [11]:
hypothesis_testing(n=30, sample_mean=8.02, pob_mean=8.92, pob_std=6.26, alpha=0.05, two_tails=True)

conclusion = 'Since the p_value is greater than z_score, H_0 is not rejected \
    which means the strategy is not fully working'