# Chocolate Bar Ratings Dataset

In this assignment you will use a dataset with the ratings of several chocolate bars produced by companies located in different parts of the world. The dataset is a pre-processed version of the original dataset that can be found following this [link](https://www.kaggle.com/rtatman/chocolate-bar-ratings).

Your focus will be on the ratings of chocolate bars produced in the UK and Switzerland. The ratings are in the range of 1-5; the higher the better.

KATE expects your code to define variables with specific names that correspond to certain things we are interested in.

KATE will run your notebook from top to bottom and check the latest value of those variables, so make sure you don't overwrite them.

* Remember to uncomment the line assigning the variable to your answer and don't change the variable or function names.
* Use copies of the original or previous DataFrames to make sure you do not overwrite them by mistake.

You will find instructions below about how to define each variable.

Once you're happy with your code, upload your notebook to KATE to check your feedback.

In [1]:
import pandas as pd
import numpy as np
import scipy.stats

### Dataset and variables

**1. Load the dataset called `flavors_cacao.csv` into a DataFrame**

Store the answer in a variable called `df` and call the `.head()` method to view the first 5 rows.

In [2]:
# Add your code below
df = pd.read_csv('data/flavors_cacao.csv')
df.head()


Unnamed: 0,company,species,REF,review_year,cocoa_p,company_location,rating,country
0,A. Morin,Agua Grande,1876,2016,0.63,France,3.75,Sao Tome
1,A. Morin,Kpime,1676,2015,0.7,France,2.75,Togo
2,A. Morin,Atsane,1676,2015,0.7,France,3.0,Togo
3,A. Morin,Akata,1680,2015,0.7,France,3.5,Togo
4,A. Morin,Quilla,1704,2015,0.7,France,3.5,Peru


**2 Using the `.loc` method, select the column `rating` for the rows of `df` where `company_location` equals `"U.K."`**

Store it in a variable called `uk_ratings`.

Your answer should be a Pandas Series.

In [3]:
# Add your code below
uk_ratings = df.loc[df['company_location'] == 'U.K.',['rating']].iloc[:,0]


**3. Using the `.loc` method, select the column `rating` for the rows of `df` where `company_location` equals `"Switzerland"`**

Store it in a variable called `swiss_ratings`

In [4]:
# Add your code below
swiss_ratings = df.loc[df['company_location'] == 'Switzerland',['rating']].iloc[:,0]


**4. How many rows are in `uk_ratings`?** 

Store your answer in a variable called `uk_rows`

In [5]:
# Add your code below
uk_rows = uk_ratings.shape[0]



**5. What is the mean rating of the chocolate produced by companies in the UK?** 

Store your answer in a variable called `uk_mean_rating`

In [6]:
# Add your code below
uk_mean_rating = np.mean(uk_ratings)
print(uk_mean_rating)


3.0546875


**6. What is the Standard Error of the Mean (SEM) of the ratings of the chocolates produced by UK companies?** 

Store your answer in a variable called `uk_ratings_sem`

*Hint: You can use the `scipy.stats.sem` function*

In [7]:
# Add your code below

uk_ratings_sem = scipy.stats.sem(uk_ratings)
print(uk_ratings_sem)


0.050382735382976704


**7. How many rows are in `swiss_ratings`?** 

Store your answer in a variable called `swiss_rows`

In [8]:
# Add your code below
swiss_rows = swiss_ratings.shape[0]
print(swiss_rows)

38


**8. What is the mean rating of the chocolate produced Swiss companies?** 

Store your answer in a variable called `swiss_mean_rating`

In [9]:
# Add your code below
swiss_mean_rating = np.mean(swiss_ratings)
print(swiss_mean_rating)


3.3421052631578947


**9. What is the Standard Error of the Mean (SEM) of the ratings of the chocolate produced Swiss companies?** 

Store your answer in a variable called `swiss_ratings_sem`

*Hint: You can use the `scipy.stats.sem` function*

In [10]:
# Add your code below
swiss_ratings_sem = scipy.stats.sem(swiss_ratings)
print(swiss_ratings_sem)

0.07567915156885435


**10. Define a function called `confidence_interval` which takes two inputs: the sample mean and the standard error. The function should return the 95% confidence interval around the mean of the data**

The confidence interval should be in the form of a list with two elements, for example: $[2.44, 2.98]$


In [11]:
# Add your code below
def confidence_interval(sample_mean, sem):
     high = scipy.stats.norm.ppf(0.975, loc = sample_mean, scale = sem)
     low = scipy.stats.norm.ppf(0.025, loc = sample_mean, scale = sem)
     interval = [low,high]
     return interval

**11. Compute the 95% confidence interval around `uk_mean_rating`**

Use the function `confidence_interval` and store your answer in a variable called `uk_confidence`

In [12]:
# Add your code below
uk_confidence = confidence_interval(uk_mean_rating,uk_ratings_sem)
print(uk_confidence)

[2.955939153206754, 3.153435846793246]


**12. Compute the 95% confidence interval around `swiss_mean_rating`**

Use the function `confidence_interval` and store your answer in a variable called `swiss_confidence`

In [13]:
# Add your code below
swiss_confidence = confidence_interval(swiss_mean_rating,swiss_ratings_sem)
print(swiss_confidence)

[3.1937768517023923, 3.490433674613397]


### Hypothesis Testing

**Swiss chocolate has a mean rating that is different from that of UK chocolate. Is this difference statistically significant?** 

We will use a two-tailed two-sample z-test to find it.

$H_0$: `uk_mean_rating`=`swiss_mean_rating`

$H_1$: `uk_mean_rating`$\neq$`swiss_mean_rating`

**13. What is the variance of the ratings of the chocolate produced by UK companies?** 

Store your answer in a variable called `uk_var`

*Hint: remember to set `ddof`=1*. 

Please refer to the [`documentation`](https://numpy.org/doc/stable/reference/generated/numpy.var.html) for a description of this parameter and to this Wikipedia [`link`](https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)) for more info.

In [14]:
# Add your code below
uk_var = np.var(uk_ratings,ddof=1)
print(uk_var)

0.24368832236842106


**14. What is the variance of the ratings of the chocolates produced by Swiss companies?** 

Store your answer in a variable called `swiss_var`

In [15]:
# Add your code below
swiss_var = np.var(swiss_ratings,ddof=1)
print(swiss_var)


0.21763869132290195


**15. Define a function called `compute_z` which takes $6$ arguments as input: `mean_1`, `var_1`,` n_1`, `mean_2`, `var_2`, `n_2`. From these arguments, it should compute the two-sample z-test statistic**

*mean_i*, *var_i* and *n_i* indicate the mean, variance and number of samples of *sample_data_i*

Remember that the z statistic is computed as follows: $$ z= \frac{mean_1 - mean_2}{\sqrt{\frac{var_1}{n_1} + \frac{var_2}{n_2}}} $$ 

In [16]:
# Add your code below
def compute_z(mean_1, var_1, n_1, mean_2, var_2, n_2):
    z_score = (mean_1 - mean_2)/(((var_1/n_1) + (var_2/n_2))**0.5)
    return z_score


**16. Use this function to compute the z statistic for the hypothesis test we are looking to answer**

Store your answer in a variable called `z_statistic`

In [20]:
# Add your code below
z_statistic = compute_z(uk_mean_rating, uk_var, uk_rows, swiss_mean_rating, swiss_var, swiss_rows)
z_statistic

-3.161348448014516

**17. Write a function called `reject_hypothesis` that takes as input the `z_statistic` and `alpha`.**

If we can reject the null hypothesis, `reject_hypothesis` should return `True`. Otherwise, it should return `False`.

*Hint: To compute the threshold that you compare the `z_statistic` with, you can use the `stats.norm.ppf` method from Scipy.*

*Hint: Remember that we are doing a two-tailed test. The input the of the `stats.norm.ppf` method has to be `alpha/2` and you should reject the null hypothesis if $z\_statistic < threshold$ or if $z\_statistic > -threshold$*.

In [23]:
# Add your code below
def reject_hypothesis(z_statistic, alpha):
    alpha = alpha/2
    print(alpha)
    low_threshold = scipy.stats.norm.ppf(alpha)
    high_threshold = scipy.stats.norm.ppf(1-alpha)
    print(low_threshold, high_threshold)
    print(z_statistic)
    if low_threshold < z_statistic < high_threshold:
        return False
    else:
        return True
    

**18. Can we reject the null hypothesis of our test for  `alpha` = 0.01?**

Use the function `reject_hypothesis` and give as input the  `z_statistic ` computed before.

Your answer should be a boolean.

Store it answer in a variable called  `rejection `.

In [24]:
# Add your code below

rejection = reject_hypothesis(z_statistic, 0.01)
rejection


0.005
-2.575829303548901 2.5758293035489004
-3.161348448014516


True