Your task is to analyze survey data available at the following link: surveydata3.csv. The analysis includes estimating the mean with confidence intervals, standard error, minimum sample size for a specified precision, and estimating proportions.

- The data is loaded from the CSV file available at the following link: [surveydata3.csv](https://raw.githubusercontent.com/juanspinzon/survey-data/refs/heads/main/surveydata3.csv).
- The dataset contains 753 rows and 55 columns, including various demographic and survey response variables.
- A detailed description of the dataset can be found [here](https://raw.githubusercontent.com/juanspinzon/survey-data/refs/heads/main/surveydata3_description.csv).

**Instructions**

**Load the Data:**

- Load the data from the CSV file.
- Calculate the mean number of hours of sleep per night.
- Calculate the proportion of people who want to buy Udacity swag and prefer hoodies.

**Mean & Variance Estimation:**

- Calculate the mean and variance for the number of hours that Udemy students sleep per night.
- Calculate the confidence interval for the mean and variance.
- Calculate the standard error.
- Calculate the minimum sample size required to achieve a specified precision (e.g., 3%). 
- Visualize results with the boxplot (mean, std errors, confidence interval).

**Proportion Estimation:**

- Calculate the proportion for of people who want to buy Udacity swag and prefer hoodies :)
- Calculate the confidence interval for that proportion.
- Calculate the standard error.
- Calculate the minimum sample size required to achieve a specified precision (e.g., 3%).
- Visualize results with the boxplot (proportion, std errors, confidence interval).



In [3]:
import pandas as pd
url = 'https://raw.githubusercontent.com/juanspinzon/survey-data/refs/heads/main/clean_surveydata3.xlsx?raw=true'
df = pd.read_excel(url, engine='openpyxl')
print(df.head())
print(df.shape)

   Index                                      reasons_study  age_years  \
0      0                                                NaN       32.0   
1      1                                                NaN       38.0   
2      2                   Start a new career in this field       30.0   
3      3  General interest in the topic (personal growth...       37.0   
4      4                   Start a new career in this field       24.0   

   sleep hours per night  commute_minutes  sitting hours per day  \
0                    NaN              NaN                    NaN   
1                    NaN              NaN                    NaN   
2                    7.0             45.0                    8.0   
3                    7.0             30.0                    5.0   
4                    8.0             65.0                    0.0   

   books per year   location  buy swag  \
0             NaN      China         1   
1             NaN  Argentina         1   
2             2.0   

In [6]:
sleep = df['sleep hours per night'].dropna()
sleep

2      7.0
3      7.0
4      8.0
5      6.0
6      8.0
      ... 
748    7.0
749    7.0
750    8.0
751    7.0
752    6.0
Name: sleep hours per night, Length: 747, dtype: float64

In [17]:
from scipy import stats

mean = sleep.mean()
print(f"Mean: {mean}")
se = stats.sem(sleep)
print(f"Standard Error: {se}")
relative_error = (se/mean) * 100
print(f"Relative Error: {relative_error:.2f}%")

Mean: 6.918340026773762
Standard Error: 0.03614444538796781
Relative Error: 0.52%


In [12]:
import statsmodels.stats.api as sms

In [15]:
confidence_level = 0.99
ci_mean_min, ci_mean_max = sms.DescrStatsW(sleep).tconfint_mean(alpha=1 - confidence_level)
print(f"Confidence Interval for the mean: {ci_mean_min}, {ci_mean_max}")



Confidence Interval for the mean: 6.8249993186828855, 7.011680734864638


In [19]:

chi2_min, chi2_max = stats.chi2.interval(0.96, sleep.size)

ci_var_min = ((sleep.size-1)*se**2) / chi2_max
print(ci_var_min)
ci_var_max = ((sleep.size-1)*se**2) / chi2_min
print(ci_var_max)


0.0011763158122183696
0.0014550916753030172
