In [189]:
# Hypothesis Testing

### Dataset Used : Coursera Course Dataset
#### URL : https://www.kaggle.com/datasets/siddharthm1698/coursera-course-dataset

#### Featured Engineered Data : https://www.kaggle.com/code/azminetoushikwasi/coursera-eda-prep-viz-fe-with-analytics-insights/notebook

## Data Brief
Course dataset scrapped from Coursera website. This dataset contains mainly 6 columns and 890 course data. The detailed description:

1. course_title : Contains the course title.
2. course_organization : It tells which organization is conducting the courses.
3. courseCertificatetype : It has details about what are the different certifications available in courses.
4. course_rating : It has the ratings associated with each course.
5. course_difficulty : It tells about how difficult or what is the level of the course.
6. coursestudentsenrolled : It has the number of students that are enrolled in the course.

In [190]:
## Data import and Coorelation Matrix
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as sps

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
df=pd.read_csv("coursera_data_FEd.csv")
df=df.drop("Unnamed: 0",axis=1)


#### Sampling example

In [191]:
sample=df.sample(40,random_state=1)
sample.describe()

Unnamed: 0,course_rating,course_students_enrolled_modified,course_difficulty_modified
count,40.0,40.0,40.0
mean,4.675,7.162798,0.325
std,0.151488,8.651666,0.460629
min,4.2,1.0,0.0
25%,4.6,2.0,0.0
50%,4.7,5.0,0.0
75%,4.8,8.0,0.5
max,4.9,38.0,2.0


# Hypothesis Formulation

### Hypothesis 01:
Null hypothesis: Equal or less than 50% enrolled courses are beginner level courses.
- Test Method: z statstic
- Significance Level: 8%

### Hypothesis 02:
Null hypothesis: Coursera has a average course rating of more than 4.5.

### Hypothesis 03:
Null hypothesis: University courses has more average rating by 0.2 from non-university courses.

# Conducting a formal significance test for one of the hypotheses and discuss the results 
### Testing for Hypothesis 01:

#### Necessary Data
- H₀: π ≤ 0.50
- H₁: π > 0.50
- α = 0.08
 
- Test Method: z statstic; z = (p-π)/σₚ, where σₚ=sqrt(π(1-π)/n)

In [192]:
pi=0.5
sigma=0.08

#### Calculating P

In [193]:
sample_size=len(df)
sample_size

891

so, total sample size = 891

In [194]:
# P, the value of sample statistic
positives= df[df['course_difficulty_modified']==0]['course_rating'].count()
positives

487

number of courses with rating more than 4.5 = 745

In [195]:
P=positives/sample_size
P

0.5465768799102132

Now, we will determine The value of σₚ, where σₚ=sqrt(π(1-π)/n)

In [196]:
import math 

#defining meu_p function
def meu_p (pi,sample_size):
    temp=pi*(1-pi)/sample_size
    return math.sqrt(temp)
meu_p (pi,sample_size)

0.016750630254320203

In [197]:
#defining z_statistic function

def z_stat(pi,p,sample_size):
    return (p-pi)/meu_p(pi,sample_size)

In [198]:
## Applying
z_stat(pi,P,sample_size)

2.7806046222171505

In [201]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "http://www.z-table.com/uploads/2/1/7/9/21795380/8573955.png?759")

 Probability is approximately 0.998; But we wanted to calculate the probability to the right of z (because we are interested in obtaining the probability value that falls in the rejection region or critical region), i.e.

In [202]:
1-0.998

0.0020000000000000018

Aplha is 0.05
So, the null hypothesis is rejected.
## More than 50% students get enrolled in Beginner level courses.


# Suggestions for next steps in analyzing this data

- Testing other hypotheses.
- Analyze university based data.
- Try to group the courses to related subjects, based on subject name - keywords and see if any subject/field is performing better than others.

# The quality of this data set and a request for additional data if needed
- Data quality is good, but data is not well distributed in various categories.
- The coirse-rating section is highly one-sided.
- Student enrollment number could be given in number, instead of string.
- Course length and these type info would have helped more.

### Data Request: 
- Require more data on some categories (advanced and so) to analyse far more better.
- More data means more accurate result. For a large platfrom like Cousera, we need more data and meta-data; like date-time of course launch, date of records and so on.