# ETL

In Data Science and Data Engineering, the process of taking data from a source, changing it, and then loading it into a database is called ETL, which is short for extract, transform, load. ETL tends to be more programming-intensive than other data science tasks like visualization, so we'll also spend time in this lesson exploring Python as an object-oriented programming language. Specifically, we'll create our own Python class to contain our ETL processes.

In [1]:
#import random

import pandas as pd
from pymongo import MongoClient

In [24]:
df = pd.read_excel(r"C:\Users\hp\WorldQuantum\7) A-B Testing\Wq-TestInfo-AB.xlsx")
df.head()

Unnamed: 0,_id,firstName,lastName,email,birthday,gender,highestDegreeEarned,countryISO2,admissionsQuiz
0,6525d787953844722c8383f8,Terry,Hassler,terry.hassler28@yahow.com,1998-04-29,male,Bachelor's degree,GB,incomplete
1,6525d787953844722c8383f9,Alan,Noble,alan.noble91@hotmeal.com,1999-03-03,male,Bachelor's degree,NG,complete
2,6525d787953844722c8383fa,Ruth,Vedovelli,ruth.vedovelli46@microsift.com,1994-08-16,female,Master's degree,ZM,incomplete
3,6525d787953844722c8383fb,Jennifer,Mayer,jennifer.mayer25@gmall.com,1984-11-23,female,Bachelor's degree,NG,complete
4,6525d787953844722c8383fc,Ray,Hersey,ray.hersey99@hotmeal.com,1990-10-15,male,Master's degree,PK,complete


# Extract: Developing the Hypothesis

##### how many applicants actually complete the DS Lab admissions quiz ??

In [25]:
len(df["admissionsQuiz"])

5025

In [26]:
df['admissionsQuiz'].notnull().sum()

5025

In [34]:
complete = (df["admissionsQuiz"]=="complete").sum()
incomplete = (df["admissionsQuiz"]!="complete").sum()
print("complete:", complete)
print("incomplete:", incomplete)

complete: 3717
incomplete: 1308


In [36]:
total = incomplete + complete
prop_incomplete = incomplete / total
print(
    "Proportion of users who don't complete admissions quiz:", round(prop_incomplete, 2)
)

Proportion of users who don't complete admissions quiz: 0.26


In [37]:
null_hypothesis = """
    No relationship between recieving email and completing the Quiz.
    Sending email does not increase the rate of completion.
"""

alternate_hypothesis = """
    There is a relationship between recieving email and completing the Quiz.
    Sending email does inrease the rate of completion.
"""

print("Null Hypothesis:", null_hypothesis)
print("Alternate Hypothesis:", alternate_hypothesis)

Null Hypothesis: 
    No relationship between recieving email and completing the Quiz.
    Sending email does not increase the rate of completion.

Alternate Hypothesis: 
    There is a relationship between recieving email and completing the Quiz.
    Sending email does inrease the rate of completion.



### The next thing we need to do is figure out a way to filter the data so that we're only looking at students who applied on a certain date. This is a perfect chance to write a function!

## We miss the df['createdAt'] column

# Transform

# Load: Preparing the Data