# The Power of Statistics

# Data exploration and hypothesis testing

**The purpose** of this project is to demostrate knowledge of how to prepare, create, and analyze hypothesis tests.

**The goal** is to apply descriptive and inferential statistics, probability distributions, and hypothesis testing in Python.
<br/>

*This activity has three parts:*

**Part 1:** Imports and data loading
* What data packages will be necessary for hypothesis testing?

**Part 2:** Conduct hypothesis testing
* How will descriptive statistics help you analyze your data?

* How will you formulate your null hypothesis and alternative hypothesis?

**Part 3:** Communicate insights with stakeholders

* What key business insight(s) emerge from your hypothesis test?

* What business recommendations do you propose based on your results?

<br/>

# **Data exploration and hypothesis testing**

# PACE stages

## PACE: Plan
1. What is your research question for this data project? Later on, you will need to formulate the null and alternative hypotheses as the first step of your hypothesis test. Consider your research question now, at the start of this task.

Whether there is a statistical difference in the data between verified and unverified accounts. 

We will conduct a two-sample hypothesis test of verified versus unverified accounts in terms of video view counts.

### **Task 1. Imports and Data Loading**

In [1]:
# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import packages for statistical analysis/hypothesis testing
from scipy import stats

ModuleNotFoundError: No module named 'seaborn'

In [None]:
# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")

## PACE: Analyze and Construct
1. Data professionals use descriptive statistics for Exploratory Data Analysis. How can computing descriptive statistics help you learn more about your data in this stage of your analysis?

It helps to understand and get familiar with the dataset. We may check already the averages and the distribution of a specific variable we will be focusing on.

### **Task 2. Data exploration**

Use descriptive statistics to conduct Exploratory Data Analysis (EDA).



<details>
  <summary><h4><strong>Hint:</strong></h4></summary>

Refer back to *Self Review Descriptive Statistics* for this step-by-step proccess.

</details>

Inspect the first five rows of the dataframe.

In [None]:
# Display first few rows
data.head()

In [None]:
# Generate a table of descriptive statistics about the data
data.describe()

In [None]:
# Check for missing values
data.isna().sum()

Drop missing values if OK.

In [None]:
# Drop rows with missing values
data = data.dropna(axis = 0)

In [None]:
# Display first few rows after handling missing values
data.head()

We are interested in the relationship between `verified_status` and `video_view_count`. One approach is to examine the mean value of `video_view_count` for each group of `verified_status` in the sample data.

In [None]:
# Compute the mean `video_view_count` for each group in `verified_status`

# Calculate mean for `verified_status` = "verified"
verified = data[data["verified_status"] == "verified"]
verified["video_view_count"].mean()

In [None]:
# Calculate mean for `verified_status` = "not verified"
not_verified = data[data["verified_status"] == "not verified"]
not_verified["video_view_count"].mean()

### Task 3. Hypothesis testing

The goal is to conduct a two-sample t-test. 

Steps for conducting a hypothesis test:

1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis

ß0 = there is no difference between the video view counts of verified users and not verified users (any difference between the two is by chance)
ß1 = `video_view_count` of not verified users are higher (any difference between the two is actual)

Significance level: 5% is chosen. 

In [None]:
# Conduct a two-sample t-test to compare means
verified_views = data[data["verified_status"] == "verified"]["video_view_count"]
not_verified_views = data[data["verified_status"] == "not verified"]["video_view_count"]

stats.ttest_ind(a=not_verified_views, b=verified_views, equal_var=False)

p-value is 2.61, which is very small, therefore the null hypothesis is rejected. The difference between the video view counts of "verified" accounts and "not verified" accounts are not by random chance and statistically significant. 

## PACE: Execute

## Step 4: Communicate insights with stakeholders

### Outcome
The video view counts acc. to the user verified status is statistically significant, in which the not verified accounts have a very high amount. It should be investigated further. It is required to build a regression model in the next step. 