<a href="https://colab.research.google.com/github/andremarinho17/data_analytics_projects_en/blob/main/TikTok_Project_A_B_Test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **TikTok Project - A/B Test**

<p align="center"><img src="https://logodownload.org/wp-content/uploads/2019/08/tiktok-logo-9.png" height="150px">

**Course 4 - The Power of Statistics**

Author: André Marinho Moreira

You are a data professional at TikTok. The current project is reaching its midpoint; a project proposal, Python coding work, and exploratory data analysis have all been completed.

The team has reviewed the results of the exploratory data analysis and the previous executive summary the team prepared. You received an email from Orion Rainier, Data Scientist at TikTok, with your next assignment: determine and conduct the necessary hypothesis tests and statistical analysis for the TikTok classification project.

A notebook was structured and prepared to help you in this project. Please complete the following questions.


# **Course 4 End-of-course project: Data exploration and hypothesis testing**

In this activity, you will explore the data provided and conduct hypothesis testing.
<br/>

**The purpose** of this project is to demostrate knowledge of how to prepare, create, and analyze hypothesis tests.

**The goal** is to apply descriptive and inferential statistics, probability distributions, and hypothesis testing in Python.
<br/>

*This activity has three parts:*

**Part 1:** Imports and data loading
* What data packages will be necessary for hypothesis testing?

**Part 2:** Conduct hypothesis testing
* How will descriptive statistics help you analyze your data?

* How will you formulate your null hypothesis and alternative hypothesis?

**Part 3:** Communicate insights with stakeholders

* What key business insight(s) emerge from your hypothesis test?

* What business recommendations do you propose based on your results?

<br/>

Follow the instructions and answer the questions below to complete the activity. Then, complete an executive summary using the questions listed on the PACE Strategy Document.

Be sure to complete this activity before moving on. The next course item will provide you with a completed exemplar to compare to your own work.



# **Data exploration and hypothesis testing**


# **PACE stages**

Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.



## **PACE: Plan**

Consider the questions in your PACE Strategy Document and those below to craft your response.

1. What is your research question for this data project? Later on, you will need to formulate the null and alternative hypotheses as the first step of your hypothesis test. Consider your research question now, at the start of this task.

Research question: Regarding the TikTok's user claim data, which hypothesis testing method best serves the data and the claims classification project?

*Complete the following steps to perform statistical analysis of your data:*

### **Task 1. Imports and Data Loading**

Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

<details>
  <summary><h4><strong>Hint:</strong></h4></summary>

Be sure to import `pandas`, `numpy`, `matplotlib.pyplot`, `seaborn`, and `scipy`.

</details>

In [None]:
# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import packages for statistical analysis/hypothesis testing
from scipy import stats

Load the dataset.

**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [None]:
# Load dataset into dataframe
df = pd.read_csv("tiktok_dataset.csv")

## **PACE: Analyze and Construct**

Consider the questions in your PACE Strategy Document and those below to craft your response:
1. Data professionals use descriptive statistics for Exploratory Data Analysis. How can computing descriptive statistics help you learn more about your data in this stage of your analysis?


Descriptive analysis (DA) helps data analysts to get to know better about the data that they are dealing with. With DA, it's possible to understand the distinct patterns of the data, as well as how it is structured, which variables are available for analysis, how they are distributed, and other important informations to conduct the analysis.

Furthermore, DA provides essential insights into the characteristics of the dataset, such as central tendencies, variability, and relationships between variables, which are critical for guiding further analysis and decision-making.

### **Task 2. Data exploration**

Use descriptive statistics to conduct Exploratory Data Analysis (EDA).



<details>
  <summary><h4><strong>Hint:</strong></h4></summary>

Refer back to *Self Review Descriptive Statistics* for this step-by-step proccess.

</details>

Inspect the first five rows of the dataframe.

In [None]:
# Display first few rows
df.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [None]:
# Generate a table of descriptive statistics about the data
df.describe(include="all")

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19084,19382.0,19382.0,19084,19382,19382,19084.0,19084.0,19084.0,19084.0,19084.0
unique,,2,,,19012,2,3,,,,,
top,,claim,,,a friend read in the media a claim that badmi...,not verified,active,,,,,
freq,,9608,,,2,18142,15663,,,,,
mean,9691.5,,5627454000.0,32.421732,,,,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,,2536440000.0,16.229967,,,,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,,1234959000.0,5.0,,,,20.0,0.0,0.0,0.0,0.0
25%,4846.25,,3430417000.0,18.0,,,,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,,5618664000.0,32.0,,,,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,,7843960000.0,47.0,,,,504327.0,125020.0,18222.0,1156.25,292.0


The dataset includes 19,382 records for different columns such as `video_duration_sec`, `video_view_count`, `video_like_count`, and others. There are some columns with missing values (NaN), notably `claim_status` and `video_transcription_text`, where not all entries are populated.

When it comes to video performance, the average video duration is 32.4 seconds, with a maximum of 60 seconds. The video view count varies significantly, with an average of 254,708 views but a maximum reaching nearly 1 million. Additionally, videos generally receive a substantial amount of interaction: the average like count is 84,304, with some videos reaching over 657,000 likes. Share counts and downloads are lower, with a means of 16,735 shares and 1,049 downloads, respectively.

It's important to draw attention to the disparity between interaction types. While videos gain significant likes, the number of shares and downloads is much lower. This suggests that while users are engaged enough to like the content, they are less inclined to share or download, possibly indicating passive consumption of the content.

Check for and handle missing values.

In [None]:
# Check for missing values
df.isna()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
19377,False,True,False,False,True,False,False,True,True,True,True,True
19378,False,True,False,False,True,False,False,True,True,True,True,True
19379,False,True,False,False,True,False,False,True,True,True,True,True
19380,False,True,False,False,True,False,False,True,True,True,True,True


In [None]:
# Drop rows with missing values

df = df.dropna()

In [None]:
# Display first few rows after handling missing values

df.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


You are interested in the relationship between `verified_status` and `video_view_count`. One approach is to examine the mean value of `video_view_count` for each group of `verified_status` in the sample data.

In [None]:
# Compute the mean `video_view_count` for each group in `verified_status`
print('Average video view count for verified users: ', df[df['verified_status'] == "verified"]['video_view_count'].mean())
print('Average video view count for non verified users: ', df[df['verified_status'] == "not verified"]['video_view_count'].mean())

Average video view count for verified users:  91439.16416666667
Average video view count for non verified users:  265663.78533885034


### **Task 3. Hypothesis testing**

Before you conduct your hypothesis test, consider the following questions where applicable to complete your code response:

1. Recall the difference between the null hypothesis and the alternative hypotheses. What are your hypotheses for this data project?

As it's possible to notice, the average view count for videos coming from non verified videos is likely significantly higher than those that came from verified users.



Your goal in this step is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test:


1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis



* $H_0$: there's no difference between the number of views between the videos coming from verified users and those coming from unverified users.
* $H_A$: there's a difference between the number of views between the videos coming from verified users and those coming from unverified users.

You choose 5% as the significance level and proceed with a two-sample t-test.

In [None]:
# Conduct a two-sample t-test to compare means
avg_verified_users = df[df['verified_status'] == "verified"]['video_view_count']
avg_unverified_users = df[df['verified_status'] == "not verified"]['video_view_count']

stats.ttest_ind(a=avg_verified_users, b=avg_unverified_users, equal_var=False)

Ttest_indResult(statistic=-25.499441780633777, pvalue=2.6088823687177823e-120)

**Question:** Based on the p-value you got above, do you reject or fail to reject the null hypothesis?


The p-value is about 2.60%.

This means there is only a 2.60% probability that the absolute difference between the two average views for videos coming from verified and non-verified users would be 174.244 greater if the null hypothesis were true. In other words, it's highly unlikely that the difference in views between videos coming from verified and non-verified users is due to chance.

Therefore, as the p-value is less than the significance level of 5%, there is a statistically significant difference between the views coming from verified users and those from unverified users.

## **PACE: Execute**

Consider the questions in your PACE Strategy Documentto reflect on the Execute stage.

## **Step 4: Communicate insights with stakeholders**

*Ask yourself the following questions:*

1. What business insight(s) can you draw from the result of your hypothesis test?

The statistically significant difference suggests that the nonverified users are receiving more views than the verified users. This difference challenges the assumption that verification automatically leads to higher visibility and suggests that content strategy, rather than verification status, plays a crucial role in driving engagement.

With that said, it's important for TikTok to conduct a deeper study in order to understand the different behavior of these two groups in social media.

Based on the findings, here are the recommended next steps:
* Perform a deeper analysis in order to understand what types of content unverified users are posting that may be driving higher engagement. Look at factors such as video length, trends, hashtags, and audience demographics.
* Encourage verified users to focus on creating more relatable, authentic content. This might involve engaging more with trends, being transparent with their audience, or posting more user-generated content.
* Stay updated on potential changes in TikTok’s algorithm to see if they influence the visibility of verified vs. unverified content, adjusting strategy accordingly.
* Set up regular monitoring and reporting on the viewership trends between verified and unverified users to track whether the pattern holds over time or shifts.

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.