<a href="https://colab.research.google.com/github/axelengelmann/datascience_portfolio_axelengelmann/blob/main/TikTok_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import packages for statistical analysis/hypothesis testing
from scipy import stats

In [2]:
# Load dataset into dataframe
data = pd.read_csv("/content/tiktok_dataset (1).csv")

In [3]:
# Display first few rows
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [4]:
# Generate a table of descriptive statistics about the data
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,12756.0,12756.0,12756.0,12755.0,12755.0,12755.0,12755.0,12755.0
mean,6378.5,5634399000.0,32.402948,378630.723167,125596.146844,24931.380008,1563.374755,521.316582
std,3682.484352,2542208000.0,16.205285,331186.470667,146603.576354,36510.342806,2283.429279,931.397885
min,1.0,1234959000.0,5.0,22.0,0.0,0.0,0.0,0.0
25%,3189.75,3417450000.0,18.0,9808.0,2810.0,481.0,31.0,6.0
50%,6378.5,5624307000.0,32.0,337386.0,66308.0,8415.0,538.0,121.0
75%,9567.25,7844339000.0,46.0,675094.5,204660.0,34664.5,2203.0,601.5
max,12756.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


In [5]:
# Check for missing values
data.isna().sum()

#                           0
claim_status                0
video_id                    0
video_duration_sec          0
video_transcription_text    0
verified_status             1
author_ban_status           1
video_view_count            1
video_like_count            1
video_share_count           1
video_download_count        1
video_comment_count         1
dtype: int64

In [6]:
# Drop rows with missing values
data = data.dropna(axis=0)

In [7]:
# Display first few rows after handling missing values
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [8]:
# Compute the mean `video_view_count` for each group in `verified_status`
### YOUR CODE HERE ###
data.groupby("verified_status")["video_view_count"].mean()

verified_status
not verified    387152.276252
verified        191662.330935
Name: video_view_count, dtype: float64

# **Hypothesis testing**

Before I conduct my hypothesis test, consider the following questions where applicable to complete my code response:

Recall the difference between the null hypothesis and the alternative hypotheses. What are my hypotheses for this data project?


* Null hypothesis: There is no difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified accounts (any observed difference in the sample data is due to chance or sampling variability).

* Alternative hypothesis: There is a difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified accounts (any observed difference in the sample data is due to an actual difference in the corresponding population means).

In [9]:
# Conduct a two-sample t-test to compare means


# Save each sample in a variable
not_verified = data[data["verified_status"] == "not verified"]["video_view_count"]
verified = data[data["verified_status"] == "verified"]["video_view_count"]

# Implement a t-test using the two samples
stats.ttest_ind(a=not_verified, b=verified, equal_var=False)

Ttest_indResult(statistic=15.209124189114362, pvalue=1.2515035115521706e-44)

Since the p-value is extremely small (much smaller than the significance level of 5%), I reject the null hypothesis. I conclude that there is a statistically significant difference in the mean video view count between verified and unverified accounts on TikTok.

**What business insight(s) can you draw from the result of your hypothesis test?**


The analysis shows that there is a statistically significant difference in the average view counts between videos from verified accounts and videos from unverified accounts. This suggests there might be fundamental behavioral differences between these two groups of accounts.

It would be interesting to investigate the root cause of this behavioral difference. For example, do unverified accounts tend to post more clickbait-y videos? Or are unverified accounts associated with spam bots that help inflate view counts?

The next step will be to build a regression model on verified_status. A regression model is the natural next step because the end goal is to make predictions on claim status. A regression model for verified_status can help analyze user behavior in this group of verified users. Technical note to prepare regression model: because the data is skewed, and there is a significant difference in account types, it will be key to build a logistic regression model.