# **TikTok**


# **Inspect and analyze data**


**The purpose** of this Notebook is to investigate and understand the data provided. This activity will:

1.   Acquaint you with the data

2.   Compile summary information about the data

3.   Begin the process of EDA and reveal insights contained in the data

4.   Prepare you for more in-depth EDA, hypothesis testing, and statistical analysis

**The goal** is to construct a dataframe in Python, perform a cursory inspection of the provided dataset, and inform TikTok data team members of the findings.

*This Notebook has three parts:*

**Part 1:** Understand the situation
* How can you best prepare to understand and organize the provided TikTok information?

**Part 2:** Understand the data

* Create a pandas dataframe for data learning and future exploratory data analysis (EDA) and statistical activities

* Compile summary information about the data to inform next steps

**Part 3:** Understand the variables

* Use insights from your examination of the summary data to guide deeper investigation into variables


# **PACE stages**




## **PACE: Plan**




### **Task 1. Understand the situation**



1. Exploring the dataset.
2. Analyzing missing data.
3. Checking if the missing data obbeys non-random causes.
4. Analyzing the existing values and creating relationships within.
5. Studying the results.


## **PACE: Analyze**



### **Task 2a. Imports and data loading**




In [1]:
import pandas as pd
import numpy as np

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
file_path_main = '/content/drive/MyDrive/Data Analytics/Main Projects/TikTok project/tiktok_dataset.csv'
data = pd.read_csv(file_path_main)

### **Task 2b. Understand the data - Inspect the data**



















Displaying and examining the first 5 rows of the dataframe:

In [4]:
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


Getting summary information:

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


Reviewing dataframe statistic description:

In [6]:
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


1. Each row represents a different published TikTok video in which a claim/opinion has been made.
2. Less than half of the variables has complete values. Out of 12, Only 5 (#, `video_id`, `video_duration_sec`, `author_ban_status`, and `verified_status`) posses all the values.
3. The std is high regarding the variables `video_view_count`, `video_like_count`, `video_share_count`, `video_download_count`, and `video_comment_count`. This is because the min and max values are far apart.


### **Task 2c. Understand the data - Investigate the variables**



What are the different values for claim status and how many of each are in the data?

In [7]:
data["claim_status"].value_counts()

claim      9608
opinion    9476
Name: claim_status, dtype: int64

Examining the engagement trends associated with each different claim status:

Average view count of videos with "claim" status:

In [9]:
mask_claim = data["claim_status"] == "claim"
print("CLAIM video view count mean:")
print(f"{data[mask_claim]['video_view_count'].mean():_.2f}")
print()
print("CLAIM video view count median:")
print(f"{data[mask_claim]['video_view_count'].median():_}")

CLAIM video view count mean:
501_029.45

CLAIM video view count median:
501_555.0


Average view count of videos with "opinion" status:

In [10]:
mask_opinion = data["claim_status"]=="opinion"
print("OPINION video view count mean:")
print(f"{data[mask_opinion]['video_view_count'].mean():_.2f}")
print()
print("OPINION video view count median:")
print(f"{data[mask_opinion]['video_view_count'].median():_.2f}")

OPINION video view count mean:
4_956.43

OPINION video view count median:
4_953.00


**Question:** What do you notice about the mean and media within each claim category?

* The medians and means are similar within their category, nonetheless, the ones from `claim` are 100 times larger than the ones from `opinion`.

Examining trends associated with the ban status of the author:


Counts for each group combination of claim status and author ban status.

In [36]:
data.groupby(["claim_status","author_ban_status"]).agg({"author_ban_status": "count"}).rename(columns={"author_ban_status":"count"})

Unnamed: 0_level_0,Unnamed: 1_level_0,count
claim_status,author_ban_status,Unnamed: 2_level_1
claim,active,6566
claim,banned,1439
claim,under review,1603
opinion,active,8817
opinion,banned,196
opinion,under review,463


**Question:** What do you notice about the number of claims videos with banned authors? Why might this relationship occur?
* The data reveals a stark contrast in the number of claimed videos with authors who have been banned across different claim statuses. Specifically, there are approximately 7 times more videos associated with banned authors in the claim claim-status compared to the opinion claim-status.

* This relationship may occur due to several factors. Firstly, the claim claim-status could potentially attract a higher number of contentious or disputed videos, which may lead to a higher likelihood of author bans as a consequence of policy violations or community standards breaches. Additionally, the presence of a significant number of banned authors in the claim claim-status may indicate a higher incidence of copyright infringement or other problematic content within this category. Furthermore, the moderation policies or enforcement mechanisms specific to the claim claim-status may play a role in identifying and penalizing authors who repeatedly violate platform guidelines.


Focusing now on `author_ban_status`.

Calculating the means and medians of `video_view_count`, `video_like_count`, and `video_share_count` for each author ban status:

In [37]:
def format_number(x):
    return '{:_.2f}'.format(round(x, 2))

data.groupby(["author_ban_status"]).agg(
    {"video_view_count": ["mean", "median"],
     "video_like_count": ["mean", "median"],
     "video_share_count": ["mean", "median"]}
).applymap(format_number)

Unnamed: 0_level_0,video_view_count,video_view_count,video_like_count,video_like_count,video_share_count,video_share_count
Unnamed: 0_level_1,mean,median,mean,median,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
active,215_927.04,8_616.00,71_036.53,2_222.00,14_111.47,437.00
banned,445_845.44,448_201.00,153_017.24,105_573.00,29_998.94,14_468.00
under review,392_204.84,365_245.50,128_718.05,71_204.50,25_774.70,9_444.00


**Question:** What do you notice about the number of views, likes, and shares for banned authors compared to active authors?
* Banned authors and those under review get far more views, likes, and shares than active authors.
* In most groups, the mean is much greater than the median, which indicates that there are some videos with very high engagement counts.




Creating three new columns to help better understand engagement rates:
* `likes_per_view`: represents the number of likes divided by the number of views for each video
* `comments_per_view`: represents the number of comments divided by the number of views for each video
* `shares_per_view`: represents the number of shares divided by the number of views for each video

In [22]:
data["likes_per_view"] = data["video_like_count"]/data["video_view_count"]

data["comments_per_view"] = data["video_comment_count"]/data["video_view_count"]

data["shares_per_view"] = data["video_share_count"]/data["video_view_count"]

Now, getting the statistics:

In [38]:
grouped_data = data.groupby(["claim_status","author_ban_status"]).agg(
    {
     "likes_per_view":["mean","median"],
     "comments_per_view":["mean","median"],
     "shares_per_view":["mean","median"]
     })
grouped_data

Unnamed: 0_level_0,Unnamed: 1_level_0,likes_per_view,likes_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,median,mean,median,mean,median
claim_status,author_ban_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
claim,active,0.329542,0.326538,0.001393,0.000776,0.065456,0.049279
claim,banned,0.345071,0.358909,0.001377,0.000746,0.067893,0.051606
claim,under review,0.327997,0.320867,0.001367,0.000789,0.065733,0.049967
opinion,active,0.219744,0.21833,0.000517,0.000252,0.043729,0.032405
opinion,banned,0.206868,0.198483,0.000434,0.000193,0.040531,0.030728
opinion,under review,0.226394,0.228051,0.000536,0.000293,0.044472,0.035027


**Question:**

How does the data for claim videos and opinion videos compare or differ? Consider views, comments, likes, and shares.


* We know that videos by banned authors and those under review tend to get far more views, likes, and shares than videos by non-banned authors. However, when a video does get viewed, its engagement rate is less related to author ban status and more related to its claim status.

* Also, we know that claim videos have a higher view rate than opinion videos, but this tells us that claim videos also have a higher rate of likes on average, so they are more favorably received as well. Furthermore, they receive more engagement via comments and shares than opinion videos.

* Note that for claim videos, banned authors have slightly higher likes/view and shares/view rates than active authors or those under review. However, for opinion videos, active authors and those under review both get higher engagement rates than banned authors in all categories.



## **PACE: Construct**




## **PACE: Execute**



Consider the questions:



What percentage of the data is comprised of claims and what percentage is comprised of opinions?
 * Of the 19,382 samples in this dataset, just under 50% are claims&mdash;9,608 of them.

What factors correlate with a video's claim status?
* Engagement level is strongly correlated with claim status. This should be a focus of further inquiry.

What factors correlate with a video's engagement level?
* `claim` videos with banned authors have significantly higher engagement than videos with active and under review authors. In the case of `opinion` videos, the ones with authors under review have significantly higher engagement.