# **TikTok Project**
**End-of-course project**

Welcome to the TikTok Project!

You have just started as a data professional at TikTok.

The team is still in the early stages of the project. You have received notice that TikTok's leadership team has approved the project proposal. To gain clear insights to prepare for a claims classification model, TikTok's provided data must be examined to begin the process of exploratory data analysis (EDA).

**Project plan**

TikTok is working on the development of a predictive model that can determine whether a video contains a claim or offers an opinion. With a successful prediction model, TikTok can reduce the backlog of user reports and prioritize them more efficiently
Project background 
In this activity, you will examine data provided and prepare it for analysis.
<br/>

**The purpose** of this project is to investigate and understand the data provided. This activity will:

1.   Acquaint you with the data

2.   Compile summary information about the data

3.   Begin the process of EDA and reveal insights contained in the data

4.   Prepare you for more in-depth EDA, hypothesis testing, and statistical analysis

**The goal** is to construct a dataframe in Python, perform a cursory inspection of the provided dataset, and inform TikTok data team members of your findings.
<br/>
*This activity has three parts:*

**Part 1:** Understand the situation
* How can you best prepare to understand and organize the provided TikTok information?

**Part 2:** Understand the data

* Create a pandas dataframe for data learning and future exploratory data analysis (EDA) and statistical activities

* Compile summary information about the data to inform next steps

**Part 3:** Understand the variables

* Use insights from your examination of the summary data to guide deeper investigation into variables

<br/>


TikTok’s data team is in the earliest stages of the claims classification project. The **following tasks** are needed before the team can begin the data analysis process: 

 *	A project proposal identifying the following: 

    * Organize project tasks into milestones 

    * Classify tasks using the PACE workflow 

    * Identify relevant stakeholders 

**Course 1 PACE Strategy Document** to plan your project while considering your audience members, teammates, key milestones, and overall project goal. 

**Create a project proposal for the data team.**

For your first assignment, TikTok is asking for a project proposal that will create milestones for the tasks within the comment classification project. Remember to take into account your audience, team, project goal, and PACE stages of each task in planning your project deliverable.

If you selected the TikTok scenario, you are working on the development of a predictive model that can be used to determine whether a video contains a claim or whether it offers an opinion.


# **TikTok users have the ability to report videos and comments that contain user claims.**

These reports identify content that needs to be reviewed by moderators.

This process generates a large number of user reports that are difficult to address quickly.

Development of a predictive model that can determine whether a video contains a claim or offers an opinion.

With a successful prediction model, TikTok can reduce the backlog of user reports and prioritize them more efficiently.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
tik_tok_data = pd.read_csv("tiktok_dataset.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'tiktok_dataset.csv'

In [None]:
tik_tok_df = pd.DataFrame(tik_tok_data)

In [None]:
tik_tok_df.columns.tolist()

In [None]:
tik_tok_df.head(10)

In [None]:
tik_tok_df.describe()

In [None]:
tik_tok_df.info()

In [None]:
tik_tok_df['claim_status'].value_counts()

In [None]:
claim = tik_tok_df[tik_tok_df['claim_status'] == 'claim']

In [None]:
opinion = tik_tok_df[tik_tok_df['claim_status'] == 'opinion']

In [None]:
claim_na = tik_tok_df[tik_tok_df['claim_status'].isna()]

# Claim VS Opinion

In [None]:
print("Number of videos categorized as claims:",len(claim))
print("Number of videos categorized as opinions:",len(opinion))
print("Total # of categorized videos:",len(opinion)+len(claim))
print("Number of videos without a claim status:", len(claim_na))

In [None]:
tik_tok_df = tik_tok_df.dropna()

In [None]:
tik_tok_df

# Claim_status VS videos

In [None]:
claim_mean_med = claim.drop(columns=['#', 'video_id','video_duration_sec']).select_dtypes(include=[np.number]).agg(['mean', 'median']).round(2)

In [None]:
opinion_mean_med = opinion.drop(columns=['#', 'video_id','video_duration_sec']).select_dtypes(include=[np.number]).agg(['mean', 'median']).round(2)

In [None]:
claim_vs_opinion_perc = ((claim_mean_med-opinion_mean_med)/claim_mean_med*100).round(2).drop('median').rename({'mean':'mean percentage'})

In [None]:
claims_fin = pd.concat({'claim':claim_mean_med, 'opinion':opinion_mean_med, 'percentage claim/opinion':claim_vs_opinion_perc})
claims_fin

It can be easily said that who makes claiming drags more more attention from other people. The amount of videos viewed, liked, shared, downloaded and commented are a lot higher compare to videos where people give opinions. The difference from who gives a statement compared to who gives an opinion ranges from 99.01% to 99.61%. There are no variations in video duration.

# Active_ban VS videos

In [None]:
active_ban = tik_tok_df[tik_tok_df['author_ban_status'] == 'active']
banned_ban = tik_tok_df[tik_tok_df['author_ban_status'] == 'banned']
under_review_ban = tik_tok_df[tik_tok_df['author_ban_status'] == 'under review']

In [None]:
active_ban_mean_med = active_ban.drop(columns=['#', 'video_id','video_duration_sec']).select_dtypes(include=[np.number]).agg(['mean', 'median']).round(2)
banned_ban_mean_med = banned_ban.drop(columns=['#', 'video_id','video_duration_sec']).select_dtypes(include=[np.number]).agg(['mean', 'median']).round(2)
under_review_ban_mean_med = under_review_ban.drop(columns=['#', 'video_id','video_duration_sec']).select_dtypes(include=[np.number]).agg(['mean', 'median']).round(2)

In [None]:
author_ban_status = pd.concat({
    'active': active_ban_mean_med,
    'banned': banned_ban_mean_med,
    'under_review': under_review_ban_mean_med
})
author_ban_status

Who is banned has the most viewed, liked, shared, downloaded and commented videos, followed by under_review. These are people can are causing "troubles" in terms of report and likely to be reported.

In [None]:
print("Number of active people:", len(active_ban))
print("Number of actual ban:", len(banned_ban))
print("Number of people under review:", len(under_review_ban))

In [None]:
verified = tik_tok_df[tik_tok_df['verified_status'] == 'verified']
verified_mean = verified.drop(columns=['#', 'video_id','video_duration_sec']).select_dtypes(include=[np.number]).agg(['mean', 'median']).round(2)

In [None]:
not_verified = tik_tok_df[tik_tok_df['verified_status'] == 'not verified']
not_verified_mean = not_verified.drop(columns=['#', 'video_id','video_duration_sec']).select_dtypes(include=[np.number]).agg(['mean', 'median']).round(2)

In [None]:
print("Number of verified users:", len(verified))
print("Number of not verified users:", len(not_verified))
print("Percentage difference verified/not_Verified:", round((len(verified) - len(not_verified)) / len(not_verified) * 100, 2))

In [None]:
verified_perc = (((verified_mean-not_verified_mean)/verified_mean)*100).round(2).drop('median').rename({'mean':'mean percentage'})
verified_perc

In [None]:
verified_and_not = pd.concat({'verified':verified_mean, 'not verified': not_verified_mean, 'percentage':verified_perc})
verified_and_not

In [None]:
tot = pd.concat({'claimings': claims_fin.drop('percentage claim/opinion'), 'author bans': author_ban_status, 'verifications': verified_and_not.drop('percentage')})
tot

# Final
The **core findings** are that **claim** videos get disproportionately more interactions, channels under **active ban** are surprisingly the most viewed,liked,shared,downloaded, and that **unverified users** (especially those making claims) **dominate interactions**. 

The number of videos containing **claims and opinions** are almost in equal numbers.
The **distribution** is evidently **right skewed** from the data. Plots would help understand more the situation.

These findings hint at:

* A “controversy‐drives‐engagement” effect: **claim-heavy content gets amplified.**

* A paradox where  hit the **most popular creators** receive bans most often from TikTok’s moderation flags

* **Verification status** as a measure for content type and engagement strategy.

#  Data visualisation

In [None]:
claim_counts = tik_tok_df['claim_status'].value_counts()
plt.figure()
plt.bar(claim_counts.index, claim_counts.values, edgecolor='k')
plt.title("Video Count by Claim Status")
plt.xlabel("Claim Status")
plt.ylabel("Number of Videos")
for idx, val in enumerate(claim_counts.values):
    plt.text(idx, val + max(claim_counts.values)*0.01, f"{val}", ha='center')
plt.tight_layout()
plt.show()

In [None]:
metrics = ['video_view_count', 'video_like_count', 'video_share_count', 
           'video_download_count', 'video_comment_count']
avg_engagement = tik_tok_df.groupby('author_ban_status')[metrics].mean()
bottom = [0] * len(avg_engagement)

plt.figure()
for metric in metrics:
    plt.bar(avg_engagement.index, avg_engagement[metric], bottom=bottom, edgecolor='k')
    bottom = bottom + avg_engagement[metric].values
plt.title("Average Engagement by Ban Status")
plt.xlabel("Author Ban Status")
plt.ylabel("Average Count")
plt.legend(metrics, title="Metrics")
plt.tight_layout()
plt.show()

In [None]:
plt.figure()
df.boxplot(column='video_like_count', by='verified_status')
plt.title("Likes Distribution by Verification Status")
plt.suptitle("")
plt.xlabel("Verified Status")
plt.ylabel("Video Like Count")
plt.tight_layout()
plt.show()