# TT Project
Get Started with Python

You have just started as a data professional at TikTok.

The team is still in the early stages of the project. You have received notice that TikTok's leadership team has approved the project proposal. To gain clear insights to prepare for a claims classification model, TikTok's provided data must be examined to begin the process of exploratory data analysis (EDA).

A notebook was structured and prepared to help you in this project. Please complete the following questions.

# Inspect and analyze data

In this activity, you will examine data provided and prepare it for analysis.
<br/>

**The purpose** of this project is to investigate and understand the data provided. This activity will:

1.   Acquaint you with the data

2.   Compile summary information about the data

3.   Begin the process of EDA and reveal insights contained in the data

4.   Prepare you for more in-depth EDA, hypothesis testing, and statistical analysis

**The goal** is to construct a dataframe in Python, perform a cursory inspection of the provided dataset, and inform TikTok data team members of your findings.
<br/>
*This activity has three parts:*

**Part 1:** Understand the situation
* How can you best prepare to understand and organize the provided TikTok information?

**Part 2:** Understand the data

* Create a pandas dataframe for data learning and future exploratory data analysis (EDA) and statistical activities

* Compile summary information about the data to inform next steps

**Part 3:** Understand the variables

* Use insights from your examination of the summary data to guide deeper investigation into variables

<br/>

# **Identify data types and compile summary information**


# PACE stages

## PACE: Plan

### **Task 1. Understand the situation**

*   How can you best prepare to understand and organize the provided information?


*Begin by exploring your dataset and consider reviewing the Data Dictionary.*

Start with getting familiar with the data files, how they are organized on the disk, folders, etc. Then, try identify collection timespan, if they are 2nd or 3rd sources, how they were collected and if they ara anonymized. I would also try to understand if I should expect any biases, considering the data sources. Then review the Data Dictionary to get a familiarity with the values and try to understand how to make use of the available data, in purpose of the project scope.

## PACE: Analyze

Consider the questions in your PACE Strategy Document to reflect on the Analyze stage.

### Task 2a. Imports and data loading
Start by importing the packages that you will need to load and explore the dataset. 

In [1]:
# Import packages
import pandas as pd
import numpy as np

Then, load the dataset into a dataframe. Creating a dataframe will help you conduct data manipulation, exploratory data analysis (EDA), and statistical activities.

In [2]:
# Load dataset into dataframe
data = pd.read_csv("../data/tt_dataset.csv")

### **Task 2b. Understand the data - Inspect the data**

View and inspect summary information about the dataframe.

*Consider the following questions:*

**Question 1:** When reviewing the first few rows of the dataframe, what do you observe about the data? What does each row represent?

**Question 2:** When reviewing the `data.info()` output, what do you notice about the different variables? Are there any null values? Are all of the variables numeric? Does anything else stand out?

**Question 3:** When reviewing the `data.describe()` output, what do you notice about the distributions of each variable? Are there any questionable values? Does it seem that there are outlier values?

In [3]:
# Display and examine the first ten rows of the dataframe
data.head(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


In [4]:
# Get summary info
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


In [5]:
# Get summary statistics
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


- Each row represents and observation
- There are 11 different variables wit 3 different types: int64, float64 and object. Not all of the are numeric. There is no null values. There is a difference of non-null values between the total video count and the matching observations.
- I didn't notice any unusual, questionable values at this phase.

### **Task 2c. Understand the data - Investigate the variables**

In this phase, you will begin to investigate the variables more closely to better understand them.

You know from the project proposal that the ultimate objective is to use machine learning to classify videos as either claims or opinions. A good first step towards understanding the data might therefore be examining the `claim_status` variable. Begin by determining how many videos there are for each different claim status.

In [6]:
# What are the different values for claim status and how many of each are in the data?
data.groupby(["claim_status"]).count()

# or
# data["claim_status"].value_counts()

Unnamed: 0_level_0,#,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
claim_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
claim,9608,9608,9608,9608,9608,9608,9608,9608,9608,9608,9608
opinion,9476,9476,9476,9476,9476,9476,9476,9476,9476,9476,9476


**Question:** What do you notice about the values shown?

Next, examine the engagement trends associated with each different claim status.

Start by using Boolean masking to filter the data according to claim status, then calculate the mean and median view counts for each claim status.

In [7]:
# What is the average view count of videos with "claim" status?
claim_status_group = data.groupby(["claim_status"])

claim_status_group[["video_view_count"]].agg(["mean", "median"])

Unnamed: 0_level_0,video_view_count,video_view_count
Unnamed: 0_level_1,mean,median
claim_status,Unnamed: 1_level_2,Unnamed: 2_level_2
claim,501029.452748,501555.0
opinion,4956.43225,4953.0


In [8]:
# What is the average view count of videos with "opinion" status?
mean_claim_status_group = claim_status_group["video_view_count"].mean()

mean_claim_status_group

claim_status
claim      501029.452748
opinion      4956.432250
Name: video_view_count, dtype: float64

**Question:** What do you notice about the mean and media within each claim category?

- **The mean and median values of each category are similar within each. That makes it a relaible data and not biased with outliers. But there is a huge difference between the view counts of claim to opinion videos.**

Now, examine trends associated with the ban status of the author.

Use `groupby()` to calculate how many videos there are for each combination of categories of claim status and author ban status.

In [9]:
# Get counts for each group combination of claim status and author ban status
ban = data.groupby(["claim_status", "author_ban_status"]).agg(["count"])

ban["#"]

Unnamed: 0_level_0,Unnamed: 1_level_0,count
claim_status,author_ban_status,Unnamed: 2_level_1
claim,active,6566
claim,banned,1439
claim,under review,1603
opinion,active,8817
opinion,banned,196
opinion,under review,463


**Question:** What do you notice about the number of claims videos with banned authors? Why might this relationship occur?

- **Banned authors are more in claim videos. There might be a correlation between claims and violation of user terms, therefore author_ban_status count is high.**

Continue investigating engagement levels, now focusing on `author_ban_status`.

Calculate the median video share count of each author ban status.

In [10]:
video_share_banned = data.groupby(["author_ban_status"])

video_share_banned["video_share_count"].median()

author_ban_status
active            437.0
banned          14468.0
under review     9444.0
Name: video_share_count, dtype: float64

In [11]:
# What's the median video share count of each author ban status?
video_share_banned = data.groupby(["author_ban_status"])

video_share_banned["video_share_count"].median()

author_ban_status
active            437.0
banned          14468.0
under review     9444.0
Name: video_share_count, dtype: float64

**Question:** What do you notice about the share count of banned authors, compared to that of active authors? Explore this in more depth.

- **The share of median of banned authors' count is very high compared to the median of active authors count'.**

Use `groupby()` to group the data by `author_ban_status`, then use `agg()` to get the count, mean, and median of each of the following columns:
* `video_view_count`
* `video_like_count`
* `video_share_count`

Remember, the argument for the `agg()` function is a dictionary whose keys are columns. The values for each column are a list of the calculations you want to perform.

In [12]:
# Use `groupby()` to group the data by `author_ban_status`, then use `agg()` to get the count, mean, and median of each of the following columns: `video_view_count`, `video_like_count`, `video_share_count`

video_banned = data.groupby("author_ban_status").agg(["count", "mean", "median"])

video_banned[["video_view_count", "video_like_count", "video_share_count"]]

TypeError: agg function failed [how->mean,dtype->object]

**Question:** What do you notice about the number of views, likes, and shares for banned authors compared to active authors?

- **Banned authors' view, like and share counts are much higher than the corresponding counts of active authors' videos**

In [None]:
# Create three new columns to help better understand engagement rates

# Create a `likes_per_view column` which represents the number of likes divided by the number of views for each video
data["likes_per_view"] = data["video_like_count"]/data["video_view_count"]

# Create a `comments_per_view column` which represents the number of comments divided by the number of views for each video
data["comments_per_view"] = data["video_comment_count"]/data["video_view_count"]

# Create a `shares_per_view column` which represents the number of shares divided by the number of views for each video
data["shares_per_view"] = data["video_share_count"]/data["video_view_count"]

data.head()

In [None]:
# Use `groupby()` to compile the information in each of the three newly created columns for each combination of categories of claim status and author ban status, then use `agg()` to calculate the count, the mean, and the median of each group

statistics_claim_status = data.groupby(["claim_status", "author_ban_status"]).agg({"likes_per_view": ["count", "mean", "median"], "comments_per_view": ["count", "mean", "median"], "shares_per_view": ["count", "mean", "median"]})

# statistics_claim_status
statistics_claim_status

**Question:**

How does the data for claim videos and opinion videos compare or differ? Consider views, comments, likes, and shares.

- **Averages on data for claim videos are higher in each group compared to the opinion videos.**

## PACE: Construct

**Note**: The Construct stage does not apply to this workflow. The PACE framework can be adapted to fit the specific requirements of any project.

## PACE: Execute

Consider the questions in your PACE Strategy Document and those below to craft your response.

### **Given your efforts, what can you summarize for Rosie Mae Bradshaw and the TikTok data team?**

*Note for Learners: Your answer should address TikTok's request for a summary that covers the following points:*

*   What percentage of the data is comprised of claims and what percentage is comprised of opinions?
*   What factors correlate with a video's claim status?
*   What factors correlate with a video's engagement level?


- Almost 50% of the videos are claims.
- There is a high positive correlation of the banned and under review videos with the claim status videos.
- Banned authors' videos have a higher engagement rate.