# **TikTok Project**
**Course 2 - Get Started with Python**

Welcome to the TikTok Project!

You have just started as a data professional at TikTok.

The team is still in the early stages of the project. You have received notice that TikTok's leadership team has approved the project proposal. To gain clear insights to prepare for a claims classification model, TikTok's provided data must be examined to begin the process of exploratory data analysis (EDA).

A notebook was structured and prepared to help you in this project. Please complete the following questions.

# **Course 2 End-of-course project: Inspect and analyze data**

In this activity, you will examine data provided and prepare it for analysis.
<br/>

**The purpose** of this project is to investigate and understand the data provided. This activity will:

1.   Acquaint you with the data

2.   Compile summary information about the data

3.   Begin the process of EDA and reveal insights contained in the data

4.   Prepare you for more in-depth EDA, hypothesis testing, and statistical analysis

**The goal** is to construct a dataframe in Python, perform a cursory inspection of the provided dataset, and inform TikTok data team members of your findings.
<br/>
*This activity has three parts:*

**Part 1:** Understand the situation
* How can you best prepare to understand and organize the provided TikTok information?

**Part 2:** Understand the data

* Create a pandas dataframe for data learning and future exploratory data analysis (EDA) and statistical activities

* Compile summary information about the data to inform next steps

**Part 3:** Understand the variables

* Use insights from your examination of the summary data to guide deeper investigation into variables

<br/>

To complete the activity, follow the instructions and answer the questions below. Then, you will us your responses to these questions and the questions included in the Course 2 PACE Strategy Document to create an executive summary.

Be sure to complete this activity before moving on to Course 3. You can assess your work by comparing the results to a completed exemplar after completing the end-of-course project.

# **Identify data types and compile summary information**


Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

# **PACE stages**

<img src="images/Pace.png" width="100" height="100" align=left>

   *        [Plan](#scrollTo=psz51YkZVwtN&line=3&uniqifier=1)
   *        [Analyze](#scrollTo=mA7Mz_SnI8km&line=4&uniqifier=1)
   *        [Construct](#scrollTo=Lca9c8XON8lc&line=2&uniqifier=1)
   *        [Execute](#scrollTo=401PgchTPr4E&line=2&uniqifier=1)

<img src="images/Plan.png" width="100" height="100" align=left>


## **PACE: Plan**

Consider the questions in your PACE Strategy Document and those below to craft your response:



### **Task 1. Understand the situation**

*   How can you best prepare to understand and organize the provided information?


*Begin by exploring your dataset and consider reviewing the Data Dictionary.*

- Identify the columns and their data types.
- Check for any missing values or anomalies.
- Outline the key variables relevant to the claims classification project.
- Develop a structured plan for data analysis and exploratory analysis.

<img src="images/Analyze.png" width="100" height="100" align=left>

## **PACE: Analyze**

Consider the questions in your PACE Strategy Document to reflect on the Analyze stage.

### **Task 2a. Imports and data loading**

Start by importing the packages that you will need to load and explore the dataset. Make sure to use the following import statements:
*   `import pandas as pd`

*   `import numpy as np`


In [2]:
# Import packages
import pandas as pd
import numpy as np

Then, load the dataset into a dataframe. Creating a dataframe will help you conduct data manipulation, exploratory data analysis (EDA), and statistical activities.

**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [3]:
# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")

### **Task 2b. Understand the data - Inspect the data**

View and inspect summary information about the dataframe by **coding the following:**

1. `data.head(10)`
2. `data.info()`
3. `data.describe()`

*Consider the following questions:*

**Question 1:** When reviewing the first few rows of the dataframe, what do you observe about the data? What does each row represent?

**Question 2:** When reviewing the `data.info()` output, what do you notice about the different variables? Are there any null values? Are all of the variables numeric? Does anything else stand out?

**Question 3:** When reviewing the `data.describe()` output, what do you notice about the distributions of each variable? Are there any questionable values? Does it seem that there are outlier values?

















In [4]:
# Display and examine the first ten rows of the dataframe
print(data.head(10))


    # claim_status    video_id  video_duration_sec  \
0   1        claim  7017666017                  59   
1   2        claim  4014381136                  32   
2   3        claim  9859838091                  31   
3   4        claim  1866847991                  25   
4   5        claim  7105231098                  19   
5   6        claim  8972200955                  35   
6   7        claim  4958886992                  16   
7   8        claim  2270982263                  41   
8   9        claim  5235769692                  50   
9  10        claim  4660861094                  45   

                            video_transcription_text verified_status  \
0  someone shared with me that drone deliveries a...    not verified   
1  someone shared with me that there are more mic...    not verified   
2  someone shared with me that american industria...    not verified   
3  someone shared with me that the metro of st. p...    not verified   
4  someone shared with me that the number of 

In [5]:
# Get summary info
print(data.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB
None


In [6]:
# Get summary statistics
print(data.describe())


                  #      video_id  video_duration_sec  video_view_count  \
count  19382.000000  1.938200e+04        19382.000000      19084.000000   
mean    9691.500000  5.627454e+09           32.421732     254708.558688   
std     5595.245794  2.536440e+09           16.229967     322893.280814   
min        1.000000  1.234959e+09            5.000000         20.000000   
25%     4846.250000  3.430417e+09           18.000000       4942.500000   
50%     9691.500000  5.618664e+09           32.000000       9954.500000   
75%    14536.750000  7.843960e+09           47.000000     504327.000000   
max    19382.000000  9.999873e+09           60.000000     999817.000000   

       video_like_count  video_share_count  video_download_count  \
count      19084.000000       19084.000000          19084.000000   
mean       84304.636030       16735.248323           1049.429627   
std       133420.546814       32036.174350           2004.299894   
min            0.000000           0.000000          

Question 1:
When reviewing the first few rows of the dataframe, I observe that each row represents an individual TikTok user report. The columns likely include various attributes related to the content, such as user ID, video ID, claim type, timestamps, and possibly the nature of the claims being reported. The first few rows provide a snapshot of these attributes and their values.

Question 2:
Upon reviewing the data.info() output, I notice several things:

The dataframe includes a mix of variable types, such as integers, floats, and objects (strings).
There may be some null values present in certain columns, which need to be addressed during analysis.
Not all variables are numeric; some are categorical or text-based, which indicates the type of data being captured.
The presence of unique identifiers (like user IDs or video IDs) and potential datetime columns for timestamps stand out, as these will be crucial for analysis.
Question 3:
When reviewing the data.describe() output, I notice the following about the distributions of each variable:

The summary statistics (like count, mean, min, max, and standard deviation) provide insight into the central tendency and spread of the numeric variables.
There may be questionable values indicated by extremely high or low min/max values that don't align with expectations (e.g., a negative age).
The presence of potential outlier values can be identified if there are significant deviations from the mean, as indicated by the standard deviation. These outliers might require further investigation to determine if they are valid or if they should be removed or corrected.

### **Task 2c. Understand the data - Investigate the variables**

In this phase, you will begin to investigate the variables more closely to better understand them.

You know from the project proposal that the ultimate objective is to use machine learning to classify videos as either claims or opinions. A good first step towards understanding the data might therefore be examining the `claim_status` variable. Begin by determining how many videos there are for each different claim status.

In [7]:
# What are the different values for claim status and how many of each are in the data?
# Check the unique values in the claim_status column and count the occurrences of each
claim_status_counts = data['claim_status'].value_counts()

# Display the counts of each claim status
print(claim_status_counts)

claim      9608
opinion    9476
Name: claim_status, dtype: int64


**Question:** What do you notice about the values shown?
1. Distribution of Claims: The counts for each unique value in claim_status will show you how many videos are classified as claims versus opinions. For example, if there are significantly more videos classified as claims compared to opinions, it indicates that user reports may primarily consist of claims that need to be addressed.

2. Imbalance in Classes: If one class (e.g., claims) has a substantially higher count than the other (e.g., opinions), this may indicate an imbalance in the dataset. This imbalance can affect the performance of machine learning models, as models may become biased towards the more frequent class.

3. Presence of Other Categories: Depending on the dataset, you might also find other classifications (such as neutral claims or other statuses) that could provide additional context on the nature of the content being reported.

4. Potential Data Quality Issues: If you observe unexpected or ambiguous values (like nulls or nonsensical entries), it could point to data quality issues that need to be addressed before analysis.

Next, examine the engagement trends associated with each different claim status.

Start by using Boolean masking to filter the data according to claim status, then calculate the mean and median view counts for each claim status.

In [21]:
# What is the average view count of videos with "claim" status?
# Calculate the average view count for videos with 'claim' status
average_claim_views = data[data['claim_status'] == 'claim']['video_view_count'].mean()

# Display the average view count for "claim" status
print("Average view count for 'claim' status:", average_claim_views)


Average view count for 'claim' status: 501029.4527477102


In [22]:
# What is the average view count of videos with "opinion" status?
# Calculate the average view count for videos with "opinion" status
average_opinion_views = data[data['claim_status'] == 'opinion']['video_view_count'].mean()

# Display the average view count for "opinion" status
print("Average view count for 'opinion' status:", average_opinion_views)


Average view count for 'opinion' status: 4956.43224989447


**Question:** What do you notice about the mean and media within each claim category?
1. Significant Difference: There is a significant difference between the average view counts of videos labeled as 'claim' and those labeled as 'opinion.' Videos with a 'claim' status have a much higher average view count compared to those with an 'opinion' status.

2. High Engagement for 'Claim' Videos: The high average view count for 'claim' status suggests that these videos may be more engaging or popular among users, possibly because they address more contentious or interesting topics that attract viewers.

3. Low Engagement for 'Opinion' Videos: The much lower average view count for 'opinion' status indicates that these videos may not resonate as strongly with the audience, or they may cover topics that are less likely to attract large viewership.

4. Potential Outliers: It might be useful to check if there are any outliers in the 'claim' category that could be skewing the average, given the large difference in average view counts.

5. Further Analysis Needed: It would be beneficial to analyze other engagement metrics (like likes, shares, and comments) for both categories to understand the overall user interaction and the nature of the content further.

Now, examine trends associated with the ban status of the author.

Use `groupby()` to calculate how many videos there are for each combination of categories of claim status and author ban status.

In [23]:
# Get counts for each group combination of claim status and author ban status
# Group by 'claim_status' and 'author_ban_status' and count the number of videos in each group
ban_status_counts = data.groupby(['claim_status', 'author_ban_status']).size().reset_index(name='video_count')

# Display the resulting counts
print(ban_status_counts)


  claim_status author_ban_status  video_count
0        claim            active         6566
1        claim            banned         1439
2        claim      under review         1603
3      opinion            active         8817
4      opinion            banned          196
5      opinion      under review          463


**Question:** What do you notice about the number of claims videos with banned authors? Why might this relationship occur?
- The data shows that there are 1439 claim video associated witht the banned author, which is a significant number compared to the 196 opinio videos from banned authors. This suggest that banned authors are more likely to create claim videos than opinon videos.
- Could be user migh claim againt htis authors due to authors often produce content that ciolets community guidlines or spark controversy.

Continue investigating engagement levels, now focusing on `author_ban_status`.

Calculate the median video share count of each author ban status.

In [26]:
# Calculate the median video share count for each author ban status
median_share_counts = data.groupby('author_ban_status')['video_share_count'].median().reset_index()

# Display the median video share counts for each author ban status
print(median_share_counts)


  author_ban_status  video_share_count
0            active              437.0
1            banned            14468.0
2      under review             9444.0


In [None]:
# What's the median video share count of each author ban status?
refer line [26]

**Question:** What do you notice about the share count of banned authors, compared to that of active authors? Explore this in more depth.

Use `groupby()` to group the data by `author_ban_status`, then use `agg()` to get the count, mean, and median of each of the following columns:
* `video_view_count`
* `video_like_count`
* `video_share_count`

Remember, the argument for the `agg()` function is a dictionary whose keys are columns. The values for each column are a list of the calculations you want to perform.

In [27]:
# Group by author_ban_status and calculate count, mean, and median for specified columns
engagement_stats = data.groupby('author_ban_status').agg(
    video_view_count_count=('video_view_count', 'count'),
    video_view_count_mean=('video_view_count', 'mean'),
    video_view_count_median=('video_view_count', 'median'),
    video_like_count_count=('video_like_count', 'count'),
    video_like_count_mean=('video_like_count', 'mean'),
    video_like_count_median=('video_like_count', 'median'),
    video_share_count_count=('video_share_count', 'count'),
    video_share_count_mean=('video_share_count', 'mean'),
    video_share_count_median=('video_share_count', 'median')
).reset_index()

# Display the engagement statistics for each author ban status
print(engagement_stats)


  author_ban_status  video_view_count_count  video_view_count_mean  \
0            active                   15383          215927.039524   
1            banned                    1635          445845.439144   
2      under review                    2066          392204.836399   

   video_view_count_median  video_like_count_count  video_like_count_mean  \
0                   8616.0                   15383           71036.533836   
1                 448201.0                    1635          153017.236697   
2                 365245.5                    2066          128718.050339   

   video_like_count_median  video_share_count_count  video_share_count_mean  \
0                   2222.0                    15383            14111.466164   
1                 105573.0                     1635            29998.942508   
2                  71204.5                     2066            25774.696999   

   video_share_count_median  
0                     437.0  
1                   14468.0  
2  

**Question:** What do you notice about the number of views, likes, and shares for banned authors compared to active authors?

Now, create three new columns to help better understand engagement rates:
* `likes_per_view`: represents the number of likes divided by the number of views for each video
* `comments_per_view`: represents the number of comments divided by the number of views for each video
* `shares_per_view`: represents the number of shares divided by the number of views for each video

In [28]:
# Create a likes_per_view column
# Create a comments_per_view column
# Create a shares_per_view column

# Create new columns for engagement rates
data['likes_per_view'] = data['video_like_count'] / data['video_view_count']
data['comments_per_view'] = data['video_comment_count'] / data['video_view_count']
data['shares_per_view'] = data['video_share_count'] / data['video_view_count']

# Display the updated DataFrame with the new engagement rate columns
print(data[['author_ban_status', 'likes_per_view', 'comments_per_view', 'shares_per_view']].head())


  author_ban_status  likes_per_view  comments_per_view  shares_per_view
0      under review        0.056584           0.000000         0.000702
1            active        0.549096           0.004855         0.135111
2            active        0.108282           0.000365         0.003168
3            active        0.548459           0.001335         0.079569
4            active        0.622910           0.002706         0.073175


Use `groupby()` to compile the information in each of the three newly created columns for each combination of categories of claim status and author ban status, then use `agg()` to calculate the count, the mean, and the median of each group.

In [29]:
# Group by 'claim_status' and 'author_ban_status', then aggregate the new engagement columns
engagement_summary = data.groupby(['claim_status', 'author_ban_status']).agg(
    count_likes_per_view=('likes_per_view', 'count'),
    mean_likes_per_view=('likes_per_view', 'mean'),
    median_likes_per_view=('likes_per_view', 'median'),
    count_comments_per_view=('comments_per_view', 'count'),
    mean_comments_per_view=('comments_per_view', 'mean'),
    median_comments_per_view=('comments_per_view', 'median'),
    count_shares_per_view=('shares_per_view', 'count'),
    mean_shares_per_view=('shares_per_view', 'mean'),
    median_shares_per_view=('shares_per_view', 'median')
).reset_index()

# Display the summary DataFrame
print(engagement_summary)


  claim_status author_ban_status  count_likes_per_view  mean_likes_per_view  \
0        claim            active                  6566             0.329542   
1        claim            banned                  1439             0.345071   
2        claim      under review                  1603             0.327997   
3      opinion            active                  8817             0.219744   
4      opinion            banned                   196             0.206868   
5      opinion      under review                   463             0.226394   

   median_likes_per_view  count_comments_per_view  mean_comments_per_view  \
0               0.326538                     6566                0.001393   
1               0.358909                     1439                0.001377   
2               0.320867                     1603                0.001367   
3               0.218330                     8817                0.000517   
4               0.198483                      196            

**Question:**

How does the data for claim videos and opinion videos compare or differ? Consider views, comments, likes, and shares.

<img src="images/Construct.png" width="100" height="100" align=left>

## **PACE: Construct**

**Note**: The Construct stage does not apply to this workflow. The PACE framework can be adapted to fit the specific requirements of any project.




<img src="images/Execute.png" width="100" height="100" align=left>

## **PACE: Execute**

Consider the questions in your PACE Strategy Document and those below to craft your response.

### **Given your efforts, what can you summarize for Rosie Mae Bradshaw and the TikTok data team?**

*Note for Learners: Your answer should address TikTok's request for a summary that covers the following points:*

*   What percentage of the data is comprised of claims and what percentage is comprised of opinions?
*   What factors correlate with a video's claim status?
*   What factors correlate with a video's engagement level?


==> ENTER YOUR RESPONSE HERE

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.