# **TikTok Project**
**Course 5 - Regression Analysis: Simplify complex data relationships**

You are a data professional at TikTok. The data team is working towards building a machine learning model that can be used to determine whether a video contains a claim or whether it offers an opinion. With a successful prediction model, TikTok can reduce the backlog of user reports and prioritize them more efficiently.

The team is getting closer to completing the project, having completed an initial plan of action, initial Python coding work, EDA, and hypothesis testing.

The TikTok team has reviewed the results of the hypothesis testing. TikTok’s Operations Lead, Maika Abadi, is interested in how different variables are associated with whether a user is verified. Earlier, the data team observed that if a user is verified, they are much more likely to post opinions. Now, the data team has decided to explore how to predict verified status to help them understand how video characteristics relate to verified users. Therefore, you have been asked to conduct a logistic regression using verified status as the outcome variable. The results may be used to inform the final model related to predicting whether a video is a claim vs an opinion.

A notebook was structured and prepared to help you in this project. Please complete the following questions.

# **Course 5 End-of-course project: Regression modeling**


In this activity, you will build a logistic regression model in Python. As you have learned, logistic regression helps you estimate the probability of an outcome. For data science professionals, this is a useful skill because it allows you to consider more than one variable against the variable you're measuring against. This opens the door for much more thorough and flexible analysis to be completed.

<br/>

**The purpose** of this project is to demostrate knowledge of EDA and regression models.

**The goal** is to build a logistic regression model and evaluate the model.
<br/>
*This activity has three parts:*

**Part 1:** EDA & Checking Model Assumptions
* What are some purposes of EDA before constructing a logistic regression model?

**Part 2:** Model Building and Evaluation
* What resources do you find yourself using as you complete this stage?

**Part 3:** Interpreting Model Results

* What key insights emerged from your model(s)?

* What business recommendations do you propose based on the models built?

Follow the instructions and answer the question below to complete the activity. Then, you will complete an executive summary using the questions listed on the PACE Strategy Document.

Be sure to complete this activity before moving on. The next course item will provide you with a completed exemplar to compare to your own work.


# **Build a regression model**

<img src="images/Pace.png" width="100" height="100" align=left>

# **PACE stages**

Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

<img src="images/Plan.png" width="100" height="100" align=left>


## **PACE: Plan**
Consider the questions in your PACE Strategy Document to reflect on the Plan stage.

### **Task 1. Imports and loading**
Import the data and packages that you've learned are needed for building regression models.

In [None]:
# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import packages for data preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.utils import resample

# Import packages for data modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import LabelEncoder


Load the TikTok dataset.

**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [None]:
# Load dataset into dataframe
data = pd.read_csv("/content/tiktok_dataset.csv")

<img src="images/Analyze.png" width="100" height="100" align=left>

## **PACE: Analyze**

Consider the questions in your PACE Strategy Document to reflect on the Analyze stage.

In this stage, consider the following question where applicable to complete your code response:

* What are some purposes of EDA before constructing a logistic regression model?


==> ENTER YOUR RESPONSE HERE

### **Task 2a. Explore data with EDA**

Analyze the data and check for and handle missing values and duplicates.

Inspect the first five rows of the dataframe.

In [None]:
# Display first few rows
### YOUR CODE HERE ###
data.head()

Get the number of rows and columns in the dataset.

In [None]:
# Get number of rows and columns
### YOUR CODE HERE ###
data.shape

Get the data types of the columns.

In [None]:
# Get data types of columns
### YOUR CODE HERE ###
data.info()

Get basic information about the dataset.

In [None]:
# Get basic information
### YOUR CODE HERE ###


Generate basic descriptive statistics about the dataset.

In [None]:
# Generate basic descriptive stats
### YOUR CODE HERE ###
data.describe()

Check for and handle missing values.

In [None]:
# Check for missing values
### YOUR CODE HERE ###
data.isnull().sum()

In [None]:
# Drop rows with missing values
### YOUR CODE HERE ###
data = data.dropna(axis = 0).reset_index(drop=True)

In [None]:
# Display first few rows after handling missing values
### YOUR CODE HERE ###
data.head()

Check for and handle duplicates.

In [None]:
# Check for duplicates
### YOUR CODE HERE ###
data.duplicated().sum()

Check for and handle outliers.

In [None]:
# Create a boxplot to visualize distribution of `video_duration_sec`
### YOUR CODE HERE ###

plt.figure(figsize=(4,2))
plt.title('Boxplot to detect outliers for video_duration_sec', fontsize=12)
sns.boxplot(x=data['video_duration_sec'])
plt.show()

In [None]:
# Create a boxplot to visualize distribution of `video_view_count`
### YOUR CODE HERE ###

plt.figure(figsize=(4,2))
plt.title('Boxplot to detect outliers for video_view_count', fontsize=12)
sns.boxplot(x=data['video_view_count'])
plt.show()

In [None]:
# Create a boxplot to visualize distribution of `video_like_count`
### YOUR CODE HERE ###
plt.figure(figsize=(6,2))
plt.title('Boxplot to detect outliers for video_like_count', fontsize=12)
sns.boxplot(x=data['video_like_count'])
plt.show()


In [None]:
# Create a boxplot to visualize distribution of `video_comment_count`
### YOUR CODE HERE ###
plt.figure(figsize=(4,2))
plt.title('Boxplot to detect outliers for video_comment_count', fontsize=12)
sns.boxplot(x=data['video_comment_count'])
plt.show()


Check class balance.

In [None]:
# Check class balance for video_comment_count
### YOUR CODE HERE ###
data['verified_status'].value_counts()

Approximately 94.2% of the dataset represents videos posted by unverified accounts and 5.8% represents videos posted by verified accounts. So the outcome variable is not very balanced.

Use resampling to create class balance in the outcome variable, if needed.

In [None]:
# Use resampling to create class balance in the outcome variable, if needed

# Identify data points from majority and minority classes
data_majority = data[data["verified_status"] == "not verified"]
data_minority = data[data["verified_status"] == "verified"]

# Upsample the minority class (which is "verified")
data_minority_upsampled = resample(data_minority,
                                 replace=True,                 # to sample with replacement
                                 n_samples=len(data_majority), # to match majority class
                                 random_state=0)               # to create reproducible results

# Combine majority class with upsampled minority class
data_upsampled = pd.concat([data_majority, data_minority_upsampled]).reset_index(drop=True)


# Display new class counts
print(data_upsampled["verified_status"].value_counts())
data_upsampled.shape

Get the average `video_transcription_text` length for videos posted by verified accounts and the average `video_transcription_text` length for videos posted by unverified accounts.



In [None]:
# Get the average `video_transcription_text` length for claims and the average `video_transcription_text` length for opinions
### YOUR CODE HERE ###
data_upsampled[["verified_status", "video_transcription_text"]].groupby(by="verified_status")[["video_transcription_text"]].agg(func=lambda array: np.mean([len(text) for text in array]))


Extract the length of each `video_transcription_text` and add this as a column to the dataframe, so that it can be used as a potential feature in the model.

In [None]:
# Create a new column 'text_length' and calculate the length of each string in 'video_transcription_text'
data_upsampled['text_length'] = data_upsampled['video_transcription_text'].apply(len)


In [None]:
# Display first few rows of dataframe after adding new column
### YOUR CODE HERE ###
data_upsampled.head()

Visualize the distribution of `video_transcription_text` length for videos posted by verified accounts and videos posted by unverified accounts.

In [None]:
# Visualize the distribution of `video_transcription_text` length for videos posted by verified accounts and videos posted by unverified accounts
# Create two histograms in one plot
### YOUR CODE HERE ###
# Create a 1x2 plot figure.
sns.histplot(data=data_upsampled, stat="count", multiple="stack", x="text_length", kde=False, palette="pastel",
             hue="verified_status", element="bars", legend=True)
plt.xlabel("video_transcription_text length (number of characters)")
plt.ylabel("Count")
plt.title("Distribution of video_transcription_text length for videos posted by verified accounts and unverified accounts")
plt.show()

### **Task 2b. Examine correlations**

Next, code a correlation matrix to help determine most correlated variables.

In [None]:
# Code a correlation matrix to help determine most correlated variables
### YOUR CODE HERE ###
data_upsampled.corr(numeric_only=True)

Visualize a correlation heatmap of the data.

In [None]:
# Create a heatmap to visualize how correlated variables are
### YOUR CODE HERE ###
plt.figure(figsize=(8, 6))
sns.heatmap(data_upsampled.corr(numeric_only=True), annot=True, cmap="crest")
plt.title("Heatmap of the dataset")
plt.show()

One of the model assumptions for logistic regression is no severe multicollinearity among the features. Take this into consideration as you examine the heatmap and choose which features to proceed with.

**Question:** What variables are shown to be correlated in the heatmap?

<img src="images/Construct.png" width="100" height="100" align=left>

## **PACE: Construct**

After analysis and deriving variables with close relationships, it is time to begin constructing the model. Consider the questions in your PACE Strategy Document to reflect on the Construct stage.

### **Task 3a. Select variables**

Set your Y and X variables.

Select the outcome variable and Select the features.

In [None]:
# Select outcome variable
### YOUR CODE HERE ###
y = data_upsampled["verified_status"]
X = data_upsampled.drop(columns=['verified_status'])

Select the features.

### **Task 3b. Train-test split**

Split the data into training and testing sets.

In [None]:
# Split the data into training and testing sets
### YOUR CODE HERE ###
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

Confirm that the dimensions of the training and testing sets are in alignment.

In [None]:
# Get shape of each training and testing set
### YOUR CODE HERE ###
X_train.shape, X_test.shape, y_train.shape, y_test.shape

### **Task 3c. Encode variables**

Check the data types of the features.

In [None]:
# Check data types
### YOUR CODE HERE ###
y_train.info()

In [None]:
# Get unique values in `claim_status`
### YOUR CODE HERE ###
X_train["claim_status"].unique()


In [None]:
# Get unique values in `author_ban_status`
### YOUR CODE HERE ###
X_train["author_ban_status"].unique()

As shown above, the `claim_status` and `author_ban_status` features are each of data type `object` currently. In order to work with the implementations of models through `sklearn`, these categorical features will need to be made numeric. One way to do this is through one-hot encoding.


In [None]:
# Select the training features that needs to be encoded
X_train1=X_train
encoded_x_train = pd.get_dummies(X_train1, columns=['claim_status', 'author_ban_status'])

# Display first few rows
encoded_x_train.head()

In [None]:
encoded_x_train.drop(columns=['video_transcription_text'], inplace=True)

# Display first few rows
encoded_x_train.head()

Check the data type of the outcome variable.

In [None]:
# Get unique values of outcome variable
### YOUR CODE HERE ###
y_train.unique()


A shown above, the outcome variable is of data type `object` currently. One-hot encoding can be used to make this variable numeric.

Encode categorical values of the outcome variable the training set using an appropriate method.

In [None]:
y_train.head()

In [None]:


# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit LabelEncoder on the target Series and transform categorical values to numerical values
encoded_y_train = label_encoder.fit_transform(y_train)

encoded_y_train

### **Task 3d. Model building**

Construct a model and fit it to the training set.

In [None]:
# Construct a logistic regression model and fit it to the training set
### YOUR CODE HERE ###
log_clf = LogisticRegression(random_state=0, max_iter=800).fit(encoded_x_train, encoded_y_train)


<img src="images/Execute.png" width="100" height="100" align=left>

## **PACE: Execute**

Consider the questions in your PACE Strategy Document to reflect on the Execute stage.

### **Taks 4a. Results and evaluation**

Evaluate your model.

Encode categorical features in the testing set using an appropriate method.

In [None]:
# Select the training features that needs to be encoded
X_test1=X_test
encoded_x_test = pd.get_dummies(X_test1, columns=['claim_status', 'author_ban_status'])
encoded_x_test.drop(columns=['video_transcription_text'], inplace=True)


# Display first few rows
encoded_x_test.head()

Test the logistic regression model. Use the model to make predictions on the encoded testing set.

In [None]:
# Use the logistic regression model to get predictions on the encoded testing set
### YOUR CODE HERE ###
y_pred = log_clf.predict(encoded_x_test)

Display the predictions on the encoded testing set.

In [None]:
# Display the predictions on the encoded testing set
### YOUR CODE HERE ###
y_pred

Display the true labels of the testing set.

In [None]:
# Display the true labels of the testing set
### YOUR CODE HERE ###
y_test

Encode the true labels of the testing set so it can be compared to the predictions.

In [None]:


# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit LabelEncoder on the target Series and transform categorical values to numerical values
encoded_y_test = label_encoder.fit_transform(y_test)

encoded_y_test

Confirm again that the dimensions of the training and testing sets are in alignment since additional features were added.

In [None]:
# Get shape of each training and testing set
### YOUR CODE HERE ###
encoded_x_train.shape, encoded_y_train.shape, encoded_x_test.shape, encoded_y_test.shape

### **Task 4b. Visualize model results**

Create a confusion matrix to visualize the results of the logistic regression model.

In [None]:
# Compute values for confusion matrix
log_cm = confusion_matrix(encoded_y_test, y_pred, labels=log_clf.classes_)

# Create display of confusion matrix
log_disp = ConfusionMatrixDisplay(confusion_matrix=log_cm, display_labels=log_clf.classes_)

# Plot confusion matrix
log_disp.plot()

# Display plot
plt.show()

Create a classification report that includes precision, recall, f1-score, and accuracy metrics to evaluate the performance of the logistic regression model.

In [None]:
# Create classification report for logistic regression model
target_labels = ["verified", "not verified"]
print(classification_report(encoded_y_test, y_pred, target_names=target_labels))

### **Task 4c. Interpret model coefficients**

In [None]:
# Get the feature names from the model and the model coefficients (which represent log-odds ratios)
# Place into a DataFrame for readability
pd.DataFrame(data={"Feature Name":log_clf.feature_names_in_, "Model Coefficient":log_clf.coef_[0]})

### **Task 4d. Conclusion**

1. What are the key takeaways from this project?

2. What results can be presented from this project?

Key takeaways:

- The dataset has a few strongly correlated variables, which might lead to breaking the "no multicollinearity" assumption when fitting a logistic regression model. We decided to drop `video_like_count` from the model building.
- Based on the logistic regression model, each additional second of the video is associated with 0.009 increase in the log-odds of the user having a verified status.
- The logistic regression model had not great, but acceptable predictive power: a precision of 61% is less than ideal, but a recall of 84% is very good. Overall accuracy is towards the lower end of what would typically be considered acceptable.


We developed a logistic regression model for verified status based on video features. The model had decent predictive power. Based on the estimated model coefficients from the logistic regression, longer videos tend to be associated with higher odds of the user being verified. Other video features have small estimated coefficients in the model, so their association with verified status seems to be minor.

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.