## Data Preparation for Instagram Dataset
**Objective**

The goal of this notebook is to clean and preprocess the dayta so it is ready for machine learning modeling. We will go through each module's steps systematically, deminstrating how to collect, clean, explore, and prepare the data for deployment.

### Introduction to Data Preparation

**What is Data Preparation?**
Data Preparation involve preparing the data for modeling by cleaning, transforming, and reducing dimensionality, if needed. Data preparation ensures that the dataset is suitable for building machine learning models

**Why is Data Preparation Important?**
Properly prepared data improves the performance of machine learning models and ensures that the model's predictions are accurate and unbiased.

### Dataset Overview

The dataset contains information about various Instagram posts, including:
- **id**: Unique identifier for each post.
- **videoViewCount**: Number of views if the post is a video.
- **commentsCount**: Number of comments on the post.
- **timestamp**: Date and time when the post was published.
- **url**: URL of the Instagram post.
- **ownerId**: Unique identifier of the post's owner.
- **productType**: Category of the content (e.g., 'cloth').
- **type**: Type of the post (e.g., 'Image', 'Video').
- **videoDuration**: Duration of the video (if applicable).
- **likesCount**: Number of likes on the post.
- **videoPlayCount**: Number of times the video was played.
- **ownerUsername**: Username of the post owner.
- **ownerFullName**: Full name of the post owner.

### Data Cleaning

Here we import and clean the dataset by handling missing values, fixing inconsistencies, and removing duplicates.

In [1]:
import pandas as pd
# Load data from a CSV file
data = pd.read_csv('../data/raw/instagram_posts.csv')

In [2]:
# Display the first 5 rows
data.head()

Unnamed: 0,id,videoViewCount,commentsCount,timestamp,url,ownerId,productType,type,videoDuration,likesCount,videoPlayCount,ownerUsername,ownerFullName
0,3.37773e+18,0,2,2024-05-28T06:13:04.000Z,https://www.instagram.com/p/C7gG02Io8z5/,48506231918,cloth,Image,0.0,5,0,aimee_apparel_,CHIAMAKA || LAGOS CLOTH VENDOR || ANAMBRA CLOT...
1,3.378452e+18,255,15,2024-05-29T06:08:32.000Z,https://www.instagram.com/p/C7irDFfogZg/,48506231918,cloth,Video,20.166,37,541,aimee_apparel_,CHIAMAKA || LAGOS CLOTH VENDOR || ANAMBRA CLOT...
2,3.377729e+18,632,51,2024-05-28T06:11:10.000Z,https://www.instagram.com/p/C7gGksNIQ_d/,48506231918,cloth,Video,15.138,112,1665,aimee_apparel_,CHIAMAKA || LAGOS CLOTH VENDOR || ANAMBRA CLOT...
3,3.376985e+18,643,81,2024-05-27T06:00:06.000Z,https://www.instagram.com/p/C7ddlvgoxK1/,4885237291,cloth,Video,16.902,143,1657,veecki._,VEE 👑 || Creator 📸
4,3.377065e+18,966,151,2024-05-27T08:16:41.000Z,https://www.instagram.com/p/C7dvnWHIve9/,48506231918,cloth,Video,43.675,197,2215,aimee_apparel_,CHIAMAKA || LAGOS CLOTH VENDOR || ANAMBRA CLOT...


#### Handling Missing Values

In [3]:
# Check for missing values
missing_values = data.isnull().sum()
print(missing_values)

id                0
videoViewCount    0
commentsCount     0
timestamp         0
url               0
ownerId           0
productType       0
type              0
videoDuration     0
likesCount        0
videoPlayCount    0
ownerUsername     0
ownerFullName     0
dtype: int64


We can view that we do not have any missing values in the dataset

#### Handling Duplicates

In [4]:
# Remove duplicate rows if any
data_cleaned = data.drop_duplicates()
print(f'Number of duplicates removed: {data.shape[0] - data_cleaned.shape[0]}')

Number of duplicates removed: 1


Here we removed duplicate records to avoid bias in the model.

#### Handling Outliers

Since we already detected outliers in the previous step, we can decide whether to remove or treat them. Outliers in likesCount may represent highly popular posts, so removing them might not always be appropriate. So we decided to keep them.

In [5]:
# Remove outliers in likesCount if needed (based on previous outlier analysis)
Q1 = data['likesCount'].quantile(0.25)
Q3 = data['likesCount'].quantile(0.75)
IQR = Q3 - Q1
outliers = data[(data['likesCount'] < (Q1 - 1.5 * IQR)) | (data['likesCount'] > (Q3 + 1.5 * IQR))]
print(f"Number of outliers in likesCount: {outliers.shape[0]}")

Number of outliers in likesCount: 308


### Data Transformation

Here we transform the data into a form suitable for machine learning models. This includes encoding categorical variables, scaling, and feature engineering.

#### Encoding Categorical Variables

The dataset has categorical features like productType and type. These need to be encoded into numerical values

In [6]:
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
label_encoder_product = LabelEncoder()
label_encoder_type = LabelEncoder()

# Fit LabelEncoder to 'productType' and 'type' columns
data_cleaned['productType'] = label_encoder_product.fit_transform(data_cleaned['productType'])
data_cleaned['type'] = label_encoder_type.fit_transform(data_cleaned['type'])

# Print the classes and their corresponding codes
print("ProductType classes and codes:", dict(zip(label_encoder_product.classes_, label_encoder_product.transform(label_encoder_product.classes_))))
print("Type classes and codes:", dict(zip(label_encoder_type.classes_, label_encoder_type.transform(label_encoder_type.classes_))))

ProductType classes and codes: {'Beauty products': 0, 'Content creation/Tourism': 1, 'Entertainment': 2, 'Housing': 3, 'Logistics': 4, 'blog': 5, 'cloth': 6, 'food': 7, 'food/cakes': 8}
Type classes and codes: {'Image': 0, 'Video': 1}


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned['productType'] = label_encoder_product.fit_transform(data_cleaned['productType'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned['type'] = label_encoder_type.fit_transform(data_cleaned['type'])


In [7]:
print(data_cleaned)

                id  videoViewCount  commentsCount                 timestamp  \
0     3.377730e+18               0              2  2024-05-28T06:13:04.000Z   
1     3.378452e+18             255             15  2024-05-29T06:08:32.000Z   
2     3.377729e+18             632             51  2024-05-28T06:11:10.000Z   
3     3.376985e+18             643             81  2024-05-27T06:00:06.000Z   
4     3.377065e+18             966            151  2024-05-27T08:16:41.000Z   
...            ...             ...            ...                       ...   
2084  3.354629e+18            1884              4  2024-04-26T09:16:59.000Z   
2085  3.353973e+18               0              0  2024-04-25T11:32:35.000Z   
2086  3.298135e+18            1372              3  2024-02-08T10:36:24.000Z   
2087  3.366961e+18               0             20  2024-05-13T09:37:33.000Z   
2088  3.313981e+18            1136              1  2024-03-01T07:17:24.000Z   

                                           url     

#### Feature Engineering

Here we want to create new features that could improve model performance, such as the hour of day or day of the week when posts were made.

In [8]:
# Extract additional features from timestamp
data_cleaned['timestamp'] = pd.to_datetime(data_cleaned['timestamp'])
data_cleaned['hour'] = data_cleaned['timestamp'].dt.hour
data_cleaned['day_of_week'] = data_cleaned['timestamp'].dt.dayofweek
print(data_cleaned)

                id  videoViewCount  commentsCount                 timestamp  \
0     3.377730e+18               0              2 2024-05-28 06:13:04+00:00   
1     3.378452e+18             255             15 2024-05-29 06:08:32+00:00   
2     3.377729e+18             632             51 2024-05-28 06:11:10+00:00   
3     3.376985e+18             643             81 2024-05-27 06:00:06+00:00   
4     3.377065e+18             966            151 2024-05-27 08:16:41+00:00   
...            ...             ...            ...                       ...   
2084  3.354629e+18            1884              4 2024-04-26 09:16:59+00:00   
2085  3.353973e+18               0              0 2024-04-25 11:32:35+00:00   
2086  3.298135e+18            1372              3 2024-02-08 10:36:24+00:00   
2087  3.366961e+18               0             20 2024-05-13 09:37:33+00:00   
2088  3.313981e+18            1136              1 2024-03-01 07:17:24+00:00   

                                           url     

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned['timestamp'] = pd.to_datetime(data_cleaned['timestamp'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned['hour'] = data_cleaned['timestamp'].dt.hour
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned['day_of_week'] = data_cleaned['timestamp'].dt.dayofweek


Here two new features hour and day_of_week are created to capture potential differences in engagement over days and hours.

### Feature Selection

Here we select the most relevant features for building machine learning models.

#### Correlation-Based Feature Selection

We already calculated the correlation matrix in the data understanding phase. Based on that, we can drop features that are irrelevant or highly correlated with each other or have little relationship with the target variable.

In [9]:
# Drop irrelevant columns
data_cleaned.drop(columns=['id', 'url', 'ownerId', 'ownerUsername', 'ownerFullName', 'timestamp'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned.drop(columns=['id', 'url', 'ownerId', 'ownerUsername', 'ownerFullName', 'timestamp'], inplace=True)


In [10]:
# Select relevant features
features = ['commentsCount', 'productType', 'type', 'videoDuration', 'videoViewCount', 'hour', 'day_of_week']
target = 'likesCount'

X = data_cleaned[features]
y = data_cleaned[target]

In [11]:
X

Unnamed: 0,commentsCount,productType,type,videoDuration,videoViewCount,hour,day_of_week
0,2,6,0,0.000,0,6,1
1,15,6,1,20.166,255,6,2
2,51,6,1,15.138,632,6,1
3,81,6,1,16.902,643,6,0
4,151,6,1,43.675,966,8,0
...,...,...,...,...,...,...,...
2084,4,2,1,22.434,1884,9,4
2085,0,2,0,0.000,0,11,3
2086,3,8,1,10.148,1372,10,3
2087,20,2,0,0.000,0,9,0


In [12]:
y

0         5
1        37
2       112
3       143
4       197
       ... 
2084     44
2085      7
2086     -1
2087      8
2088     18
Name: likesCount, Length: 2089, dtype: int64

#### Scaling Numerical Features

Machine learning models generally perform better when numerical features are on a similar scale. We'll use standard scaling here.

In [13]:
from sklearn.preprocessing import StandardScaler

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [14]:
X_scaled

array([[-0.15840666,  1.50473846, -0.91150981, ..., -0.11633374,
        -1.54294566, -0.90092747],
       [-0.13581564,  1.50473846,  1.0970809 , ..., -0.11382056,
        -1.54294566, -0.34997973],
       [-0.0732559 ,  1.50473846,  1.0970809 , ..., -0.11010499,
        -1.54294566, -0.90092747],
       ...,
       [-0.15666889,  2.44264646,  1.0970809 , ..., -0.10281183,
        -0.58650249,  0.20096801],
       [-0.12712679, -0.37107753, -0.91150981, ..., -0.11633374,
        -0.82561328, -1.45187521],
       [-0.16014443, -0.37107753,  1.0970809 , ..., -0.10513776,
        -1.30383487,  0.75191575]])

### Data Splitting

Here we split the data into training and testing sets to evaluate model performance.

In [15]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

### Data Preparation for Deployment

Since we plan to use Streamlit for deployment after modeling, it is important to save key elements of the data preparation process that will be used during model inference

#### Preprocessed Data

We want to use the same preprocessed data for further analysis or testing without re-running the preprocessing steps each time

In [16]:
# Convert the arrays back to DatFrames
X_train = pd.DataFrame(X_train, columns=X.columns)
X_test = pd.DataFrame(X_test, columns=X.columns)
y_train = pd.Series(y_train, name='likesCount')
y_test = pd.Series(y_test, name='likesCount')

In [17]:
# Save preprocessed data
X_train.to_csv('../data/processed/X_train.csv', index=False)
X_test.to_csv('../data/processed/X_test.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False)
y_test.to_csv('../data/processed/y_test.csv', index=False)

#### Scaler for Standardization

Since our model relies on standardized data (i.e., scaling with StandardScaler), we need to save the fitted scaler. We will need it to transform any new data (from users) in the same way before making predictions with your model.

In [18]:
import joblib

# Save the scaler to disk
joblib.dump(scaler, '../models/scaler.pkl')

['scaler.pkl']

#### Feature Metadata

Here we save any metadata about the features, such as:
- The names of the features used in the model.
- Any categorical encodings we performed, especially if you're using labelencoder or similar.

In [19]:
# Save the feature names
import json

with open('../models/feature_names.json', 'w') as f:
    json.dump(X_train.columns.tolist(), f)

In [20]:
# Save the label encoders for both 'productType' and 'type'
joblib.dump(label_encoder_product, '../models/label_encoder_product.pkl')
joblib.dump(label_encoder_type, '../models/label_encoder_type.pkl')

['label_encoder_type.pkl']