# Data Exploration: Telegram Medical Business Channels

This notebook explores the raw data scraped from Ethiopian medical business Telegram channels.

## Objectives

1. Load and examine raw message data from the data lake
2. Analyze message patterns and distributions
3. Explore channel-specific characteristics
4. Visualize key metrics (views, forwards, media usage)
5. Identify data quality issues and patterns

## Prerequisites

- Data lake populated with scraped messages (`data/raw/telegram_messages/`)
- Database loaded with raw data
- Required Python packages installed

## 1. Setup and Imports

Import necessary libraries and modules from the project.

In [21]:
import sys
import os
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
sys.path.append(str(project_root))

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import json
from datetime import datetime
from collections import Counter

# Import project modules
from src.database.db_connector import DatabaseConnector
from src.database.data_loader import DataLoader

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

print("‚úÖ Imports successful")

‚úÖ Imports successful


## 2. Load Data from Data Lake

Load raw JSON files from the data lake to understand the data structure.

In [22]:
# Path to data lake
data_lake_path = project_root / "data" / "raw" / "telegram_messages"

# Collect all JSON files
json_files = list(data_lake_path.rglob("*.json"))

print(f"Found {len(json_files)} JSON files in data lake")
print(f"\nSample files:")
for f in json_files[:5]:
    print(f"  {f.relative_to(project_root)}")

Found 4 JSON files in data lake

Sample files:
  data/raw/telegram_messages/2026-01-18/CheMed123.json
  data/raw/telegram_messages/2026-01-18/_manifest.json
  data/raw/telegram_messages/2026-01-18/lobelia4cosmetics.json
  data/raw/telegram_messages/2026-01-18/tikvahpharma.json


In [23]:
# Load all messages from JSON files
all_messages = []

for json_file in json_files:
    try:
        with open(json_file, 'r', encoding='utf-8') as f:
            data = json.load(f)
            if isinstance(data, list):
                all_messages.extend(data)
            else:
                all_messages.append(data)
    except Exception as e:
        print(f"Error loading {json_file}: {e}")

print(f"\n‚úÖ Loaded {len(all_messages)} messages from data lake")


‚úÖ Loaded 277 messages from data lake


## 3. Create DataFrame and Initial Exploration

Convert raw data to pandas DataFrame for analysis.

In [24]:
# Create DataFrame
df = pd.DataFrame(all_messages)

# Display basic information
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()

Dataset shape: (277, 12)

Columns: ['message_id', 'channel_name', 'channel_title', 'message_date', 'message_text', 'has_media', 'views', 'forwards', 'date', 'timestamp', 'channels', 'total_messages']

First few rows:


Unnamed: 0,message_id,channel_name,channel_title,message_date,message_text,has_media,views,forwards,date,timestamp,channels,total_messages
0,97.0,CheMed123,CheMed,2023-02-10T12:23:06+00:00,"‚ö†Ô∏èNotice!\nDear esteemed customers,\nDue to four-day motorbike movement restrictions, we have li...",True,1307.0,1.0,,,,
1,96.0,CheMed123,CheMed,2023-02-02T08:58:52+00:00,Mela-One ·â†·ãç·àµ·å° ·àÜ·à≠·àû·äï ·ã´·àà·ãç ·ãµ·äï·åà·â∞·äõ ·ãà·àä·ãµ ·àò·âÜ·å£·å†·à≠·ã´ ·à≤·àÜ·äï ·ã´·àà·àò·ä®·àã·ä®·ã´ ·ã®·â∞·ã∞·à®·åà ·ã®·åç·â•·à®·àµ·åã ·åç·äï·äô·äê·âµ ·à≤·äñ·à≠ ·â†72 ·à∞·ãì·â≥·âµ ·ãà·àµ·å• ·àò·ãà·à∞·ãµ ·ã≠·äñ·à≠...,True,1174.0,3.0,,,,
2,95.0,CheMed123,CheMed,2023-02-01T08:59:37+00:00,·ä†·ãö·âµ·àÆ·àõ·ã≠·à≤·äï ·â†·àÉ·ä™·àù ·àò·ãµ·àÉ·äí·âµ ·àõ·ãò·ã£ ·ä®·àö·â≥·ãò·ãô ·ä†·äï·â≤·â£·ãÆ·â≤·äÆ·âΩ ·ä†·äï·ã± ·à≤·àÜ·äï ·â†·à≠·ä®·âµ ·ã´·àâ ·â£·ä≠·â¥·à≠·ã´·ãé·âΩ·äï ·ã≠·åà·àã·àç·ç¢\n\n·â†·âÄ·äï ·ä†·äï·ã¥ ·àà3 ·âÄ·äì·âµ ·àù·åç·â• ·ä®·àò·â•·àã...,True,1058.0,4.0,,,,
3,94.0,CheMed123,CheMed,2023-01-31T09:19:53+00:00,Che-Med Trivia #3\n\n·àù·åç·â•·äì ·àò·å†·å¶·âΩ ·ä†·äï·ã≥·äï·ãµ ·àò·ãµ·àÉ·äí·â∂·âΩ ·â†·ã∞·äï·â• ·ä•·äï·ã≥·ã≠·à∞·à© ·àä·ã´·ã∞·à≠·åâ ·ã≠·âΩ·àã·àâ·ç¢ ·â†·ãö·àÖ ·àÅ·äî·â≥ ·ä•·äê·ãö·àÖ·äï ·àò·ãµ·àÉ·äí·â∂·âΩ ·àù·åç·â• ·ä®·ãà·à∞...,True,819.0,1.0,,,,
4,93.0,CheMed123,CheMed,2023-01-30T09:45:25+00:00,"Che-Med Trivia #2\n\n·ä•·äï·ã∞ Ciprofloxacin, Doxycycline, Levothyroxine, Iron supplement ·ã´·àâ ·àò·ãµ·àÉ·äí·â∂·âΩ·äï ·ä®...",True,710.0,2.0,,,,


In [25]:
# Data types and missing values
print("Data Types and Missing Values:")
print("="*50)
info_df = pd.DataFrame({
    'Column': df.columns,
    'Type': df.dtypes,
    'Non-Null Count': df.count(),
    'Null Count': df.isnull().sum(),
    'Null %': (df.isnull().sum() / len(df) * 100).round(2)
})
info_df

Data Types and Missing Values:


Unnamed: 0,Column,Type,Non-Null Count,Null Count,Null %
message_id,message_id,float64,276,1,0.36
channel_name,channel_name,object,276,1,0.36
channel_title,channel_title,object,276,1,0.36
message_date,message_date,object,276,1,0.36
message_text,message_text,object,276,1,0.36
has_media,has_media,object,276,1,0.36
views,views,float64,276,1,0.36
forwards,forwards,float64,276,1,0.36
date,date,object,1,276,99.64
timestamp,timestamp,object,1,276,99.64


## 4. Data Cleaning and Preparation

Prepare the data for analysis by handling missing values and converting data types.

In [26]:
# Convert date column to datetime
if 'message_date' in df.columns:
    df['message_date'] = pd.to_datetime(df['message_date'])
elif 'date' in df.columns:
    df['message_date'] = pd.to_datetime(df['date'])

# Extract useful date components
df['date_only'] = df['message_date'].dt.date
df['hour'] = df['message_date'].dt.hour
df['day_of_week'] = df['message_date'].dt.day_name()
df['month'] = df['message_date'].dt.month_name()

# Calculate message length
if 'message_text' in df.columns:
    df['message_length'] = df['message_text'].fillna('').str.len()
elif 'text' in df.columns:
    df['message_length'] = df['text'].fillna('').str.len()

# Create has_image flag
if 'has_media' in df.columns:
    df['has_image'] = df['has_media'].fillna(False)
elif 'image_path' in df.columns:
    df['has_image'] = df['image_path'].notna()

print("‚úÖ Data cleaning completed")
df.head()

‚úÖ Data cleaning completed



Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`



Unnamed: 0,message_id,channel_name,channel_title,message_date,message_text,has_media,views,forwards,date,timestamp,channels,total_messages,date_only,hour,day_of_week,month,message_length,has_image
0,97.0,CheMed123,CheMed,2023-02-10 12:23:06+00:00,"‚ö†Ô∏èNotice!\nDear esteemed customers,\nDue to four-day motorbike movement restrictions, we have li...",True,1307.0,1.0,,,,,2023-02-10,12.0,Friday,February,320,True
1,96.0,CheMed123,CheMed,2023-02-02 08:58:52+00:00,Mela-One ·â†·ãç·àµ·å° ·àÜ·à≠·àû·äï ·ã´·àà·ãç ·ãµ·äï·åà·â∞·äõ ·ãà·àä·ãµ ·àò·âÜ·å£·å†·à≠·ã´ ·à≤·àÜ·äï ·ã´·àà·àò·ä®·àã·ä®·ã´ ·ã®·â∞·ã∞·à®·åà ·ã®·åç·â•·à®·àµ·åã ·åç·äï·äô·äê·âµ ·à≤·äñ·à≠ ·â†72 ·à∞·ãì·â≥·âµ ·ãà·àµ·å• ·àò·ãà·à∞·ãµ ·ã≠·äñ·à≠...,True,1174.0,3.0,,,,,2023-02-02,8.0,Thursday,February,174,True
2,95.0,CheMed123,CheMed,2023-02-01 08:59:37+00:00,·ä†·ãö·âµ·àÆ·àõ·ã≠·à≤·äï ·â†·àÉ·ä™·àù ·àò·ãµ·àÉ·äí·âµ ·àõ·ãò·ã£ ·ä®·àö·â≥·ãò·ãô ·ä†·äï·â≤·â£·ãÆ·â≤·äÆ·âΩ ·ä†·äï·ã± ·à≤·àÜ·äï ·â†·à≠·ä®·âµ ·ã´·àâ ·â£·ä≠·â¥·à≠·ã´·ãé·âΩ·äï ·ã≠·åà·àã·àç·ç¢\n\n·â†·âÄ·äï ·ä†·äï·ã¥ ·àà3 ·âÄ·äì·âµ ·àù·åç·â• ·ä®·àò·â•·àã...,True,1058.0,4.0,,,,,2023-02-01,8.0,Wednesday,February,218,True
3,94.0,CheMed123,CheMed,2023-01-31 09:19:53+00:00,Che-Med Trivia #3\n\n·àù·åç·â•·äì ·àò·å†·å¶·âΩ ·ä†·äï·ã≥·äï·ãµ ·àò·ãµ·àÉ·äí·â∂·âΩ ·â†·ã∞·äï·â• ·ä•·äï·ã≥·ã≠·à∞·à© ·àä·ã´·ã∞·à≠·åâ ·ã≠·âΩ·àã·àâ·ç¢ ·â†·ãö·àÖ ·àÅ·äî·â≥ ·ä•·äê·ãö·àÖ·äï ·àò·ãµ·àÉ·äí·â∂·âΩ ·àù·åç·â• ·ä®·ãà·à∞...,True,819.0,1.0,,,,,2023-01-31,9.0,Tuesday,January,287,True
4,93.0,CheMed123,CheMed,2023-01-30 09:45:25+00:00,"Che-Med Trivia #2\n\n·ä•·äï·ã∞ Ciprofloxacin, Doxycycline, Levothyroxine, Iron supplement ·ã´·àâ ·àò·ãµ·àÉ·äí·â∂·âΩ·äï ·ä®...",True,710.0,2.0,,,,,2023-01-30,9.0,Monday,January,254,True


## 5. Channel Statistics

Analyze message distribution and activity across channels.

In [27]:
# Messages per channel
channel_col = 'channel_name' if 'channel_name' in df.columns else 'channel'
channel_stats = df[channel_col].value_counts()

print("Messages per Channel:")
print("="*50)
for channel, count in channel_stats.items():
    print(f"{channel}: {count} messages")

# Visualize channel distribution
fig = px.bar(
    x=channel_stats.index,
    y=channel_stats.values,
    title="Message Distribution by Channel",
    labels={'x': 'Channel', 'y': 'Number of Messages'},
    color=channel_stats.values,
    color_continuous_scale='Blues'
)
fig.update_layout(showlegend=False, height=400)
fig.show()

Messages per Channel:
lobelia4cosmetics: 100 messages
tikvahpharma: 100 messages
CheMed123: 76 messages


## 6. Temporal Analysis

Explore posting patterns over time.

In [28]:
# Messages over time
daily_messages = df.groupby('date_only').size().reset_index(name='count')
daily_messages['date_only'] = pd.to_datetime(daily_messages['date_only'])

fig = px.line(
    daily_messages,
    x='date_only',
    y='count',
    title="Daily Message Volume",
    labels={'date_only': 'Date', 'count': 'Number of Messages'}
)
fig.update_traces(line_color='#1f77b4', line_width=2)
fig.update_layout(height=400)
fig.show()

In [29]:
# Posting patterns by hour of day
hourly_dist = df['hour'].value_counts().sort_index()

fig = px.bar(
    x=hourly_dist.index,
    y=hourly_dist.values,
    title="Message Distribution by Hour of Day",
    labels={'x': 'Hour (24h format)', 'y': 'Number of Messages'},
    color=hourly_dist.values,
    color_continuous_scale='Viridis'
)
fig.update_layout(showlegend=False, height=400)
fig.show()

In [30]:
# Day of week distribution
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_dist = df['day_of_week'].value_counts().reindex(day_order, fill_value=0)

fig = px.bar(
    x=day_dist.index,
    y=day_dist.values,
    title="Message Distribution by Day of Week",
    labels={'x': 'Day of Week', 'y': 'Number of Messages'},
    color=day_dist.values,
    color_continuous_scale='Teal'
)
fig.update_layout(showlegend=False, height=400)
fig.show()

## 7. Engagement Metrics

Analyze views and forwards to understand content engagement.

In [31]:
# Check for engagement columns
views_col = 'views' if 'views' in df.columns else ('view_count' if 'view_count' in df.columns else None)
forwards_col = 'forwards' if 'forwards' in df.columns else ('forward_count' if 'forward_count' in df.columns else None)

if views_col:
    # Summary statistics for views
    print("View Count Statistics:")
    print("="*50)
    print(df[views_col].describe())
    
    # Distribution of views
    fig = px.histogram(
        df[df[views_col].notna()],
        x=views_col,
        nbins=50,
        title="Distribution of Message Views",
        labels={views_col: 'Views', 'count': 'Number of Messages'},
        color_discrete_sequence=['#636EFA']
    )
    fig.update_layout(height=400)
    fig.show()
else:
    print("‚ö†Ô∏è No views column found in data")

View Count Statistics:
count      276.000000
mean      1529.543478
std       3877.625492
min          0.000000
25%        309.750000
50%        484.500000
75%        766.250000
max      30160.000000
Name: views, dtype: float64


In [32]:
if forwards_col and forwards_col in df.columns:
    # Summary statistics for forwards
    print("Forward Count Statistics:")
    print("="*50)
    print(df[forwards_col].describe())
    
    # Messages with most forwards
    top_forwarded = df.nlargest(10, forwards_col)[[channel_col, forwards_col, 'message_length', 'has_image']]
    print("\nTop 10 Most Forwarded Messages:")
    print(top_forwarded)
else:
    print("‚ö†Ô∏è No forwards column found in data")

Forward Count Statistics:
count    276.000000
mean       3.003623
std        7.268612
min        0.000000
25%        0.000000
50%        1.000000
75%        2.000000
max       54.000000
Name: forwards, dtype: float64

Top 10 Most Forwarded Messages:
     channel_name  forwards  message_length  has_image
240  tikvahpharma      54.0            1880       True
266  tikvahpharma      54.0            1880       True
241  tikvahpharma      41.0            2710       True
267  tikvahpharma      41.0            2710       True
208  tikvahpharma      24.0               0       True
209  tikvahpharma      24.0               0       True
210  tikvahpharma      24.0               0       True
211  tikvahpharma      24.0               0       True
212  tikvahpharma      24.0               0       True
213  tikvahpharma      24.0             431       True


In [33]:
# Average engagement by channel
if views_col and forwards_col:
    channel_engagement = df.groupby(channel_col).agg({
        views_col: 'mean',
        forwards_col: 'mean',
        'message_id': 'count'
    }).round(2)
    channel_engagement.columns = ['Avg Views', 'Avg Forwards', 'Total Messages']
    
    print("Channel Engagement Metrics:")
    print("="*50)
    print(channel_engagement.sort_values('Avg Views', ascending=False))
    
    # Visualize
    fig = make_subplots(
        rows=1, cols=2,
        subplot_titles=('Average Views by Channel', 'Average Forwards by Channel')
    )
    
    fig.add_trace(
        go.Bar(x=channel_engagement.index, y=channel_engagement['Avg Views'], name='Avg Views'),
        row=1, col=1
    )
    
    fig.add_trace(
        go.Bar(x=channel_engagement.index, y=channel_engagement['Avg Forwards'], name='Avg Forwards'),
        row=1, col=2
    )
    
    fig.update_layout(height=400, showlegend=False, title_text="Channel Engagement Comparison")
    fig.show()

Channel Engagement Metrics:
                   Avg Views  Avg Forwards  Total Messages
channel_name                                              
tikvahpharma         2858.79          5.61             100
CheMed123            1416.47          3.16              76
lobelia4cosmetics     286.23          0.28             100


## 8. Content Analysis

Analyze message content characteristics.

In [34]:
# Message length distribution
print("Message Length Statistics:")
print("="*50)
print(df['message_length'].describe())

# Visualize message length distribution
fig = px.histogram(
    df[df['message_length'] > 0],
    x='message_length',
    nbins=50,
    title="Distribution of Message Lengths",
    labels={'message_length': 'Message Length (characters)', 'count': 'Number of Messages'},
    color_discrete_sequence=['#EF553B']
)
fig.update_layout(height=400)
fig.show()

Message Length Statistics:
count     277.000000
mean      663.104693
std       955.764437
min         0.000000
25%       177.000000
50%       384.000000
75%       412.000000
max      4073.000000
Name: message_length, dtype: float64


In [35]:
# Media usage analysis
media_stats = df['has_image'].value_counts()

fig = px.pie(
    values=media_stats.values,
    names=['With Image' if x else 'No Image' for x in media_stats.index],
    title="Message Media Distribution",
    color_discrete_sequence=['#00CC96', '#AB63FA']
)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(height=400)
fig.show()

print(f"\nMessages with images: {media_stats.get(True, 0)} ({media_stats.get(True, 0)/len(df)*100:.1f}%)")
print(f"Messages without images: {media_stats.get(False, 0)} ({media_stats.get(False, 0)/len(df)*100:.1f}%)")


Messages with images: 217 (78.3%)
Messages without images: 60 (21.7%)


In [36]:
# Media usage by channel
channel_media = df.groupby(channel_col)['has_image'].agg(['sum', 'count'])
channel_media['percentage'] = (channel_media['sum'] / channel_media['count'] * 100).round(1)
channel_media.columns = ['Messages with Image', 'Total Messages', 'Image %']

print("Media Usage by Channel:")
print("="*50)
print(channel_media.sort_values('Image %', ascending=False))

# Visualize
fig = px.bar(
    x=channel_media.index,
    y=channel_media['Image %'],
    title="Percentage of Messages with Images by Channel",
    labels={'x': 'Channel', 'y': 'Percentage of Messages with Images'},
    color=channel_media['Image %'],
    color_continuous_scale='Greens'
)
fig.update_layout(showlegend=False, height=400)
fig.show()

Media Usage by Channel:
                   Messages with Image  Total Messages  Image %
channel_name                                                   
lobelia4cosmetics                  100             100    100.0
CheMed123                           72              76     94.7
tikvahpharma                        45             100     45.0


## 9. Data Quality Assessment

Identify potential data quality issues.

In [38]:
print("Data Quality Checks:")
print("="*50)

# Check for duplicates
duplicates = df.duplicated(subset=['message_id']).sum()
print(f"\n1. Duplicate message IDs: {duplicates}")

# Check for messages with no text and no media
empty_messages = df[(df['message_length'] == 0) & (df['has_image'] == False)]
print(f"\n2. Empty messages (no text, no media): {len(empty_messages)}")

# Check for future dates (make datetime timezone-aware for comparison)
from datetime import timezone
now = datetime.now(timezone.utc)
future_dates = df[df['message_date'] > now]
print(f"\n3. Messages with future dates: {len(future_dates)}")

# Check for missing critical fields
print("\n4. Missing critical fields:")
critical_fields = ['message_id', channel_col, 'message_date']
for field in critical_fields:
    if field in df.columns:
        missing = df[field].isna().sum()
        print(f"   - {field}: {missing} missing ({missing/len(df)*100:.1f}%)")

# Check for outliers in engagement metrics
if views_col:
    q75 = df[views_col].quantile(0.75)
    q25 = df[views_col].quantile(0.25)
    iqr = q75 - q25
    outliers = df[(df[views_col] > q75 + 1.5 * iqr) | (df[views_col] < q25 - 1.5 * iqr)]
    print(f"\n5. View count outliers: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)")

Data Quality Checks:

1. Duplicate message IDs: 0

2. Empty messages (no text, no media): 5

3. Messages with future dates: 0

4. Missing critical fields:
   - message_id: 1 missing (0.4%)
   - channel_name: 1 missing (0.4%)
   - message_date: 1 missing (0.4%)

5. View count outliers: 36 (13.0%)


## 10. Key Findings Summary

Summarize the main insights from the data exploration.

In [39]:
print("="*60)
print("KEY FINDINGS - Data Exploration Summary")
print("="*60)

print(f"\nüìä DATASET OVERVIEW")
print(f"   ‚Ä¢ Total messages scraped: {len(df):,}")
print(f"   ‚Ä¢ Channels tracked: {df[channel_col].nunique()}")
print(f"   ‚Ä¢ Date range: {df['message_date'].min().date()} to {df['message_date'].max().date()}")
print(f"   ‚Ä¢ Average messages per day: {len(df) / df['date_only'].nunique():.1f}")

print(f"\nüìù CONTENT CHARACTERISTICS")
print(f"   ‚Ä¢ Average message length: {df['message_length'].mean():.0f} characters")
print(f"   ‚Ä¢ Messages with images: {df['has_image'].sum():,} ({df['has_image'].sum()/len(df)*100:.1f}%)")
print(f"   ‚Ä¢ Most active channel: {channel_stats.index[0]} ({channel_stats.values[0]} messages)")

if views_col:
    print(f"\nüëÅÔ∏è ENGAGEMENT METRICS")
    print(f"   ‚Ä¢ Average views per message: {df[views_col].mean():.0f}")
    print(f"   ‚Ä¢ Median views per message: {df[views_col].median():.0f}")
    print(f"   ‚Ä¢ Most viewed message: {df[views_col].max():,.0f} views")

if forwards_col and forwards_col in df.columns:
    print(f"   ‚Ä¢ Average forwards per message: {df[forwards_col].mean():.1f}")
    print(f"   ‚Ä¢ Most forwarded message: {df[forwards_col].max():,.0f} forwards")

print(f"\nüìÖ TEMPORAL PATTERNS")
peak_hour = hourly_dist.idxmax()
peak_day = day_dist.idxmax()
print(f"   ‚Ä¢ Peak posting hour: {peak_hour}:00 ({hourly_dist.max()} messages)")
print(f"   ‚Ä¢ Most active day: {peak_day} ({day_dist.max()} messages)")

print(f"\n‚ö†Ô∏è DATA QUALITY")
print(f"   ‚Ä¢ Duplicate messages: {duplicates}")
print(f"   ‚Ä¢ Empty messages: {len(empty_messages)}")
print(f"   ‚Ä¢ Future dated messages: {len(future_dates)}")
print(f"   ‚Ä¢ Overall data quality: {'Good ‚úÖ' if duplicates == 0 and len(future_dates) == 0 else 'Needs attention ‚ö†Ô∏è'}")

print("\n" + "="*60)

KEY FINDINGS - Data Exploration Summary

üìä DATASET OVERVIEW
   ‚Ä¢ Total messages scraped: 277
   ‚Ä¢ Channels tracked: 3
   ‚Ä¢ Date range: 2022-09-05 to 2026-01-18
   ‚Ä¢ Average messages per day: 4.7

üìù CONTENT CHARACTERISTICS
   ‚Ä¢ Average message length: 663 characters
   ‚Ä¢ Messages with images: 217 (78.3%)
   ‚Ä¢ Most active channel: lobelia4cosmetics (100 messages)

üëÅÔ∏è ENGAGEMENT METRICS
   ‚Ä¢ Average views per message: 1530
   ‚Ä¢ Median views per message: 484
   ‚Ä¢ Most viewed message: 30,160 views
   ‚Ä¢ Average forwards per message: 3.0
   ‚Ä¢ Most forwarded message: 54 forwards

üìÖ TEMPORAL PATTERNS
   ‚Ä¢ Peak posting hour: 6.0:00 (45 messages)
   ‚Ä¢ Most active day: Thursday (61 messages)

‚ö†Ô∏è DATA QUALITY
   ‚Ä¢ Duplicate messages: 0
   ‚Ä¢ Empty messages: 5
   ‚Ä¢ Future dated messages: 0
   ‚Ä¢ Overall data quality: Good ‚úÖ



## Conclusions

### Main Insights

1. **Channel Activity**: The data shows varying levels of activity across channels, with some channels posting significantly more than others.

2. **Content Patterns**: 
   - Messages vary in length, with some channels preferring short promotional posts while others provide detailed product information
   - Visual content (images) is used extensively, which is critical for product-based businesses

3. **Engagement Trends**: 
   - View counts and forward behavior show which types of content resonate with audiences
   - Channels with higher image usage may see different engagement patterns

4. **Temporal Patterns**: 
   - Clear peak posting times and days suggest when audiences are most active
   - This information can guide optimal posting schedules

### Data Quality

- The scraped data is generally clean with few duplicates or invalid records
- Missing values are minimal for critical fields
- The data is ready for transformation into a structured warehouse

### Next Steps

1. Load this raw data into PostgreSQL raw schema
2. Transform into dimensional model (star schema) using dbt
3. Enrich with YOLO object detection for image analysis
4. Build analytical API for business questions