# Twitter Data Analysis (Dataset 1)

This notebook presents a descriptive analysis of a Twitter dataset using Python and Polars. It covers basic data exploration and grouped statistics to understand engagement patterns.

## Script 1 — Pure Python (No Pandas, No Polars)


## 1. Load and Preview the Dataset
We start by loading the dataset using Polars and checking its structure.

In [1]:
import csv

# Open the file
with open("2024_tw_posts_president_scored_anon.csv", mode='r', encoding='utf-8') as file:
    reader = csv.reader(file)
    header = next(reader)
    first_row = next(reader)

print("Header:\n", header)
print("\nFirst Row:\n", first_row)




Header:
 ['id', 'url', 'source', 'retweetCount', 'replyCount', 'likeCount', 'quoteCount', 'viewCount', 'createdAt', 'lang', 'bookmarkCount', 'isReply', 'isRetweet', 'isQuote', 'isConversationControlled', 'quoteId', 'inReplyToId', 'month_year', 'illuminating_scored_message', 'election_integrity_Truth_illuminating', 'advocacy_msg_type_illuminating', 'issue_msg_type_illuminating', 'attack_msg_type_illuminating', 'image_msg_type_illuminating', 'cta_msg_type_illuminating', 'engagement_cta_subtype_illuminating', 'fundraising_cta_subtype_illuminating', 'voting_cta_subtype_illuminating', 'covid_topic_illuminating', 'economy_topic_illuminating', 'education_topic_illuminating', 'environment_topic_illuminating', 'foreign_policy_topic_illuminating', 'governance_topic_illuminating', 'health_topic_illuminating', 'immigration_topic_illuminating', 'lgbtq_issues_topic_illuminating', 'military_topic_illuminating', 'race_and_ethnicity_topic_illuminating', 'safety_topic_illuminating', 'social_and_cultural

## 2. Descriptive Statistics for Numeric Columns
 to generate summary statistics (mean, std, min, max, quartiles) for numeric features like `likeCount`, `viewCount`, etc.

In [2]:
import csv
import statistics
import math

# File path
file_path = "2024_tw_posts_president_scored_anon.csv"

# Columns we want to analyze
numeric_columns = ["retweetCount", "replyCount", "likeCount", "quoteCount", "viewCount"]

# Data storage for each column
data = {col: [] for col in numeric_columns}

# Read the CSV file
with open(file_path, mode='r', encoding='utf-8') as file:
    reader = csv.DictReader(file)
    for row in reader:
        for col in numeric_columns:
            value = row[col]
            try:
                num = int(value)
                data[col].append(num)
            except:
                continue  # Skip empty or bad values

# Function to compute summary stats
def summarize(values):
    if not values:
        return None
    return {
        "Count": len(values),
        "Mean": round(statistics.mean(values), 2),
        "Min": min(values),
        "Max": max(values),
        "Std Dev": round(statistics.stdev(values), 2) if len(values) > 1 else 0.0
    }

# Print summary
for col in numeric_columns:
    print(f"Summary for '{col}':")
    print(summarize(data[col]))
    print("-" * 40)


Summary for 'retweetCount':
{'Count': 27304, 'Mean': 1322.06, 'Min': 0, 'Max': 144615, 'Std Dev': 3405.0}
----------------------------------------
Summary for 'replyCount':
{'Count': 27304, 'Mean': 1063.79, 'Min': 0, 'Max': 121270, 'Std Dev': 3174.98}
----------------------------------------
Summary for 'likeCount':
{'Count': 27304, 'Mean': 6913.69, 'Min': 0, 'Max': 915221, 'Std Dev': 21590.31}
----------------------------------------
Summary for 'quoteCount':
{'Count': 27304, 'Mean': 128.08, 'Min': 0, 'Max': 123320, 'Std Dev': 1131.53}
----------------------------------------
Summary for 'viewCount':
{'Count': 27304, 'Mean': 507084.73, 'Min': 5, 'Max': 333502775, 'Std Dev': 3212173.99}
----------------------------------------


## 3. Frequency Counts for Categorical Variables
Here, we analyze categorical columns such as `lang`, `isReply`, `isRetweet`, and `source` to understand their distributions.

In [3]:
from collections import Counter

# Categorical columns to analyze
categorical_columns = ["lang", "isReply", "isRetweet", "source"]

# Store frequency counts
category_counts = {col: Counter() for col in categorical_columns}

# Read and count values
with open(file_path, mode='r', encoding='utf-8') as file:
    reader = csv.DictReader(file)
    for row in reader:
        for col in categorical_columns:
            value = row[col].strip()
            category_counts[col][value] += 1

# Display results
for col in categorical_columns:
    print(f"Column: {col}")
    print(f"Unique values: {len(category_counts[col])}")
    print("Top 5 most common values:")
    for value, count in category_counts[col].most_common(5):
        print(f"  {value}: {count}")
    print("-" * 40)


Column: lang
Unique values: 12
Top 5 most common values:
  en: 27281
  fr: 6
  tl: 4
  es: 3
  da: 3
----------------------------------------
Column: isReply
Unique values: 2
Top 5 most common values:
  False: 23930
  True: 3374
----------------------------------------
Column: isRetweet
Unique values: 1
Top 5 most common values:
  False: 27304
----------------------------------------
Column: source
Unique values: 14
Top 5 most common values:
  Twitter Web App: 14930
  Twitter for iPhone: 8494
  Sprout Social: 2933
  Twitter Media Studio: 499
  Twitter for iPad: 266
----------------------------------------


## 4. Grouped Aggregations by Language
We calculate average and count of `likeCount` and `viewCount` grouped by the `lang` column. This helps us compare engagement metrics across languages.

In [4]:
# Grouped summary for likeCount and viewCount by lang
group_field = "lang"
target_fields = ["likeCount", "viewCount"]

# Storage
grouped_data = {}

# Read CSV and group values
with open(file_path, mode='r', encoding='utf-8') as file:
    reader = csv.DictReader(file)
    for row in reader:
        key = row[group_field].strip()
        if key not in grouped_data:
            grouped_data[key] = {field: [] for field in target_fields}
        for field in target_fields:
            try:
                val = int(row[field])
                grouped_data[key][field].append(val)
            except:
                continue

# Summarize each group
for lang, values_dict in grouped_data.items():
    print(f"Language Group: {lang}")
    for field in target_fields:
        values = values_dict[field]
        if values:
            mean_val = round(statistics.mean(values), 2)
            count_val = len(values)
            print(f"  {field} → Mean: {mean_val}, Count: {count_val}")
    print("-" * 40)


Language Group: en
  likeCount → Mean: 6913.52, Count: 27281
  viewCount → Mean: 507323.4, Count: 27281
----------------------------------------
Language Group: tl
  likeCount → Mean: 23366.25, Count: 4
  viewCount → Mean: 443447.25, Count: 4
----------------------------------------
Language Group: et
  likeCount → Mean: 47, Count: 1
  viewCount → Mean: 3355, Count: 1
----------------------------------------
Language Group: es
  likeCount → Mean: 1879, Count: 3
  viewCount → Mean: 245237.33, Count: 3
----------------------------------------
Language Group: pl
  likeCount → Mean: 507, Count: 1
  viewCount → Mean: 69531, Count: 1
----------------------------------------
Language Group: fr
  likeCount → Mean: 770, Count: 6
  viewCount → Mean: 41883.67, Count: 6
----------------------------------------
Language Group: in
  likeCount → Mean: 2, Count: 1
  viewCount → Mean: 109, Count: 1
----------------------------------------
Language Group: nl
  likeCount → Mean: 178, Count: 1
  viewCount

## 5. Insights from Output:
English (`en`) dominates with 27,281 tweets.

Tweets in Tagalog (`tl`) had high engagement on average (`likeCount ≈ 23,366`), though based on only 4 tweets — showing potential outlier impact.

Tweets in languages like `ht` or `tr` had very high view counts, again likely due to small sample size. 

## Script 2 — With Pandas

In [5]:
## Step 2: Descriptive Statistics Using Pandas

## In this section, we replicate the descriptive analysis on the Twitter dataset using the Pandas library.

In [1]:
import pandas as pd

# Load the CSV using Pandas
df = pd.read_csv("2024_tw_posts_president_scored_anon.csv")

# Show basic info and preview
print("Dataset shape:", df.shape)
print("\nColumns:\n", df.columns.tolist())
df.head()


Dataset shape: (27304, 47)

Columns:
 ['id', 'url', 'source', 'retweetCount', 'replyCount', 'likeCount', 'quoteCount', 'viewCount', 'createdAt', 'lang', 'bookmarkCount', 'isReply', 'isRetweet', 'isQuote', 'isConversationControlled', 'quoteId', 'inReplyToId', 'month_year', 'illuminating_scored_message', 'election_integrity_Truth_illuminating', 'advocacy_msg_type_illuminating', 'issue_msg_type_illuminating', 'attack_msg_type_illuminating', 'image_msg_type_illuminating', 'cta_msg_type_illuminating', 'engagement_cta_subtype_illuminating', 'fundraising_cta_subtype_illuminating', 'voting_cta_subtype_illuminating', 'covid_topic_illuminating', 'economy_topic_illuminating', 'education_topic_illuminating', 'environment_topic_illuminating', 'foreign_policy_topic_illuminating', 'governance_topic_illuminating', 'health_topic_illuminating', 'immigration_topic_illuminating', 'lgbtq_issues_topic_illuminating', 'military_topic_illuminating', 'race_and_ethnicity_topic_illuminating', 'safety_topic_illumi

Unnamed: 0,id,url,source,retweetCount,replyCount,likeCount,quoteCount,viewCount,createdAt,lang,...,military_topic_illuminating,race_and_ethnicity_topic_illuminating,safety_topic_illuminating,social_and_cultural_topic_illuminating,technology_and_privacy_topic_illuminating,womens_issue_topic_illuminating,incivility_illuminating,scam_illuminating,freefair_illuminating,fraud_illuminating
0,cc46051622b8a9c1b883a3bbf12c640b12ac1cbdc7f48a...,f70a206472e9deaf6e313297c1efb891729ced346a0aeb...,Twitter for iPhone,10,37,94,2,15610,2023-09-30 14:11:00,en,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0,0
1,0e3db0c35a290c6df3b737d15882846c108cc80a9b7e5c...,a1962f54943732a0dc006c33b4b6f5764c7085a1282a59...,Twitter for iPhone,421,1005,2697,60,158324,2023-09-29 13:27:24,en,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0,0
2,256905919085d2946d5d187abc6cbe81a8abe3384793b3...,4ddbbdb7f4d8ef62fccf3ed20c993bb665b8620fedb089...,Twitter for iPhone,39,194,332,12,35535,2023-09-27 20:31:23,en,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0,0
3,a461b32b31e72b222df7fdda0a8e68b0092e31deda33a8...,c7e729c427e714baf06d88a2856a1be07d4ff52e1e6334...,Twitter for iPhone,47,332,427,62,199642,2023-09-26 01:52:40,en,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
4,ca2795ec79d62adc1fff06c4d3bc9da0bbc899e32c9b21...,c589bd751d7e1d275901b184087716d3155a582b164308...,Twitter Web App,17,46,106,3,17917,2023-09-21 13:24:13,en,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0,0


In [7]:
##  we used `.describe()` to generate summary statistics (mean, std, min, max, quartiles) for all numeric features like `likeCount`, `viewCount`, etc.

In [2]:
# Descriptive statistics for numeric columns
df_pl.describe()

Unnamed: 0,retweetCount,replyCount,likeCount,quoteCount,viewCount,bookmarkCount,quoteId,inReplyToId,election_integrity_Truth_illuminating,advocacy_msg_type_illuminating,...,military_topic_illuminating,race_and_ethnicity_topic_illuminating,safety_topic_illuminating,social_and_cultural_topic_illuminating,technology_and_privacy_topic_illuminating,womens_issue_topic_illuminating,incivility_illuminating,scam_illuminating,freefair_illuminating,fraud_illuminating
count,27304.0,27304.0,27304.0,27304.0,27304.0,27304.0,3287.0,3345.0,26034.0,26034.0,...,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,27304.0,27304.0
mean,1322.055193,1063.785013,6913.692829,128.081563,507084.7,136.213522,1.764298e+18,1.758286e+18,0.037144,0.563686,...,0.010986,0.015403,0.037605,0.051971,0.002036,0.023316,0.178574,0.012368,0.001428,0.002747
std,3405.00424,3174.981654,21590.307989,1131.533468,3212174.0,712.580294,6.894687e+16,4.361197e+16,0.189118,0.495937,...,0.104237,0.123151,0.190242,0.221972,0.045075,0.150907,0.383003,0.110526,0.037767,0.052339
min,0.0,0.0,0.0,0.0,5.0,0.0,7.912639e+17,1.240067e+18,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,84.0,43.0,393.0,5.0,27852.75,4.0,1.726459e+18,1.726801e+18,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,333.0,131.0,1406.0,17.0,70942.0,21.0,1.756496e+18,1.746641e+18,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1071.0,501.25,5010.0,69.0,303663.0,76.0,1.816599e+18,1.789226e+18,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,144615.0,121270.0,915221.0,123320.0,333502800.0,42693.0,1.853576e+18,1.853531e+18,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [3]:
# Categorical columns to summarize
categorical_cols = ["lang", "isReply", "isRetweet", "source"]

for col in categorical_cols:
    print(f"\nColumn: {col}")
    print("Unique values:", df[col].nunique())
    print("Top 5 most frequent values:")
    print(df[col].value_counts().head(5))
    print("-" * 40)



Column: lang
Unique values: 12
Top 5 most frequent values:
lang
en    27281
fr        6
tl        4
es        3
da        3
Name: count, dtype: int64
----------------------------------------

Column: isReply
Unique values: 2
Top 5 most frequent values:
isReply
False    23930
True      3374
Name: count, dtype: int64
----------------------------------------

Column: isRetweet
Unique values: 1
Top 5 most frequent values:
isRetweet
False    27304
Name: count, dtype: int64
----------------------------------------

Column: source
Unique values: 14
Top 5 most frequent values:
source
Twitter Web App         14930
Twitter for iPhone       8494
Sprout Social            2933
Twitter Media Studio      499
Twitter for iPad          266
Name: count, dtype: int64
----------------------------------------


In [None]:
## We’ll use .groupby() to compute mean likeCount and viewCount per language — just like we did manually in the pure Python version.

In [4]:
# Grouped mean likeCount and viewCount by lang
grouped_stats = df.groupby("lang")[["likeCount", "viewCount"]].agg(["mean", "count"]).round(2)
grouped_stats


Unnamed: 0_level_0,likeCount,likeCount,viewCount,viewCount
Unnamed: 0_level_1,mean,count,mean,count
lang,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
da,597.0,3,33529.67,3
en,6913.52,27281,507323.4,27281
es,1879.0,3,245237.33,3
et,47.0,1,3355.0,1
fr,770.0,6,41883.67,6
ht,40516.0,1,1600170.0,1
in,2.0,1,109.0,1
nl,178.0,1,29081.0,1
pl,507.0,1,69531.0,1
pt,2537.0,1,153972.0,1


## Script 3 — With Polars

In [1]:
# Load the CSV into a Polars DataFrame
df_pl = pl.read_csv("2024_tw_posts_president_scored_anon.csv")

# Show shape and preview
print("Shape:", df_pl.shape)
df_pl.head()

Shape: (27304, 47)


id,url,source,retweetCount,replyCount,likeCount,quoteCount,viewCount,createdAt,lang,bookmarkCount,isReply,isRetweet,isQuote,isConversationControlled,quoteId,inReplyToId,month_year,illuminating_scored_message,election_integrity_Truth_illuminating,advocacy_msg_type_illuminating,issue_msg_type_illuminating,attack_msg_type_illuminating,image_msg_type_illuminating,cta_msg_type_illuminating,engagement_cta_subtype_illuminating,fundraising_cta_subtype_illuminating,voting_cta_subtype_illuminating,covid_topic_illuminating,economy_topic_illuminating,education_topic_illuminating,environment_topic_illuminating,foreign_policy_topic_illuminating,governance_topic_illuminating,health_topic_illuminating,immigration_topic_illuminating,lgbtq_issues_topic_illuminating,military_topic_illuminating,race_and_ethnicity_topic_illuminating,safety_topic_illuminating,social_and_cultural_topic_illuminating,technology_and_privacy_topic_illuminating,womens_issue_topic_illuminating,incivility_illuminating,scam_illuminating,freefair_illuminating,fraud_illuminating
str,str,str,i64,i64,i64,i64,i64,str,str,i64,bool,bool,bool,bool,f64,f64,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,i64,i64
"""cc46051622b8a9c1b883a3bbf12c64…","""f70a206472e9deaf6e313297c1efb8…","""Twitter for iPhone""",10,37,94,2,15610,"""2023-09-30 14:11:00""","""en""",0,False,False,False,False,,,"""2023-09""","""1876a8ce2704af06f47f4cf6c5bcad…",0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0,0
"""0e3db0c35a290c6df3b737d1588284…","""a1962f54943732a0dc006c33b4b6f5…","""Twitter for iPhone""",421,1005,2697,60,158324,"""2023-09-29 13:27:24""","""en""",13,False,False,False,False,,,"""2023-09""","""ca5cbace947a7eaae06ef2d2423ff6…",0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0,0
"""256905919085d2946d5d187abc6cbe…","""4ddbbdb7f4d8ef62fccf3ed20c993b…","""Twitter for iPhone""",39,194,332,12,35535,"""2023-09-27 20:31:23""","""en""",1,False,False,False,False,,,"""2023-09""","""ac5132800ac8301dd96b6502706c0a…",0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0,0
"""a461b32b31e72b222df7fdda0a8e68…","""c7e729c427e714baf06d88a2856a1b…","""Twitter for iPhone""",47,332,427,62,199642,"""2023-09-26 01:52:40""","""en""",7,False,False,False,False,,,"""2023-09""","""b12b8365f96e4ce77fc72599ac977a…",0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
"""ca2795ec79d62adc1fff06c4d3bc9d…","""c589bd751d7e1d275901b184087716…","""Twitter Web App""",17,46,106,3,17917,"""2023-09-21 13:24:13""","""en""",2,False,False,False,False,,,"""2023-09""","""58788e34f34d8a3f530dd68d9faf79…",0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0,0


In [2]:
# Descriptive statistics for numeric columns
df_pl.describe()

statistic,id,url,source,retweetCount,replyCount,likeCount,quoteCount,viewCount,createdAt,lang,bookmarkCount,isReply,isRetweet,isQuote,isConversationControlled,quoteId,inReplyToId,month_year,illuminating_scored_message,election_integrity_Truth_illuminating,advocacy_msg_type_illuminating,issue_msg_type_illuminating,attack_msg_type_illuminating,image_msg_type_illuminating,cta_msg_type_illuminating,engagement_cta_subtype_illuminating,fundraising_cta_subtype_illuminating,voting_cta_subtype_illuminating,covid_topic_illuminating,economy_topic_illuminating,education_topic_illuminating,environment_topic_illuminating,foreign_policy_topic_illuminating,governance_topic_illuminating,health_topic_illuminating,immigration_topic_illuminating,lgbtq_issues_topic_illuminating,military_topic_illuminating,race_and_ethnicity_topic_illuminating,safety_topic_illuminating,social_and_cultural_topic_illuminating,technology_and_privacy_topic_illuminating,womens_issue_topic_illuminating,incivility_illuminating,scam_illuminating,freefair_illuminating,fraud_illuminating
str,str,str,str,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""count""","""27304""","""27304""","""27304""",27304.0,27304.0,27304.0,27304.0,27304.0,"""27304""","""27304""",27304.0,27304.0,27304.0,27304.0,27304.0,3287.0,3345.0,"""27304""","""27304""",26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,26034.0,27304.0,27304.0
"""null_count""","""0""","""0""","""0""",0.0,0.0,0.0,0.0,0.0,"""0""","""0""",0.0,0.0,0.0,0.0,0.0,24017.0,23959.0,"""0""","""0""",1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,1270.0,0.0,0.0
"""mean""",,,,1322.055193,1063.785013,6913.692829,128.081563,507084.731834,,,136.213522,0.123572,0.0,0.118664,0.000293,1.7643e+18,1.7583e+18,,,0.037144,0.563686,0.507682,0.307598,0.226435,0.109664,0.066912,0.007874,0.016786,0.007605,0.160214,0.018437,0.02854,0.042252,0.02297,0.055658,0.065299,0.003073,0.010986,0.015403,0.037605,0.051971,0.002036,0.023316,0.178574,0.012368,0.001428,0.002747
"""std""",,,,3405.00424,3174.981654,21590.307989,1131.533468,3212200.0,,,712.580294,,,,,6.8947e+16,4.3612e+16,,,0.189118,0.495937,0.499951,0.461508,0.418532,0.312477,0.249875,0.088389,0.12847,0.086879,0.366811,0.134529,0.166512,0.201168,0.149811,0.229264,0.247058,0.05535,0.104237,0.123151,0.190242,0.221972,0.045075,0.150907,0.383003,0.110526,0.037767,0.052339
"""min""","""0000635d0c9e7bdf89dfc13811d080…","""0000179c6b90798f167528aaaaf678…","""Canva""",0.0,0.0,0.0,0.0,5.0,"""2023-09-01 00:30:21""","""da""",0.0,0.0,0.0,0.0,0.0,7.9126e+17,1.2401e+18,"""2023-09""","""0000f20a94aa332e2e6ed7a0620f98…",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""25%""",,,,84.0,43.0,393.0,5.0,27853.0,,,4.0,,,,,1.7265e+18,1.7268e+18,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""50%""",,,,333.0,131.0,1406.0,17.0,70942.0,,,21.0,,,,,1.7565e+18,1.7466e+18,,,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""75%""",,,,1071.0,501.0,5010.0,69.0,303661.0,,,76.0,,,,,1.8166e+18,1.7892e+18,,,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""max""","""fffbb471d8b0bd6d990b4f9f22283b…","""ffffd63fa71574c0127b90e12fdba3…","""Twitter for iPhone""",144615.0,121270.0,915221.0,123320.0,333502775.0,"""2024-11-04 23:40:21""","""tr""",42693.0,1.0,0.0,1.0,1.0,1.8536e+18,1.8535e+18,"""2024-11""","""fffe6f31ba97d01463912106398493…",1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [4]:
# Categorical columns to inspect
cat_cols = ["lang", "source", "isReply", "isRetweet"]

for col in cat_cols:
    print(f"\nColumn: {col}")
    value_counts = (
        df_pl
        .group_by(col)  # ✅ Polars uses group_by
        .agg(pl.count())
        .sort("count", descending=True)
        .head(5)
    )
    print(value_counts)
    print("-" * 40)



Column: lang
shape: (5, 2)
┌──────┬───────┐
│ lang ┆ count │
│ ---  ┆ ---   │
│ str  ┆ u32   │
╞══════╪═══════╡
│ en   ┆ 27281 │
│ fr   ┆ 6     │
│ tl   ┆ 4     │
│ es   ┆ 3     │
│ da   ┆ 3     │
└──────┴───────┘
----------------------------------------

Column: source
shape: (5, 2)
┌──────────────────────┬───────┐
│ source               ┆ count │
│ ---                  ┆ ---   │
│ str                  ┆ u32   │
╞══════════════════════╪═══════╡
│ Twitter Web App      ┆ 14930 │
│ Twitter for iPhone   ┆ 8494  │
│ Sprout Social        ┆ 2933  │
│ Twitter Media Studio ┆ 499   │
│ Twitter for iPad     ┆ 266   │
└──────────────────────┴───────┘
----------------------------------------

Column: isReply
shape: (2, 2)
┌─────────┬───────┐
│ isReply ┆ count │
│ ---     ┆ ---   │
│ bool    ┆ u32   │
╞═════════╪═══════╡
│ false   ┆ 23930 │
│ true    ┆ 3374  │
└─────────┴───────┘
----------------------------------------

Column: isRetweet
shape: (1, 2)
┌───────────┬───────┐
│ isRetweet ┆ count │
│

(Deprecated in version 0.20.5)
  .agg(pl.count())


In [5]:
# Grouped mean and count of likeCount and viewCount by language
grouped_stats = (
    df_pl
    .group_by("lang")
    .agg([
        pl.col("likeCount").mean().alias("likeCount_mean"),
        pl.col("likeCount").count().alias("likeCount_count"),
        pl.col("viewCount").mean().alias("viewCount_mean"),
        pl.col("viewCount").count().alias("viewCount_count")
    ])
    .sort("likeCount_count", descending=True)
)
print(grouped_stats)

shape: (12, 5)
┌──────┬────────────────┬─────────────────┬────────────────┬─────────────────┐
│ lang ┆ likeCount_mean ┆ likeCount_count ┆ viewCount_mean ┆ viewCount_count │
│ ---  ┆ ---            ┆ ---             ┆ ---            ┆ ---             │
│ str  ┆ f64            ┆ u32             ┆ f64            ┆ u32             │
╞══════╪════════════════╪═════════════════╪════════════════╪═════════════════╡
│ en   ┆ 6913.519886    ┆ 27281           ┆ 507323.401525  ┆ 27281           │
│ fr   ┆ 770.0          ┆ 6               ┆ 41883.666667   ┆ 6               │
│ tl   ┆ 23366.25       ┆ 4               ┆ 443447.25      ┆ 4               │
│ es   ┆ 1879.0         ┆ 3               ┆ 245237.333333  ┆ 3               │
│ da   ┆ 597.0          ┆ 3               ┆ 33529.666667   ┆ 3               │
│ …    ┆ …              ┆ …               ┆ …              ┆ …               │
│ nl   ┆ 178.0          ┆ 1               ┆ 29081.0        ┆ 1               │
│ pl   ┆ 507.0          ┆ 1          

## Insights from Output:
we now have:

1) Grouped mean and count of `likeCount` and `viewCount`

2) By `lang` (language)

3) Sorted in descending order of `likeCount_count`

This mirrors the Pandas version of Script 2 exactly — so the task for Dataset 1 is fully done across:

Script 1 (Exploratory info)

Script 2 (Descriptive statistics in Pandas)

Script 3 (Descriptive stats in Polars)


---

Notebook created as part of Task 04 – Descriptive Statistics.