# Arabic Tweets Dataset Analysis

This notebook explores the Arabic Tweets dataset from HuggingFace. We'll load the data, perform exploratory data analysis, and extract useful features from the tweet text.

## 1. Setup and Data Loading

In [11]:
from datasets import load_dataset


In [None]:
# Download dataset from HuggingFace (commented out - CSV file is used instead)
# num_samples_to_take = 50000
# dataset_name = "pain/Arabic-Tweets"
# ds = load_dataset(dataset_name, split="train", streaming=True)
# ds = ds.take(num_samples_to_take)

'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 08897d62-a098-42ab-ae2f-45b0e5057dc3)')' thrown while requesting HEAD https://huggingface.co/datasets/pain/Arabic-Tweets/resolve/main/README.md
Retrying in 1s [Retry 1/5].
'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 8066565c-1058-4890-880a-860877eebf0a)')' thrown while requesting HEAD https://huggingface.co/datasets/pain/Arabic-Tweets/resolve/main/README.md
Retrying in 2s [Retry 2/5].
'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 1487c3ba-aff1-4052-887b-77a261f221d4)')' thrown while requesting HEAD https://huggingface.co/datasets/pain/Arabic-Tweets/resolve/main/README.md
Retrying in 4s [Retry 3/5].
'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)")

**Note:** The dataset has been saved to CSV. The download cell above is now commented out to avoid re-downloading.

In [24]:
# Load dataset from CSV file
import pandas as pd

csv_path = 'arabic_tweets_50k.csv'
df = pd.read_csv(csv_path)
print(f"Dataset loaded from {csv_path}")
print(f"Shape: {df.shape}")

Dataset loaded from arabic_tweets_50k.csv
Shape: (50000, 14)


In [14]:
df.shape

(50000, 1)

In [25]:
df.head()


Unnamed: 0,text,char_count,word_count,mention_count,hashtag_count,url_count,has_numbers,char_per_word,emoji_count,punctuation_count,exclamation_count,question_count,has_latin,is_all_caps
0,مذاكره اخر اختبار هي اصعب شي امر فيه حاليا,42,9,0,0,0,False,4.666667,0,0,0,0,False,False
1,باريللا صارله كم مباراه مستواه سيء فيدال بس يض...,96,19,0,0,0,False,5.052632,0,0,0,0,False,False
2,معروفه اننا بنتزنق في اخر اسبوع من المديول,42,8,0,0,0,False,5.25,0,0,0,0,False,False
3,حساب عظيم اخر همه اثاره الجدل تفكير عميق وتحلي...,72,14,0,0,0,False,5.142857,0,0,0,0,False,False
4,دخل لاجامي مكان مادو الي قاتل اللعب وراء ودخل ...,112,22,0,0,0,False,5.090909,0,0,0,0,False,False


## 2. Basic Exploration

Let's examine the structure and content of our dataset.

In [16]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum())
print(f"\nTotal rows: {len(df)}")
print(f"Columns: {df.columns.tolist()}")

Missing values:
text    0
dtype: int64

Total rows: 50000
Columns: ['text']


In [None]:
# Display sample tweets
print("Sample tweets:")
for i, tweet in enumerate(df['text'].head(10), 1):
    print(f"{i}. {tweet}")

## 3. Feature Engineering

Extract useful features from the tweet text to better understand the content.

In [17]:
import re

# Text length features
df['char_count'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()

# Extract mentions, hashtags, and URLs
df['mention_count'] = df['text'].apply(lambda x: len(re.findall(r'@\w+', x)))
df['hashtag_count'] = df['text'].apply(lambda x: len(re.findall(r'#\w+', x)))
df['url_count'] = df['text'].apply(lambda x: len(re.findall(r'http\S+|www\.\S+', x)))

# Check for Arabic numerals
df['has_numbers'] = df['text'].str.contains(r'\d', regex=True)

# Approximate word density (chars per word)
df['char_per_word'] = df['char_count'] / df['word_count'].replace(0, 1)

print("Features extracted successfully!")
df[['text', 'char_count', 'word_count', 'mention_count', 'hashtag_count', 'url_count']].head()

Features extracted successfully!


Unnamed: 0,text,char_count,word_count,mention_count,hashtag_count,url_count
0,مذاكره اخر اختبار هي اصعب شي امر فيه حاليا,42,9,0,0,0
1,باريللا صارله كم مباراه مستواه سيء فيدال بس يض...,96,19,0,0,0
2,معروفه اننا بنتزنق في اخر اسبوع من المديول,42,8,0,0,0
3,حساب عظيم اخر همه اثاره الجدل تفكير عميق وتحلي...,72,14,0,0,0
4,دخل لاجامي مكان مادو الي قاتل اللعب وراء ودخل ...,112,22,0,0,0


## 4. Exploratory Data Analysis (EDA)

### 4.1 Statistical Summary

In [18]:
# Descriptive statistics for numerical features
df[['char_count', 'word_count', 'mention_count', 'hashtag_count', 'url_count', 'char_per_word']].describe()

Unnamed: 0,char_count,word_count,mention_count,hashtag_count,url_count,char_per_word
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,90.2637,17.28058,0.0,0.0,0.0,5.098266
std,72.612125,13.366369,0.0,0.0,0.0,0.554885
min,3.0,1.0,0.0,0.0,0.0,2.5
25%,35.0,7.0,0.0,0.0,0.0,4.75
50%,63.0,12.0,0.0,0.0,0.0,5.1
75%,126.0,24.0,0.0,0.0,0.0,5.451613
max,280.0,65.0,0.0,0.0,0.0,9.25


In [19]:
# Calculate percentages
total_tweets = len(df)
print(f"Tweets with mentions: {(df['mention_count'] > 0).sum()} ({(df['mention_count'] > 0).sum()/total_tweets*100:.2f}%)")
print(f"Tweets with hashtags: {(df['hashtag_count'] > 0).sum()} ({(df['hashtag_count'] > 0).sum()/total_tweets*100:.2f}%)")
print(f"Tweets with URLs: {(df['url_count'] > 0).sum()} ({(df['url_count'] > 0).sum()/total_tweets*100:.2f}%)")
print(f"Tweets with numbers: {df['has_numbers'].sum()} ({df['has_numbers'].sum()/total_tweets*100:.2f}%)")

Tweets with mentions: 0 (0.00%)
Tweets with hashtags: 0 (0.00%)
Tweets with URLs: 0 (0.00%)
Tweets with numbers: 0 (0.00%)


### 4.2 Visualizations

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 10)

# Create subplots
fig, axes = plt.subplots(2, 3, figsize=(16, 10))

# Character count distribution
axes[0, 0].hist(df['char_count'], bins=50, color='skyblue', edgecolor='black')
axes[0, 0].set_title('Character Count Distribution')
axes[0, 0].set_xlabel('Character Count')
axes[0, 0].set_ylabel('Frequency')

# Word count distribution
axes[0, 1].hist(df['word_count'], bins=50, color='lightcoral', edgecolor='black')
axes[0, 1].set_title('Word Count Distribution')
axes[0, 1].set_xlabel('Word Count')
axes[0, 1].set_ylabel('Frequency')

# Mentions distribution
axes[0, 2].hist(df['mention_count'], bins=range(0, df['mention_count'].max()+2), color='lightgreen', edgecolor='black')
axes[0, 2].set_title('Mentions per Tweet Distribution')
axes[0, 2].set_xlabel('Number of Mentions')
axes[0, 2].set_ylabel('Frequency')

# Hashtags distribution
axes[1, 0].hist(df['hashtag_count'], bins=range(0, df['hashtag_count'].max()+2), color='gold', edgecolor='black')
axes[1, 0].set_title('Hashtags per Tweet Distribution')
axes[1, 0].set_xlabel('Number of Hashtags')
axes[1, 0].set_ylabel('Frequency')

# URL distribution
axes[1, 1].hist(df['url_count'], bins=range(0, df['url_count'].max()+2), color='plum', edgecolor='black')
axes[1, 1].set_title('URLs per Tweet Distribution')
axes[1, 1].set_xlabel('Number of URLs')
axes[1, 1].set_ylabel('Frequency')

# Character per word distribution
axes[1, 2].hist(df['char_per_word'], bins=50, color='orange', edgecolor='black')
axes[1, 2].set_title('Characters per Word Distribution')
axes[1, 2].set_xlabel('Chars per Word')
axes[1, 2].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(10, 8))
correlation_cols = ['char_count', 'word_count', 'mention_count', 'hashtag_count', 'url_count', 'char_per_word']
sns.heatmap(df[correlation_cols].corr(), annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.show()

### 4.3 Top Tweets Analysis

In [None]:
# Longest tweets
print("Top 5 Longest Tweets by Character Count:")
print("="*60)
for idx, row in df.nlargest(5, 'char_count')[['text', 'char_count', 'word_count']].iterrows():
    print(f"\nChars: {row['char_count']}, Words: {row['word_count']}")
    print(f"Text: {row['text'][:100]}...")
    
print("\n" + "="*60)
print("\nTop 5 Tweets with Most Mentions:")
print("="*60)
for idx, row in df.nlargest(5, 'mention_count')[['text', 'mention_count']].iterrows():
    print(f"\nMentions: {row['mention_count']}")
    print(f"Text: {row['text'][:100]}...")

## 5. Additional Text Features

Extract more advanced features like emojis, punctuation, and special characters.

In [20]:
# Count emojis (simplified check for common emoji ranges)
def count_emojis(text):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE)
    return len(emoji_pattern.findall(text))

df['emoji_count'] = df['text'].apply(count_emojis)

# Count punctuation
df['punctuation_count'] = df['text'].apply(lambda x: len([c for c in x if c in '!?,.:;…']))

# Count exclamation and question marks
df['exclamation_count'] = df['text'].str.count('!')
df['question_count'] = df['text'].str.count('؟|\\?')  # Arabic and English question marks

# Check if tweet is all caps (for Arabic, check Latin characters if any)
df['has_latin'] = df['text'].str.contains('[a-zA-Z]', regex=True)
df['is_all_caps'] = df.apply(lambda row: row['text'].upper() == row['text'] if row['has_latin'] else False, axis=1)

print("Additional features extracted!")
df[['text', 'emoji_count', 'punctuation_count', 'exclamation_count', 'question_count']].head(10)

Additional features extracted!


Unnamed: 0,text,emoji_count,punctuation_count,exclamation_count,question_count
0,مذاكره اخر اختبار هي اصعب شي امر فيه حاليا,0,0,0,0
1,باريللا صارله كم مباراه مستواه سيء فيدال بس يض...,0,0,0,0
2,معروفه اننا بنتزنق في اخر اسبوع من المديول,0,0,0,0
3,حساب عظيم اخر همه اثاره الجدل تفكير عميق وتحلي...,0,0,0,0
4,دخل لاجامي مكان مادو الي قاتل اللعب وراء ودخل ...,0,0,0,0
5,اللهك اخر حاجه كتبتها,0,0,0,0
6,اي حد خالطني اخر اسبوعين يروح يحلل لان مبقيتش ...,0,0,0,0
7,مكانتش اخر مره عملتها بس دي كانت اغربهم,0,0,0,0
8,ليفربول يفشل في الفوز في اخر خمس مباريات بالدو...,0,0,0,0
9,كنت انتظر ساعه عند كوفي يدخلني اخر شيء شفت الا...,0,0,0,0


## 6. Summary Statistics

Final overview of all features in the dataset.

In [21]:
# Display complete feature set
print("Dataset Overview:")
print(f"Total tweets: {len(df)}")
print(f"Total features: {len(df.columns)}")
print(f"\nFeature columns: {df.columns.tolist()}")
print("\n" + "="*60)
print("\nAll Features Summary:")
df.describe(include='all').T

Dataset Overview:
Total tweets: 50000
Total features: 14

Feature columns: ['text', 'char_count', 'word_count', 'mention_count', 'hashtag_count', 'url_count', 'has_numbers', 'char_per_word', 'emoji_count', 'punctuation_count', 'exclamation_count', 'question_count', 'has_latin', 'is_all_caps']


All Features Summary:


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
text,50000.0,50000.0,مذاكره اخر اختبار هي اصعب شي امر فيه حاليا,1.0,,,,,,,
char_count,50000.0,,,,90.2637,72.612125,3.0,35.0,63.0,126.0,280.0
word_count,50000.0,,,,17.28058,13.366369,1.0,7.0,12.0,24.0,65.0
mention_count,50000.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0
hashtag_count,50000.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0
url_count,50000.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0
has_numbers,50000.0,1.0,False,50000.0,,,,,,,
char_per_word,50000.0,,,,5.098266,0.554885,2.5,4.75,5.1,5.451613,9.25
emoji_count,50000.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0
punctuation_count,50000.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
# Key insights
print("Key Insights from the Dataset:")
print("="*60)
print(f"Average tweet length: {df['char_count'].mean():.2f} characters")
print(f"Average word count: {df['word_count'].mean():.2f} words")
print(f"Average characters per word: {df['char_per_word'].mean():.2f}")
print(f"\nTweets with emojis: {(df['emoji_count'] > 0).sum()} ({(df['emoji_count'] > 0).sum()/len(df)*100:.2f}%)")
print(f"Tweets with exclamations: {(df['exclamation_count'] > 0).sum()} ({(df['exclamation_count'] > 0).sum()/len(df)*100:.2f}%)")
print(f"Tweets with questions: {(df['question_count'] > 0).sum()} ({(df['question_count'] > 0).sum()/len(df)*100:.2f}%)")
print(f"\nMost common tweet length: {df['char_count'].mode()[0]} characters")
print(f"Most common word count: {df['word_count'].mode()[0]} words")

Key Insights from the Dataset:
Average tweet length: 90.26 characters
Average word count: 17.28 words
Average characters per word: 5.10

Tweets with emojis: 0 (0.00%)
Tweets with exclamations: 0 (0.00%)
Tweets with questions: 0 (0.00%)

Most common tweet length: 28 characters
Most common word count: 6 words
