# Social Network Analysis I - Week 1
## Python Fundamentals & Data Handling

**Instructor:** Dr. Mudassir Shabbir  
**Fall 2025**

---

### Course Overview
- **Level:** 300-level undergraduate  
- **Credits:** 3  
- **Prerequisite:** Algorithms course with grade B- or higher  
- **Format:** Hands-on lectures, labs, homeworks, group activities, and final project

### Learning Objectives
By the end of this course, you will:
1. Understand core concepts in supervised and unsupervised ML
2. Gain proficiency in Python for ML and data analysis
3. Model, analyze, and visualize real-world networks
4. Apply ML techniques to network problems (link prediction, node classification)
5. Collaborate on applied projects and communicate results

### Today's Focus
Building the foundation: Python skills and data handling with pandas

## 1. Environment Setup and Library Imports

### Development Environment Options
- **Google Colab (Recommended):** Free, cloud-based, pre-installed libraries, easy sharing
- **Local Jupyter Notebooks:** Full control, works offline, better for large datasets

### Getting Started with Colab
1. Go to https://colab.research.google.com
2. Sign in with Google account
3. Create new notebook
4. Start coding!

Let's import the essential libraries we'll use throughout this course:

In [None]:
# Essential Python Libraries for Social Network Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

# Set display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
plt.style.use('default')

print("✅ All libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"NetworkX version: {nx.__version__}")

## 2. Python Basics Review

Let's refresh some fundamental Python concepts that are essential for data analysis and network analysis.

In [None]:
# Basic data types
number = 42
text = "Social Network Analysis"
is_fun = True

print(f"Number: {number} (type: {type(number)})")
print(f"Text: {text} (type: {type(text)})")
print(f"Boolean: {is_fun} (type: {type(is_fun)})")

# Lists and dictionaries
students = ["Alice", "Bob", "Charlie", "Diana", "Eve"]
grades = {"Alice": 95, "Bob": 87, "Charlie": 92, "Diana": 78, "Eve": 88}

print(f"\nStudents: {students}")
print(f"Grades: {grades}")

# List comprehensions (very useful for data processing!)
squares = [x**2 for x in range(1, 6)]
high_grades = [name for name, grade in grades.items() if grade > 90]

print(f"\nSquares: {squares}")
print(f"High achievers: {high_grades}")

In [None]:
# Functions and control flow - essential for network analysis
def analyze_network_stats(connections):
    """Calculate basic network statistics from a list of connections."""
    if not connections:
        return "No connections found"
    
    total_connections = len(connections)
    unique_users = len(set([user for conn in connections for user in conn.split('-')]))
    
    # Calculate density (simplified metric)
    max_possible_connections = unique_users * (unique_users - 1) / 2
    density = total_connections / max_possible_connections if max_possible_connections > 0 else 0
    
    return {
        'total_connections': total_connections,
        'unique_users': unique_users,
        'network_density': round(density, 3)
    }

# Example usage with social network data
connections = ['Alice-Bob', 'Bob-Charlie', 'Alice-Charlie', 'Diana-Eve', 'Charlie-Diana']
stats = analyze_network_stats(connections)

print("Network Statistics:")
for key, value in stats.items():
    print(f"  {key}: {value}")

# Demonstrate control flow
print(f"\nNetwork Analysis:")
if stats['network_density'] > 0.5:
    print("  🔗 Dense network - high connectivity")
elif stats['network_density'] > 0.3:
    print("  🔗 Medium density network")
else:
    print("  🔗 Sparse network - low connectivity")

## 3. Creating and Exploring DataFrames

**Why Pandas for Network Analysis?**
- **Data Import:** Read CSV, JSON, Excel files easily
- **Data Cleaning:** Handle missing values, duplicates, inconsistencies  
- **Data Transformation:** Filter, group, aggregate, pivot
- **Network Representation:** Edge lists, adjacency matrices
- **Integration:** Works seamlessly with NetworkX, scikit-learn

**Key Pandas Objects:**
- **Series:** 1D labeled array (like a column)
- **DataFrame:** 2D labeled data structure (like a spreadsheet)

In [None]:
# Creating DataFrames - Different approaches

# Method 1: From dictionary (most common for small datasets)
social_data = {
    'user_id': [1, 2, 3, 4, 5],
    'username': ['alice', 'bob', 'charlie', 'diana', 'eve'],
    'followers': [150, 89, 203, 45, 312],
    'following': [98, 156, 87, 178, 92],
    'posts': [45, 23, 67, 12, 89]
}

df = pd.DataFrame(social_data)
print("📊 Social Network User Data")
print("=" * 40)
print(df)

print(f"\n📏 DataFrame shape: {df.shape} (rows, columns)")
print(f"📋 Column names: {list(df.columns)}")
print(f"🔢 Data types:\n{df.dtypes}")

# Method 2: From lists of lists
user_data = [
    [6, 'frank', 78, 145, 34],
    [7, 'grace', 234, 67, 78],
    [8, 'henry', 156, 189, 56]
]

additional_df = pd.DataFrame(user_data, columns=['user_id', 'username', 'followers', 'following', 'posts'])
print(f"\n📊 Additional Users:")
print(additional_df)

In [None]:
# Essential DataFrame exploration methods

print("🔍 EXPLORING THE DATAFRAME")
print("=" * 50)

# 1. Basic info about the dataset
print("📋 DataFrame Info:")
print(df.info())

print("\n" + "=" * 50)

# 2. Statistical summary
print("📊 Statistical Summary:")
print(df.describe())

print("\n" + "=" * 50)

# 3. First and last few rows
print("👆 First 3 rows:")
print(df.head(3))

print("\n👇 Last 2 rows:")
print(df.tail(2))

print("\n" + "=" * 50)

# 4. Check for missing values
print("❓ Missing values check:")
missing_data = df.isnull().sum()
print(missing_data)

if missing_data.sum() == 0:
    print("✅ No missing values found!")

print("\n" + "=" * 50)

# 5. Quick insights
print("💡 Quick Insights:")
print(f"  📈 Average followers: {df['followers'].mean():.1f}")
print(f"  📊 Most active user: {df.loc[df['posts'].idxmax(), 'username']}")
print(f"  🏆 Most followed user: {df.loc[df['followers'].idxmax(), 'username']}")
print(f"  📱 Total posts across all users: {df['posts'].sum()}")
print(f"  👥 Average following: {df['following'].mean():.1f}")

## 4. Data Selection and Filtering

One of the most important skills in data analysis is being able to select and filter data efficiently. Let's practice different selection methods.

In [None]:
# Data Selection and Filtering Techniques

print("🎯 DATA SELECTION TECHNIQUES")
print("=" * 50)

# 1. Select specific columns
print("📋 Select specific columns:")
user_metrics = df[['username', 'followers', 'posts']]
print(user_metrics)

print("\n" + "-" * 50)

# 2. Single column selection (returns Series)
print("👤 Usernames only (Series):")
usernames = df['username']
print(usernames)
print(f"Type: {type(usernames)}")

print("\n" + "-" * 50)

# 3. Filter rows based on conditions
print("🏆 Popular users (followers > 100):")
popular_users = df[df['followers'] > 100]
print(popular_users[['username', 'followers']])

print("\n📱 Active users (posts > 50):")
active_users = df[df['posts'] > 50]
print(active_users[['username', 'posts']])

print("\n" + "-" * 50)

# 4. Multiple conditions with & (and) and | (or)
print("⭐ Influential users (followers > 100 AND posts > 30):")
influential = df[(df['followers'] > 100) & (df['posts'] > 30)]
print(influential[['username', 'followers', 'posts']])

print("\n🚀 Super active OR popular (posts > 60 OR followers > 200):")
super_users = df[(df['posts'] > 60) | (df['followers'] > 200)]
print(super_users[['username', 'followers', 'posts']])

print("\n" + "-" * 50)

# 5. Using .loc and .iloc for specific selections
print("🎯 Using .loc (label-based selection):")
# Select specific user by condition
top_user = df.loc[df['followers'].idxmax(), ['username', 'followers']]
print(f"Most followed user: {top_user['username']} with {top_user['followers']} followers")

print("\n🔢 Using .iloc (position-based selection):")
# Select first 3 rows and first 3 columns
subset = df.iloc[0:3, 0:3]
print(subset)

print("\n" + "-" * 50)

# 6. String operations on text columns
print("🔤 String filtering:")
users_with_a = df[df['username'].str.contains('a')]
print("Users with 'a' in username:")
print(users_with_a[['username']])

## 5. Handling Missing Data

Real-world data often has missing values. Let's learn how to identify and handle them appropriately.

In [None]:
# Create sample data with missing values to demonstrate handling techniques

# Simulate messy real-world data
messy_data = {
    'user_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'username': ['alice', 'bob', None, 'diana', 'eve', 'frank', 'grace', 'henry'],
    'followers': [150, np.nan, 203, 45, 312, np.nan, 234, 156],
    'following': [98, 156, 87, np.nan, 92, 145, 67, 189],
    'posts': [45, 23, 67, 12, np.nan, 34, 78, 56],
    'engagement_rate': [0.05, 0.03, np.nan, 0.08, 0.04, 0.06, np.nan, 0.07]
}

messy_df = pd.DataFrame(messy_data)

print("🚨 MESSY DATA WITH MISSING VALUES")
print("=" * 50)
print(messy_df)

print(f"\n📊 Missing data overview:")
missing_summary = messy_df.isnull().sum()
print(missing_summary)

print(f"\n📈 Missing data percentage:")
missing_percentage = (messy_df.isnull().sum() / len(messy_df) * 100).round(2)
for col, pct in missing_percentage.items():
    if pct > 0:
        print(f"  {col}: {pct}%")

print("\n" + "=" * 50)

# Different strategies for handling missing data

# Strategy 1: Drop rows with ANY missing values
print("🗑️ Strategy 1: Drop rows with ANY missing values")
clean_df_dropall = messy_df.dropna()
print(f"Original shape: {messy_df.shape}, After dropping: {clean_df_dropall.shape}")
print(clean_df_dropall)

print("\n" + "-" * 50)

# Strategy 2: Drop rows only if ALL values are missing
print("🗑️ Strategy 2: Drop rows only if ALL values are missing")
clean_df_dropall_missing = messy_df.dropna(how='all')
print(f"Shape after dropping all-missing rows: {clean_df_dropall_missing.shape}")

print("\n" + "-" * 50)

# Strategy 3: Fill missing values with mean (for numerical columns)
print("🔢 Strategy 3: Fill numerical missing values with mean")
filled_df = messy_df.copy()
numerical_cols = ['followers', 'following', 'posts', 'engagement_rate']

for col in numerical_cols:
    if col in filled_df.columns:
        mean_value = filled_df[col].mean()
        filled_df[col].fillna(mean_value, inplace=True)
        print(f"  Filled {col} missing values with mean: {mean_value:.2f}")

print("\nAfter filling numerical columns:")
print(filled_df)

print("\n" + "-" * 50)

# Strategy 4: Fill text missing values with a placeholder
print("📝 Strategy 4: Fill text missing values")
filled_df['username'].fillna('unknown_user', inplace=True)
print("After filling username:")
print(filled_df[['user_id', 'username']])

print("\n" + "-" * 50)

# Strategy 5: Forward fill (useful for time series)
print("⏭️ Strategy 5: Forward fill example")
time_series_data = pd.DataFrame({
    'date': pd.date_range('2025-01-01', periods=6, freq='D'),
    'user_count': [100, np.nan, np.nan, 120, np.nan, 140]
})
print("Original time series:")
print(time_series_data)

time_series_data['user_count_ffill'] = time_series_data['user_count'].fillna(method='ffill')
print("\nAfter forward fill:")
print(time_series_data)

## 6. Data Type Conversion and Validation

Ensuring correct data types is crucial for analysis. Let's practice converting and validating data types.

In [None]:
# Data Type Conversion and Validation Examples

# Create sample data with wrong data types (common in real datasets)
network_events = pd.DataFrame({
    'event_date': ['2025-01-15', '2025-01-16', '2025-01-17', '2025-01-18', '2025-01-19'],
    'user_id': ['123', '456', '789', '101', '202'],  # Should be integers
    'event_type': ['like', 'share', 'comment', 'follow', 'like'],
    'timestamp': ['09:30:00', '14:25:30', '18:45:15', '11:20:45', '16:10:30'],
    'value': ['1.5', '2.3', '0.8', '3.1', '1.9']  # Should be float
})

print("🔍 ORIGINAL DATA TYPES")
print("=" * 50)
print("Data:")
print(network_events)
print(f"\nData types:")
print(network_events.dtypes)

print("\n" + "=" * 50)

# Convert data types
print("🔄 CONVERTING DATA TYPES")
print("-" * 50)

# 1. Convert string dates to datetime
network_events['event_date'] = pd.to_datetime(network_events['event_date'])
print("✅ Converted event_date to datetime")

# 2. Convert string user_id to integer
network_events['user_id'] = network_events['user_id'].astype(int)
print("✅ Converted user_id to integer")

# 3. Convert string values to float
network_events['value'] = network_events['value'].astype(float)
print("✅ Converted value to float")

# 4. Convert event_type to category (saves memory for repeated values)
network_events['event_type'] = network_events['event_type'].astype('category')
print("✅ Converted event_type to category")

print(f"\nData types after conversion:")
print(network_events.dtypes)

print("\n" + "-" * 50)

# Validate the conversions
print("✅ VALIDATION RESULTS")
print("-" * 30)

# Check if dates are valid
print(f"Date range: {network_events['event_date'].min()} to {network_events['event_date'].max()}")

# Check if user_ids are positive integers
print(f"User ID range: {network_events['user_id'].min()} to {network_events['user_id'].max()}")

# Check value statistics
print(f"Value statistics: min={network_events['value'].min()}, max={network_events['value'].max()}, mean={network_events['value'].mean():.2f}")

# Check categories
print(f"Event types: {list(network_events['event_type'].cat.categories)}")

print("\n" + "=" * 50)

# Advanced: Create datetime from separate date and time columns
print("🕐 ADVANCED: Combining date and time")
print("-" * 40)

# Combine date and time into full datetime
network_events['full_datetime'] = pd.to_datetime(
    network_events['event_date'].dt.strftime('%Y-%m-%d') + ' ' + network_events['timestamp']
)

print("Combined datetime column:")
print(network_events[['event_date', 'timestamp', 'full_datetime']].head())

# Extract useful datetime components
network_events['hour'] = network_events['full_datetime'].dt.hour
network_events['day_of_week'] = network_events['full_datetime'].dt.day_name()

print(f"\nEvent distribution by hour:")
print(network_events['hour'].value_counts().sort_index())

print(f"\nFinal dataset:")
print(network_events)

## 7. Removing Duplicates and Outliers

Data quality is crucial for network analysis. Let's learn to identify and handle duplicates and outliers.

In [None]:
# Handling Duplicates and Outliers

print("🔍 HANDLING DUPLICATES")
print("=" * 50)

# Create sample interaction data with duplicates
user_interactions = pd.DataFrame({
    'from_user': ['alice', 'bob', 'alice', 'charlie', 'alice', 'bob', 'diana', 'alice'],
    'to_user': ['bob', 'charlie', 'bob', 'alice', 'diana', 'charlie', 'eve', 'bob'],
    'interaction_type': ['like', 'comment', 'like', 'share', 'follow', 'comment', 'like', 'like'],
    'timestamp': pd.date_range('2025-01-01', periods=8, freq='H')
})

print("Original interaction data:")
print(user_interactions)
print(f"Shape: {user_interactions.shape}")

# Check for duplicates
print(f"\nDuplicate rows: {user_interactions.duplicated().sum()}")
print("Duplicate rows details:")
print(user_interactions[user_interactions.duplicated(keep=False)])

# Remove duplicates
unique_interactions = user_interactions.drop_duplicates()
print(f"\nAfter removing duplicates:")
print(f"Shape: {unique_interactions.shape}")
print(unique_interactions)

# Remove duplicates based on specific columns only
unique_connections = user_interactions.drop_duplicates(subset=['from_user', 'to_user'], keep='first')
print(f"\nUnique connections (ignoring interaction type and time):")
print(f"Shape: {unique_connections.shape}")
print(unique_connections)

print("\n" + "=" * 50)

# HANDLING OUTLIERS
print("📊 HANDLING OUTLIERS")
print("-" * 50)

# Create sample data with outliers
user_stats = pd.DataFrame({
    'user_id': range(1, 21),
    'followers': [150, 89, 203, 45, 312, 78, 156, 234, 67, 189,
                  145, 298, 56, 187, 123, 5000, 245, 167, 89, 278],  # 5000 is an outlier
    'posts': [45, 23, 67, 12, 89, 34, 56, 78, 29, 67,
              43, 87, 23, 65, 45, 2000, 56, 34, 28, 71]  # 2000 is an outlier
})

print("User statistics with potential outliers:")
print(user_stats.describe())

# Method 1: IQR (Interquartile Range) method for outlier detection
def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

# Detect outliers in followers
follower_outliers, lower_followers, upper_followers = detect_outliers_iqr(user_stats, 'followers')
print(f"\n🔍 Followers outlier detection:")
print(f"Valid range: {lower_followers:.1f} to {upper_followers:.1f}")
print("Outliers:")
print(follower_outliers[['user_id', 'followers']])

# Detect outliers in posts
post_outliers, lower_posts, upper_posts = detect_outliers_iqr(user_stats, 'posts')
print(f"\n📱 Posts outlier detection:")
print(f"Valid range: {lower_posts:.1f} to {upper_posts:.1f}")
print("Outliers:")
print(post_outliers[['user_id', 'posts']])

# Remove outliers
clean_user_stats = user_stats[
    (user_stats['followers'] >= lower_followers) & 
    (user_stats['followers'] <= upper_followers) &
    (user_stats['posts'] >= lower_posts) & 
    (user_stats['posts'] <= upper_posts)
]

print(f"\n✅ After removing outliers:")
print(f"Original shape: {user_stats.shape}")
print(f"Clean shape: {clean_user_stats.shape}")
print("\nClean data statistics:")
print(clean_user_stats.describe())

# Method 2: Z-score method (alternative approach)
from scipy import stats

print(f"\n📊 Alternative: Z-score method")
z_scores_followers = np.abs(stats.zscore(user_stats['followers']))
z_threshold = 2.5  # Common threshold

outliers_z = user_stats[z_scores_followers > z_threshold]
print(f"Outliers using Z-score (threshold={z_threshold}):")
print(outliers_z[['user_id', 'followers']])

# Visualize the data distribution (simple histogram)
print(f"\n📈 Data distribution summary:")
print(f"Followers - Min: {user_stats['followers'].min()}, Max: {user_stats['followers'].max()}, Median: {user_stats['followers'].median()}")
print(f"Posts - Min: {user_stats['posts'].min()}, Max: {user_stats['posts'].max()}, Median: {user_stats['posts'].median()}")

## 8. Data Grouping and Aggregation

Grouping and aggregation are powerful techniques for summarizing data and finding patterns in networks.

In [None]:
# Data Grouping and Aggregation Examples

# Create extended dataset with more diverse information
extended_user_data = {
    'user_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'username': ['alice', 'bob', 'charlie', 'diana', 'eve', 'frank', 'grace', 'henry', 'iris', 'jack'],
    'followers': [150, 89, 203, 45, 312, 78, 234, 156, 67, 189],
    'following': [98, 156, 87, 178, 92, 145, 67, 189, 123, 145],
    'posts': [45, 23, 67, 12, 89, 34, 78, 56, 29, 67],
    'account_type': ['personal', 'business', 'personal', 'influencer', 'business', 
                     'personal', 'influencer', 'business', 'personal', 'influencer'],
    'location': ['USA', 'Canada', 'USA', 'UK', 'Germany', 'USA', 'UK', 'Canada', 'France', 'UK'],
    'verified': [False, True, False, True, True, False, True, True, False, True]
}

extended_df = pd.DataFrame(extended_user_data)

print("📊 GROUPING AND AGGREGATION")
print("=" * 50)
print("Extended dataset:")
print(extended_df)

print("\n" + "=" * 50)

# Basic grouping by account type
print("👥 GROUP BY ACCOUNT TYPE")
print("-" * 30)

account_groups = extended_df.groupby('account_type')
print("Basic statistics by account type:")
account_stats = account_groups[['followers', 'posts', 'following']].mean().round(2)
print(account_stats)

print("\n" + "-" * 30)

# Multiple aggregation functions
print("📈 MULTIPLE AGGREGATION FUNCTIONS")
print("-" * 40)

detailed_stats = extended_df.groupby('account_type').agg({
    'followers': ['mean', 'max', 'min', 'count'],
    'posts': ['sum', 'mean'],
    'following': 'mean'
}).round(2)

print("Detailed statistics by account type:")
print(detailed_stats)

print("\n" + "-" * 30)

# Group by multiple columns
print("🌍 GROUP BY MULTIPLE COLUMNS")
print("-" * 35)

location_account_stats = extended_df.groupby(['location', 'account_type']).agg({
    'followers': 'mean',
    'posts': 'sum',
    'verified': 'count'  # Count of users
}).round(2)

print("Statistics by location and account type:")
print(location_account_stats)

print("\n" + "-" * 30)

# Custom aggregation functions
print("🔧 CUSTOM AGGREGATION")
print("-" * 25)

def engagement_ratio(group):
    return (group['followers'] / group['following']).mean()

def activity_score(group):
    return (group['posts'] * 0.3 + group['followers'] * 0.001).mean()

custom_stats = extended_df.groupby('account_type').apply(lambda x: pd.Series({
    'avg_engagement_ratio': engagement_ratio(x),
    'activity_score': activity_score(x),
    'verification_rate': x['verified'].mean()
})).round(3)

print("Custom metrics by account type:")
print(custom_stats)

print("\n" + "=" * 50)

# CREATING DERIVED METRICS
print("⚡ DERIVED METRICS")
print("-" * 20)

# Add calculated columns
extended_df['engagement_ratio'] = extended_df['followers'] / extended_df['following']
extended_df['posts_per_follower'] = extended_df['posts'] / extended_df['followers'] * 1000  # per 1000 followers
extended_df['influence_score'] = (
    extended_df['followers'] * 0.4 + 
    extended_df['posts'] * 10 + 
    extended_df['verified'].astype(int) * 50
)

print("Dataset with derived metrics:")
print(extended_df[['username', 'account_type', 'engagement_ratio', 'posts_per_follower', 'influence_score']].round(2))

print("\n" + "-" * 30)

# Analyze derived metrics by groups
print("📊 Analysis of derived metrics:")
derived_analysis = extended_df.groupby('account_type')[['engagement_ratio', 'posts_per_follower', 'influence_score']].agg(['mean', 'std']).round(3)
print(derived_analysis)

print("\n" + "-" * 30)

# Find top performers in each category
print("🏆 TOP PERFORMERS BY CATEGORY")
print("-" * 35)

for account_type in extended_df['account_type'].unique():
    top_user = extended_df[extended_df['account_type'] == account_type].nlargest(1, 'influence_score')
    print(f"{account_type.title()}: {top_user['username'].iloc[0]} (Score: {top_user['influence_score'].iloc[0]:.1f})")

print("\n" + "-" * 30)

# Verification analysis
print("✅ VERIFICATION ANALYSIS")
print("-" * 25)

verification_stats = extended_df.groupby('verified').agg({
    'followers': ['mean', 'median'],
    'posts': 'mean',
    'influence_score': 'mean'
}).round(2)

print("Verified vs Non-verified users:")
print(verification_stats)

## 9. Creating Pivot Tables for Network Analysis

Pivot tables are excellent for creating adjacency matrices and analyzing network relationships.

In [None]:
# Pivot Tables for Network Analysis

print("🔄 PIVOT TABLES FOR NETWORK ANALYSIS")
print("=" * 50)

# Create sample interaction data
interactions = pd.DataFrame({
    'from_user': ['alice', 'alice', 'bob', 'bob', 'charlie', 'charlie', 'diana', 'diana', 'eve', 'frank'],
    'to_user': ['bob', 'charlie', 'alice', 'charlie', 'alice', 'bob', 'eve', 'frank', 'alice', 'bob'],
    'interaction_type': ['like', 'comment', 'share', 'like', 'comment', 'like', 'follow', 'like', 'share', 'comment'],
    'timestamp': pd.date_range('2025-01-01', periods=10, freq='6H'),
    'weight': [1, 2, 1, 1, 3, 1, 1, 1, 2, 1]  # interaction strength
})

print("Sample interaction data:")
print(interactions)

print("\n" + "=" * 50)

# Create basic adjacency matrix using pivot table
print("🔗 BASIC ADJACENCY MATRIX")
print("-" * 30)

# Count of interactions between users
adjacency_matrix = interactions.pivot_table(
    index='from_user',
    columns='to_user',
    values='interaction_type',
    aggfunc='count',
    fill_value=0
)

print("Adjacency matrix (interaction counts):")
print(adjacency_matrix)

print("\n" + "-" * 30)

# Weighted adjacency matrix
print("⚖️ WEIGHTED ADJACENCY MATRIX")
print("-" * 35)

weighted_matrix = interactions.pivot_table(
    index='from_user',
    columns='to_user',
    values='weight',
    aggfunc='sum',
    fill_value=0
)

print("Weighted adjacency matrix (sum of weights):")
print(weighted_matrix)

print("\n" + "-" * 30)

# Interaction type breakdown
print("📊 INTERACTION TYPE BREAKDOWN")
print("-" * 35)

interaction_breakdown = interactions.pivot_table(
    index='interaction_type',
    columns='from_user',
    values='weight',
    aggfunc='count',
    fill_value=0
)

print("Interactions by type and user:")
print(interaction_breakdown)

print("\n" + "=" * 50)

# More complex network analysis
print("🎯 ADVANCED NETWORK METRICS")
print("-" * 32)

# Calculate in-degree and out-degree
out_degree = interactions.groupby('from_user')['to_user'].count()
in_degree = interactions.groupby('to_user')['from_user'].count()

# Combine into a network metrics dataframe
all_users = set(interactions['from_user'].unique()) | set(interactions['to_user'].unique())
network_metrics = pd.DataFrame(index=sorted(all_users))
network_metrics['out_degree'] = out_degree
network_metrics['in_degree'] = in_degree
network_metrics = network_metrics.fillna(0).astype(int)
network_metrics['total_degree'] = network_metrics['out_degree'] + network_metrics['in_degree']

print("Network metrics by user:")
print(network_metrics)

print("\n" + "-" * 30)

# Time-based analysis
print("⏰ TIME-BASED INTERACTION ANALYSIS")
print("-" * 40)

# Add hour and day information
interactions['hour'] = interactions['timestamp'].dt.hour
interactions['day'] = interactions['timestamp'].dt.day

# Pivot by time
hourly_activity = interactions.pivot_table(
    index='hour',
    columns='interaction_type',
    values='weight',
    aggfunc='count',
    fill_value=0
)

print("Activity by hour and interaction type:")
print(hourly_activity)

print("\n" + "-" * 30)

# Cross-tabulation (alternative to pivot table)
print("📋 CROSS-TABULATION EXAMPLE")
print("-" * 32)

# Create cross-tab of user interactions
crosstab = pd.crosstab(
    interactions['from_user'], 
    interactions['interaction_type'], 
    margins=True
)

print("Cross-tabulation of users and interaction types:")
print(crosstab)

print("\n" + "=" * 50)

# Convert to NetworkX format (preview for later weeks)
print("🕸️ PREVIEW: NETWORKX INTEGRATION")
print("-" * 38)

# Create edge list format
edge_list = interactions[['from_user', 'to_user', 'weight']].copy()
edge_list.columns = ['source', 'target', 'weight']

print("Edge list format (ready for NetworkX):")
print(edge_list)

# Basic network statistics
unique_nodes = len(set(edge_list['source']) | set(edge_list['target']))
total_edges = len(edge_list)
max_possible_edges = unique_nodes * (unique_nodes - 1)
density = total_edges / max_possible_edges if max_possible_edges > 0 else 0

print(f"\nBasic network statistics:")
print(f"  Nodes: {unique_nodes}")
print(f"  Edges: {total_edges}")
print(f"  Density: {density:.3f}")
print(f"  Average degree: {(total_edges * 2) / unique_nodes:.2f}")

# Most active users
print(f"\nMost active users (by total interactions):")
user_activity = interactions.groupby('from_user').size().sort_values(ascending=False)
print(user_activity.head())

## 10. Data Visualization Basics

Visualization helps us understand patterns in network data. Let's create basic plots for network analysis.

In [None]:
# Data Visualization for Network Analysis

# Set up matplotlib for better plots
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

print("📊 DATA VISUALIZATION FOR NETWORKS")
print("=" * 50)

# Use our extended dataset for visualizations
viz_data = extended_df.copy()

# 1. HISTOGRAMS - Distribution of followers
print("📈 Creating visualizations...")

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Social Network Data Visualizations', fontsize=16, fontweight='bold')

# Histogram of followers
axes[0, 0].hist(viz_data['followers'], bins=8, alpha=0.7, color='skyblue', edgecolor='black')
axes[0, 0].set_title('Distribution of Followers')
axes[0, 0].set_xlabel('Number of Followers')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].grid(True, alpha=0.3)

# Bar chart by account type
account_counts = viz_data['account_type'].value_counts()
axes[0, 1].bar(account_counts.index, account_counts.values, 
               color=['lightcoral', 'lightblue', 'lightgreen'])
axes[0, 1].set_title('Users by Account Type')
axes[0, 1].set_xlabel('Account Type')
axes[0, 1].set_ylabel('Number of Users')
axes[0, 1].tick_params(axis='x', rotation=45)

# Scatter plot: Followers vs Posts
colors = {'personal': 'blue', 'business': 'red', 'influencer': 'green'}
for account_type in viz_data['account_type'].unique():
    data_subset = viz_data[viz_data['account_type'] == account_type]
    axes[1, 0].scatter(data_subset['followers'], data_subset['posts'], 
                       c=colors[account_type], label=account_type, alpha=0.7, s=50)

axes[1, 0].set_title('Followers vs Posts by Account Type')
axes[1, 0].set_xlabel('Number of Followers')
axes[1, 0].set_ylabel('Number of Posts')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Box plot: Followers by location
location_data = [viz_data[viz_data['location'] == loc]['followers'].values 
                 for loc in viz_data['location'].unique()]
axes[1, 1].boxplot(location_data, labels=viz_data['location'].unique())
axes[1, 1].set_title('Followers Distribution by Location')
axes[1, 1].set_xlabel('Location')
axes[1, 1].set_ylabel('Number of Followers')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("✅ Basic plots created!")

print("\n" + "=" * 50)

# 2. SEABORN PLOTS - More advanced visualizations
print("🎨 ADVANCED VISUALIZATIONS WITH SEABORN")
print("-" * 45)

# Create figure with seaborn plots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Advanced Network Data Analysis', fontsize=16, fontweight='bold')

# Heatmap of correlation matrix
numeric_cols = ['followers', 'following', 'posts', 'engagement_ratio']
correlation_matrix = viz_data[numeric_cols].corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, ax=axes[0, 0])
axes[0, 0].set_title('Correlation Matrix')

# Violin plot
sns.violinplot(data=viz_data, x='account_type', y='followers', ax=axes[0, 1])
axes[0, 1].set_title('Followers Distribution by Account Type')
axes[0, 1].tick_params(axis='x', rotation=45)

# Count plot with hue
sns.countplot(data=viz_data, x='location', hue='verified', ax=axes[1, 0])
axes[1, 0].set_title('Verified vs Non-verified by Location')
axes[1, 0].tick_params(axis='x', rotation=45)

# Pair plot data (using subset for clarity)
pair_data = viz_data[['followers', 'posts', 'account_type']].copy()
# Create a simple scatter for the subplot
for i, account_type in enumerate(viz_data['account_type'].unique()):
    data_subset = viz_data[viz_data['account_type'] == account_type]
    axes[1, 1].scatter(data_subset['followers'], data_subset['engagement_ratio'], 
                       label=account_type, alpha=0.7)
axes[1, 1].set_title('Engagement Ratio vs Followers')
axes[1, 1].set_xlabel('Followers')
axes[1, 1].set_ylabel('Engagement Ratio')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

print("✅ Advanced plots created!")

print("\n" + "=" * 50)

# 3. NETWORK-SPECIFIC VISUALIZATIONS
print("🕸️ NETWORK-SPECIFIC VISUALIZATIONS")
print("-" * 38)

# Create adjacency matrix heatmap
print("Creating adjacency matrix heatmap...")

# Use interaction data from previous section
interaction_matrix = pd.crosstab(interactions['from_user'], interactions['to_user'])

plt.figure(figsize=(10, 8))
sns.heatmap(interaction_matrix, annot=True, cmap='Blues', 
            square=True, cbar_kws={'label': 'Number of Interactions'})
plt.title('User Interaction Matrix', fontsize=14, fontweight='bold')
plt.xlabel('To User')
plt.ylabel('From User')
plt.tight_layout()
plt.show()

print("\n" + "-" * 30)

# Degree distribution
print("Creating degree distribution plot...")

plt.figure(figsize=(12, 5))

# Plot degree distribution
plt.subplot(1, 2, 1)
degree_dist = network_metrics['total_degree'].value_counts().sort_index()
plt.bar(degree_dist.index, degree_dist.values, alpha=0.7, color='orange')
plt.title('Degree Distribution')
plt.xlabel('Degree')
plt.ylabel('Number of Nodes')
plt.grid(True, alpha=0.3)

# Plot in-degree vs out-degree
plt.subplot(1, 2, 2)
plt.scatter(network_metrics['in_degree'], network_metrics['out_degree'], 
           alpha=0.7, s=100, color='purple')
for i, user in enumerate(network_metrics.index):
    plt.annotate(user, (network_metrics['in_degree'].iloc[i], 
                        network_metrics['out_degree'].iloc[i]),
                xytext=(5, 5), textcoords='offset points', fontsize=8)
plt.title('In-degree vs Out-degree')
plt.xlabel('In-degree')
plt.ylabel('Out-degree')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("✅ Network visualizations created!")

print("\n" + "=" * 50)

# Summary statistics for visualization
print("📊 VISUALIZATION INSIGHTS")
print("-" * 25)

print(f"Data insights from visualizations:")
print(f"  📈 Follower range: {viz_data['followers'].min()} - {viz_data['followers'].max()}")
print(f"  📊 Most common account type: {viz_data['account_type'].mode().iloc[0]}")
print(f"  🌍 Most represented location: {viz_data['location'].mode().iloc[0]}")
print(f"  ✅ Verification rate: {viz_data['verified'].mean():.1%}")
print(f"  🔗 Average engagement ratio: {viz_data['engagement_ratio'].mean():.2f}")

# Correlation insights
strongest_corr = correlation_matrix.abs().unstack().sort_values(ascending=False)
# Remove self-correlations
strongest_corr = strongest_corr[strongest_corr < 1.0]
print(f"  🔗 Strongest correlation: {strongest_corr.index[0]} ({strongest_corr.iloc[0]:.3f})")

print(f"\n💡 Network insights:")
print(f"  👥 Most connected user: {network_metrics.loc[network_metrics['total_degree'].idxmax()].name}")
print(f"  📤 Highest out-degree: {network_metrics['out_degree'].max()}")
print(f"  📥 Highest in-degree: {network_metrics['in_degree'].max()}")
print(f"  🎯 Average degree: {network_metrics['total_degree'].mean():.1f}")

## 11. Real-world Dataset Exercise

Let's apply everything we've learned to a comprehensive real-world example that combines all the techniques.

In [None]:
# Comprehensive Real-world Dataset Exercise

print("🌍 REAL-WORLD DATASET EXERCISE")
print("=" * 50)
print("Simulating a Twitter-like social network dataset with realistic challenges")

# Create a more complex, realistic dataset
np.random.seed(42)  # For reproducible results

# Generate realistic social media data
n_users = 50
user_ids = range(1, n_users + 1)

# Generate user data with realistic patterns
user_data = []
for i in user_ids:
    # Account types with realistic distributions
    account_type = np.random.choice(['personal', 'business', 'influencer'], 
                                   p=[0.7, 0.2, 0.1])
    
    # Followers based on account type
    if account_type == 'influencer':
        followers = np.random.randint(1000, 10000)
    elif account_type == 'business':
        followers = np.random.randint(100, 2000)
    else:  # personal
        followers = np.random.randint(10, 500)
    
    # Following count (influencers follow fewer people proportionally)
    if account_type == 'influencer':
        following = np.random.randint(50, min(500, followers // 5))
    else:
        following = np.random.randint(20, min(800, followers))
    
    # Posts (more variable)
    posts = np.random.randint(5, 200)
    
    # Location distribution
    location = np.random.choice(['USA', 'UK', 'Canada', 'Germany', 'France', 'Australia'], 
                               p=[0.4, 0.15, 0.1, 0.1, 0.1, 0.15])
    
    # Verification (more likely for businesses and influencers)
    if account_type == 'influencer':
        verified = np.random.choice([True, False], p=[0.8, 0.2])
    elif account_type == 'business':
        verified = np.random.choice([True, False], p=[0.4, 0.6])
    else:
        verified = np.random.choice([True, False], p=[0.05, 0.95])
    
    # Join date (some variation)
    join_date = pd.Timestamp('2020-01-01') + pd.Timedelta(days=np.random.randint(0, 1500))
    
    user_data.append({
        'user_id': i,
        'username': f'user_{i:03d}',
        'followers': followers,
        'following': following,
        'posts': posts,
        'account_type': account_type,
        'location': location,
        'verified': verified,
        'join_date': join_date
    })

# Create DataFrame
social_network_df = pd.DataFrame(user_data)

print("📊 Generated realistic social network dataset:")
print(f"Shape: {social_network_df.shape}")
print(social_network_df.head(10))

print("\n" + "=" * 50)

# STEP 1: DATA EXPLORATION
print("🔍 STEP 1: COMPREHENSIVE DATA EXPLORATION")
print("-" * 45)

print("Basic information:")
print(social_network_df.info())

print(f"\nDataset summary:")
print(f"  📊 Total users: {len(social_network_df)}")
print(f"  📅 Date range: {social_network_df['join_date'].min().date()} to {social_network_df['join_date'].max().date()}")
print(f"  🌍 Locations: {social_network_df['location'].nunique()}")
print(f"  👥 Account types: {social_network_df['account_type'].nunique()}")

# Check for data quality issues
print(f"\n🔍 Data quality check:")
print(f"  Missing values: {social_network_df.isnull().sum().sum()}")
print(f"  Duplicate usernames: {social_network_df['username'].duplicated().sum()}")
print(f"  Impossible ratios (following > followers * 10): {(social_network_df['following'] > social_network_df['followers'] * 10).sum()}")

print("\n" + "-" * 45)

# STEP 2: DATA CLEANING AND PREPROCESSING
print("🧹 STEP 2: DATA CLEANING AND PREPROCESSING")
print("-" * 45)

# Add some realistic missing values and errors
clean_df = social_network_df.copy()

# Introduce some missing values (realistic scenario)
missing_indices = np.random.choice(clean_df.index, size=5, replace=False)
clean_df.loc[missing_indices[:3], 'location'] = np.nan
clean_df.loc[missing_indices[3:], 'posts'] = np.nan

# Add some duplicate users (data entry errors)
duplicate_user = clean_df.iloc[0].copy()
duplicate_user['user_id'] = len(clean_df) + 1
duplicate_user['username'] = clean_df.iloc[0]['username']  # Same username
clean_df = pd.concat([clean_df, duplicate_user.to_frame().T], ignore_index=True)

print("Data after introducing realistic issues:")
print(f"  Missing values: {clean_df.isnull().sum().sum()}")
print(f"  Duplicate usernames: {clean_df['username'].duplicated().sum()}")

# Clean the data
print(f"\nCleaning process:")

# Handle missing values
print("  📍 Filling missing locations with 'Unknown'")
clean_df['location'].fillna('Unknown', inplace=True)

print("  📱 Filling missing posts with median by account type")
for account_type in clean_df['account_type'].unique():
    median_posts = clean_df[clean_df['account_type'] == account_type]['posts'].median()
    mask = (clean_df['account_type'] == account_type) & (clean_df['posts'].isna())
    clean_df.loc[mask, 'posts'] = median_posts

# Remove duplicates
print("  🗑️ Removing duplicate usernames")
clean_df = clean_df.drop_duplicates(subset=['username'], keep='first')

print(f"\nAfter cleaning:")
print(f"  Shape: {clean_df.shape}")
print(f"  Missing values: {clean_df.isnull().sum().sum()}")

print("\n" + "-" * 45)

# STEP 3: FEATURE ENGINEERING
print("⚙️ STEP 3: FEATURE ENGINEERING")
print("-" * 35)

# Create derived features
clean_df['engagement_ratio'] = clean_df['followers'] / clean_df['following'].replace(0, 1)
clean_df['posts_per_follower'] = clean_df['posts'] / clean_df['followers'].replace(0, 1) * 1000
clean_df['account_age_days'] = (pd.Timestamp.now() - clean_df['join_date']).dt.days
clean_df['influence_score'] = (
    clean_df['followers'] * 0.4 + 
    clean_df['posts'] * 5 + 
    clean_df['verified'].astype(int) * 100 +
    clean_df['account_age_days'] * 0.1
)

# Categorize users by activity level
clean_df['activity_level'] = pd.cut(clean_df['posts'], 
                                   bins=[0, 25, 75, 200], 
                                   labels=['Low', 'Medium', 'High'])

print("✅ Created derived features:")
print("  - engagement_ratio")
print("  - posts_per_follower")
print("  - account_age_days")
print("  - influence_score")
print("  - activity_level")

print("\n" + "-" * 45)

# STEP 4: COMPREHENSIVE ANALYSIS
print("📊 STEP 4: COMPREHENSIVE ANALYSIS")
print("-" * 38)

# Analysis by account type
print("Analysis by account type:")
account_analysis = clean_df.groupby('account_type').agg({
    'followers': ['mean', 'median', 'max'],
    'posts': 'mean',
    'engagement_ratio': 'mean',
    'influence_score': 'mean',
    'verified': 'mean'
}).round(2)

print(account_analysis)

print(f"\n🏆 Top 5 most influential users:")
top_influencers = clean_df.nlargest(5, 'influence_score')[['username', 'account_type', 'followers', 'influence_score']]
for idx, user in top_influencers.iterrows():
    print(f"  {user['username']}: {user['influence_score']:.0f} ({user['account_type']}, {user['followers']} followers)")

print(f"\n🌍 Geographic distribution:")
geo_dist = clean_df['location'].value_counts()
for location, count in geo_dist.items():
    percentage = count / len(clean_df) * 100
    print(f"  {location}: {count} users ({percentage:.1f}%)")

print(f"\n📈 Activity level distribution:")
activity_dist = clean_df['activity_level'].value_counts()
for level, count in activity_dist.items():
    percentage = count / len(clean_df) * 100
    print(f"  {level}: {count} users ({percentage:.1f}%)")

print("\n" + "-" * 45)

# STEP 5: INSIGHTS AND RECOMMENDATIONS
print("💡 STEP 5: KEY INSIGHTS AND RECOMMENDATIONS")
print("-" * 50)

# Calculate key metrics
avg_engagement = clean_df['engagement_ratio'].mean()
verification_rate = clean_df['verified'].mean()
median_followers = clean_df['followers'].median()

print("Key insights:")
print(f"  📊 Average engagement ratio: {avg_engagement:.2f}")
print(f"  ✅ Overall verification rate: {verification_rate:.1%}")
print(f"  👥 Median followers: {median_followers:.0f}")

# Business insights
business_users = clean_df[clean_df['account_type'] == 'business']
influencer_users = clean_df[clean_df['account_type'] == 'influencer']

print(f"\n🏢 Business account insights:")
print(f"  Average followers: {business_users['followers'].mean():.0f}")
print(f"  Verification rate: {business_users['verified'].mean():.1%}")
print(f"  Most active location: {business_users['location'].mode().iloc[0]}")

print(f"\n⭐ Influencer account insights:")
print(f"  Average followers: {influencer_users['followers'].mean():.0f}")
print(f"  Verification rate: {influencer_users['verified'].mean():.1%}")
print(f"  Average engagement ratio: {influencer_users['engagement_ratio'].mean():.2f}")

print(f"\n📋 Recommendations for platform:")
print("  1. Focus verification efforts on influencers and businesses")
print("  2. Encourage content creation among low-activity users")
print("  3. Develop location-specific features for major markets")
print("  4. Monitor high-engagement-ratio accounts for potential bot activity")

print("\n" + "=" * 50)
print("🎯 EXERCISE COMPLETED!")
print("This comprehensive analysis demonstrates all the pandas skills")
print("we've learned in this session. You can apply these techniques")
print("to any real-world dataset in your homework and projects!")

# Final dataset summary
print(f"\nFinal dataset summary:")
print(f"  📊 Users analyzed: {len(clean_df)}")
print(f"  🔧 Features created: {len(clean_df.columns)}")
print(f"  📈 Insights generated: Multiple business-relevant findings")
print(f"  🎯 Ready for: Machine learning and network analysis (upcoming weeks)")

## 🎯 Today's Summary and Next Steps

### What We Accomplished Today
✅ **Environment Setup:** Configured Python environment and imported essential libraries  
✅ **Python Review:** Refreshed fundamental programming concepts  
✅ **DataFrame Mastery:** Created, explored, and manipulated pandas DataFrames  
✅ **Data Cleaning:** Handled missing values, duplicates, and outliers  
✅ **Data Analysis:** Performed grouping, aggregation, and pivot table operations  
✅ **Visualization:** Created basic plots for network data  
✅ **Real-world Application:** Applied all techniques to a comprehensive dataset

### Lab Activity: Clean and Summarize Real Data
**Objective:** Apply today's techniques to a provided social network dataset

**Tasks:**
1. Load the provided Twitter interaction dataset
2. Explore data structure and identify quality issues
3. Clean the data (missing values, duplicates, outliers)
4. Create summary statistics and visualizations
5. Generate insights about network structure

**Deliverable:** Jupyter notebook with clean code, comments, and insights

### Homework 1 Assignment
**Due:** Next week before class  
**Task:** Explore and clean a dataset from UCI ML Repository or Kaggle

**Requirements:**
- Choose a dataset with network/social components
- Perform comprehensive data cleaning and exploration
- Compute at least 5 meaningful statistics
- Create 3-4 visualizations
- Write a 1-page summary of findings and data quality issues

**Submission:** Jupyter notebook + PDF summary via course portal

### Looking Ahead: Week 2 - Supervised Learning
- Train/test splits and cross-validation
- Decision trees and k-nearest neighbors  
- Evaluation metrics: accuracy, precision, recall, F1-score
- Building classifiers for network problems

### Resources for Success
- **Documentation:** pandas.pydata.org, numpy.org, matplotlib.org, seaborn.pydata.org
- **Getting Help:** Office hours, discussion forum, study groups
- **Best Practices:** Comment code, use meaningful names, test with small examples

### Questions? 
Feel free to ask during lab time or reach out during office hours!