# üìä NumPy & Pandas Cheat Sheet for Social Media Data Analysis

**Dataset Columns:** `platform`, `post_type`, `post_length`, `views`, `likes`, `comments`, `shares`, `engagement_rate`

---
## üîß Setup & Sample Data

In [None]:
import numpy as np
import pandas as pd

# Sample Social Media Data
data = {
    'platform': ['Instagram', 'Twitter', 'Facebook', 'Instagram', 'TikTok', 'Twitter', 'Facebook', 'TikTok'],
    'post_type': ['image', 'text', 'video', 'reel', 'video', 'image', 'text', 'video'],
    'post_length': [150, 280, 500, 30, 60, 200, 1000, 45],
    'views': [5000, 1200, 8000, 25000, 50000, 800, 3000, 75000],
    'likes': [450, 50, 600, 3000, 8000, 30, 150, 12000],
    'comments': [45, 10, 80, 200, 500, 5, 20, 800],
    'shares': [20, 5, 100, 300, 1000, 2, 15, 1500],
    'engagement_rate': [10.3, 5.4, 9.75, 14.0, 19.0, 4.6, 6.2, 19.1]
}

df = pd.DataFrame(data)
print(df)

---
# üî¢ NUMPY CHEAT SHEET

## 1. Converting DataFrame Columns to NumPy Arrays

In [None]:
# Convert columns to NumPy arrays
views_arr = df['views'].to_numpy()
likes_arr = df['likes'].to_numpy()
comments_arr = df['comments'].to_numpy()
shares_arr = df['shares'].to_numpy()
engagement_arr = df['engagement_rate'].to_numpy()

print("Views Array:", views_arr)
print("Type:", type(views_arr))

## 2. Basic Statistical Operations

In [None]:
# Mean, Median, Standard Deviation
print("=== VIEWS STATISTICS ===")
print(f"Mean Views: {np.mean(views_arr):,.2f}")
print(f"Median Views: {np.median(views_arr):,.2f}")
print(f"Std Dev Views: {np.std(views_arr):,.2f}")
print(f"Variance: {np.var(views_arr):,.2f}")

print("\n=== MIN/MAX ===")
print(f"Min Views: {np.min(views_arr):,}")
print(f"Max Views: {np.max(views_arr):,}")
print(f"Range: {np.ptp(views_arr):,}")  # Peak to Peak (max - min)

## 3. Percentiles & Quantiles

In [None]:
# Percentiles
print("25th Percentile (Q1):", np.percentile(views_arr, 25))
print("50th Percentile (Q2/Median):", np.percentile(views_arr, 50))
print("75th Percentile (Q3):", np.percentile(views_arr, 75))
print("90th Percentile:", np.percentile(views_arr, 90))

# Multiple percentiles at once
percentiles = np.percentile(engagement_arr, [10, 25, 50, 75, 90])
print("\nEngagement Rate Percentiles [10,25,50,75,90]:", percentiles)

## 4. Correlation Analysis

In [None]:
# Correlation between views and likes
correlation = np.corrcoef(views_arr, likes_arr)
print("Correlation Matrix (Views vs Likes):")
print(correlation)
print(f"\nCorrelation Coefficient: {correlation[0,1]:.4f}")

# Interpretation:
# 1.0 = perfect positive correlation
# 0 = no correlation
# -1.0 = perfect negative correlation

## 5. Array Operations & Broadcasting

In [None]:
# Calculate total engagement (likes + comments + shares)
total_engagement = likes_arr + comments_arr + shares_arr
print("Total Engagement per post:", total_engagement)

# Calculate engagement rate manually
manual_engagement_rate = (total_engagement / views_arr) * 100
print("\nManual Engagement Rate %:", np.round(manual_engagement_rate, 2))

# Normalize views (0-1 scale)
normalized_views = (views_arr - np.min(views_arr)) / (np.max(views_arr) - np.min(views_arr))
print("\nNormalized Views (0-1):", np.round(normalized_views, 3))

## 6. Conditional Filtering with NumPy

In [None]:
# Filter: Posts with views > 10000
high_views = views_arr[views_arr > 10000]
print("High Views (>10K):", high_views)

# Filter: Posts with engagement > 10%
high_engagement = engagement_arr[engagement_arr > 10]
print("High Engagement (>10%):", high_engagement)

# Count posts meeting criteria
viral_count = np.sum(views_arr > 20000)
print(f"\nNumber of viral posts (>20K views): {viral_count}")

# Get indices of top performers
top_indices = np.where(engagement_arr > 15)[0]
print("Indices of high engagement posts:", top_indices)

## 7. Sorting & Ranking

In [None]:
# Sort views
sorted_views = np.sort(views_arr)
print("Sorted Views (ascending):", sorted_views)

# Sort descending
sorted_desc = np.sort(views_arr)[::-1]
print("Sorted Views (descending):", sorted_desc)

# Get indices that would sort the array
sort_indices = np.argsort(views_arr)[::-1]  # Descending order indices
print("\nTop 3 post indices by views:", sort_indices[:3])

# Index of max/min
print(f"Index of max views: {np.argmax(views_arr)}")
print(f"Index of min views: {np.argmin(views_arr)}")

## 8. Unique Values & Counts

In [None]:
# Convert categorical to array
platforms = df['platform'].to_numpy()

# Get unique values
unique_platforms = np.unique(platforms)
print("Unique Platforms:", unique_platforms)

# Get unique values with counts
unique, counts = np.unique(platforms, return_counts=True)
print("\nPlatform Distribution:")
for platform, count in zip(unique, counts):
    print(f"  {platform}: {count} posts")

## 9. Mathematical Operations

In [None]:
# Sum, Cumulative Sum, Product
print("Total Views:", np.sum(views_arr))
print("Total Likes:", np.sum(likes_arr))
print("\nCumulative Views:", np.cumsum(views_arr))

# Log transformation (useful for skewed data)
log_views = np.log10(views_arr)
print("\nLog10 of Views:", np.round(log_views, 2))

# Square root transformation
sqrt_views = np.sqrt(views_arr)
print("Square Root of Views:", np.round(sqrt_views, 2))

## 10. Creating Bins/Categories

In [None]:
# Digitize: Assign views to bins
bins = [0, 1000, 5000, 20000, 100000]
labels = ['Low', 'Medium', 'High', 'Viral']

bin_indices = np.digitize(views_arr, bins)
print("Bin Indices:", bin_indices)

# Histogram
hist, bin_edges = np.histogram(views_arr, bins=4)
print("\nHistogram counts:", hist)
print("Bin edges:", bin_edges)

---
# üêº PANDAS CHEAT SHEET

## 1. Basic DataFrame Operations

In [None]:
# Basic Info
print("=== DATAFRAME INFO ===")
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nData Types:\n{df.dtypes}")

# Preview data
print("\n=== HEAD & TAIL ===")
display(df.head(3))
display(df.tail(2))

## 2. Descriptive Statistics

In [None]:
# Full statistical summary
print("=== DESCRIBE (Numeric Columns) ===")
display(df.describe())

# Include categorical columns
print("\n=== DESCRIBE (All Columns) ===")
display(df.describe(include='all'))

## 3. Value Counts & Unique Values

In [None]:
# Value counts for categorical
print("=== PLATFORM DISTRIBUTION ===")
print(df['platform'].value_counts())

print("\n=== POST TYPE DISTRIBUTION ===")
print(df['post_type'].value_counts())

# Normalized (percentage)
print("\n=== PLATFORM % ===")
print(df['platform'].value_counts(normalize=True).mul(100).round(2))

# Number of unique values
print(f"\nUnique platforms: {df['platform'].nunique()}")
print(f"Unique post types: {df['post_type'].nunique()}")

## 4. Filtering Data

In [None]:
# Filter by single condition
high_views_df = df[df['views'] > 10000]
print("Posts with >10K views:")
display(high_views_df)

# Filter by multiple conditions (AND)
viral_engaged = df[(df['views'] > 10000) & (df['engagement_rate'] > 10)]
print("\nViral AND High Engagement:")
display(viral_engaged)

# Filter by multiple conditions (OR)
instagram_or_tiktok = df[(df['platform'] == 'Instagram') | (df['platform'] == 'TikTok')]
print("\nInstagram OR TikTok posts:")
display(instagram_or_tiktok)

# Filter using isin()
selected_platforms = df[df['platform'].isin(['Instagram', 'TikTok'])]
print("\nUsing isin() - Same result:")
display(selected_platforms)

## 5. GroupBy Operations

In [None]:
# Group by platform - basic aggregation
print("=== AVERAGE VIEWS BY PLATFORM ===")
print(df.groupby('platform')['views'].mean().round(0))

# Multiple aggregations
print("\n=== MULTIPLE STATS BY PLATFORM ===")
platform_stats = df.groupby('platform').agg({
    'views': ['mean', 'sum', 'count'],
    'likes': ['mean', 'sum'],
    'engagement_rate': 'mean'
}).round(2)
display(platform_stats)

# Group by multiple columns
print("\n=== GROUP BY PLATFORM & POST TYPE ===")
multi_group = df.groupby(['platform', 'post_type'])['engagement_rate'].mean().round(2)
print(multi_group)

## 6. Sorting & Ranking

In [None]:
# Sort by single column
sorted_by_views = df.sort_values('views', ascending=False)
print("=== SORTED BY VIEWS (DESC) ===")
display(sorted_by_views)

# Sort by multiple columns
sorted_multi = df.sort_values(['platform', 'views'], ascending=[True, False])
print("\n=== SORTED BY PLATFORM, THEN VIEWS ===")
display(sorted_multi)

# Top N
print("\n=== TOP 3 BY ENGAGEMENT ===")
display(df.nlargest(3, 'engagement_rate'))

# Bottom N
print("\n=== BOTTOM 3 BY VIEWS ===")
display(df.nsmallest(3, 'views'))

# Add rank column
df['views_rank'] = df['views'].rank(ascending=False).astype(int)
print("\n=== WITH RANK COLUMN ===")
display(df[['platform', 'post_type', 'views', 'views_rank']])

## 7. Creating New Columns

In [None]:
# Simple calculation
df['total_interactions'] = df['likes'] + df['comments'] + df['shares']

# Calculated column
df['likes_per_view'] = (df['likes'] / df['views'] * 100).round(2)

# Conditional column using np.where
df['performance'] = np.where(df['engagement_rate'] > 10, 'High', 'Low')

# Multiple conditions using np.select
conditions = [
    df['views'] >= 50000,
    df['views'] >= 10000,
    df['views'] >= 5000
]
choices = ['Viral', 'Popular', 'Growing']
df['view_category'] = np.select(conditions, choices, default='Starting')

print("=== NEW COLUMNS ADDED ===")
display(df[['platform', 'views', 'total_interactions', 'likes_per_view', 'performance', 'view_category']])

## 8. Binning / Categorizing

In [None]:
# Cut: Equal-width bins
df['engagement_bin'] = pd.cut(df['engagement_rate'], 
                               bins=[0, 5, 10, 15, 20],
                               labels=['Poor', 'Average', 'Good', 'Excellent'])

# Qcut: Equal-frequency bins (quartiles)
df['views_quartile'] = pd.qcut(df['views'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

print("=== BINNED DATA ===")
display(df[['platform', 'views', 'views_quartile', 'engagement_rate', 'engagement_bin']])

## 9. Pivot Tables & Cross Tabulation

In [None]:
# Simple Pivot Table
print("=== PIVOT: AVERAGE ENGAGEMENT BY PLATFORM & POST TYPE ===")
pivot = df.pivot_table(
    values='engagement_rate',
    index='platform',
    columns='post_type',
    aggfunc='mean'
).round(2)
display(pivot)

# Cross tabulation (frequency table)
print("\n=== CROSSTAB: COUNT BY PLATFORM & PERFORMANCE ===")
crosstab = pd.crosstab(df['platform'], df['performance'])
display(crosstab)

## 10. Correlation Matrix

In [None]:
# Select numeric columns only
numeric_cols = df.select_dtypes(include=[np.number])

# Correlation matrix
print("=== CORRELATION MATRIX ===")
correlation_matrix = numeric_cols.corr().round(3)
display(correlation_matrix)

# Specific correlation
print(f"\nViews-Likes Correlation: {df['views'].corr(df['likes']):.4f}")
print(f"Views-Engagement Correlation: {df['views'].corr(df['engagement_rate']):.4f}")

## 11. Apply & Lambda Functions

In [None]:
# Apply to single column
df['views_in_k'] = df['views'].apply(lambda x: f"{x/1000:.1f}K")

# Apply to create categorical
def categorize_length(length):
    if length < 100:
        return 'Short'
    elif length < 500:
        return 'Medium'
    else:
        return 'Long'

df['content_length_category'] = df['post_length'].apply(categorize_length)

# Apply across row (axis=1)
df['summary'] = df.apply(
    lambda row: f"{row['platform']}-{row['post_type']}: {row['engagement_rate']}%",
    axis=1
)

print("=== APPLIED TRANSFORMATIONS ===")
display(df[['platform', 'views', 'views_in_k', 'post_length', 'content_length_category', 'summary']])

## 12. Aggregation Summary

In [None]:
# Comprehensive summary by platform
summary = df.groupby('platform').agg(
    total_posts=('platform', 'count'),
    total_views=('views', 'sum'),
    avg_views=('views', 'mean'),
    total_likes=('likes', 'sum'),
    avg_engagement=('engagement_rate', 'mean'),
    max_engagement=('engagement_rate', 'max'),
    total_interactions=('total_interactions', 'sum')
).round(2)

print("=== PLATFORM PERFORMANCE SUMMARY ===")
display(summary)

# Best performing platform
best_platform = summary['avg_engagement'].idxmax()
print(f"\nüèÜ Best Platform by Avg Engagement: {best_platform}")

---
# üìã QUICK REFERENCE

## NumPy Quick Reference
| Operation | Code |
|-----------|------|
| Mean | `np.mean(arr)` |
| Median | `np.median(arr)` |
| Std Dev | `np.std(arr)` |
| Min/Max | `np.min(arr)`, `np.max(arr)` |
| Percentile | `np.percentile(arr, 75)` |
| Correlation | `np.corrcoef(arr1, arr2)` |
| Sort | `np.sort(arr)` |
| Unique | `np.unique(arr)` |
| Filter | `arr[arr > value]` |
| Sum | `np.sum(arr)` |

## Pandas Quick Reference
| Operation | Code |
|-----------|------|
| Filter rows | `df[df['col'] > value]` |
| Group by | `df.groupby('col').mean()` |
| Value counts | `df['col'].value_counts()` |
| Sort | `df.sort_values('col')` |
| Top N | `df.nlargest(n, 'col')` |
| New column | `df['new'] = df['a'] + df['b']` |
| Pivot | `df.pivot_table(...)` |
| Apply | `df['col'].apply(func)` |
| Describe | `df.describe()` |
| Correlation | `df.corr()` |

---
# ‚öñÔ∏è PROS AND CONS

## NumPy
| Pros ‚úÖ | Cons ‚ùå |
|---------|--------|
| Very fast for numerical computations | No built-in support for labeled data |
| Memory efficient | Harder to work with mixed data types |
| Great for mathematical operations | No built-in groupby functionality |
| Foundation for other libraries | Limited data manipulation features |
| Broadcasting is powerful | Steeper learning curve for indexing |

## Pandas
| Pros ‚úÖ | Cons ‚ùå |
|---------|--------|
| Easy to work with tabular data | Slower than NumPy for large arrays |
| Powerful groupby & aggregation | Higher memory usage |
| Great for data cleaning | Can be confusing with index behavior |
| Handles missing data well | Multiple ways to do same thing |
| Built on NumPy (best of both) | Performance issues with very large data |

## When to Use Which?
- **NumPy**: Pure numerical operations, matrix math, image processing, when speed is critical
- **Pandas**: Working with CSV/Excel, data cleaning, groupby analysis, exploratory data analysis