# Ungraded Lab: Capstone Project Lab

## Overview
In this comprehensive capstone project, you'll analyze TrendWave Media's social media engagement data to derive actionable insights. You'll apply the full range of data science skills you've learned throughout the course while leveraging AI tools to enhance your workflow.

## Learning Outcomes
By completing this lab, you will:
- Execute a complete data science workflow from ingestion to reporting
- Apply advanced data cleaning and EDA techniques
- Create insightful visualizations
- Perform statistical analysis
- Document your process using Git

## Dataset Information
You'll be working with the <b>Capstone_Project_synthetic_social_media_data.csv</b> dataset, which contains social media engagement metrics including:
- Post metadata (ID, timestamp, content)
- User demographics
- Engagement metrics (likes, shares, comments)
- Sentiment scores
- Topic classifications

## Activities

### Activity 1: Data Ingestion and Cleaning
As a data scientist at TrendWave Media, your first task is to prepare the social media engagement data for analysis. The marketing team needs insights into user engagement patterns, but before any analysis can begin, you'll need to ensure the data is clean and properly formatted.

<b>Step 1:</b> Load and Examine the Dataset

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from scipy import stats

# Load the dataset
# YOUR CODE HERE

<b>Step 2:</b> Data Quality Assessment

In [None]:
# Check for missing values, duplicates, and data types
# YOUR CODE HERE

<b>Step 3:</b> Data Cleaning

In [None]:
# Handle missing values and standardize formats
# YOUR CODE HERE

### Activity 2: Exploratory Data Analysis
With clean data in hand, TrendWave's content strategy team needs to understand what drives user engagement across different content types and topics. They want insights that can directly inform content creation and posting strategies.

<b>Step 1:</b> Basic Statistical Analysis

In [None]:
# Calculate descriptive statistics
# YOUR CODE HERE

<b>Step 2:</b> Engagement Metrics Analysis

In [None]:
# Analyze likes, shares, and comments distributions
# YOUR CODE HERE

<b>Step 3:</b> Visualization Creation

In [None]:
# Create the following engagement trend visualizations
# - A boxplot for Topic engagement distribution
# - A Time series chart of engagement
# - An Interactive scatter plot of engagement relationships 
# YOUR CODE HERE

### Activity 3: Statistical Analysis and Documentation
The executive team needs concrete evidence to support strategic decisions about content direction and resource allocation. The hypothesis being tested is whether there is an association between post topics and user demographics, which could influence content performance. Your analysis will help validate or challenge existing assumptions about content performance.

<b>Step 1:</b> Hypothesis Testing

In [None]:
# Perform Chi-Squared test on the relationship between PostTopic and UserGender 
# YOUR CODE HERE

<b>Step 2:</b> GitHub Integrations and Documentation

In [None]:
# Create meaningful commits and documentation
# YOUR CODE HERE

## Success Checklist
- Dataset successfully loaded and cleaned
- Comprehensive EDA completed with visualizations
- Statistical analysis performed and documented
- Code and findings properly version controlled
- AI tools effectively leveraged for workflow enhancement

## Common Issues & Solutions
- Problem: Memory issues with large dataset 
    - Solution: Use chunking or optimize data types
- Problem: Visualization rendering issues 
    - Solution: Check Jupyter Notebook display settings

## Summary
Congratulations on completing your capstone project! You've demonstrated mastery of core data science concepts and tools while analyzing real-world social media data.

### Key Points
- Data science workflow mastery
- Advanced analysis techniques
- Professional documentation practices
- AI tool integration

## Solution Code 
Stuck on your code or want to check your solution? Here's a complete reference implementation to guide you. This represents just one effective approach—try solving independently first, then use this to overcome obstacles or compare techniques. The solution is provided to help you move forward and explore alternative approaches to achieve the same results. Happy coding!

### Activity 1: Data Ingestion and Cleaning - Solution Code

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from scipy import stats

# Load the dataset
df = pd.read_csv('Capstone_Project_synthetic_social_media_data.csv')

def clean_social_media_data(df):
    """Clean and prepare social media engagement data.
    
    Args:
        df (pandas.DataFrame): Raw social media data
        
    Returns:
        pandas.DataFrame: Cleaned dataset
    """
    # Create a copy
    df_clean = df.copy()
    
    # Convert timestamp to datetime
    df_clean['PostDateTime'] = pd.to_datetime(df_clean['PostDateTime'], format='mixed')
    
    # Handle missing values
    df_clean['Likes'] = df_clean['Likes'].fillna(0)
    df_clean['Shares'] = df_clean['Shares'].fillna(0)
    df_clean['Comments'] = df_clean['Comments'].fillna(0)
    df_clean['SentimentScore'] = df_clean['SentimentScore'].fillna(0)
    
    # Remove duplicate posts
    df_clean = df_clean.drop_duplicates(subset='PostID')
    # Remove rows where UserGender is missing
    df_clean = df_clean.dropna(subset=['UserGender'])

    
    # Standardize categories
    df_clean['PostTopic'] = df_clean['PostTopic'].str.lower()
    df_clean['UserGender'] = df_clean['UserGender'].str.upper()
    
    return df_clean

# Clean the data
df_clean = clean_social_media_data(df)

### Activity 2: Exploratory Data Analysis - Solution Code

In [None]:
def analyze_engagement_metrics(df):
    """Analyze social media engagement metrics.
    
    Args:
        df (pandas.DataFrame): Cleaned social media data
        
    Returns:
        dict: Engagement analysis results
    """
    results = {}
    
    # Calculate basic engagement statistics
    results['total_engagement'] = {
        'Likes': df['Likes'].sum(),
        'Shares': df['Shares'].sum(),
        'Comments': df['Comments'].sum()
    }
    
    # Analyze engagement by topic
    results['topic_engagement'] = df.groupby('PostTopic').agg({
        'Likes': 'mean',
        'Shares': 'mean',
        'Comments': 'mean'
    }).round(2)
    
    # Time-based analysis
    df['hour'] = df['PostDateTime'].dt.hour
    results['hourly_engagement'] = df.groupby('hour')['Likes'].mean()
    
    return results

def create_engagement_visualizations(df):
    """Create visualizations for engagement metrics."""
    
    # Topic engagement distribution
    plt.figure(figsize=(12, 6))
    sns.boxplot(x='PostTopic', y='Likes', data=df)
    plt.title('Likes by Topic')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    # Time series of engagement
    daily_engagement = df.groupby(df['PostDateTime'].dt.date)['Likes'].mean()
    plt.figure(figsize=(12, 6))
    daily_engagement.plot()
    plt.title('Daily Average Likes')
    plt.tight_layout()
    plt.show()
    
    # Interactive scatter plot
    fig = px.scatter(df, 
                     x='Likes', 
                     y='Shares',
                     color='PostTopic',
                     hover_data=['Comments'],
                     title='Engagement Relationships')
    fig.show()

# Run analysis
engagement_results = analyze_engagement_metrics(df_clean)
create_engagement_visualizations(df_clean)

### Activity 3: Statistical Analysis and Documentation - Solution Code

In [None]:
def perform_statistical_tests(df):
    """Conduct statistical analysis on engagement patterns.
    
    Args:
        df (pandas.DataFrame): Cleaned social media data
        
    Returns:
        dict: Statistical test results
    """
    results = {}
    
    # Chi-square test for topic and user_gender
    contingency_table = pd.crosstab(df['PostTopic'], df['UserGender'])
    chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
    
    results['topic_gender_relationship'] = {
        'chi2_statistic': chi2,
        'p_value': p_value,
        'degrees_of_freedom': dof
    }
    
    # ANOVA test for engagement across topics
    f_stat, p_value = stats.f_oneway(*[group['Likes'] for name, group in df.groupby('PostTopic')])
    
    results['topic_engagement_difference'] = {
        'f_statistic': f_stat,
        'p_value': p_value
    }
    
    return results

# Perform statistical analysis
statistical_results = perform_statistical_tests(df_clean)

# Document findings
print("\nStatistical Analysis Results:")
print("-----------------------------")
print(f"Topic-Gender Relationship p-value: {statistical_results['topic_gender_relationship']['p_value']:.4f}")
print(f"Topic Engagement Difference p-value: {statistical_results['topic_engagement_difference']['p_value']:.4f}")

# Create meaningful commits and documentation
# !git init -b main  # if not already done

# !git checkout -b feature/analysis 

# !git status 

# !git add . 

# !git commit -m "docs: add analysis and documentation" 

# !git remote add origin https://github.com/your-username/your-repo.git  
# if you are working in a brand new Git repo created locally

# !git push -u origin feature/documentation


# If you want to use these Git commands, and encounter an error like:
# "fatal: detected dubious ownership in repository"
# then Git is preventing access to the directory for security reasons.

# To fix this, mark the directory as 'safe' by running:
# !git config --global --add safe.directory /home/jovyan/work