# Session 1: Foundations of Data Analysis, Python Fundamentals, and Data Ethics

**Module:** Data Insights and Visualization  
**Level:** 7 | **Credits:** 10  
**Learning Outcomes Addressed:** LO1, LO4  
**Big Academy Saudi Arabia - Riyadh Campus** 🇸🇦

---

![Big Academy](https://img.shields.io/badge/Big%20Academy-Saudi%20Arabia-green?style=for-the-badge)
![Level](https://img.shields.io/badge/Level-7%20Master's-blue?style=for-the-badge)
![Credits](https://img.shields.io/badge/Credits-10-orange?style=for-the-badge)

<!-- CELL BREAK -->

## 📋 Session Overview

Welcome to the **Data Insights and Visualization** module! In this first session, we'll establish the foundational knowledge and skills needed for effective data analysis and visualization. By the end of this session, you'll understand the current data landscape, have a working Python environment, and appreciate the critical importance of data ethics.

-----------

### Learning Objectives
#### By the end of this session, you will be able to:

- **Explain the role of data analytics in modern business decision-making**
- **Set up and navigate a Python data analysis environment**
- **Identify different data types and quality dimensions**
- **Apply fundamental data ethics principles**
- **Perform basic data exploration using Python**
- **Load and inspect sample datasets to identify data quality issues**


-----------

## 🌍 Part 1: The Data Revolution in Business

### 1.1 Course Introduction

Welcome to **Data Insights and Visualization** - a transformative journey that will equip you with the skills to extract meaningful insights from complex datasets and communicate them effectively through compelling visualizations.

In today's digital economy, organizations generate and collect vast amounts of data every second. The ability to analyze this data and extract actionable insights has become a critical competitive advantage.

#### 🎯 This Module Combines:
- Statistical analysis and programming techniques
- Contemporary visualization tools (Python, Power BI)
- Business intelligence and decision support
- Data ethics and governance frameworks
- Real-world industry applications

-----------

### 1.2 Current State of Data Analytics

The global data sphere is expected to grow from 33 zettabytes in 2018 to 175 zettabytes by 2025. This explosion of data, often referred to as "Big Data," is characterized by:

- **Volume:** Massive amounts of data generated every second
- **Velocity:** Real-time data processing requirements
- **Variety:** Structured, semi-structured, and unstructured data formats
- **Veracity:** Ensuring data quality and reliability
- **Value:** Extracting actionable insights for business impact

- **🏭 REAL-WORLD APPLICATIONS:**
  - **Retail Analytics:** Customer segmentation, inventory optimization, price optimization
  - **Healthcare Analytics:** Predictive diagnostics, treatment personalization, patient outcomes
  - **Financial Services:** Risk assessment, fraud detection, algorithmic trading
  - **Manufacturing:** Predictive maintenance, quality control, supply chain optimization
  - **Government & Public Sector:** Smart cities, policy analysis, resource allocation

### 1.3 Career Opportunities in Data Visualization

The field offers diverse career paths:
- **Data Analyst:** Focus on descriptive analytics and reporting
- **Business Intelligence Analyst:** Strategic insights and dashboard creation
- **Data Scientist:** Advanced analytics and machine learning
- **Data Visualization Specialist:** Design and communication expert
- **Chief Data Officer:** Strategic data leadership

---

# Intro to Analytics: 
- **Descriptive** 
- **Predictive** 
- **Prescriptive** 

## 📊 Part 2: Understanding Data Types and Structures

![image.png](attachment:02f874ef-3b0a-4824-bf50-5af445893365.png)

### 2.1 Structured vs Unstructured Data

#### 🔲 Structured Data
- **Definition:** Data organized in predefined format (rows and columns)
- **Examples:** Relational databases, CSV files, Excel spreadsheets, SQL tables
- **Characteristics:** Easy to analyze, query, and visualize; well-defined schema
- **Storage:** Typically represents 20% of organizational data

#### 🔳 Semi-Structured Data
- **Definition:** Data with some organizational properties but flexible schema
- **Examples:** JSON, XML, NoSQL databases, log files, email
- **Characteristics:** Self-describing, hierarchical, nested structures

#### 🔲 Unstructured Data
- **Definition:** Data without predefined structure or organization
- **Examples:** Text documents, images, videos, social media posts, audio files
- **Characteristics:** Requires preprocessing, complex analysis, rich insights
- **Storage:** Typically represents 80% of organizational data

### 2.2 Data Quality Dimensions

High-quality data is essential for reliable analysis. The key dimensions include:

1. **Accuracy:** Data correctly represents the real-world entity
2. **Completeness:** All required data is present
3. **Consistency:** Data is uniform across systems and time
4. **Timeliness:** Data is up-to-date and available when needed
5. **Validity:** Data conforms to defined formats and business rules
6. **Uniqueness:** No unnecessary duplication of data

In [1]:
print("📊 DATA TYPE EXAMPLES:\n")

# Structured Data Example
print("🔲 STRUCTURED DATA EXAMPLE:")
structured_example = """
Customer Database Table:
| CustomerID | Name        | Age | City   | PurchaseAmount |
|------------|-------------|-----|--------|----------------|
| 1001       | Ahmed Ali   | 25  | Riyadh | 450.00         |
| 1002       | Sara Hassan | 32  | Jeddah | 320.50         |
| 1003       | Omar Khalid | 28  | Dammam | 275.75         |
"""
print(structured_example)


📊 DATA TYPE EXAMPLES:

🔲 STRUCTURED DATA EXAMPLE:

Customer Database Table:
| CustomerID | Name        | Age | City   | PurchaseAmount |
|------------|-------------|-----|--------|----------------|
| 1001       | Ahmed Ali   | 25  | Riyadh | 450.00         |
| 1002       | Sara Hassan | 32  | Jeddah | 320.50         |
| 1003       | Omar Khalid | 28  | Dammam | 275.75         |



In [2]:
# Semi-Structured Data Example
print("🔳 SEMI-STRUCTURED DATA EXAMPLE (JSON):")
import json
semi_structured_example = {
    "customer": {
        "id": 1001,
        "name": "Ahmed Ali",
        "demographics": {
            "age": 25,
            "city": "Riyadh",
            "preferences": ["electronics", "books"]
        },
        "purchases": [
            {"date": "2024-01-15", "amount": 450.00, "category": "electronics"},
            {"date": "2024-01-20", "amount": 125.50, "category": "books"}
        ]
    }
}
print(json.dumps(semi_structured_example, indent=2))


🔳 SEMI-STRUCTURED DATA EXAMPLE (JSON):
{
  "customer": {
    "id": 1001,
    "name": "Ahmed Ali",
    "demographics": {
      "age": 25,
      "city": "Riyadh",
      "preferences": [
        "electronics",
        "books"
      ]
    },
    "purchases": [
      {
        "date": "2024-01-15",
        "amount": 450.0,
        "category": "electronics"
      },
      {
        "date": "2024-01-20",
        "amount": 125.5,
        "category": "books"
      }
    ]
  }
}


In [3]:
# Unstructured Data Example
print("\n🔲 UNSTRUCTURED DATA EXAMPLE:")
unstructured_example = """
Customer Review Text:
"I absolutely loved this product! The quality exceeded my expectations 
and the delivery was super fast. Will definitely recommend to friends. 
⭐⭐⭐⭐⭐ Five stars!"

Social Media Post:
"Just bought the new laptop from @TechStore Riyadh! Amazing customer 
service 👏 #TechLife #Riyadh #CustomerExperience"
"""
print(unstructured_example)



🔲 UNSTRUCTURED DATA EXAMPLE:

Customer Review Text:
"I absolutely loved this product! The quality exceeded my expectations 
and the delivery was super fast. Will definitely recommend to friends. 
⭐⭐⭐⭐⭐ Five stars!"

Social Media Post:
"Just bought the new laptop from @TechStore Riyadh! Amazing customer 
service 👏 #TechLife #Riyadh #CustomerExperience"



### 2.2 Data Quality Dimensions

High-quality data is essential for reliable analysis. The key dimensions include:

1. **Accuracy:** Data correctly represents the real-world entity
2. **Completeness:** All required data is present
3. **Consistency:** Data is uniform across systems and time
4. **Timeliness:** Data is up-to-date and available when needed
5. **Validity:** Data conforms to defined formats and business rules
6. **Uniqueness:** No unnecessary duplication of data


### 2.3 Common Data Formats

- **CSV (Comma-Separated Values):** Simple, widely supported
- **JSON (JavaScript Object Notation):** Web-friendly, hierarchical
- **XML (eXtensible Markup Language):** Self-describing, verbose
- **Parquet:** Columnar storage, optimized for analytics
- **Database formats:** SQL Server, MySQL, PostgreSQL, MongoDB

---

## 🛡️ Part 3: Data Ethics and Governance

### 3.1 Fundamental Ethical Principles

Data ethics is not just about compliance—it's about building trust and ensuring responsible use of data. The core principles include:

#### Privacy
- **Principle:** Individuals have the right to control their personal information
- **Application:** Anonymization, pseudonymization, data minimization

#### Transparency
- **Principle:** Be open about data collection, use, and decision-making processes
- **Application:** Clear data collection notices, algorithm explainability

#### Fairness and Non-discrimination
- **Principle:** Avoid bias and ensure equitable treatment
- **Application:** Regular bias audits, diverse training data

#### Accountability
- **Principle:** Take responsibility for data practices and their consequences
- **Application:** Data governance frameworks, audit trails

**SAUDI DATA PROTECTION CONTEXT**

- Saudi Data and AI Authority (SDAIA) guidelines
- Personal Data Protection Law (PDPL) compliance
- Vision 2030 digital transformation objectives
- Cultural sensitivity in data collection and analysis
- Cross-border data transfer regulations

### 3.2 GDPR and Legal Compliance

The General Data Protection Regulation (GDPR) sets the global standard for data protection and influences regulations worldwide, including in Saudi Arabia.

#### Key Requirements
- **Lawful basis** Six legal bases: consent, contract, legal obligation, vital interests, public task, legitimate interests
- **Data subject rights** Access, rectification, erasure, restriction, portability, objection
- **Privacy by design** Data protection built into systems from the start
- **Data Protection Impact Assessments** for high-risk processing
- **Breach Notification** :72-hour notification requirement to authorities
- **Data Protection Officer**: "Required for certain organizations


**📋 GDPR COMPLIANCE CHECKLIST**

- ✅ Legal basis for processing documented
- ✅ Data collection purpose clearly defined 
- ✅ Data retention period established
- ✅ Data subject rights procedures implemented
- ✅ Consent mechanisms in place
- ✅ Data breach response plan prepared
- ✅ Privacy policy updated and accessible
- ✅ Staff training on data protection completed

#### Practical Data Anonymization Example

In [32]:
def anonymize_customer_data(df):
    """
    Anonymize personal data while preserving analytical value
    """
    anonymized_df = df.copy()
    
    # Remove direct identifiers
    if 'name' in anonymized_df.columns:
        anonymized_df = anonymized_df.drop(['name'], axis=1)
    if 'email' in anonymized_df.columns:
        anonymized_df = anonymized_df.drop(['email'], axis=1)
    
    # Generalize age into ranges
    if 'age' in anonymized_df.columns:
        anonymized_df['age_group'] = pd.cut(anonymized_df['age'], 
                                          bins=[0, 25, 35, 50, 65, 100], 
                                          labels=['18-25', '26-35', '36-50', '51-65', '65+'])
        anonymized_df = anonymized_df.drop(['age'], axis=1)
    
    # Generalize location data
    if 'city' in anonymized_df.columns:
        city_mapping = {'Riyadh': 'Central', 'Jeddah': 'Western', 
                       'Dammam': 'Eastern', 'Mecca': 'Western'}
        anonymized_df['region'] = anonymized_df['city'].map(city_mapping)
        anonymized_df = anonymized_df.drop(['city'], axis=1)
    
    return anonymized_df

In [33]:
# Demonstrate anonymization on sample data
print("Original customer data:")
print(f"Shape: {customers_df.shape}")
print(f"Columns: {list(customers_df.columns)}")
print(customers_df.head(3))

# Apply anonymization
anonymized_customers  =anonymize_customer_data(customers_df)
print("\nAnonymized customer data:")
print(f"Shape: {anonymized_customers.shape}")
print(f"Columns: {list(anonymized_customers.columns)}")
print(anonymized_customers.head(3))

print("\n✅ Personal identifiers removed, analytical value preserved")

Original customer data:
Shape: (20, 6)
Columns: ['customer_id', 'name', 'email', 'age', 'city', 'registration_date']
   customer_id        name                email  age    city registration_date
0         1001  Customer 1  customer1@email.com   56   Mecca        2023-01-01
1         1002  Customer 2  customer2@email.com   69  Jeddah        2023-01-08
2         1003  Customer 3  customer3@email.com   46  Jeddah        2023-01-15

Anonymized customer data:
Shape: (20, 4)
Columns: ['customer_id', 'registration_date', 'age_group', 'region']
   customer_id registration_date age_group   region
0         1001        2023-01-01     51-65  Western
1         1002        2023-01-08       65+  Western
2         1003        2023-01-15     36-50  Western

✅ Personal identifiers removed, analytical value preserved


### 3.3 Bias and Fairness in Data Representation

**🎯 TYPES OF BIAS IN DATA ANALYSIS:**

   - **Historical Bias**: Existing inequalities reflected in historical data
   - **Representation Bias**: Certain groups underrepresented in datasets
   - **Measurement Bias**: Systematic errors in data collection methods
   - **Evaluation Bias**: Using inappropriate benchmarks or metrics
   - **Aggregation Bias**: Assuming one model fits all demographic groups
   - **Confirmation Bias**: Looking for data that confirms preconceptions

**🛠️ BIAS MITIGATION STRATEGIES:**
- Diverse data collection sources
- Regular bias audits and monitoring
- Inclusive team composition
- Stakeholder feedback incorporation
- Algorithm fairness testing
- Documentation of decision processes

#### Detecting and Mitigating Bias

In [34]:
def check_data_representation(df, protected_attribute):
    """
    Check for representation bias in dataset
    """
    if protected_attribute not in df.columns:
        print(f"   ⚠️ Column '{protected_attribute}' not found in dataset")
        return
    
    distribution = df[protected_attribute].value_counts(normalize=True)
    print(f"📊 Distribution for '{protected_attribute}':")
    for category, percentage in distribution.items():
        print(f"   {category}: {percentage:.1%}")
    
    # Flag underrepresented groups (less than 10%)
    underrepresented = distribution[distribution < 0.1]
    if len(underrepresented) > 0:
        print(f"   ⚠️ Underrepresented groups: {list(underrepresented.index)}")
        print(f"   💡 Consider collecting more data for these groups")
    else:
        print(f"   ✅ No severely underrepresented groups detected")
    
    return distribution

# Demonstrate bias checking on customer data
print("🔍 BIAS DETECTION EXAMPLE:\n")
city_distribution = check_data_representation(customers_df, 'city')

print("\n" + "-"*50)
age_group_distribution = check_data_representation(anonymized_customers, 'age_group')


🔍 BIAS DETECTION EXAMPLE:

📊 Distribution for 'city':
   Mecca: 30.0%
   Jeddah: 30.0%
   Riyadh: 25.0%
   Dammam: 15.0%
   ✅ No severely underrepresented groups detected

--------------------------------------------------
📊 Distribution for 'age_group':
   36-50: 40.0%
   51-65: 25.0%
   18-25: 15.0%
   26-35: 15.0%
   65+: 5.0%
   ⚠️ Underrepresented groups: ['65+']
   💡 Consider collecting more data for these groups


## 🐍 Part 4: Python Environment Setup and Basics

### 4.1 Setting Up Your Development Environment

#### Required Software
1. **Python 3.8+** - Programming language
2. **Jupyter Notebook** - Interactive development environment
3. **Essential Libraries:**
   - `pandas` - Data manipulation and analysis
   - `numpy` - Numerical computing
   - `matplotlib` - Basic plotting
   - `seaborn` - Statistical visualization

#### Installation Steps

```bash
pip install jupyter pandas numpy matplotlib seaborn plotly openpyxl

# Using conda (Anaconda distribution - recommended)
conda install jupyter pandas numpy matplotlib seaborn plotly openpyxl

# Or install Anaconda distribution (recommended for beginners)
# Download from: https://www.anaconda.com/products/distribution
```

#### Starting Jupyter Notebook

```bash
# Navigate to your project directory
cd /path/to/your/project

# Start Jupyter Notebook
jupyter notebook

# Or Jupyter Lab (more advanced interface)
jupyter lab
```

In [4]:
# Environment Verification
import sys
import platform
from datetime import datetime

print("🖥️ SYSTEM INFORMATION:")
print(f"  • Python Version: {sys.version}")
print(f"  • Platform: {platform.system()} {platform.release()}")
print(f"  • Architecture: {platform.architecture()[0]}")
print(f"  • Session Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

# Essential Libraries Verification
essential_libraries = {
    'pandas': 'Data manipulation and analysis',
    'numpy': 'Numerical computing and arrays', 
    'matplotlib': 'Basic plotting and visualization',
    'seaborn': 'Statistical data visualization',
    'jupyter': 'Interactive development environment',
    'openpyxl': 'Excel file reading/writing'
}

print(f"\n📦 CHECKING ESSENTIAL LIBRARIES:")
missing_libraries = []
available_libraries = []

for lib, description in essential_libraries.items():
    try:
        __import__(lib)
        print(f"  ✅ {lib:<15} - {description}")
        available_libraries.append(lib)
    except ImportError:
        print(f"  ❌ {lib:<15} - Missing: {description}")
        missing_libraries.append(lib)

if missing_libraries:
    print(f"\n⚠️ MISSING LIBRARIES:")
    print(f"   Install with: pip install {' '.join(missing_libraries)}")
else:
    print(f"\n✅ ALL ESSENTIAL LIBRARIES AVAILABLE!")

🖥️ SYSTEM INFORMATION:
  • Python Version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)]
  • Platform: Windows 11
  • Architecture: 64bit
  • Session Date: 2025-07-01 20:17:14

📦 CHECKING ESSENTIAL LIBRARIES:
  ✅ pandas          - Data manipulation and analysis
  ✅ numpy           - Numerical computing and arrays
  ✅ matplotlib      - Basic plotting and visualization
  ✅ seaborn         - Statistical data visualization
  ✅ jupyter         - Interactive development environment
  ✅ openpyxl        - Excel file reading/writing

✅ ALL ESSENTIAL LIBRARIES AVAILABLE!


In [5]:
print("🔧 IMPORTING ESSENTIAL LIBRARIES:")

try:
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    from datetime import datetime, date
    import warnings
    warnings.filterwarnings('ignore')
    
    print("  ✅ Core libraries imported successfully!")
    
    # Set plotting style
    plt.style.use('default')
    sns.set_palette("husl")
    plt.rcParams['figure.figsize'] = (10, 6)
    
    print("  ✅ Visualization settings configured!")
    
    # Display versions
    print(f"\n📊 LIBRARY VERSIONS:")
    print(f"  • pandas: {pd.__version__}")
    print(f"  • numpy: {np.__version__}")
    print(f"  • matplotlib: {plt.matplotlib.__version__}")
    print(f"  • seaborn: {sns.__version__}")
    
except ImportError as e:
    print(f"  ❌ Import Error: {e}")
    print("  Please install missing libraries before continuing.")


🔧 IMPORTING ESSENTIAL LIBRARIES:
  ✅ Core libraries imported successfully!
  ✅ Visualization settings configured!

📊 LIBRARY VERSIONS:
  • pandas: 2.2.2
  • numpy: 1.26.4
  • matplotlib: 3.9.2
  • seaborn: 0.13.2


### 4.2 Python Fundamentals for Data Analysis

#### Essential Data Structures

- Integers: Used to store whole numbers (e.g., 10, -5, 0). 
- Double : Used to store numbers with decimal points (e.g., 3.14, -2.5)
- Char : Used to store single characters (e.g., 'a', 'Z', '7'). 
- Boolen: Used to store true or false values. 

In [6]:
# Check with team if they need walk through of Conditions and Loops

**1️⃣ LISTS - Ordered Collections**

In [7]:
sales_data = [100, 150, 200, 175, 300, 120, 180]
product_names = ['Laptop', 'Mouse', 'Keyboard', 'Monitor']

print(f"   sales_data = {sales_data}")
print(f"   product_names = {product_names}")
print(f"   • Length: {len(sales_data)}")
print(f"   • Sum: {sum(sales_data)}")
print(f"   • Average: {sum(sales_data)/len(sales_data):.2f}")
print(f"   • Max: {max(sales_data)}, Min: {min(sales_data)}")

   sales_data = [100, 150, 200, 175, 300, 120, 180]
   product_names = ['Laptop', 'Mouse', 'Keyboard', 'Monitor']
   • Length: 7
   • Sum: 1225
   • Average: 175.00
   • Max: 300, Min: 100


**2️⃣ DICTIONARIES - Key-Value Pairs**

In [8]:
customer = {
    'customer_id': 'C001',
    'name': 'Ahmed Al-Salem', 
    'age': 35,
    'location': 'Riyadh',
    'purchases': [100, 250, 75],
    'registration_date': '2023-01-15'
}
print(f"   customer = {customer}")
print(f"   • Name: {customer['name']}")
print(f"   • Total purchases: {sum(customer['purchases'])}")
print(f"   • Keys: {list(customer.keys())}")

   customer = {'customer_id': 'C001', 'name': 'Ahmed Al-Salem', 'age': 35, 'location': 'Riyadh', 'purchases': [100, 250, 75], 'registration_date': '2023-01-15'}
   • Name: Ahmed Al-Salem
   • Total purchases: 425
   • Keys: ['customer_id', 'name', 'age', 'location', 'purchases', 'registration_date']


**3️⃣ Tuple : ordered, immutable collections**

In [9]:
# Define a customer tuple
customer = (
    'C001',                      # customer_id
    'Ahmed Al-Salem',            # name
    35,                          # age
    'Riyadh',                    # location
    (100, 250, 75),              # purchases (as a tuple)
    '2023-01-15'                 # registration_date
)

print(f"   customer = {customer}")
print(f"   • Name: {customer[1]}")
print(f"   • Total purchases: {sum(customer[4])}")
print(f"   • Number of fields: {len(customer)}")
print(f"   • Purchases tuple: {customer[4]}")
print(f"   • Is purchases a tuple? {'Yes' if isinstance(customer[4], tuple) else 'No'}")


   customer = ('C001', 'Ahmed Al-Salem', 35, 'Riyadh', (100, 250, 75), '2023-01-15')
   • Name: Ahmed Al-Salem
   • Total purchases: 425
   • Number of fields: 6
   • Purchases tuple: (100, 250, 75)
   • Is purchases a tuple? Yes


**4️⃣ NUMPY ARRAYS - Numerical Computing**

In [10]:
# NumPy Arrays - Efficient numerical operations
revenue = np.array([1000, 1200, 1100, 1400, 1300, 1150, 1250])
costs = np.array([600, 700, 650, 800, 750, 680, 720])

In [11]:
print(f"   revenue = {revenue}")
print(f"   costs = {costs}")

   revenue = [1000 1200 1100 1400 1300 1150 1250]
   costs = [600 700 650 800 750 680 720]


In [12]:
# Mathematical operations
profit = revenue - costs
profit_margin = (profit / revenue) * 100

In [13]:
print(f"   • Profit: {profit}")
print(f"   • Profit Margin %: {profit_margin.round(2)}")
print(f"   • Average Revenue: {revenue.mean():.2f}")
print(f"   • Revenue Std Dev: {revenue.std():.2f}")

   • Profit: [400 500 450 600 550 470 530]
   • Profit Margin %: [40.   41.67 40.91 42.86 42.31 40.87 42.4 ]
   • Average Revenue: 1200.00
   • Revenue Std Dev: 122.47


In [14]:
# Advanced NumPy operations
growth_rate = np.diff(revenue) / revenue[:-1] * 100
print(f"   • Revenue Growth Rate: {growth_rate.round(2)}")
print(f"   • Cumulative Revenue: {np.cumsum(revenue)}")

   • Revenue Growth Rate: [ 20.    -8.33  27.27  -7.14 -11.54   8.7 ]
   • Cumulative Revenue: [1000 2200 3300 4700 6000 7150 8400]


**5️⃣PANDAS DATAFRAMES - Structured Data**

In [15]:
# Create sample business dataset
df = pd.DataFrame({
    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'product_name': ['Laptop Pro', 'Wireless Mouse', 'USB Cable', 'Monitor 24"', 'Keyboard Mech'],
    'category': ['Electronics', 'Electronics', 'Accessories', 'Electronics', 'Electronics'],
    'price': [2500, 45, 15, 800, 150],
    'quantity_sold': [120, 300, 500, 80, 200],
    'region': ['Riyadh', 'Jeddah', 'Dammam', 'Riyadh', 'Jeddah']
})

print("Sample Sales DataFrame:")
print(df)

Sample Sales DataFrame:
  product_id    product_name     category  price  quantity_sold  region
0       P001      Laptop Pro  Electronics   2500            120  Riyadh
1       P002  Wireless Mouse  Electronics     45            300  Jeddah
2       P003       USB Cable  Accessories     15            500  Dammam
3       P004     Monitor 24"  Electronics    800             80  Riyadh
4       P005   Keyboard Mech  Electronics    150            200  Jeddah


In [16]:
# Calculate additional metrics
df['Sales'] = df['price'] * df['quantity_sold']

#### Basic Operations and Functions

In [17]:
# Descriptive statistics
df['Sales'].mean()    # Average sales

83000.0

In [18]:
df['Sales'].median()  # Median sales

30000.0

In [19]:
df['Sales'].std()     # Standard deviation

123277.63381895355

In [20]:
# Data inspection
df.head()        # First 5 rows

Unnamed: 0,product_id,product_name,category,price,quantity_sold,region,Sales
0,P001,Laptop Pro,Electronics,2500,120,Riyadh,300000
1,P002,Wireless Mouse,Electronics,45,300,Jeddah,13500
2,P003,USB Cable,Accessories,15,500,Dammam,7500
3,P004,"Monitor 24""",Electronics,800,80,Riyadh,64000
4,P005,Keyboard Mech,Electronics,150,200,Jeddah,30000


In [21]:
df.info()        # Data types and non-null counts

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   product_id     5 non-null      object
 1   product_name   5 non-null      object
 2   category       5 non-null      object
 3   price          5 non-null      int64 
 4   quantity_sold  5 non-null      int64 
 5   region         5 non-null      object
 6   Sales          5 non-null      int64 
dtypes: int64(3), object(4)
memory usage: 412.0+ bytes


In [22]:
df.describe()    # Summary statistics

Unnamed: 0,price,quantity_sold,Sales
count,5.0,5.0,5.0
mean,702.0,240.0,83000.0
std,1054.837665,167.928556,123277.633819
min,15.0,80.0,7500.0
25%,45.0,120.0,13500.0
50%,150.0,200.0,30000.0
75%,800.0,300.0,64000.0
max,2500.0,500.0,300000.0


In [23]:
df.shape         # Number of rows and columns

(5, 7)

In [24]:
# Filtering and selection
high_sales = df[df['Sales'] > 150]

In [25]:
north_sales = df[df['region'] == 'North']
north_sales

Unnamed: 0,product_id,product_name,category,price,quantity_sold,region,Sales


### 3.3 Reading and Writing Data Files

In [26]:
print("📁 CREATING SAMPLE DATASETS FOR FILE I/O DEMO:\n")

# Customer dataset
np.random.seed(42)  # For reproducible results
customers_df = pd.DataFrame({
    'customer_id': range(1001, 1021),
    'name': [f'Customer {i}' for i in range(1, 21)],
    'email': [f'customer{i}@email.com' for i in range(1, 21)],
    'age': np.random.randint(18, 70, 20),
    'city': np.random.choice(['Riyadh', 'Jeddah', 'Dammam', 'Mecca'], 20),
    'registration_date': pd.date_range('2023-01-01', periods=20, freq='W')
})

# Sales transactions dataset  
transactions_df = pd.DataFrame({
    'transaction_id': range(5001, 5051),
    'customer_id': np.random.choice(range(1001, 1021), 50),
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], 50),
    'amount': np.random.gamma(2, 50, 50).round(2),
    'transaction_date': pd.date_range('2023-01-01', periods=50, freq='D')[:50]
})

print(f"✅ Customers dataset created: {customers_df.shape}")
print(f"✅ Transactions dataset created: {transactions_df.shape}")


📁 CREATING SAMPLE DATASETS FOR FILE I/O DEMO:

✅ Customers dataset created: (20, 6)
✅ Transactions dataset created: (50, 5)


In [27]:
# Display sample data
print("\n📊 Sample Customer Data:")
print(customers_df.head())

print("\n📊 Sample Transaction Data:")
print(transactions_df.head())


📊 Sample Customer Data:
   customer_id        name                email  age    city registration_date
0         1001  Customer 1  customer1@email.com   56   Mecca        2023-01-01
1         1002  Customer 2  customer2@email.com   69  Jeddah        2023-01-08
2         1003  Customer 3  customer3@email.com   46  Jeddah        2023-01-15
3         1004  Customer 4  customer4@email.com   32  Jeddah        2023-01-22
4         1005  Customer 5  customer5@email.com   60   Mecca        2023-01-29

📊 Sample Transaction Data:
   transaction_id  customer_id product_category  amount transaction_date
0            5001         1016      Electronics  149.67       2023-01-01
1            5002         1015             Home  103.98       2023-01-02
2            5003         1015         Clothing  163.24       2023-01-03
3            5004         1019      Electronics  145.96       2023-01-04
4            5005         1012             Home  545.57       2023-01-05


In [28]:
# File I/O Operations
print("📁 FILE INPUT/OUTPUT OPERATIONS:\n")

# CSV Operations
print("📄 CSV (Comma-Separated Values):")
try:
    # Write to CSV
    customers_df.to_csv('sample_customers.csv', index=False)
    transactions_df.to_csv('sample_transactions.csv', index=False)
    print("   ✅ CSV files created successfully")
    
    # Read from CSV
    csv_data = pd.read_csv('sample_customers.csv')
    print(f"   ✅ CSV read successfully: {csv_data.shape}")
    
except Exception as e:
    print(f"   ❌ CSV Error: {e}")

📁 FILE INPUT/OUTPUT OPERATIONS:

📄 CSV (Comma-Separated Values):
   ✅ CSV files created successfully
   ✅ CSV read successfully: (20, 6)


In [29]:
# Excel Operations
print("\n📊 EXCEL FILES:")
try:
    # Write to Excel with multiple sheets
    with pd.ExcelWriter('sample_business_data.xlsx', engine='openpyxl') as writer:
        customers_df.to_excel(writer, sheet_name='Customers', index=False)
        transactions_df.to_excel(writer, sheet_name='Transactions', index=False)
        
        # Summary sheet
        summary_df = pd.DataFrame({
            'Metric': ['Total Customers', 'Total Transactions', 'Average Transaction', 'Date Range'],
            'Value': [len(customers_df), len(transactions_df), 
                     f"${transactions_df['amount'].mean():.2f}",
                     f"{transactions_df['transaction_date'].min()} to {transactions_df['transaction_date'].max()}"]
        })
        summary_df.to_excel(writer, sheet_name='Summary', index=False)
    
    print("   ✅ Excel file with multiple sheets created")
    
    # Read specific sheet
    excel_customers = pd.read_excel('sample_business_data.xlsx', sheet_name='Customers')
    print(f"   ✅ Excel sheet read successfully: {excel_customers.shape}")
    
except Exception as e:
    print(f"   ❌ Excel Error: {e}")


📊 EXCEL FILES:
   ✅ Excel file with multiple sheets created
   ✅ Excel sheet read successfully: (20, 6)


In [30]:
# JSON Operations
print("\n🔗 JSON (JavaScript Object Notation):")
try:
    # Write to JSON
    customers_df.to_json('sample_customers.json', orient='records', indent=2)
    print("   ✅ JSON file created successfully")
    
    # Read from JSON
    json_data = pd.read_json('sample_customers.json')
    print(f"   ✅ JSON read successfully: {json_data.shape}")
    
except Exception as e:
    print(f"   ❌ JSON Error: {e}")


🔗 JSON (JavaScript Object Notation):
   ✅ JSON file created successfully
   ✅ JSON read successfully: (20, 6)


In [31]:
# Database connections (example)
# import sqlite3
# conn = sqlite3.connect('company_database.db')
# df_db = pd.read_sql_query('SELECT * FROM sales', conn)

## 🔬 Part 5: Hands-on Data Exploration

### 5.1 Loading Your First Dataset

Let's work with a sample sales dataset to practice our skills:

#### Import Required Packages

In [35]:
import os 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#### Load the Datasets

In [36]:
cwd = os.getcwd()
pwd = os.path.abspath(os.path.join(cwd, os.pardir))
file_path = os.path.join(pwd, 'datasets', 'Business_data.csv')

In [37]:
df = pd.read_csv(file_path)

In [38]:
# Display sample of the data
print("\n📋 Sample Data:")
print(df.head(10))


📋 Sample Data:
   customer_id product_category  purchase_amount  customer_age  region  \
0        10001         Clothing           150.42          70.0  Dammam   
1        10002           Sports           375.48           8.0  Medina   
2        10003             Home           213.35          39.0  Jeddah   
3        10004            Books            26.86          39.0   Mecca   
4        10005      Electronics            91.74           7.0  Medina   
5        10006      Electronics            62.02          54.0  Jeddah   
6        10007      Electronics            17.53          69.0  Jeddah   
7        10008             Home             7.67          40.0  Medina   
8        10009            Books           138.23          19.0  Dammam   
9        10010             Home            44.24          48.0  Riyadh   

   purchase_date  payment_method  customer_satisfaction  
0  1/1/2023 0:00     Credit Card                      3  
1  1/1/2023 1:00  Mobile Payment                     

### 5.2 Comprehensive Data Quality Assessment

In [39]:
# 1. Basic Information
print(f"📋 BASIC INFORMATION:")
print(f"   • Dataset Shape: {df.shape}")
print(f"   • Memory Usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")
print(f"   • Total Cells: {df.shape[0] * df.shape[1]:,}")

📋 BASIC INFORMATION:
   • Dataset Shape: (100000, 8)
   • Memory Usage: 25975.08 KB
   • Total Cells: 800,000


In [40]:
# 2. Data Types
print(f"\n📊 DATA TYPES:")
dtype_counts = df.dtypes.value_counts()
for dtype, count in dtype_counts.items():
    print(f"   • {dtype}: {count} columns")


📊 DATA TYPES:
   • object: 4 columns
   • int64: 2 columns
   • float64: 2 columns


In [41]:
# 3. Missing Values Analysis
print(f"\n❌ MISSING VALUES ANALYSIS:")
missing_data = df.isnull().sum()
missing_percentage = (missing_data / len(df)) * 100

has_missing = missing_data[missing_data > 0]
if len(has_missing) > 0:
    print("   Columns with missing values:")
    for col, count in has_missing.items():
        print(f"   • {col}: {count} ({missing_percentage[col]:.1f}%)")
else:
    print("   ✅ No missing values detected")



❌ MISSING VALUES ANALYSIS:
   Columns with missing values:
   • purchase_amount: 30 (0.0%)
   • customer_age: 50 (0.1%)


In [42]:
# 4. Duplicates
duplicates = df.duplicated().sum()
print(f"\n🔄 DUPLICATES:")
print(f"   • Duplicate rows: {duplicates} ({duplicates/len(df)*100:.1f}%)")



🔄 DUPLICATES:
   • Duplicate rows: 0 (0.0%)


In [43]:
print("\n=== BASIC STATISTICS ===")
print(df.describe())


=== BASIC STATISTICS ===
         customer_id  purchase_amount  customer_age  customer_satisfaction
count  100000.000000     99970.000000  99950.000000          100000.000000
mean    60000.500000       100.842831     39.937579               2.995740
std     28867.657797        85.305769     14.997718               1.413507
min     10001.000000         0.100000    -25.000000               1.000000
25%     35000.750000        48.300000     30.000000               2.000000
50%     60000.500000        84.180000     40.000000               3.000000
75%     85000.250000       134.677500     50.000000               4.000000
max    110000.000000      4800.120102    109.000000               5.000000


In [74]:
# Detailed Statistical Analysis
print("📊 DETAILED STATISTICAL ANALYSIS:\n")

# Numerical columns analysis
numerical_cols = df.select_dtypes(include=[np.number]).columns
print("📈 NUMERICAL COLUMNS SUMMARY:")
print(df[numerical_cols].describe().round(2))

📊 DETAILED STATISTICAL ANALYSIS:

📈 NUMERICAL COLUMNS SUMMARY:
       customer_id  purchase_amount  customer_age  customer_satisfaction
count    100000.00         99970.00      99950.00              100000.00
mean      60000.50           100.84         39.94                   3.00
std       28867.66            85.31         15.00                   1.41
min       10001.00             0.10        -25.00                   1.00
25%       35000.75            48.30         30.00                   2.00
50%       60000.50            84.18         40.00                   3.00
75%       85000.25           134.68         50.00                   4.00
max      110000.00          4800.12        109.00                   5.00


In [75]:
# Categorical columns analysis
categorical_cols = df.select_dtypes(include=['object']).columns
print(f"\n📈 CATEGORICAL COLUMNS ANALYSIS:")

for col in categorical_cols:
    unique_count = df[col].nunique()
    most_common = df[col].mode().iloc[0] if len(df[col].mode()) > 0 else 'N/A'
    print(f"   • {col}: {unique_count} unique values, most common: '{most_common}'")
    
    # Show distribution for categorical variables
    if unique_count <= 10:  # Only show distribution for variables with <= 10 categories
        value_counts = df[col].value_counts()
        print(f"     Distribution: {dict(value_counts.head())}")

# Date columns analysis
date_cols = df.select_dtypes(include=['datetime64']).columns
if len(date_cols) > 0:
    print(f"\n📅 DATE COLUMNS ANALYSIS:")
    for col in date_cols:
        print(f"   • {col}: from {df[col].min()} to {df[col].max()}")
        print(f"     Span: {(df[col].max() - df[col].min()).days} days")


📈 CATEGORICAL COLUMNS ANALYSIS:
   • product_category: 5 unique values, most common: 'Electronics'
     Distribution: {'Electronics': 30055, 'Clothing': 24972, 'Home': 20063, 'Books': 15084, 'Sports': 9826}
   • region: 5 unique values, most common: 'Riyadh'
     Distribution: {'Riyadh': 39800, 'Jeddah': 25239, 'Dammam': 14950, 'Mecca': 10009, 'Medina': 10002}
   • purchase_date: 100000 unique values, most common: '1/1/2023 0:00'
   • payment_method: 4 unique values, most common: 'Credit Card'
     Distribution: {'Credit Card': 40231, 'Debit Card': 29919, 'Cash': 14931, 'Mobile Payment': 14919}


## 📝 Part 6: Practical Exercises and Deliverables

### Exercise 1: Environment Setup Verification

1. Create a new Jupyter notebook
2. Import all required libraries (pandas, numpy, matplotlib, seaborn)
3. Create a simple DataFrame with sample data
4. Display basic information about your DataFrame

```python
# Your code here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create your own sample dataset
# Hint: Use at least 3 columns with different data types
```

### Exercise 2: Data Quality Analysis

Using the provided sample dataset:

1. Calculate the percentage of missing values for each column
2. Identify the data types of each column
3. Find any duplicate rows
4. Create a simple visualization showing the distribution of one numerical column

```python
# Your analysis code here
```

### Exercise 3: Ethics Reflection

Write a brief reflection (200-300 words) addressing the following questions:

1. What ethical considerations should be taken into account when analyzing customer purchase data?
2. How would you ensure privacy protection while still extracting valuable business insights?
3. What potential biases might exist in retail sales data, and how could they impact analysis results?

*Write your reflection in the markdown cell below:*

---

### 📋 Session 1 Deliverables Checklist

#### ✅ Python Environment Setup Complete
- [ ] Python 3.8+ installed and verified
- [ ] Jupyter Notebook functional
- [ ] All required libraries imported successfully
- [ ] Sample notebook with basic operations completed

#### ✅ Data Quality Assessment
- [ ] Missing values analysis completed
- [ ] Data types identified and documented

#### ✅ Data Ethics Reflection
- [ ] GDPR compliance considerations addressed
- [ ] Bias identification and mitigation strategies
- [ ] Privacy protection strategies outlined
- [ ] Ethical handling recommendations provided

#### ✅ Basic Data Exploration
- [ ] Descriptive statistics calculated

## 📚 Additional Resources and Learning Materials

<!-- CELL BREAK -->

### 🎓 Essential Reading

#### 📖 Required Textbooks
- **Python for Data Analysis** by Wes McKinney (Chapters 1-4)
- **The Ethics of Data Science** by DJ Patil (Introduction)
- **Practical Statistics for Data Scientists** by Peter Bruce (Chapters 1-2)

#### 📄 Official Documentation
- [Pandas Documentation](https://pandas.pydata.org/docs/) - Comprehensive guide
- [NumPy User Guide](https://numpy.org/doc/stable/user/) - Numerical computing

### 🔮 Next Session Preview

#### 📅 Session 2: Data Preparation, Cleaning and Wrangling

**What you'll learn:**
- Advanced data profiling and quality assessment techniques
- Handling missing data: detection, analysis, and treatment strategies
- Outlier detection and treatment methods (Z-score, IQR, winsorization)
- Data transformation: type conversions, standardizing formats
- Reshaping and merging datasets (pivot, join operations)
- Feature engineering: creating derived variables, binning, categorization
- Text preprocessing fundamentals for unstructured data

**Hands-on activities:**
- Clean real-world messy datasets
- Implement multiple missing data treatment strategies
- Create automated data cleaning pipelines
- Document data transformation decisions

**Prepare for next session:**
- Review pandas documentation for data cleaning methods
- Think about data quality issues you've encountered
- Practice with the sample datasets from today

## 🤝 Getting Help and Course Support

<!-- CELL BREAK -->

### 📞 Course Support Channels

#### 🏢 Big Academy Saudi Arabia
- **📍 Location:** Riyadh, Saudi Arabia
- **☎️ Phone:** +966 566 049 140
- **🌐 Website:** [bigacademy.edu.sa](https://bigacademy.edu.sa/)
- **📧 General Email:** som@bigacademy.com
- **📧 Course Email:** [To be provided by instructor]

#### 📱 Follow Us on Social Media
- **Instagram:** [@bigacademy.ksa](https://www.instagram.com/bigacademy.ksa/)
- **LinkedIn:** [Big Academy KSA](https://www.linkedin.com/school/big-academyksa/)
- **Facebook:** [Big Academy Saudi Arabia](https://m.facebook.com/61559705850355/)

## 🎉 Session 1 Complete!

### 🌟 Congratulations!

You have successfully completed **Session 1: Foundations of Data Analysis, Python Fundamentals, and Data Ethics**. You've taken the first important step in your data analytics journey!

### 🎯 What You've Accomplished Today:
- ✅ **Understood the data revolution** and its impact on modern business
- ✅ **Set up your Python environment** for data analysis
- ✅ **Learned essential data structures** and operations
- ✅ **Explored data types and quality dimensions**
- ✅ **Applied data ethics principles** including GDPR compliance
- ✅ **Performed hands-on data exploration** with real datasets

### 🚀 Next Steps:
1. **Complete all exercises** and submit your deliverables
2. **Practice with additional datasets** to reinforce your learning
3. **Prepare for Session 2** using the provided checklist
4. **Join the course community** and connect with fellow learners

### 💪 Keep Learning!
> *"Data is the new oil, but like oil, it needs to be refined to be valuable."*

> *"The journey of a thousand insights begins with a single dataset exploration."*

Remember: **Practice makes perfect!** The more you work with data, the more comfortable you'll become. Don't be afraid to experiment, make mistakes, and ask questions.

### 🌟 Ready for the Next Challenge?
**Session 2: Data Preparation, Cleaning and Wrangling** awaits you!

---

<div align="center">

**🎓 Big Academy Saudi Arabia - Where Education Knows No Limits**

*Empowering the next generation of data professionals in Saudi Arabia* 🇸🇦

![Big Academy](https://img.shields.io/badge/Made%20with%20❤️%20in-Saudi%20Arabia-green?style=for-the-badge)

</div>