### Task 1: Data Profiling to Understand Data Quality
**Description**: Use basic statistical methods to profile a dataset and identify potential quality issues.

**Steps**:
1. Load the dataset using pandas in Python.
2. Understand the data by checking its basic statistics.
3. Identify null values.
4. Check unique values for categorical columns.
5. Review outliers using box plots.

In [None]:
# write your code from here

In [1]:
import pandas as pd

# Load the dataset (replace 'your_dataset.csv' with the actual file path)
df = pd.read_csv('your_dataset.csv')

# Display basic statistics
print("Basic Statistics:")
print(df.describe(include='all'))

# Check for null values
print("\nNull Values:")
print(df.isnull().sum())

# Check unique values for categorical columns
print("\nUnique Values in Categorical Columns:")
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
    print(f"{col}: {df[col].nunique()} unique values")

# Review outliers using box plots (optional, requires matplotlib)
import matplotlib.pyplot as plt

numerical_columns = df.select_dtypes(include=['number']).columns
for col in numerical_columns:
    plt.figure()
    df.boxplot(column=col)
    plt.title(f"Boxplot for {col}")
    plt.show()

FileNotFoundError: [Errno 2] No such file or directory: 'your_dataset.csv'

### Task 2: Implement Simple Data Validation
**Description**: Write a Python script to validate the data types and constraints of each column in a dataset.

**Steps**:
1. Define constraints for each column.
2. Validate each column based on its constraints.

In [None]:
# write your code from here

In [4]:
# Define constraints for each column
constraints = {
    'column_name_1': {'type': 'int', 'min': 0, 'max': 100},  # Example constraint
    'column_name_2': {'type': 'str', 'allowed_values': ['A', 'B', 'C']},  # Example constraint
    # Add constraints for other columns as needed
}

# Function to validate a column based on constraints
def validate_column(column, constraints):
    if 'type' in constraints:
        if constraints['type'] == 'int' and not pd.api.types.is_integer_dtype(column):
            return False
        if constraints['type'] == 'str' and not pd.api.types.is_string_dtype(column):
            return False
    if 'min' in constraints and column.min() < constraints['min']:
        return False
    if 'max' in constraints and column.max() > constraints['max']:
        return False
    if 'allowed_values' in constraints and not column.isin(constraints['allowed_values']).all():
        return False
    return True

# Ensure the dataframe 'df' is defined
if 'df' in globals():
    # Validate each column
    validation_results = {}
    for col, col_constraints in constraints.items():
        if col in df.columns:
            validation_results[col] = validate_column(df[col], col_constraints)
        else:
            validation_results[col] = False  # Column not found in the dataset
else:
    raise NameError("The dataframe 'df' is not defined. Please ensure CELL INDEX 2 is executed.")

# Print validation results
print("Validation Results:")
for col, result in validation_results.items():
    print(f"{col}: {'Valid' if result else 'Invalid'}")

NameError: The dataframe 'df' is not defined. Please ensure CELL INDEX 2 is executed.

### Task 3: Detect Missing Data Patterns
**Description**: Analyze and visualize missing data patterns in a dataset.

**Steps**:
1. Visualize missing data using a heatmap.
2. Identify patterns in missing data.

In [None]:
# write your code from here

In [5]:
# Validate the specific column using its constraints
if col in constraints:
    col_constraints = constraints[col]
    if col in df.columns:
        validation_results[col] = validate_column(df[col], col_constraints)
    else:
        validation_results[col] = False  # Column not found in the dataset
else:
    print(f"No constraints defined for column: {col}")

# Print the validation result for the specific column
print(f"Validation Result for {col}: {'Valid' if validation_results.get(col, False) else 'Invalid'}")

NameError: name 'df' is not defined

### Task 4: Integrate Automated Data Quality Checks
**Description**: Integrate automated data quality checks using the Great Expectations library for a dataset.

**Steps**:
1. Install and initialize Great Expectations.
2. Set up Great Expectations.
3. Add further checks and validate.

In [None]:
# write your code from here

In [6]:
# Validate the specific column using its constraints
if col in constraints:
    col_constraints = constraints[col]
    if col in df.columns:
        validation_results[col] = validate_column(df[col], col_constraints)
    else:
        validation_results[col] = False  # Column not found in the dataset
else:
    print(f"No constraints defined for column: {col}")

# Print the validation result for the specific column
print(f"Validation Result for {col}: {'Valid' if validation_results.get(col, False) else 'Invalid'}")

NameError: name 'df' is not defined