# China Cancer Patient Records - Data Analysis

This notebook analyzes the China cancer patient synthetic dataset. We'll load the data, explore its structure, and perform initial data analysis.

**Dataset Source:** Kaggle - China Cancer Patient Records
**File:** china_cancer_patients_synthetic.csv

## 1. Import Required Libraries

Import pandas and other necessary libraries for data analysis and visualization.

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Set style for plots
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")

## 2. Define File Path

Specify the file path for the CSV file containing the China cancer patient data.

In [None]:
# Define the path to the dataset
data_path = "/Users/f/.cache/kagglehub/datasets/ak0212/china-cancer-patient-records/versions/1/china_cancer_patients_synthetic.csv"

print(f"Data path: {data_path}")

## 3. Read CSV File

Load the cancer patient data into a pandas DataFrame for analysis.

In [None]:
# Read the CSV file into a DataFrame
df = pd.read_csv(data_path)

print(f"Dataset loaded successfully!")
print(f"Shape of the dataset: {df.shape}")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

## 4. Preview Data

Display the first few rows and basic information about the dataset.

In [None]:
# Display the first 5 rows of the dataset
print("First 5 rows of the dataset:")
df.head()

In [None]:
# Display basic information about the dataset
print("Dataset Information:")
print("=" * 50)
df.info()

In [None]:
# Display column names and their data types
print("Column names and data types:")
print("=" * 40)
for i, (col, dtype) in enumerate(zip(df.columns, df.dtypes), 1):
    print(f"{i:2d}. {col:<25} - {dtype}")

In [None]:
# Display descriptive statistics for numerical columns
print("Descriptive Statistics:")
print("=" * 50)
df.describe()