# Cell Types Project Group 9
*Replace this with the name of your project*

## Team Member Names & Contributions
*Feel free to name your team, but please also include your real names and IDs here. Please specify who in your group worked on which parts of the project.*

- **Adya Pidara**: You know, blowing up things and such.
- **Brandon Huynh**: Cleverly sneaking into small spaces
- **Mariel Tampubolon**: AKA The "Muscle"

## Abstract

*Fill in your 3-4 sentence abstract here*

## Research Question

*Fill in your research question here*

# Background and Prior Work

*Fill in your background and prior work here (~500 words). You are welcome to use additional subheadings. You should also include a paragraph describing each dataset and how you'll be using them.* 

### References (include links):
(1)

(2)

## Hypothesis


*Fill in your hypotheses here*

## Setup
*Are there packages that need to be imported, or datasets that need to be downloaded?*

In [1]:
%pip install pandas
%pip install matplotlib
%pip install scipy
%pip install seaborn

Collecting pandas
  Downloading pandas-3.0.0-cp314-cp314-macosx_11_0_arm64.whl.metadata (79 kB)
Collecting numpy>=2.3.3 (from pandas)
  Downloading numpy-2.4.2-cp314-cp314-macosx_14_0_arm64.whl.metadata (6.6 kB)
Downloading pandas-3.0.0-cp314-cp314-macosx_11_0_arm64.whl (9.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m7.3 MB/s[0m  [33m0:00:01[0m eta [36m0:00:01[0m
[?25hDownloading numpy-2.4.2-cp314-cp314-macosx_14_0_arm64.whl (5.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m6.9 MB/s[0m  [33m0:00:00[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: numpy, pandas
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [pandas]2m1/2[0m [pandas]
[1A[2KSuccessfully installed numpy-2.4.2 pandas-3.0.0
Note: you may need to restart the kernel to use updated packages.
Collecting matplotlib
  Downloading matplotlib-3.10.8-cp314-cp314-macosx_11_0_arm64.whl.metadata (52 kB)
Coll

## Data Wrangling

In [None]:
import pandas as pd

platform_file = "GPL96-57554.txt"  # your GPL file

# Read first few rows to inspect
platform_preview = pd.read_csv(platform_file, sep="\t", nrows=5)
print(platform_preview.columns)



Index(['#ID = Affymetrix Probe Set ID'], dtype='str')


Describe your data wrangling steps here.

In [21]:
import pandas as pd
import numpy as np
from scipy import stats
from io import StringIO
import gzip

# -----------------------------
# 0. File setup
# -----------------------------
series_matrix_file = "GSE7621_series_matrix.txt.gz"
platform_file = "GPL570.txt"  # your GPL file

# -----------------------------
# 1. Extract !Sample_title from the series matrix
# -----------------------------
sample_title_row = None
with gzip.open(series_matrix_file, 'rt') as f:
    for line in f:
        line = line.strip()
        if line.startswith("!Sample_title"):
            sample_title_row = [x.replace('"','').strip() for x in line.split("\t")[1:]]
            break

if sample_title_row is None:
    raise ValueError("Could not find !Sample_title line in the series matrix!")

print(f"Found {len(sample_title_row)} samples.")

# -----------------------------
# 2. Load the expression table
# -----------------------------
data_lines = []
header = None
with gzip.open(series_matrix_file, 'rt') as f:
    table_started = False
    for line in f:
        line = line.strip()
        if line == "!series_matrix_table_begin":
            table_started = True
            continue
        if line == "!series_matrix_table_end":
            break
        if table_started:
            if "ID_REF" in line and header is None:
                header = [col.replace('"','').strip() for col in line.split("\t")]
            elif not line.startswith("!"):  # data rows
                data_lines.append(line)

if header is None:
    raise ValueError("Header line (ID_REF) not found!")

rows = [line.split("\t") for line in data_lines]
df = pd.DataFrame(rows, columns=header)

# Clean ID_REF column
df["ID_REF"] = df["ID_REF"].str.replace('"','').str.strip()

# Convert expression values to float
for col in df.columns[1:]:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# -----------------------------
# 3. Map samples to PD vs Control
# -----------------------------
sample_title_row = [x.replace('"','').strip() for x in sample_title_row]
control_samples = [col for col, title in zip(df.columns[1:], sample_title_row) if "normal" in title.lower()]
pd_samples = [col for col, title in zip(df.columns[1:], sample_title_row) if "pd" in title.lower()]

print(f"Control samples ({len(control_samples)}): {control_samples}")
print(f"PD samples ({len(pd_samples)}): {pd_samples}")

# -----------------------------
# 4. Load GPL platform table robustly
# -----------------------------
with open(platform_file, 'r') as f:
    lines = f.readlines()

# Skip all comment lines starting with #
for i, line in enumerate(lines):
    if not line.startswith("#"):
        header_line_index = i
        break

table_lines = lines[header_line_index:]
table_str = "".join(table_lines)
platform_df = pd.read_csv(StringIO(table_str), sep="\t", dtype=str)

# Clean ID column
id_column = "ID"
platform_df[id_column] = platform_df[id_column].str.replace('"','').str.strip()

gene_column = "Gene Symbol"
platform_df[gene_column] = platform_df[gene_column].str.strip()

print("Columns in platform table:", platform_df.columns.tolist())

# -----------------------------
# 5. Find CACNA1D probes
# -----------------------------
cacna1d_probes = platform_df[platform_df[gene_column].str.contains("CACNA1D", na=False)][id_column].tolist()
print(f"CACNA1D probes found in GPL: {cacna1d_probes}")

# -----------------------------
# 6. Filter expression data for CACNA1D probes
# -----------------------------
cacna1d_df = df[df["ID_REF"].isin(cacna1d_probes)]
print(f"CACNA1D probes found in expression data: {cacna1d_df['ID_REF'].tolist()}")

if cacna1d_df.empty:
    print("No CACNA1D probes matched in expression data!")
else:
    # 6a. Mean expression per probe
    print("\nMean expression per probe:")
    probe_means = []
    for idx, row in cacna1d_df.iterrows():
        mean_ctrl = row[control_samples].mean()
        mean_pd = row[pd_samples].mean()
        probe_means.append((mean_ctrl, mean_pd))
        print(f"{row['ID_REF']}: Control={mean_ctrl:.3f}, PD={mean_pd:.3f}")

    # 6b. Average across all probes
    mean_ctrl_all = np.mean([c for c, p in probe_means])
    mean_pd_all = np.mean([p for c, p in probe_means])
    print(f"\nAverage across all probes: Control={mean_ctrl_all:.3f}, PD={mean_pd_all:.3f}")

    # 6c. Log2 fold change
    log2fc = np.log2(mean_pd_all + 1e-9) - np.log2(mean_ctrl_all + 1e-9)
    print(f"Log2 fold change (PD vs Control): {log2fc:.3f}")

    # 6d. T-test across all probe values
    control_values = cacna1d_df[control_samples].values.flatten()
    pd_values = cacna1d_df[pd_samples].values.flatten()
    t_stat, p_val = stats.ttest_ind(pd_values, control_values, equal_var=False, nan_policy='omit')
    print(f"T-test (all probes combined): t={t_stat:.3f}, p={p_val:.4g}")


Found 25 samples.
Control samples (9): ['GSM184354', 'GSM184355', 'GSM184356', 'GSM184357', 'GSM184358', 'GSM184359', 'GSM184360', 'GSM184361', 'GSM184362']
PD samples (16): ['GSM184363', 'GSM184364', 'GSM184365', 'GSM184366', 'GSM184367', 'GSM184368', 'GSM184369', 'GSM184370', 'GSM184371', 'GSM184372', 'GSM184373', 'GSM184374', 'GSM184375', 'GSM184376', 'GSM184377', 'GSM184378']
Columns in platform table: ['ID', 'GB_ACC', 'SPOT_ID', 'Species Scientific Name', 'Annotation Date', 'Sequence Type', 'Sequence Source', 'Target Description', 'Representative Public ID', 'Gene Title', 'Gene Symbol', 'ENTREZ_GENE_ID', 'RefSeq Transcript ID', 'Gene Ontology Biological Process', 'Gene Ontology Cellular Component', 'Gene Ontology Molecular Function']
CACNA1D probes found in GPL: ['1555993_at', '207998_s_at', '210108_at', '243334_at']
CACNA1D probes found in expression data: ['1555993_at', '207998_s_at', '210108_at', '243334_at']

Mean expression per probe:
1555993_at: Control=44.414, PD=55.763
207

## Data Analysis & Results

Include cells that describe the steps in your data analysis.

In [5]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

## Conclusion & Discussion

*Fill in your discussion information here*