<style>
.analysis-title {
    color: #2563eb !important;
    font-size: 2.8rem;
    font-weight: 700;
    text-align: center;
    border-bottom: 4px solid #dbeafe;
    padding-bottom: 15px;
    margin-bottom: 25px;
    text-shadow: 2px 2px 4px rgba(0,0,0,0.1);
}
.metadata-box {
    background: linear-gradient(135deg, #f8fafc 0%, #e2e8f0 100%);
    border-left: 5px solid #3b82f6;
    padding: 20px;
    border-radius: 10px;
    margin: 20px 0;
    box-shadow: 0 4px 15px rgba(0,0,0,0.1);
}
.metadata-text {
    color: #1e293b;
    font-size: 1.1rem;
    line-height: 1.8;
    margin: 0;
}
.overview-header {
    color: #1d4ed8 !important;
    font-size: 2rem;
    font-weight: 600;
    border-left: 5px solid #3b82f6;
    padding-left: 15px;
    margin-top: 30px;
    margin-bottom: 15px;
}
.section-text {
    color: #374151;
    font-size: 1.05rem;
    line-height: 1.7;
    text-align: justify;
}
.subsection-header {
    color: #4338ca !important;
    font-size: 1.4rem;
    font-weight: 600;
    margin-top: 25px;
    margin-bottom: 10px;
    border-bottom: 2px solid #e0e7ff;
    padding-bottom: 5px;
}
.data-list {
    background-color: #f8fafc;
    border: 1px solid #e2e8f0;
    border-radius: 8px;
    padding: 15px;
    margin: 15px 0;
}
.data-list ul {
    margin: 0;
    color: #475569;
    font-size: 1rem;
}
.data-list li {
    margin-bottom: 8px;
    padding-left: 5px;
}
.data-list code {
    background-color: #f1f5f9;
    color: #dc2626;
    padding: 2px 6px;
    border-radius: 4px;
    font-weight: 500;
}
</style>

<h1 class="analysis-title">Genes with Damaging Mutations Analysis</h1>

<div class="metadata-box">
<p class="metadata-text">
<strong style="color: #1e40af;">Project:</strong> Computational Biology DMV Petri Dish<br>
<strong style="color: #1e40af;">Author:</strong> Chris Indorf<br>
<strong style="color: #1e40af;">Date:</strong> August 1, 2025<br>
<strong style="color: #1e40af;">Language:</strong> Python
</p>
</div>

<h2 class="overview-header">Overview</h2>

<p class="section-text">This notebook performs data retrieval and visualization for research and analysis of genes with damaging mutations in lung cancer cell lines. The analysis connects to a PostgreSQL data warehouse to query mutation data and creates an interactive bar plot visualization.</p>

<h3 class="subsection-header">Data Sources</h3>
<div class="data-list">
<ul>
<li><strong>Database:</strong> data_warehouse (PostgreSQL)</li>
<li><strong>Tables:</strong>
    <ul>
        <li><code>im_dep_sprime_damaging_mutations</code></li>
        <li><code>im_dep_raw_secondary_dose_curve</code></li>
    </ul>
</li>
</ul>
</div>

<h3 class="subsection-header">Libraries Used</h3>
<div class="data-list">
<ul>
<li><code>pandas</code> - Data manipulation and analysis</li>
<li><code>psycopg2</code> - PostgreSQL database connectivity</li>
<li><code>altair</code> - Statistical data visualization</li>
</ul>
</div>

In [1]:
# Import required libraries
import psycopg2
import pandas as pd
import altair as alt

<style>
.section-header {
    color: #1d4ed8 !important;
    font-size: 2rem;
    font-weight: 600;
    border-left: 5px solid #3b82f6;
    padding-left: 15px;
    margin-top: 30px;
    margin-bottom: 15px;
    background: linear-gradient(90deg, #f0f9ff 0%, transparent 100%);
    padding-top: 10px;
    padding-bottom: 10px;
}
.description-text {
    color: #374151;
    font-size: 1.05rem;
    line-height: 1.7;
    margin-bottom: 20px;
}
.process-box {
    background: linear-gradient(135deg, #ecfdf5 0%, #d1fae5 100%);
    border: 1px solid #a7f3d0;
    border-radius: 10px;
    padding: 20px;
    margin: 20px 0;
}
.process-list {
    color: #065f46;
    font-size: 1rem;
    margin: 0;
}
.process-list li {
    margin-bottom: 10px;
    font-weight: 500;
}
.code-highlight {
    background-color: #fef2f2;
    color: #dc2626;
    padding: 2px 6px;
    border-radius: 4px;
    font-family: 'Monaco', 'Consolas', monospace;
    font-weight: 600;
}
</style>

<h2 class="section-header">Data Retrieval</h2>

<p class="description-text">Connect to the PostgreSQL database and execute a query to retrieve genes with damaging mutations in lung cancer cell lines. The query performs the following operations:</p>

<div class="process-box">
<ol class="process-list">
<li><strong>Filters</strong> for <span class="code-highlight">mutation_value = 2</span> (damaging mutations)</li>
<li><strong>Restricts</strong> to lung cancer cell lines (<span class="code-highlight">CCLE_name LIKE '%LUNG'</span>)</li>
<li><strong>Groups</strong> by <span class="code-highlight">gene_id</span> and counts affected cell lines</li>
<li><strong>Orders</strong> results by mutation count (descending) and gene_id</li>
</ol>
</div>

In [None]:
# Establish database connection
conn = psycopg2.connect(
    host='dmvpetridishdatastore.dev',
    port=5432,
    database='data_warehouse',
    user='comp_bio_u2',
    password='ENTER PASSWORD HERE' ### IN THIS DEVELOPMENT VERSION REPLACE 'ENTER PASSWORD HERE' WITH THE ACTUAL PASSWORD
)

# Define SQL query to retrieve genes with damaging mutations in lung cancer cell lines
query = """
SELECT gene_id, COUNT(cell_line) 
FROM public.im_dep_sprime_damaging_mutations
WHERE mutation_value = 2 
  AND cell_line IN (
    SELECT depmap_id 
    FROM public.im_dep_raw_secondary_dose_curve
    WHERE ccle_name LIKE '%LUNG')  
GROUP BY gene_id
ORDER BY COUNT(cell_line) DESC, gene_id;
"""

# Execute query and create DataFrame
df = pd.read_sql_query(query, conn)

# Display results
print("Cell line damaging mutations:")
print(df)

# Show distribution of mutation counts across genes
# Left column: number of cell lines with damaging mutations
# Right column: number of genes with that count
print(df['count'].value_counts())

# Close database connection
conn.close()

  df = pd.read_sql_query(query, conn)


Cell line damaging mutations:
     gene_id  count
0       7115     73
1       8533     21
2       8291     13
3      17047     13
4       8520      9
..       ...    ...
920    18314      1
921    18331      1
922    18334      1
923    18579      1
924    18842      1

[925 rows x 2 columns]
count
1     835
2      67
3      12
4       2
6       2
13      2
73      1
21      1
5       1
9       1
8       1
Name: count, dtype: int64


<style>
.filtering-header {
    color: #1d4ed8 !important;
    font-size: 2rem;
    font-weight: 600;
    border-left: 5px solid #f59e0b;
    padding-left: 15px;
    margin-top: 30px;
    margin-bottom: 15px;
    background: linear-gradient(90deg, #fffbeb 0%, transparent 100%);
    padding-top: 10px;
    padding-bottom: 10px;
}
.filtering-explanation {
    background: linear-gradient(135deg, #fff7ed 0%, #fed7aa 100%);
    border: 1px solid #fdba74;
    border-radius: 10px;
    padding: 20px;
    margin: 20px 0;
    color: #9a3412;
    font-size: 1.05rem;
    line-height: 1.7;
}
.highlight-number {
    background-color: #fef3c7;
    color: #92400e;
    padding: 3px 8px;
    border-radius: 4px;
    font-weight: 700;
    font-size: 1.1rem;
}
</style>

<h2 class="filtering-header">Data Filtering</h2>

<div class="filtering-explanation">
<p>For visualization purposes, we exclude genes with only a single cell line mutation (<span class="highlight-number">835 genes</span>) to focus on genes with mutations across multiple cell lines. This creates a more informative and readable plot by reducing noise from singleton mutations.</p>
</div>

In [3]:
# Filter data to exclude genes with only single cell line mutations
# This improves plot readability by focusing on recurrent mutations
dfplot = df[df['count'] > 1]

# Display filtered dataset
print(dfplot)

# Show updated distribution
dfplot['count'].value_counts()

    gene_id  count
0      7115     73
1      8533     21
2      8291     13
3     17047     13
4      8520      9
..      ...    ...
85    18033      2
86    18084      2
87    18115      2
88    18181      2
89    18189      2

[90 rows x 2 columns]


count
2     67
3     12
4      2
6      2
13     2
73     1
21     1
9      1
8      1
5      1
Name: count, dtype: int64

<style>
.viz-header {
    color: #1d4ed8 !important;
    font-size: 2rem;
    font-weight: 600;
    border-left: 5px solid #8b5cf6;
    padding-left: 15px;
    margin-top: 30px;
    margin-bottom: 15px;
    background: linear-gradient(90deg, #faf5ff 0%, transparent 100%);
    padding-top: 10px;
    padding-bottom: 10px;
}
.viz-description {
    color: #374151;
    font-size: 1.05rem;
    line-height: 1.7;
    margin-bottom: 20px;
}
.features-box {
    background: linear-gradient(135deg, #f3e8ff 0%, #ddd6fe 100%);
    border: 1px solid #c4b5fd;
    border-radius: 10px;
    padding: 20px;
    margin: 20px 0;
}
.features-title {
    color: #5b21b6;
    font-size: 1.3rem;
    font-weight: 600;
    margin-top: 0;
    margin-bottom: 15px;
}
.features-list {
    color: #6b21a8;
    font-size: 1rem;
    margin: 0;
}
.features-list li {
    margin-bottom: 12px;
    font-weight: 500;
}
.feature-highlight {
    background-color: #ede9fe;
    color: #5b21b6;
    padding: 2px 6px;
    border-radius: 4px;
    font-weight: 600;
}
</style>

<h2 class="viz-header">Data Visualization</h2>

<p class="viz-description">Create an interactive horizontal bar chart using Altair to visualize genes with damaging mutations across multiple lung cancer cell lines.</p>

<div class="features-box">
<h3 class="features-title">Chart Features:</h3>
<ul class="features-list">
<li><span class="feature-highlight">Logarithmic scale:</span> Accommodates the wide range of mutation counts (2-73)</li>
<li><span class="feature-highlight">Interactive zooming:</span> Pan and zoom functionality for detailed exploration</li>
<li><span class="feature-highlight">Count labels:</span> Numerical values displayed at the end of each bar</li>
<li><span class="feature-highlight">Sorted display:</span> Genes ordered by mutation frequency (highest to lowest)</li>
</ul>
</div>

In [4]:
# Prepare data for visualization
data = dfplot.copy()
data = data.sort_values(by=['count', 'gene_id'], ascending=[False, True])
data['gene_id_str'] = data['gene_id'].astype(str)

# Create horizontal bar chart with rectangles
bars = alt.Chart(data).mark_rect(
    color='steelblue',
    stroke='white',
    strokeWidth=1
).encode(
    x=alt.X('count:Q',
            title='Number of cell lines with damaging mutations',
            scale=alt.Scale(type='log', domain=[1, 80])),
    x2=alt.value(1),  # Start bars at x=1 for log scale
    y=alt.Y('gene_id_str:N',
            title='Gene',
            sort=alt.SortField(field='count', order='descending'),
            scale=alt.Scale(padding=0.5),
            axis=alt.Axis(labelLimit=150))  # Increase space for gene ID labels
)

# Add count labels at the end of each bar
text_labels = alt.Chart(data).mark_text(
    align='left',
    baseline='middle',
    dx=5,  # Offset text from bar end
    fontSize=10
).encode(
    x=alt.X('count:Q',
            scale=alt.Scale(type='log', domain=[1, 80])),
    y=alt.Y('gene_id_str:N',
            sort=alt.SortField(field='count', order='descending')),
    text=alt.Text('count:Q')
)

# Combine bars and labels, add interactivity
chart = (bars + text_labels).properties(
    width=600,
    height=1350,
    title='Genes with > 1 Damaging Mutations'
).interactive()

# Display the chart
chart.show()

<style>
.results-header {
    color: #dc2626 !important;
    font-size: 2.2rem;
    font-weight: 700;
    text-align: center;
    border: 3px solid #fecaca;
    background: linear-gradient(135deg, #fef2f2 0%, #fee2e2 100%);
    padding: 20px;
    border-radius: 15px;
    margin: 30px 0 25px 0;
    text-shadow: 1px 1px 3px rgba(0,0,0,0.1);
}
.summary-intro {
    background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
    color: white;
    padding: 25px;
    border-radius: 12px;
    margin: 25px 0;
    box-shadow: 0 8px 25px rgba(0,0,0,0.15);
}
.summary-text {
    font-size: 1.1rem;
    line-height: 1.7;
    margin: 0;
}
.highlight-stat {
    background-color: #fde68a;
    color: #92400e;
    padding: 4px 10px;
    border-radius: 6px;
    font-weight: 700;
    font-size: 1.15rem;
}
.key-findings {
    background-color: #f0f9ff;
    border: 2px solid #bfdbfe;
    border-radius: 10px;
    padding: 20px;
    margin: 20px 0;
}
.findings-title {
    color: #1e40af !important;
    font-size: 1.4rem;
    font-weight: 600;
    margin-top: 0;
    margin-bottom: 15px;
    border-bottom: 2px solid #dbeafe;
    padding-bottom: 8px;
}
.findings-list {
    color: #1e3a8a;
    font-size: 1rem;
    margin: 0;
}
.findings-list li {
    margin-bottom: 10px;
    padding-left: 8px;
    border-left: 3px solid #60a5fa;
    padding-top: 5px;
    padding-bottom: 5px;
    background-color: #f8fafc;
    margin-left: 0;
    padding-left: 12px;
}
.data-distribution {
    background: linear-gradient(135deg, #ecfdf5 0%, #d1fae5 100%);
    border: 2px solid #a7f3d0;
    border-radius: 10px;
    padding: 20px;
    margin: 20px 0;
}
.distribution-title {
    color: #065f46 !important;
    font-size: 1.4rem;
    font-weight: 600;
    margin-top: 0;
    margin-bottom: 15px;
}
.distribution-list {
    color: #047857;
    font-size: 1.1rem;
    font-weight: 500;
    margin: 0;
}
.distribution-list li {
    background-color: #f0fdf4;
    margin-bottom: 8px;
    padding: 10px 15px;
    border-radius: 6px;
    border-left: 4px solid #10b981;
}
.conclusion-box {
    background: linear-gradient(135deg, #fef7ff 0%, #fae8ff 100%);
    border: 2px solid #d8b4fe;
    border-radius: 12px;
    padding: 25px;
    margin: 25px 0;
    color: #581c87;
    font-size: 1.05rem;
    line-height: 1.8;
    font-style: italic;
}
.emphasis-text {
    background-color: #e879f9;
    color: #ffffff;
    padding: 3px 8px;
    border-radius: 4px;
    font-weight: 600;
    font-style: normal;
}
</style>

<h2 class="results-header">Results Summary</h2>

<div class="summary-intro">
<p class="summary-text">The analysis identified <span class="highlight-stat">925 genes</span> with damaging mutations in lung cancer cell lines, revealing important patterns in cancer genomics.</p>
</div>

<div class="key-findings">
<h3 class="findings-title">Key Findings</h3>
<ul class="findings-list">
<li><strong>Gene 7115</strong> shows the highest frequency with <strong>73 affected cell lines</strong></li>
<li><strong>90 genes</strong> have mutations in 2 or more cell lines (displayed in the chart)</li>
<li><strong>835 genes</strong> have mutations in only a single cell line (filtered from visualization)</li>
<li>The majority of recurrent mutations (67 genes) affect exactly 2 cell lines</li>
</ul>
</div>

<div class="data-distribution">
<h3 class="distribution-title">Data Distribution</h3>
<ul class="distribution-list">
<li><strong>2 cell lines:</strong> 67 genes</li>
<li><strong>3 cell lines:</strong> 12 genes</li>
<li><strong>4+ cell lines:</strong> 11 genes</li>
</ul>
</div>

<div class="conclusion-box">
<p>This pattern suggests that while many genes show occasional damaging mutations, relatively few genes are consistently mutated across multiple lung cancer cell lines, potentially indicating <span class="emphasis-text">key driver genes</span> in lung cancer pathogenesis.</p>
</div>