# BUAN 446 Final Project
## Comprehensive Data Analysis: Lehigh Student Success Study

---

**Name:** _____________________

**Date:** _____________________

---

### Instructions

This project assesses your ability to apply the complete data analysis workflow using NumPy, pandas, and visualization tools. You will conduct a comprehensive analysis of the Lehigh student dataset and produce professional-quality deliverables.

**Time Estimate:** 2-3 hours

**Rules:**
- You **MAY** use AI tools for this project (this is a RUN-level assessment)
- You **MUST** document your AI usage in Part 6
- You **MUST** be able to explain every line of code you submit
- You may **NOT** collaborate with other students
- Submit your completed notebook via CourseConnect by the deadline

**Grading:**
- Part 1: Data Loading and Exploration (15 points)
- Part 2: NumPy Analysis (20 points)
- Part 3: Pandas Analysis (25 points)
- Part 4: Data Visualization (25 points)
- Part 5: Executive Summary (10 points)
- Part 6: AI Collaboration Log (5 points)
- **Total: 100 points**

**Tips:**
- Read each task completely before coding
- Your visualizations should be publication-quality (titles, labels, appropriate chart types)
- The executive summary should be written for a non-technical audience
- If you use AI, focus on understanding the code, not just generating it

---

## Dataset Overview

You will work with `lehigh_students_clean.csv`, which contains 600 student records:

| Column | Type | Description |
|--------|------|-------------|
| Student_ID | String | Unique identifier (e.g., "LU100001") |
| College | String | One of 5 Lehigh colleges |
| Major | String | Student's declared major |
| Class_Year | String | Freshman, Sophomore, Junior, Senior, or Graduate |
| GPA | Float | Grade point average (0.0 - 4.0) |
| Credits_Attempted | Integer | Total credits attempted |
| Credits_Earned | Integer | Total credits earned |

**Research Questions:**
1. Are there significant performance differences across colleges?
2. Do students improve academically as they progress through their studies?
3. What factors are associated with at-risk students?
4. What actionable recommendations can we provide to university administration?

---

# Part 1: Data Loading and Exploration (15 points)

Load the data and perform initial exploration to understand its structure and quality.

### Task 1.1: Load and Inspect the Data (8 points)

1. Import the necessary libraries (pandas, numpy, matplotlib, seaborn)
2. Load `lehigh_students_clean.csv` into a DataFrame called `df`
3. Display the first 10 rows
4. Show the DataFrame's shape, column names, and data types
5. Generate summary statistics for numerical columns

In [None]:
# Your code here






### Task 1.2: Create Derived Variables (7 points)

Add the following columns to your DataFrame:

1. `Completion_Rate`: Credits_Earned / Credits_Attempted * 100 (rounded to 1 decimal)
2. `Academic_Standing`: Categorize based on GPA:
   - "Dean's List" if GPA >= 3.5
   - "Good Standing" if GPA >= 2.0 (but < 3.5)
   - "Probation" if GPA < 2.0

After creating these columns, display the value counts for Academic_Standing.

In [None]:
# Your code here






---

# Part 2: NumPy Analysis (20 points)

Use NumPy for numerical analysis of the student data.

### Task 2.1: GPA Statistical Analysis (10 points)

Extract the GPA column as a NumPy array and calculate:

1. Mean, median, and standard deviation
2. Minimum and maximum values
3. The 25th, 50th, and 75th percentiles
4. The number of students with GPA above 3.5 (using boolean indexing)
5. The number of students with GPA below 2.0 (using boolean indexing)

Print all results with appropriate labels and formatting.

In [None]:
import numpy as np

# Your code here






### Task 2.2: Outlier Detection (10 points)

Identify GPA outliers using the IQR method:

1. Calculate Q1 (25th percentile) and Q3 (75th percentile)
2. Calculate the IQR (Q3 - Q1)
3. Define outlier boundaries as:
   - Lower bound: Q1 - 1.5 * IQR
   - Upper bound: Q3 + 1.5 * IQR
4. Count how many students have GPAs below the lower bound
5. Count how many students have GPAs above the upper bound
6. Print the outlier boundaries and counts

**Interpretation:** Write 2-3 sentences explaining what this tells us about the GPA distribution.

In [None]:
# Your code here






**Your interpretation:**

*Write 2-3 sentences here*



---

# Part 3: Pandas Analysis (25 points)

Use pandas for grouped analysis and comparisons.

### Task 3.1: College Performance Comparison (10 points)

Create a summary table showing for each college:
- Number of students
- Average GPA (rounded to 3 decimal places)
- Average Completion Rate (rounded to 1 decimal place)
- Number of students on Dean's List
- Number of students on Probation

Sort the results by average GPA (highest to lowest) and display the table.

In [None]:
# Your code here






### Task 3.2: Class Year Progression Analysis (8 points)

Analyze whether students improve academically as they progress:

1. Calculate the average GPA for each Class_Year
2. Order the results logically: Freshman → Sophomore → Junior → Senior → Graduate
3. Calculate the GPA difference between Seniors and Freshmen
4. Print the results and the calculated improvement

**Interpretation:** Is there evidence that students improve over time? Write 2-3 sentences.

In [None]:
# Your code here






**Your interpretation:**

*Write 2-3 sentences here*



### Task 3.3: At-Risk Student Profile (7 points)

Create a profile of at-risk students (those on Academic Probation):

1. Filter the DataFrame to only students with Academic_Standing == "Probation"
2. Calculate what percentage of total students are at-risk
3. Show the distribution of at-risk students by College
4. Show the distribution of at-risk students by Class_Year
5. Calculate the average Completion_Rate for at-risk vs. all students

**Interpretation:** What patterns do you notice? Write 2-3 sentences.

In [None]:
# Your code here






**Your interpretation:**

*Write 2-3 sentences here*



---

# Part 4: Data Visualization (25 points)

Create publication-quality visualizations that communicate your findings.

### Task 4.1: GPA Distribution (6 points)

Create a histogram showing the distribution of GPA values:
- Use 20 bins
- Add vertical lines for the mean (red, dashed) and median (blue, dashed)
- Include a legend identifying the lines
- Add appropriate title and axis labels
- Add vertical lines at GPA = 2.0 and GPA = 3.5 (the academic standing thresholds)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Your code here






### Task 4.2: College Comparison Box Plot (6 points)

Create a box plot comparing GPA distributions across the five colleges:
- Use seaborn for cleaner aesthetics
- Rotate x-axis labels if needed for readability
- Add horizontal reference lines at GPA = 2.0 and GPA = 3.5
- Include appropriate title and axis labels

In [None]:
# Your code here






### Task 4.3: Credits Scatter Plot (6 points)

Create a scatter plot showing Credits_Attempted vs Credits_Earned:
- Color the points by GPA (use a colormap like 'RdYlGn')
- Add a diagonal line representing 100% completion (y = x)
- Include a colorbar showing the GPA scale
- Add appropriate title and axis labels

In [None]:
# Your code here






### Task 4.4: Multi-Panel Dashboard (7 points)

Create a 2x2 figure with four subplots:
1. Top-left: Bar chart of student counts by College
2. Top-right: Bar chart of average GPA by Class_Year (in logical order)
3. Bottom-left: Pie chart of Academic Standing distribution
4. Bottom-right: Box plot of Completion_Rate by Academic Standing

Requirements:
- Use `plt.subplots(2, 2, figsize=(14, 10))`
- Each subplot must have a title
- Use `plt.tight_layout()` to prevent overlapping
- Save the figure as 'student_dashboard.png' with dpi=150

In [None]:
# Your code here






---

# Part 5: Executive Summary (10 points)

Write a professional summary of your findings for university administration.

### Task 5.1: Executive Summary

Write a 200-300 word executive summary that includes:

1. **Overview** (2-3 sentences): What data did you analyze and why?

2. **Key Findings** (3-4 bullet points): What are the most important insights?
   - Include specific numbers to support each finding
   - Reference your visualizations where appropriate

3. **Recommendations** (2-3 bullet points): What actions should the university take?
   - Be specific and actionable
   - Connect recommendations to your findings

4. **Limitations** (1-2 sentences): What limitations should decision-makers be aware of?

**Write your summary in the cell below (not in code, just as text).**

## Executive Summary: Lehigh University Student Performance Analysis

### Overview

*Your overview here*

### Key Findings

- *Finding 1*
- *Finding 2*
- *Finding 3*
- *Finding 4*

### Recommendations

- *Recommendation 1*
- *Recommendation 2*
- *Recommendation 3*

### Limitations

*Your limitations here*


---

# Part 6: AI Collaboration Log (5 points)

Document how you used AI tools during this project.

### Task 6.1: AI Usage Documentation

Complete the table below documenting your AI usage. Be honest - this is not graded on whether you used AI, but on the quality of your documentation.

| Task/Part | AI Tool Used | What You Asked | What You Learned | Did You Modify the Output? |
|-----------|--------------|----------------|------------------|---------------------------|
| Part 1 | (e.g., ChatGPT, Claude, Copilot, None) | | | |
| Part 2 | | | | |
| Part 3 | | | | |
| Part 4 | | | | |
| Part 5 | | | | |

**Reflection Questions:**

1. Which parts of this project were you able to complete without AI assistance?

*Your answer:*

2. For parts where you used AI, could you explain the code to a classmate? If not, which parts need more review?

*Your answer:*

3. What did you learn about effective AI collaboration from this project?

*Your answer:*


---

# Submission Checklist

Before submitting, verify that:

- [ ] All code cells run without errors (Kernel → Restart & Run All)
- [ ] Part 1: Data loads correctly with derived variables created
- [ ] Part 2: All NumPy calculations complete with interpretations
- [ ] Part 3: All three analysis tasks complete with interpretations
- [ ] Part 4: All four visualizations are publication-quality with proper labels
- [ ] Part 4: Dashboard saved as 'student_dashboard.png'
- [ ] Part 5: Executive summary is 200-300 words with all required sections
- [ ] Part 6: AI collaboration log is complete and honest
- [ ] Your name is at the top of the notebook

**Save your notebook and submit the .ipynb file (and student_dashboard.png) via CourseConnect.**

---

## Honor Code Statement

By submitting this project, I affirm that:
- I did not collaborate with other students
- I have documented all AI tool usage honestly
- I can explain every line of code in this submission

**Electronic Signature (type your name):** _____________________