# Project 1 
**Aileen Yang cy2830** <br>
In this project, I am exploring how student lifestyle habits affect their academic performance. <br>
**Dataset:** Student Habits vs Academic Performance <br>
**Link:** https://www.kaggle.com/datasets/jayaantanaath/student-habits-vs-academic-performance/data <br> <br>
This dataset includes data for exploring how lifestyle habits may affect academic performance of students, for example: study hours, social media usage, diet quality; and compare them with the exam score they achieved.

In [None]:
# Import Data
import pandas as pd

df = pd.read_csv('student_habits_performance.csv')

### Mean, median, mode using pandas library <br>
**Numerical column chosen:** <br>
**social_media_hours:** the number of hours students spend on social media everyday. <br>
My hypothesis is that the length of social media usage would affect students' academic performance the most, as they may create distractions and takes away time that could be spent studying. My prediction is that higher the number of hours student spend on social media, the lower exam score they are likely to attain.

**Mean**

In [37]:
df["social_media_hours"].mean()

np.float64(2.5055)

**Median**

In [38]:
df["social_media_hours"].median()

np.float64(2.5)

**Mode**

In [39]:
df["social_media_hours"].mode()

0    3.1
Name: social_media_hours, dtype: float64

- The mean is 2.51 hours, meaning students spend about 2.5 hours on social media on average
- The median is also 2.5 hours, suggesting the data is fairly symmetric
- The mode is 3.1 hours, which is the most frequently occurring value

### Mean, median, mode using Python

In [40]:
# Read data values from csv
import csv

values = []

with open('student_habits_performance.csv', 'r', newline='') as file:
    csv_reader = csv.DictReader(file)
    for row in csv_reader:
        values.append(float(row['social_media_hours']))

Mean

In [30]:
def cal_mean(values):
    n = len(values)
    val_sum = sum(values)
    mean = val_sum / n
    return mean

cal_mean(values)

2.5055

Median

In [31]:
def cal_median(values):
    s = sorted(values)
    n = len(s)
    
    # Compute the middle number
    mid = n // 2
    
    # If there is odd number of values, return exactly the middle value
    if n % 2 == 1:
        return s[mid]
    
    # If there is even number of values, return the average of the two middle values
    else:
        return (s[mid - 1] + s[mid]) / 2
    
cal_median(values)

2.5

Mode

In [41]:
from collections import Counter

def cal_mode(values):
    counts = Counter(values)
    
    max_count = max(counts.values())
    modes = []
    # If the item appears, add its count by 1
    # Find the item with the most appearance
    for item, value in counts.items():
        if value == max_count:
            modes.append(item)
    return modes

cal_mode(values)

[3.1]

### Data Visualization
Let's create visualizations to explore relationships between different factors and exam performance.

In [33]:
def visual_blocks(data, title):
    chars = "▁▂▃▄▅▆▇█"
    
    #Turn data into dictionary
    data = data.to_dict()
    vals = list(data.values())
    
    # Find the smallest and largest value in the dataset
    min_val = min(vals)
    max_val = max(vals)

    print(title)
    
    for label, i in data.items():
        # If the max and min value are the same, the level should be 0
        if max_val == min_val:
            level = 0
        # Scale the level between level 1 and 8
        else:
            level = int((i - min_val) / (max_val - min_val) * (len(chars) - 1))
        
        print(f"{label}: {chars[level]}  {i:.2f} ")

### Does participating in extracurriculars will affect exam performance?

In [42]:
avg_extra = df.groupby('extracurricular_participation')['exam_score'].mean()
visual_blocks(avg_extra, title = "Average exam score by extracurricular participation")

Average exam score by extracurricular participation
No: ▁  69.59 
Yes: █  69.62 


**Finding:** The average exam scores are nearly identical (69.59 for "No" vs 69.62 for "Yes"). This suggests that extracurricular participation has little to no effect on exam performance in this dataset. 

This doesn't account for confounding variables. It is possible that students who do extracurriculars also study more efficiently, offsetting any time lost to activities.

### Does the quality of diet will affect exam performance?

In [35]:
avg_by_diet = df.groupby('diet_quality')['exam_score'].mean().sort_values()
visual_blocks(avg_by_diet, title="Average exam score by diet quality")


Average exam score by diet quality
Poor: ▁  68.13 
Good: ▄  69.37 
Fair: █  70.43 


**Finding:** There is a slight positive relationship between diet quality and exam scores. <br>
Students with better diets tend to score about 2 points higher on exams. However, this difference is relatively small, suggesting diet quality is not a major factor in academic performance for this dataset.

However, "Fair" diet performs better than "Good" diet, which seems counterintuitive. This might be due to how diet quality was categorized in the original data collection.


### Does the number of hours of study each day will affect exam performance?

In [36]:
# Group by length of study hours and compute mean exam_score
avg_by_study = (
    df.groupby('study_hours_per_day')['exam_score']
      .mean()
      .sort_index()
)

visual_blocks(avg_by_study, title="Average exam score by study hour (ascending)")


Average exam score by study hour (ascending)
0.0: ▂  40.81 
0.1: ▄  53.40 
0.2: ▂  31.50 
0.3: ▂  32.62 
0.5: ▂  39.52 
0.6: ▁  18.40 
0.7: ▃  48.02 
0.8: ▂  39.81 
0.9: ▁  29.53 
1.0: ▃  46.42 
1.1: ▃  48.08 
1.2: ▂  41.69 
1.3: ▃  46.10 
1.4: ▃  41.89 
1.5: ▃  49.41 
1.6: ▃  49.70 
1.7: ▃  52.74 
1.8: ▃  51.67 
1.9: ▃  49.49 
2.0: ▄  55.90 
2.1: ▄  56.33 
2.2: ▄  56.06 
2.3: ▄  60.77 
2.4: ▄  59.21 
2.5: ▄  60.27 
2.6: ▄  60.56 
2.7: ▄  58.40 
2.8: ▄  63.71 
2.9: ▄  61.30 
3.0: ▄  64.12 
3.1: ▅  66.00 
3.2: ▅  66.57 
3.3: ▅  68.44 
3.4: ▅  72.18 
3.5: ▅  69.74 
3.6: ▅  71.30 
3.7: ▅  73.15 
3.8: ▅  70.43 
3.9: ▅  71.46 
4.0: ▅  75.28 
4.1: ▅  76.46 
4.2: ▆  77.70 
4.3: ▅  74.42 
4.4: ▆  79.23 
4.5: ▆  77.25 
4.6: ▆  80.35 
4.7: ▆  80.31 
4.8: ▆  81.59 
4.9: ▆  83.78 
5.0: ▆  83.92 
5.1: ▆  87.55 
5.2: ▆  84.05 
5.3: ▆  84.36 
5.4: ▆  86.48 
5.5: ▆  84.63 
5.6: ▇  91.90 
5.7: ▆  87.27 
5.8: ▇  96.92 
5.9: ▆  87.85 
6.0: ▇  95.57 
6.1: ▇  98.23 
6.2: ▇  96.97 
6.3: ▇  95.14 
6.4: ▇  93

**Finding:** This shows a strong positive relationship. As study hours increase, exam scores consistently increase:
- 0-1 hours: scores around 30-50
- 2-3 hours: scores around 55-70
- 4-5 hours: scores around 75-85
- 6+ hours: scores around 95-100

Students who study 6+ hours per day almost always achieve scores above 90, while those studying less than 2 hours typically score below 65.

This is the clearest pattern in the entire dataset.

### Conclusion

I started by hypothesizing that social media usage would be the biggest factor affecting exam performance. However, after analyzing the data, I found that:

1. **Study hours** is by far the strongest predictor of exam success
2. **Diet quality** shows a weak positive relationship
3. **Extracurricular participation** shows almost no relationship

**Key Takeaway** <br>
The data suggests that how students spend their time matters more than most other factors. Students who dedicate more hours to studying consistently achieve higher exam scores, regardless of their diet, extracurricular activities, or other habits.
