#Student Performance Dataset Analysis

## Objective
Analyze a dataset of student exam scores to uncover key insights, answer specific questions, and present findings using Python.

## Workflow:
1. Data Loading
2. Data 
3. Data Cleaning
4. Data Analysis
5. Data Visualization

## Dataset Overview:
- Dataset: Student Performance Dataset (`student-mat.csv`)
- Key Columns:
  - `G1`, `G2`, `G3`: Grades for three terms.
  - `studytime`: Weekly hours spent studying.
  - `sex`: Gender of students (Male/Female).

## Key Questions:
1. What is the average score in math (G3)?
2. How many students scored above 15 in their final grade (G3)?
3. Is there a correlation between study time and the final grade (G3)?
4. Which gender has a higher average final grade (G3)?

## Deliverables:
- Detailed analysis with Python code.
- Visualizations for better understanding.
- A concise summary of key insights.


# Import Libraries

In [6]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure inline plots in Jupyter Notebook
%matplotlib inline


# Step 1: Data Loading

In [55]:
## Step 1: Data Loading

# - Load the dataset using pandas.
# - Preview the first few rows of the dataset to understand its structure.

# Load the dataset
file_path = 'student-mat - student-mat.csv'  # Update this path if the file is located elsewhere
data = pd.read_csv(file_path)

# Display the first few rows of the dataset
print("First few rows of the dataset:")
print(data.head())


First few rows of the dataset:
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  ...  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher  ...   
1     GP   F   17       U     GT3       T     1     1  at_home     other  ...   
2     GP   F   15       U     LE3       T     1     1  at_home     other  ...   
3     GP   F   15       U     GT3       T     4     2   health  services  ...   
4     GP   F   16       U     GT3       T     3     3    other     other  ...   

  famrel freetime  goout  Dalc  Walc health absences  G1  G2  G3  
0      4        3      4     1     1      3        6   5   6   6  
1      5        3      3     1     1      3        4   5   5   6  
2      4        3      2     2     3      3       10   7   8  10  
3      3        2      2     1     1      5        2  15  14  15  
4      4        3      2     1     2      5        4   6  10  10  

[5 rows x 33 columns]


# Step 2: Data Exploration

In [57]:
## Step 2: Data Exploration

# - Check for missing values to identify incomplete data.
# - Display column data types for better understanding of the dataset.
# - Determine the size of the dataset to understand its scale.

# Check for missing values
print("\nMissing values in each column:")
print(data.isnull().sum())

# Display column data types
print("\nColumn data types:")
print(data.dtypes)

# Understand the dataset's size
print("\nDataset size (rows, columns):")
print(data.shape)



Missing values in each column:
school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
G3            0
dtype: int64

Column data types:
school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery 

# Step 3: Data Cleaning

In [59]:
## Step 3: Data Cleaning

# - Replace missing values with the median to ensure data completeness.
# - Remove duplicate entries for accuracy.

# Handle missing values (replace with median)
data.fillna(data.median(numeric_only=True), inplace=True)

# Remove duplicate entries
data.drop_duplicates(inplace=True)

print("\nData after cleaning:")
print(data.info())


Data after cleaning:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    

# Step 4: Data Analysis

In [61]:
## Step 4: Data Analysis

### Key Questions:
# 1. What is the average score in math (G3)?
# 2. How many students scored above 15 in their final grade (G3)?
# 3. Is there a correlation between study time and the final grade (G3)?
# 4. Which gender has a higher average final grade (G3)?

# 1. Average score in math (G3)
average_score = data['G3'].mean()
print(f"Average final grade (G3): {average_score:.2f}")

# 2. Count of students scoring above 15 in final grade (G3)
above_15_count = (data['G3'] > 15).sum()
print(f"Number of students scoring above 15 in final grade (G3): {above_15_count}")

# 3. Correlation between study time and final grade
correlation = data['studytime'].corr(data['G3'])
print(f"Correlation between study time and final grade (G3): {correlation:.2f}")

# 4. Gender with a higher average final grade
gender_average = data.groupby('sex')['G3'].mean()
print("Average final grade by gender:")
print(gender_average)

# Determine which gender has a higher average grade
higher_avg_gender = gender_average.idxmax()
print(f"Gender with a higher average final grade: {higher_avg_gender}")


Average final grade (G3): 10.42
Number of students scoring above 15 in final grade (G3): 40
Correlation between study time and final grade (G3): 0.10
Average final grade by gender:
sex
F     9.966346
M    10.914439
Name: G3, dtype: float64
Gender with a higher average final grade: M


# Step 5: Data Visualization

In [52]:
## Step 5: Data Visualization

### Objectives:
# 1. Histogram Visualize the distribution of final grades (G3).
# 2. Scatter Plot :  Analyze the relationship between study time and final grades (G3).
# 3. Bar Chart : Compare average final grades by gender.

# 1. Histogram of Final Grades (G3)
plt.figure(figsize=(8, 6))
plt.hist(data['G3'], bins=10, color='blue', edgecolor='black', alpha=0.7)
plt.title('Histogram of Final Grades (G3)')
plt.xlabel('Final Grade (G3)')
plt.ylabel('Frequency')
plt.show()

# 2. Scatter Plot Between Study Time and Final Grade
plt.figure(figsize=(8, 6))
sns.scatterplot(x=data['studytime'], y=data['G3'], hue=data['sex'], alpha=0.7)
plt.title('Study Time vs Final Grade')
plt.xlabel('Study Time (hours/week)')
plt.ylabel('Final Grade (G3)')
plt.legend(title='Gender')
plt.show()

# 3. Bar Chart Comparing Average Scores of Male and Female Students
gender_average.plot(kind='bar', color=['blue', 'pink'], figsize=(8, 6))
plt.title('Average Final Grade by Gender')
plt.xlabel('Gender')
plt.ylabel('Average Final Grade (G3)')
plt.xticks(rotation=0)
plt.show()

KeyError: 'G3'

# Summary

## Summary of Findings

1. **Average Final Grade**:
   - The average final grade (G3) is [Calculated Value].

2. **Students Scoring Above 15**:
   - [Calculated Value] students scored above 15 in their final grades.

3. **Correlation Between Study Time and Final Grade**:
   - The correlation coefficient is [Calculated Value], indicating [weak/strong positive/negative] relationship.

4. **Gender With Higher Average Grade**:
   - [Gender] has a higher average final grade, with an average score of [Value].

### Visual Insights:
- **Histogram**: The distribution of grades shows [e.g., most students scored between X and Y].
- **Scatter Plot**: The relationship between study time and grades shows [e.g., no significant correlation].
- **Bar Chart**: Female students scored higher than male students on average.
