# Mini Project: Exploring Factors Influencing Academic Performance

### This project analyzes a dataset collected from a student survey to explore potential factors that might influence academic performance.

# 1. Introduction

### This project aims to uncover insights into factors potentially affecting student academic performance using a dataset from a student survey. We'll explore various demographic, academic, and social factors through data cleaning, exploratory data analysis (EDA), and visualization techniques.

# 2. Data Source

### The dataset used in this project was sourced from https://www.kaggle.com/datasets/joshuanaude/effects-of-alcohol-on-student-performance.

### Problem Definition:
#### The dataset is related to student life and various factors affecting it. Potential questions could include:
#### 1. How does alcohol consumption impact academic performance?
#### 2. Is there a correlation between accommodation type and academic success?
#### 3. Do students' relationships with their parents affect their alcohol consumptions habits?

# 3. Data Cleaning and Preparation

### Importing Necessary Library

In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import re

In [None]:
# load the dataset in pandas framework
df=pd.read_csv("/Users/bikashkumarsah/Downloads/survey121.csv")

In [None]:
#Showing the first 5 column
df.head()

### Dataset dimension and information

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

### Since the timestamp is not required for the analysis, we can drop it
### It only contains the submission date and time of the survey

In [None]:
df=df.drop(columns="Timestamp")

In [None]:
df

### Renaming Columns: Columns were renamed for clarity and consistency.

In [None]:
df.rename(columns={'Social': 'Socializing Frequency'}, inplace=True)
df.rename(columns={'Your Sex?': 'Sex'}, inplace=True)
df.rename(columns={'Your Matric (grade 12) Average/ GPA (in %)': 'Grade 12 GPA(%)'}, inplace=True)
df.rename(columns={'What year were you in last year (2023) ?': 'Previous Academic Year(2023)'}, inplace=True)
df.rename(columns={'What faculty does your degree fall under?': 'Enrolled Faculty'}, inplace=True)
df.rename(columns={'Your 2023 academic year average/GPA in % (Ignore if you are 2024 1st year student)': '2023 GPA(%)'}, inplace=True)
df.rename(columns={'Were you on scholarship/bursary in 2023?': 'Scholarships(2023)'}, inplace=True)
df.rename(columns={'Additional amount of studying (in hrs) per week': 'Additional Study hrs/week'}, inplace=True)
df.rename(columns={'On a night out, how many alcoholic drinks do you consume?': 'Alcoholic Drinks/Night'}, inplace=True)
df.rename(columns={'How many classes do you miss per week due to alcohol reasons, (i.e: being hungover or too tired?)': 'Classes Missed/Week (Alcohol)'}, inplace=True)
df.rename(columns={'How many modules have you failed thus far into your studies?': 'Failed Modules to Date'}, inplace=True)
df.rename(columns={'Are you currently in a romantic relationship?': 'Relationship Status'}, inplace=True)
df.rename(columns={'Do your parents approve alcohol consumption?': 'Parental Approval of Alcohol'}, inplace=True)
df.rename(columns={'How strong is your relationship with your parent/s?': 'Relationship with Parents'}, inplace=True)
df.rename(columns={'Your Accommodation Status Last Year (2023)': 'Accommodation Status(2023)'}, inplace=True)
df

### Data Transformation:
### Range values in 'Monthly Allowance in 2023' were converted to their mean values.

In [None]:
import numpy as np
import re
def calculate_mean(range_str):
    # Define regular expression pattern to extract lower and upper bounds
    pattern = r'R\s*(\d+)\s*-\s*R\s*(\d+)'
    match = re.match(pattern, range_str)
    if match:
        lower = int(match.group(1))
        upper = int(match.group(2))
        return (lower + upper) / 2
    else:
        # If no match found, return NaN
        return None

# Apply the calculate_mean function to the specific column
df['Monthly Allowance in 2023'] = df['Monthly Allowance in 2023'].astype(str).apply(calculate_mean)

### Categorical variables like 'Additional Study hrs/week', 'Alcoholic Drinks/Night', etc., were converted to numerical representations.

In [None]:
# Convert "Additional Study hrs/week" to numerical
df['Additional Study hrs/week'] = df['Additional Study hrs/week'].replace({'1-3': 2, '3-5': 4, '5-8': 6, '8+': 9, '0': 0})
df['Alcoholic Drinks/Night'] = df['Alcoholic Drinks/Night'].replace({'1-3': 2, '3-5': 4, '5-8': 6, '8+': 9, '0': 0})
df['Classes Missed/Week (Alcohol)'] = df['Classes Missed/Week (Alcohol)'].replace({'4+': 5, '0': 0,'1':1,'2':2,'3':3})
df['Scholarships(2023)'] = df['Scholarships(2023)'].replace({'Yes (NSFAS, etc...)': 'Yes'})
df['Socializing Frequency'] = df['Socializing Frequency'].replace({'Only weekends': 2,'4+':5})
df['Failed Modules to Date'] = df['Failed Modules to Date'].replace({'4+':5})

### Checking Null values

In [None]:
a=df.isnull().sum()
a

## Handling Missing Values:

### Rows with missing values in critical columns (e.g., 'Additional Study hrs/week', 'Failed Modules to Date') were dropped.

In [None]:
# Drop rows with missing values in 'Column1'
df.dropna(subset=['Additional Study hrs/week'], inplace=True)
df.dropna(subset=['Classes Missed/Week (Alcohol)'], inplace=True)
df.dropna(subset=['Failed Modules to Date'], inplace=True)
df.dropna(subset=['Relationship Status'], inplace=True)
df.dropna(subset=['Parental Approval of Alcohol'], inplace=True)
df.dropna(subset=['Relationship with Parents'], inplace=True)
df.dropna(subset=['Scholarships(2023)'], inplace=True)
df.dropna(subset=['Enrolled Faculty'], inplace=True)
df.dropna(subset=['Grade 12 GPA(%)'], inplace=True)

In [None]:
a=df.isnull().sum()
a

### Missing values in 'Monthly Allowance in 2023' were imputed with the median allowance.

In [None]:
median_allowance = df['Monthly Allowance in 2023'].median()

# Fill NaN values with the calculated median
df['Monthly Allowance in 2023'].fillna(median_allowance, inplace=True) 

# Check if NaNs have been replaced
print(df['Monthly Allowance in 2023'].isnull().sum()) 

### Missing values in 'Accommodation Status(2023)' were filled randomly using forward or backward fill.

In [None]:
def random_fill(df, column_name):
  """Fills NaNs in a column randomly using forward or backward fill.

  Args:
    df: The pandas DataFrame.
    column_name: The name of the column to fill.
  """

  for i in df.index:
    if pd.isna(df.loc[i, column_name]):
      # Randomly choose forward or backward fill
      fill_method = np.random.choice(['ffill', 'bfill'])
      df.loc[i, column_name] = df[column_name].fillna(method=fill_method).loc[i] 

# Assuming your DataFrame is named 'df' and your column is 'Previous Academic Year(2023)'

random_fill(df, 'Accommodation Status(2023)')

### Missing values in 'Previous Academic Year(2023)' were imputed with the mode within each faculty.

In [None]:
# Example: Imputing with mode within each faculty
df['Previous Academic Year(2023)'] = df.groupby('Enrolled Faculty')['Previous Academic Year(2023)'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))

### Missing values in '2023 GPA(%)' were imputed using a linear regression model trained on other features.

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder


# 2. Split data into those with and without missing GPAs:
df_missing = df[df['2023 GPA(%)'].isnull()]
df_complete = df[df['2023 GPA(%)'].notnull()]

# 3. Prepare data for regression:
# Select predictor columns
predictors = ['Grade 12 GPA(%)', 'Additional Study hrs/week'] 

# One-hot encode categorical predictors (if any) - NOT NEEDED IN THIS CASE
# You would need this if you had categorical predictors like 'Enrolled Faculty'
# encoder = OneHotEncoder(handle_unknown='ignore')
# X_encoded = encoder.fit_transform(df_complete[categorical_predictors]).toarray()

# Create separate DataFrames for predictors and target
X_complete = df_complete[predictors]
y_complete = df_complete['2023 GPA(%)']

# 4. Split complete data into training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(X_complete, y_complete, test_size=0.2, random_state=42)

# 5. Train the regression model:
model = LinearRegression()
model.fit(X_train, y_train)

# 6. Evaluate model performance (optional, but recommended):
# You can use metrics like R-squared or Mean Squared Error to see how well the model performs
# on the test set.

# 7. Predict missing GPAs:
X_missing = df_missing[predictors]
imputed_gpas = model.predict(X_missing)



# 8. Fill missing values in the original DataFrame:
df_missing['2023 GPA(%)'] = imputed_gpas
df = pd.concat([df_complete, df_missing])

In [None]:
df

### Round the imputed '2023 GPA(%)' values to two decimal places.

In [None]:
df['2023 GPA(%)'] = df['2023 GPA(%)'].round(2)

In [None]:
a=df.isnull().sum()
a

# 4. Exploratory Data Analysis (EDA) and Visualizations

### We performed extensive EDA to uncover patterns and relationships within the data. Key visualizations and findings include:

### Demographic Distribution:
#### Pie charts were used to visualize the gender distribution and academic year distribution of the surveyed students.

In [None]:
# Gender distribution
gender_distribution = df['Sex'].value_counts()

# Academic year distribution
academic_year_distribution = df['Previous Academic Year(2023)'].value_counts()

# Plotting the pie charts
plt.figure(figsize=(16, 8))

# Gender distribution pie chart
plt.subplot(1, 2, 1)
plt.pie(gender_distribution, labels=gender_distribution.index, autopct='%1.1f%%', startangle=140, colors=['skyblue', 'lightgreen'])
plt.title('Gender Distribution Among Surveyed Students')

# Academic year distribution pie chart
plt.subplot(1, 2, 2)
plt.pie(academic_year_distribution, labels=academic_year_distribution.index, autopct='%1.1f%%', startangle=140, colors=['orange', 'lightcoral','blue','skyblue','red'])
plt.title('Academic Year Distribution Among Surveyed Students')

plt.tight_layout()

### Accommodation and Relationship Status:
#### Grouped bar charts illustrated the distribution of accommodation status and relationship status across different academic years.

In [None]:
# Accommodation status by academic year


# Grouping the data by 'Previous Academic Year' and 'Accommodation Status' and counting the occurrences
accommodation_counts = df.groupby(['Previous Academic Year(2023)', 'Accommodation Status(2023)']).size().unstack()

# Plotting the data
accommodation_counts.plot(kind='bar', stacked=False, figsize=(10, 7))
plt.title('Accommodation Status by Academic Year')
plt.xlabel('Academic Year')
plt.ylabel('Number of Students')
plt.legend(title='Accommodation Status')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Relationship status by academic year


# Grouping the data by 'Previous Academic Year' and 'Relationship Status' and counting the occurrences
relationship_counts = df.groupby(['Previous Academic Year(2023)', 'Relationship Status']).size().unstack()

# Plotting the data
relationship_counts.plot(kind='bar', stacked=False, figsize=(10, 7))
plt.title('Relationship Status by Academic Year')
plt.xlabel('Academic Year')
plt.ylabel('Number of Students')
plt.legend(title='Relationship Status')
plt.xticks(rotation=45)
plt.show()

### Parental Approval of Alcohol:
#### A grouped bar chart showed the distribution of parental approval of alcohol consumption across academic years.

In [None]:
# Parental Approval for alcohol consumption by academic year


# Grouping the data by 'Previous Academic Year' and 'Parental Approval of Alcohol' and counting the occurrences
parental_approval_counts = df.groupby(['Previous Academic Year(2023)', 'Parental Approval of Alcohol']).size().unstack()

# Plotting the data
parental_approval_counts.plot(kind='bar', stacked=False, figsize=(10, 7))
plt.title('Parental Approval of Alcohol by Academic Year')
plt.xlabel('Academic Year')
plt.ylabel('Number of Students')
plt.legend(title='Parental Approval of Alcohol')
plt.xticks(rotation=45)
plt.show()

### Grade Distribution:
#### Histograms and box plots were used to visualize the distribution of grades (both Grade 12 and 2023 academic year averages) and identify potential outliers.

In [None]:
# Distribution of students' grades (Grade 12 and Previous academic year)

# histograms to see the frequency distributions of grades

# Histogram for 12TH Grades
plt.figure(figsize = (10, 6))
df['Grade 12 GPA(%)'].hist(bins = 20)
plt.title('Distribution of Grade 12 Marks')
plt.xlabel('12th Marks (%)')
plt.ylabel('Frequency')
plt.grid(False)
plt.show()


# HIstogram for Previous academic years marks

plt.figure(figsize = (10, 6))
df['2023 GPA(%)'].hist(bins = 20)
plt.title('Distribution of Previous academic year (2023) Marks')
plt.xlabel('Marks (%)')
plt.ylabel('Frequency')
plt.grid(False)
plt.show()

In [None]:
# Box plot to understand the spread and identify any outliers in the data

# Box plot for Grade 12 marks

plt.figure(figsize = (10, 6))
df.boxplot(column = 'Grade 12 GPA(%)')
plt.title('Box plot of Grade 12 Marks')
plt.ylabel('12th Marks (%)')
plt.show()


# Box plot for Previous academic year marks
plt.figure(figsize = (10, 6))
df.boxplot(column = '2023 GPA(%)')
plt.title('Box plot of Previous academic year (2023) Marks')
plt.ylabel('Marks (%)')
plt.show()

### Academic Performance by Gender:
#### Box plots and violin plots were used to compare the distribution of 2023 academic year grades between genders.

In [None]:
# How does the academic performance (2023 academic year average) vary by gender?

# Box plot for the distribution of grades between different genders
plt.figure(figsize = (10, 6))
df.boxplot(column = '2023 GPA(%)', by = 'Sex')
plt.title('Box plot of 2023 Marks by Gender')
plt.suptitle('')
plt.ylabel('Previous academic year marks (%)')
plt.xlabel('Sex')
plt.show()

In [None]:
# Violin plot to show the density and distribution of grades by gender

plt.figure(figsize = (10, 6))
sns.violinplot(x = 'Sex', y = '2023 GPA(%)', data =df, inner = 'quartile')
plt.title('Violin plot of 2023 Marks')
plt.xlabel('Sex')
plt.ylabel('Previous academic year marks (%)')
plt.show()

### Impact of Socializing and Alcohol:
#### A box plot investigated the relationship between the frequency of alcohol consumption and academic performance.


In [None]:
# Relation between the frequency of alcohol consumption and academic performance

plt.figure(figsize = (10, 6))
# Box plot for the distribution of grades based on alcohol consumption

# Box plot
sns.boxplot(x = 'Socializing Frequency', y = '2023 GPA(%)', data = df)
plt.title('Academic Performance by Frequency of Alcohol Consumption')
plt.xlabel('Frequency of Alcohol Consumption')
plt.ylabel('2023 marks (%)')
plt.xticks(rotation=45)  # Rotate x labels if necessary
plt.grid(True)

plt.show()


### Monthly Allowances, Accommodation, and Missed Classes:
#### Histograms, box plots, and scatter plots were used to analyze the distribution of monthly allowances, the relationship between accommodation status and academic performance, and the correlation between missed classes due to alcohol and academic performance.

In [None]:
# Distribution of Monthly Allowances:
plt.figure(figsize=(8, 6))
sns.histplot(df['Monthly Allowance in 2023'], bins=10, kde=False)
plt.title('Distribution of Monthly Allowances')
plt.xlabel('Monthly Allowance')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Accommodation Status vs. Academic Performance:
plt.figure(figsize=(8, 6))
sns.boxplot(x='Accommodation Status(2023)', y='2023 GPA(%)', data=df)
plt.title('Academic Performance by Accommodation Status')
plt.xlabel('Accommodation Status')
plt.ylabel('2023 GPA')
plt.show()

plt.figure(figsize=(8, 6))
sns.barplot(x='Accommodation Status(2023)', y='2023 GPA(%)', data=df, ci=None) 
plt.title('Average Grades by Accommodation Status')
plt.xlabel('Accommodation Status')
plt.ylabel('Average 2023 GPA')
plt.show()

In [None]:
# Missed Classes (Alcohol) vs. Academic Performance:
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Classes Missed/Week (Alcohol)', y='2023 GPA(%)', data=df)
plt.title('Correlation between Missed Classes (Alcohol) and Academic Performance')
plt.xlabel('Classes Missed/Week (Alcohol)')
plt.ylabel('2023 GPA')
plt.show()

plt.figure(figsize=(8, 6))
sns.barplot(x='Classes Missed/Week (Alcohol)', y='2023 GPA(%)', data=df, ci=None)
plt.title('Average Grades by Number of Classes Missed (Alcohol)')
plt.xlabel('Classes Missed/Week (Alcohol)')
plt.ylabel('Average 2023 GPA')
plt.show()

### Failed Modules and Relationship Status:
#### Histograms, count plots, box plots, and bar plots were used to visualize the distribution of failed modules and explore potential relationships between romantic relationship status and academic performance.

In [None]:
# Distribution of Failed Modules:
plt.figure(figsize=(8, 6))
df['Failed Modules to Date'] = df['Failed Modules to Date'].astype(str)
sns.histplot(df['Failed Modules to Date'], bins=10, kde=False) 
sns.histplot(df['Failed Modules to Date'], bins=10, kde=False)
plt.title('Distribution of Failed Modules')
plt.xlabel('Number of Failed Modules')
plt.ylabel('Frequency')
plt.show()

plt.figure(figsize=(8, 6))
sns.countplot(x='Failed Modules to Date', data=df)
plt.title('Counts of Failed Modules')
plt.xlabel('Number of Failed Modules')
plt.ylabel('Count')
plt.show()

In [None]:

# Romantic Relationship vs. Academic Performance:
plt.figure(figsize=(8, 6))
sns.boxplot(x='Relationship Status', y='2023 GPA(%)', data=df)
plt.title('Academic Performance by Relationship Status')
plt.xlabel('Relationship Status')
plt.ylabel('2023 GPA')
plt.show()

plt.figure(figsize=(8, 6))
sns.barplot(x='Relationship Status', y='2023 GPA(%)', data=df, ci=None)
plt.title('Average Grades by Relationship Status')
plt.xlabel('Relationship Status')
plt.ylabel('Average 2023 GPA')
plt.show()

# 5. Conclusion

### The EDA provided valuable insights into factors potentially influencing student academic performance. We observed differences in grade distributions based on gender, alcohol consumption patterns, and other factors. Further analysis and potentially more sophisticated statistical modeling could be used to establish stronger conclusions and causal relationships