# Data Preprocessing and Feature Engineering Capstone Project

----

#### Assume you are a Chancellor Of Private University and you are having less results in Btech DataScience.You are hiring a DataAnalyst who can work on the raw data of students and give you useful insights.The DataAnalyst has now start the process with Data Collection, Data Cleaning, Data Encoding & Data Visualization Such that the insights given by him are useful for your university.

---

---

#### DATA COLLECTION

##### The dataset utilized in this analysis is a mock dataset specifically created for the purpose of illustrating the essential procedures involved in data analysis. It serves as a practical tool to showcase the various steps and methodologies that should be employed when working with an authentic dataset to extract the desired insights. While the data within this mock dataset is synthetic, the analysis techniques applied herein mirror those employed in real-world scenarios, enabling users to acquire a comprehensive understanding of the data analysis process. By utilizing this simulated dataset, users can familiarize themselves with the methodologies and gain the necessary proficiency to effectively tackle data analysis tasks using genuine datasets.

In [1]:
import pandas as pd

In [6]:
# Load the dataset
df = pd.read_excel('DPFE Capstone Project Dataset.xlsx')
df.head(10)

Unnamed: 0,Student ID,Name,Age,Gender,Attendance (%),Midterm Score,Project Score,Final Exam Score,Overall Score,Scholarship,Study Material,Programming Language
0,1,Rajesh,20.0,Male,95.0,85,90.0,88.0,89.5,Yes,Yes,Python
1,2,Priya,21.0,Female,92.0,78,85.0,90.0,84.75,No,Yes,R
2,3,Arjun,19.0,Male,88.0,80,82.0,85.0,83.75,Yes,No,Python
3,4,Aarav,20.0,Male,90.0,85,,88.0,89.5,Yes,Yes,Python
4,5,Sameer,20.0,Male,94.0,75,80.0,82.0,80.25,No,No,R
5,6,Ishika,21.0,Female,92.0,80,85.0,,84.75,No,Yes,R
6,7,Advait,19.0,Male,,78,82.0,85.0,83.75,Yes,No,Python
7,8,Nivedita,20.0,Female,90.0,90,92.0,95.0,92.25,Yes,Yes,Python
8,9,Akash,22.0,Male,85.0,75,78.0,80.0,77.0,No,No,Python
9,10,Ishita,21.0,Female,92.0,88,85.0,88.0,87.25,No,Yes,R


---

#### DATA CLEANING

##### Data cleaning is necessary to ensure the accuracy, consistency, and reliability of the data by identifying and rectifying errors, inconsistencies, and missing values, thus enabling valid and trustworthy analysis and decision-making.

In [8]:
# Check for missing values
df.isnull().sum()

Student ID              0
Name                    0
Age                     2
Gender                  1
Attendance (%)          1
Midterm Score           0
Project Score           1
Final Exam Score        1
Overall Score           1
Scholarship             1
Study Material          1
Programming Language    1
dtype: int64

##### The missing values in the dataset have been filled using appropriate strategies to preserve the integrity of the data analysis. The mean strategy was employed for numerical variables like 'Project Score,' 'Final Exam Score,' and 'Overall Score' to retain the overall distribution of the data. The median strategy was utilized for the 'Attendance (%)' variable, ensuring the replacement value is not affected by outliers. For categorical variables such as 'Age,' 'Study Material,' 'Scholarship,' 'Gender,' and 'Programming Language,' the mode strategy was applied to impute the most frequent value, maintaining the data's dominant characteristics.

In [9]:
# Replace missing values with appropriate strategies
df['Project Score'].fillna(df['Project Score'].mean(), inplace=True)
df['Final Exam Score'].fillna(df['Final Exam Score'].mean(), inplace=True)
df['Attendance (%)'].fillna(df['Attendance (%)'].median(), inplace=True)
df['Overall Score'].fillna(df['Overall Score'].mean(), inplace=True)
df['Age'].fillna(df['Age'].mode()[0], inplace=True)
df['Study Material'].fillna(df['Study Material'].mode()[0], inplace=True)
df['Scholarship'].fillna(df['Scholarship'].mode()[0], inplace=True)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['Programming Language'].fillna(df['Programming Language'].mode()[0], inplace=True)

In [10]:
# Check for missing values
df.isnull().sum()

Student ID              0
Name                    0
Age                     0
Gender                  0
Attendance (%)          0
Midterm Score           0
Project Score           0
Final Exam Score        0
Overall Score           0
Scholarship             0
Study Material          0
Programming Language    0
dtype: int64

In [13]:
#Checking for duplicates
df.duplicated().sum()

0

In [14]:
import numpy as np

# Identify and handle outliers using z-score
z_scores = np.abs((df[['Midterm Score', 'Project Score', 'Final Exam Score', 'Overall Score']] - df[['Midterm Score', 'Project Score', 'Final Exam Score', 'Overall Score']].mean()) / df[['Midterm Score', 'Project Score', 'Final Exam Score', 'Overall Score']].std())
z_scores

Unnamed: 0,Midterm Score,Project Score,Final Exam Score,Overall Score
0,0.502431,0.998983,0.323062,0.931244
1,0.803289,0.081477,0.797925,0.112578
2,0.430226,0.729754,0.389232,0.332330
3,0.502431,0.000000,0.323062,0.931244
4,1.362883,1.161938,1.101526,1.101461
...,...,...,...,...
57,1.435088,1.431167,1.985082,1.535561
58,1.362883,1.594122,1.576389,1.815655
59,1.062026,0.081477,0.323062,0.436802
60,0.502431,0.998983,0.323062,0.931244


In [15]:
# Removing values with more than 3 z-score 
df = df[(z_scores < 3).all(axis=1)]

In [None]:
# Saving the cleaned dataset
df.to_csv('cleaned_dataset.csv', index=False)

---

#### DATA ENCODING

---

#### DATA VISUALIZATION

---

### - INSIGHT -