# Student Performance Indicator
## Life Cycle of a Machine Learning Project

### Problem statement
This project understands how the student's performance (test scores) is affected by other variables such as Gender, Ethnicity, Parental level of education, Lunch and Test preparation course.

### 1. Understanding the Problem Statement
- Clearly define the objective of the project.
- Identify the key performance indicators (KPIs) for student performance.
- Understand the stakeholders' requirements and expectations.

### 2. Data Collection
- Gather relevant data from various sources (e.g., student records, exam scores, attendance, etc.).
- Ensure data is collected in a structured and consistent manner.
- Document the sources and types of data collected.

### 3. Data Checks to Perform
- Check for missing values and handle them appropriately.
- Identify and treat outliers in the data.
- Ensure data consistency and accuracy.
- Verify data types and convert them if necessary.

### 4. Exploratory Data Analysis (EDA)
- Perform descriptive statistics to understand the data distribution.
- Visualize data using plots and graphs to identify patterns and trends.
- Analyze correlations between different variables.
- Summarize key findings from the data.

### 5. Data Pre-Processing
- Clean the data by removing duplicates and irrelevant information.
- Normalize or standardize the data if required.
- Encode categorical variables.
- Split the data into training and testing sets.

### 6. Model Building
- Select appropriate algorithms for the problem.
- Train the models using the training data.
- Fine-tune model parameters to optimize performance.
- Evaluate models using the testing data.

### 7. Model Evaluation
- Use metrics like accuracy, precision, recall, and F1-score to evaluate model performance.
- Perform cross-validation to ensure model robustness.
- Compare different models and select the best-performing one.

### 8. Model Deployment
- Prepare the final model for deployment.
- Integrate the model into the existing system.
- Monitor model performance in the production environment.
- Update the model as needed based on new data and feedback.

### 9. Model Maintenance
- Regularly retrain the model with new data to maintain accuracy.
- Monitor for model drift and address it promptly.
- Gather feedback from stakeholders and make necessary improvements.

### 10. Documentation and Reporting
- Document the entire process, including data sources, EDA, model building, and evaluation.
- Create comprehensive reports for stakeholders.
- Ensure transparency and reproducibility of the project.



Model Training
Choose best modelt model

In [10]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline 

import warnings
warnings.filterwarnings('ignore')

### Import The data set --->

In [2]:
df=pd.read_csv("data\StudentsPerformance.csv")

In [4]:
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


## Data Check to be perform 

1. Checking the Missing value
2. Check for duplicates
3. Check number of unique Values in Each column
4. Check statistics of data
5. Check various categories present in the different Categorical column


#### Checking the missing Value if there

In [6]:
df.isna().sum()
## There is no missing value in the dataset-->


gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

#### Checking the duplicate in the dataset 

In [7]:
df.duplicated().sum()

np.int64(0)

#### Checking Null and Information for Dataframe-->


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


In [9]:
## We have categorical Column-->
df.nunique()

gender                          2
race/ethnicity                  5
parental level of education     6
lunch                           2
test preparation course         2
math score                     81
reading score                  72
writing score                  77
dtype: int64

#### Checking the statistics of data---->


In [11]:
df.describe()

## Three Numerical Column 

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


### Insight Of the dataset-->
1. From the discription all the means are very close to each other .
2. Standard deviation are close to each other 
3. Math there is 0 and reading score is 17 and writting  is 10


### Exploring the data 


In [18]:
print("Categories in 'gender' variable:     ",end=" " )
print(df['gender'].unique())

print("Categories in 'race_ethnicity' variable:  ",end=" ")
print(df['race/ethnicity'].unique())

print("Categories in'parental level of education' variable:",end=" " )
print(df['parental level of education'].unique())

print("Categories in 'lunch' variable:     ",end=" " )
print(df['lunch'].unique())

print("Categories in 'test preparation course' variable:     ",end=" " )
print(df['test preparation course'].unique())

Categories in 'gender' variable:      ['female' 'male']
Categories in 'race_ethnicity' variable:   ['group B' 'group C' 'group A' 'group D' 'group E']
Categories in'parental level of education' variable: ["bachelor's degree" 'some college' "master's degree" "associate's degree"
 'high school' 'some high school']
Categories in 'lunch' variable:      ['standard' 'free/reduced']
Categories in 'test preparation course' variable:      ['none' 'completed']


In [19]:
## Need the HOT encoder for change the categorical value in the numerical value
# define numerical & categorical columns
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

We have 3 numerical features : ['math score', 'reading score', 'writing score']

We have 5 categorical features : ['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course']


In [20]:
## We have take avg to set as target which will contains the marks 

df['total_score']=df['math score'] + df["reading score"]+ df["writing score"]
df['average_score']=df['total_score']/3

df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total_score
0,female,group B,bachelor's degree,standard,none,72,72,74,218
1,female,group C,some college,standard,completed,69,90,88,247
2,female,group B,master's degree,standard,none,90,95,93,278
3,male,group A,associate's degree,free/reduced,none,47,57,44,148
4,male,group C,some college,standard,none,76,78,75,229


In [21]:
df['average_score']=df['total_score']/3

df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total_score,average_score
0,female,group B,bachelor's degree,standard,none,72,72,74,218,72.666667
1,female,group C,some college,standard,completed,69,90,88,247,82.333333
2,female,group B,master's degree,standard,none,90,95,93,278,92.666667
3,male,group A,associate's degree,free/reduced,none,47,57,44,148,49.333333
4,male,group C,some college,standard,none,76,78,75,229,76.333333


In [27]:
reading_full = df[df['reading score'] == 100]['average_score'].count()
writing_full = df[df['writing score'] == 100]['average_score'].count()
math_full = df[df['math score'] == 100]['average_score'].count()

print(f'Number of students with full marks in Maths: {math_full}')
print(f'Number of students with full marks in Writing: {writing_full}')
print(f'Number of students with full marks in Reading: {reading_full}')

Number of students with full marks in Maths: 7
Number of students with full marks in Writing: 14
Number of students with full marks in Reading: 17


In [28]:
reading_less_20 = df[df['reading score'] <= 20]['average_score'].count()
writing_less_20 = df[df['writing score'] <= 20]['average_score'].count()
math_less_20 = df[df['math score'] <= 20]['average_score'].count()

print(f'Number of students with less than 20 marks in Maths: {math_less_20}')
print(f'Number of students with less than 20 marks in Writing: {writing_less_20}')
print(f'Number of students with less than 20 marks in Reading: {reading_less_20}')

Number of students with less than 20 marks in Maths: 4
Number of students with less than 20 marks in Writing: 3
Number of students with less than 20 marks in Reading: 1


## Insight of the data set with the graph and plot--->>
