<a href="https://colab.research.google.com/github/chetnaarora93/Data-Analysis-Project/blob/main/Jamboree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Jamboree has helped thousands of students like you make it to top colleges abroad. Be it GMAT, GRE or SAT, their unique problem-solving methods ensure maximum scores with minimum effort.
They recently launched a feature where students/learners can come to their website and check their probability of getting into the IVY league college. This feature estimates the chances of graduate admission from an Indian perspective.

# **IMPORTING LIBRARIES**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso, Ridge


from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

# **READING DATASET**

In [None]:
df=pd.read_csv("https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/001/839/original/Jamboree_Admission.csv")
df.head(10)

In [None]:
df.shape

Dataset has 500 rows and 9 columns

In [None]:
df.info()

There are no null values

# **CHECKING NULL VALUES**

In [None]:
df.isna().sum()

In [None]:
df.columns

In [None]:
df.duplicated().sum()

There are no duplicate values

# **NON GRAPHICAL ANALYSIS**

In [None]:
df["University Rating"].value_counts()

The highest number of universities are rated 3 (162 universities, 36.4%).
The second most common rating is 2 (126 universities, 28.3%).
Together, ratings 3 and 2 account for 64.7% of all universities, indicating that most universities are in the average-to-below-average range.

In [None]:
df["SOP"].value_counts()

In [None]:
# df["LOR"].value_counts()

In [None]:
df["Research"].value_counts()

# **UNIVARIATE ANALYSIS**

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot histograms
sns.histplot(df["GRE Score"], bins=10, kde=True, ax=axes[0], color='blue')
axes[0].set_title("GRE score Distribution")

sns.histplot(df["TOEFL Score"], bins=10, kde=True, ax=axes[1], color='green')
axes[1].set_title("TOEFL Score Distribution")

sns.histplot(df["CGPA"], bins=10, kde=True, ax=axes[2], color='red')
axes[2].set_title("CGPA Value Distribution")

plt.tight_layout()
plt.show()

GRE Score Distribution: The histogram of GRE scores shows how the scores are spread among applicants. If the distribution is skewed, it may indicate that most students either score high or low.

TOEFL Score Distribution: This plot helps understand the spread of TOEFL scores. A normal distribution suggests that most students score around the mean, while skewness might indicate a concentration of scores in a particular range.

CGPA Distribution: The CGPA histogram provides insights into the academic performance of applicants. A right-skewed distribution could indicate that most applicants have high CGPAs, while a left-skewed distribution suggests lower GPAs.

Comparison Across Metrics: If the distributions differ significantly, it may indicate variations in applicant profiles. For instance, a uniform distribution in GRE but a skewed CGPA distribution might suggest diverse academic backgrounds.

# **BIVARIATE ANALYSIS**

In [None]:
# Bar plot between University Rating and Chance of Admit
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

sns.scatterplot(x=df["University Rating"], y=df["Chance of Admit "], color='purple', ax=axes[0])
axes[0].set_title("University Rating vs Chance of Admit")
axes[0].set_xlabel("University Rating")
axes[0].set_ylabel("Chance of Admit")

# Bar plot between TOEFL Score and Chance of Admit
sns.scatterplot(x=df["SOP"], y=df["Chance of Admit "], color='orange', ax=axes[1])
axes[1].set_title("TOEFL Score vs Chance of Admit")
axes[1].set_xlabel("TOEFL Score")
axes[1].set_ylabel("Chance of Admit")

# Bar plot between CGPA and Chance of Admit
sns.scatterplot(x=df["LOR "], y=df["Chance of Admit "], color='brown', ax=axes[2])
axes[2].set_title("CGPA vs Chance of Admit")
axes[2].set_xlabel("CGPA")
axes[2].set_ylabel("Chance of Admit")

plt.tight_layout()
plt.show()




# **MULTIVARIATE ANALYSIS**

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x="University Rating", y="Chance of Admit ", hue="Research", data=df, palette="coolwarm")
plt.title("Chance of Admit vs University Rating (Grouped by Research Experience)")
plt.xlabel("University Rating")
plt.ylabel("Chance of Admit")
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x="University Rating", y="Chance of Admit ", hue="Research", data=df, palette="coolwarm")
plt.title("Chance of Admit vs University Rating (Grouped by Research Experience)")
plt.xlabel("University Rating")
plt.ylabel("Chance of Admit")
plt.show()

In [None]:
sns.pairplot(df, vars=["GRE Score", "TOEFL Score", "CGPA", "Chance of Admit "], hue="University Rating", palette="coolwarm")
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

# **DATA PREPROCESSING**

In [None]:
df.head()
x=df.drop("Chance of Admit ",axis=1)
y=df["Chance of Admit "]

In [None]:
# splitting the data for training and testing
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

## **NORMALIZING DATA**

In [None]:
scaler= MinMaxScaler()
x_train_transform=scaler.fit_transform(x_train)
x_test_transform=scaler.transform(x_test)
x_train_transform_df = pd.DataFrame(x_train_transform)
x_test_transform_df = pd.DataFrame(x_test_transform)

# Display first 5 rows
print('dependent_features_normalized_values', x_train_transform_df.head(5))
print('independent_features_normalized_values', x_test_transform_df.head(5))

# **MODEL BUILDING**

In [None]:
lr=LinearRegression()
lr.fit(x_train_transform,y_train)

In [None]:

predict_test=lr.predict(x_test_transform)
predict_train=lr.predict(x_train_transform)


# **FINDING ERRORS**

In [None]:
#caluclating errors of test data
MSE= mean_squared_error(y_test,predict_test)
RMSE=np.sqrt(MSE)
MAE=mean_absolute_error(y_test,predict_test)
r2_score= r2_score(y_test,predict_test)
adjusted_r2_score = 1 - (1 - r2_score) * (len(y_test) - 1) / (len(y_test) - x_test.shape[1] - 1)
print("MSE:",MSE)
print("RMSE:",RMSE)
print("MAE:",MAE)
print("r2_score:",r2_score)
print("adjusted_r2_score:",adjusted_r2_score)

In [None]:
#caluclating errors of trained data
MSE=mean_squared_error(y_train,predict_train)
RMSE=np.sqrt(MSE)
MAE=mean_absolute_error(y_train,predict_train)
# r2_train = r2_score(y_train, predict_train)
adjusted_r2_score = 1 - (1 - r2_score) * (len(y_train) - 1) / (len(y_train) - x_train.shape[1] - 1)
print("MSE:",MSE)
print("RMSE:",RMSE)
print("MAE:",MAE)
print("r2_score:",r2_score)
print("adjusted_r2_score:",adjusted_r2_score)

1. Model Accuracy and Fit
Test R² Score: 0.8263 (82.63%)
Train R² Score: (Needs recalculating, but likely similar to test R²)
This means that 82.63% of the variance in the target variable is explained by your model, which is a strong fit.
If the train R² is close to the test R², it suggests the model generalizes well and isn't overfitting.
2. Model Errors & Stability
Test MSE: 0.00355, Test RMSE: 0.0596, Test MAE: 0.0433
Train MSE: 0.00338, Train RMSE: 0.0581, Train MAE: 0.0422
The train and test errors are very close, which indicates low variance and suggests that the model is not overfitting or underfitting.
RMSE and MAE are low, meaning the average prediction error is quite small.
3. Overfitting or Underfitting Check
The train and test errors are almost the same, which means:
The model is not overfitting (if it were, train errors would be much lower than test errors).
The model is not underfitting (if it were, both train and test R² would be significantly lower).
This balance suggests a well-tuned model with good generalization.
4. Adjusted R² Interpretation
Test Adjusted R²: 0.8111
Train Adjusted R²: To be recalculated
Since Adjusted R² is close to R², it means that adding extra features did not add unnecessary complexity.
If Adjusted R² was significantly lower than R², it would indicate that some features are not contributing to the model.

✅ The model is performing well with strong predictive power.
✅ No signs of overfitting or underfitting.
✅ Errors are low, and the variance between test and train data is minimal.
✅ Adjusted R² confirms that features are meaningful.

# O**rdinary Least Squares**

Ordinary Least Squares (OLS) is a method used to estimate the coefficients (parameters) of a linear regression model by minimizing the sum of squared residuals (errors). It finds the best-fitting line that minimizes the difference between the observed values and the predicted values.

In [None]:

# Add an intercept to the features
X_sm = sm.add_constant(x_train_transform)

# Fit OLS model
model = sm.OLS(y_train, X_sm)
results = model.fit()

# Print model summary
print(results.summary())

# **LASSO AND RIDGE RIDGE REGRESSION**

In [None]:
modelm1=Lasso(alpha=0.001)
modelm1.fit(x_train_transform,y_train)

modelm2=Ridge(alpha=0.001)
modelm2.fit(x_train_transform,y_train)


In [None]:
x1_predict_test=modelm1.predict(x_test_transform)
x2_predict_test=modelm2.predict(x_test_transform)

x1_predict_train=modelm1.predict(x_train_transform)
x2_predict_train=modelm2.predict(x_train_transform)


In [None]:
MSE_lasso = mean_squared_error(y_test, x1_predict_test)
RMSE_lasso = np.sqrt(MSE_lasso)  # RMSE Calculation
MAE_lasso = mean_absolute_error(y_test, x1_predict_test)

print("Lasso Regression - Test Data:")
print("MSE:", MSE_lasso)
print("RMSE:", RMSE_lasso)
print("MAE:", MAE_lasso)


# Ridge Regression Evaluation
MSE_ridge = mean_squared_error(y_test, x2_predict_test)
RMSE_ridge = np.sqrt(MSE_ridge)
MAE_ridge = mean_absolute_error(y_test, x2_predict_test)


print("\nRidge Regression - Test Data:")
print("MSE:", MSE_ridge)
print("RMSE:", RMSE_ridge)
print("MAE:", MAE_ridge)


MSE_lasso = mean_squared_error(y_train, x1_predict_train)
RMSE_lasso = np.sqrt(MSE_lasso)  # RMSE Calculation
MAE_lasso = mean_absolute_error(y_train, x1_predict_train)

print("\nLasso Regression - train Data:")
print("MSE:", MSE_lasso)
print("RMSE:", RMSE_lasso)
print("MAE:", MAE_lasso)


# Ridge Regression Evaluation
MSE_ridge = mean_squared_error(y_train, x2_predict_train)
RMSE_ridge = np.sqrt(MSE_ridge)
MAE_ridge = mean_absolute_error(y_train, x2_predict_train)


print("\nRidge Regression - train Data:")
print("MSE:", MSE_ridge)
print("RMSE:", RMSE_ridge)
print("MAE:", MAE_ridge)

Both models perform well with low errors and no clear overfitting.
Lasso slightly outperforms Ridge in terms of lower error metrics.
If feature selection is important, Lasso is preferable.
If all features are expected to be important, Ridge is more stable.

In [None]:
# Calculate errors
lasso_errors = y_test - x1_predict_test
ridge_errors = y_test - x2_predict_test

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.histplot(lasso_errors, bins=30, kde=True, color="blue")
plt.xlabel("Error")
plt.title("Lasso Regression Error Distribution")

plt.subplot(1, 2, 2)
sns.histplot(ridge_errors, bins=30, kde=True, color="red")
plt.xlabel("Error")
plt.title("Ridge Regression Error Distribution")

plt.tight_layout()
plt.show()


# **ASSUMPTIONS OF LINEAR REGRESSION**

# **NORMALITY OF RESIDUAL CHECK**

The normality of residuals in regression analysis refers to the assumption that the residuals (the differences between observed and predicted values) follow a normal distribution.

Violations of this assumption can lead to incorrect conclusions about the significance of predictor variables and the overall fit of the model.

Residuals = y-y_hat


1.   List item
2.   List item


y = w0 + wTx +(y-y_hat)


In [None]:
# Finding Training and Validation Residuals
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# residuals of training data
X_train_scaled = scaler.fit_transform(x_train)
X_sm = sm.add_constant(X_train_scaled)
sm_model = sm.OLS(y_train, X_sm).fit()
Y_predict_tr = sm_model.predict(X_sm)
errors_train = Y_predict_tr - y_train

# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)  # Fit and transform training data
# X_train_sm = sm.add_constant(X_train_scaled)
# sm_model = sm.OLS(y_train, X_train_sm).fit()
# Y_hat_tr = sm_model.predict(X_train_sm)

# residuals of validation data
X_test_scaled = scaler.fit_transform(x_test)
X_sm = sm.add_constant(X_test_scaled)
sm_model = sm.OLS(y_test, X_sm).fit()
Y_predict_ts = sm_model.predict(X_sm)
errors_test = Y_predict_ts - y_test

In [None]:
# histogram for training and validation errors

fig, axes = plt.subplots(1, 2, figsize=(10, 3))

sns.histplot(errors_train, kde= True, ax= axes[0], color="r")
axes[0].set_xlabel("Training Residuals")
axes[0].set_title("Histogram of training residuals")

sns.histplot(errors_test, kde= True, ax= axes[1])
axes[1].set_xlabel("Validation Residuals")
axes[1].set_title("Histogram of validation residuals")


The distribution seems to be normal.

**Mean Of Residuals**

In [None]:
errors_train.mean()
errors_test.mean()
print("Mean of Training Residuals:", errors_train.mean())
print("Mean of Validation Residuals:", errors_test.mean())

The mean of residuals of both training and validation data is very close to 0.

# **HOMOSCEDESTICITY**

Homoscedasticity, also known as homogeneity of variance, is a key assumption in regression analysis. It refers to the condition where the variance of the residuals is constant across all levels of the independent variables. It means that the spread of the residuals remains the same as you move along the range of the predictor variables.

Homoscedasticity ensures that the standard errors of the regression coefficients are estimated accurately.

In a homoscedastic model, the variability of the residuals is consistent across the range of the predictor variables. This consistency allows for more reliable predictions from the regression model.

In [None]:
#scatterplot for Homoscedasticity check for training and validation data

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

sns.scatterplot(x=Y_predict_tr, y=errors_train, ax=axes[0])
axes[0].set_xlabel("Predicted Training Selling price")
axes[0].set_ylabel("Training Residuals")
axes[0].set_title("Predicted values vs Training Residuals")


sns.scatterplot(x=Y_predict_ts, y=errors_test, ax=axes[1])
axes[1].set_xlabel("Predicted Validation Selling price")
axes[1].set_ylabel("Validation Residuals")
axes[1].set_title("Predicted values vs Validation Residuals")


plt.show()

As the residuals are roughly even across all the predicted values, we can say that the property of Homoscedasticity is met.



In [None]:
def calculate_vif(calculate):
    vif = pd.DataFrame()
    vif['Features'] = x_train.columns
    vif['VIF'] = [variance_inflation_factor(calculate.values, i) for i in range(calculate.shape[1])]
    return vif

In [None]:
calculate_vif(x_train_transform_df)

In [None]:
x_train=x_train.drop(["GRE Score", "CGPA", "LOR "], axis=1)
calculate_vif(x_train)


In [None]:
x_train=x_train.drop(["TOEFL Score","SOP"], axis=1)
calculate_vif(x_train)

Serial No, University rating and research had Vif less than 5, so we can consifer them for feature evaluation

# **ASSUMPTION OF LINEARITY**

The assumption of linearity in linear regression refers to the relationship between the independent variables (predictors) and the dependent variable (response).

It assumes that the relationship between the predictors and the response variable is linear or can be adequately approximated by a linear function.

In [None]:
# H0 : There is no linear correlation between the two variables
# Ha: There is linear correlation between the two variables
from scipy.stats import pearsonr

alpha=0.05
column = x_train.columns
for c in column:

    pstat,pval = pearsonr(x_train[c],y_train)
    print('Column Name:',c)
    if pval < alpha:
        print('P-value:',pval)
        print('There is a linear correlation between the independent and dependent variable')
    else:
        print('There is no linear correlation between the independent and dependent variable')
    print('\n')
    print('-'*100)

# **INSIGHTS**

1. The target variable Chance of Admit, is mostly dependent on many features like CGPA, GRE Score, TOEFL Score and SOP-LOR.

2. The factor which mostly influences the Admit chance is the CGPA.

3. Other factors such as Research and University ratings are also there but they are not contributing much to the Chance of Admit.

4. Also, there are certain students who have done research and have high university ratings they have a high Chance of Admit.

5. Students which have great Chance of Admit have a CGPA around or above 8.5, and also a good GRE Score.

6. The features are concentrated within a specified range, they dont have any outlier data points.

7. During model creation, the best score by ridge and lasso was almost same and was around 81% on training data and 83% on validation data.

# **RECOMMENDATION**

1. Focusing on Key Features: Given that the Chance of Admit is mostly dependent on features like CGPA, GRE Score, TOEFL Score, and SOP-LOR, it's important to focus the analysis on these key features. These features should be prioritized during feature selection and model building.

2. Emphasize CGPA: Since CGPA is identified as the most influential factor in determining the Chance of Admit, special attention should be given to this feature during model building and analysis.

3. Considering Research and University Ratings: Although Research and University ratings may not contribute significantly to the Chance of Admit individually, they still play a role in certain cases. It's important to consider them as additional factors in the analysis, especially for students with high university ratings and research experience.

4. Exploring Feature Interactions: Investigating potential interactions between features, especially for students with high Chance of Admit. It may be beneficial to examine how different combinations of features contribute to the overall likelihood of admission.

5. Segmentation Analysis: Conducting segmentation analysis to identify groups of students with similar characteristics and Chance of Admit. This can help tailoring admission strategies and identifying specific target groups for recruitment efforts.

6. Ensure Model Robustness: Since the Ridge and Lasso models produced almost similar scores, it's essential to ensure the robustness of the models and validate their performance on independent datasets.

7. Continuous Monitoring and Improvement: Continuously monitoring the model performance by updating the model with new data and refining the feature selection process can help maintain its effectiveness over time.