# **Portfolio Project Exploratory Data Analysis**

 I will explain EDA step by step, namely addressing missing values and duplicates. In this project, I took a dataset about *Student Admission Record by Zeeshan Ahmad* from Kaggle.com.

# **1. Read the Dataset**

First of all, we have to add your dataset to the file in your google collab. After that, import library pandas to read the dataset

*note :*
The text with (#) is comment

In [11]:
# import library pandas to read the dataset
import pandas as pd

# read the csv dataset using the pd.read_csv() syntax
# you can use pd.read_excel() or pd.read_json() if your dataset is not in csv format
data = pd.read_csv('/content/student_admission_record.csv')

In [12]:
# To display the dataset, we call it by writing the variable. Here, I am using the data variable.
data

Unnamed: 0,Name,Age,Gender,Admission Test Score,High School Percentage,City,Admission Status
0,Shehroz,24.0,Female,50.0,68.90,Quetta,Rejected
1,Waqar,21.0,Female,99.0,60.73,Karachi,
2,Bushra,17.0,Male,89.0,,Islamabad,Accepted
3,Aliya,17.0,Male,55.0,85.29,Karachi,Rejected
4,Bilal,20.0,Male,65.0,61.13,Lahore,
...,...,...,...,...,...,...,...
152,Ali,19.0,Female,85.0,78.09,Quetta,Accepted
153,Bilal,17.0,Female,81.0,84.40,Islamabad,Rejected
154,Fatima,21.0,Female,98.0,50.86,Multan,Accepted
155,Shoaib,-1.0,Male,91.0,80.12,Quetta,Accepted


# **2. Data Summary**

We can use the `info()` variable.

The `info()` is useful for quickly understanding the structure and content of a DataFrame, helping with data exploration and preparation.

In [15]:
# don't forget the to put the variable in front of the code
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Name                    147 non-null    object 
 1   Age                     147 non-null    float64
 2   Gender                  147 non-null    object 
 3   Admission Test Score    146 non-null    float64
 4   High School Percentage  146 non-null    float64
 5   City                    147 non-null    object 
 6   Admission Status        147 non-null    object 
dtypes: float64(3), object(4)
memory usage: 8.7+ KB


# **3. Checking on Missing Value**

A Missing Value is *data that is missing or unavailable in a dataset.*

To check on the missing value, we can use `.isna().sum()`

In [16]:
# checing the missing value
data.isna().sum()

Unnamed: 0,0
Name,10
Age,10
Gender,10
Admission Test Score,11
High School Percentage,11
City,10
Admission Status,10


*Summary*

Based on the dataset, we got missing values in all collumns.

In [17]:
#  check statistical summary
data.describe()

Unnamed: 0,Age,Admission Test Score,High School Percentage
count,147.0,146.0,146.0
mean,19.680272,77.657534,75.684726
std,4.540512,16.855343,17.368014
min,-1.0,-5.0,-10.0
25%,18.0,68.25,65.0525
50%,20.0,79.0,77.545
75%,22.0,89.0,88.3125
max,24.0,150.0,110.5


*Summary*

Based on the dataset, the minimum value from columns `Age`, `Test Score`, `Percentage` is odd and does not make sense.

In [18]:
# addressing missing value
for columns in data.columns:
  if data[columns].dtype == 'object':
    data[columns].fillna(data[columns].mode()[0], inplace=True)
    # if the column is of type object, fill it with mode
  else:
    data[columns].fillna(data[columns].mean(), inplace=True)
    # If the column is numeric, fill it with the mean

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[columns].fillna(data[columns].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[columns].fillna(data[columns].mean(), inplace=True)


In [19]:
# now we're checking the missing value again
data.isna().sum()

Unnamed: 0,0
Name,0
Age,0
Gender,0
Admission Test Score,0
High School Percentage,0
City,0
Admission Status,0


In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Name                    157 non-null    object 
 1   Age                     157 non-null    float64
 2   Gender                  157 non-null    object 
 3   Admission Test Score    157 non-null    float64
 4   High School Percentage  157 non-null    float64
 5   City                    157 non-null    object 
 6   Admission Status        157 non-null    object 
dtypes: float64(3), object(4)
memory usage: 8.7+ KB


# **4. How to Deal with Duplicate Data**

In [26]:
# first, we're checking if there is any duplicate data
check_duplicate = data.duplicated().sum()

print("Amount of duplicate data: ", check_duplicate)

Amount of duplicate data:  7


In [27]:
# handling duplicate data
data = data.drop_duplicates()

In [28]:
# checking on duplicate data after we handle it
check_duplicate = data.duplicated().sum()

print("Amount of duplicate data: ", check_duplicate)

Amount of duplicate data:  0


Yeay! We did it! We already handle the missing values and the duplicate datas!!!

# **ABOUT ME**

Hello! My name is Edelweiss Saraswati Priyono. I'm an entry level of data scientist. I'm taking major of Data Science in Bina Nusantara University. Please kindly checking on my GitHub and LinkedIn. Thank you!

[GitHub](https://github.com/edelweissaraswati)

[LinkedIn](https://www.linkedin.com/in/edelweiss-saraswati-priyono-3a4331260/)