# Exploratory Data Analysis
In this file, we will perform exploratory data analysis to obtain a deeper insight and understanding of the dataset, as well as its individual features. Besides this, we will also perform pre-processing to fix any anomalies within the data.

In [1]:
# Import needed dependencies
import matplotlib as plt
import seaborn as sns
import numpy as np
import pandas as pd

In [2]:
# Load the dataset (PCOS_Infertility)
pcos_df = pd.read_csv("data/PCOS_infertility.csv")
pcos_df.shape

(541, 6)

In [3]:
# Preview dataset
pcos_df.head()

Unnamed: 0,Sl. No,Patient File No.,PCOS (Y/N),I beta-HCG(mIU/mL),II beta-HCG(mIU/mL),AMH(ng/mL)
0,1,10001,0,1.99,1.99,2.07
1,2,10002,0,60.8,1.99,1.53
2,3,10003,1,494.08,494.08,6.63
3,4,10004,0,1.99,1.99,1.22
4,5,10005,0,801.45,801.45,2.26


In [4]:
# Check datatypes of values
pcos_df.dtypes

Sl. No                      int64
Patient File No.            int64
PCOS (Y/N)                  int64
  I   beta-HCG(mIU/mL)    float64
II    beta-HCG(mIU/mL)    float64
AMH(ng/mL)                 object
dtype: object

`AMH(ng/mL)` should be type `float64`, however it is listed as type `object`.

In [5]:
# Get unique values of the feature to check
pcos_df['AMH(ng/mL)'].unique()

array(['2.07', '1.53', '6.63', '1.22', '2.26', '6.74', '3.05', '1.54',
       '1', '1.61', '4.47', '1.67', '7.94', '2.38', '0.88', '0.69',
       '3.78', '1.92', '2.85', '2.13', '4.13', '2.5', '1.89', '0.26',
       '3.84', '3.56', '1.56', '1.69', '2.34', '1.58', '2.36', '3.64',
       '2.78', '0.33', '2.35', '3.88', '3.55', '4.33', '3.66', '4.5',
       '3.2', '2.1', '6.55', '1.2', '2.33', '3.22', '2.333', '2.31',
       '4.2', '3.21', '2.14', '2.3', '4.6', '5.8', '5.2', '4.63', '1.01',
       '2.58', '0.35', '5.23', '3.68', '2.55', '4.91', '1.03', '6.56',
       '3.91', '5.42', '1.65', '2.06', '1.81', '3.81', '3.65', '8.98',
       '1.7', '3.18', '2.75', '0.86', '2.29', '2.19', '8.46', '4.59',
       '1.04', '4.27', '3.86', '1.42', '10.07', '0.98', '4.07', '3.9',
       '10', '16.9', '17', '21.9', '1.6', '3.3', '21', '12.7', '1.8',
       '3.6', '15', '5', '17.9', '19.8', '9.2', '2.4', '5.14', '0.3',
       '11.48', '19.3', '8.8', '19', '4.3', '1.4', '12.6', '4.8', '17.1',
       '11

In [6]:
# Convert values to type float
# pcos_df['AMH(ng/mL)'] = pcos_df['AMH(ng/mL)'].astype(float) # feature contains erroneous values (e.g.: a)

The code above has been commented out because it throws an error. This is because among the data, there is non-numerical data.

In [7]:
# Sort the values
pcos_df['AMH(ng/mL)'].sort_values()

422     0.1
272    0.16
524    0.19
227     0.2
374     0.2
       ... 
454     9.7
351     9.8
144     9.9
434     9.9
305       a
Name: AMH(ng/mL), Length: 541, dtype: object

Because only a single row has a non-numerical value, we can drop the row without worrying too much about data loss.

In [8]:
# Drop row where AMH(ng/mL) is "a"
pcos_df = pcos_df.drop(pcos_df[pcos_df['AMH(ng/mL)'] == 'a'].index)
pcos_df.shape

(540, 6)

In [9]:
# Try type conversion
pcos_df['AMH(ng/mL)'] = pcos_df['AMH(ng/mL)'].astype(float)
pcos_df.dtypes

Sl. No                      int64
Patient File No.            int64
PCOS (Y/N)                  int64
  I   beta-HCG(mIU/mL)    float64
II    beta-HCG(mIU/mL)    float64
AMH(ng/mL)                float64
dtype: object