# Dataset Cleaning & Exploration   

_Erin Cameron   
2024-04-09_

## 1.0) Set up

In [107]:
# Installations


In [141]:
# Perform import statements
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## 2) Load & Format Data

In [142]:
print("Loading...")
data = pd.read_csv("../data/FedCycleData071012.csv")
print("Load complete!")

Loading...
Load complete!


In [143]:
# Display the dataset size
print("Dataset size: " + str(data.shape)) 

Dataset size: (1665, 80)


In [144]:
print(data.columns.tolist()) # Display the columns

['ClientID', 'CycleNumber', 'Group', 'CycleWithPeakorNot', 'ReproductiveCategory', 'LengthofCycle', 'MeanCycleLength', 'EstimatedDayofOvulation', 'LengthofLutealPhase', 'FirstDayofHigh', 'TotalNumberofHighDays', 'TotalHighPostPeak', 'TotalNumberofPeakDays', 'TotalDaysofFertility', 'TotalFertilityFormula', 'LengthofMenses', 'MeanMensesLength', 'MensesScoreDayOne', 'MensesScoreDayTwo', 'MensesScoreDayThree', 'MensesScoreDayFour', 'MensesScoreDayFive', 'MensesScoreDaySix', 'MensesScoreDaySeven', 'MensesScoreDayEight', 'MensesScoreDayNine', 'MensesScoreDayTen', 'MensesScoreDay11', 'MensesScoreDay12', 'MensesScoreDay13', 'MensesScoreDay14', 'MensesScoreDay15', 'TotalMensesScore', 'MeanBleedingIntensity', 'NumberofDaysofIntercourse', 'IntercourseInFertileWindow', 'UnusualBleeding', 'PhasesBleeding', 'IntercourseDuringUnusBleed', 'Age', 'AgeM', 'Maristatus', 'MaristatusM', 'Yearsmarried', 'Wedding', 'Religion', 'ReligionM', 'Ethnicity', 'EthnicityM', 'Schoolyears', 'SchoolyearsM', 'Occupation

In [145]:
print(data.head()) # Display the head of the data

  ClientID  CycleNumber  Group  CycleWithPeakorNot  ReproductiveCategory  \
0  nfp8122            1      0                   1                     0   
1  nfp8122            2      0                   1                     0   
2  nfp8122            3      0                   1                     0   
3  nfp8122            4      0                   1                     0   
4  nfp8122            5      0                   1                     0   

   LengthofCycle MeanCycleLength EstimatedDayofOvulation LengthofLutealPhase  \
0             29           27.33                      17                  12   
1             27                                      15                  12   
2             29                                      15                  14   
3             27                                      15                  12   
4             28                                      16                  12   

  FirstDayofHigh  ... Method Prevmethod Methoddate Whychart Ne

## 3.0) Data Exploration

_I am interested in using menstrual cycle data to predict if a woman suffers from miscarriages or not. Here, I will explore the nature of the binary class I am trying to predict (column "Miscarriages" in the data frame) and how this class relates to other data categories._

### 3.1) Exploring the `LengthofMenses` variable

_In my dataset of 1665 data points the LengthofMenses was:_
* _21/1665 (~1.26%) were 2 days._
* _63/1665 (~3.78%) were 3 days._
* _346/1665 (~20.78%) were 4 days._
* _629/1665 (~37.78%) were 5 days._
* _380/1665 (~22.82%) were 6 days._
* _155/1665 (~9.31%) were 7 days._
* _41/1665 (~2.46%) were 8 days._
* _20/1665 (~1.20%) were 9 days._
* _4/1665 (~0.24%) were 10 days._
* _1/1665 (~0.06%) were 11 days._
* _1/1665 (~0.06%) were 15 days._

_Using the isna() function and further methods, it is determined there are 4 missing values in the column that were removed from the study._

In [146]:
# Display the unique values in this column and counts for each answer
print("=====> LengthofMenses")
display(data["LengthofMenses"].value_counts().sort_index())

=====> LengthofMenses


        4
10      4
11      1
15      1
2      21
3      63
4     346
5     629
6     380
7     155
8      41
9      20
Name: LengthofMenses, dtype: int64

In [147]:
# Display isna() values
print("=====>  Are there any NaN values in the \"LengthofMenses\" column of the dataset that need to be removed?")
display(data["LengthofMenses"].isna().value_counts())

# Display spacebar values
print("\n\n=====>  Are there any missing values in the \"LengthofMenses\" column of the dataset that need to be removed?")
filtered_data = data.loc[data["LengthofMenses"].str.contains(" ")]
print("ClientID:\n", filtered_data["LengthofMenses"])

=====>  Are there any NaN values in the "LengthofMenses" column of the dataset that need to be removed?


False    1665
Name: LengthofMenses, dtype: int64



=====>  Are there any missing values in the "LengthofMenses" column of the dataset that need to be removed?
ClientID:
 1107     
1298     
1340     
1664     
Name: LengthofMenses, dtype: object


In [148]:
# Now that we have identified the missing values, we need to drop those row values
data = data.drop(data[data["LengthofMenses"] == " "].index)

# Display spacebar values again, to ensure they have been removed
filtered_data = data.loc[data["LengthofMenses"].str.contains(" ")]
print("ClientID:\n", filtered_data["LengthofMenses"])

ClientID:
 Series([], Name: LengthofMenses, dtype: object)


In [149]:
# Review the dataset size again, it should be 4 less
print("Dataset size: " + str(data.shape))

Dataset size: (1661, 80)


### 3.2) Exploring the `Miscarriages` variable

_Using the isna() function and further methods, it is determined there are 1522 missing values in the column that were removed from the study._

In [150]:
# Display the unique values in this column and counts for each answer
print("=====> Miscarriages")
display(data["Miscarriages"].value_counts().sort_index())

=====> Miscarriages


     1522
0     107
1      23
2       6
3       1
4       2
Name: Miscarriages, dtype: int64

In [151]:
# Display isna() values
print("=====>  Are there any NaN values in the \"Miscarriages\" column of the dataset that need to be removed?")
display(data["Miscarriages"].isna().value_counts())

# Display spacebar values
print("\n\n=====>  Are there any missing values in the \"Miscarriages\" column of the dataset that need to be removed?")
filtered_data = data.loc[data["Miscarriages"].str.contains(" ")]
print(filtered_data["Miscarriages"].value_counts())

=====>  Are there any NaN values in the "Miscarriages" column of the dataset that need to be removed?


False    1661
Name: Miscarriages, dtype: int64



=====>  Are there any missing values in the "Miscarriages" column of the dataset that need to be removed?
     1522
Name: Miscarriages, dtype: int64


In [152]:
# Now that we have identified the missing values, we need to drop those row values
data = data.drop(data[data["Miscarriages"] == " "].index)

# Display spacebar values again, to ensure they have been removed
filtered_data = data.loc[data["Miscarriages"].str.contains(" ")]
print("ClientID:\n", filtered_data["Miscarriages"])

ClientID:
 Series([], Name: Miscarriages, dtype: object)


In [153]:
# We are left with 139 rows of data containing data on Miscarriages and LengthofMenses
# Display the number of miscarriages for each LengthofMenses value_count()

result = data.groupby("LengthofMenses")["Miscarriages"].value_counts().unstack(fill_value=0)

# Reformat the DataFrame to display the desired output format
output = result.rename_axis(columns="Miscarriages").reset_index()
output.columns.name = None  # Remove the column name for better display

# Sort output by 'LengthofMenses'
output = output.sort_values(by="LengthofMenses")

print(output)

  LengthofMenses   0   1  2  3  4
0             15   1   0  0  0  0
1              2   1   0  0  0  0
2              3   1   0  1  0  0
3              4  20   4  1  1  0
4              5  36  10  2  0  2
5              6  29   6  2  0  0
6              7  11   1  0  0  0
7              8   6   2  0  0  0
8              9   2   0  0  0  0


### 3.3) Exploring the `BMI` variable

_Using the isna() function and further methods, it is determined there are XXX missing values in the column that were removed from the study._

In [154]:
# Display the unique values in this column and counts for each answer
print("=====> BMI")
display(data["BMI"].value_counts().sort_index())

=====> BMI


                    11
16.8266565885614     1
18.0095789708176     1
18.7750062988158     1
18.7926041434618     1
18.8520761245675     1
18.8822314049587     1
19.1955471539593     1
19.366391184573      1
19.4602076124567     1
19.6484815909702     1
19.737548828125      2
19.9338374291115     1
19.9668639053254     1
20.0263370061811     1
20.1170655567118     1
20.1733241505969     1
20.3768115942029     1
20.4660355029586     1
20.4828303850156     1
20.4960973370064     1
20.524437716263      2
20.5462333081381     1
20.595703125         2
20.671864557808      2
20.8030612244898     1
20.93896484375       1
20.9465306122449     1
21.0314776274714     1
21.1416796613945     1
21.254724111867      2
21.2822265625        1
21.453857421875      1
21.4548897304522     1
21.6089695137314     1
21.6307692307692     1
21.9247048340388     1
21.945889698231      1
21.9485766758494     1
22.1099632690542     1
22.1487082545684     1
22.1967993079585     1
22.2616666666667     1
22.31201171

In [155]:
print(data.shape)

(139, 80)


In [156]:
# Display isna() values
print("=====>  Are there any NaN values in the \"BMI\" column of the dataset that need to be removed?")
display(data["BMI"].isna().value_counts())

# Display spacebar values
print("\n\n=====>  Are there any missing values in the \"BMI\" column of the dataset that need to be removed?")
filtered_data = data.loc[data["BMI"].str.contains(" ")]
print("ClientID:\n", filtered_data["BMI"].value_counts())

=====>  Are there any NaN values in the "BMI" column of the dataset that need to be removed?


False    139
Name: BMI, dtype: int64



=====>  Are there any missing values in the "BMI" column of the dataset that need to be removed?
ClientID:
      11
Name: BMI, dtype: int64
