# Cleaning and testing `masculinity-survey.csv`

This notebook contains data cleaning processes for `masculinity-survey.csv` file inside `raw` folder. <br>
`masculinity-survey.csv` file is a part of the dataset published by FiveThirtyEight and describes the responses to Masculinity Survey conducted by SurveyMonkey in partnership with FiveThirtyEight and WNYC Studios in 2018.

Each step of data cleaning comes with test cases verifying the state of the data. <br>
These test cases also serve as **specifications** for each step. Should you need to edit the data cleaning code for machine learning, you can read test cases as reminders of what each step does and edit specific section of code with confidence.

# **Summary of the results**

Below are the first 10 rows of datasets before and after cleaning. For more details, please refer to the full cleaning process.

In [1]:
%%capture
from tqdm import tqdm_notebook as tqdm
tqdm().pandas()

%matplotlib inline
import pandas as pd

## Before the cleaning

In [2]:
survey_raw = pd.read_csv("raw/masculinity-survey.csv")
survey_raw.head()

Unnamed: 0,AMONG ADULT MEN,Unnamed: 1,Adult Men,Age,Unnamed: 4,Unnamed: 5,Race,Unnamed: 7,Children,Unnamed: 9,Sexual Orientation,Unnamed: 11
0,,,,18 - 34,35 - 64,65 and up,White,Non-white,Has children,No children,Straight,Gay/Bisexual
1,"In general, how masculine or ""manly"" do you feel?",,,,,,,,,,,
2,,Very masculine,37%,29%,42%,37%,34%,44%,43%,30%,40%,21%
3,,Somewhat masculine,46%,47%,46%,47%,50%,39%,47%,46%,47%,49%
4,,Not very masculine,11%,13%,9%,13%,11%,11%,6%,17%,10%,19%


In [3]:
survey_raw.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232 entries, 0 to 231
Columns: 12 entries, AMONG ADULT MEN to Unnamed: 11
dtypes: object(12)
memory usage: 21.9+ KB


## After the cleaning

In [4]:
# survey_cleaned = pd.read_csv("cleaned/cleaned-survey.csv")
# survey_cleaned.head(10)

In [5]:
# survey_cleaned.info(verbose=False)

# **Full cleaning processes below**

## Copying dataset

In case we need to compare the raw and cleaned datasets later on, we create copies of the datasets and name them as `survey_wip` and `response_wip`. <br>
`survey_wip` and `response_wip` are what we are going to work with.

In [6]:
survey_wip = survey_raw.copy()

## Inspecting dataset

`survey_wip` contains the survey results, where each row represent a question from the survey and each column represent demographic groups. <br>

In [7]:
survey_wip.head()

Unnamed: 0,AMONG ADULT MEN,Unnamed: 1,Adult Men,Age,Unnamed: 4,Unnamed: 5,Race,Unnamed: 7,Children,Unnamed: 9,Sexual Orientation,Unnamed: 11
0,,,,18 - 34,35 - 64,65 and up,White,Non-white,Has children,No children,Straight,Gay/Bisexual
1,"In general, how masculine or ""manly"" do you feel?",,,,,,,,,,,
2,,Very masculine,37%,29%,42%,37%,34%,44%,43%,30%,40%,21%
3,,Somewhat masculine,46%,47%,46%,47%,50%,39%,47%,46%,47%,49%
4,,Not very masculine,11%,13%,9%,13%,11%,11%,6%,17%,10%,19%


In [8]:
survey_wip.describe(include="all")

Unnamed: 0,AMONG ADULT MEN,Unnamed: 1,Adult Men,Age,Unnamed: 4,Unnamed: 5,Race,Unnamed: 7,Children,Unnamed: 9,Sexual Orientation,Unnamed: 11
count,35,195,189,155,190,190,190,190,190,190,190,155
unique,35,104,59,58,62,64,61,62,62,58,61,60
top,AMONG EMPLOYED: Have you seen or heard of a se...,No answer,7%,0%,1%,1%,1%,5%,1%,1%,1%,0%
freq,1,23,11,19,17,13,13,10,19,11,13,17


Fist off, `survey_wip` has non-intuitive column names. We have to rename them for readability <br>
Second, `survey_wip` has nothing but string values. We have to change the datatypes of numerical values <br>
Third, `survey_wip` has numerous `NaN` here and there. We have to clear them out as many as possible.

## Changing column names

All the new column names will be in snake case.

In [9]:
survey_mapper = {"AMONG ADULT MEN":"question", "Unnamed: 1":"answer",
                 "Adult Men":"age_all", "Age":"age_18_34", "Unnamed: 4": "age_35_64", "Unnamed: 5": "age_65+",
                 "Race":"race_white", 'Unnamed: 7':"race_non_white", "Children":"has_children", "Unnamed: 9":"no_children",
                 "Sexual Orientation":"straight", "Unnamed: 11":"gay_bisexual"}
survey_wip.rename(columns=survey_mapper, inplace=True)
survey_wip.columns

Index(['question', 'answer', 'age_all', 'age_18_34', 'age_35_64', 'age_65+',
       'race_white', 'race_non_white', 'has_children', 'no_children',
       'straight', 'gay_bisexual'],
      dtype='object')

## Truncating the dataset

In [10]:
survey_wip.drop(index=0, inplace=True) # Dropping the first row
survey_wip.reset_index(drop=True, inplace=True)

## Converting the numbers

In [11]:
numeric_cols = survey_wip.columns[2:]
numeric_cols

Index(['age_all', 'age_18_34', 'age_35_64', 'age_65+', 'race_white',
       'race_non_white', 'has_children', 'no_children', 'straight',
       'gay_bisexual'],
      dtype='object')

In [12]:
survey_wip[numeric_cols] = survey_wip[numeric_cols].progress_apply(lambda col : col.str.replace("%", "").astype(float) / 100)
survey_wip.head()

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))




Unnamed: 0,question,answer,age_all,age_18_34,age_35_64,age_65+,race_white,race_non_white,has_children,no_children,straight,gay_bisexual
0,"In general, how masculine or ""manly"" do you feel?",,,,,,,,,,,
1,,Very masculine,0.37,0.29,0.42,0.37,0.34,0.44,0.43,0.3,0.4,0.21
2,,Somewhat masculine,0.46,0.47,0.46,0.47,0.5,0.39,0.47,0.46,0.47,0.49
3,,Not very masculine,0.11,0.13,0.09,0.13,0.11,0.11,0.06,0.17,0.1,0.19
4,,Not at all masculine,0.05,0.1,0.02,0.03,0.04,0.06,0.03,0.06,0.03,0.1


# Saving the dataframe

In [13]:
survey_wip.to_csv("cleaned/cleaned-survey.csv", index=False)