# Cleaning `masculinity-survey.csv`

This notebook contains data cleaning processes for `masculinity-survey.csv` file inside `raw` folder. <br>
`masculinity-survey.csv` file is a part of the dataset published by FiveThirtyEight and describes the responses to Masculinity Survey conducted by SurveyMonkey in partnership with FiveThirtyEight and WNYC Studios in 2018.

# **Summary of the results**

* input : `masculinity-survey.csv`
* output : `cleaned-survey.csv`

Below are the first 5 rows of datasets before and after cleaning. <br>
For more details, please refer to the full cleaning process.

In [1]:
%%capture
from tqdm import tqdm
tqdm().pandas()

%matplotlib inline
import pandas as pd
import numpy as np

## Before the cleaning

In [2]:
survey_raw = pd.read_csv("raw/masculinity-survey.csv")
survey_raw.head()

Unnamed: 0,AMONG ADULT MEN,Unnamed: 1,Adult Men,Age,Unnamed: 4,Unnamed: 5,Race,Unnamed: 7,Children,Unnamed: 9,Sexual Orientation,Unnamed: 11
0,,,,18 - 34,35 - 64,65 and up,White,Non-white,Has children,No children,Straight,Gay/Bisexual
1,"In general, how masculine or ""manly"" do you feel?",,,,,,,,,,,
2,,Very masculine,37%,29%,42%,37%,34%,44%,43%,30%,40%,21%
3,,Somewhat masculine,46%,47%,46%,47%,50%,39%,47%,46%,47%,49%
4,,Not very masculine,11%,13%,9%,13%,11%,11%,6%,17%,10%,19%


In [3]:
survey_raw.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232 entries, 0 to 231
Columns: 12 entries, AMONG ADULT MEN to Unnamed: 11
dtypes: object(12)
memory usage: 21.9+ KB


## After the cleaning

In [4]:
survey_cleaned = pd.read_csv("cleaned/cleaned-survey.csv")
survey_cleaned.head()

Unnamed: 0,question,question_num,answer,is_multi_select,age_all,age_18_34,age_35_64,age_65+,race_white,race_non_white,has_children,no_children,straight,gay_bisexual
0,"In general, how masculine or ""manly"" do you feel?",q01,Very masculine,False,0.37,0.29,0.42,0.37,0.34,0.44,0.43,0.3,0.4,0.21
1,"In general, how masculine or ""manly"" do you feel?",q01,Somewhat masculine,False,0.46,0.47,0.46,0.47,0.5,0.39,0.47,0.46,0.47,0.49
2,"In general, how masculine or ""manly"" do you feel?",q01,Not very masculine,False,0.11,0.13,0.09,0.13,0.11,0.11,0.06,0.17,0.1,0.19
3,"In general, how masculine or ""manly"" do you feel?",q01,Not at all masculine,False,0.05,0.1,0.02,0.03,0.04,0.06,0.03,0.06,0.03,0.1
4,"In general, how masculine or ""manly"" do you feel?",q01,No answer,False,0.01,0.0,0.01,0.01,0.01,0.0,0.01,0.0,0.01,0.0


In [5]:
survey_cleaned.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Columns: 14 entries, question to gay_bisexual
dtypes: bool(1), float64(10), object(3)
memory usage: 20.1+ KB


# **Full cleaning processes below**

## Copying dataset

In case we need to compare the raw and cleaned datasets later on, we create copies of the datasets and name them as `survey_wip` and `response_wip`. <br>
`survey_wip` and `response_wip` are what we are going to work with.

In [6]:
survey_wip = survey_raw.copy()

## Inspecting dataset

`survey_wip` contains the survey results, where each row represent a question from the survey and each column represent demographic groups. <br>

In [7]:
survey_wip.head()

Unnamed: 0,AMONG ADULT MEN,Unnamed: 1,Adult Men,Age,Unnamed: 4,Unnamed: 5,Race,Unnamed: 7,Children,Unnamed: 9,Sexual Orientation,Unnamed: 11
0,,,,18 - 34,35 - 64,65 and up,White,Non-white,Has children,No children,Straight,Gay/Bisexual
1,"In general, how masculine or ""manly"" do you feel?",,,,,,,,,,,
2,,Very masculine,37%,29%,42%,37%,34%,44%,43%,30%,40%,21%
3,,Somewhat masculine,46%,47%,46%,47%,50%,39%,47%,46%,47%,49%
4,,Not very masculine,11%,13%,9%,13%,11%,11%,6%,17%,10%,19%


In [8]:
survey_wip.describe(include="all")

Unnamed: 0,AMONG ADULT MEN,Unnamed: 1,Adult Men,Age,Unnamed: 4,Unnamed: 5,Race,Unnamed: 7,Children,Unnamed: 9,Sexual Orientation,Unnamed: 11
count,35,195,189,155,190,190,190,190,190,190,190,155
unique,35,104,59,58,62,64,61,62,62,58,61,60
top,AMONG EMPLOYED: How much have you heard about ...,No answer,1%,0%,1%,1%,1%,5%,1%,1%,1%,0%
freq,1,23,11,19,17,13,13,10,19,11,13,17


Fist off, `survey_wip` has non-intuitive column names. We have to rename them for readability <br>
Second, `survey_wip` has nothing but string values. We have to change the datatypes of numerical values <br>
Third, `survey_wip` has numerous `NaN` here and there. We have to clear them out as many as possible.

## Changing column names

All the new column names will be in snake case.

In [9]:
survey_mapper = {"AMONG ADULT MEN":"question", "Unnamed: 1":"answer",
                 "Adult Men":"age_all", "Age":"age_18_34", "Unnamed: 4": "age_35_64", "Unnamed: 5": "age_65+",
                 "Race":"race_white", 'Unnamed: 7':"race_non_white", "Children":"has_children", "Unnamed: 9":"no_children",
                 "Sexual Orientation":"straight", "Unnamed: 11":"gay_bisexual"}
survey_wip.rename(columns=survey_mapper, inplace=True)
survey_wip.columns

Index(['question', 'answer', 'age_all', 'age_18_34', 'age_35_64', 'age_65+',
       'race_white', 'race_non_white', 'has_children', 'no_children',
       'straight', 'gay_bisexual'],
      dtype='object')

## Truncating the dataset

In [10]:
survey_wip.drop(index=0, inplace=True) # Dropping the first row
survey_wip.reset_index(drop=True, inplace=True)

## Converting the numbers

In [11]:
numeric_cols = survey_wip.columns[2:]
numeric_cols

Index(['age_all', 'age_18_34', 'age_35_64', 'age_65+', 'race_white',
       'race_non_white', 'has_children', 'no_children', 'straight',
       'gay_bisexual'],
      dtype='object')

In [12]:
survey_wip[numeric_cols] = survey_wip[numeric_cols].progress_apply(lambda col : col.str.replace("%", "").astype(float) / 100)
survey_wip.head()

100%|████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 454.75it/s]


Unnamed: 0,question,answer,age_all,age_18_34,age_35_64,age_65+,race_white,race_non_white,has_children,no_children,straight,gay_bisexual
0,"In general, how masculine or ""manly"" do you feel?",,,,,,,,,,,
1,,Very masculine,0.37,0.29,0.42,0.37,0.34,0.44,0.43,0.3,0.4,0.21
2,,Somewhat masculine,0.46,0.47,0.46,0.47,0.5,0.39,0.47,0.46,0.47,0.49
3,,Not very masculine,0.11,0.13,0.09,0.13,0.11,0.11,0.06,0.17,0.1,0.19
4,,Not at all masculine,0.05,0.1,0.02,0.03,0.04,0.06,0.03,0.06,0.03,0.1


## Filling in `question` column

In [13]:
survey_wip['question'] = survey_wip['question'].fillna(method='ffill')

## Dropping rows with nothing but question

In [14]:
non_questions = survey_wip.columns.drop('question')
survey_wip = survey_wip.dropna(how='all', subset=non_questions)

## Adding `is_multi_select` column

In [15]:
survey_wip['is_multi_select'] = survey_wip['question'].str.contains(r"Select all that apply")
survey_wip.head()

Unnamed: 0,question,answer,age_all,age_18_34,age_35_64,age_65+,race_white,race_non_white,has_children,no_children,straight,gay_bisexual,is_multi_select
1,"In general, how masculine or ""manly"" do you feel?",Very masculine,0.37,0.29,0.42,0.37,0.34,0.44,0.43,0.3,0.4,0.21,False
2,"In general, how masculine or ""manly"" do you feel?",Somewhat masculine,0.46,0.47,0.46,0.47,0.5,0.39,0.47,0.46,0.47,0.49,False
3,"In general, how masculine or ""manly"" do you feel?",Not very masculine,0.11,0.13,0.09,0.13,0.11,0.11,0.06,0.17,0.1,0.19,False
4,"In general, how masculine or ""manly"" do you feel?",Not at all masculine,0.05,0.1,0.02,0.03,0.04,0.06,0.03,0.06,0.03,0.1,False
5,"In general, how masculine or ""manly"" do you feel?",No answer,0.01,0.0,0.01,0.01,0.01,0.0,0.01,0.0,0.01,0.0,False


## Adding `question_num` column

In [16]:
question_dict = {'In general, how masculine or "manly" do you feel?':"q01",
                 'How important is it to you that others see you as masculine?':"q02",
                 'Where have you gotten your ideas about what it means to be a good man? (Select all that apply.)':"q04",
                 'Do you think that society puts pressure on men in a way that is unhealthy or bad for them?':"q05",
                 'Ask a friend for professional advice':"q07_01",
                 'Ask a friend for personal advice':"q07_02",
                 'Express physical affection to male friends, like hugging, rubbing shoulders':"q07_03",
                 'Cry':"q07_04",
                 'Get in a physical fight with another person':"q07_05",
                 'Have sexual relations with women, including anything from kissing to sex':"q07_06",
                 'Have sexual relations with men, including anything from kissing to sex':"q07_07",
                 'Watch sports of any kind':"q07_08",
                 'Work out':"q07_09",
                 'See a therapist':"q07_10",
                 'Feel lonely or isolated':"q07_11",
                 'Which of the following do you worry about on a daily or near daily basis? (Select all that apply.)':"q08",
                 'Which of the following categories best describes your employment status?':"q09",
                 'AMONG EMPLOYED: In which of the following ways would you say it?s an advantage to be a man at your work right now? (Select all that apply).':"q10",
                 'AMONG EMPLOYED: In which of the following ways would you say it?s a disadvantage to be a man at your work right now? (Select all that apply.)':"q11",
                 'AMONG EMPLOYED: Have you seen or heard of a sexual harassment incident at your work? If so, how did you respond? (Select all that apply.)':"q12",
                 'AMONG THOSE WHO DID NOT RESPOND TO HARASSMENT: And which of the following is the main reason you did not respond?':"q13",
                 'AMONG EMPLOYED: How much have you heard about the #MeToo movement?':"q14",
                 "AMONG EMPLOYED WHO'VE HEARD OF METOO: As a man, would you say you think about your behavior at work differently in the wake of #MeToo?":"q15",
                 "Do you typically feel as though you're expected to make the first move in romantic relationships?":"q17",
                 'How often do you try to be the one who pays when on a date?':"q18",
                 'AMONG THOSE WHO TRY TO PAY ON FIRST DATE: Which of the following are reasons why you try to pay on dates? (Select all that apply).':"q19",
                 'When you want to be physically intimate with someone, how do you gauge their interest? (Select all that apply.)':"q20",
                 'Over the past 12 months, when it comes to sexual boundaries, which of the following things have you done? (Select all that apply.)':"q21",
                 'Have you changed your behavior in romantic relationships in the wake of #MeToo movement?':"q22",
                 'Are you now married, widowed, divorced, separated, or have you never been married?':"q24",
                 'Do you have any children? (Select all that apply.)':"q25",
                 'Would you describe your sexual orientation as:':"q26",
                 'Are you:':"q28",
                 'What is the last grade of school you completed?':"q29",
                 'age3':np.nan}
survey_wip['question_num'] = survey_wip['question'].progress_apply(lambda question : question_dict[question])

100%|████████████████████████████████████████████████████████████████████████████████████████████| 195/195 [00:00<00:00, 97576.86it/s]


## Reordering columns

In [17]:
reordered_cols = ['question', "question_num", 'answer', 'is_multi_select', 'age_all', 'age_18_34', 'age_35_64', 'age_65+',
                  'race_white', 'race_non_white', 'has_children', 'no_children', 'straight', 'gay_bisexual']
survey_wip = survey_wip[reordered_cols]
survey_wip.head()

Unnamed: 0,question,question_num,answer,is_multi_select,age_all,age_18_34,age_35_64,age_65+,race_white,race_non_white,has_children,no_children,straight,gay_bisexual
1,"In general, how masculine or ""manly"" do you feel?",q01,Very masculine,False,0.37,0.29,0.42,0.37,0.34,0.44,0.43,0.3,0.4,0.21
2,"In general, how masculine or ""manly"" do you feel?",q01,Somewhat masculine,False,0.46,0.47,0.46,0.47,0.5,0.39,0.47,0.46,0.47,0.49
3,"In general, how masculine or ""manly"" do you feel?",q01,Not very masculine,False,0.11,0.13,0.09,0.13,0.11,0.11,0.06,0.17,0.1,0.19
4,"In general, how masculine or ""manly"" do you feel?",q01,Not at all masculine,False,0.05,0.1,0.02,0.03,0.04,0.06,0.03,0.06,0.03,0.1
5,"In general, how masculine or ""manly"" do you feel?",q01,No answer,False,0.01,0.0,0.01,0.01,0.01,0.0,0.01,0.0,0.01,0.0


# Saving the dataframe

In [18]:
survey_wip.to_csv("cleaned/cleaned-survey.csv", index=False)