# Cleaning and testing `raw-responses.csv`

This notebook contains data cleaning processes for `raw-responses.csv` file inside `raw` folder. <br>
`raw-responses.csv` file is a part of the dataset published by FiveThirtyEight and describes the responses to Masculinity Survey conducted by SurveyMonkey in partnership with FiveThirtyEight and WNYC Studios in 2018.

Each step of data cleaning comes with test cases verifying the state of the data. <br>
These test cases also serve as **specifications** for each step. Should you need to edit the data cleaning code for machine learning, you can read test cases as reminders of what each step does and edit specific section of code with confidence.

# **Summary of the results**

Below are the first 10 rows of datasets before and after cleaning. For more details, please refer to the full cleaning process.

In [1]:
%%capture
from tqdm import tqdm_notebook as tqdm
tqdm().pandas()

%matplotlib inline
import pandas as pd
import numpy as np

In [2]:
response_raw = pd.read_csv("raw/raw-responses.csv")
response_raw.head()

Unnamed: 0.1,Unnamed: 0,StartDate,EndDate,q0001,q0002,q0004_0001,q0004_0002,q0004_0003,q0004_0004,q0004_0005,...,q0035,q0036,race2,racethn4,educ3,educ4,age3,kids,orientation,weight
0,1,5/10/18 4:01,5/10/18 4:06,Somewhat masculine,Somewhat important,Not selected,Not selected,Not selected,Pop culture,Not selected,...,Middle Atlantic,Windows Desktop / Laptop,Non-white,Hispanic,College or more,College or more,35 - 64,No children,Gay/Bisexual,1.714026
1,2,5/10/18 6:30,5/10/18 6:53,Somewhat masculine,Somewhat important,Father or father figure(s),Not selected,Not selected,Not selected,Not selected,...,East North Central,iOS Phone / Tablet,White,White,Some college,Some college,65 and up,Has children,Straight,1.24712
2,3,5/10/18 7:02,5/10/18 7:09,Very masculine,Not too important,Father or father figure(s),Not selected,Not selected,Not selected,Not selected,...,East North Central,Windows Desktop / Laptop,White,White,College or more,College or more,35 - 64,Has children,Straight,0.515746
3,4,5/10/18 7:27,5/10/18 7:31,Very masculine,Not too important,Father or father figure(s),Mother or mother figure(s),Other family members,Not selected,Not selected,...,East North Central,Windows Desktop / Laptop,White,White,Some college,Some college,65 and up,Has children,No answer,0.60064
4,5,5/10/18 7:35,5/10/18 7:42,Very masculine,Very important,Not selected,Not selected,Other family members,Not selected,Not selected,...,East North Central,Windows Desktop / Laptop,White,White,College or more,College or more,35 - 64,No children,Straight,1.0334


In [3]:
response_cleaned = pd.read_csv("cleaned/raw-responses-clean.csv")
response_cleaned.head()

# **Full cleaning processes below**

## Copying dataset

In case we need to compare the raw and cleaned datasets later on, we create copies of the datasets and name them as `survey_wip` and `response_wip`. <br>
`survey_wip` and `response_wip` are what we are going to work with.

In [4]:
response_wip = response_raw.copy()

## Inspecting dataset

`response_wip` contains the responses to the survey, where each row represent a survey respondent and each column represent survey questions.

In [5]:
response_wip.head()

Unnamed: 0.1,Unnamed: 0,StartDate,EndDate,q0001,q0002,q0004_0001,q0004_0002,q0004_0003,q0004_0004,q0004_0005,...,q0035,q0036,race2,racethn4,educ3,educ4,age3,kids,orientation,weight
0,1,5/10/18 4:01,5/10/18 4:06,Somewhat masculine,Somewhat important,Not selected,Not selected,Not selected,Pop culture,Not selected,...,Middle Atlantic,Windows Desktop / Laptop,Non-white,Hispanic,College or more,College or more,35 - 64,No children,Gay/Bisexual,1.714026
1,2,5/10/18 6:30,5/10/18 6:53,Somewhat masculine,Somewhat important,Father or father figure(s),Not selected,Not selected,Not selected,Not selected,...,East North Central,iOS Phone / Tablet,White,White,Some college,Some college,65 and up,Has children,Straight,1.24712
2,3,5/10/18 7:02,5/10/18 7:09,Very masculine,Not too important,Father or father figure(s),Not selected,Not selected,Not selected,Not selected,...,East North Central,Windows Desktop / Laptop,White,White,College or more,College or more,35 - 64,Has children,Straight,0.515746
3,4,5/10/18 7:27,5/10/18 7:31,Very masculine,Not too important,Father or father figure(s),Mother or mother figure(s),Other family members,Not selected,Not selected,...,East North Central,Windows Desktop / Laptop,White,White,Some college,Some college,65 and up,Has children,No answer,0.60064
4,5,5/10/18 7:35,5/10/18 7:42,Very masculine,Very important,Not selected,Not selected,Other family members,Not selected,Not selected,...,East North Central,Windows Desktop / Laptop,White,White,College or more,College or more,35 - 64,No children,Straight,1.0334


In [6]:
response_wip.describe(include='all')

Unnamed: 0.1,Unnamed: 0,StartDate,EndDate,q0001,q0002,q0004_0001,q0004_0002,q0004_0003,q0004_0004,q0004_0005,...,q0035,q0036,race2,racethn4,educ3,educ4,age3,kids,orientation,weight
count,1615.0,1615,1615,1615,1615,1615,1615,1615,1615,1615,...,1595,1613,1615,1615,1615,1615,1615,1606,1615,1615.0
unique,,1378,1377,5,5,2,2,2,2,2,...,9,5,2,4,3,4,3,2,4,
top,,5/17/18 6:48,5/17/18 7:35,Somewhat masculine,Somewhat important,Father or father figure(s),Not selected,Not selected,Not selected,Not selected,...,South Atlantic,Windows Desktop / Laptop,White,White,College or more,College or more,35 - 64,Has children,Straight,
freq,,4,4,826,628,1109,960,1051,1312,1056,...,302,880,1351,1351,997,515,855,1065,1408,
mean,808.0,,,,,,,,,,...,,,,,,,,,,1.0
std,466.354658,,,,,,,,,,...,,,,,,,,,,1.438996
min,1.0,,,,,,,,,,...,,,,,,,,,,0.019744
25%,404.5,,,,,,,,,,...,,,,,,,,,,0.10258
50%,808.0,,,,,,,,,,...,,,,,,,,,,0.596892
75%,1211.5,,,,,,,,,,...,,,,,,,,,,1.010046


First, `response_wip` also has ambiguous column names. We have to rename them for readability <br>
Second, `response_wip` has string datatypes for datetime data and survey answers. We have to convert them into more appropriate datatypes. <br>
Third, `response_wip` has numerous `Not selected` here and there. We have to convert them into `NaN`. <br>
Fourth, `response_wip` has a column named `Unnamed: 0`, which is not serving any purpose other than index, which we already have. Drop the column.

## Changing column names

All the new column names will be in snake case.

In [7]:
response_new_cols = response_wip.columns.copy()

response_new_cols = response_new_cols.str.replace(r"(?<=[q_])00", "") # Making question column names shorter
response_new_cols = response_new_cols.str.replace("StartDate", "start_date")
response_new_cols = response_new_cols.str.replace("EndDate", "end_date")
response_new_cols = response_new_cols.str.replace("kids", "has_children")

response_mapper = dict(zip(response_wip.columns, response_new_cols))
response_wip.rename(columns=response_mapper, inplace=True)
response_wip.columns

Index(['Unnamed: 0', 'start_date', 'end_date', 'q01', 'q02', 'q04_01',
       'q04_02', 'q04_03', 'q04_04', 'q04_05', 'q04_06', 'q05', 'q07_01',
       'q07_02', 'q07_03', 'q07_04', 'q07_05', 'q07_06', 'q07_07', 'q07_08',
       'q07_09', 'q07_10', 'q07_11', 'q08_01', 'q08_02', 'q08_03', 'q08_04',
       'q08_05', 'q08_06', 'q08_07', 'q08_08', 'q08_09', 'q08_10', 'q08_11',
       'q08_12', 'q09', 'q10_01', 'q10_02', 'q10_03', 'q10_04', 'q10_05',
       'q10_06', 'q10_07', 'q10_08', 'q11_01', 'q11_02', 'q11_03', 'q11_04',
       'q11_05', 'q12_01', 'q12_02', 'q12_03', 'q12_04', 'q12_05', 'q12_06',
       'q12_07', 'q13', 'q14', 'q15', 'q17', 'q18', 'q19_01', 'q19_02',
       'q19_03', 'q19_04', 'q19_05', 'q19_06', 'q19_07', 'q20_01', 'q20_02',
       'q20_03', 'q20_04', 'q20_05', 'q20_06', 'q21_01', 'q21_02', 'q21_03',
       'q21_04', 'q22', 'q24', 'q25_01', 'q25_02', 'q25_03', 'q26', 'q28',
       'q29', 'q30', 'q34', 'q35', 'q36', 'race2', 'racethn4', 'educ3',
       'educ4', 'age3',

## Truncating the dataset

In [8]:
response_wip.drop(columns="Unnamed: 0", inplace=True) # Dropping the column that indexes rows

## Dropping redundant columns in `response_wip`

Out of the columns in `response_wip`, `race2` and `racethn4`, `educ3` and `educ4` are respectively redundant. <br>
Therefore, it is best to drop the less precise of each pair for readability.

In [9]:
response_wip['race2'].value_counts()

White        1351
Non-white     264
Name: race2, dtype: int64

In [10]:
response_wip['racethn4'].value_counts()

White       1351
Other        121
Black         72
Hispanic      71
Name: racethn4, dtype: int64

It seems clear that `racethn4` is the more precise. <br>
But to absolutely make sure, let's implement a test to check if `race2` and `racethn4` match.

In [11]:
def test_race_match(response):
    """
    Tests row-wise if `race2` and `racethn4` match as below :
    
    race2     : racethn4
    -----------------
    White     : White
    Non-white : Other
    Non-white : Black
    Non-white : Hispanic
    """
    racethn4_values = response_wip['racethn4'].value_counts().index.tolist()
    race2_values = response_wip['race2'].value_counts().index.tolist() + ["Non-white"] * 2
    matches = dict(zip(racethn4_values, race2_values))
    return matches[response['racethn4']] == response['race2']

Now let's run the test on `response_wip` and see for how many rows `race2` and `racethn4` match.

In [12]:
response_wip.progress_apply(test_race_match, axis="columns").value_counts()

HBox(children=(FloatProgress(value=0.0, max=1615.0), HTML(value='')))




True    1615
dtype: int64

`race2` and `racethn4` match in every row! <br>
Now we can drop `race2` column with confidence.

In [13]:
response_wip.drop(columns='race2', inplace=True)

Next up is `educ3` and `educ4` columns. Let's see how they compare.

In [14]:
response_wip['educ3'].value_counts()

College or more        997
Some college           440
High school or less    178
Name: educ3, dtype: int64

In [15]:
response_wip['educ4'].value_counts()

College or more         515
Post graduate degree    482
Some college            440
High school or less     178
Name: educ4, dtype: int64

It seems very likely that `educ3` and `educ4` match and `educ4` is the more detailed of the two. <br>
Let's implement a test for it.

In [16]:
def test_educ_match(response):
    """
    Tests row-wise if `educ3` and `educ4` match as below :
    
    educ3               : educ4
    ------------------------------------------
    College or more     : College or more
    College or more     : Post graduate degree
    Some college        : Some college
    High school or less : High school or less
    """
    educ4_values = response_wip['educ4'].value_counts().index.tolist()
    educ3_values = response_wip['educ3'].value_counts().index.tolist()
    educ3_values.insert(1, "College or more")
    matches = dict(zip(educ4_values, educ3_values))
    return matches[response['educ4']] == response['educ3']

Now let's run the test on `response_wip` and see for how many rows `educ3` and `educ4` match.

In [17]:
response_wip.progress_apply(test_educ_match, axis="columns").value_counts()

HBox(children=(FloatProgress(value=0.0, max=1615.0), HTML(value='')))




True    1615
dtype: int64

`educ3` and `educ4` match in every row. <br>
Now we can drop `educ3` column with confidence.

In [18]:
response_wip.drop(columns="educ3", inplace=True)

## Converting datatype of `has_children` column

`has_children` column currently has two string values : ```Has_children```, and ```No_children```

In [19]:
response_wip['has_children'].value_counts()

Has children    1065
No children      541
Name: has_children, dtype: int64

For ease of processing, we need to convert them into boolean values.

In [20]:
kids_values = response_wip['has_children'].value_counts().index.tolist()
kids_mapper = dict(zip(kids_values, [True, False]))

response_wip['has_children'] = response_wip['has_children'].map(kids_mapper)
response_wip['has_children'].value_counts()

True     1065
False     541
Name: has_children, dtype: int64

In [21]:
response_wip.head()

Unnamed: 0,start_date,end_date,q01,q02,q04_01,q04_02,q04_03,q04_04,q04_05,q04_06,...,q30,q34,q35,q36,racethn4,educ4,age3,has_children,orientation,weight
0,5/10/18 4:01,5/10/18 4:06,Somewhat masculine,Somewhat important,Not selected,Not selected,Not selected,Pop culture,Not selected,Not selected,...,New York,"$0-$9,999",Middle Atlantic,Windows Desktop / Laptop,Hispanic,College or more,35 - 64,False,Gay/Bisexual,1.714026
1,5/10/18 6:30,5/10/18 6:53,Somewhat masculine,Somewhat important,Father or father figure(s),Not selected,Not selected,Not selected,Not selected,Not selected,...,Ohio,"$50,000-$74,999",East North Central,iOS Phone / Tablet,White,Some college,65 and up,True,Straight,1.24712
2,5/10/18 7:02,5/10/18 7:09,Very masculine,Not too important,Father or father figure(s),Not selected,Not selected,Not selected,Not selected,Other (please specify),...,Michigan,"$50,000-$74,999",East North Central,Windows Desktop / Laptop,White,College or more,35 - 64,True,Straight,0.515746
3,5/10/18 7:27,5/10/18 7:31,Very masculine,Not too important,Father or father figure(s),Mother or mother figure(s),Other family members,Not selected,Not selected,Not selected,...,Indiana,"$50,000-$74,999",East North Central,Windows Desktop / Laptop,White,Some college,65 and up,True,No answer,0.60064
4,5/10/18 7:35,5/10/18 7:42,Very masculine,Very important,Not selected,Not selected,Other family members,Not selected,Not selected,Not selected,...,Ohio,"$50,000-$74,999",East North Central,Windows Desktop / Laptop,White,College or more,35 - 64,False,Straight,1.0334


## Converting `start_date` and `end_date` into datetime objects

In [22]:
response_wip['start_date'] = pd.to_datetime(response_wip['start_date'])
response_wip['end_date'] = pd.to_datetime(response_wip['end_date'])

## Cleaning question columns

Columns whose names start with `q` denote answers to survey questions. <br>

In [23]:
question_filter = response_wip.columns.str.contains(r"q[0-9]+")
questions = response_wip.columns[question_filter]
questions

Index(['q01', 'q02', 'q04_01', 'q04_02', 'q04_03', 'q04_04', 'q04_05',
       'q04_06', 'q05', 'q07_01', 'q07_02', 'q07_03', 'q07_04', 'q07_05',
       'q07_06', 'q07_07', 'q07_08', 'q07_09', 'q07_10', 'q07_11', 'q08_01',
       'q08_02', 'q08_03', 'q08_04', 'q08_05', 'q08_06', 'q08_07', 'q08_08',
       'q08_09', 'q08_10', 'q08_11', 'q08_12', 'q09', 'q10_01', 'q10_02',
       'q10_03', 'q10_04', 'q10_05', 'q10_06', 'q10_07', 'q10_08', 'q11_01',
       'q11_02', 'q11_03', 'q11_04', 'q11_05', 'q12_01', 'q12_02', 'q12_03',
       'q12_04', 'q12_05', 'q12_06', 'q12_07', 'q13', 'q14', 'q15', 'q17',
       'q18', 'q19_01', 'q19_02', 'q19_03', 'q19_04', 'q19_05', 'q19_06',
       'q19_07', 'q20_01', 'q20_02', 'q20_03', 'q20_04', 'q20_05', 'q20_06',
       'q21_01', 'q21_02', 'q21_03', 'q21_04', 'q22', 'q24', 'q25_01',
       'q25_02', 'q25_03', 'q26', 'q28', 'q29', 'q30', 'q34', 'q35', 'q36'],
      dtype='object')

Questions divide into boolean questions and categorical questions. <br>
Boolean questions further divide into yes/no questions and selected/not selected questions.

### Cleaning boolean questions

Boolean questions fall into either selected/not selected questions or yes/no questions.

In [24]:
select_questions = pd.Index([col for col in questions if response_wip[col].str.contains(r"^Not selected$").any()])
yes_no_questions = pd.Index([col for col in questions if response_wip[col].str.contains(r"^Yes$").any()])

#### Cleaning `select_questions`

Below are select question columns before cleaning.

In [25]:
response_wip[select_questions].head()

Unnamed: 0,q04_01,q04_02,q04_03,q04_04,q04_05,q04_06,q08_01,q08_02,q08_03,q08_04,...,q20_04,q20_05,q20_06,q21_01,q21_02,q21_03,q21_04,q25_01,q25_02,q25_03
0,Not selected,Not selected,Not selected,Pop culture,Not selected,Not selected,Not selected,Not selected,Your hair or hairline,Not selected,...,Every situation is different,It isn?t always clear how to gauge someone?s i...,Not selected,Not selected,Not selected,Not selected,None of the above,Not selected,Not selected,No children
1,Father or father figure(s),Not selected,Not selected,Not selected,Not selected,Not selected,Not selected,Your weight,Not selected,Not selected,...,Not selected,Not selected,Other (please specify),Not selected,Not selected,Not selected,None of the above,Not selected,"Yes, one or more children 18 or older",Not selected
2,Father or father figure(s),Not selected,Not selected,Not selected,Not selected,Other (please specify),Not selected,Not selected,Not selected,Not selected,...,Every situation is different,Not selected,Not selected,Not selected,Not selected,Not selected,None of the above,Not selected,"Yes, one or more children 18 or older",Not selected
3,Father or father figure(s),Mother or mother figure(s),Other family members,Not selected,Not selected,Not selected,Not selected,Not selected,Not selected,Not selected,...,Not selected,Not selected,Not selected,Not selected,Not selected,Not selected,Not selected,Not selected,"Yes, one or more children 18 or older",Not selected
4,Not selected,Not selected,Other family members,Not selected,Not selected,Not selected,Not selected,Your weight,Not selected,Not selected,...,Not selected,Not selected,Not selected,Not selected,Not selected,Not selected,None of the above,Not selected,Not selected,No children


Below is how we clean select question columns

In [26]:
def map_select(val):
    if val == "Not selected":
        return False
    elif pd.isna(val):
        return np.nan
    else:
        return True

response_wip[select_questions] = response_wip[select_questions].progress_applymap(map_select)

HBox(children=(FloatProgress(value=0.0, max=93670.0), HTML(value='')))




Below are select question columns after clenaing

In [27]:
response_wip[select_questions].head()

Unnamed: 0,q04_01,q04_02,q04_03,q04_04,q04_05,q04_06,q08_01,q08_02,q08_03,q08_04,...,q20_04,q20_05,q20_06,q21_01,q21_02,q21_03,q21_04,q25_01,q25_02,q25_03
0,False,False,False,True,False,False,False,False,True,False,...,True,True,False,False,False,False,True,False,False,True
1,True,False,False,False,False,False,False,True,False,False,...,False,False,True,False,False,False,True,False,True,False
2,True,False,False,False,False,True,False,False,False,False,...,True,False,False,False,False,False,True,False,True,False
3,True,True,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
4,False,False,True,False,False,False,False,True,False,False,...,False,False,False,False,False,False,True,False,False,True


#### Cleaning `yes_no_questions`

Below are yes/no questions before cleaning

In [28]:
response_wip[yes_no_questions].head()

Unnamed: 0,q05,q15,q17,q22
0,Yes,,Yes,No
1,Yes,,No,No
2,No,No,Yes,No
3,No,,Yes,No answer
4,Yes,Yes,No,No


Below is how we clean yes/no questions

In [29]:
yes_no_mapper = {"Yes":True, "No":False, "No answer":np.nan}
response_wip[yes_no_questions] = response_wip[yes_no_questions].progress_apply(lambda col : col.map(yes_no_mapper, na_action="ignore"))

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))




In [30]:
response_wip[yes_no_questions].head()

Unnamed: 0,q05,q15,q17,q22
0,True,,True,False
1,True,,False,False
2,False,False,True,False
3,False,,True,
4,True,True,False,False


### Cleaning categorical questions

Below are the categorical questioin columns

In [31]:
categorical_questions = questions.difference(select_questions).difference(yes_no_questions)
categorical_questions

Index(['q01', 'q02', 'q07_01', 'q07_02', 'q07_03', 'q07_04', 'q07_05',
       'q07_06', 'q07_07', 'q07_08', 'q07_09', 'q07_10', 'q07_11', 'q09',
       'q13', 'q14', 'q18', 'q24', 'q26', 'q28', 'q29', 'q30', 'q34', 'q35',
       'q36'],
      dtype='object')

#### Replacing `No answer` and `Prefer not to answer` with null value

In [32]:
response_wip[categorical_questions] = response_wip[categorical_questions].replace("No answer", np.nan)\
                                                                         .replace("Prefer not to answer", np.nan)\
                                                                         .replace("Other (please specify)", np.nan)

#### Converting the datatype into ordered `pandas.Categorical`

In [33]:
response_wip[categorical_questions] = response_wip[categorical_questions].astype(pd.CategoricalDtype(ordered=True))

#### Reordering the misordered categories

##### Reordering `q07` categories
Below are the q07 question columns. <br>

In [34]:
q07_questions = response_wip.columns[response_wip.columns.str.contains(r"^q07")]
q07_questions

Index(['q07_01', 'q07_02', 'q07_03', 'q07_04', 'q07_05', 'q07_06', 'q07_07',
       'q07_08', 'q07_09', 'q07_10', 'q07_11'],
      dtype='object')

They all share same categories inside.

In [35]:
from functools import reduce

def test_cat_shared(df):
    """
    Tests if all columns in the given df have identical sets of categories
    """
    col_cat = pd.Series({col : set(vals) for col, vals in response_wip[q07_questions].iteritems()})
    shared  = reduce(lambda a,b : a.intersection(b), col_cat.values)
    results = col_cat.apply(lambda cat : cat == shared)
    if results.all():
        print("Passed ; all the columns have identical sets of categories")
    else:
        print("Failed ; some columns do not have identical sets of categories")

test_cat_shared(response_wip[q07_questions])

Passed ; all the columns have identical sets of categories


Below are the categories of q07 question columns

In [36]:
response_wip['q07_01'].cat.categories

Index(['Never, and not open to it', 'Never, but open to it', 'Often', 'Rarely',
       'Sometimes'],
      dtype='object')

Below we reorder the categories across columns

In [37]:
q07_reordered = ['Never, and not open to it', 'Never, but open to it', 'Rarely', 'Sometimes', 'Often']

for label, content in response_wip[q07_questions].items():
    response_wip[label] = response_wip[label].cat.reorder_categories(q07_reordered)

##### Reordering `q14` categories

In [38]:
q14_reordered = ['Nothing at all', 'Only a little', 'Some', 'A lot']
response_wip['q14'] = response_wip['q14'].cat.reorder_categories(q14_reordered)

##### Reordering `q18` categories

In [39]:
q18_reordered = ['Never', "Rarely", "Sometimes", "Often", "Always"]
response_wip['q18'] = response_wip['q18'].cat.reorder_categories(q18_reordered)

##### Reordering `q29` categories

In [40]:
response_wip['q29'].cat.categories

Index(['Associate's degree', 'College graduate',
       'Did not complete high school', 'High school or G.E.D.',
       'Post graduate degree', 'Some college'],
      dtype='object')

In [41]:
q29_reordered = ['Did not complete high school', 'High school or G.E.D.', "Associate's degree",
                 'Some college', 'College graduate', 'Post graduate degree']
response_wip['q29'] = response_wip['q29'].cat.reorder_categories(q29_reordered)

#### Splitting question columns

Some question columns contain data that can be further divided into two separate columns. <br>

##### Splitting `q09` question column

In [42]:
extracted = response_wip['q09'].str.extract(r"(?P<is_employed>[a-zA-Z ]*[Ee]mployed)[,-][ ]?(?P<employment_status>[a-zA-Z -]*)")
extracted = extracted.apply(lambda col: col.str.lower())
extracted['is_employed'] = extracted['is_employed'].map({"employed":True, "not employed":False}, na_action="ignore")
response_wip = pd.concat([response_wip, extracted], axis="columns")
response_wip.drop(columns="q09", inplace=True)

##### Splitting `q36` question column

In [43]:
response_wip[["os", "device"]] = response_wip['q36'].str.extract(r"(?P<os>\w*) (?P<device>\w* / \w*)")
response_wip.drop(columns="q36", inplace=True)

#### Setting some columns as unordered

Some categorical question columns do not necessarily have order in their categories.

##### Setting `q13` as unordered

In [44]:
response_wip['q13'] = response_wip['q13'].cat.as_unordered()

`q13` column also has `?` instead of `'` in categories. <br>
We will also replace them.

In [45]:
response_wip['q13'] = response_wip['q13'].str.replace("?", "`")

##### Setting `q28` as unordered

In [46]:
response_wip['q28'] = response_wip['q28'].cat.as_unordered()

##### Setting `q30` as unordered

In [47]:
response_wip['q30'] = response_wip['q30'].cat.as_unordered()

##### Setting `q35` as unordered

In [48]:
response_wip['q35'] = response_wip['q35'].cat.as_unordered()

# Saving the cleaned dataframe

In [49]:
response_wip.to_csv("cleaned/cleaned-responses.csv", index=False)