# Classifying Masculinity with KMeans++

In this project, I will be investigating the way people think about masculinity by applying the KMeans algorithm to data from  FiveThirtyEight. FiveThirtyEight is a popular website known for their use of statistical analysis in many of their stories.

FiveThirtyEight and WNYC studios used `masculinity-survey.pdf` to get their male readers' thoughts on masculinity. I'm going to try to find more insights into the survey responses by using a KMeans classifier.

In [1]:
# Investigating the data
import pandas as pd

survey = pd.read_csv("masculinity.csv")
print(survey.columns)
print(len(survey))
print(survey["q0007_0001"].value_counts())
print(survey.head())

Index(['Unnamed: 0', 'StartDate', 'EndDate', 'q0001', 'q0002', 'q0004_0001',
       'q0004_0002', 'q0004_0003', 'q0004_0004', 'q0004_0005', 'q0004_0006',
       'q0005', 'q0007_0001', 'q0007_0002', 'q0007_0003', 'q0007_0004',
       'q0007_0005', 'q0007_0006', 'q0007_0007', 'q0007_0008', 'q0007_0009',
       'q0007_0010', 'q0007_0011', 'q0008_0001', 'q0008_0002', 'q0008_0003',
       'q0008_0004', 'q0008_0005', 'q0008_0006', 'q0008_0007', 'q0008_0008',
       'q0008_0009', 'q0008_0010', 'q0008_0011', 'q0008_0012', 'q0009',
       'q0010_0001', 'q0010_0002', 'q0010_0003', 'q0010_0004', 'q0010_0005',
       'q0010_0006', 'q0010_0007', 'q0010_0008', 'q0011_0001', 'q0011_0002',
       'q0011_0003', 'q0011_0004', 'q0011_0005', 'q0012_0001', 'q0012_0002',
       'q0012_0003', 'q0012_0004', 'q0012_0005', 'q0012_0006', 'q0012_0007',
       'q0013', 'q0014', 'q0015', 'q0017', 'q0018', 'q0019_0001', 'q0019_0002',
       'q0019_0003', 'q0019_0004', 'q0019_0005', 'q0019_0006', 'q0019_0007',
      

# Mapping the Data

In order to start thinking about using the KMeans algorithm with this data, the survey responses need to be converted into numerical data. Consider question 7. The data can't be clustered using the phrases `"Often"` or `"Rarely"`, but those phrases can be converted into numbers. For example, the data can be mapped in the following way: 
* `"Often"` -> `4`
* `"Sometimes"` ->  `3`
* `"Rarely"` -> `2` 
* `"Never, but open to it"` -> `1`
* `"Never, and not open to it"` -> `0`.

Note that it's important that these responses are somewhat linear. `"Often"` is at one end of the spectrum with `"Never, and not open to it"` at the other. The other values fall in sequence between the two.

# Full list of questions asked in question 7, located in 'masculinity-survey.pdf'

# How often would you say you do each of the following?

1. Ask a friend for professional advice

2. Ask a friend for personal advice

3. Express physical affection to male friends, like hugging, rubbing shoulders 

4. Cry

5. Get in a physical fight with another person

6. Have sexual relations with women, including anything from kissing to sex 

7. Have sexual relations with men, including anything from kissing to sex 

8. Watch sports of any kind

9. Work out

10. See a therapist

11. Feel lonely or isolated

In [2]:
# Selection of questions to map for KMeans
cols_to_map = ["q0007_0001", "q0007_0002", "q0007_0003", "q0007_0004",
       "q0007_0005", "q0007_0006", "q0007_0007", "q0007_0008", "q0007_0009",
       "q0007_0010", "q0007_0011"]
for col in cols_to_map:
    survey[col] = survey[col].map(
        {"Often": 4, "Sometimes": 3, "Rarely": 2, "Never, but open to it": 1, "Never, and not open to it": 0})

# Plotting the Data

I now have 11 different features that can be used in the KMeans algorithm. Before I jump into clustering, I will graph some of these features on a 2D graph.

In [3]:
from matplotlib import pyplot as plt

plt.scatter(survey["q0007_0001"], survey["q0007_0002"], alpha = 0.1)
plt.xlabel("Ask friend for professional advice")
plt.ylabel("Ask friend for personal advice")
plt.show()

<Figure size 640x480 with 1 Axes>

# Building the KMeans Model

It's now time to start clustering! There are so many interesting questions we could ask about this data. Let's start by seeing if clusters form based on traditionally masculine concepts. 

First I will consider the the first four sub-questions in question 7. Those four activities aren't necessarily seen as traditionally masculine. On the other hand, sub-questions 5, 8, and 9 are often seen as very masculine activities. What would happen if 2 clusters were found based on those 7 questions? Would we find clusters that represent traditionally feminine and traditionally masculine people? Let's find out.

In [4]:
from sklearn.cluster import KMeans

# The 7 survey responses that will be focused on
questions_of_interest = ["q0007_0001", "q0007_0002", "q0007_0003", "q0007_0004", "q0007_0005", 
                         "q0007_0008", "q0007_0009"]

# Creating a new dataframe without NaN values 
rows_to_cluster = survey.dropna(subset = questions_of_interest)

# Building and fitting the KMeans classifier 
# 1 cluster represents traditionally masculine answers while the other cluster represents traditionally feminine answers
# 1 cluster for each, so n_clusters = 2
classifier = KMeans(n_clusters = 2)
classifier.fit(rows_to_cluster[questions_of_interest])
print(classifier.cluster_centers_)

[[1.87798408 1.84350133 0.84615385 1.72413793 0.56763926 2.63660477
  1.97612732]
 [2.84425036 2.81513828 2.84133916 2.39883552 0.69577875 3.0713246
  2.89665211]]


# Separating the Cluster Members

When looking at the two clusters, the first four numbers represent the traditionally feminine activities and the last three represent the traditionally masculine activities. If the data points separated into a feminine cluster and a masculine cluster, one would expect to see one cluster to have high values for the first four numbers and the other cluster to have high values for the last three numbers.

Instead, the first cluster has a higher value in every feature. Since a higher number means the person was more likely to "often" do something, the clusters seem to represent "people who do things" and "people who don't do things".

I might be able to find out more information about these clusters by looking at the specific members of each cluster.

In [5]:
# Creating lists of cluster indices for the two clusters
cluster_zero_indices = []
cluster_one_indices = []

for i in range(len(classifier.labels_)):
    if classifier.labels_[i] == 0:
        cluster_zero_indices.append(i)
    elif classifier.labels_[i] == 1:
        cluster_one_indices.append(i)
        
print(cluster_zero_indices)

[1, 4, 6, 7, 9, 10, 12, 14, 17, 18, 19, 24, 29, 35, 39, 42, 49, 51, 52, 53, 54, 55, 57, 58, 62, 63, 65, 66, 75, 78, 79, 82, 84, 86, 87, 88, 89, 90, 92, 94, 95, 97, 98, 101, 106, 107, 109, 113, 116, 117, 119, 123, 128, 129, 130, 131, 132, 134, 139, 142, 143, 154, 172, 175, 176, 178, 179, 180, 181, 184, 187, 189, 195, 196, 198, 199, 201, 209, 212, 222, 229, 230, 231, 233, 236, 237, 240, 241, 247, 248, 249, 250, 256, 260, 261, 263, 264, 272, 275, 281, 283, 284, 286, 288, 291, 296, 297, 299, 300, 301, 305, 310, 311, 325, 328, 331, 336, 337, 340, 341, 343, 347, 350, 351, 353, 361, 367, 369, 377, 378, 390, 391, 392, 393, 394, 396, 397, 398, 399, 409, 410, 411, 412, 415, 417, 418, 419, 425, 428, 429, 432, 449, 454, 455, 457, 459, 461, 463, 468, 470, 471, 476, 477, 478, 484, 489, 490, 493, 494, 496, 498, 499, 502, 508, 509, 510, 515, 516, 521, 523, 525, 526, 529, 531, 533, 542, 546, 549, 555, 556, 559, 560, 562, 563, 564, 566, 567, 570, 577, 579, 580, 585, 588, 589, 592, 593, 599, 603, 610, 61

# Investigating the Cluster Members

Now that I have the indices for each cluster, I will try and gain some insight into these two clusters. 

In [6]:
# Creating dataframes for the 2 clusters
cluster_zero_df = rows_to_cluster.iloc[cluster_zero_indices]
cluster_one_df = rows_to_cluster.iloc[cluster_one_indices]

Reminder of traditionally masculine activities:

q0007_005: Get in a physical fight with another person
 
q0007_0008: Watch sports of any kind

q0007_0009: Work out

* `"Often"` -> `4`
* `"Sometimes"` ->  `3`
* `"Rarely"` -> `2` 
* `"Never, but open to it"` -> `1`
* `"Never, and not open to it"` -> `0`.

In [7]:
masc = ["q0007_0005", "q0007_0008", "q0007_0009"]
for q in masc:
    print("Cluster 0:")
    print(cluster_zero_df[q].value_counts()/len(cluster_zero_df))
    print("Cluster 1:")
    print(cluster_one_df[q].value_counts()/len(cluster_one_df))

Cluster 0:
0.0    0.641910
1.0    0.183024
2.0    0.148541
3.0    0.018568
4.0    0.007958
Name: q0007_0005, dtype: float64
Cluster 1:
0.0    0.570597
2.0    0.211063
1.0    0.193595
3.0    0.018923
4.0    0.005822
Name: q0007_0005, dtype: float64
Cluster 0:
4.0    0.350133
3.0    0.281167
2.0    0.172414
0.0    0.148541
1.0    0.047745
Name: q0007_0008, dtype: float64
Cluster 1:
4.0    0.442504
3.0    0.296943
2.0    0.190684
0.0    0.040757
1.0    0.029112
Name: q0007_0008, dtype: float64
Cluster 0:
2.0    0.320955
3.0    0.225464
0.0    0.177719
1.0    0.148541
4.0    0.127321
Name: q0007_0009, dtype: float64
Cluster 1:
3.0    0.350801
4.0    0.320233
2.0    0.257642
1.0    0.048035
0.0    0.023290
Name: q0007_0009, dtype: float64


Reminder of traditionally feminine activities:

q0007_0001: Ask a friend for professional advice
 
q0007_0002: Ask a friend for personal advice

q0007_0003: Express physical affection to male friends, like hugging, rubbing shoulders

q0007_0004: Cry

* `"Often"` -> `4`
* `"Sometimes"` ->  `3`
* `"Rarely"` -> `2` 
* `"Never, but open to it"` -> `1`
* `"Never, and not open to it"` -> `0`.

In [8]:
fem = ["q0007_0001", "q0007_0002", "q0007_0003", "q0007_0004"]
for q in fem:
    print("Cluster 0:")
    print(cluster_zero_df[q].value_counts()/len(cluster_zero_df))
    print("Cluster 1:")
    print(cluster_one_df[q].value_counts()/len(cluster_one_df))

Cluster 0:
2.0    0.352785
3.0    0.275862
1.0    0.238727
0.0    0.106101
4.0    0.026525
Name: q0007_0001, dtype: float64
Cluster 1:
3.0    0.553130
2.0    0.225619
4.0    0.174672
1.0    0.034934
0.0    0.011645
Name: q0007_0001, dtype: float64
Cluster 0:
2.0    0.432361
3.0    0.241379
1.0    0.190981
0.0    0.119363
4.0    0.015915
Name: q0007_0002, dtype: float64
Cluster 1:
3.0    0.538574
2.0    0.272198
4.0    0.155750
1.0    0.032023
0.0    0.001456
Name: q0007_0002, dtype: float64
Cluster 0:
0.0    0.525199
2.0    0.236074
1.0    0.175066
3.0    0.055703
4.0    0.007958
Name: q0007_0003, dtype: float64
Cluster 1:
3.0    0.438137
2.0    0.314410
4.0    0.219796
1.0    0.018923
0.0    0.008734
Name: q0007_0003, dtype: float64
Cluster 0:
2.0    0.464191
1.0    0.190981
3.0    0.180371
0.0    0.148541
4.0    0.015915
Name: q0007_0004, dtype: float64
Cluster 1:
2.0    0.448326
3.0    0.410480
1.0    0.061135
4.0    0.052402
0.0    0.027656
Name: q0007_0004, dtype: float64


In [9]:
print("Cluster Zero:")
print(cluster_zero_df["educ4"].value_counts()/len(cluster_zero_df))
print("Cluster One:")
print(cluster_one_df["educ4"].value_counts()/len(cluster_one_df))

Cluster Zero:
Some college            0.312997
College or more         0.286472
Post graduate degree    0.251989
High school or less     0.145889
Name: educ4, dtype: float64
Cluster One:
Post graduate degree    0.365357
College or more         0.330422
Some college            0.231441
High school or less     0.072780
Name: educ4, dtype: float64


# Discussion

My analysis seems to indicate that the survey responses to the 7 questions that were focused on do not result in people falling into a "masculine" category or a "feminine" category. Instead, they seem to be divided by their level of education. 

Special thanks to the team at FiveThirtyEight and Codecademy for providing me with access to this data!