# Gender Bias in Street Names

The aim of this project is to analyse patterns of gender in street names. We use the city of Thiruvanathapuram.

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import plotly.express as px

## Data
This data was scrapped from OpenStreetMap using Overpass Turbo/API

In [2]:
df=pd.read_csv("streetnames.csv")
df

Unnamed: 0,Street_Name
0,101 Acre Road.
1,1st street
2,2nd street
3,2nd street -2
4,3rd street
...,...
2502,vssc road
2503,way to Anganvai
2504,way to railway
2505,yajani


The csv file contains the names of all the streets in the city of Thiruvanathapuram.

## Data Handling

Numbers, special characters, stopwords, keywords like highway, road, apartment, street, mandir, masjid, church, square, margs, gali, etc., and neighborhood names are first removed from the street names. Then these parsed names (with length > 3 characters) are passed through NamSor API (an API to classifies personal names accurately by gender) to identify the gender.

In [3]:
type(df['Street_Name'][4])

str

In [4]:
# converting all names to lowercase
for i in range(len(df)):
    df['Street_Name'][i]=df['Street_Name'][i].lower()

In [6]:
# user-defined function to remove certain words and numbers
import string
regular_punct=list(string.punctuation)
def remove_punct(text,punct_list):
    for punc in punct_list:
        if punc in text:
            text=text.replace(punc,' ')
    return text.strip()


In [7]:
keywords=['highway','road','apartment','street','mandir','temple','masjid','church','square','circle','junction','nagar','lane','colony','mosque','way']
numbers=['1','2','0','3','4','5','6','7','8','9']

In [8]:
for i in range(len(df)):
    df['Street_Name'][i]=remove_punct(df['Street_Name'][i],regular_punct)
    df['Street_Name'][i]=remove_punct(df['Street_Name'][i],keywords)
    df['Street_Name'][i]=remove_punct(df['Street_Name'][i],numbers)

In [9]:
# updated dataframe
df

Unnamed: 0,Street_Name
0,acre
1,st
2,nd
3,nd
4,rd
...,...
2502,vssc
2503,to anganvai
2504,to rail
2505,yajani


In [12]:
df=df.dropna(how='all')

In [14]:
df.to_csv("final.csv")

## Gender Classification

To identify the gender of a particular street name, we pass the data through an API called NamSor API. The result is saved to a csv file. 

In [11]:
df1=pd.read_csv("genderFullGeoBatch_final.csv")

In [12]:
df1

Unnamed: 0,script,id,name,likelyGender,genderScale,score,probabilityCalibrated
0,LATIN,id-000001,acre,male,-0.285406,4.713575,0.642703
1,LATIN,id-000002,th stone mulamukk,female,0.182001,2.677813,0.591000
2,LATIN,id-000003,th stone sneha,female,0.717793,8.890934,0.858896
3,LATIN,id-000004,agra,male,-0.013239,0.306919,0.506619
4,LATIN,id-000005,akg centre kunnukuzhi,male,-0.696158,8.552085,0.848079
...,...,...,...,...,...,...,...
2430,LATIN,id-002431,vssc,male,-0.239412,3.570787,0.619706
2431,LATIN,id-002432,to anganvai,male,-0.492222,7.240833,0.746111
2432,LATIN,id-002433,to rail,male,-0.464234,7.145236,0.732117
2433,LATIN,id-002434,yajani,female,0.848578,11.030036,0.924289


The two important results we recieved are Likely Gender of the name and the Probability of that being true. We will use that Probability to determine whether a street-name has a gender or not.

In [15]:
# classifying all the names with probability less than 65% as Ungendered
condition=[(df1['probabilityCalibrated']<0.65)]
choice=['Ungendered']

In [16]:
df1['test1']=np.select(condition,choice,default=df1['likelyGender'])

In [17]:
df1

Unnamed: 0,script,id,name,likelyGender,genderScale,score,probabilityCalibrated,test1
0,LATIN,id-000001,acre,male,-0.285406,4.713575,0.642703,Ungendered
1,LATIN,id-000002,th stone mulamukk,female,0.182001,2.677813,0.591000,Ungendered
2,LATIN,id-000003,th stone sneha,female,0.717793,8.890934,0.858896,female
3,LATIN,id-000004,agra,male,-0.013239,0.306919,0.506619,Ungendered
4,LATIN,id-000005,akg centre kunnukuzhi,male,-0.696158,8.552085,0.848079,male
...,...,...,...,...,...,...,...,...
2430,LATIN,id-002431,vssc,male,-0.239412,3.570787,0.619706,Ungendered
2431,LATIN,id-002432,to anganvai,male,-0.492222,7.240833,0.746111,male
2432,LATIN,id-002433,to rail,male,-0.464234,7.145236,0.732117,male
2433,LATIN,id-002434,yajani,female,0.848578,11.030036,0.924289,female


In [18]:
df1['index']=df1.index

Creating a new dataframe with only the name and the gender for easiness

In [19]:
df2=df1[['index','name','test1']]

In [20]:
df2

Unnamed: 0,index,name,test1
0,0,acre,Ungendered
1,1,th stone mulamukk,Ungendered
2,2,th stone sneha,female
3,3,agra,Ungendered
4,4,akg centre kunnukuzhi,male
...,...,...,...
2430,2430,vssc,Ungendered
2431,2431,to anganvai,male
2432,2432,to rail,male
2433,2433,yajani,female


In [27]:
df2['test1'].value_counts()

Ungendered    1364
male           662
female         409
Name: test1, dtype: int64

## Visualisation

In [24]:
x={'Gender':['Male','Female','Ungendered'],'Count':[662,409,1364]}
df3=pd.DataFrame(data=x,index=None)
df3

Unnamed: 0,Gender,Count
0,Male,662
1,Female,409
2,Ungendered,1364


In [25]:
import plotly.graph_objects as go


In [34]:
fig1=go.Figure(data=[go.Pie(labels=df3['Gender'],values=df3['Count'],hole=.2)])
fig1.show()

## Results

From the visualisation, we found that around 56% (1364 streets) of the street names are Ungendered, or we are not able to accurately classify the gender. Among the rest, 27.2% (662 streets) are male and only 16.8% (409 streets) are classified as female. It is very clear that gender discrimination exists even while naming a street.