# Data 620 - Project 1

By Anjal Hussan, Zhouxin Shi, Chunjie Nan

## Requirements

For your first project, you are asked to:

- Identify and load a network dataset that has some categorical information available for each node.
- For each of the nodes in the dataset, calculate degree centrality and eigenvector centrality.
- Compare your centrality measures across your categorical groups.

## Data set

Source: https://ride.citibikenyc.com/system-data
Dataset: https://s3.amazonaws.com/tripdata/index.html


> CitiBike is New York City’s bike share system, and the largest in the nation. CitiBike launched in May 2013 and has become an essential part of transportation network. They make commute fun, efficient and affordable – not to mention healthy and good for the environment.

- This project looks at 'gender', 'start station id', 'start station name', 'end station id, and 'end station name'. 
- The nodes represent stations. 
- Edge represents bike pickup station to bike drop off station.



## Load libraries



In [4]:
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from scipy.stats import ttest_1samp
import numpy as np
import pandas as pd
import zipfile
import urllib.request

##  Read the data

In [9]:

url = 'https://s3.amazonaws.com/tripdata/201612-citibike-tripdata.zip'
filehandle, _ = urllib.request.urlretrieve(url)
zip_file_object = zipfile.ZipFile(filehandle, 'r')
first_file = zip_file_object.namelist()[0]
file = zip_file_object.open(first_file)

#zf = zipfile.ZipFile('https://s3.amazonaws.com/tripdata/202201-citibike-tripdata.csv.zip') 
df = pd.read_csv(file)

df.head()

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
0,528,2016-12-01 00:00:04,2016-12-01 00:08:52,499,Broadway & W 60 St,40.769155,-73.981918,228,E 48 St & 3 Ave,40.754601,-73.971879,26931,Subscriber,1964.0,1
1,218,2016-12-01 00:00:28,2016-12-01 00:04:06,3418,Plaza St West & Flatbush Ave,40.675021,-73.971115,3358,Garfield Pl & 8 Ave,40.671198,-73.974841,27122,Subscriber,1955.0,1
2,399,2016-12-01 00:00:39,2016-12-01 00:07:19,297,E 15 St & 3 Ave,40.734232,-73.986923,345,W 13 St & 6 Ave,40.736494,-73.997044,19352,Subscriber,1985.0,1
3,254,2016-12-01 00:00:44,2016-12-01 00:04:59,405,Washington St & Gansevoort St,40.739323,-74.008119,358,Christopher St & Greenwich St,40.732916,-74.007114,20015,Subscriber,1982.0,1
4,1805,2016-12-01 00:00:54,2016-12-01 00:31:00,279,Peck Slip & Front St,40.707873,-74.00167,279,Peck Slip & Front St,40.707873,-74.00167,23148,Subscriber,1989.0,1


Create station_id to station_name mapping. This will be used later.

In [12]:
#Create station_id to station_name mapping
from_station_mapping = df[['Start Station ID', 'Start Station Name']]
to_station_mapping =  df[['End Station ID', 'End Station Name']]
from_station_mapping.columns = ['station_id', 'station_name']
to_station_mapping.columns = ['station_id', 'station_name']

#station_id to station_name mapping
station_name_id = pd.concat([from_station_mapping, to_station_mapping],ignore_index=True).drop_duplicates().reset_index(drop=True)

## Divide data set into male and female groups


In [14]:
df_male = df[df['Gender']==1].copy(deep=True)
df_female = df[df['Gender']==2].copy(deep=True)

In [15]:
df_male.head()

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
0,528,2016-12-01 00:00:04,2016-12-01 00:08:52,499,Broadway & W 60 St,40.769155,-73.981918,228,E 48 St & 3 Ave,40.754601,-73.971879,26931,Subscriber,1964.0,1
1,218,2016-12-01 00:00:28,2016-12-01 00:04:06,3418,Plaza St West & Flatbush Ave,40.675021,-73.971115,3358,Garfield Pl & 8 Ave,40.671198,-73.974841,27122,Subscriber,1955.0,1
2,399,2016-12-01 00:00:39,2016-12-01 00:07:19,297,E 15 St & 3 Ave,40.734232,-73.986923,345,W 13 St & 6 Ave,40.736494,-73.997044,19352,Subscriber,1985.0,1
3,254,2016-12-01 00:00:44,2016-12-01 00:04:59,405,Washington St & Gansevoort St,40.739323,-74.008119,358,Christopher St & Greenwich St,40.732916,-74.007114,20015,Subscriber,1982.0,1
4,1805,2016-12-01 00:00:54,2016-12-01 00:31:00,279,Peck Slip & Front St,40.707873,-74.00167,279,Peck Slip & Front St,40.707873,-74.00167,23148,Subscriber,1989.0,1


In [16]:
df_female.head()

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
19,200,2016-12-01 00:03:58,2016-12-01 00:07:19,511,E 14 St & Avenue B,40.729387,-73.977724,504,1 Ave & E 16 St,40.732219,-73.981656,27008,Subscriber,1995.0,2
26,274,2016-12-01 00:08:14,2016-12-01 00:12:48,3416,7 Ave & Park Pl,40.677615,-73.973243,3358,Garfield Pl & 8 Ave,40.671198,-73.974841,19860,Subscriber,1987.0,2
36,1097,2016-12-01 00:12:23,2016-12-01 00:30:40,417,Barclay St & Church St,40.712912,-74.010202,150,E 2 St & Avenue C,40.720874,-73.980858,23653,Subscriber,1968.0,2
38,160,2016-12-01 00:13:41,2016-12-01 00:16:21,504,1 Ave & E 16 St,40.732219,-73.981656,511,E 14 St & Avenue B,40.729387,-73.977724,27008,Subscriber,1995.0,2
46,284,2016-12-01 00:17:04,2016-12-01 00:21:48,545,E 23 St & 1 Ave,40.736502,-73.978095,325,E 19 St & 3 Ave,40.736245,-73.984738,25641,Subscriber,1954.0,2


## Assign weight to each link for male and female groups

- The weight describes how much bike pick-up and drop-off occurs from source to target stations. 
- The number of 'source -> target' occurrences is counted for each distinct 'source -> target' combination. 
- This maximum count is determined for each group.
- The weight for each link is determined by taking the count and dividing it by the group maximum count. 

In [18]:
#Generate a from-to id we can use to group by and join later 
df_male['from_to'] = df_male.apply(lambda row: str(row['Start Station ID']) + '->' + str(row['End Station ID']), axis=1)

In [19]:
df_female['from_to'] = df_female.apply(lambda row: str(row['Start Station ID']) + '->' + str(row['End Station ID']), axis=1)

In [20]:
#frequency of from-to
from_to_count_male = df_male['from_to'].value_counts()
from_to_count_female = df_female['from_to'].value_counts()

#Call reset_index() to convert series to dataframe
from_to_count_male = from_to_count_male.reset_index()
from_to_count_female = from_to_count_female.reset_index()

#rename columns
from_to_count_male.columns = ['from_to', 'count']
from_to_count_female.columns = ['from_to', 'count']

### Determine maximum 'source -> target' count for each male and female group

- Max 'source -> target' count for male is 1199
- Max 'source -> target' count for female is 271

In [21]:
max_count_male = from_to_count_male['count'].max()
max_count_female =from_to_count_female['count'].max()
max_count = max(max_count_male, max_count_female)

print(max_count_male)
print(max_count_female)

369
123


### Calculate weight for each male and female group

- Weight for each node in each group is determined by dividing the count of the respective 'source -> target' by the max_count. 

In [22]:
#divide count by the maximum count
from_to_count_male['weight'] = from_to_count_male.apply(lambda row: row['count']/max_count_male, axis=1)
from_to_count_female['weight'] = from_to_count_female.apply(lambda row: row['count']/max_count_female, axis=1)

### Male 'source -> target' weights

In [23]:
#preview for male
from_to_count_male.head()

Unnamed: 0,from_to,count,weight
0,432->3263,369,1.0
1,435->509,302,0.818428
2,383->383,291,0.788618
3,527->492,276,0.747967
4,519->491,275,0.745257


### Female 'source -> target' weights

In [24]:
#preview for female
from_to_count_female.head()

Unnamed: 0,from_to,count,weight
0,435->509,123,1.0
1,432->3263,113,0.918699
2,502->307,112,0.910569
3,3258->494,112,0.910569
4,494->3258,108,0.878049


In [25]:
#join from-to data to df_male and df_female
df_male2 = df_male.join(from_to_count_male.set_index('from_to'), on='from_to').copy(deep=True)
df_female2 = df_female.join(from_to_count_female.set_index('from_to'), on='from_to').copy(deep=True)

In [26]:
# drop duplicates to generate top 10 by weight
df_male3 = df_male2.drop_duplicates()
df_female3 = df_female2.drop_duplicates()

###  Top 10 bike pickup and drop off stations

In [27]:
df_male3.sort_values(by=['weight'], ascending=False).head(n=10)  

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender,from_to,count,weight
537750,208,2016-12-18 17:27:00,2016-12-18 17:30:29,432,E 7 St & Avenue A,40.726218,-73.983799,3263,Cooper Square & E 7 St,40.729236,-73.990868,26012,Subscriber,1990.0,1,432->3263,369,1.0
212441,175,2016-12-07 03:12:43,2016-12-07 03:15:39,432,E 7 St & Avenue A,40.726218,-73.983799,3263,Cooper Square & E 7 St,40.729236,-73.990868,24952,Subscriber,1995.0,1,432->3263,369,1.0
222187,209,2016-12-07 09:41:53,2016-12-07 09:45:22,432,E 7 St & Avenue A,40.726218,-73.983799,3263,Cooper Square & E 7 St,40.729236,-73.990868,19252,Subscriber,1994.0,1,432->3263,369,1.0
518818,266,2016-12-16 23:37:45,2016-12-16 23:42:11,432,E 7 St & Avenue A,40.726218,-73.983799,3263,Cooper Square & E 7 St,40.729236,-73.990868,25058,Subscriber,1987.0,1,432->3263,369,1.0
606094,216,2016-12-21 11:22:01,2016-12-21 11:25:37,432,E 7 St & Avenue A,40.726218,-73.983799,3263,Cooper Square & E 7 St,40.729236,-73.990868,16980,Subscriber,1963.0,1,432->3263,369,1.0
56552,248,2016-12-02 10:03:17,2016-12-02 10:07:25,432,E 7 St & Avenue A,40.726218,-73.983799,3263,Cooper Square & E 7 St,40.729236,-73.990868,17467,Subscriber,1983.0,1,432->3263,369,1.0
297363,248,2016-12-09 08:53:28,2016-12-09 08:57:37,432,E 7 St & Avenue A,40.726218,-73.983799,3263,Cooper Square & E 7 St,40.729236,-73.990868,15503,Subscriber,1965.0,1,432->3263,369,1.0
629164,164,2016-12-22 07:57:38,2016-12-22 08:00:22,432,E 7 St & Avenue A,40.726218,-73.983799,3263,Cooper Square & E 7 St,40.729236,-73.990868,26501,Subscriber,1985.0,1,432->3263,369,1.0
185936,222,2016-12-06 08:26:27,2016-12-06 08:30:10,432,E 7 St & Avenue A,40.726218,-73.983799,3263,Cooper Square & E 7 St,40.729236,-73.990868,27278,Subscriber,1990.0,1,432->3263,369,1.0
443761,245,2016-12-14 10:00:24,2016-12-14 10:04:29,432,E 7 St & Avenue A,40.726218,-73.983799,3263,Cooper Square & E 7 St,40.729236,-73.990868,17250,Subscriber,1988.0,1,432->3263,369,1.0


In [28]:
df_female3.sort_values(by=['weight'], ascending=False).head(n=10) 

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender,from_to,count,weight
75931,258,2016-12-02 18:00:13,2016-12-02 18:04:32,435,W 21 St & 6 Ave,40.74174,-73.994156,509,9 Ave & W 22 St,40.745497,-74.001971,16583,Subscriber,1980.0,2,435->509,123,1.0
307932,270,2016-12-09 14:44:54,2016-12-09 14:49:24,435,W 21 St & 6 Ave,40.74174,-73.994156,509,9 Ave & W 22 St,40.745497,-74.001971,18061,Subscriber,1979.0,2,435->509,123,1.0
453200,349,2016-12-14 15:44:14,2016-12-14 15:50:04,435,W 21 St & 6 Ave,40.74174,-73.994156,509,9 Ave & W 22 St,40.745497,-74.001971,26595,Subscriber,1960.0,2,435->509,123,1.0
446404,269,2016-12-14 11:42:36,2016-12-14 11:47:05,435,W 21 St & 6 Ave,40.74174,-73.994156,509,9 Ave & W 22 St,40.745497,-74.001971,25945,Subscriber,1967.0,2,435->509,123,1.0
65804,402,2016-12-02 14:49:59,2016-12-02 14:56:42,435,W 21 St & 6 Ave,40.74174,-73.994156,509,9 Ave & W 22 St,40.745497,-74.001971,23068,Subscriber,1967.0,2,435->509,123,1.0
687698,352,2016-12-24 15:48:49,2016-12-24 15:54:41,435,W 21 St & 6 Ave,40.74174,-73.994156,509,9 Ave & W 22 St,40.745497,-74.001971,18885,Subscriber,1957.0,2,435->509,123,1.0
320930,250,2016-12-09 20:14:11,2016-12-09 20:18:22,435,W 21 St & 6 Ave,40.74174,-73.994156,509,9 Ave & W 22 St,40.745497,-74.001971,25463,Subscriber,1992.0,2,435->509,123,1.0
104175,448,2016-12-03 15:42:57,2016-12-03 15:50:26,435,W 21 St & 6 Ave,40.74174,-73.994156,509,9 Ave & W 22 St,40.745497,-74.001971,20246,Subscriber,1972.0,2,435->509,123,1.0
62736,238,2016-12-02 13:28:02,2016-12-02 13:32:01,435,W 21 St & 6 Ave,40.74174,-73.994156,509,9 Ave & W 22 St,40.745497,-74.001971,16278,Subscriber,1988.0,2,435->509,123,1.0
667284,338,2016-12-23 12:21:44,2016-12-23 12:27:22,435,W 21 St & 6 Ave,40.74174,-73.994156,509,9 Ave & W 22 St,40.745497,-74.001971,25930,Subscriber,1960.0,2,435->509,123,1.0


## Create graph object

- This is a directed graph.

In [30]:
#Create directed graph object
G_male = nx.from_pandas_edgelist(df_male2, source='Start Station ID', target='End Station ID', edge_attr=['weight'], create_using=nx.DiGraph())
G_female = nx.from_pandas_edgelist(df_female2, source='Start Station ID', target='End Station ID', edge_attr=['weight'], create_using=nx.DiGraph())

#w=G_female.edges(data=True)
#print(w)

### Male network

In [31]:
print(nx.info(G_male))

DiGraph with 612 nodes and 90274 edges


### Female network

In [32]:
print(nx.info(G_female))

DiGraph with 604 nodes and 53356 edges


## Calculate degree of centrality

> The in-degree centrality for a node v is the fraction of nodes its incoming edges are connected to.
The degree centrality values are normalized by dividing by the maximum possible degree in a simple graph n-1 where n is the number of nodes in G.

Source: https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.algorithms.centrality.in_degree_centrality.html


In [33]:
#calculate degree centrality

in_deg_centrality_male = pd.DataFrame.from_dict(nx.in_degree_centrality(G_male), orient='index').reset_index()
in_deg_centrality_female = pd.DataFrame.from_dict(nx.in_degree_centrality(G_female), orient='index').reset_index()

#rename columns
in_deg_centrality_male.columns = ['station', 'in_degree_centrality']
in_deg_centrality_female.columns = ['station', 'in_degree_centrality']

In [34]:
#join
in_deg_centrality_male2 = in_deg_centrality_male.join(station_name_id.set_index('station_id'), on='station')

#join
in_deg_centrality_female2 = in_deg_centrality_female.join(station_name_id.set_index('station_id'), on='station')

### Male: Top 10 stations with most incoming connections for bike drop offs

In [35]:
#sort
in_deg_centrality_male = in_deg_centrality_male2.sort_values(by=['in_degree_centrality'], ascending=False)
in_deg_centrality_male.head(n=10)

Unnamed: 0,station,in_degree_centrality,station_name
170,519,0.563011,Pershing Square North
51,497,0.545008,E 17 St & Broadway
72,402,0.530278,Broadway & E 22 St
364,520,0.502455,W 52 St & 5 Ave
102,151,0.490998,Cleveland Pl & Spring St
204,285,0.490998,Broadway & E 14 St
41,3263,0.487725,Cooper Square & E 7 St
334,3427,0.484452,Lafayette St & Jersey St
105,3255,0.484452,8 Ave & W 31 St
26,490,0.482815,8 Ave & W 33 St


### Female: Top 10 stations with most incoming connections for bike drop offs

In [36]:
#sort
in_deg_centrality_female = in_deg_centrality_female2.sort_values(by=['in_degree_centrality'], ascending=False)
in_deg_centrality_female.head(n=10)

Unnamed: 0,station,in_degree_centrality,station_name
214,402,0.401327,Broadway & E 22 St
45,497,0.39801,E 17 St & Broadway
134,285,0.379768,Broadway & E 14 St
48,435,0.374793,W 21 St & 6 Ave
117,519,0.369818,Pershing Square North
93,151,0.369818,Cleveland Pl & Spring St
212,229,0.356551,Great Jones St
92,168,0.349917,W 18 St & 6 Ave
151,520,0.331675,W 52 St & 5 Ave
34,444,0.330017,Broadway & W 24 St


## Calculate eigenvector centrality for each station

> Eigenvector centrality computes the centrality for a node based on the centrality of its neighbors. For directed graphs this is “left” eigenvector centrality which corresponds to the in-edges in the graph.

Source: 
https://networkx.github.io/documentation/latest/reference/algorithms/generated/networkx.algorithms.centrality.eigenvector_centrality.html

In [37]:
# Eigenvector centrality
eigenvector_male = pd.DataFrame.from_dict(nx.eigenvector_centrality(G_male, weight='weight'), orient='index').reset_index()
eigenvector_female = pd.DataFrame.from_dict(nx.eigenvector_centrality(G_female, weight='weight'), orient='index').reset_index()

In [38]:
#Rename columns
eigenvector_male.columns = ['station', 'eigenvector_centrality']
eigenvector_female.columns = ['station', 'eigenvector_centrality']

In [39]:
#join
eigenvector_male2 = eigenvector_male.join(station_name_id.set_index('station_id'), on='station')

#join
eigenvector_female2 = eigenvector_female.join(station_name_id.set_index('station_id'), on='station')

### Top 10 stations by eigenvector centrality 

These are stations with incoming connections from stations with many incoming connections. In this case, incoming connection means bike drop offs. 

In [40]:
#Top 10 stations based on eigenvector centrality
eigenvector_centrality_male = eigenvector_male2.sort_values(by=['eigenvector_centrality'], ascending=False)
eigenvector_centrality_female = eigenvector_female2.sort_values(by=['eigenvector_centrality'], ascending=False)

In [41]:
eigenvector_centrality_male.head(n=10)

Unnamed: 0,station,eigenvector_centrality,station_name
170,519,0.27928,Pershing Square North
72,402,0.189755,Broadway & E 22 St
51,497,0.157887,E 17 St & Broadway
61,379,0.152572,W 31 St & 7 Ave
392,435,0.149315,W 21 St & 6 Ave
109,477,0.14844,W 41 St & 8 Ave
241,492,0.143228,W 33 St & 7 Ave
178,491,0.142913,E 24 St & Park Ave S
89,523,0.133702,W 38 St & 8 Ave
26,490,0.129494,8 Ave & W 33 St


In [42]:
eigenvector_centrality_female.head(n=10)

Unnamed: 0,station,eigenvector_centrality,station_name
48,435,0.211993,W 21 St & 6 Ave
45,497,0.200473,E 17 St & Broadway
134,285,0.182986,Broadway & E 14 St
214,402,0.174017,Broadway & E 22 St
15,3263,0.155714,Cooper Square & E 7 St
92,168,0.14386,W 18 St & 6 Ave
54,509,0.143452,9 Ave & W 22 St
50,284,0.12822,Greenwich Ave & 8 Ave
40,382,0.121831,University Pl & E 14 St
183,368,0.117517,Carmine St & 6 Ave


## Degree centrality comparison

For each station, the difference in degree centrality is determined for male and female groups.


In [43]:
#Create dataframe that compares degree of centrality for each station for male and female groups

#rename columns
in_deg_centrality_female.columns = ['f_station', 'f_in_degree_centrality', 'f_station_name']

#rename columns
in_deg_centrality_male.columns = ['m_station', 'm_in_degree_centrality', 'm_station_name']

#join male and female centrality data by station id
in_deg_centrality_compare = in_deg_centrality_male.join(in_deg_centrality_female.set_index('f_station'), on='m_station')

#drop the index
in_deg_centrality_compare.reset_index(drop=True)

#drop repeated information
in_deg_centrality_compare = in_deg_centrality_compare.drop(['f_station_name'], axis=1)

#rename columns 
in_deg_centrality_compare.columns = ['station_id', 'm_in_degree_centrality', 'station_name', 'f_in_degree_centrality']

#Calculate difference between male and female in degree centrality for a given station
in_deg_centrality_compare['difference'] = in_deg_centrality_compare.apply(lambda row: abs(row['m_in_degree_centrality']-row['f_in_degree_centrality']), axis=1)

In [44]:
in_deg_centrality_compare[['station_name', 'difference']].head(n=10)

Unnamed: 0,station_name,difference
170,Pershing Square North,0.193194
51,E 17 St & Broadway,0.146998
72,Broadway & E 22 St,0.128952
364,W 52 St & 5 Ave,0.17078
102,Cleveland Pl & Spring St,0.121181
204,Broadway & E 14 St,0.111231
41,Cooper Square & E 7 St,0.180926
334,Lafayette St & Jersey St,0.171019
105,8 Ave & W 31 St,0.166044
26,8 Ave & W 33 St,0.210842


In [45]:
#mean of difference
in_deg_centrality_compare['difference'].mean()

0.09806452898972548

## Test if mean difference of male and female degree centrality is zero

### Null Hypothesis: 
The mean difference between male and female degree centrality for each drop off bike station is zero. 

### Alternative Hypothesis: 
The mean difference between male and female degree centrality for each drop off bike station is not zero. 

#### scipy.stats.ttest_1samp

>Calculates the T-test for the mean of ONE group of scores. This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations a is equal to the given population mean.

Source: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_1samp.html
     

In [46]:
ttest_1samp(in_deg_centrality_compare['difference'], 0)

Ttest_1sampResult(statistic=nan, pvalue=nan)

### Result of ttest

Since the pvalue (6.245978575816204e-141) is much less than 0.05. We  reject the null hypothesis. 

Assuming that the null hypothesis is true (mean is zero), the probability of observing the data that we have is very small.

Hence we conclude that there is a difference in the degree centrality in stations where male and female riders drop off their bikes.



## Eigenvector Centrality Comparison

For each station, calculate the difference in eigenvector centrality for male and female groups.

In [47]:
#Create dataframe that compares degree of centrality for each station for male and female groups

#rename columns
eigenvector_centrality_female.columns = ['f_station', 'f_in_degree_centrality', 'f_station_name']

#rename columns
eigenvector_centrality_male.columns = ['m_station', 'm_in_degree_centrality', 'm_station_name']

#join male and female centrality data by station id
eigenvector_centrality_compare = eigenvector_centrality_male.join(eigenvector_centrality_female.set_index('f_station'), on='m_station')

#drop the index
eigenvector_centrality_compare.reset_index(drop=True)

#drop repeated information
eigenvector_centrality_compare = eigenvector_centrality_compare.drop(['f_station_name'], axis=1)

#rename columns 
eigenvector_centrality_compare.columns = ['station_id', 'm_in_degree_centrality', 'station_name', 'f_in_degree_centrality']

#Calculate difference between male and female in degree centrality for a given station
eigenvector_centrality_compare['difference'] = eigenvector_centrality_compare.apply(lambda row: abs(row['m_in_degree_centrality']-row['f_in_degree_centrality']), axis=1)

In [48]:
eigenvector_centrality_compare[['station_name', 'difference']].head(n=10)

Unnamed: 0,station_name,difference
170,Pershing Square North,0.17012
72,Broadway & E 22 St,0.015739
51,E 17 St & Broadway,0.042586
61,W 31 St & 7 Ave,0.08476
392,W 21 St & 6 Ave,0.062678
109,W 41 St & 8 Ave,0.10429
241,W 33 St & 7 Ave,0.1033
178,E 24 St & Park Ave S,0.08061
89,W 38 St & 8 Ave,0.06503
26,8 Ave & W 33 St,0.07297


In [49]:
#mean of difference
eigenvector_centrality_compare['difference'].mean()

0.01010477548561151

## Test if mean difference of male and female eigenvector centrality is zero

### Null Hypothesis: 
The mean difference between male and female eigenvector centrality for each drop off bike station is zero. 

### Alternative Hypothesis: 
The mean difference between male and female eigenvector centrality for each drop off bike station is not zero. 

#### scipy.stats.ttest_1samp

>Calculates the T-test for the mean of ONE group of scores. This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations a is equal to the given population mean.

Source: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_1samp.html

In [50]:
ttest_1samp(eigenvector_centrality_compare['difference'], 0)

Ttest_1sampResult(statistic=nan, pvalue=nan)

### Result of ttest

Since the pvalue (3.020672599424717e-31) is much less than 0.05. We  reject the null hypothesis. 

Assuming that the null hypothesis is true (mean is zero), the probability of observing the data that we have is very small.

Hence we conclude that there is a difference in the eigenvector centrality in stations where male and female riders drop off their bikes.
