# Week 4 - Centrality Measures

## Project Overview

Centrality measures can be used to predict (positive or negative) outcomes for a node.

Your task in this week’s assignment is to identify an interesting set of network data that is available on the web (either through web scraping or web APIs) that could be used for analyzing and comparing centrality measures across nodes.  As an additional constraint, there should be at least one categorical variable available for each node (such as “Male” or “Female”; “Republican”, “Democrat,” or “Undecided”, etc.)

In addition to identifying your data source, you should create a high level plan that describes how you would load the data for analysis, and describe a hypothetical outcome that could be predicted from comparing degree centrality across categorical groups. 

For this week’s assignment, you are not required to actually load or analyze the data.

## Identifying Interesting Network Data

We chose a classical transportation network of __flight data from [OpenFlights.org](https://openflights.org/data.html)__.

__OpenFlights describes itself as:__

_"OpenFlights is a tool that lets you map your flights around the world, search and filter them in all sorts of interesting ways, calculate statistics automatically, and share your flights and trips with friends and the entire world (if you wish)."_

__OpenFlights consists of the following datasets:__

- Airports
- Airlines
- Routes
- Planes
- Schedules

We have choosen to use the __Routes__ and __Airports__ datasets to build our network. 

__Variables in the Routes dataset:__

- Airline
- Airline ID
- Source Airport
- Source Airport ID
- Destination Airport
- Destination Airport ID
- Codeshare
- Stops
- Equipment

__Variables in the Airports dataset:__

- Airport ID
- Name
- City
- Country
- IATA
- ICAO
- Latitude
- Longtitude
- Altitude
- Timezone
- DST
- Tz database time zone
- Type
- Source

Our nodes will be taken from the __Source Airport__ and __Destination Airport__ variables in the __Routes__ dataset. Each record with a source and destination airport represents an edge between nodes and the __Stops__ variable could be used as an optional edge weight.

Data is saved as .DAT files. They are UTF-8 encoded.

## Choosing a Categorical Variable

We have decided to create a categorical variable __N_S_Hemisphere__ from the __Latitude__ variable in the __Airports__ dataset.  An airport node with a Negative latitude will be coded __S for "South"__, and an airport node with a positive latitude will be coded __N for "North"__.

## Centrality Measures

1. __Degree Centrality__ -- how many direct, ‘one hop’ connections each node has to other nodes within the network [1]

2. __Betweenness Centrality__ -- which nodes act as ‘bridges’ between nodes in a network [1]

3. __Closeness Centrality__ -- calculates the shortest paths between all nodes, then assigns each node a score based on its sum of shortest paths [1]

4. __EigenCentrality__ -- by calculating the extended connections of a node, EigenCentrality can identify nodes with influence over the whole network, not just those directly connected to it [1]

5. __PageRank__ -- uncovers nodes whose influence extends beyond their direct connections into the wider network [1]


[1] https://cambridge-intelligence.com/keylines-faqs-social-network-analysis/


## Loading the Data

- OpenFlights.org allows us to download the .dat files (which seem to be formatted as comma seperated values) which we can save to our GitHub account. 
- From Github, we can read the data with Python into a pandas dataframe, importing only the variables needed for our analysis then join the two datasets to create the __N_S_Hemisphere__ categorical variable.
- We can use python to write our reformatted and joined data file to a .edges file that can then be imported using NetworkX to create our network data.
- Once our data is imported as a network, we can use NetworkX to analyze and visualize the network.

## Hypothetical Outcome

We believe that Northern airports (where highly developed Northern American, European, and East Asian countries are located) are more centralized, interconnected, and have more direct flights, so they will have a greater number of airports that score higher in Degree Centrality, Betweenness and Closeness.