# City Lines Data Analysis

> This notebook was created as part of the examination requirements of "Information Structures and Implications" class offered by the Master of Digital Humanities programme in KU Leuven.

## What's this notebook about?

Often times it is thought that the complexity level of a city's transportation systems is linked to that city's level of "development". We want to investigate whether this widely-held belief holds true by interrogating the city lines dataset and combining it with other datasets which can inform us about human development. While doing this, we also want to uncover some less-known facts about metro systems such as dominant colors and crowdedness.

The questions that we will ask are as follows: 

1. Is the education level of a country related to its total railway length?
2. Is the subjective well-being of a country related to its total railway length?
3. Is personal mobile phone ownership related to the variety of transportation modes in a country?
4. Are freedom of speech rankings related to the variety of transportation modes in a country?
5. Is there a relationship in between country and the time it takes to finish the construction of a railway station?
6. Are there any “late bloomer” cities? Cities that started building up their metro system late but have quickly built up many lines and stations.
7. What are the most “crowded” (short line, lots of stations) and the most “spacious” (long line, barely no stations) lines?
8. What are some unique hues that nobody uses in coloring their metro lines?
9. Is there a correlation between the age of a line and its color?
10. What is the most popular line color for each city?

## Code

### Setup

#### Import the required packages

In [76]:
from pathlib import Path
import mysql.connector as connector
import pandas as pd


#### Establish connection with the database

In [77]:
credentials = {
    "username": "root",
    "password": ""
}

conn = connector.connect(user=credentials["username"],
                         passwd=credentials["password"],
                         host="localhost",
                         database="city_lines")


### Analysis

#### Question four

> ***Is there a correlation between the age of a line and its color?***

##### English explanation

##### Code

In [138]:
sql_query = """
SELECT l.id AS line_id, l.name AS line_name,
       l.color AS line_color,
       SUBSTRING(l.color, 2, 2) AS r_value,
       SUBSTRING(l.color, 4, 2) AS g_value,
       SUBSTRING(l.color, 4, 2) AS b_value,
       s.name AS station_name, (2021 - s.opening) AS age
  FROM `lines` l
        JOIN station_lines sl ON (sl.line_id = l.id)
        JOIN stations s ON (s.id = sl.station_id )
 GROUP BY line_id, age
 ORDER BY line_id ASC, age DESC;
"""

result = pd.read_sql(sql_query, conn)
result

# The result needs further processing
min_mask = result.groupby("line_id")["age"].transform(max) == result["age"]
result = (result.loc[min_mask, :]
          .sort_values("age", ascending=False))

# result = result.apply(lambda x: int(x.astype(str), 16) if x.name in ["r_value"] else x)
# result
print(int(result.loc[10,"r_value"], 16))

# result[["r_value", "g_value", "b_value"]] = (result[["r_value", "g_value", "b_value"]]
#                                                     .apply(lambda x: int(x)))


KeyError: 10

##### Interpretation of results

#### Question ten

> ***What is the most popular line color for each country?***

For countries that have more than one line noted.

##### English explanation

##### Code

In [114]:
sql_query = """
SELECT c.country, l.color, COUNT(l.color) AS occurence_count
  FROM cities c
       JOIN `lines` l ON (c.id = l.city_id)
 GROUP BY c.country, l.color
 ORDER BY c.country ASC, occurence_count DESC;
"""

result = pd.read_sql(sql_query, conn)

# The result needs further processing
max_mask = result.groupby("country")["occurence_count"].transform(max) == result["occurence_count"]
result = (result.loc[max_mask, :]
          .sort_values("occurence_count", ascending=False))
result

# YOU NEED TO NORMALIZE THIS, MAKE IT INTO A PERCENTAGE?

Unnamed: 0,country,color,occurence_count
479,Japan,#5f0101,108
127,Chile,#4a90e2,29
9,Australia,#000,22
262,France,#000,20
0,Argentina,#f3d379,12
...,...,...,...
449,Hungary,#71be1c,1
450,Hungary,#003b83,1
62,Bolivia,#000000,1
61,Bolivia,#ff8a26,1


##### Interpretation of results