# City Lines Data Analysis

> This notebook was created as part of the examination requirements of "Information Structures and Implications" class offered by the Master of Digital Humanities programme in KU Leuven.

## What's this notebook about?

Often times it is thought that the complexity level of a city's transportation systems is linked to that city's level of "development". We want to investigate whether this widely-held belief holds true by interrogating the city lines dataset and combining it with other datasets which can inform us about human development. While doing this, we also want to uncover some less-known facts about metro systems such as dominant colors and crowdedness.

The questions that we will ask are as follows: 

1. Is the education level of a country related to its total railway length?
2. Is the subjective well-being of a country related to its total railway length?
3. Is personal mobile phone ownership related to the variety of transportation modes in a country?
4. Are freedom of speech rankings related to the variety of transportation modes in a country?
5. Is there a relationship in between country and the time it takes to finish the construction of a railway station?
6. Are there any “late bloomer” cities? Cities that started building up their metro system late but have quickly built up many lines and stations.
7. What are the most “crowded” (short line, lots of stations) and the most “spacious” (long line, barely no stations) lines?
8. What are some unique hues that nobody uses in coloring their metro lines?
9. Is there a correlation between the age of a line and its color?
10. What is the most popular line color for each city?

## Code

### Setup

#### Import the required packages

In [1]:
from pathlib import Path
import mysql.connector as connector
import pandas as pd


#### Establish connection with the database

In [2]:
credentials = {
    "username": "root",
    "password": ""
}

conn = connector.connect(user=credentials["username"],
                         passwd=credentials["password"],
                         host="localhost",
                         database="city_lines")


### Analysis

#### Question four

> ***Is there a correlation between the age of a line and its color?***

##### English explanation

###### SQL explanation

Join the 'lines' table with the 'stations' table through the 'station_lines' table. Group first by line_id, then by line age. Select line id's and line colors. Calculate the line age for each line and station combination by substracting from the current date the station opening date. Also convert the hexcode color value into separate red, green and blue values in integer format. Order by line_id in ascending, and then by line age descending.

###### Python explanation




##### Code

In [3]:
sql_query = """
SELECT l.id AS line_id,
       l.color AS line_color,
       CONV(SUBSTRING(l.color, 2, 2), 16, 10) AS r_value,
       CONV(SUBSTRING(l.color, 4, 2), 16, 10) AS g_value,
       CONV(SUBSTRING(l.color, 4, 2), 16, 10) AS b_value,
      (2021 - s.opening) AS age
  FROM `lines` l
        JOIN station_lines sl ON (sl.line_id = l.id)
        JOIN stations s ON (s.id = sl.station_id )
 GROUP BY line_id, age
 ORDER BY line_id ASC, age DESC;
"""

result = pd.read_sql(sql_query, conn)

# The result needs further processing
max_mask = result.groupby("line_id")["age"].transform(max) == result["age"]
result = (result.loc[max_mask, :]
          .sort_values("age", ascending=False))
result.loc[:, ["r_value", "g_value", "b_value"]] = result.loc[:, ["r_value", "g_value", "b_value"]].astype(int)
result


Unnamed: 0,line_id,line_color,r_value,g_value,b_value,age
2477,942,#c0cd30,192,205,205,2021.0
1775,560,#276cd9,39,108,108,2021.0
1838,591,#000,0,0,0,2021.0
1837,590,#000,0,0,0,2021.0
1830,583,#883f98,136,63,63,2021.0
...,...,...,...,...,...,...
856,274,#fc921c,252,146,146,-997978.0
2346,864,#f33043,243,48,48,-997978.0
93,62,#ffbc2d,255,188,188,-997978.0
858,275,#ff8000,255,128,128,-997978.0


##### Interpretation of results

#### Question ten

> ***What is the most popular line color for each country?***

##### English explanation

###### SQL explanation
Select each country from the cities table that appear more than twice in the "lines" table. This is a proxy for countries that have more than one transportation line logged into the database. Select color of the each line to represent the lines. Group by country first, then by color. Calculate how many times each color occurs within its country group, name this as "occurence_count"

###### Python explanation
We read the SQL table we get into a dataframe. To make our dataframe more intelligible, we first convert "occurence_count" to a percentage. Then we filter the dataframe to get only the most recurring color.
In the end, we have a dataframe of countries, their most popular line color and its occurence percentage.

##### Code

In [4]:
sql_query = """
SELECT x.country, z.color, COUNT(z.color) AS occurence_count
  FROM (SELECT c.country, COUNT(c.country)
          FROM cities c
               JOIN `lines` l ON (c.id = l.city_id)
         GROUP BY c.country
         HAVING COUNT(c.country) > 2) x
        JOIN cities y ON (x.country = y.country)
        JOIN `lines` z on (y.id = z.city_id)
 GROUP BY x.country, z.color
 ORDER BY x.country ASC, occurence_count DESC;
"""

result = pd.read_sql(sql_query, conn)

# The result needs further processing
perc = (result.groupby(["country"])["occurence_count"].max()
        / result.groupby(["country"])["occurence_count"].sum()*100) # Get the percentage versions
perc = pd.DataFrame(perc).reset_index()
max_mask = result.groupby("country")["occurence_count"].transform(max) == result["occurence_count"] # Select the max only
result = (result.loc[max_mask, :]
          .sort_values("country", ascending=True))
result = pd.merge(result, perc, on="country")
result = (result
          .drop("occurence_count_x", axis=1)
          .rename({"occurence_count_y": "occurence_perc"}, axis=1)
          ["occurence_perc"])
result


0     42.857143
1     48.888889
2     55.000000
3     13.636364
4     13.636364
        ...    
75    33.333333
76    12.000000
77    25.000000
78    25.000000
79     7.746479
Name: occurence_perc, Length: 80, dtype: float64

##### Interpretation of results

Since we are dealing with color, it does not make sense to talk without visualizing the results first.