# <b>Denison CS181/DA210 Final Project</b> 
## Deliverable #3: Data Storage and Analysis
#### Cheryl Nguyen, Minh Le
#### Dr. Amert
#### April 21th, 2023

In [1]:
import os
import io
import sys
import importlib
import pandas as pd
from lxml import etree
import requests
from IPython.display import Image
import plotly.express as px
import os.path

htmlparser =  etree.HTMLParser()

module_dir = os.path.join("..", "..", "modules")
module_path = os.path.abspath(module_dir)
if not module_path in sys.path:
    sys.path.append(module_path)

import util
importlib.reload(util)

import json
import sqlalchemy as sa

datadir = "data"

%load_ext sql

---

We have updated our `data_acquisition.ipynb` file due to an additional file `countries.csv`. Hence, we will provide the `Reasons for Elibility` again.

#### <b>Reasons for Eligibility</b>:

- The life expectancy dataset is taken from Wikipedia. According to the Copyrights section in Terms of Use (https://en.wikipedia.org/wiki/Wikipedia:Copyrights#Guidelines_for_images_and_other_media_files), texts on Wikipedia are co-licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA) and the GNU Free Documentation License (GFDL). These licenses allow users to have free access to information stored in Wikipedia, and in turn, they can copy, modify, and redistribute it.

- As regards the GDP per capita dataset, we have inspected the Data Access and Licensing page of The World Bank (https://datacatalog.worldbank.org/public-licenses#cc-by) and learned that the database provided on this website earns the Creative Commons Attribution 4.0 (CC-BY 4.0) International license. By integrating this license, the provider enables web users to “copy, modify and distribute data in any format for any purpose, including commercial use.” (The World Bank) To increase the certainty, we also referred to the overall Terms of Use (https://www.worldbank.org/en/about/legal/terms-of-use-for-datasets), which claims to allow users to “extract, download, and make copies of the data contained in the Datasets.” In other words, we can use the data in this course, which is for educational purposes. 

- The csv file named `countries.csv` is taken from the website of `denison.edu` (http://datasystems.denison.edu/data.html) under the `Tabular Data` section. This file is eligible to use because it originates from an education website owned by Denison University. Indeed, the location of the URL is `denison.edu`. As we are using this file for pure educational purposes in a project assigned by a Denison professor during school year, we have the ability to use it and modify it within the scope of this project only.

---

# <font color='gold'><b>Part 1: Database Walkthrough</b></font>

In this part, we will summarize how the data is stored in the database along with the queries used to access that data.

### <b>1. Database Design</b>

`indicators`: 
- Primary key: `country_and_area` (singleton)
- Fields: `region`, `gdp_per_capita`, `life`

### <b>2. Set Credentials and Build a Connection</b>

In [2]:
def getsqlite_creds(dirname=".",filename="creds.json",source="sqlite"):
    """ Using directory and filename parameters, open a credentials file
        and obtain the two parts needed for a connection string to
        a local provider using the "sqlite" dictionary within
        an outer dictionary.  
        
        Return a scheme and a dbfile
    """
    assert os.path.isfile(os.path.join(dirname, filename))
    with open(os.path.join(dirname, filename)) as f:
        D = json.load(f)
    sqlite = D[source]
    return sqlite["scheme"], sqlite["dbdir"], sqlite["database"]


In [3]:
def buildConnectionString(source="sqlite_country"):
    scheme, dbdir, database = getsqlite_creds(source=source)
    template = '{}:///{}/{}.db'
    return template.format(scheme, dbdir, database)

In [4]:
# Build the conection string
cstring = buildConnectionString("sqlite_country")
print("Connection string:", cstring)

# Connect to the database
engine = sa.create_engine(cstring)
connection = engine.connect()

Connection string: sqlite:///./country.db


In [5]:
%sql $cstring 


---

# <font color='silver'><b>Part 2: Data Analysis</b></font>

In this part, we will execute advanced SQL statements to acquire necessary data for the analysis. The answer to our central question <b>Does GDP per capita have a relationship with the life expectancy of countries in the world?</b> will be provided at the end of the analysis.

First, we will draw a scatter plot with a linear regression line showing the correlation between GDP per capita and life expectancy of the countries. To do this, we have to use an SQL query to acquire a DataFrame as a resource for plotting.

In [6]:
query1 = """
SELECT * 
FROM indicators
WHERE gdp_per_capita IS NOT NULL and life IS NOT NULL
"""

df1 = pd.read_sql_query(query1, con=connection)
df1.head(10)

Unnamed: 0,country_and_area,region,gdp_per_capita,life
0,Aruba,Latin America & Caribbean,29342.100858,74.6
1,Afghanistan,South Asia,368.754614,62.0
2,Angola,Sub-Saharan Africa,1953.533757,61.6
3,Albania,Europe & Central Asia,6492.872012,76.5
4,Andorra,Europe & Central Asia,42137.327271,80.4
5,United Arab Emirates,Middle East & North Africa,44315.554183,78.7
6,Argentina,Latin America & Caribbean,10636.120196,75.4
7,Armenia,Europe & Central Asia,4966.513471,72.0
8,American Samoa,East Asia & Pacific,15743.310758,72.5
9,Antigua and Barbuda,Latin America & Caribbean,15781.395702,78.5


In [7]:
def life_gdp(data):
    """
    Creates a plot showing the relationship between
    GDP per capita and life expectancy
    """
    data.rename(columns={'gdp_per_capita': 'GDP Per Capita', 'life': 'Life Expectancy'}, inplace=True)
    fig = px.scatter(data, x = "GDP Per Capita", y = "Life Expectancy", 
                    title = "Relationship between GDP Per Capita and Life Expectancy", 
                    hover_name = 'country_and_area', size_max = 20, trendline = 'ols', trendline_color_override = "red")
    fig.show()

In [8]:
# Display scatter plot
plot = life_gdp(df1)

As clearly shown on the graph, there is a positive relationship between the GDP per capita and the life expectancy of world countries. The relationship represented by the line of linear regression is maintained despite the appearance of the figure for `Monaco` as an outlier.

---

Subsequently, we will find the average life expectancy for each region and use a bar graph to illustrate this. 

In [9]:
query2 = """
SELECT region, AVG(life) AS average_life
FROM indicators
WHERE gdp_per_capita IS NOT NULL and life IS NOT NULL
GROUP BY region
ORDER BY average_life 
"""

df2 = pd.read_sql_query(query2, con=connection)
df2.head(10)

Unnamed: 0,region,average_life
0,Sub-Saharan Africa,61.536585
1,South Asia,70.525
2,Latin America & Caribbean,72.839394
3,East Asia & Pacific,73.110345
4,Middle East & North Africa,75.26875
5,Europe & Central Asia,77.661702
6,North America,79.733333


In [10]:
# Display bar graph
fig = px.bar(df2, x='region', y='average_life',
             title = "Average Life Expectancy by Regions",
             hover_data=['region', 'average_life'], color='average_life',
             labels={'region': 'Regions', 'average_life': 'Life Expectancy'}, height=600)
fig.show()

### <b>Analysis</b>

Countries with higher economic development record higher life expectancy, as evidenced by the top positions of North America (USA, Canada, etc.) and Europe & Central Asia. On the other hand, we can observe lower records of average lifespan from those with lower economic growth (Sub-Saharan Africa and South Asia).

---

Then, we will use SQL queries to find separate countries into 2 groups: high GDP per capita and low GDP per capita. The differentiation point will be the value of mean GDP per capita.

We will perform queries to find group with high and low GDP per capita and label them as 1 and 0, respectively. The queries will also provide the average life expectancy for each group. 

In [11]:
query3 = """
SELECT low_high_group, AVG(life) AS avg_life
FROM (SELECT country_and_area, life, gdp_per_capita > (SELECT AVG(gdp_per_capita)
                                                                 FROM indicators) AS low_high_group
      FROM indicators
      WHERE gdp_per_capita IS NOT NULL)
GROUP BY low_high_group
"""

df3 = pd.read_sql_query(query3, con=connection)
df3

Unnamed: 0,low_high_group,avg_life
0,0,68.562791
1,1,80.41875


As seen above, the average life expectancy of countries in the high GDP per capita group is higher than that of the other. Moreover, there is a significantly discernible discrepancy between the average life expectancy of 2 groups, which is approximately 12 years.

In [12]:
# Close the connection
try:
    connection.close()
except:
    pass
del engine

---

## <b>Conclusion</b>

Up to this point, we have performed SQL queries and plotted graphs to see the following phenomena:
- The rise of average life expectancy quite precisely reflects the level of economic development of 7 fields of regions.
- The average life expectancy of countries with high GDP per capita is higher than that of countries with low GDP per capita.

=> There is a high alignment betweeh the average life expectancy and the level of economic development (which is epitomized by GDP per capita). 

=> The 2021 database helps reveal that <font color='gold'><b> The GDP per capita has a direct correlation with the life expectancy of world countries</b></font>.

---
---
### <b>References</b>
1. gdp.xml: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD?end=2019&start=1960

2. life.html: https://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy#World_Health_Organization_(2019)

3. countries.csv: http://datasystems.denison.edu/data.html