# Extract CSV's/Create Dataframes

Import dependencies, including the `config.py` file containing the postgres username and password.

In [None]:
import pandas as pd
from sqlalchemy import create_engine
from config import username, password

Importing the CSVs obtained from Kaggle, containing the World Happiness Report measurements and the CIA World Factbook countries data.  The Country data contained numbers in European formats so the code specifies the thousand and decimal separators in order to import the numbers correctly.

In [None]:
happiness_file = "Data/happiness.csv"
happiness_df = pd.read_csv(happiness_file)
happiness_df.head()

In [None]:
country_file = "Data/countries.csv"
country_df = pd.read_csv(country_file,thousands=".",decimal=",")
country_df.head()

# Transform Happiness & Countries Data

The two sets of data have inconsistencies with the country naming.  A comparison between the designations of countries was made and code written to change these designations to be consistent so the data from each table could be matched up correctly.

In [None]:
happiness_df.replace({'Country name': {"Bosnia and Herzegovina" : "Bosnia & Herzegovina",
                                       "Congo (Kinshasa)" : "Congo DR",
                                       "Congo (Brazzaville)" : "Congo Republic",
                                       "Ivory Coast" : "Cote d'Ivoire",
                                       "Hong Kong S.A.R. of China" : "Hong Kong",
                                       "Taiwan Province of China" : "Taiwan",
                                       "Myanmar" : "Burma",
                                       "Trinidad and Tobago": "Trinidad & Tobago"}},
                                       inplace=True)

The World Happiness data contains additional columns of data used to compile the subscores, so these were dropped along with general data about the countries that were unnecessary for comparison.  The remaining columns are then renamed for simplicity and to match with the table created in the postgres database.

In [None]:
happiness = happiness_df.drop(columns= ["Regional indicator", "Ladder score", "Standard error of ladder score", 
                                        "upperwhisker", "lowerwhisker", "Ladder score in Dystopia", 
                                        "Perceptions of corruption", "Generosity", "Freedom to make life choices", 
                                        "Healthy life expectancy", "Social support", "Logged GDP per capita"],)
happiness.head()

In [None]:
happiness = happiness.rename(columns={"Country name":"country",
                                            "Explained by: Log GDP per capita":"log_gdp",
                                            "Explained by: Social support":"social_support",
                                            "Explained by: Healthy life expectancy":"life_expectancy",
                                            "Explained by: Freedom to make life choices":"freedom_of_choice",
                                            "Explained by: Generosity":"generosity",
                                            "Explained by: Perceptions of corruption":"corruption_perception",
                                            "Dystopia + residual":"dystopia_residual"})
happiness.head()

Some countries' designations were clearer in one set of data than in the other.  Additionally, the country names in the World Factbook data contained an additional space at the end and so this space was stripped out in order to successfully match the World Happiness data.

In [None]:
country_df["Country"] = country_df["Country"].str.strip()
country_df.replace({'Country': {"Gambia, The" : "Gambia",
                                "Central African Rep." : "Central African Republic",
                                "Congo, Dem. Rep." : "Congo DR",
                                "Congo, Repub. of the" : "Congo Republic",
                                "Korea, South": "South Korea"}},inplace=True)

Not all data from the World Factbook was needed, especially where it did not seem to have a relation to happiness.  The remaining columns are then renamed for simplicity and to match with the table created in the postgres database.

In [None]:
countries = country_df.drop(columns=['Region', 'Population', 'Area (sq. mi.)', 'Coastline (coast/area ratio)',
                'Phones (per 1000)', 'Arable (%)', 'Crops (%)', 'Other (%)', 'Climate', 'Birthrate', 
                 'Deathrate', 'Agriculture', 'Industry', 'Service'])
countries.head()

In [None]:
countries = countries.rename(columns={'Country':'country', 'Pop. Density (per sq. mi.)':'pop_density', 
                                      'Net migration':'net_migration', 
                                      'Infant mortality (per 1000 births)':'infant_mortality', 
                                      'GDP ($ per capita)':'gdp', 'Literacy (%)':'literacy_rate'})
countries.head()

# Create database connection/Load data into database

Create connection string and pass username and password from `config.py`

In [None]:
connection_string = f"{username}:{password}@localhost:5432/happiness_db"
engine = create_engine(f'postgresql://{connection_string}')

Show the tables created after having run the `schema.sql`.

In [None]:
engine.table_names()

Import the tables and run sql `select *` queries to verify success.

In [None]:
happiness.to_sql(name='happiness', con=engine, if_exists='append', index=False)

In [None]:
countries.to_sql(name='countries', con=engine, if_exists='append', index=False)

In [None]:
pd.read_sql_query('select * from happiness', con=engine)

In [None]:
pd.read_sql_query('select * from countries', con=engine)

### Return to the `readme.md` for further instructions.