# Lab Assignment 9: Data Management Using `pandas`, Part 2
## DS 6001: Practice and Application of Data Science

### Instructions
Please answer the following questions as completely as possible using text, code, and the results of code as needed. Format your answers in a Jupyter notebook. To receive full credit, make sure you address every part of the problem, and make sure your document is formatted in a clean and professional way.

## Problem 0
Import the following libraries:

In [6]:
import numpy as np
import pandas as pd

## Problem 1
In the first part of this lab, the goal is to merge data from the United Nations World Health Organization (https://www.who.int/who-un/en/) with data from the Varieties of Democracy Project (https://www.v-dem.net/en/). The UN-WHO studies health outcomes in a cross-national context, and V-Dem studies the quality of democracy as it changes across countries and over time. We would want to merge these two datasets together if we wanted to study whether democratic quality can predict health outcomes.

The UN data contains cross-national time series data from the United Nations and World Health Organization, and includes three features:

* The number of physicians per 1000 people
* The percent of the population that is malnourished
* Health expenditure per capita

The VDem data comes from the Varieties of Democracy project, which aims to measure the quality of democracy and the amount of corruption in different countries over time (https://www.v-dem.net/en/data/data-version-8/). This data file contains indices regarding a country’s democractic quality, level of civil liberites, and corruption. It also contains a binary indicator that separates countries into democratic and nondemocratic states, and it includes a categorizaton of the corruption scale.

The URLs for the two datasets are:

In [7]:
undata_url = "https://github.com/jkropko/DS-6001/raw/master/localdata/UNdata.csv"
VDem_url = "https://github.com/jkropko/DS-6001/raw/master/localdata/vdem.csv"

### Part a
Load both CSV files. Make sure to check whether there are rows that should not be included in the dataframe, and whether there are missing codes that should be replaced with `NaN`. Fix these problems at the data loading stage, if you can. (Don't worry about column names or category labels yet.) Also, the UN data covers the years 1960-2014, and the VDem data covers the years 1960-2015. To make the timeframe match up, delete rows in the VDem data from 2015. (1 point)

In [70]:
# Load the data and inspect the head
undata = pd.read_csv(undata_url)
undata.head()

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],...,2006 [YR2006],2007 [YR2007],2008 [YR2008],2009 [YR2009],2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015]
0,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Afghanistan,AFG,0.0348442494869232,..,..,..,..,0.0634277984499931,...,0.136,0.146,0.145,0.175,0.194,0.234,0.225,0.266,..,..
1,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Albania,ALB,0.276291221380234,..,..,..,..,0.48128342628479,...,1.15,1.146,..,1.144,1.132,1.113,1.145,1.145,..,..
2,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Algeria,DZA,0.173148155212402,..,..,..,..,0.116413652896881,...,..,1.207,..,..,1.207,..,..,..,..,..
3,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,American Samoa,ASM,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
4,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Andorra,ADO,..,..,..,..,..,..,...,3.64,3.716,..,3.912,4,..,..,..,..,..


In [71]:
# Check if there are any `NaN` values -- it turns out Yes!
undata.isnull().values.any()

True

In [72]:
# Check column names
undata.columns

Index(['Series Name', 'Series Code', 'Country Name', 'Country Code',
       '1960 [YR1960]', '1961 [YR1961]', '1962 [YR1962]', '1963 [YR1963]',
       '1964 [YR1964]', '1965 [YR1965]', '1966 [YR1966]', '1967 [YR1967]',
       '1968 [YR1968]', '1969 [YR1969]', '1970 [YR1970]', '1971 [YR1971]',
       '1972 [YR1972]', '1973 [YR1973]', '1974 [YR1974]', '1975 [YR1975]',
       '1976 [YR1976]', '1977 [YR1977]', '1978 [YR1978]', '1979 [YR1979]',
       '1980 [YR1980]', '1981 [YR1981]', '1982 [YR1982]', '1983 [YR1983]',
       '1984 [YR1984]', '1985 [YR1985]', '1986 [YR1986]', '1987 [YR1987]',
       '1988 [YR1988]', '1989 [YR1989]', '1990 [YR1990]', '1991 [YR1991]',
       '1992 [YR1992]', '1993 [YR1993]', '1994 [YR1994]', '1995 [YR1995]',
       '1996 [YR1996]', '1997 [YR1997]', '1998 [YR1998]', '1999 [YR1999]',
       '2000 [YR2000]', '2001 [YR2001]', '2002 [YR2002]', '2003 [YR2003]',
       '2004 [YR2004]', '2005 [YR2005]', '2006 [YR2006]', '2007 [YR2007]',
       '2008 [YR2008]', '2009 [

In [73]:
# Check if there are in Series Names -- 3 rows
undata['Series Name'].isnull().sum()

3

In [74]:
# Visual check of the missing records
undata[undata['Series Name'].isna()]

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],...,2006 [YR2006],2007 [YR2007],2008 [YR2008],2009 [YR2009],2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015]
774,,,,,,,,,,,...,,,,,,,,,,
775,,,,,,,,,,,...,,,,,,,,,,
776,,,,,,,,,,,...,,,,,,,,,,


In [75]:
# Check if there are in Series Code -- 5 rows
undata['Series Code'].isnull().sum()

5

In [76]:
# Visual check of the missing records -- same missing values for Country Code and Country Name
undata[undata['Series Code'].isnull()] 

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],...,2006 [YR2006],2007 [YR2007],2008 [YR2008],2009 [YR2009],2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015]
774,,,,,,,,,,,...,,,,,,,,,,
775,,,,,,,,,,,...,,,,,,,,,,
776,,,,,,,,,,,...,,,,,,,,,,
777,Data from database: Health Nutrition and Popul...,,,,,,,,,,...,,,,,,,,,,
778,Last Updated: 12/16/2016,,,,,,,,,,...,,,,,,,,,,


In [77]:
undata = undata.dropna(subset=['Series Code'])

In [78]:
# Check if there are any `NaN` values after dropping "Series Code" -- turns out No!
undata.isnull().values.any()

False

In [79]:
# Load and inspect the democracy data
VDem = pd.read_csv(VDem_url)
VDem.head()

Unnamed: 0,X1,country_name,country_id,country_text_id,year,historical_date,codingstart,gapstart,gapend,codingend,...,v2xcs_ccsi_codehigh,v2xcs_ccsi_codelow,v2xps_party,v2xps_party_codehigh,v2xps_party_codelow,v2x_gender,v2x_gender_codehigh,v2x_gender_codelow,v2x_gencl,v2x_gencl_codehigh
0,1,Mexico,3,MEX,1960,1960-01-01,1900,,,2014,...,0.451123,0.170201,0.681416,0.811379,0.524055,0.347498,0.42127,0.273726,0.555367,0.714971
1,2,Mexico,3,MEX,1961,1961-01-01,1900,,,2014,...,0.461693,0.175715,0.681416,0.811379,0.524055,0.344214,0.417813,0.270614,0.555367,0.714971
2,3,Mexico,3,MEX,1962,1962-01-01,1900,,,2014,...,0.461693,0.175715,0.681416,0.811379,0.524055,0.344214,0.417813,0.270614,0.555367,0.714971
3,4,Mexico,3,MEX,1963,1963-01-01,1900,,,2014,...,0.461693,0.175715,0.681416,0.811379,0.524055,0.344214,0.417813,0.270614,0.555367,0.714971
4,5,Mexico,3,MEX,1964,1964-01-01,1900,,,2014,...,0.461693,0.175715,0.681416,0.811379,0.524055,0.356873,0.428861,0.284885,0.555367,0.714971


In [80]:
# Check if there are any `NaN` values after dropping "Series Code" -- turns out Yes!
VDem.isnull().values.any()

True

In [67]:
# Check columns -- probably "Country Name" and "year" are of most importance
VDem.columns

Index(['X1', 'country_name', 'country_id', 'country_text_id', 'year',
       'historical_date', 'codingstart', 'gapstart', 'gapend', 'codingend',
       'COWcode', 'v2x_polyarchy', 'v2x_polyarchy_codehigh',
       'v2x_polyarchy_codelow', 'v2x_api', 'v2x_api_codehigh',
       'v2x_api_codelow', 'v2x_mpi', 'v2x_mpi_codehigh', 'v2x_mpi_codelow',
       'v2x_EDcomp_thick', 'v2x_EDcomp_thick_codehigh',
       'v2x_EDcomp_thick_codelow', 'v2x_libdem', 'v2x_libdem_codehigh',
       'v2x_libdem_codelow', 'v2x_liberal', 'v2x_liberal_codehigh',
       'v2x_liberal_codelow', 'v2x_partipdem', 'v2x_partipdem_codehigh',
       'v2x_partipdem_codelow', 'v2x_partip', 'v2x_partip_codehigh',
       'v2x_partip_codelow', 'v2x_delibdem', 'v2x_delibdem_codehigh',
       'v2x_delibdem_codelow', 'v2xdl_delib', 'v2xdl_delib_codehigh',
       'v2xdl_delib_codelow', 'v2x_egaldem', 'v2x_egaldem_codehigh',
       'v2x_egaldem_codelow', 'v2x_egal', 'v2x_egal_codehigh',
       'v2x_egal_codelow', 'v2x_frassoc_thic

In [82]:
# Check if there are in missing values for countries -- 0 rows
VDem['country_name'].isnull().sum()

0

In [83]:
# Check the year -- also 0, probably good to go for now...
VDem['year'].isnull().sum()

0

In [84]:
len(VDem)

8534

### Part b
The UN data contain certain rows that refer to groups of countries instead of to individual countries. Here’s a list of these non-countries:

In [85]:
noncountries = ['Arab World',  'Caribbean small states',  'Central Europe and the Baltics', 
    'Early-demographic dividend',  'East Asia & Pacific', 'East Asia & Pacific (excluding high income)', 
    'East Asia & Pacific (IDA & IBRD countries)', 'Euro area', 'Europe & Central Asia', 
    'Europe & Central Asia (excluding high income)', 'Europe & Central Asia (IDA & IBRD countries)', 'European Union', 
    'Fragile and conflict affected situations', 'Heavily indebted poor countries (HIPC)', 
    'High income', 'Late-demographic dividend', 'Latin America & Caribbean', 
    'Latin America & Caribbean (excluding high income)', 
    'Latin America & the Caribbean (IDA & IBRD countries)', 'Least developed countries: UN classification', 
    'Low & middle income', 'Low income', 'Lower middle income', 
    'Middle East & North Africa', 'Middle East & North Africa (excluding high income)',
    'Middle East & North Africa (IDA & IBRD countries)', 
    'Middle income', 'North America', 'OECD members', 
    'Other small states', 'Pacific island small states', 'Post-demographic dividend', 
    'Pre-demographic dividend', 'Small states', 'South Asia', 
    'South Asia (IDA & IBRD)', 'Sub-Saharan Africa', 'Sub-Saharan Africa (excluding high income)', 
    'Sub-Saharan Africa (IDA & IBRD countries)', 'Upper middle income', 'World']




We can use `.query()` to remove the non-countries from the data, but in this case there are complications due to the space in the name of the column `Country Name` and the use of an external list. So here let's use an alternative method:

First, apply the `.isin(noncountries)` method to the `Country Name` column of the UN data to create a series of values that are `True` if the `Country Name` on a row is one of the non-countries, and `False` otherwise. Second, use the `~` operator to negate the logical values: turn `True` to `False` and vice versa. Finally, pass this logical series to the `.loc[]` attribute of the dataframe to drop the rows that refer to these noncountries from the UN data. (1 point)

(If you wanted to use `.query()`, you would first need to rename `Country Name` to remove the space, then you can use an `@` in front of `noncountries` to refer to the external list. But for this problem follow the instructions listed above.)

In [310]:
# Store the true countries together with all of the columns in "uncountries" dataframe
uncountries = undata.loc[~undata['Country Name'].isin(noncountries)]

In [311]:
# Check the expected length
len(uncountries)

651

### Part c
Reshape the UN data to move the years from the columns to the rows. (Once the years are in the rows, they will have values such as "1960 [YR1960]".) (2 points)

In [312]:
uncountries.columns

Index(['Series Name', 'Series Code', 'Country Name', 'Country Code',
       '1960 [YR1960]', '1961 [YR1961]', '1962 [YR1962]', '1963 [YR1963]',
       '1964 [YR1964]', '1965 [YR1965]', '1966 [YR1966]', '1967 [YR1967]',
       '1968 [YR1968]', '1969 [YR1969]', '1970 [YR1970]', '1971 [YR1971]',
       '1972 [YR1972]', '1973 [YR1973]', '1974 [YR1974]', '1975 [YR1975]',
       '1976 [YR1976]', '1977 [YR1977]', '1978 [YR1978]', '1979 [YR1979]',
       '1980 [YR1980]', '1981 [YR1981]', '1982 [YR1982]', '1983 [YR1983]',
       '1984 [YR1984]', '1985 [YR1985]', '1986 [YR1986]', '1987 [YR1987]',
       '1988 [YR1988]', '1989 [YR1989]', '1990 [YR1990]', '1991 [YR1991]',
       '1992 [YR1992]', '1993 [YR1993]', '1994 [YR1994]', '1995 [YR1995]',
       '1996 [YR1996]', '1997 [YR1997]', '1998 [YR1998]', '1999 [YR1999]',
       '2000 [YR2000]', '2001 [YR2001]', '2002 [YR2002]', '2003 [YR2003]',
       '2004 [YR2004]', '2005 [YR2005]', '2006 [YR2006]', '2007 [YR2007]',
       '2008 [YR2008]', '2009 [

In [313]:
ids = ['Series Name', 'Series Code', 'Country Name', 'Country Code']

values = ['1960 [YR1960]', '1961 [YR1961]', '1962 [YR1962]', '1963 [YR1963]',
       '1964 [YR1964]', '1965 [YR1965]', '1966 [YR1966]', '1967 [YR1967]',
       '1968 [YR1968]', '1969 [YR1969]', '1970 [YR1970]', '1971 [YR1971]',
       '1972 [YR1972]', '1973 [YR1973]', '1974 [YR1974]', '1975 [YR1975]',
       '1976 [YR1976]', '1977 [YR1977]', '1978 [YR1978]', '1979 [YR1979]',
       '1980 [YR1980]', '1981 [YR1981]', '1982 [YR1982]', '1983 [YR1983]',
       '1984 [YR1984]', '1985 [YR1985]', '1986 [YR1986]', '1987 [YR1987]',
       '1988 [YR1988]', '1989 [YR1989]', '1990 [YR1990]', '1991 [YR1991]',
       '1992 [YR1992]', '1993 [YR1993]', '1994 [YR1994]', '1995 [YR1995]',
       '1996 [YR1996]', '1997 [YR1997]', '1998 [YR1998]', '1999 [YR1999]',
       '2000 [YR2000]', '2001 [YR2001]', '2002 [YR2002]', '2003 [YR2003]',
       '2004 [YR2004]', '2005 [YR2005]', '2006 [YR2006]', '2007 [YR2007]',
       '2008 [YR2008]', '2009 [YR2009]', '2010 [YR2010]', '2011 [YR2011]',
       '2012 [YR2012]', '2013 [YR2013]', '2014 [YR2014]', '2015 [YR2015]'
]


uncountries = pd.melt(uncountries, id_vars=ids, value_vars=values)

In [314]:
uncountries.head()

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,variable,value
0,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Afghanistan,AFG,1960 [YR1960],0.0348442494869232
1,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Albania,ALB,1960 [YR1960],0.276291221380234
2,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Algeria,DZA,1960 [YR1960],0.173148155212402
3,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,American Samoa,ASM,1960 [YR1960],..
4,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Andorra,ADO,1960 [YR1960],..


### Part d
Rename the `variable` column to `year`. Then use string methods to remove the ends such as "[YR1960]" from the values of the new `year` column and convert the column to an integer data type.

Also, for whatever reason, real world data often contains multiple variables that are just different representations of the same information. In this case, the `Series Name` and `Series Code` variables tell us exactly the same thing, and the `Country Name` and `Country Code` variables tell us exactly the same thing. Unless I have a very good reason to keep both, I generally prefer to drop variables that are redundant and coded in a less helpful way. So drop `Series Code` and `Country Code`. (2 points)

In [315]:
# Rename columns name
uncountries = uncountries.rename(columns={'variable':'year'})

In [316]:
# Drop all but keep the first 4 characters
uncountries['year'] = uncountries['year'].str[:4]

In [317]:
# Change type to int
uncountries['year'] = uncountries['year'].astype('int')
uncountries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36456 entries, 0 to 36455
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Series Name   36456 non-null  object
 1   Series Code   36456 non-null  object
 2   Country Name  36456 non-null  object
 3   Country Code  36456 non-null  object
 4   year          36456 non-null  int64 
 5   value         36456 non-null  object
dtypes: int64(1), object(5)
memory usage: 1.7+ MB


### Part e
Reshape the data to move the values of `Series Name` to separate columns. Make sure all of the columns exist in the dataframe after reshaping and are not stored in a row index or multi-index. Then rename the columns so that all of the columns have concise and descriptive names. (2 points)

In [318]:
# Check the names of the series that are in "Series Name" columns
uncountries['Series Name'].unique()

array(['Physicians (per 1,000 people)',
       'Prevalence of undernourishment (% of population)',
       'Health expenditure per capita (current US$)'], dtype=object)

In [319]:
# Create a list of better (shorter) names
better_names = {'Physicians (per 1,000 people)':'docs_per_thousand',
       'Prevalence of undernourishment (% of population)': 'undernourishment_percent',
       'Health expenditure per capita (current US$)': 'health_expenditure'
               }

In [320]:
# Replace
uncountries.replace({'Series Name': better_names}, inplace=True)

In [321]:
# Check the result
uncountries['Series Name'].unique()

array(['docs_per_thousand', 'undernourishment_percent',
       'health_expenditure'], dtype=object)

In [322]:
# Perform pivot on "Series name" column as it has 3 different features
uncountries = uncountries.pivot_table(index=['Country Name','year'],
                                      columns='Series Name',
                                      values='value', 
                                      aggfunc='first').reset_index()


In [323]:
# Check the resultant dataframe -- there are many missing values
uncountries.head()

Series Name,Country Name,year,docs_per_thousand,health_expenditure,undernourishment_percent
0,Afghanistan,1960,0.0348442494869232,..,..
1,Afghanistan,1961,..,..,..
2,Afghanistan,1962,..,..,..
3,Afghanistan,1963,..,..,..
4,Afghanistan,1964,..,..,..


In [324]:
uncountries.tail()

Series Name,Country Name,year,docs_per_thousand,health_expenditure,undernourishment_percent
12147,Zimbabwe,2011,0.083,48.46958014,33.5
12148,Zimbabwe,2012,..,57.25376348,33.2
12149,Zimbabwe,2013,..,62.30922835,33.5
12150,Zimbabwe,2014,..,57.71045218,34.0
12151,Zimbabwe,2015,..,..,33.4


In [None]:
uncountries.rename_axis('index')

### Part f
Next we are going to join the cleaned UN data with the VDem data. In a perfect world, both datasets would include a shared numeric country ID field that we can use to match countries in one dataset to countries in the other. Unfortunately the UN data identifies the countries only by name. Worse still, while there is a big overlap the two datasets cover different sets of countries.

First decide whether this merge is a one-to-one, one-to-many, many-to-one, or many-to-many merge and describe your rationale in words.

Then perform a test merge that checks whether your expectation that the merge is one-to-one, one-to-many, many-to-one, or many-to-many is confirmed, and reports whether each row is matched, appears only in the UN data, or appears only in the VDem data. Use the `.unique()` or `.value_counts()` method to display the names of the countries that are not matched. (2 points)

In [348]:
len(VDem.country_name.unique())

172

In [349]:
len(uncountries['Country Name'].unique())

217

## Answer
#### As shown above, `VDem` has much less unique countries than the `UN` dataset, so probably one should try to do left-join on the `UN` dataset first of all. Given that every instance will be identified by "country_name" and "year", ideally this should be done as one-to-one match. It is expected however that there will be a lot of "unmatched" instances as controlled by `matched` indicator.

In [351]:
# Perform left-join on countries and years, check with indicator:
merged_data = pd.merge(uncountries, 
                       VDem, 
                       how='left', 
                       left_on=['Country Name', 'year'], 
                       right_on=['country_name', 'year'],
                       indicator='matched')
merged_data.head()

Unnamed: 0,Country Name,year,docs_per_thousand,health_expenditure,undernourishment_percent,X1,country_name,country_id,country_text_id,historical_date,...,v2xcs_ccsi_codelow,v2xps_party,v2xps_party_codehigh,v2xps_party_codelow,v2x_gender,v2x_gender_codehigh,v2x_gender_codelow,v2x_gencl,v2x_gencl_codehigh,matched
0,Afghanistan,1960,0.0348442494869232,..,..,1583.0,Afghanistan,36.0,AFG,1960-01-01,...,0.14355,0.074516,0.162687,0.028557,0.181335,0.232855,0.129815,0.172381,0.301402,both
1,Afghanistan,1961,..,..,..,1584.0,Afghanistan,36.0,AFG,1961-01-01,...,0.160458,0.074516,0.162687,0.028557,0.181335,0.232855,0.129815,0.172381,0.301402,both
2,Afghanistan,1962,..,..,..,1585.0,Afghanistan,36.0,AFG,1962-01-01,...,0.160458,0.074516,0.162687,0.028557,0.181335,0.232855,0.129815,0.172381,0.301402,both
3,Afghanistan,1963,..,..,..,1586.0,Afghanistan,36.0,AFG,1963-01-01,...,0.140015,0.111794,0.212807,0.050778,0.181335,0.232855,0.129815,0.172381,0.301402,both
4,Afghanistan,1964,..,..,..,1587.0,Afghanistan,36.0,AFG,1964-01-01,...,0.229685,0.17783,0.304231,0.090927,0.174778,0.232559,0.116997,0.167323,0.300779,both


In [357]:
merged_data.matched.unique()

['both', 'left_only']
Categories (3, object): ['left_only', 'right_only', 'both']

In [356]:
# Only 60% are properly matched
len(merged_data.loc[merged_data.matched=='both'])/len(merged_data)

0.6023699802501645

In [363]:
# There are many unmatched rows associated with the following contries below:
merged_data.loc[merged_data.matched!='both']['Country Name'].unique()

array(['Albania', 'American Samoa', 'Andorra', 'Angola',
       'Antigua and Barbuda', 'Armenia', 'Aruba', 'Australia', 'Austria',
       'Azerbaijan', 'Bahamas, The', 'Bahrain', 'Bangladesh', 'Barbados',
       'Belarus', 'Belgium', 'Belize', 'Bermuda',
       'Bosnia and Herzegovina', 'Botswana', 'British Virgin Islands',
       'Brunei Darussalam', 'Bulgaria', 'Cabo Verde', 'Cameroon',
       'Canada', 'Cayman Islands', 'Central African Republic', 'Chad',
       'Channel Islands', 'Chile', 'China', 'Comoros', 'Congo, Dem. Rep.',
       'Congo, Rep.', "Cote d'Ivoire", 'Croatia', 'Curacao', 'Cyprus',
       'Czech Republic', 'Denmark', 'Djibouti', 'Dominica',
       'Dominican Republic', 'Ecuador', 'Egypt, Arab Rep.',
       'Equatorial Guinea', 'Estonia', 'Faroe Islands', 'Finland',
       'France', 'French Polynesia', 'Gabon', 'Gambia, The', 'Georgia',
       'Germany', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guam',
       'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti', 'Ho

### Part g
There are many unmatched rows in this merge. There are three reasons why rows failed to match:
* Differences in geographical coverage: for example, the VDem data includes Taiwan, but the UN data does not
* Differences in time coverage: for example, the UN data includes records for France every year from 1970 through 2014, and VDem includes rows for France from 1960 to 2012, leaving 12 rows for France without matching years
* Differences in spelling: for example, South Korea is called "Korea, Rep." in the UN data and "Korea_South" in the VDem data.

We can't do anything about differences in geographic or temporal coverage. But we can recode some country names to account for differences in spelling and to match more rows that should match. Here is a list of differently spelled countries:

* "Burma_Myanmar" in VDem is "Myanmar" in the UN data
* "Cape Verde" in VDem is "Cabo Verde" in the UN data
* "Congo_Democratic Republic of" in VDem is "Congo, Dem. Rep." in the UN data
* "Congo_Republic of the" in VDem is "Congo, Rep." in the UN data
* "East Timor" in VDem is "Timor-Leste" in the UN data
* "Egypt" in VDem is "Egypt, Arab Rep." in the UN data
* "Gambia" in VDem is "Gambia, The" in the UN data
* "Iran" in VDem is "Iran, Islamic Rep." in the UN data
* "Ivory Coast" in VDem is "Cote d’Ivoire" in the UN data
* "Korea_North" in VDem is "Korea, Dem. People’s Rep." in the UN data
* "Korea_South" in VDem is "Korea, Rep." in the UN data
* "Kyrgyzstan" in VDem is "Kyrgyz Republic" in the UN data
* "Laos" in VDem is "Lao PDR" in the UN data
* "Macedonia" in VDem is "Macedonia, FYR" in the UN data
* "Palestine_West_Bank" in VDem is "West Bank and Gaza" in the UN Data (there is also "Palestine_Gaza" in VDem, but since the UN combines data for the West Bank and Gaza, let's just use "Palestine_West_Bank" for this assignment)
* "Russia" in VDem is "Russian Federation" in the UN data
* "Slovakia" in VDem is "Slovak Republic" in the UN data
* "Syria" in VDem is "Syrian Arab Republic" in the UN data
* "Venezuela" in VDem is "Venezuela, RB" in the UN data
* "Vietnam_Democratic Republic of" in VDem is "Vietnam" in the UN data
* "Yemen" in VDem is "Yemen, Rep." in the UN data

Recode the country names listed above in one of the two dataframes to match the names in the other dataframe. Then perform an inner join of the two dataframes. Some rows will be dropped because of differences in coverage, but no rows will be dropped because of differences in spelling. (2 points)

In [371]:
lookup_table = {
    "Burma_Myanmar" :"Myanmar" ,
    "Cape Verde" : "Cabo Verde",
    "Congo_Democratic Republic of" :"Congo, Dem. Rep." ,
    "Congo_Republic of the": "Congo, Rep." ,
    "East Timor" : "Timor-Leste" ,
    "Egypt" :"Egypt, Arab Rep." ,
    "Gambia" :"Gambia, The" ,
    "Iran" : "Iran, Islamic Rep." ,
    "Ivory Coast" : "Cote d’Ivoire" ,
    "Korea_North" : "Korea, Dem. People’s Rep.",
    "Korea_South" : "Korea, Rep.",
    "Kyrgyzstan" : "Kyrgyz Republic",
    "Laos" : "Lao PDR",
    "Macedonia" :"Macedonia, FYR",
    "Palestine_West_Bank" : "West Bank and Gaza",
#"Palestine_Gaza" in VDem, but since the UN combines data for the West Bank and Gaza, 
#let's just use "Palestine_West_Bank" for this assignment)
    "Russia" : "Russian Federation" ,
    "Slovakia" : "Slovak Republic" ,
    "Syria" : "Syrian Arab Republic",
    "Venezuela" :"Venezuela, RB" ,
    "Vietnam_Democratic Republic of" : "Vietnam" ,
    "Yemen" :"Yemen, Rep."   
}
lookup_table

{'Burma_Myanmar': 'Myanmar',
 'Cape Verde': 'Cabo Verde',
 'Congo_Democratic Republic of': 'Congo, Dem. Rep.',
 'Congo_Republic of the': 'Congo, Rep.',
 'East Timor': 'Timor-Leste',
 'Egypt': 'Egypt, Arab Rep.',
 'Gambia': 'Gambia, The',
 'Iran': 'Iran, Islamic Rep.',
 'Ivory Coast': 'Cote d’Ivoire',
 'Korea_North': 'Korea, Dem. People’s Rep.',
 'Korea_South': 'Korea, Rep.',
 'Kyrgyzstan': 'Kyrgyz Republic',
 'Laos': 'Lao PDR',
 'Macedonia': 'Macedonia, FYR',
 'Palestine_West_Bank': 'West Bank and Gaza',
 'Russia': 'Russian Federation',
 'Slovakia': 'Slovak Republic',
 'Syria': 'Syrian Arab Republic',
 'Venezuela': 'Venezuela, RB',
 'Vietnam_Democratic Republic of': 'Vietnam',
 'Yemen': 'Yemen, Rep.'}

In [374]:
# Replace country names in VDem to better allign with the UN dataset
VDem.replace({'country_name': lookup_table}, inplace=True)

In [375]:
# Replaced countries are shown below:
VDem.country_name.unique()

array(['Mexico', 'Suriname', 'Sweden', 'Switzerland', 'Ghana',
       'South Africa', 'Japan', 'Myanmar', 'Russian Federation',
       'Albania', 'Egypt, Arab Rep.', 'Yemen, Rep.', 'Colombia', 'Poland',
       'Brazil', 'United States', 'Portugal', 'El Salvador',
       'South Yemen', 'Bangladesh', 'Bolivia', 'Haiti', 'Honduras',
       'Mali', 'Pakistan', 'Peru', 'Senegal', 'South Sudan', 'Sudan',
       'Vietnam', 'Vietnam_Republic of', 'Afghanistan', 'Argentina',
       'Ethiopia', 'India', 'Kenya', 'Korea, Dem. People’s Rep.',
       'Korea, Rep.', 'Kosovo', 'Lebanon', 'Nigeria', 'Philippines',
       'Tanzania', 'Taiwan', 'Thailand', 'Uganda', 'Venezuela, RB',
       'Benin', 'Bhutan', 'Burkina Faso', 'Cambodia', 'Indonesia',
       'Mozambique', 'Nepal', 'Nicaragua', 'Niger', 'Zambia', 'Zimbabwe',
       'Guinea', 'Cote d’Ivoire', 'Mauritania', 'Canada', 'Australia',
       'Botswana', 'Burundi', 'Cabo Verde', 'Central African Republic',
       'Chile', 'Costa Rica', 'Timor-Leste

In [376]:
# Perform left-join on countries and years, use UN names:
un_merged_data = pd.merge(uncountries, 
                       VDem, 
                       how='left', 
                       left_on=['Country Name', 'year'], 
                       right_on=['country_name', 'year'],
                       indicator='matched')

In [377]:
# Only 68% are properly matched now:
len(un_merged_data.loc[un_merged_data.matched=='both'])/len(un_merged_data)

0.6837557603686636

## Problem 2
[Kickstarter](https://www.kickstarter.com/) is a website in which people can pledge financial support for creative projects. Patrons are only charged if a project raises enough money to meet a pre-specified goal, and projects can offer items as "rewards" for patrons who contribute at particular levels. One interesting aspect of Kickstarter is the ability to [search projects by "ending soon"](https://www.kickstarter.com/discover/advanced?sort=end_date). If you have a few dollars to spare and want to feel like a hero, you can swoop in at the last minute to contribute enough for a project to meet its goal.

Cathie So created a project on Kaggle in which she [scraped Kickstarter](https://www.kaggle.com/socathie/kickstarter-project-statistics/data?select=live.csv) and collected data on 4000 live projects (projects that were currently collecting pledges from patrons) as of October 10, 2016, at 5pm Pacific time. The data are here:

In [426]:
# Read/inspect the datatset
kickstarter = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/live.csv")
kickstarter.head()

Unnamed: 0.1,Unnamed: 0,amt.pledged,blurb,by,country,currency,end.time,location,percentage.funded,state,title,type,url
0,0,15823.0,"\n'Catalysts, Explorers & Secret Keepers: Wome...",Museum of Science Fiction,US,usd,2016-11-01T23:59:00-04:00,"Washington, DC",186,DC,"Catalysts, Explorers & Secret Keepers: Women o...",Town,/projects/1608905146/catalysts-explorers-and-s...
1,1,6859.0,\nA unique handmade picture book for kids & ar...,"Tyrone Wells & Broken Eagle, LLC",US,usd,2016-11-25T01:13:33-05:00,"Portland, OR",8,OR,The Whatamagump (a hand-crafted story picture ...,Town,/projects/thewhatamagump/the-whatamagump-a-han...
2,2,17906.0,\nA horror comedy about a repairman who was in...,Tessa Stone,US,usd,2016-11-23T23:00:00-05:00,"Los Angeles, CA",102,CA,Not Drunk Enough Volume 1!,Town,/projects/1890925998/not-drunk-enough-volume-1...
3,3,67081.0,\nThe Johnny Wander autobio omnibus you've all...,Johnny Wander,US,usd,2016-11-01T23:50:00-04:00,"Brooklyn, NY",191,NY,Our Cats Are More Famous Than Us: A Johnny Wan...,County,/projects/746734715/our-cats-are-more-famous-t...
4,4,32772.0,\nThe vision for this project is the establish...,Beau's All Natural Brewing Company,RW,cad,2016-11-18T23:05:48-05:00,"Kigali, Rwanda",34,Kigali Province,The Rwanda Craft Brewery Project,Town,/projects/beaus/the-rwanda-craft-brewery-proje...


### Part a
Notice that the `end.time` column, the date and time at which the project stops accepting pledges, is formatted as follows:
```
2016-11-01T23:59:00-04:00
```
This formatting is "YYYY-MM-DDThh:mm:ss-TZD": four digits for the year, a dash, two digits for the month, another dash, and two digits for the day; the "T" separates the dates from the time; two digits for the hour, minute and second, separated by colons; and the time zone expressed as hours difference from Greenwich mean time (also called UTC), and -04:00 is four hours earlier than UTC, for example.

But `end.time` is also currently read as a string, with `object` data type:

In [427]:
kickstarter.dtypes

Unnamed: 0             int64
amt.pledged          float64
blurb                 object
by                    object
country               object
currency              object
end.time              object
location              object
percentage.funded      int64
state                 object
title                 object
type                  object
url                   object
dtype: object

In [428]:
# Convert "end.time" to datetime
kickstarter['timestamp'] = pd.to_datetime(kickstarter['end.time'], utc=True)
kickstarter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   Unnamed: 0         4000 non-null   int64              
 1   amt.pledged        4000 non-null   float64            
 2   blurb              4000 non-null   object             
 3   by                 4000 non-null   object             
 4   country            3999 non-null   object             
 5   currency           4000 non-null   object             
 6   end.time           4000 non-null   object             
 7   location           4000 non-null   object             
 8   percentage.funded  4000 non-null   int64              
 9   state              4000 non-null   object             
 10  title              4000 non-null   object             
 11  type               4000 non-null   object             
 12  url                4000 non-null   object       

Convert `end.time` to a timestamp, and extract the month, day, year, hour, minute, and second of the end time. To allow the `pd.to_datetime()` function to read timezones, use the `utc=True` argument. (2 points)

In [429]:
# Splitting datetime into respective columns
kickstarter['year'] = kickstarter.timestamp.dt.year
kickstarter['month'] = kickstarter.timestamp.dt.month
kickstarter['day'] = kickstarter.timestamp.dt.day
kickstarter['hour'] = kickstarter.timestamp.dt.hour
kickstarter['minute'] = kickstarter.timestamp.dt.minute
kickstarter['second'] = kickstarter.timestamp.dt.second
kickstarter.head(2)

Unnamed: 0.1,Unnamed: 0,amt.pledged,blurb,by,country,currency,end.time,location,percentage.funded,state,title,type,url,timestamp,year,month,day,hour,minute,second
0,0,15823.0,"\n'Catalysts, Explorers & Secret Keepers: Wome...",Museum of Science Fiction,US,usd,2016-11-01T23:59:00-04:00,"Washington, DC",186,DC,"Catalysts, Explorers & Secret Keepers: Women o...",Town,/projects/1608905146/catalysts-explorers-and-s...,2016-11-02 03:59:00+00:00,2016,11,2,3,59,0
1,1,6859.0,\nA unique handmade picture book for kids & ar...,"Tyrone Wells & Broken Eagle, LLC",US,usd,2016-11-25T01:13:33-05:00,"Portland, OR",8,OR,The Whatamagump (a hand-crafted story picture ...,Town,/projects/thewhatamagump/the-whatamagump-a-han...,2016-11-25 06:13:33+00:00,2016,11,25,6,13,33


### Part b
Create a dataframe with one row for every ending day in the `kickstarter` data that reports the average amount pledged (`amt.pledged`) on each day. Sort the rows in descending order by average amount pledged, and display the five days with the highest averages. (2 points)

In [430]:
# Creating extra column for the day to keep truck of each day
kickstarter['date'] = kickstarter.timestamp.dt.date
kickstarter.head(2)

Unnamed: 0.1,Unnamed: 0,amt.pledged,blurb,by,country,currency,end.time,location,percentage.funded,state,...,type,url,timestamp,year,month,day,hour,minute,second,date
0,0,15823.0,"\n'Catalysts, Explorers & Secret Keepers: Wome...",Museum of Science Fiction,US,usd,2016-11-01T23:59:00-04:00,"Washington, DC",186,DC,...,Town,/projects/1608905146/catalysts-explorers-and-s...,2016-11-02 03:59:00+00:00,2016,11,2,3,59,0,2016-11-02
1,1,6859.0,\nA unique handmade picture book for kids & ar...,"Tyrone Wells & Broken Eagle, LLC",US,usd,2016-11-25T01:13:33-05:00,"Portland, OR",8,OR,...,Town,/projects/thewhatamagump/the-whatamagump-a-han...,2016-11-25 06:13:33+00:00,2016,11,25,6,13,33,2016-11-25


In [443]:
# Grouping by date, aggregating by avg amt pledges, and sorting descending
ending_day = pd.DataFrame(
    kickstarter.groupby('date')['amt.pledged'].mean("amt.pledged")).sort_values(
    by=['amt.pledged'], ascending=False)

In [444]:
# Reporting the top 5
ending_day.head(5)

Unnamed: 0_level_0,amt.pledged
date,Unnamed: 1_level_1
2016-12-14,47938.375
2016-11-04,26975.388889
2016-11-11,24990.669065
2016-12-17,22160.230769
2016-11-18,21016.234043


### Part c
Display the text of the longest `blurb` in the data. (2 points)

In [449]:
# First, create column to store length of the blurb strings
kickstarter['length'] = kickstarter['blurb'].str.len()

In [458]:
# Second, use .loc to select the row with the longest length, show the blurb only
kickstarter.loc[max(kickstarter['length'])]['blurb']

'\nA box of cool projects every month, shipped directly from our shop to your mailbox.  Open it, build it, take it apart.  Make something!\n'

In [459]:
# Check how long that string was and if it is trully the largest value in the "length" column
len(kickstarter.loc[max(kickstarter['length'])]['blurb']) == kickstarter['length'].max()

True

### Part d
How many blurbs for projects with end dates between November 15, 2016 and December 7, 2016 contain the phrase "science fiction"? [Hint: Don't forget to make this search case-insensitive and to sort the `kickstarter` dataframe by `end.time` before setting `end.time` as the index.] (2 points)

In [None]:
sorted_records = kickstarter.sort_values(by=['end.time'])