## Getting started with Pandas
## Econ 148

This Demo seeks to introduce students to pandas using an economics dataset. We will use data from the Economic Transformation Database (ETD) which presents the following internationally comparable sectoral data on employment and productivity in Africa, Asia, and Latin America. Feel free to further explore the data at https://www.wider.unu.edu/database/etd-economic-transformation-database.

Kruse, H., E. Mensah, K. Sen, and G. J. de Vries (2022). “A manufacturing renaissance? Industrialization trends in the developing world”, IMF Economic Review DOI: 10.1057/s41308-022-00183-7

License: The GGDC/UNU-WIDER Economic Transformation Database is licensed under a Creative Commons Attribution 4.0 International License.

In [None]:
import pandas as pd

In [None]:
ETDdf = pd.read_csv("ETD.csv")
ETDdf

We have successfully imported the data! However, before we begin we must understand the data that we are working with and adjust it such that it will be ideal to work with.

In [None]:
ETDdf.shape

In [None]:
years = 2018-1989
years

In [None]:
num_country = ETDdf["country"].nunique()
num_country

In [None]:
num_vars= ETDdf["var"].nunique()
num_vars

In [None]:
rows = years*num_country*num_vars
rows

We see that there are 4,437 rows and 23 columns. But notice that there are a few columns containing only 'NaN' values.  We can explore what features we have.  

In [None]:
ETDdf["country"].unique()

In [None]:
ETDdf.columns

In [None]:
ETDdf.dtypes

In [None]:
ETDdf["Agriculture"]

### Oh no we have a problem
If we look at the data as imported
there are commas in the numbers
so the values are showing up as strings 


In [None]:
ETDdf['Agriculture'] = ETDdf['Agriculture'].str.replace(',', '')
ETDdf['Agriculture'] = pd.to_numeric(ETDdf['Agriculture'])
ETDdf

In [None]:
ETDdf["Agriculture"]

In [None]:
ETDdf[ETDdf["country"]== 'Cambodia']

In [None]:
ETDdf.loc[ETDdf["cnt"]=="KHM"]

In [None]:
ETDdf.loc[ETDdf["cnt"]=="KHM",:]

In [None]:
ETDdf.iloc[522:608,:]

In [None]:
ETDdf.iloc[522:609,:]

In [None]:
KMHdf=ETDdf[ETDdf["country"]== 'Cambodia']

In [None]:
KMHdf

In [None]:
KMH2018df=KMHdf[KMHdf['year']==2018]
KMH2018df

### Example of changing the index to year
Rembember there are three `var` variables per year for the full dataset

In [None]:
ETDdf.set_index("year", inplace=True)
ETDdf

### Example of **destringing** the whole dataset
Rembember there are strings in all the monetary variables - we can copy paste this list from the list of variable.

Lets bring in the **raw** dataset in again

In [None]:
ETDdf = pd.read_csv("ETD.csv")
ETDdf

In [None]:
ETDdf.dtypes

In [None]:
columns_to_convert = ['Agriculture', 'Mining',
       'Manufacturing', 'Utilities', 'Construction', 'Trade services',
       'Transport services', 'Business services', 'Financial services',
       'Real estate', 'Government services', 'Other services', 'Total',
       ] 

In [None]:
for column in columns_to_convert:
    ETDdf[column] = ETDdf[column].str.replace(',', '')
    ETDdf[column] = pd.to_numeric(ETDdf[column], errors='coerce')

In [None]:
ETDdf

In [None]:
ETDdf.dtypes

In [None]:
ETDdf.loc[0:5]

In [None]:
ETDdf.head()

In [None]:
ETDdf.tail()

In [None]:
ETDdf[-5:]

In [None]:
ETDdf.loc[0:4,"Agriculture":"Manufacturing"]

In [None]:
ETDdf.loc[100:103,"country":"Manufacturing"]

In [None]:
ETDdf.loc[100:103,"Manufacturing"]

In [None]:
ETDdf.loc[100,"Manufacturing"]

In [None]:
ETDdf.loc[:,["country","var","Manufacturing"]]

In [None]:
ETDdf.iloc[[1, 2, 3], [0, 1, 2]]


In [None]:
ETDdf.iloc[1:4, 0:3]

In [None]:
ETDdf.iloc[1:4, 4]

In [None]:
ETDdf.iloc[:, 0:4]

In [None]:
ETDdf[522:608]

In [None]:
ETDdf[["year","country", "var","Manufacturing"]]