## Getting started with Pandas
## Econ 148

This Demo seeks to introduce students to pandas using an economics dataset. We will use data from the Economic Transformation Database (ETD) which presents the following internationally comparable sectoral data on employment and productivity in Africa, Asia, and Latin America. Feel free to further explore the data at https://www.wider.unu.edu/database/etd-economic-transformation-database.

Kruse, H., E. Mensah, K. Sen, and G. J. de Vries (2022). “A manufacturing renaissance? Industrialization trends in the developing world”, IMF Economic Review DOI: 10.1057/s41308-022-00183-7

License: The GGDC/UNU-WIDER Economic Transformation Database is licensed under a Creative Commons Attribution 4.0 International License.

In [50]:
import pandas as pd
#!pip install openpyxl


In [None]:
dataverse_url="https://dataverse.nl/api/access/datafile/382704"
sheet_name="Data"
ETDdf_XL=pd.read_excel(dataverse_url,  sheet_name=sheet_name)
ETDdf_XL

In [None]:

github_url="https://raw.githubusercontent.com/UCB-Econ-148/econ148-sp24/refs/heads/main/lec/Lec2.1/ETD.csv"
ETDdf_GH = pd.read_csv(github_url,thousands=',') # note that the csv has a problem with thousands separator
ETDdf_GH 

In [None]:
ETDdf = pd.read_csv("ETD.csv", thousands=',')  # note that the csv has a problem with thousands separator
ETDdf

We have successfully imported the data! However, before we begin we must understand the data that we are working with and adjust it such that it will be ideal to work with.

In [None]:
ETDdf.shape

In [None]:
years = 2018-1989
years

In [None]:
num_country = ETDdf["country"].nunique()
num_country

In [None]:
num_vars= ETDdf["var"].nunique()
num_vars

In [None]:
rows = years*num_country*num_vars
rows

We see that there are 4,437 rows and 23 columns. But notice that there are a few columns containing only 'NaN' values.  We can explore what features we have.  

In [None]:
ETDdf["country"].unique()

In [None]:
ETDdf.columns

In [None]:
ETDdf.dtypes

In [None]:
ETDdf["Agriculture"]

In [None]:
ETDdf[ETDdf["country"]== 'Cambodia']

In [None]:
ETDdf.loc[ETDdf["cnt"]=="KHM"]

In [None]:
ETDdf.loc[ETDdf["cnt"]=="KHM",:]

In [None]:
ETDdf.iloc[522:608,:]

In [None]:
ETDdf.iloc[522:609,:]

## Now Let's make a new Dataframe for Cambodia 

In [68]:
KMHdf=ETDdf[ETDdf["country"]== 'Cambodia']

You could try and make a datafram for a different country 


In [69]:
my_df=ETDdf[ETDdf["country"]== '...']

In [None]:
KMHdf

## how about a dataframe for just Cambodia in 2018

In [None]:
KMH2018df=KMHdf[KMHdf['year']==2018]
KMH2018df

In [None]:
ETDdf.head()

In [None]:
ETDdf.tail()

In [None]:
ETDdf[-5:]

In [None]:
ETDdf.loc[0:4,"Agriculture":"Manufacturing"]

In [None]:
ETDdf.loc[100:103,"country":"Manufacturing"]

In [None]:
ETDdf.loc[100:103,"Manufacturing"]

In [None]:
ETDdf.loc[100,"Manufacturing"]

In [None]:
ETDdf.loc[:,["country","var","Manufacturing"]]

In [None]:
ETDdf.iloc[[1, 2, 3], [0, 1, 2]]


In [None]:
ETDdf.iloc[1:4, 0:3]

In [None]:
ETDdf.iloc[1:4, 4]

In [None]:
ETDdf.iloc[:, 0:4]

In [None]:
ETDdf[522:608]

## What year did Bangladesh employ more people in manufacturing than in agriculture?


In [None]:
Bang_df=ETDdf[ETDdf["country"]=="Bangladesh"]
Bang_df

In [None]:
Bang_ManAG_df = Bang_df[["year","country", "var","Manufacturing", "Agriculture"]]
Bang_ManAG_df

In [None]:
#subset to when var = EMP
Bang_ManAGEMP_df = Bang_ManAG_df[Bang_ManAG_df["var"]=="EMP"]
Bang_ManAGEMP_df 



In [None]:
Bang_ManAGVA_df = Bang_ManAG_df[Bang_ManAG_df["var"]=="VA_Q15"]
Bang_ManAGVA_df

In [89]:
import matplotlib.pyplot as plt


In [None]:
plot = Bang_ManAGEMP_df.plot(x='year', y=['Manufacturing', 'Agriculture'], kind='line')
plot.set_ylabel('Employment (in millions)')


In [None]:
#plot va
plot = Bang_ManAGVA_df.plot(x='year', y=['Manufacturing', 'Agriculture'], kind='line')
plot.set_ylabel('Value Added (in billions of 2010 USD)')


In [None]:
df=pd.DataFrame({ 'Name': ['Grace', 'Alan', 'Aarthi'], 'City': ['New York', 'San Francisco', 'Los Angeles']}) 
df

In [None]:
df.loc[1, "City"]
