# Goal of this practice

In this practice, we will learn several pandas DataFrame operations for transforming data. We will use data from the World Bank that gives yearly GDP per country since 1960. We can formulate various tasks with this data. Right now we will foucus on inspecting and getting basic stats from this data.

# Import pandas

In [2]:
import pandas as pd

data_path = "../resources/gdp.csv"


# Load the data

Before loading any data file, we should open the file in a simple text editor if the file is small. If the file is large, we can use `head` or `less` functions in a **terminal** to read the first few lines. This will give us an idea whether we need to skip some rows in order to load the data. E.g., the following code will throw an error as the data is not in expected format.


In [4]:
gdp_df = pd.read_csv(data_path)

ParserError: Error tokenizing data. C error: Expected 3 fields in line 5, saw 65


Let's open the terminal and change to `Module3/resources`. If we execture `head -n 5 gdp.csv` (i.e read the first 5 lines) we will see that the first 4 lines are meta data and we need to skip these rows. 

In [5]:
gdp_df = pd.read_csv(data_path, skiprows=4)

## Inspect the data frame

In [6]:
# write your code here
gdp_df = pd.read_csv(data_path, skiprows=4)
# Task: print firt two rows of data
print(gdp_df.head(2))
# Task: What is shape the of the data?
print(gdp_df.shape)
# Task: print the column names
print(gdp_df.columns)
# Task: show the data types
print(gdp_df.dtypes)

  Country Name Country Code     Indicator Name  Indicator Code          1960  \
0        Aruba          ABW  GDP (current US$)  NY.GDP.MKTP.CD           NaN   
1  Afghanistan          AFG  GDP (current US$)  NY.GDP.MKTP.CD  5.377778e+08   

           1961          1962          1963          1964          1965  ...  \
0           NaN           NaN           NaN           NaN           NaN  ...   
1  5.488889e+08  5.466667e+08  7.511112e+08  8.000000e+08  1.006667e+09  ...   

           2011          2012          2013          2014          2015  \
0  2.549721e+09  2.534637e+09  2.581564e+09  2.649721e+09  2.691620e+09   
1  1.780428e+10  2.000162e+10  2.056105e+10  2.048487e+10  1.990711e+10   

           2016          2017          2018  2019  Unnamed: 64  
0  2.646927e+09  2.700559e+09           NaN   NaN          NaN  
1  1.936264e+10  2.019176e+10  1.936297e+10   NaN          NaN  

[2 rows x 65 columns]
(264, 65)
Index(['Country Name', 'Country Code', 'Indicator Name', 'Indica

# Remove unnecessary columns

In [9]:
# TODO: drop 'Country Code', 'Indicator Name', 'Indicator Code' columns with drop function
gdp_transpose = gdp_df.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code'])

## Get country wise stats

Pandas `describe` method operated over columns. Right now country data is row-wise. We need to make them column-wise.

In [14]:
# Set the Index on Country Name - previously was numeric.
gdp_transpose = gdp_df.set_index('Country Name')

# TODO: Transpose the index (i.e. country) and columns (year)
gdp_transpose = gdp_transpose.transpose()

In [15]:
# apply the describe function
gdp_transpose.describe()

Country Name,Aruba,Afghanistan,Angola,Albania,Andorra,Arab World,United Arab Emirates,Argentina,Armenia,American Samoa,...,Virgin Islands (U.S.),Vietnam,Vanuatu,World,Samoa,Kosovo,"Yemen, Rep.",South Africa,Zambia,Zimbabwe
count,35.0,42.0,42.0,38.0,52.0,54.0,47.0,60.0,32.0,20,...,19.0,37.0,43.0,62.0,40,22.0,32.0,62.0,62.0,62.0
unique,35.0,42.0,41.0,38.0,52.0,54.0,47.0,60.0,32.0,20,...,19.0,37.0,43.0,62.0,40,22.0,32.0,62.0,62.0,62.0
top,2646927000.0,548888900.0,5550483000.0,11386930000.0,3660531000.0,1185227000000.0,33943610000.0,31256280000.0,12433090000.0,GDP (current US$),...,4439000000.0,28683660000.0,914301100.0,1650922000000.0,WSM,2535334000.0,16746340000.0,152587400000.0,3872667000.0,19091020000.0
freq,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1,...,1.0,1.0,1.0,1.0,1,1.0,1.0,1.0,1.0,1.0


## Repeat the above exercise. This time only focus on the last 10 years of data. 


In [19]:
gdp_df_transpose = gdp_df.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code'])

gdp_df_transpose.head()

gdp_df_transpose = gdp_df_transpose.set_index('Country Name')

gdp_last_10years_df = gdp_df_transpose.iloc[:, -10:]

gdp_last_10years_df.head()

Unnamed: 0_level_0,2011,2012,2013,2014,2015,2016,2017,2018,2019,Unnamed: 64
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Aruba,2549721000.0,2534637000.0,2581564000.0,2649721000.0,2691620000.0,2646927000.0,2700559000.0,,,
Afghanistan,17804280000.0,20001620000.0,20561050000.0,20484870000.0,19907110000.0,19362640000.0,20191760000.0,19362970000.0,,
Angola,111789700000.0,128052900000.0,136709900000.0,145712200000.0,116193600000.0,101123900000.0,122123800000.0,105751000000.0,,
Albania,12890870000.0,12319780000.0,12776280000.0,13228240000.0,11386930000.0,11861350000.0,13025060000.0,15102500000.0,,
Andorra,3442063000.0,3164615000.0,3281585000.0,3350736000.0,2811489000.0,2877312000.0,3013387000.0,3236544000.0,,


## Extract the following information from this data

In [None]:
# Task: which country has the maximum GDP in 2011?

# Task: get the avg GDP of France from the year 2001 to 2010

# Task: show country-wise pecent change in GDP with `pct_change()` function 
