# Introduction to Pandas and Dataframes

**Pandas** is a popular Python library, used for statistical analysis. We operate on **dataframes**, which you can think of as "Excel sheets as variables in Python". 

In [2]:
import pandas as pd # because we don't want to type 'pandas', only 'pd'. Think of all the hours of typing saved!

## Importing data

Now, let's import some data! We do this using the `read_csv` function within the `pd` library, so we call it with `pd.read_csv`. We tell it one thing: the location of the file to be read. 

Let's get started by importing a CSV file, which are plain-text versions of data values! You should have some sample CSVs on your machines already. Double check to make sure they're in the **same folder as this notebook**

Some sample CSV files can be found in our server's `Resources` folder. To use them, you'll need to go to the homepage, download the CSV files from the `Resources` folder, and **re-upload to the same folder as this notebook**.

Once you're ready, run the next cell to store all the data in a CSV in a single variable:

In [3]:
# Make a DataFrame called data.
# This DataFrame contains all the data from the CSV file.
data = pd.read_csv("gdp_asia.csv")

In [None]:
data

Printing is a bit unwieldy because of all the data. To read just a bit of info from the beginning, which lets you check if everything imported correctly, use `.head()`:

In [7]:
data.head()
#data.head(10) # Shows the first 10 rows of data.

Unnamed: 0,country,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
0,Afghanistan,779.445314,820.85303,853.10071,836.197138,739.981106,786.11336,978.011439,852.395945,649.341395,635.341351,726.734055,974.580338
1,Bahrain,9867.084765,11635.79945,12753.27514,14804.6727,18268.65839,19340.10196,19211.14731,18524.02406,19035.57917,20292.01679,23403.55927,29796.04834
2,Bangladesh,684.244172,661.637458,686.341554,721.186086,630.233627,659.877232,676.981866,751.979403,837.810164,972.770035,1136.39043,1391.253792
3,Cambodia,368.469286,434.038336,496.913648,523.432314,421.624026,524.972183,624.475478,683.895573,682.303175,734.28517,896.226015,1713.778686
4,China,400.448611,575.987001,487.674018,612.705693,676.900092,741.23747,962.421381,1378.904018,1655.784158,2289.234136,3119.280896,4959.114854


Notice that the column headings are each of the columns in the CSV file, and the row headings are just numbers. This is OK, but not quite right--we want each row heading to be the country. In pandas terminology, we want to **set the country column as the index**. To do this, we re-import, and specify an `index_col`:

In [8]:
data = pd.read_csv("gdp_asia.csv", index_col = "country")

In [9]:
data.head(10)

Unnamed: 0_level_0,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Afghanistan,779.445314,820.85303,853.10071,836.197138,739.981106,786.11336,978.011439,852.395945,649.341395,635.341351,726.734055,974.580338
Bahrain,9867.084765,11635.79945,12753.27514,14804.6727,18268.65839,19340.10196,19211.14731,18524.02406,19035.57917,20292.01679,23403.55927,29796.04834
Bangladesh,684.244172,661.637458,686.341554,721.186086,630.233627,659.877232,676.981866,751.979403,837.810164,972.770035,1136.39043,1391.253792
Cambodia,368.469286,434.038336,496.913648,523.432314,421.624026,524.972183,624.475478,683.895573,682.303175,734.28517,896.226015,1713.778686
China,400.448611,575.987001,487.674018,612.705693,676.900092,741.23747,962.421381,1378.904018,1655.784158,2289.234136,3119.280896,4959.114854
Hong Kong China,3054.421209,3629.076457,4692.648272,6197.962814,8315.928145,11186.14125,14560.53051,20038.47269,24757.60301,28377.63219,30209.01516,39724.97867
India,546.565749,590.061996,658.347151,700.770611,724.032527,813.337323,855.723538,976.512676,1164.406809,1458.817442,1746.769454,2452.210407
Indonesia,749.681655,858.900271,849.28977,762.431772,1111.107907,1382.702056,1516.872988,1748.356961,2383.140898,3119.335603,2873.91287,3540.651564
Iran,3035.326002,3290.257643,4187.329802,5906.731805,9613.818607,11888.59508,7608.334602,6642.881371,7235.653188,8263.590301,9240.761975,11605.71449
Iraq,4129.766056,6229.333562,8341.737815,8931.459811,9576.037596,14688.23507,14517.90711,11643.57268,3745.640687,3076.239795,4390.717312,4471.061906


Now, the first two rows look a bit weird, but that's pandas' way of telling us that "country" is an index column. You can have multiple index columns, but that's beyond the scope of this class.

<hr>

## Some DataFrame operations

Some useful functions you can do with DataFrames below. Try them out, and see what they do!

In [None]:
data.describe()

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 33 entries, Afghanistan to Yemen Rep.
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   gdpPercap_1952  33 non-null     float64
 1   gdpPercap_1957  33 non-null     float64
 2   gdpPercap_1962  33 non-null     float64
 3   gdpPercap_1967  33 non-null     float64
 4   gdpPercap_1972  33 non-null     float64
 5   gdpPercap_1977  33 non-null     float64
 6   gdpPercap_1982  33 non-null     float64
 7   gdpPercap_1987  33 non-null     float64
 8   gdpPercap_1992  33 non-null     float64
 9   gdpPercap_1997  33 non-null     float64
 10  gdpPercap_2002  33 non-null     float64
 11  gdpPercap_2007  33 non-null     float64
dtypes: float64(12)
memory usage: 3.4+ KB


In [12]:
data.mean()

gdpPercap_1952     5195.484004
gdpPercap_1957     5787.732940
gdpPercap_1962     5729.369625
gdpPercap_1967     5971.173374
gdpPercap_1972     8187.468699
gdpPercap_1977     7791.314020
gdpPercap_1982     7434.135157
gdpPercap_1987     7608.226508
gdpPercap_1992     8639.690248
gdpPercap_1997     9834.093295
gdpPercap_2002    10174.090397
gdpPercap_2007    12473.026870
dtype: float64

In [13]:
data.max()

gdpPercap_1952    108382.35290
gdpPercap_1957    113523.13290
gdpPercap_1962     95458.11176
gdpPercap_1967     80894.88326
gdpPercap_1972    109347.86700
gdpPercap_1977     59265.47714
gdpPercap_1982     33693.17525
gdpPercap_1987     28118.42998
gdpPercap_1992     34932.91959
gdpPercap_1997     40300.61996
gdpPercap_2002     36023.10540
gdpPercap_2007     47306.98978
dtype: float64

In [14]:
data.columns

Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967',
       'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987',
       'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'],
      dtype='object')

In [15]:
data.index

Index(['Afghanistan', 'Bahrain', 'Bangladesh', 'Cambodia', 'China',
       'Hong Kong China', 'India', 'Indonesia', 'Iran', 'Iraq', 'Israel',
       'Japan', 'Jordan', 'Korea Dem. Rep.', 'Korea Rep.', 'Kuwait', 'Lebanon',
       'Malaysia', 'Mongolia', 'Myanmar', 'Nepal', 'Oman', 'Pakistan',
       'Philippines', 'Saudi Arabia', 'Singapore', 'Sri Lanka', 'Syria',
       'Taiwan', 'Thailand', 'Vietnam', 'West Bank and Gaza', 'Yemen Rep.'],
      dtype='object', name='country')

In [16]:
data.T

country,Afghanistan,Bahrain,Bangladesh,Cambodia,China,Hong Kong China,India,Indonesia,Iran,Iraq,...,Philippines,Saudi Arabia,Singapore,Sri Lanka,Syria,Taiwan,Thailand,Vietnam,West Bank and Gaza,Yemen Rep.
gdpPercap_1952,779.445314,9867.084765,684.244172,368.469286,400.448611,3054.421209,546.565749,749.681655,3035.326002,4129.766056,...,1272.880995,6459.554823,2315.138227,1083.53203,1643.485354,1206.947913,757.797418,605.066492,1515.592329,781.717576
gdpPercap_1957,820.85303,11635.79945,661.637458,434.038336,575.987001,3629.076457,590.061996,858.900271,3290.257643,6229.333562,...,1547.944844,8157.591248,2843.104409,1072.546602,2117.234893,1507.86129,793.577415,676.285448,1827.067742,804.830455
gdpPercap_1962,853.10071,12753.27514,686.341554,496.913648,487.674018,4692.648272,658.347151,849.28977,4187.329802,8341.737815,...,1649.552153,11626.41975,3674.735572,1074.47196,2193.037133,1822.879028,1002.199172,772.04916,2198.956312,825.623201
gdpPercap_1967,836.197138,14804.6727,721.186086,523.432314,612.705693,6197.962814,700.770611,762.431772,5906.731805,8931.459811,...,1814.12743,16903.04886,4977.41854,1135.514326,1881.923632,2643.858681,1295.46066,637.123289,2649.715007,862.442146
gdpPercap_1972,739.981106,18268.65839,630.233627,421.624026,676.900092,8315.928145,724.032527,1111.107907,9613.818607,9576.037596,...,1989.37407,24837.42865,8597.756202,1213.39553,2571.423014,4062.523897,1524.358936,699.501644,3133.409277,1265.047031
gdpPercap_1977,786.11336,19340.10196,659.877232,524.972183,741.23747,11186.14125,813.337323,1382.702056,11888.59508,14688.23507,...,2373.204287,34167.7626,11210.08948,1348.775651,3195.484582,5596.519826,1961.224635,713.53712,3682.831494,1829.765177
gdpPercap_1982,978.011439,19211.14731,676.981866,624.475478,962.421381,14560.53051,855.723538,1516.872988,7608.334602,14517.90711,...,2603.273765,33693.17525,15169.16112,1648.079789,3761.837715,7426.354774,2393.219781,707.235786,4336.032082,1977.55701
gdpPercap_1987,852.395945,18524.02406,751.979403,683.895573,1378.904018,20038.47269,976.512676,1748.356961,6642.881371,11643.57268,...,2189.634995,21198.26136,18861.53081,1876.766827,3116.774285,11054.56175,2982.653773,820.799445,5107.197384,1971.741538
gdpPercap_1992,649.341395,19035.57917,837.810164,682.303175,1655.784158,24757.60301,1164.406809,2383.140898,7235.653188,3745.640687,...,2279.324017,24841.61777,24769.8912,2153.739222,3340.542768,15215.6579,4616.896545,989.023149,6017.654756,1879.496673
gdpPercap_1997,635.341351,20292.01679,972.770035,734.28517,2289.234136,28377.63219,1458.817442,3119.335603,8263.590301,3076.239795,...,2536.534925,20586.69019,33519.4766,2664.477257,4014.238972,20206.82098,5852.625497,1385.896769,7110.667619,2117.484526


### <font color="red">Exercise 1: Import and check</font>

Import the data from the CSV files of the other continents, and check how many rows of data they each have.

In [19]:
africa = pd.read_csv("gdp_africa.csv", index_col = "country")
# print(len(africa)) # len counts the number of rows of data.
len(africa)

52

<hr>

## Reading and filtering data in DataFrames

There are many, many ways of accessing data in DataFrames. Here are a few ways--you can read up on other ways at the [Pandas DataFrame documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) page. It's worth clicking over just to see the variety of functions you can call to handle DataFrames!

Here, though, we'll start with accessing information the way you'd expect, by row and column:

In [20]:
data=pd.read_csv("gdp_asia.csv", index_col="country")
data.head()

Unnamed: 0_level_0,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Afghanistan,779.445314,820.85303,853.10071,836.197138,739.981106,786.11336,978.011439,852.395945,649.341395,635.341351,726.734055,974.580338
Bahrain,9867.084765,11635.79945,12753.27514,14804.6727,18268.65839,19340.10196,19211.14731,18524.02406,19035.57917,20292.01679,23403.55927,29796.04834
Bangladesh,684.244172,661.637458,686.341554,721.186086,630.233627,659.877232,676.981866,751.979403,837.810164,972.770035,1136.39043,1391.253792
Cambodia,368.469286,434.038336,496.913648,523.432314,421.624026,524.972183,624.475478,683.895573,682.303175,734.28517,896.226015,1713.778686
China,400.448611,575.987001,487.674018,612.705693,676.900092,741.23747,962.421381,1378.904018,1655.784158,2289.234136,3119.280896,4959.114854


In [21]:
print("row 0, column 0")
print(data.iloc[0,0])

row 0, column 0
779.4453145


In [None]:
print("Row 1, column 0")
print(data.iloc([1,0]))

In [None]:
print("Row 1, column 1")

In [23]:
# We can access a whole series of data using slicing:
print("Rows 0 to 3, column 0")
print(data.iloc[0:4, 0])

# data2 = data.iloc[0:4, 0]
# print(data2)

Rows 0 to 3, column 0
country
Afghanistan     779.445314
Bahrain        9867.084765
Bangladesh      684.244172
Cambodia        368.469286
Name: gdpPercap_1952, dtype: float64


In [24]:
# The row and column numbers aren't very expressive. We can use names instead with loc, instead of iloc:

# Remember to use loc(), not iloc()
data.loc["Afghanistan","gdpPercap_1952"]

779.4453145

In [25]:
data.loc["China","gdpPercap_1952"]

400.4486107

In [26]:
# This works with slices as well, but take note--it includes the upper limit, unlike numerical slicing, which doesn't.
# So even though "China" is below, it gets included in the slice; in data.iloc[0:4,0], China was excluded because it
# was index 4.

data.loc["Afghanistan":"China","gdpPercap_1952"]

country
Afghanistan     779.445314
Bahrain        9867.084765
Bangladesh      684.244172
Cambodia        368.469286
China           400.448611
Name: gdpPercap_1952, dtype: float64

In [27]:
# You can mix and match to slice out smaller DataFrames, which you can store for later analysis

data_sub = data.loc["Afghanistan":"China","gdpPercap_1952":"gdpPercap_1962"]
data_sub

Unnamed: 0_level_0,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,779.445314,820.85303,853.10071
Bahrain,9867.084765,11635.79945,12753.27514
Bangladesh,684.244172,661.637458,686.341554
Cambodia,368.469286,434.038336,496.913648
China,400.448611,575.987001,487.674018


In [None]:
# If you want to slice everything, you can use the : by itself

data.loc["Afghanistan":"China", :]

In [30]:
# loc is pretty smart about filtering for names, if they're sequential:

data_sub = data.loc["Afghanistan":"China","gdpPercap_1952":"gdpPercap_1980"] # The 1980 data doesn't exist!
# data_sub
data_sub.max() # Shows the maximum value of each column.

gdpPercap_1952     9867.084765
gdpPercap_1957    11635.799450
gdpPercap_1962    12753.275140
gdpPercap_1967    14804.672700
gdpPercap_1972    18268.658390
gdpPercap_1977    19340.101960
dtype: float64

In [None]:
# And finally, you can select individual rows or columns by creating a list inside loc.
data.loc[:,["gdpPercap_1952", "gdpPercap_2007"]]

In [32]:
# We can also split this into two lines for better view:

rows_to_show = ["China", "Japan", "Singapore"]
cols_to_show = ["gdpPercap_1952", "gdpPercap_2007"]

data.loc[rows_to_show, cols_to_show]

Unnamed: 0_level_0,gdpPercap_1952,gdpPercap_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1
China,400.448611,4959.114854
Japan,3216.956347,31656.06806
Singapore,2315.138227,47143.17964


In [None]:
#If we want to view all rows or all columns...

rows_to_show = [:]  #this isn't a list, so we can't store it this way. You'll get an error.
cols_to_show = ["gdpPercap_1952", "gdpPercap_2007"]

data.loc[rows_to_show, cols_to_show]

In [None]:
#Variables can only store lists. If we want to use notation to specify all rows/cols,
# we need to put it directly inside .loc:

cols_to_show = ["gdpPercap_1952", "gdpPercap_2007"]

data.loc[:, cols_to_show]

### <font color="red">Exercise 2: Get data</font>

Import the necessary CSV file, and set up a DataFrame for the GDP data of Canada, the United States, and Mexico for last decade. Your result should look like this:

|                 | gdpPercap_2002 | gdpPercap_2007 |
| --------------- | -------------- | -------------- |
| Canada          | 33328.96507    | 36319.23501    |
| United States   | 39097.09955    | 42951.65309    |
| Mexico          | 10742.44053    | 11977.57496    |

#### Exercise 2 answer

In [36]:
americas = pd.read_csv("gdp_americas.csv", index_col="country")
rows_to_show = ["Canada","United States","Mexico"]
cols_to_show = ["gdpPercap_2002","gdpPercap_2007"]

# north_america = americas.loc[rows_to_show, cols_to_show]
# print(north_america)
americas.loc[rows_to_show, cols_to_show]

Unnamed: 0_level_0,gdpPercap_2002,gdpPercap_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Canada,33328.96507,36319.23501
United States,39097.09955,42951.65309
Mexico,10742.44053,11977.57496


<hr>

## Filtering data

Here's one of the most powerful features of DataFrames--being able to quickly work with large chunks of data. If you had to do this with for loops, it'd be a bit of pain to filter everything out item by item, not to mention having to reconstruct your lists one by one.

In [45]:
import pandas as pd
data = pd.read_csv("gdp_asia.csv", index_col="country")
subset = data.loc["Afghanistan":"China", ["gdpPercap_1952"]]
subset.head()

Unnamed: 0_level_0,gdpPercap_1952
country,Unnamed: 1_level_1
Afghanistan,779.445314
Bahrain,9867.084765
Bangladesh,684.244172
Cambodia,368.469286
China,400.448611


In [42]:
# This condition generates a "truth table" of sorts:

subset > 500 # print this to see the truth table.

country
Afghanistan     True
Bahrain         True
Bangladesh      True
Cambodia       False
China          False
Name: gdpPercap_1952, dtype: bool

In [None]:
# We can store the condition table as a variable, and apply it as a filter, using square brackets:

its_over_500 = subset > 500
subset[its_over_500]

In [None]:
# In summarised, but harder-to-read format, we could do this:

subset[subset>500]

## More Filtering

We're leaving this here as independent study--take a look at what's being done, and try to figure it out, particularly when it comes to the two-condition criteria!

In [27]:
import pandas as pd

# Can you figure out what's being done in the below code? 

dataAll = pd.read_csv("gdp_pop_all.csv", index_col = "country")
# print(dataAll)
# dataAll.to_excel("dataAll.xlsx")
# #

criteria = (dataAll["continent"] == "Asia") & (dataAll["gdpPercap_2007"] > 9000)
#

dataAll["gdp_2007"] = dataAll["gdpPercap_2007"] * dataAll["pop_2007"]
#

# dataAll[criteria][["gdpPercap_2007","pop_2007", "gdp_2007"]]
dataAll[criteria]["gdpPercap_1952":"gdpPercap_2007"]
# 

# This next line is the same effect as the previous line.
# dataAll.loc[criteria,["gdpPercap_2007","pop_2007", "gdp_2007"]]
# #

Unnamed: 0_level_0,continent,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,...,pop_1967,pop_1972,pop_1977,pop_1982,pop_1987,pop_1992,pop_1997,pop_2002,pop_2007,gdp_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


### <font color="red">Exercise 3: African data and filtering</font>

* Read data from the Africa file
* Find the Per Capita GDP of Egypt in 2007
* Find countries whose Per Capita GDP exceeded Egypt's that year. 

There should be 9 countries.

#### Exercise 3 answer

In [None]:
africa = pd.read_csv("gdp_africa.csv", index_col="country")

egypt_2007 = africa.loc["Egypt", "gdpPercap_2007"]

criteria = africa["gdpPercap_2007"] > egypt_2007

# We can take a look at the truth table
# to see which rows will be shown due to the criteria.
# print(criteria)

# Show all columns.
africa[criteria]

# Or show just a specific column.
# africa[criteria][["gdpPercap_2007"]]

# Or show a range of columns.
# africa[criteria]["gdpPercap_1952":"gdpPercap_2007"]

# To see all the column names:
# africa.columns

### <font color="red">Exercise 4: What does this do?</font>

What do each of the lines in this chunk of code do? Run it, find out, and explain to someone sitting next to you.

In [None]:
first = pd.read_csv('gdp_pop_all.csv', index_col='country')
# print(first)
# first.to_excel("first.xlsx")
criteria = first['continent'] == 'Americas'
second = first[criteria]
# second.to_excel("second.xlsx")
third = second.drop('Puerto Rico')
# third.to_excel("third.xlsx")
fourth = third.drop('continent', axis = 1) # axis = 1 for dropping a column.
fourth.to_csv('result.csv')
fourth.to_excel('result.xlsx')

## Inserting data

Inserting column data into your DataFrames is straightforward. Just add it in:

In [None]:
data = pd.read_csv('gdp_pop_all.csv', index_col='country')
data["Has people?"] = "Yes"
data.head() # scroll to the right to see

##### Inserting the result of a criteria

In [None]:
data = pd.read_csv('gdp_pop_all.csv', index_col='country')

# The next line contains a condition,
# so criteria will have either True or False values
# applied to each cell in the "Is in Africa?" column.
criteria = data["continent"] == "Africa"
data["Is in Africa?"] = criteria
data

##### Then replacing the result of a criteria

In [None]:
data.replace(True, "Yes", True) # Inplace needs to be set to True. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
data.replace(False, "No", True)
data

##### What about inserting rows onto your DataFrame?

This involves creating another DataFrame and then appending one DataFrame to the other.

In [None]:
# Some starter code:
df = pd.read_csv("gdp_asia.csv", index_col="country")
# print(df)
rows = ['Singapore','Malaysia']
cols = ['gdpPercap_1952','gdpPercap_1957']
subset1 = df.loc[rows,cols]
# print(subset1)

rows = ['Thailand','Indonesia']
cols = ['gdpPercap_1952','gdpPercap_1957']
subset2 = df.loc[rows,cols]
# print(subset2)

rows = ['Thailand','Indonesia']
cols = ['gdpPercap_1952','gdpPercap_2007']
subset3 = df.loc[rows,cols]
# print(subset3)

In [None]:
subset1.append(subset2)

###### Append is deprecated, so use concat (see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html):

In [52]:
# concat takes in a list of dataframes to append.
pd.concat([subset1, subset2])

Unnamed: 0_level_0,gdpPercap_1952,gdpPercap_1957
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Singapore,2315.138227,2843.104409
Malaysia,1831.132894,1810.066992
Thailand,757.797418,793.577415
Indonesia,749.681655,858.900271


In [53]:
# Any missing data appears as NaN (Not a Number):

# subset2.append(subset3) # append is deprecated, so use concat instead. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
pd.concat([subset2, subset3])

Unnamed: 0_level_0,gdpPercap_1952,gdpPercap_1957,gdpPercap_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Thailand,757.797418,793.577415,
Indonesia,749.681655,858.900271,
Thailand,757.797418,,7458.396327
Indonesia,749.681655,,3540.651564


### Merging DataFrames

`merge()` will add on columns from one DataFrame to another.

In [78]:
# Some starter code:
subset1 = data.loc["Afghanistan":"China", "gdpPercap_1952":"gdpPercap_1957"]
subset2 = data.loc["Afghanistan":"China", "gdpPercap_2002":"gdpPercap_2007"]

In [None]:
subset1

In [77]:
# We specify that we're matching the indices
# (i.e. the rows) of the left dataframe with the indices
# of the right dataframe
merged = subset1.merge(subset2, left_index=True, right_index=True)
merged

Unnamed: 0_level_0,gdpPercap_1952,gdpPercap_1957,gdpPercap_2002,gdpPercap_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Albania,1601.056136,1942.284244,4604.211737,5937.029526
Austria,6137.076492,8842.59803,32417.60769,36126.4927
Belgium,8343.105127,9714.960623,30485.88375,33692.60508
Bosnia and Herzegovina,973.533195,1353.989176,6018.975239,7446.298803
Bulgaria,2444.286648,3008.670727,7696.777725,10680.79282


### <font color="red">Exercise 5: European data analysis</font>

Import the GDP data for Europe. Write an expression to select each of the following:

* GDP per capita for all countries in 1982.
* GDP per capita for Denmark for all years.
* GDP per capita for all countries for years after 1985.
* GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952. Show a DataFrame with 1952, 2007, and "2007 vs. 1952", for example:

|                        | gdpPercap_1952 | gdpPercap_2007  | 2007/1952 |
| ---------------------- | -------------- | --------------  | --------- |
| Albania                | 1601.056136    | 5937.029526     | 3.708196  |
| Austria                | 6137.076492    | 36126.492700    | 5.886596  |
| Belgium                | 8343.105127    | 33692.605080    | 4.038377  |
| Bosnia and Herzegovina | 973.533195     | 7446.298803     | 7.648736  |
| Bulgaria               | 2444.286648    | 10680.792820    | 4.369697  |

In [62]:
data = pd.read_csv('gdp_europe.csv', index_col='country')

##### Exercise 5 Answers

In [None]:
# a) GDP for all countries in 1982
data.loc[:,"gdpPercap_1982"]

In [None]:
# b) GDP for Denmark for all years
data.loc["Denmark",:]

In [None]:
# c) GDP after 1985
data.loc[:, "gdpPercap_1985":]

In [None]:
# d) GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952.
data["2007/1952"] = data["gdpPercap_2007"] / data["gdpPercap_1952"]
cols_to_show = ["gdpPercap_1952", "gdpPercap_2007", "2007/1952"]
data[cols_to_show]