<a id='top'></a>
### Table of contents

1. Importing Visualization Libraries and Data

2. [Deriving new variables](#variables)

- [Affordability](#afford)
- [Percent of undernourished people](#percent)
- [Price index](#index)
- [Food basket size](#basket)
- [Annual Price Growth](#growth)


In [1]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
import scipy
from scipy import stats
from scipy.stats import pearsonr
import sklearn
from sklearn import preprocessing
from sklearn.preprocessing import scale
from pylab import rcParams
import matplotlib.ticker as mt
from matplotlib.ticker import ScalarFormatter
from fuzzywuzzy import process, fuzz
from datetime import date
from datetime import datetime

In [2]:
%matplotlib inline
rcParams['figure.figsize']=14,7
sns.set_style('whitegrid')

In [3]:
path=r'C:\Users\frauz\Documents\Python Projects\Final Project\Data\Data Prepared' #creating a path

<a id='variables'></a>
# Deriving new variables

[Back to top](#top)

In [4]:
# Importing the data

df_full=pd.read_pickle(os.path.join(path,'data_market_capnocapmerged_no_var.pkl'))

df_full.head()

Unnamed: 0,iso,date,currency,capital_market,product_name,price_unit,usdprice_unit,year,population,millions_undernourished,country,estim_earnings,year_month,inflation,gdp_pcapita
0,AFG,2000-01-15,AFN,capital,Bread,14.26,0.3048,2000,,,Afghanistan,,2000_01,,
1,AFG,2000-01-15,AFN,capital,Wheat,13.75,0.2939,2000,,,Afghanistan,,2000_01,,
2,AFG,2000-01-15,AFN,capital,Wheat flour,18.57,0.3969,2000,,,Afghanistan,,2000_01,,
3,AFG,2000-01-15,AFN,non_capital,Bread,15.58,0.332967,2000,,,Afghanistan,,2000_01,,
4,AFG,2000-01-15,AFN,non_capital,Wheat,11.723333,0.250567,2000,,,Afghanistan,,2000_01,,


In [5]:
df_full.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 193836 entries, 0 to 193835
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   iso                      193836 non-null  object        
 1   date                     193836 non-null  datetime64[ns]
 2   currency                 193836 non-null  object        
 3   capital_market           193836 non-null  object        
 4   product_name             193836 non-null  object        
 5   price_unit               193836 non-null  float64       
 6   usdprice_unit            193836 non-null  float64       
 7   year                     193836 non-null  int32         
 8   population               161482 non-null  float64       
 9   millions_undernourished  135049 non-null  float64       
 10  country                  193836 non-null  object        
 11  estim_earnings           63673 non-null   float64       
 12  year_month      

In [6]:
# Visualizations revealed some very suspicious prices. Such in Yemen in 2016, prices in USD were not recorded accurately. 

# I will calculate prices in USD in Yemen in 2016 based on 0.004 exchange rate 

df_full.loc[(df_full['iso']=='YEM')&(df_full['year']==2016), 'usdprice_unit']=df_full['price_unit']*0.004

df_full[['iso','year','price_unit','product_name','usdprice_unit']][(df_full['iso']=='YEM')&(df_full['usdprice_unit']>1000)]

Unnamed: 0,iso,year,price_unit,product_name,usdprice_unit


<a id='afford'></a>
## Affordability

Product Affordability Index: This index is calculated by dividing the average earnings by the price of the product, both in local currency. It quantifies how many units of a particular product a person can afford based on their monthly income.

[Back to top](#top)

In [7]:
# Calculating monthly product affordability index

df_full['affordability_index']=df_full['estim_earnings']/df_full['price_unit']

In [8]:
df_full['affordability_index'].describe()

count    6.367300e+04
mean     3.313157e+04
std      5.771881e+05
min      4.798316e-01
25%      1.318812e+02
50%      2.847500e+02
75%      5.617667e+02
max      2.217459e+07
Name: affordability_index, dtype: float64

In [9]:
df_full[(df_full['country']=='Timor-Leste')&(df_full['estim_earnings'].notnull())]

Unnamed: 0,iso,date,currency,capital_market,product_name,price_unit,usdprice_unit,year,population,millions_undernourished,country,estim_earnings,year_month,inflation,gdp_pcapita,affordability_index
172595,TLS,2013-01-15,USD,non_capital,Beans,3.051111,3.051111,2013,1161555.0,0.3,Timor-Leste,412.77,2013_01,10.987234,1201.423609,135.285142
172596,TLS,2013-01-15,USD,non_capital,Cassava,0.693333,0.693333,2013,1161555.0,0.3,Timor-Leste,412.77,2013_01,10.987234,1201.423609,595.341346
172597,TLS,2013-01-15,USD,non_capital,Maize,1.139167,1.139167,2013,1161555.0,0.3,Timor-Leste,412.77,2013_01,10.987234,1201.423609,362.343819
172598,TLS,2013-01-15,USD,non_capital,Rice,1.115455,1.115455,2013,1161555.0,0.3,Timor-Leste,412.77,2013_01,10.987234,1201.423609,370.046455
172599,TLS,2013-02-15,USD,non_capital,Beans,2.948889,2.948889,2013,1161555.0,0.3,Timor-Leste,412.77,2013_02,10.987234,1201.423609,139.974755
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173467,TLS,2021-12-15,USD,non_capital,Sugar,0.985000,0.985000,2021,1320942.0,0.3,Timor-Leste,257.59,2021_12,,2741.393945,261.512690
173468,TLS,2021-12-15,USD,non_capital,Sweet potatoes,1.041111,1.041111,2021,1320942.0,0.3,Timor-Leste,257.59,2021_12,,2741.393945,247.418356
173469,TLS,2021-12-15,USD,non_capital,Taro,1.023000,1.023000,2021,1320942.0,0.3,Timor-Leste,257.59,2021_12,,2741.393945,251.798631
173470,TLS,2021-12-15,USD,non_capital,Tomatoes,1.589000,1.589000,2021,1320942.0,0.3,Timor-Leste,257.59,2021_12,,2741.393945,162.108244


<a id='percent'></a>
## Percent of undernourished people

Percent of undernourished people is calculated by deviding number of undernourished people by the total population. Since the number of people is recorded in millions, I first need to multiply it by million.

[Back to top](#top)

In [10]:
# Calculating the percentage of undernourished people by country

df_full['%_undernourished']=df_full['millions_undernourished']*1000000/df_full['population']*100

In [11]:
df_full['%_undernourished'].describe()
# considering that we are focusing on the most vulnerable regions, the numbers look logical

count    135049.000000
mean         19.368563
std          12.255154
min           2.415867
25%           9.551492
50%          16.916652
75%          28.013181
max          70.774067
Name: %_undernourished, dtype: float64

<a id='index'></a>
## Price index

Price Index: This value is used for normalized comparison across all prices and currencies. It's calculated by dividing the new price by the base price and then multiplying it by one hundred.

Since different countries have different observation periods, there is no single date when observations for each product across  all countries were made. Therefore, I've decided to use the latest observation for each country are product to determine the base price. 

[Back to top](#top)

In [12]:
# Finding the last (max) observation date for each country, pricetype, product

df_full['base_price_date']=df_full.groupby(['country','capital_market','product_name'])['date'].transform(np.max)

In [13]:
df_full.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 193836 entries, 0 to 193835
Data columns (total 18 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   iso                      193836 non-null  object        
 1   date                     193836 non-null  datetime64[ns]
 2   currency                 193836 non-null  object        
 3   capital_market           193836 non-null  object        
 4   product_name             193836 non-null  object        
 5   price_unit               193836 non-null  float64       
 6   usdprice_unit            193836 non-null  float64       
 7   year                     193836 non-null  int32         
 8   population               161482 non-null  float64       
 9   millions_undernourished  135049 non-null  float64       
 10  country                  193836 non-null  object        
 11  estim_earnings           63673 non-null   float64       
 12  year_month      

In [14]:
df_full[df_full['base_price_date'].isnull()]

Unnamed: 0,iso,date,currency,capital_market,product_name,price_unit,usdprice_unit,year,population,millions_undernourished,country,estim_earnings,year_month,inflation,gdp_pcapita,affordability_index,%_undernourished,base_price_date


In [15]:
#Isolating records that contain base prices

df_base=df_full.loc[df_full['date']==df_full['base_price_date']]

df_base.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2248 entries, 271 to 193835
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   iso                      2248 non-null   object        
 1   date                     2248 non-null   datetime64[ns]
 2   currency                 2248 non-null   object        
 3   capital_market           2248 non-null   object        
 4   product_name             2248 non-null   object        
 5   price_unit               2248 non-null   float64       
 6   usdprice_unit            2248 non-null   float64       
 7   year                     2248 non-null   int32         
 8   population               484 non-null    float64       
 9   millions_undernourished  418 non-null    float64       
 10  country                  2248 non-null   object        
 11  estim_earnings           238 non-null    float64       
 12  year_month               2248 

In [16]:
df_base.duplicated().value_counts()

False    2248
dtype: int64

In [17]:
df_base[['country','capital_market','product_name','price_unit']].value_counts()

country      capital_market  product_name  price_unit
Afghanistan  capital         Bread         50.000000     1
Nicaragua    non_capital     Maize         0.288889      1
                             Bread         1.800000      1
                             Cabbage       0.422222      1
                             Cheese        6.155556      1
                                                        ..
Ghana        non_capital     Eggplants     10.257857     1
                             Cowpeas       9.357273      1
                             Cassava       9.461250      1
             capital         Yam           21.665000     1
Zimbabwe     non_capital     Wheat flour   3.000000      1
Length: 2248, dtype: int64

In [18]:
# Merging the new dataframe with the main one to add base price to each record

df_base_price=df_full.merge(df_base[['country','capital_market','product_name','price_unit']], on=['country','capital_market','product_name'])

In [19]:
df_base_price.tail(70)

Unnamed: 0,iso,date,currency,capital_market,product_name,price_unit_x,usdprice_unit,year,population,millions_undernourished,country,estim_earnings,year_month,inflation,gdp_pcapita,affordability_index,%_undernourished,base_price_date,price_unit_y
193766,ZWE,2021-03-15,USD,capital,Rice,1.1886,1.1886,2021,15993524.0,6.1,Zimbabwe,170.213637,2021_03,98.546105,1773.920411,143.205146,38.140437,2021-10-15,0.9761
193767,ZWE,2021-04-15,USD,capital,Rice,1.1841,1.1841,2021,15993524.0,6.1,Zimbabwe,146.119485,2021_04,98.546105,1773.920411,123.401305,38.140437,2021-10-15,0.9761
193768,ZWE,2021-05-15,USD,capital,Rice,1.0910,1.0910,2021,15993524.0,6.1,Zimbabwe,146.119485,2021_05,98.546105,1773.920411,133.931700,38.140437,2021-10-15,0.9761
193769,ZWE,2021-06-15,USD,capital,Rice,1.0429,1.0429,2021,15993524.0,6.1,Zimbabwe,167.311618,2021_06,98.546105,1773.920411,160.429205,38.140437,2021-10-15,0.9761
193770,ZWE,2021-07-15,USD,capital,Rice,1.0835,1.0835,2021,15993524.0,6.1,Zimbabwe,146.119485,2021_07,98.546105,1773.920411,134.858777,38.140437,2021-10-15,0.9761
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193831,ZWE,2020-08-15,USD,capital,Maize,0.3150,0.3150,2020,15669666.0,6.1,Zimbabwe,,2020_08,557.201817,1372.696674,,38.928717,2021-07-15,0.3248
193832,ZWE,2020-09-15,USD,capital,Maize,0.3057,0.3057,2020,15669666.0,6.1,Zimbabwe,,2020_09,557.201817,1372.696674,,38.928717,2021-07-15,0.3248
193833,ZWE,2021-05-15,USD,capital,Maize,0.3447,0.3447,2021,15993524.0,6.1,Zimbabwe,146.119485,2021_05,98.546105,1773.920411,423.903351,38.140437,2021-07-15,0.3248
193834,ZWE,2021-06-15,USD,capital,Maize,0.3265,0.3265,2021,15993524.0,6.1,Zimbabwe,167.311618,2021_06,98.546105,1773.920411,512.439870,38.140437,2021-07-15,0.3248


In [20]:
df_base_price.info() #the number of records is the same

<class 'pandas.core.frame.DataFrame'>
Int64Index: 193836 entries, 0 to 193835
Data columns (total 19 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   iso                      193836 non-null  object        
 1   date                     193836 non-null  datetime64[ns]
 2   currency                 193836 non-null  object        
 3   capital_market           193836 non-null  object        
 4   product_name             193836 non-null  object        
 5   price_unit_x             193836 non-null  float64       
 6   usdprice_unit            193836 non-null  float64       
 7   year                     193836 non-null  int32         
 8   population               161482 non-null  float64       
 9   millions_undernourished  135049 non-null  float64       
 10  country                  193836 non-null  object        
 11  estim_earnings           63673 non-null   float64       
 12  year_month      

In [21]:
df_base_price.duplicated().value_counts() #no duplicates detected

False    193836
dtype: int64

In [22]:
# Dropping base_price_date and renaming price_units

df_base_price.drop(columns='base_price_date', inplace=True)
df_base_price.rename(columns={'price_unit_y':'base_price','price_unit_x':'price_unit'}, inplace=True)

In [23]:
# Calculating price_index

df_base_price['price_index']=df_base_price['price_unit']/df_base_price['base_price']*100

In [24]:
df_base_price['price_index'].describe()

count    193836.000000
mean         75.411841
std         256.749440
min           0.003333
25%          44.765045
50%          69.232101
75%          92.720264
max       80289.617202
Name: price_index, dtype: float64

In [25]:
df_base_price

Unnamed: 0,iso,date,currency,capital_market,product_name,price_unit,usdprice_unit,year,population,millions_undernourished,country,estim_earnings,year_month,inflation,gdp_pcapita,affordability_index,%_undernourished,base_price,price_index
0,AFG,2000-01-15,AFN,capital,Bread,14.2600,0.3048,2000,,,Afghanistan,,2000_01,,,,,50.0000,28.520000
1,AFG,2000-02-15,AFN,capital,Bread,13.4900,0.2860,2000,,,Afghanistan,,2000_02,,,,,50.0000,26.980000
2,AFG,2000-03-15,AFN,capital,Bread,11.7600,0.2482,2000,,,Afghanistan,,2000_03,,,,,50.0000,23.520000
3,AFG,2000-04-15,AFN,capital,Bread,12.9200,0.2733,2000,,,Afghanistan,,2000_04,,,,,50.0000,25.840000
4,AFG,2000-05-15,AFN,capital,Bread,16.4400,0.3478,2000,,,Afghanistan,,2000_05,,,,,50.0000,32.880000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193831,ZWE,2020-08-15,USD,capital,Maize,0.3150,0.3150,2020,15669666.0,6.1,Zimbabwe,,2020_08,557.201817,1372.696674,,38.928717,0.3248,96.982759
193832,ZWE,2020-09-15,USD,capital,Maize,0.3057,0.3057,2020,15669666.0,6.1,Zimbabwe,,2020_09,557.201817,1372.696674,,38.928717,0.3248,94.119458
193833,ZWE,2021-05-15,USD,capital,Maize,0.3447,0.3447,2021,15993524.0,6.1,Zimbabwe,146.119485,2021_05,98.546105,1773.920411,423.903351,38.140437,0.3248,106.126847
193834,ZWE,2021-06-15,USD,capital,Maize,0.3265,0.3265,2021,15993524.0,6.1,Zimbabwe,167.311618,2021_06,98.546105,1773.920411,512.439870,38.140437,0.3248,100.523399


<a id='basket'></a>
## Food basket size

In this context, food basket refers to the number of staple products included in the monitoring by the UN World Food Programme.

[Back to top](#top)

In [26]:
#Calculating the number of products that are monitored in each country

df_base_price['basket_size']=df_base_price.groupby('iso')['product_name'].transform('nunique')
df_base_price.head()

Unnamed: 0,iso,date,currency,capital_market,product_name,price_unit,usdprice_unit,year,population,millions_undernourished,country,estim_earnings,year_month,inflation,gdp_pcapita,affordability_index,%_undernourished,base_price,price_index,basket_size
0,AFG,2000-01-15,AFN,capital,Bread,14.26,0.3048,2000,,,Afghanistan,,2000_01,,,,,50.0,28.52,9
1,AFG,2000-02-15,AFN,capital,Bread,13.49,0.286,2000,,,Afghanistan,,2000_02,,,,,50.0,26.98,9
2,AFG,2000-03-15,AFN,capital,Bread,11.76,0.2482,2000,,,Afghanistan,,2000_03,,,,,50.0,23.52,9
3,AFG,2000-04-15,AFN,capital,Bread,12.92,0.2733,2000,,,Afghanistan,,2000_04,,,,,50.0,25.84,9
4,AFG,2000-05-15,AFN,capital,Bread,16.44,0.3478,2000,,,Afghanistan,,2000_05,,,,,50.0,32.88,9


In [27]:
df_base_price['basket_size'].describe()

count    193836.000000
mean         21.663587
std          11.893905
min           1.000000
25%          12.000000
50%          20.000000
75%          29.000000
max          50.000000
Name: basket_size, dtype: float64

<a id='growth'></a>
## Annual Price Growth

This represents the percentage difference between the first and last recorded prices for a given product/country within one year.

[Back to top](#top)

In [28]:
# First, I want to isolate the records with the earliest (min) and latest (max) date for each country, year and product 

df_growth_max=df_full[['country',
                       'capital_market',
                       'date',
                       'year',
                       'product_name']].groupby(['country',
                                                 'capital_market',
                                                 'year',
                                                 'product_name']).agg({'date':np.max})

In [29]:
df_growth_max.reset_index(inplace=True)
df_growth_max.head()

Unnamed: 0,country,capital_market,year,product_name,date
0,Afghanistan,capital,2000,Bread,2000-12-15
1,Afghanistan,capital,2000,Livestock,2000-12-15
2,Afghanistan,capital,2000,Wheat,2000-12-15
3,Afghanistan,capital,2000,Wheat flour,2000-12-15
4,Afghanistan,capital,2001,Bread,2001-12-15


In [30]:
df_growth_max.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19189 entries, 0 to 19188
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   country         19189 non-null  object        
 1   capital_market  19189 non-null  object        
 2   year            19189 non-null  int64         
 3   product_name    19189 non-null  object        
 4   date            19189 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 749.7+ KB


In [31]:
df_growth_min=df_full[['country',
                       'capital_market',
                       'date',
                       'year',
                       'product_name']].groupby(['country',
                                                 'capital_market',
                                                 'year',
                                                 'product_name']).agg({'date':np.min})

In [32]:
df_growth_min.reset_index(inplace=True)
df_growth_min.head()

Unnamed: 0,country,capital_market,year,product_name,date
0,Afghanistan,capital,2000,Bread,2000-01-15
1,Afghanistan,capital,2000,Livestock,2000-05-15
2,Afghanistan,capital,2000,Wheat,2000-01-15
3,Afghanistan,capital,2000,Wheat flour,2000-01-15
4,Afghanistan,capital,2001,Bread,2001-01-15


In [33]:
df_growth_min.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19189 entries, 0 to 19188
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   country         19189 non-null  object        
 1   capital_market  19189 non-null  object        
 2   year            19189 non-null  int64         
 3   product_name    19189 non-null  object        
 4   date            19189 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 749.7+ KB


In [34]:
#Now I can add prices min_date_price and max_date_price to the dataframes

df_max_price=pd.merge(df_growth_max, df_full[['country',
                                              'capital_market',
                                                   'date',
                                              'year',
                                              'product_name',
                                                   'price_unit']], on=['country',
                                                                       'capital_market',
                                                                       'date',
                                                                       'year',
                                                                       'product_name'])

In [35]:
df_max_price.head()

Unnamed: 0,country,capital_market,year,product_name,date,price_unit
0,Afghanistan,capital,2000,Bread,2000-12-15,16.13
1,Afghanistan,capital,2000,Livestock,2000-12-15,1100000.0
2,Afghanistan,capital,2000,Wheat,2000-12-15,13.78
3,Afghanistan,capital,2000,Wheat flour,2000-12-15,15.11
4,Afghanistan,capital,2001,Bread,2001-12-15,9.78


In [36]:
df_max_price.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19189 entries, 0 to 19188
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   country         19189 non-null  object        
 1   capital_market  19189 non-null  object        
 2   year            19189 non-null  int64         
 3   product_name    19189 non-null  object        
 4   date            19189 non-null  datetime64[ns]
 5   price_unit      19189 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(3)
memory usage: 1.0+ MB


In [37]:
df_max_price.rename(columns={'price_unit':'max_price'}, inplace=True)

In [38]:
df_min_price=pd.merge(df_growth_min, df_full[['country',
                                              'capital_market',
                                                   'date',
                                              'year',
                                              'product_name',
                                                   'price_unit']], on=['country',
                                                                       'capital_market',
                                                                       'date',
                                                                       'year',
                                                                       'product_name'])
df_min_price.head()

Unnamed: 0,country,capital_market,year,product_name,date,price_unit
0,Afghanistan,capital,2000,Bread,2000-01-15,14.26
1,Afghanistan,capital,2000,Livestock,2000-05-15,1075000.0
2,Afghanistan,capital,2000,Wheat,2000-01-15,13.75
3,Afghanistan,capital,2000,Wheat flour,2000-01-15,18.57
4,Afghanistan,capital,2001,Bread,2001-01-15,15.9


In [39]:
df_min_price.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19189 entries, 0 to 19188
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   country         19189 non-null  object        
 1   capital_market  19189 non-null  object        
 2   year            19189 non-null  int64         
 3   product_name    19189 non-null  object        
 4   date            19189 non-null  datetime64[ns]
 5   price_unit      19189 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(3)
memory usage: 1.0+ MB


In [40]:
df_min_price.rename(columns={'price_unit':'min_price'},  inplace=True)

In [41]:
df_growth_dif=pd.merge(df_max_price[['country','capital_market','year','product_name','max_price']],
                       df_min_price[['country','capital_market','year','product_name','min_price']], on=['country','capital_market','year','product_name'])
df_growth_dif.head()

Unnamed: 0,country,capital_market,year,product_name,max_price,min_price
0,Afghanistan,capital,2000,Bread,16.13,14.26
1,Afghanistan,capital,2000,Livestock,1100000.0,1075000.0
2,Afghanistan,capital,2000,Wheat,13.78,13.75
3,Afghanistan,capital,2000,Wheat flour,15.11,18.57
4,Afghanistan,capital,2001,Bread,9.78,15.9


In [42]:
df_growth_dif.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19189 entries, 0 to 19188
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   country         19189 non-null  object 
 1   capital_market  19189 non-null  object 
 2   year            19189 non-null  int64  
 3   product_name    19189 non-null  object 
 4   max_price       19189 non-null  float64
 5   min_price       19189 non-null  float64
dtypes: float64(2), int64(1), object(3)
memory usage: 1.0+ MB


In [43]:
# Calculating % anual growth

df_growth_dif['%_annual_growth']=round((df_growth_dif['max_price']-df_growth_dif['min_price'])/df_growth_dif['min_price']*100, 2)
df_growth_dif.head(30) #though some % are negative, in general the numbers look realistic

Unnamed: 0,country,capital_market,year,product_name,max_price,min_price,%_annual_growth
0,Afghanistan,capital,2000,Bread,16.13,14.26,13.11
1,Afghanistan,capital,2000,Livestock,1100000.0,1075000.0,2.33
2,Afghanistan,capital,2000,Wheat,13.78,13.75,0.22
3,Afghanistan,capital,2000,Wheat flour,15.11,18.57,-18.63
4,Afghanistan,capital,2001,Bread,9.78,15.9,-38.49
5,Afghanistan,capital,2001,Livestock,1440000.0,1260000.0,14.29
6,Afghanistan,capital,2001,Wheat,4.37,13.71,-68.13
7,Afghanistan,capital,2001,Wheat flour,5.6,15.1,-62.91
8,Afghanistan,capital,2002,Bread,14.43,7.72,86.92
9,Afghanistan,capital,2002,Livestock,2901333.0,1153000.0,151.63


In [44]:
# Now I can add the annual growth % to the main dataframe

df_merged=pd.merge(df_base_price,df_growth_dif[['country',
                                                'capital_market',
                                               'year',
                                                'product_name',
                                               '%_annual_growth']], on=['country',
                                                                        'capital_market',
                                                                        'year',
                                                                        'product_name'], how='left')

In [45]:
df_merged.tail(50)

Unnamed: 0,iso,date,currency,capital_market,product_name,price_unit,usdprice_unit,year,population,millions_undernourished,...,estim_earnings,year_month,inflation,gdp_pcapita,affordability_index,%_undernourished,base_price,price_index,basket_size,%_annual_growth
193786,ZWE,2021-10-15,USD,capital,Salt,0.7155,0.7155,2021,15993524.0,6.1,...,147.136989,2021_10,98.546105,1773.920411,205.642194,38.140437,0.7155,100.0,16,-29.18
193787,ZWE,2020-02-15,USD,capital,Sugar,0.974,0.974,2020,15669666.0,6.1,...,,2020_02,557.201817,1372.696674,,38.928717,1.5193,64.108471,16,26.13
193788,ZWE,2020-03-15,USD,capital,Sugar,0.9713,0.9713,2020,15669666.0,6.1,...,,2020_03,557.201817,1372.696674,,38.928717,1.5193,63.930758,16,26.13
193789,ZWE,2020-04-15,USD,capital,Sugar,1.4,1.4,2020,15669666.0,6.1,...,,2020_04,557.201817,1372.696674,,38.928717,1.5193,92.1477,16,26.13
193790,ZWE,2020-05-15,USD,capital,Sugar,1.42,1.42,2020,15669666.0,6.1,...,,2020_05,557.201817,1372.696674,,38.928717,1.5193,93.464095,16,26.13
193791,ZWE,2020-08-15,USD,capital,Sugar,1.2544,1.2544,2020,15669666.0,6.1,...,,2020_08,557.201817,1372.696674,,38.928717,1.5193,82.564339,16,26.13
193792,ZWE,2020-09-15,USD,capital,Sugar,1.1682,1.1682,2020,15669666.0,6.1,...,,2020_09,557.201817,1372.696674,,38.928717,1.5193,76.890673,16,26.13
193793,ZWE,2020-10-15,USD,capital,Sugar,1.2285,1.2285,2020,15669666.0,6.1,...,,2020_10,557.201817,1372.696674,,38.928717,1.5193,80.859606,16,26.13
193794,ZWE,2021-03-15,USD,capital,Sugar,1.4857,1.4857,2021,15993524.0,6.1,...,170.213637,2021_03,98.546105,1773.920411,114.567972,38.140437,1.5193,97.788455,16,2.26
193795,ZWE,2021-04-15,USD,capital,Sugar,1.4801,1.4801,2021,15993524.0,6.1,...,146.119485,2021_04,98.546105,1773.920411,98.722711,38.140437,1.5193,97.419864,16,2.26


In [46]:
df_merged['%_annual_growth'].isnull().sum() #all 

0

In [47]:
df_merged.info() #number of records remained the same

<class 'pandas.core.frame.DataFrame'>
Int64Index: 193836 entries, 0 to 193835
Data columns (total 21 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   iso                      193836 non-null  object        
 1   date                     193836 non-null  datetime64[ns]
 2   currency                 193836 non-null  object        
 3   capital_market           193836 non-null  object        
 4   product_name             193836 non-null  object        
 5   price_unit               193836 non-null  float64       
 6   usdprice_unit            193836 non-null  float64       
 7   year                     193836 non-null  int32         
 8   population               161482 non-null  float64       
 9   millions_undernourished  135049 non-null  float64       
 10  country                  193836 non-null  object        
 11  estim_earnings           63673 non-null   float64       
 12  year_month      

In [48]:
df_merged.duplicated().value_counts()

False    193836
dtype: int64

In [49]:
df_merged['product_name'].replace('Shrimp','Shrimps', inplace=True)

In [50]:
df_merged[df_merged['product_name']=='Shrimps']['country'].value_counts()

Gambia, The    80
Philippines    69
Benin          38
Name: country, dtype: int64

In [51]:
df_merged[df_merged['capital_market']=='non-capital']

Unnamed: 0,iso,date,currency,capital_market,product_name,price_unit,usdprice_unit,year,population,millions_undernourished,...,estim_earnings,year_month,inflation,gdp_pcapita,affordability_index,%_undernourished,base_price,price_index,basket_size,%_annual_growth


In [55]:
df_merged['capital_market'].value_counts()

non_capital    121730
capital         72106
Name: capital_market, dtype: int64

In [52]:
# Removing columns that I won't need

df_for_tableau=df_merged.drop(columns=['base_price','year','year_month'])

In [53]:
# Exporting data for further analysis. Final dataframe: 2164814 records, 21 columns

df_merged.to_pickle(os.path.join(path,'capnoncap_market_level_data_final.pkl'))
df_for_tableau.to_csv(os.path.join(path,'capnoncap_market_level_data_final.csv'))