<center><img src="https://i.imgur.com/zRrFdsf.png" width="700"></center> 

_____

<a id='home'></a>

# Appending

<a target="_blank" href="https://colab.research.google.com/github/CienciaDeDatosEspacial/code_and_data/blob/main/Appending.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

<a id='appending'></a>

As the name implies, this process binds DFs into one, that is, one or more DFs will be put below or on top of another DF. Appending can be done when you fulfill these requisites:
1. All the DFs  share the same column names.
2. All the DFs  columns are in the same location.

Note that it is better if the columns share the same data types. But you can solve it during the formatting process.


Let's visit this website: https://fundforpeace.org/what-we-do/country-risk-and-fragility-data/

There, you will find several excel files with the _Fragile States Index_ per year. I have the files from 2013 to 2021 in a GitHub repo:

![](fragilityGit.png)

Let's read every file. For that, we will use a link to each. Let's do it step by step:

In [1]:
# link to repo - common to all files
dataRepo='https://github.com/enrique1157/datasemana7/raw/main'

In [2]:
# creating file names into a list:
years=range(2013,2022)
fileNames=['fsi-'+str(year)+'.xlsx' for year in years]
# list of file names
fileNames

['fsi-2013.xlsx',
 'fsi-2014.xlsx',
 'fsi-2015.xlsx',
 'fsi-2016.xlsx',
 'fsi-2017.xlsx',
 'fsi-2018.xlsx',
 'fsi-2019.xlsx',
 'fsi-2020.xlsx',
 'fsi-2021.xlsx']

In [3]:
# creating the url to each file:
alltheLinks=[dataRepo+fn for fn in fileNames]
alltheLinks

['https://github.com/CienciaDeDatosEspacial/dataSets/raw/main/fragility/fsi-2013.xlsx',
 'https://github.com/CienciaDeDatosEspacial/dataSets/raw/main/fragility/fsi-2014.xlsx',
 'https://github.com/CienciaDeDatosEspacial/dataSets/raw/main/fragility/fsi-2015.xlsx',
 'https://github.com/CienciaDeDatosEspacial/dataSets/raw/main/fragility/fsi-2016.xlsx',
 'https://github.com/CienciaDeDatosEspacial/dataSets/raw/main/fragility/fsi-2017.xlsx',
 'https://github.com/CienciaDeDatosEspacial/dataSets/raw/main/fragility/fsi-2018.xlsx',
 'https://github.com/CienciaDeDatosEspacial/dataSets/raw/main/fragility/fsi-2019.xlsx',
 'https://github.com/CienciaDeDatosEspacial/dataSets/raw/main/fragility/fsi-2020.xlsx',
 'https://github.com/CienciaDeDatosEspacial/dataSets/raw/main/fragility/fsi-2021.xlsx']

In [4]:
#una lista de dataframes es el allDFs

We will save each data frame in a list **allDFs**. We will use pandas, but we need **openpyxl** and **xlrd** (for Excel) before doing this operation:

In [7]:
# creating list of DFs
import pandas as pd

allDFs=[pd.read_excel(link) for link in alltheLinks] 


In [8]:
# saving column names
allColumnNames=[]
for df in allDFs:
    allColumnNames.append(set(df.columns))# list of sets!

In [9]:
# checking how many columns per df

[len(cols) for cols in allColumnNames]

[16, 16, 16, 16, 16, 16, 17, 17, 16]

We have an extra column in a couple of years. 
Let's find the common columns:

In [10]:
# details of common columns
commonColumns=set.intersection(*allColumnNames) # expanding list of sets (*)
len(commonColumns),commonColumns

(16,
 {'C1: Security Apparatus',
  'C2: Factionalized Elites',
  'C3: Group Grievance',
  'Country',
  'E1: Economy',
  'E2: Economic Inequality',
  'E3: Human Flight and Brain Drain',
  'P1: State Legitimacy',
  'P2: Public Services',
  'P3: Human Rights',
  'Rank',
  'S1: Demographic Pressures',
  'S2: Refugees and IDPs',
  'Total',
  'X1: External Intervention',
  'Year'})

These are the columns not in the common names:

In [11]:
# all minus the common
set.union(*allColumnNames)-commonColumns

{'Change from Previous Year'}

We could make a list of data frames with only the common columns:

In [13]:
# DFs with the common columns
allDFs_sameNames=[df.loc[:,list(commonColumns)] for df in allDFs]

Appending in pandas requires a list of data frames, in these case that is **allDFs_sameNames**. Then we proceed:

In [14]:
# appending
allDFsConcat=pd.concat(allDFs_sameNames)
allDFsConcat.head()

Unnamed: 0,Year,P3: Human Rights,Rank,Country,C1: Security Apparatus,C2: Factionalized Elites,E1: Economy,Total,P2: Public Services,S1: Demographic Pressures,X1: External Intervention,E3: Human Flight and Brain Drain,C3: Group Grievance,P1: State Legitimacy,S2: Refugees and IDPs,E2: Economic Inequality
0,2013-01-01 00:00:00,10.0,1st,Somalia,9.7,10.0,9.4,113.9,9.8,9.5,9.4,8.9,9.3,9.5,10.0,8.4
1,2013-01-01 00:00:00,9.8,2nd,Congo Democratic Republic,10.0,9.5,8.5,111.9,9.5,10.0,9.7,7.1,9.4,9.6,10.0,8.8
2,2013-01-01 00:00:00,9.3,3rd,Sudan,9.8,10.0,7.8,111.0,8.8,8.8,10.0,8.4,10.0,9.6,10.0,8.5
3,2013-01-01 00:00:00,9.3,4th,South Sudan,9.6,9.8,8.6,110.6,9.8,8.9,10.0,6.5,10.0,9.1,10.0,8.9
4,2013-01-01 00:00:00,9.8,5th,Chad,9.4,9.5,8.0,109.0,9.9,9.5,7.9,8.0,8.8,9.7,9.7,8.9


We could pay attention to the current data types:

In [15]:
allDFsConcat.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1603 entries, 0 to 178
Data columns (total 16 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Year                              1603 non-null   object 
 1   P3: Human Rights                  1603 non-null   float64
 2   Rank                              1603 non-null   object 
 3   Country                           1603 non-null   object 
 4   C1: Security Apparatus            1603 non-null   float64
 5   C2: Factionalized Elites          1603 non-null   float64
 6   E1: Economy                       1603 non-null   float64
 7   Total                             1603 non-null   float64
 8   P2: Public Services               1603 non-null   float64
 9   S1: Demographic Pressures         1603 non-null   float64
 10  X1: External Intervention         1603 non-null   float64
 11  E3: Human Flight and Brain Drain  1603 non-null   float64
 12  C3: Group Gr

The columns] **Year** was expected to be a numeric type, but we got an _object_ instead. Let's explore that column:

In [16]:
# exploring year column as frequency table
allDFsConcat.Year.value_counts()

Year
2021                   179
2013-01-01 00:00:00    178
2014-01-01 00:00:00    178
2015-01-01 00:00:00    178
2016-01-01 00:00:00    178
2017-01-01 00:00:00    178
2018-01-01 00:00:00    178
2019-01-01 00:00:00    178
2020-01-01 00:00:00    178
Name: count, dtype: int64

Except for the year 2021, the other values are in date-time format. We just need an integer number, then:

In [None]:
# keeping just the year value
yearAsNumber=[]
for y in allDFsConcat.Year:
    try:
        yearAsNumber.append(y.year)# the value from a date-time format
    except:
        yearAsNumber.append(y) # if not a datetime

#verifying
pd.Series(yearAsNumber).value_counts()

In [None]:
# overwriting the year column
allDFsConcat.Year=yearAsNumber

You may have notice that the column ordering does not look appropriate. In general you expect that the columns to the left start with identification of the rows rather than some measurements. Let's move 'Country','Year','Total' to the left:

In [None]:
# this is a trick: setting columns as index
allDFsConcat.set_index(['Country','Year','Total'],inplace=True)
allDFsConcat.head()

Since I will not use _Rank_, I will get rid of it:

In [None]:
# dropping unneeded column
allDFsConcat.drop(columns='Rank',inplace=True)

Let's order the current column names:

In [None]:
# ordering column names alphabetically
allDFsConcat.sort_index(axis=1,inplace=True) # by row index will be axis=0

Now put the row indexes back:

In [None]:
# indexes will be columns
allDFsConcat.reset_index(inplace=True)

Let's do some cleaning on the column names:

In [None]:
# see column names
allDFsConcat.columns.to_list()

In [None]:
# clean column names
allDFsConcat.columns=allDFsConcat.columns.str.replace(':\s',"_",regex=True)
allDFsConcat.columns=allDFsConcat.columns.str.replace('\s',"",regex=True)
#see
allDFsConcat.columns.to_list()

Let's set the country names into upper case:

In [None]:
# overwriting country
allDFsConcat.Country=allDFsConcat.Country.str.upper()

Finally, let's check the format:

In [None]:
allDFsConcat.info()

We should save this result:

In [None]:
import os

allDFsConcat.to_csv(os.path.join("data","Fragility.csv"),index=False)