## Data Exploration and cleaning

I can see the database has 1155 entries, in this case, heritage sites in the world and also contains 15 columns

In [3]:
df.shape

(1155, 15)

Let's look a the columns we have

In [4]:
df.columns

Index(['Name', 'short_description', 'date_inscribed', 'danger', 'date_end',
       'longitude', 'latitude', 'area_hectares', 'category_long',
       'category_short', 'Country name', 'Region', 'iso_code', 'transboundary',
       'rev_bis'],
      dtype='object')

The columns category_short, iso_code and rev_bis seem not to be useful for our purposes so we will delet it

In [5]:
df=df.drop('category_short',axis=1)
df=df.drop('iso_code',axis=1)
df=df.drop('rev_bis',axis=1)

I want to know what kind of categories we have

In [6]:
df['category_long'].unique()

array(['Cultural', 'Natural', 'Mixed'], dtype=object)

I also want to know the regions we have. I see weir combinations but I will fix it later.

In [7]:
df['Region'].unique()

array(['Europe and North America', 'Latin America and the Caribbean',
       'Africa', 'Arab States', 'Asia and the Pacific',
       'Asia and the Pacific,Europe and North America',
       'Asia and the Pacific,Europe and North America,Latin America and the Caribbean'],
      dtype=object)

The first obvious question that comes to mind is: How many heritage sites does Mexico hold? 

In [8]:
df[df['Country name']=='Mexico'].count()

Name                 35
short_description    35
date_inscribed       35
danger               35
date_end              0
longitude            35
latitude             35
area_hectares        35
category_long        35
Country name         35
Region               35
transboundary        35
dtype: int64

We find out that the answer is 35, but is that a considerable amount, or is this just average or maybe very low? Let's find out.

Let's observe the top 20 countries. There Mexico is, in 7th place.

In [9]:
df.groupby('Country name').count().sort_values(by='Name',ascending=False).head(20)

Unnamed: 0_level_0,Name,short_description,date_inscribed,danger,date_end,longitude,latitude,area_hectares,category_long,Region,transboundary
Country name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
China,55,55,55,55,0,55,55,54,55,55,55
Italy,51,51,51,51,0,51,51,50,51,51,51
Spain,45,45,45,45,0,45,45,44,45,45,45
France,43,43,43,43,0,43,43,42,43,43,43
Germany,41,41,41,41,1,41,41,40,41,41,41
India,39,39,39,39,2,39,39,39,39,39,39
Mexico,35,35,35,35,0,35,35,35,35,35,35
United Kingdom of Great Britain and Northern Ireland,32,32,32,32,1,32,32,32,32,32,32
Russian Federation,26,26,26,26,0,26,26,26,26,26,26
Iran (Islamic Republic of),26,26,26,26,1,26,26,26,26,26,26


Let's observe the top 20 countries.
Now I want to explore data by region. But at this point, we observe something strange. Europe and North America, for some reason, form one area. There is also one region called 'Asia and the Pacific' but 'Asia and the Pacific, Europe and North America', 'Asia and the Pacific, Europe and North America, Latin America and the Caribbean' also exist. That seems redundant, so we need to fix it.

In [10]:
df['Region'].unique()

array(['Europe and North America', 'Latin America and the Caribbean',
       'Africa', 'Arab States', 'Asia and the Pacific',
       'Asia and the Pacific,Europe and North America',
       'Asia and the Pacific,Europe and North America,Latin America and the Caribbean'],
      dtype=object)

First, let's separate North America from Europe. I will list all the countries in this region.

In [11]:
df[(df['Region']=='Europe and North America') & (df['transboundary']==0)]['Country name'].unique()

array(['Canada', 'Germany', 'Poland', 'United States of America',
       'Bulgaria', 'Croatia', 'France', 'Italy', 'Montenegro', 'Norway',
       'Serbia', 'Cyprus', 'Malta', 'Portugal', 'Switzerland', 'Holy See',
       'Spain', 'Turkey', 'Greece', 'Slovenia',
       'United Kingdom of Great Britain and Northern Ireland', 'Hungary',
       'Russian Federation', 'Ukraine', 'Finland', 'Romania', 'Sweden',
       'Albania', 'Czechia', 'Ireland', 'Slovakia', 'Denmark', 'Georgia',
       'Lithuania', 'Luxembourg', 'Netherlands', 'Armenia', 'Austria',
       'Estonia', 'Latvia', 'Belgium', 'Azerbaijan', 'Belarus', 'Israel',
       'Andorra', 'Iceland', 'Bosnia and Herzegovina', 'San Marino'],
      dtype=object)

United Kingdom of Great Britain and Northern Ireland is such a long name. For simplicity, I will replace it with United Kingdom.

In [12]:
df.loc[(df['Country name']=='United Kingdom of Great Britain and Northern Ireland') , 'Country name'] = 'United Kingdom'

Now, I will set the column "Region" as "North America" for the instances in which the column "Country name" is Canada or United States of America.

In [13]:
df.loc[(df['Country name']=='Canada') | (df['Country name']=='United States of America') , 'Region'] = 'North America'

And the region that used to be "Europe and North America" now will be "Europe".

In [14]:
df.loc[(df['Region']=='Europe and North America') , 'Region'] = 'Europe'

We can see the data frame successfully updated.

In [15]:
df[(df['Country name']=='Canada') | (df['Country name']=='United States of America')].head(3)

Unnamed: 0,Name,short_description,date_inscribed,danger,date_end,longitude,latitude,area_hectares,category_long,Country name,Region,transboundary
0,L’Anse aux Meadows National Historic Site,<p>At the tip of the Great Northern Peninsula ...,1978,0,,-55.616667,51.466667,7991.0,Cultural,Canada,North America,0
1,Nahanni National Park,"<p>Located along the South Nahanni River, one ...",1978,0,,-125.589444,61.547222,476560.0,Natural,Canada,North America,0
10,Mesa Verde National Park,<p>A great concentration of ancestral Pueblo I...,1978,0,,-108.485556,37.261667,21043.0,Cultural,United States of America,North America,0


Grouping the data frame, we can see that we still need to fix the redundant "Asia and the Pacific, Europe and North America", "Asia and the Pacific, Europe and North America, Latin America and the Caribbean". So first, we need to see what sites correspond to these regions.

In [16]:
df.groupby('Region').count()

Unnamed: 0_level_0,Name,short_description,date_inscribed,danger,date_end,longitude,latitude,area_hectares,category_long,Country name,transboundary
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Africa,98,98,98,98,10,98,98,98,98,98,98
Arab States,88,88,88,88,4,88,88,83,88,88,88
Asia and the Pacific,275,275,275,275,7,275,275,273,275,275,275
"Asia and the Pacific,Europe and North America",2,2,2,2,0,2,2,2,2,2,2
"Asia and the Pacific,Europe and North America,Latin America and the Caribbean",1,1,1,1,0,1,1,1,1,1,1
Europe,505,505,505,505,11,505,505,492,505,505,505
Latin America and the Caribbean,146,146,146,146,7,146,146,146,146,146,146
North America,40,40,40,40,2,40,40,40,40,40,40


I see 2 sites in a region Asia and the Pacific,Europe and North America
How can it be in 3 regions?

Uvs Nuur Basin (https://whc.unesco.org/en/list/769/) is between Mongolia and Russia



Landscapes of Dauria (https://whc.unesco.org/en/list/1448/) is between Mongolia and Russia as well

I will change the region to Asia and the Pacific

In [17]:
df.loc[df['Region']== 'Asia and the Pacific,Europe and North America']

Unnamed: 0,Name,short_description,date_inscribed,danger,date_end,longitude,latitude,area_hectares,category_long,Country name,Region,transboundary
752,Uvs Nuur Basin,"<p>The Uvs Nuur Basin (1,068,853 ha), is the n...",2003,0,,92.719722,50.275,898063.5,Natural,"Mongolia,Russian Federation","Asia and the Pacific,Europe and North America",1
1072,Landscapes of Dauria,<p>Shared between Mongolia and the Russian Fed...,2017,0,,115.425444,49.930222,912624.0,Natural,"Mongolia,Russian Federation","Asia and the Pacific,Europe and North America",1


In [18]:
df.loc[(df['Region']=='Asia and the Pacific,Europe and North America') , 'Region'] = 'Asia and the Pacific'

I see 1 sites in a region Asia and the Pacific,Europe and North America,Latin America and the Caribbean
How can it be in 4 regions?

The Architectural Work of Le Corbusier (https://whc.unesco.org/en/list/1321/) is a series of 17 sites across
7 countries, however the latitude and longitude are not correct. For simplicity this entry will be deleted






In [19]:
df=df.drop(df.loc[df['Region']== 'Asia and the Pacific,Europe and North America,Latin America and the Caribbean'].index[0])