# Installationsanweisungen

```bash

conda install -c conda-forge geopandas

conda install -c conda-forge cartopy


---------------------------------------
## Win : pip install gdal

## Mac : brew install gdal

```

# Explorative Data Analysis in Action: global power plant distribution

In the subsequent notebooks we apply a variety of EDA techniques on a real world data set.

The data set [__Global Power-Plants__](https://www.kaggle.com/ramjasmaurya/global-powerplants) is avaiable on [Kaggle](https://www.kaggle.com/).



It was already downloaded for you and is found in the `datasets` folder:

    ../data/powerplants.csv

<img src="./_img/power_plant.webp"> 

Source: [The Quint](https://www.thequint.com/news/environment/thermal-power-plants-use-up-more-water-than-permitted-rti-data-shows/).

### Content

This dataset consists of information about power plants worldwide. Each record includes the name, country, energy source type, geographic location, start date and other data elements. In this data analysis we want to learn someting about the geospatial distribution and the energy share of power plants in the world and in specific regions. Referring to the global goals of reducing the greenhouse gas emissions the energy production sector is one of the most important one. Actual knowledge about the specific energy share of green / fossil energy source and its change in time is therefore an important information for political decissions. 



# Data preparation

**Import statements**

In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sys
sys.path.append("../src")

plt.rcParams["figure.figsize"] = [20,9]

![](./_img/Time_data_science.png)

Source: [Gil Press (2016)](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#55852a146f63)

## Reading the Dataset

In [None]:
pp = pd.read_csv("../data/powerplants.csv")

In [None]:
pp.shape

In [None]:
pp.sample(10)

## Data Cleaning

[Data cleansing or data cleaning](https://en.wikipedia.org/wiki/Data_cleansing) is the process of **detecting and correcting (or removing) corrupt or inaccurate records** from a record set, table, or database and refers to **identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data**.

### Dealing with incomplete (`NaN`) and irrelevant data

Missing values in data sets are a well-known problem as nearly everywhere, where data is measured and recorded, issues with missing values occur. Various reasons lead to missing values: values may not be measured, values may be measured but get lost or values may be measured but are considered unusable. Missing values can lead to problems, because often further data processing and analysis steps rely on complete data sets. Therefore missing values need to be replaced with reasonable values. In statistics this process is called **imputation**.

When faced with the problem of missing values it is important to understand the mechanism that causes missing data. Such an understanding is useful, as it may be employed as background knowledge for selecting an appropriate imputation strategy. 

**Check for `NaN`**

Note that in many cases missing values are assigned special characters, such as `-999`, `NA`, `k.A.` etc.; hence, you as a data analyst are responsible for taking appropriate action.    

In [None]:
pp.shape[0]

In [None]:
pp.notnull().sum()

In [None]:
pp.isnull().sum()

> **Challenge:** Calculate the percentage of NaN-values in each column. (4 min)

In [None]:
## Your code here


In [None]:
# %load ../src/_solutions/percentage.py

**Strategies to deal with missing data in Python**

In general there are many options to consider when imputing missing values, for example:
* A constant value that has meaning within the domain, such as 0, distinct from all other values.
* A value from another randomly selected record.
* A mean, median or mode value for the column.
* A value estimated by another predictive model.

There are some libraries implementing more or less advanced missing value imputation strategies such as 

* [`statsmodels`](http://www.statsmodels.org/dev/imputation.html) ([Multiple Imputation with Chained Equations (MICE)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/))
* [`fancyimpute`](https://github.com/iskandr/fancyimpute) (matrix completion and imputation algorithms)
* [`scikit-learn`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html) (mean, median, most frequent)
* [`pandas`](https://pandas.pydata.org/pandas-docs/stable/missing_data.html) ([`fillna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html), [`interpolate`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html) methods)


**Our working strategy to deal with missing data**

_Owing to the fact that the amount of missing values in our data set is considerable high and that we can not easily do predictions or assumptions for the specific missing values, we simply remove these columns with a high amount of missing data._ 

_Nethertheless we have to keep in mind that our dataset still consists of gaps. Therefore we have to handle these gaps individually in the following analysis._

In [None]:
pp

> **Challenge:** Drop all columns in the dataframe that have more than 60 percent null values. Refactor your code afterwards and externalise your code into a function that takes the original dataframe and a threshold. The function should return the new dataframe.

> **Hint**: Make use of the `df.index` attribute. (8 min)

In [None]:
# Hint: The index attribute allows us to extract the column names in this case
ratios = ((pp.isnull().sum() / pp.shape[0]) * 100).round(2)
display(ratios)
# for example here are the columns for which the "missing-ness" ratio is 0
print() # new line
print("Columns with a ratio of 0")
display(ratios.loc[ratios == 0.0].index)

In [None]:
type(ratios)

*Hint:* Remember the `copy()` method to return the result of the function.

In [None]:
## Your code here


In [None]:
# %load ../src/_solutions/filter_dataframe.py

In [None]:
print(f"Number of columns before cleanup: {pp.shape[1]}")
pp = filter_dataframe(pp, threshold=60)
print(f"Number of columns after cleanup: {pp.shape[1]}")

## Dealing with data structures

Upon closer inspection we see that the `start date` column is stored as a floating point number only containing a year (which is rather an integer, isn't it?).

For simplicity let's transform it to integer numbers.

In [None]:
pp[["country code", "start date"]].sample(10, random_state=42)

In [None]:
pp['start date'].dtype

In order to transform the column to a integer datatype we have to deal with the `NaN` values.

For example by assigning a distinct value that can't be achieved inside the column itself.


> **Challenge:** Assign an integer to NaN values using the `fillna()` and change the data type to integer using the `astype()` method. (5 min)

In [None]:
## Your code here


In [None]:
# %load ../src/_solutions/imputation.py

In [None]:
pp[["country code", "start date"]].sample(10, random_state=42)

## Visualisation of spatial datasets

_Note: In the subsequent cells we load Python library for spatial data analysis, such as `shapely`, `fiona`, `geopandas`, `cartopy` and `folium`. Make sure that you have installed the [GDAL bindings](http://www.gdal.org/index.html) on your computer._

In this section we make use third party libraries for visualisation of the geospatial information, such as [GeoPandas](http://geopandas.org/index.html) and [shapely](http://toblerity.org/shapely/), which abstract away many algorithmic or computational issues related to spatial data processing and plotting by integrating the workhorses of geospatial computing, such as [GEOS](http://trac.osgeo.org/geos/), [GDAL](http://www.gdal.org/), [OGR](http://gdal.org/1.11/ogr/) and [proj.4](http://proj4.org/), among others.

The geographical information is stored and given by the `latitude` and `longitude` column. We can use these information to localize each power plant in the world.

**Transform the variables `Target Latitude` and `Target Longitude` to spatial coordinates**

In [None]:
from shapely.geometry import Point
geometry = [Point(xy) for xy in zip(pp['longitude'], pp['latitude'])]
geometry[0:5]

The `Point` class does not yield any useful information upon printing it. 

However it contains `x` and `y` attributes that map our longitudes and latitudes onto x-y coordinates.

In [None]:
point_0 = geometry[0]
print(point_0.x)
print(point_0.y)

It is nothing more than a slightly more efficient way of storing coordinates. 

And the library that will allow us to describe coordinates uses exactly this representation.

**Use the GeoPandas to make a pandas `DataFrame` spatially aware.**

[GeoPandas](http://geopandas.org/index.html) extends the datatypes used by pandas to allow spatial operations on geometric types. Geometric operations are performed by [shapely](https://shapely.readthedocs.io/en/stable/). GeoPandas further depends on [fiona](https://fiona.readthedocs.io/en/stable/) for file access and descartes and [matplotlib](https://matplotlib.org/) for plotting.

It combines the capabilities of pandas and shapely, providing geospatial operations in pandas and a high-level interface to multiple geometries to shapely. 

In [None]:
import geopandas as gpd
gdf = gpd.GeoDataFrame(pp, geometry=geometry)
gdf.head()

**Make sure that for every entry we have a valid spatial coordinates**

> **Challenge:** Check the columns `longitude` and `latitude` for NaN-values. (4 min)

In [None]:
## Your code here


In [None]:
# %load ../src/_solutions/long_lat.py

**Assign a spatial coordinate reference system (`crs`) to our GeoPandas object**

In general the CRS may be defined in several ways, for example the CRS may be defined as [Well-known text (WKT)](https://en.wikipedia.org/wiki/Well-known_text) format, or [JSON](https://en.wikipedia.org/wiki/JSON) format, or [GML](https://en.wikipedia.org/wiki/Geography_Markup_Language) format, or in the [Proj4](https://en.wikipedia.org/wiki/PROJ.4) format, among many others.

The Proj4 format is a generic, string-based description of a CRS. It defines projection types and parameter values for particular projections. For instance the Proj4 format string for the [European Terrestrial Reference System 1989 (ETRS89)](https://en.wikipedia.org/wiki/European_Terrestrial_Reference_System_1989) is:

    +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no\_defs

With respect to the enormous amount of existing CRS the [International Association of Oil & Gas Producers (IOGP)](https://en.wikipedia.org/wiki/International_Association_of_Oil_%26_Gas_Producers), formerly known as **_European Petroleum Survey Group (EPSG)_**, built a collection of definitions for global, regional, national and local coordinate reference systems and coordinate transformations, the [EPSG Geodetic Parameter Dataset](http://www.epsg.org/). Within this collection each particular coordinate reference systems gets an unique integer identifier, commonly denoted as EPSG. For instance, the EPSG identifier for the the latest revision of the [World Geodetic System (WGS84)](https://en.wikipedia.org/wiki/World_Geodetic_System) is simply [4326](http://spatialreference.org/ref/epsg/4326/).


A nice look up page for different coordinate reference systems is found [here](https://epsg.io/) and a fancy visualization of many prominent map projections is found [here](https://bl.ocks.org/mbostock/raw/3711652/).


In [None]:
gdf.set_crs('epsg:4326', inplace=True)

### Context matters: Load _Natural Earth countries_ dataset, bundled with GeoPandas

[Natural Earth](http://www.naturalearthdata.com/) is a public domain map dataset available at 1:10m, 1:50m, and 1:110 million scales. Featuring tightly integrated vector and raster data, with Natural Earth you can make a variety of visually pleasing, well-crafted maps with cartography or GIS software. A subset comes bundled with GeoPandas and is accessible from the `gpd.datasets` module. We’ll use it as a helpful global base layer map.



In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.head(2)

In [None]:
world.crs

In [None]:
world.plot(facecolor='lightgray')

**Combine world map and the power plant data set**

In [None]:
base = world.plot(facecolor='lightgray') # create base plot (world map)
gdf.plot(
    ax=base,         # specify object to plot over
    marker='o',      # select shape of each plotted point (circle, cross, etc.)
    color='red',     # color
    markersize=2,    # size of each plotted point
    alpha=0.1        # transparency of each plotted point (areas with less powerplants will become less visible)
);

### Spatial Filtering the data by geolocation

In a world view the points observations overlap each over a lot. So we can not see any pattern. For those purpose it is usefull, to set the focus of the research to a specific geographical region. In the following we want to focus on a continental and a country view.

**Europe focus**

As a first example we want to restrict our analysis to data points, which refer to an area within Europe. Although it is not straightforward to define Europe as an entity, in terms of geography, politics or sphere of cultural identity, we define Europe as an area between the coordinates  

$$\text{33.0 to 73.5°N and 27.0°W to 45.0°E.}$$

In order to represent that area spatially; we construct a `Polygon` object to represent the [bounding box](https://en.wikipedia.org/wiki/Minimum_bounding_box) of Europe. 

In [None]:
from shapely.geometry import Polygon
# generate geopandas object
poly_europe = gpd.GeoSeries([Polygon([(-27,33), (45,33), (45,73.5), (-27,73.5)])])
bb_europe = gpd.GeoDataFrame({'geometry': poly_europe})
# assign crs
bb_europe.set_crs(epsg=4326, inplace=True)
bb_europe

**Plot world map and bounding box of Europe**

In [None]:
base = world.plot(facecolor='lightgray')
bb_europe.plot(ax=base, alpha=0.5);

**Subset (intersect) the GeoPandas `DataFrame` with the bounding box of Europe**

In [None]:
gdf_europe = gpd.sjoin(gdf, bb_europe, how="inner", predicate='intersects').drop("index_right", axis=1)

print(gdf_europe.shape)

In [None]:
gdf_europe.plot(markersize=2);

**We catch a bit of the African and Asian continent**

Fortunately the Naturalearth dataset also contains continent definitions. We can again subset these together to take the intersection of both definitions of Europe.

In [None]:
gdf_europe = gpd.overlay(gdf_europe, world.loc[world['continent'] == 'Europe'], how='intersection')
print(gdf_europe.shape)

In [None]:
gdf_europe.plot(markersize=2);

Note that the result again is not perfect as we don't include the european part of Turkey for example.
However the Naturalearth dataset alone also counts Russia and some islands far away from the European continent as Europe.

This definition of Europe should be sufficient for our analysis. Although feel free to bring in your own if needed.

In [None]:
world.loc[world['continent'] == 'Europe'].plot()

**In order to keep the spatial context we extract the area of Europe from the world map**

In [None]:
europe = gpd.overlay(world, bb_europe, how='intersection')
europe.set_crs('epsg:4326', inplace=True);

In [None]:
base = europe.plot(facecolor='lightgray')
gdf_europe.plot(ax=base, marker='o', markersize=2, alpha=0.75);

## Dealing with categorical data

`geopandas` includes `column`, `categorical` and `legend` keywords in order to better distinguish between individual types of data inside the plot. 

In [None]:
base = europe.plot(facecolor='lightgray')
gdf_europe.plot(ax=base, marker='o', markersize=2, alpha=0.75, column='primary_fuel', categorical=True, legend=True);

### Choosing individual sub-categories

As our dataset provides many categories the resulting plot is not necessarily ideal.

Way too many categories are displayed and some even share the same color.

One way to circumvent this is to chose individual categories we want to display.
For example only the most prominent in the dataset.

In [None]:
gdf_europe["primary_fuel"].value_counts()

In [None]:
# only get top-5 categories
gdf_europe["primary_fuel"].value_counts().iloc[:5]

In [None]:
top_5_fuels = gdf_europe["primary_fuel"].value_counts().iloc[:5].index

In [None]:
top_5_fuels

In [None]:
base = europe.plot(facecolor='lightgray')
gdf_europe.loc[gdf_europe["primary_fuel"].isin(top_5_fuels)].plot(
        ax=base,                # plotting object to plot over
        marker='o',             # marker shape
        markersize=2,           # marker size
        alpha=0.75,             # transparency
        column='primary_fuel',  # column to pick categories from
        categorical=True,       # force the plot to chose different colors for individual categories
        legend=True             # include legend
)

**Colors this way are selected by GeoPandas and may be unfitting. For better control we can loop over each fuel**

In [None]:
# Create dictionary containing colors for each top_5 fuel kind
colors = {
    "Solar": "yellow",
    "Wind": "blue",
    "Hydro": "aqua",
    "Gas": "red",
    "Biomass": "darkgoldenrod",
}

# Plot base map
base = europe.plot(facecolor='lightgray')
for fuel in top_5_fuels:
    # Plot each respective fuel
    gdf_europe.loc[gdf_europe["primary_fuel"] == fuel].plot(
        ax=base,                # plotting object to plot over
        marker='o',             # marker shape
        markersize=2,           # marker size
        alpha=0.5,              # transparency
        color = colors[fuel],   # color
        label = fuel            # name in legend
)           
base.legend() # show legend

**Still pretty ugly although more informative than the original plot**

For further analysis let's add a new column which decides whether the fuel type is "green", i.e. sustainable, or not.

**Feel free to add further categories or change the definition of sustainable energy**

In [None]:
gdf['primary_fuel'].unique()

In [None]:
green_fuels = ['Hydro', 'Solar', 'Wind', 'Biomass', 'Wave and Tidal', 'Geothermal']
def is_green(entry):
    return entry in green_fuels

gdf['green'] = gdf['primary_fuel'].apply(is_green)

> **Challenge:** Try to select only entries within `green_fuels` using the `loc` and `isin()` method. (5 min)

In [None]:
## Your code here


In [None]:
# %load ../src/_solutions/locisin.py

In [None]:
base = world.plot(facecolor='lightgray')

gdf.loc[gdf['green'] == True].plot(ax = base, markersize=2, alpha=0.1, color='green', label='green')
gdf.loc[gdf['green'] == False].plot(ax = base, markersize=2, alpha=0.1, color='red', label='unsustainable')
base.legend()

**The legend directly infers the marker shape, label and color from the plot. Since the low alpha we can barely see the corresponding markers.**
> Hack:

In [None]:
base = world.plot(facecolor='lightgray')

# Add dummy plots (of empty lists -> nothing to be plotted) for correct legend assignment
base.scatter([],[],color='green', marker='o', label='green')
base.scatter([],[],color='red', marker='o', label='unsustainable')

gdf.loc[gdf['green'] == True].plot(ax = base, markersize=2, alpha=0.1, color='green')
gdf.loc[gdf['green'] == False].plot(ax = base, markersize=2, alpha=0.1, color='red')
base.legend(fontsize=16)

**Regional focus - Germany**

In [None]:
filtered = world[world.name == "Germany"]

In [None]:
gdf_germany = gdf.loc[gdf['country'] == "Germany"].copy()

In [None]:
germany = gpd.overlay(world, filtered, how='intersection')
germany.set_crs('epsg:4326', inplace=True)

In [None]:
base = germany.plot(facecolor='lightgray')
# Add dummy plots (of empty lists -> nothing to be plotted) for correct legend assignment
base.scatter([],[],color='green', marker='o', label='green')
base.scatter([],[],color='red', marker='o', label='unsustainable')

gdf_germany.loc[gdf_germany['green'] == True].plot(ax = base, markersize=4, alpha=0.7, color='green')
gdf_germany.loc[gdf_germany['green'] == False].plot(ax = base, markersize=4, alpha=0.7, color='red')
base.legend(fontsize=12)

## Variable size of plotted points

We can also add a column name as `markersize` argument inside the plotting method.

In [None]:
base = germany.plot(facecolor='lightgray')
# Add dummy plots (of empty lists -> nothing to be plotted) for correct legend assignment
base.scatter([],[],color='green', marker='o', label='green')
base.scatter([],[],color='red', marker='o', label='unsustainable')

gdf_germany.loc[gdf_germany['green'] == True].plot(ax = base, markersize="estimated_generation_gwh_2020", alpha=0.7, color='green')
gdf_germany.loc[gdf_germany['green'] == False].plot(ax = base, markersize="estimated_generation_gwh_2020", alpha=0.7, color='red')
base.legend(fontsize=12)

**Normalizing the values**

The resulting values are way too big. Before we have set `markersize = 4` and now the values of the column are taken which can be far greater than that.

One way to deal with that, is to scale the data beforehand.

**E.g. Scale data to a set upper bound $m \cdot \frac{x}{max(x)}$**

The maximum value in $x$ will now be equal to $m$ and all other values will keep the same relative distance to it.

This may result in many very small points depending on the distribution of $x$.

In [None]:
upper_bound = 400
gdf_germany['estimated_generation_gwh_2020_scaled'] = upper_bound * gdf_germany['estimated_generation_gwh_2020'] / gdf_germany['estimated_generation_gwh_2020'].max()

In [None]:
gdf_germany['estimated_generation_gwh_2020_scaled'].describe()

In [None]:
base = germany.plot(facecolor='lightgray')
# Add dummy plots (of empty lists -> nothing to be plotted) for correct legend assignment
base.scatter([],[],color='green', marker='o', label='green')
base.scatter([],[],color='red', marker='o', label='unsustainable')

gdf_germany.loc[gdf_germany['green'] == True].plot(ax = base, markersize="estimated_generation_gwh_2020_scaled", alpha=0.7, color='green')
gdf_germany.loc[gdf_germany['green'] == False].plot(ax = base, markersize="estimated_generation_gwh_2020_scaled", alpha=0.7, color='red')
base.legend(fontsize=12)

**E.g. Min-Max Scaling between $[a,b]$: $a + \frac{(x-min(x)) \cdot (b - a)}{max(x) - min(x)}$**

This way we can control both the minimum and maximum values in which our values will end up.

> **Challenge:** Try to calculate Min-Max Scaling for the column `gdf_germany['estimated_generation_gwh_2020']` with `a=4` and `b=400`. (10 min)

In [None]:
## Your code here


In [None]:
# %load ../src/_solutions/minmax_manually.py

You can also use the function `minmax_scaler()` from the `helper` module.

In [None]:
from helper import minmax_scaler

In [None]:
gdf_germany['estimated_generation_gwh_2020_scaled'] = minmax_scaler(gdf_germany['estimated_generation_gwh_2020'], lower_bound=4, upper_bound=400)
gdf_germany['estimated_generation_gwh_2020_scaled'].describe()

In [None]:
base = germany.plot(facecolor='lightgray')
# Add dummy plots (of empty lists -> nothing to be plotted) for correct legend assignment
base.scatter([],[],color='green', marker='o', label='green')
base.scatter([],[],color='red', marker='o', label='unsustainable')

gdf_germany.loc[gdf_germany['green'] == True].plot(ax = base, markersize="estimated_generation_gwh_2020_scaled", alpha=0.7, color='green')
gdf_germany.loc[gdf_germany['green'] == False].plot(ax = base, markersize="estimated_generation_gwh_2020_scaled", alpha=0.7, color='red')
base.legend(fontsize=12)

We notice that the "big" dots stayed somewhat similar whereas the smaller ones grew.

## Even more colorful plots

We offer a function to produce plots using earth projections you may be more familiar with.

Feel free to take a look at it, the source file can be found in `src/utils.py`.

In [None]:
from helper import cuteplot

In [None]:
?cuteplot

In [None]:
cuteplot(gdf_europe)

In [None]:
cuteplot(gdf_europe.loc[gdf_europe["primary_fuel"] == "Solar"], color = "blue", alpha = 0.5)

In [None]:
cuteplot(gdf_europe.loc[gdf_europe["primary_fuel"] == "Solar"], color = "blue", alpha = 0.5)

**The function returns a `Figure` and `Axes` object. Allowing for further manipulation. For example setting a title as shown here**

In [None]:
fig, ax = cuteplot(gdf.loc[gdf["primary_fuel"] == "Hydro"], color = "blue", alpha = 0.3, map_extent=None, label = "Solar")

ax.set_title("Hydro Power Plants across the Earth", size=24)

**We can also pass the ax object back inside and plot further points on top**

In [None]:
cuteplot(gdf.loc[gdf["primary_fuel"] == "Coal"], color = "red", alpha = 0.3, map_extent=None, label = "Coal", ax = ax)

In [None]:
ax.legend(fontsize=16)
ax.set_title("Hydro and Coal Powerplants across the earth", size=24)
fig

## Saving our Dataframe to disk for further analysis

We spend quite a lot of time for prepating our dataset for our analysis. At the end we want to save our dataframe to disk so that we can load this work state at every time again. There exist plenty of options, in this tutorial we want to use the `pickle` library to serialize our dataframe:

In [None]:
gdf_world = gpd.overlay(gdf, world, how='intersection')

In [None]:
gdf_world = gdf_world[['country code', 'country', 'name of powerplant', 'capacity in MW',
       'latitude', 'longitude', 'primary_fuel', 'start date', 'owner of plant',
       'geolocation_source', 'estimated_generation_gwh_2020', 'green', 'continent', 'geometry']]

In [None]:
gdf_europe = gpd.sjoin(gdf_world.loc[gdf_world['continent'] == "Europe"], bb_europe, how="inner", predicate='intersects').drop("index_right", axis=1)
gdf_germany = gdf_world.loc[gdf_world['country'] == "Germany"]

In [None]:
## uncomment to serialize the data to disk
import pickle
pickle.dump(gdf_world, open("../data/gdf_world.p", "wb"))
pickle.dump(gdf_europe, open("../data/gdf_europe.p", "wb"))
pickle.dump(gdf_germany, open("../data/gdf_germany.p", "wb"))

In [None]:
## alternatively save in geojson format
gdf_world.to_file("../data/gdf_world.geojson", driver='GeoJSON')
gdf_europe.to_file("../data/gdf_europe.geojson", driver='GeoJSON')
gdf_germany.to_file("../data/gdf_germany.geojson", driver='GeoJSON')

In [None]:
### gdf = gpd.read_file("file.geojson")

## Ready!! Now it's your turn :-)