# Data Exploration with pandas
This notebook is about data exploration with the awesome library pandas.

In [17]:
import pandas as pd
import numpy as np

# we will use the famous gapminder dataset
from gapminder import gapminder
df = gapminder

Look at top and bottom of dataset to understand the scope of the dataset.

For this we use the functions .head() and .tail() on our dataframe

In [18]:
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


In [19]:
df.tail()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.44996
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623
1703,Zimbabwe,Africa,2007,43.487,12311143,469.709298


.shape shows us the shape of the dataframe. It prints the number of rows and columns.

In [20]:
df.shape

(1704, 6)

To explore the data types we use the function .info().
It displays for every column the non-null count and the datatype of the column.

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB


With .describe we check basic statistics of our dataset like mean, count, etc.

In [22]:
df.describe()  # analyzes numberic numbers only
# if other data types are there as well you can include it, here with object
#df.describe(include=np.object)

Unnamed: 0,year,lifeExp,pop,gdpPercap
count,1704.0,1704.0,1704.0,1704.0
mean,1979.5,59.474439,29601210.0,7215.327081
std,17.26533,12.917107,106157900.0,9857.454543
min,1952.0,23.599,60011.0,241.165876
25%,1965.75,48.198,2793664.0,1202.060309
50%,1979.5,60.7125,7023596.0,3531.846988
75%,1993.25,70.8455,19585220.0,9325.462346
max,2007.0,82.603,1318683000.0,113523.1329


.value_counts() counts the number of the items of a specfic column.

In [24]:
df['country'].value_counts()

Swaziland          12
Albania            12
Benin              12
Brazil             12
South Africa       12
                   ..
Slovak Republic    12
Cuba               12
Chad               12
Singapore          12
Vietnam            12
Name: country, Length: 142, dtype: int64

.unique() prints all unique values of a column.

In [25]:
df['country'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
       'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium',
       'Benin', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.',
       'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Czech Republic',
       'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Ethiopia',
       'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana',
       'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti',
       'Honduras', 'Hong Kong, China', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Korea, Dem. Rep.',
       'Korea, Rep.', 'Kuwait', 'Leba

.to_markdown() prints our dataframe as nice looking markdown table.

Requires package tabulate: https://pypi.org/project/tabulate/

In [38]:
from tabulate import tabulate
print(df.head().to_markdown())

|    | country     | continent   |   year |   lifeExp |      pop |   gdpPercap |
|---:|:------------|:------------|-------:|----------:|---------:|------------:|
|  0 | Afghanistan | Asia        |   1952 |    28.801 |  8425333 |     779.445 |
|  1 | Afghanistan | Asia        |   1957 |    30.332 |  9240934 |     820.853 |
|  2 | Afghanistan | Asia        |   1962 |    31.997 | 10267083 |     853.101 |
|  3 | Afghanistan | Asia        |   1967 |    34.02  | 11537966 |     836.197 |
|  4 | Afghanistan | Asia        |   1972 |    36.088 | 13079460 |     739.981 |


In [39]:
# we can change the tableformat
print(df.head().to_markdown(tablefmt="grid"))

+----+-------------+-------------+--------+-----------+----------+-------------+
|    | country     | continent   |   year |   lifeExp |      pop |   gdpPercap |
|  0 | Afghanistan | Asia        |   1952 |    28.801 |  8425333 |     779.445 |
+----+-------------+-------------+--------+-----------+----------+-------------+
|  1 | Afghanistan | Asia        |   1957 |    30.332 |  9240934 |     820.853 |
+----+-------------+-------------+--------+-----------+----------+-------------+
|  2 | Afghanistan | Asia        |   1962 |    31.997 | 10267083 |     853.101 |
+----+-------------+-------------+--------+-----------+----------+-------------+
|  3 | Afghanistan | Asia        |   1967 |    34.02  | 11537966 |     836.197 |
+----+-------------+-------------+--------+-----------+----------+-------------+
|  4 | Afghanistan | Asia        |   1972 |    36.088 | 13079460 |     739.981 |
+----+-------------+-------------+--------+-----------+----------+-------------+


The loc function empowers us to explore the pandas dataframe by selecting the index and column.
If the index is set, use .loc to explore the dataset by index


In [45]:
df.set_index('continent', inplace=True)

In this case, we select all rows where the index is "Europe"

In [46]:
df.loc['Europe']

Unnamed: 0_level_0,year,lifeExp,pop,gdpPercap
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Europe,1952,55.230,1282697,1601.056136
Europe,1957,59.280,1476505,1942.284244
Europe,1962,64.820,1728137,2312.888958
Europe,1967,66.220,1984060,2760.196931
Europe,1972,67.690,2263554,3313.422188
...,...,...,...,...
Europe,1987,75.007,56981620,21664.787670
Europe,1992,76.420,57866349,22705.092540
Europe,1997,77.218,58808266,26074.531360
Europe,2002,78.471,59912431,29478.999190


We can select multiple indexes:

In [47]:
df.loc[['Europe', 'Africa']]

Unnamed: 0_level_0,year,lifeExp,pop,gdpPercap
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Europe,1952,55.230,1282697,1601.056136
Europe,1957,59.280,1476505,1942.284244
Europe,1962,64.820,1728137,2312.888958
Europe,1967,66.220,1984060,2760.196931
Europe,1972,67.690,2263554,3313.422188
...,...,...,...,...
Africa,1987,62.351,9216418,706.157306
Africa,1992,60.377,10704340,693.420786
Africa,1997,46.809,11404948,792.449960
Africa,2002,39.989,11926563,672.038623


Additionally we can select columns:

In [49]:
df.loc[['Europe', 'Africa'],['lifeExp', 'pop']]

Unnamed: 0_level_0,lifeExp,pop
continent,Unnamed: 1_level_1,Unnamed: 2_level_1
Europe,55.230,1282697
Europe,59.280,1476505
Europe,64.820,1728137
Europe,66.220,1984060
Europe,67.690,2263554
...,...,...
Africa,62.351,9216418
Africa,60.377,10704340
Africa,46.809,11404948
Africa,39.989,11926563


Columns can be also selected by range with ":"

In [50]:
df.loc[['Europe'], 'lifeExp':'gdpPercap']


Unnamed: 0_level_0,lifeExp,pop,gdpPercap
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Europe,55.230,1282697,1601.056136
Europe,59.280,1476505,1942.284244
Europe,64.820,1728137,2312.888958
Europe,66.220,1984060,2760.196931
Europe,67.690,2263554,3313.422188
...,...,...,...
Europe,75.007,56981620,21664.787670
Europe,76.420,57866349,22705.092540
Europe,77.218,58808266,26074.531360
Europe,78.471,59912431,29478.999190
