# EDA - gapminder data

### 1. Import needed package(s)
At the start of any notebook or script we do all of the import statements

In [1]:
import pandas as pd

### 2. Reading Data

Use the  `read_csv()` function to read in the population file as a DataFrame.

**Note:** replace the ??? with the correct pandas commands

In [5]:
df_population = pd.read_csv('./data/population_web.csv')

### 3. First impression

Use `.shape` and `head` to get an idea about the dataset

In [24]:
df_population.shape

(22275, 3)

In [25]:
df_population.head()

Unnamed: 0,Total population,year,population
0,Abkhazia,1800,
1,Afghanistan,1800,3280000.0
2,Akrotiri and Dhekelia,1800,
3,Albania,1800,410445.0
4,Algeria,1800,2503218.0


### 4. NaN

Some of the cells in our dataframe have a value of `NaN`. What does that mean?

- NaN stands for `not a number`

But wait **strings** are also not numbers?!?

- correct! What it really means for in the end is that the data is **missing**.

### 5. Access to index and column names

- how were the index and column accessed?

In [26]:
df_population.index

RangeIndex(start=0, stop=22275, step=1)

`.index` This tells us that the index is a numeric index from 0 to 22275 using a stepwise value of 1

In [27]:
df_population.columns

Index(['Total population', 'year', 'population'], dtype='object')

`.columns` returns the column names as a *list-like* Series. We can access each element as we would in a python list

### 6. Information about each column

`.info()` give us the non-Nan count of each column and the data type of each column

In [28]:
df_population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22275 entries, 0 to 22274
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Total population  22275 non-null  object 
 1   year              22275 non-null  int64  
 2   population        20176 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 522.2+ KB


### 7. Pandas `.read_csv()` parameters

We currently have a dataframe for population ready to use. Let's also read in one for mapping countries to their respective continents. Do you remember how to read in a dataframe with pandas.

In [29]:
df_continents = pd.read_csv('./data/continents.csv')

To ensure the data has been read in correctly it is always clever to look at the first few rows of the dataframe. How did we do that?

In [30]:
df_continents.head()

Unnamed: 0,continent;country
0,Africa;Algeria
1,Africa;Angola
2,Africa;Benin
3,Africa;Botswana
4,Africa;Burkina


The data does not seem to be read in correctly. What seems to be wrong with it? 
- notice the column title `continent;country` It should be two separate columns
- so we have only one column with the continent and country names separated by a `;` and not a `,` as expected

What can we do to clean it up?

- although the data was labeled as a `.csv` file that may not always be the case. 
- in the above print out we can see continent and country names separated by `;`
- we can use a parameter of the `.read_csv()` method called `sep` to fix this
- `sep` is short for separated, the default is `,` in a `.csv` file
- what separates the data can also be refered to as a **delimiter**

Now we can read in the data with the correct delimiter:

In [32]:
df_continents = pd.read_csv('.\data\continents.csv', sep=';')

Then take a look to see if the data has the correct structure now:

In [33]:
df_continents.head()

Unnamed: 0,continent,country
0,Africa,Algeria
1,Africa,Angola
2,Africa,Benin
3,Africa,Botswana
4,Africa,Burkina


### Short review

Using the df_continents dataframe try to use the correct command to give you the following results

In [34]:
# 1. display the DataFrame
df_continents

Unnamed: 0,continent,country
0,Africa,Algeria
1,Africa,Angola
2,Africa,Benin
3,Africa,Botswana
4,Africa,Burkina
...,...,...
189,South America,Paraguay
190,South America,Peru
191,South America,Suriname
192,South America,Uruguay


In [48]:
# 2. display the first 5 rows
df_continents.head()

Unnamed: 0,continent,country
0,Africa,Algeria
1,Africa,Angola
2,Africa,Benin
3,Africa,Botswana
4,Africa,Burkina


In [47]:
# 3. display the last 5 rows
df_continents.tail()
# df_continents.iloc[-5:]

Unnamed: 0,continent,country
189,South America,Paraguay
190,South America,Peru
191,South America,Suriname
192,South America,Uruguay
193,South America,Venezuela


In [49]:
# 4. display the number of rows and columns
df_continents.shape

(194, 2)

In [50]:
# 5. list the column names
df_continents.columns

Index(['continent', 'country'], dtype='object')

In [51]:
# 6. list the row index
df_continents.index

RangeIndex(start=0, stop=194, step=1)

In [53]:
# 7. display the column types
df_continents.dtypes

continent    object
country      object
dtype: object

## License
(c) 2021 Kristian Rother and Samuel McGuire.
Distributed under the conditions of the MIT License.