# EDA - gapminder data
### Exploratory Data Analysis

### 1. Import needed package(s)
At the start of any notebook or script we do all of the import statements

In [1]:
import pandas as pd

### 2. Reading Data

Use the  `read_csv()` function to read in the population file as a DataFrame.

**Note:** replace the ??? with the correct pandas commands

In [6]:
df_population = pd.read_csv('../data/population.csv')

df_population

Unnamed: 0,Total population,year,population
0,Abkhazia,1800,
1,Afghanistan,1800,3280000.0
2,Akrotiri and Dhekelia,1800,
3,Albania,1800,410445.0
4,Algeria,1800,2503218.0
...,...,...,...
22270,Northern Marianas,2015,
22271,South Georgia and the South Sandwich Islands,2015,
22272,US Minor Outlying Islands,2015,
22273,Virgin Islands,2015,


### 3. First impression

Use `.shape` and `head` to get an idea about the dataset

In [4]:
df_population.shape

(22275, 3)

In [5]:
df_population.head()

Unnamed: 0,Total population,year,population
0,Abkhazia,1800,
1,Afghanistan,1800,3280000.0
2,Akrotiri and Dhekelia,1800,
3,Albania,1800,410445.0
4,Algeria,1800,2503218.0


In [42]:
df_population.sample(10)

Unnamed: 0,Total population,year,population
1510,Malaysia,1850,534296.0
8076,India,1964,487690114.0
6099,"Congo, Dem. Rep.",1957,14161242.0
20554,Singapore,2009,4965105.0
17296,Venezuela,1997,23108003.0
14853,Albania,1989,3253659.0
1309,Somaliland,1840,
15340,Suriname,1990,408276.0
15523,Latvia,1991,2645056.0
10673,Tanzania,1973,14984973.0


In [8]:
df_population.dtypes

Total population     object
year                  int64
population          float64
dtype: object

In [9]:
df_population.index

RangeIndex(start=0, stop=22275, step=1)

In [43]:
df_population.isnull()

Unnamed: 0,Total population,year,population
0,False,False,True
1,False,False,False
2,False,False,True
3,False,False,False
4,False,False,False
...,...,...,...
22270,False,False,True
22271,False,False,True
22272,False,False,True
22273,False,False,True


In [45]:
df_population.duplicated('Total population') #returns boolean Series denoting duplicate rows

0        False
1        False
2        False
3        False
4        False
         ...  
22270     True
22271     True
22272     True
22273     True
22274     True
Length: 22275, dtype: bool

In [46]:
df_pop_dd = df_population.drop_duplicates('Total population')
df_pop_dd

Unnamed: 0,Total population,year,population
0,Abkhazia,1800,
1,Afghanistan,1800,3280000.0
2,Akrotiri and Dhekelia,1800,
3,Albania,1800,410445.0
4,Algeria,1800,2503218.0
...,...,...,...
270,Northern Marianas,1800,
271,South Georgia and the South Sandwich Islands,1800,
272,US Minor Outlying Islands,1800,
273,Virgin Islands,1800,


In [32]:
DF_West_Bank = df_population[df_population['Total population']=='West Bank']

DF_West_Bank['population'].sum()

0.0

In [36]:
df_Abkhazia = df_population[df_population['Total population']=='Abkhazia']
df_Abkhazia['population'].sum()

0.0

In [41]:
df_Akrotiri_Dhekelia = df_population[df_population['Total population']== 'Akrotiri and Dhekelia']
df_Akrotiri_Dhekelia.sample(15)

Unnamed: 0,Total population,year,population
8252,Akrotiri and Dhekelia,1965,11848.0
15677,Akrotiri and Dhekelia,1992,14328.0
16227,Akrotiri and Dhekelia,1994,14531.0
827,Akrotiri and Dhekelia,1830,
20352,Akrotiri and Dhekelia,2009,
17327,Akrotiri and Dhekelia,1998,14946.0
9077,Akrotiri and Dhekelia,1968,12101.0
14302,Akrotiri and Dhekelia,1987,13832.0
4952,Akrotiri and Dhekelia,1953,10889.0
20077,Akrotiri and Dhekelia,2008,15700.0


### 4. NaN

Some of the cells in our dataframe have a value of `NaN`. What does that mean?

- NaN stands for `not a number`

But wait **strings** are also not numbers?!?

- correct! What it really means for in the end is that the data is **missing**.

In [10]:
df_population['population'].sum()

363596236498.0

### 5. Access to index and column names

- how were the index and column accessed?

In [12]:
df_population.index

RangeIndex(start=0, stop=22275, step=1)

`.index` This tells us that the index is a numeric index from 0 to 22275 using a stepwise value of 1

In [13]:
df_population.columns

Index(['Total population', 'year', 'population'], dtype='object')

`.columns` returns the column names as a *list-like* Series. We can access each element as we would in a python list

### 6. Information about each column

`.info()` give us the non-Nan count of each column and the data type of each column

In [14]:
df_population.info

<bound method DataFrame.info of                                    Total population  year  population
0                                          Abkhazia  1800         NaN
1                                       Afghanistan  1800   3280000.0
2                             Akrotiri and Dhekelia  1800         NaN
3                                           Albania  1800    410445.0
4                                           Algeria  1800   2503218.0
...                                             ...   ...         ...
22270                             Northern Marianas  2015         NaN
22271  South Georgia and the South Sandwich Islands  2015         NaN
22272                     US Minor Outlying Islands  2015         NaN
22273                                Virgin Islands  2015         NaN
22274                                     West Bank  2015         NaN

[22275 rows x 3 columns]>

### 5. Pandas `.read_csv()` parameters

We currently have a dataframe for population ready to use. Let's also read in one for mapping countries to their respective continents. Do you remember how to read in a dataframe with pandas.

In [15]:
df_continents = pd.read_csv('../data/continents.csv')

To ensure the data has been read in correctly it is always clever to look at the first few rows of the dataframe. How did we do that?

In [16]:
df_continents.head()

Unnamed: 0,continent;country
0,Africa;Algeria
1,Africa;Angola
2,Africa;Benin
3,Africa;Botswana
4,Africa;Burkina


The data does not seem to be read in correctly. What seems to be wrong with it? 
- notice the column title `continent;country` It should be two separate columns
- so we have only one column with the continent and country names separated by a `;` and not a `,` as expected

What can we do to clean it up?

- although the data was labeled as a `.csv` file that may not always be the case. 
- in the above print out we can see continent and country names separated by `;`
- we can use a parameter of the `.read_csv()` method called `sep` to fix this
- `sep` is short for separated, the default is `,` in a `.csv` file
- what separates the data can also be refered to as a **delimiter**

Now we can read in the data with the correct delimiter:

In [17]:
df_continents = pd.read_csv('../data/continents.csv', sep=';')

Then take a look to see if the data has the correct structure now:

In [18]:
df_continents.head()

Unnamed: 0,continent,country
0,Africa,Algeria
1,Africa,Angola
2,Africa,Benin
3,Africa,Botswana
4,Africa,Burkina


### Short review

Using the df_continents dataframe try to use the correct command to give you the following results

In [19]:
# 1. display the DataFrame
df_continents

Unnamed: 0,continent,country
0,Africa,Algeria
1,Africa,Angola
2,Africa,Benin
3,Africa,Botswana
4,Africa,Burkina
...,...,...
189,South America,Paraguay
190,South America,Peru
191,South America,Suriname
192,South America,Uruguay


In [20]:
# 2. display the first 5 rows
df_continents.head(5)

Unnamed: 0,continent,country
0,Africa,Algeria
1,Africa,Angola
2,Africa,Benin
3,Africa,Botswana
4,Africa,Burkina


In [21]:
# 3. display the last 5 rows
df_continents.tail(5)

Unnamed: 0,continent,country
189,South America,Paraguay
190,South America,Peru
191,South America,Suriname
192,South America,Uruguay
193,South America,Venezuela


In [22]:
# 4. display the number of rows and columns
df_continents.shape

(194, 2)

In [23]:
# 5. list the column names
df_continents.columns

Index(['continent', 'country'], dtype='object')

In [28]:
# 6. list the row index
df_continents.index

RangeIndex(start=0, stop=194, step=1)

In [29]:
# 7. display the column types
df_continents.dtypes

continent    object
country      object
dtype: object

## License
(c) 2021 Kristian Rother and Samuel McGuire.
Distributed under the conditions of the MIT License.