<a href="https://pandas.pydata.org/">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/2560px-Pandas_logo.svg.png" width="300px">
</a>


# Data Wrangling with pandas



## Objectives
This is the demonstration notebook for the first task. The facilitators will present a lecture that is based around this notebook. Before performing the first task as a team, we recommend you first go through this walkthrough together, as it will show a lot of techniques and code that will be useful in the task.  


The main objective of this walkthrough is to familiarize you with the [pandas python package](https://pandas.pydata.org/). Pandas is a widely-used popular Python package that can discover, clean and structure tabular data. In particular, this notebook will cover:
* The basic concept behind the core data structures within Pandas: [Series](https://pandas.pydata.org/docs/user_guide/dsintro.html#series) and [DataFrames](https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe)
* How to process, organize and clean data stored in DataFrames
* Visualization of data stored within a DataFrame.

After completing this notebook you will be able to:
* Understand key features of the Pandas library.
* Import tabular data from spreadsheets into Pandas
* Understand your data, its size, features and its structure.
* Handle missing values
* Correct data format
* Transform variables, including grouping continuous values and normalization
* Perform statistical analysis
* Visualise data using Matplotlib library


## What do we mean by data wrangling?
Data wrangling is the process of reading in the raw data and then cleaning and processing it so that the resulting information is in suitable format for analysis. Part of the data wrangling process includes data discovery, data cleaning (identifying and then removing any erroneous data), handling missing data, transforming variables, and visualizing the resulting data.

Data wrangling is often needed before any quantitative analysis can begin; however, it is often the most time-consuming and tedious part of the process. There are many different skills and actions that are required as part of data wrangling, many of which are illustrated in the example code below. Many of these steps are part of the functionality that Pandas provides. 

Here are some common, often used steps that are considered part of data wrangling:

* **Discovering**: 
Reviewing and understanding the input data to better understand its structure and what variables will be useful for your problem 
* **Structuring**:
Standardising the format for disparate types of data and make the data usable for automated or semiautomated data analysis. The data must be structured to fit the analytics model 
COME BACK TO THIS

* **Cleaning** : 
There are often missing or implausible values in a dataset. These occur for a number of reasons and they could adversely affect the reliability and repeatability of an analysis. Cleaning techniques are identify and in some cases correct these instances so that they can be handled appropriately in subsequent analyses.

* **Enriching**
You are pretty familiar with the data at this point. This is the moment to ask yourself if you want to enhance the data. Are you looking to add other data to it?


*  **Validating** 
This step involves iterative programming steps that authenticate your data’s quality and safety. For example, you may have problems if your data is not clean or enriched and the attributes are not distributed evenly.

Many of these different stages of data wrangling require you to visualize the data in some form other than the base tabular data. Visualisation libraries such as matplotlib and seaborn are often used to plot data as well as perform some basic statistical analysis. For example, one simple way of identify outliers for cleaning is to plot the data points and visually identify points that appear to be well outside the plausible range of the rest of the data for a given variable. 



## Table of Contents
* [1. Importing pandas and reading in data](#1-importing-pandas-and-reading-in-data)
* [2. Discovering and reviewing the data](#2-discovering-and-reviewing-the-data)
* [3. Handling missing data](#3-handling-missing-data)
* [4. Checking the data representation](#4-checking-the-data-representation)
* [5. Grouping data](#5-grouping-data)
* [6. Sorting data](#6-sorting-data)
* [7. Statistical summary operations](#7-statistical-summary-operations)
* [8. Combining and merging data sets](#8-combining-and-merging-data-sets)
* [9. Visualizing the data](#9-visualizing-the-data)
    <li><a href="https://#data_standardization">Data standardization</a></li>
    <li><a href="https://#data_normalization">Data normalization (centering/scaling)</a></li>


## 1. Importing pandas and reading in data
### 1.1 Importing packages
The first thing that we need to do is to import the Python packages we are going to need. This includes pandas, [numpy](https://numpy.org/) (which Pandas uses to store data in the two-dimensional tabular structure), and [matplotlib](https://matplotlib.org/), a popular Python package used for data visualisation.

For convenience, we will import these packages with the widely agreed shortname (`pd`,`np`,`plt`), so that we don't always have to provide the package name each time we use some functionality from the package. 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt


###  1.2 Reading in data
For this notebook, we are using a dataset provided by the World Health Organisation. There are many sources where you might get data from that you will then want to work with using pandas. The most common is to open a file containing tabular data. The most well-known format for tabular data are Excel spreadsheets. Another commonly used format are plain-text *comma separated values* files with the extension `csv`. Each line of a csv file represents a row, and each column is separated by a comma. 

Besides local files, you may also get data by querying a database or requesting data from the web. 

All of these sources generate tabular data that can be stored within pandas DataFrames.

For the purpose of this notebook, we have assumed you have [downloaded the data](https://liveuclac.sharepoint.com/:x:/r/sites/TeamCodersEventsPlanning/Shared%20Documents/General/WHO/who_case_statistics_modified3.csv?d=wec3e29c1cb9d43779f37cc43bf9eed3b&csf=1&web=1&e=wy7e84) and stored it to a local file in the same directory as this notebook. 

pandas has many different functions that will read in data depending on the data format. Below, we will read the filename and store it the variable `input_file` and then use the [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function to read this file 
and store it into a DataFrame, under a variable named `df_raw`. Here the df stands for DataFrame, so that you know what is stored in the variable.  

In [70]:

# If you have not downloaded the file into this directory
# please ammend the first line so that input_file contains
# the full location (path) of the file
input_file = "data/who_case_statistics_modified3.csv" 
df_raw = pd.read_csv(input_file)

## 2. Discovering and reviewing the data 
Before we can work with the data, we need to know about more about the contents of the data. Here are some core functions that can help you start to review the structure and contents of your DataFrame. 
* `pandas.DataFrame.head()` - prints out the first *n* rows of the DataFrame 
* `pandas.DataFrame.tail()` - prints out the last *n* rows of the DataFrame

You can understand a bit more of the data structure using the following commands:
* `pandas.DataFrame.info()` - summary information (column names, counts, data types) 
* `pandas.DataFrame.describe()` - provides basic summary statistics for each variable

Some basic low-level information about the DataFrame can be extracted with the following attributes:
* `pandas.DataFrame.shape` - provides the number of rows and columns in the DataFrame
* `pandas.DataFrame.dtypes` - indicates the datatype (string, float, integer, etc) used to store the data for each variable


### 2.1 Review the data
Use the method <b>head()</b> to display the first five rows of the dataframe.


In [71]:
df_raw.head()

Unnamed: 0,country,year,sex,age,cases_no,population,HDI for year,gdp_for_year ($)
0,Albania,1987,male,15-24 years,21,312900,,2156624900
1,Albania,1987,male,35-54 years,16,308000,,2156624900
2,Albania,1987,female,15-24 years,14,289700,,2156624900
3,Albania,1987,male,75+ years,1,21800,,2156624900
4,Albania,1987,male,25-34 years,9,274300,,2156624900


Notice how there are headers at the top, which match up with the column names. Each row also has an _index_, which starts at 0 and counts up for this spreadsheet. Indices are like headers but for rows.

In [38]:
df_raw.shape

(27840, 8)

In [4]:
df_raw.dtypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27840 entries, 0 to 27839
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   country             27840 non-null  object 
 1   year                27840 non-null  int64  
 2   sex                 27840 non-null  object 
 3   age                 27840 non-null  object 
 4   cases_no            23575 non-null  object 
 5   population          27840 non-null  int64  
 6   HDI for year        8368 non-null   float64
 7    gdp_for_year ($)   27840 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 1.7+ MB


### Exercise 1
Use the tail method to see the last part of the data.
How many lines do you see?
Use the the tail method again to see the last 20 lines of the data.

In [None]:
# Write your code here

### Exercise 2
Use the info method to gather information  about the data. 


In [None]:
# Write your code here

### Observations
* The data frame contains 27840 rows and 8 columns.
* The data frame is an object of class pandas.DataFrame.
* Columns that hold only numeric data are stored as floats or integers
* Columns that hold text or a mixture of data are stored as objects. 


### Exercise 3
Write any other observation that you see about the data. 


### 2.2 Understanding headers
How did pandas know how to name the columns? If you opened up the file in a text editor, you would see that the first row contains these names. When you run `read_csv`, pandas assumes that the column names (or _headers_) are located in the first row. If that is not the case, you can specify which row they can be found by using the `header` argument. 

However, if there are not any column names in the file, then it is important to tell pandas that, lest you lose actual data. This can be done by setting `header` to `None`. 


In [73]:
# Reading in the data - but assuming no headers
df_raw = pd.read_csv(input_file, header=None)
# Now look at the head, does it look correct?
df_raw.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,country,year,sex,age,cases_no,population,HDI for year,gdp_for_year ($)
1,Albania,1987,male,15-24 years,21,312900,,2156624900
2,Albania,1987,male,35-54 years,16,308000,,2156624900
3,Albania,1987,female,15-24 years,14,289700,,2156624900
4,Albania,1987,male,75+ years,1,21800,,2156624900


In [74]:
# Also look at the info, does it look correct?
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27841 entries, 0 to 27840
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       27841 non-null  object
 1   1       27841 non-null  object
 2   2       27841 non-null  object
 3   3       27841 non-null  object
 4   4       23576 non-null  object
 5   5       27841 non-null  object
 6   6       8369 non-null   object
 7   7       27841 non-null  object
dtypes: object(8)
memory usage: 1.7+ MB


### 2.3 Selecting the columns and getting the values
The way to to select columns is to write the 
`dataframe['column_name']`
Lets show all the countries in the data set.

In [75]:
df_raw = pd.read_csv(input_file)
countries=df_raw['country']
print(countries)

0            Albania
1            Albania
2            Albania
3            Albania
4            Albania
            ...     
27835        Belgium
27836       Thailand
27837    Netherlands
27838        Grenada
27839         Mexico
Name: country, Length: 27840, dtype: object


Notice that, even though we select one column `country` that the indices on the left remain. As mentioned these are like headers for rows and they are there even if there is only a one column. We can get the counts of different values in the countries column using the `pandas.DataFrame.value_counts()` function.

In [6]:
df_raw['country'].value_counts() #select the column with data_raw['country'] then add the method .value_counts()

Austria                   383
Netherlands               383
Iceland                   382
Mauritius                 382
Belgium                   373
                         ... 
Bosnia and Herzegovina     24
Macau                      12
Cabo Verde                 12
Dominica                   12
Mongolia                   10
Name: country, Length: 101, dtype: int64

This shows us how many times each country is found in the column. If you wanted instead to get a list of all countries found in this column, you can use the `unique` function. The output from this will be a one-dimensional numpy array.  


In [77]:
country_array = df_raw['country'].unique()
print(country_array)
print(type(country_array))

['Albania' 'Antigua and Barbuda' 'Argentina' 'Armenia' 'Aruba' 'Australia'
 'Austria' 'Azerbaijan' 'Bahamas' 'Bahrain' 'Barbados' 'Belarus' 'Belgium'
 'Belize' 'Bosnia and Herzegovina' 'Brazil' 'Bulgaria' 'Cabo Verde'
 'Canada' 'Chile' 'Colombia' 'Costa Rica' 'Croatia' 'Cuba' 'Cyprus'
 'Czech Republic' 'Denmark' 'Dominica' 'Ecuador' 'El Salvador' 'Estonia'
 'Fiji' 'Finland' 'France' 'Georgia' 'Germany' 'Greece' 'Grenada'
 'Guatemala' 'Guyana' 'Hungary' 'Iceland' 'Ireland' 'Israel' 'Italy'
 'Jamaica' 'Japan' 'Kazakhstan' 'Kiribati' 'Kuwait' 'Kyrgyzstan' 'Latvia'
 'Lithuania' 'Luxembourg' 'Macau' 'Maldives' 'Malta' 'Mauritius' 'Mexico'
 'Mongolia' 'Montenegro' 'Netherlands' 'New Zealand' 'Nicaragua' 'Norway'
 'Oman' 'Panama' 'Paraguay' 'Philippines' 'Poland' 'Portugal'
 'Puerto Rico' 'Qatar' 'Republic of Korea' 'Romania' 'Russian Federation'
 'Saint Kitts and Nevis' 'Saint Lucia' 'Saint Vincent and Grenadines'
 'San Marino' 'Serbia' 'Seychelles' 'Singapore' 'Slovakia' 'Slovenia'
 'South 

In order to convert the numpy array to a Python list, we will use the numpy `tolist()` method. In case the country list is not alphabetically sorted, we can use the `sorted()` function.


In [78]:
country_list = country_array.tolist() # Convert a numpy array to a list
sorted_country_list = sorted(country_list)
country_list  # The list starts and finishes with []
#  we can print the list within the print function:      print(f'The list of countries is:{data_countries_list}')

['Albania',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Bosnia and Herzegovina',
 'Brazil',
 'Bulgaria',
 'Cabo Verde',
 'Canada',
 'Chile',
 'Colombia',
 'Costa Rica',
 'Croatia',
 'Cuba',
 'Cyprus',
 'Czech Republic',
 'Denmark',
 'Dominica',
 'Ecuador',
 'El Salvador',
 'Estonia',
 'Fiji',
 'Finland',
 'France',
 'Georgia',
 'Germany',
 'Greece',
 'Grenada',
 'Guatemala',
 'Guyana',
 'Hungary',
 'Iceland',
 'Ireland',
 'Israel',
 'Italy',
 'Jamaica',
 'Japan',
 'Kazakhstan',
 'Kiribati',
 'Kuwait',
 'Kyrgyzstan',
 'Latvia',
 'Lithuania',
 'Luxembourg',
 'Macau',
 'Maldives',
 'Malta',
 'Mauritius',
 'Mexico',
 'Mongolia',
 'Montenegro',
 'Netherlands',
 'New Zealand',
 'Nicaragua',
 'Norway',
 'Oman',
 'Panama',
 'Paraguay',
 'Philippines',
 'Poland',
 'Portugal',
 'Puerto Rico',
 'Qatar',
 'Republic of Korea',
 'Romania',
 'Russian Federation',
 'Saint 

In [25]:
#Find the length of the list
len(country_list)

101

We have looked at how to explore categorical data. Now let's look at numeric data. We can get the mean of a column using the `mean()` function

In [28]:
case_numbers = df_raw['cases_no']
case_numbers

0         21
1         16
2         14
3          1
4          9
        ... 
27835      6
27836    152
27837     21
27838    NaN
27839      7
Name: cases_no, Length: 27840, dtype: object

In [30]:
cases_mean = df_raw['cases_no'].mean()

TypeError: can only concatenate str (not "int") to str

You should have received an error here, and the key reason for this can be seen at the bottom fo the printout of the variable `cases_no`. This column is of the data type `object`. This typically denotes a mixture of different data, both numeric and non-numeric. As a result, pandas cannot work out a mean. We will focus on handling missing data in the next section.
 


## 3. Handling missing data
In the above printout of `cases_no`, we can see the code "NaN" in some of the rows, which is a special numeric value in pandas which means "Not A Number". When Pandas reads in the file, it identifies empty values, as well as certain strings (for example `#N/A` or `None`), as missing data and replaces this with the NaN code. However, especially when data has not yet been cleaned, there may be many different ways that a missing value is indicated, such as the string `Null` or a specific numeric code. There are many ways of handling missing data, but the first step in all of them is to identify the missing data and make sure they are labelled as such by pandas. In this section, we will identify all of the missing values and prepare the format of our DataFrame to be ready for analysis.

Steps for working with missing data:
* Identify missing data
* Deal with missing data
* Correct data format

### 3.1 Identifying missing data
Pandas represents missing data with a value of `NaN`. We can check whether a value is missing or not using the `isnull()` or `isna()` function. The opposite function to detect non-missing values is `notnull()` or `notna()`. The output from these functions are boolean (i.e. `True` or `False`.  

The following code will check every cell in our DataFrame and tell us how many have missing values. 

In [31]:
df_raw.isnull().value_counts() # Indicates with True that there are Nulls and counts the values that are numbers 

country  year   sex    age    cases_no  population  HDI for year   gdp_for_year ($) 
False    False  False  False  False     False       True          False                 16357
                                                    False         False                  7218
                              True      False       True          False                  3115
                                                    False         False                  1150
dtype: int64

The output shows all of the combinations with missing data. Two columns - `cases_no` and `HDI for year` - have some mimssing values because the `isnull()` test has returned `True`. 

In [32]:
missing_data = df_raw.isnull() # find data that are null and return 'True' if is null and 'False' if it is not
missing_data.head(20)

Unnamed: 0,country,year,sex,age,cases_no,population,HDI for year,gdp_for_year ($)
0,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,True,False
5,False,False,False,False,False,False,True,False
6,False,False,False,False,False,False,True,False
7,False,False,False,False,False,False,True,False
8,False,False,False,False,False,False,True,False
9,False,False,False,False,True,False,True,False


What if we wanted to determine the number of missing values in each column? We can use a loop to get the number of missing values found in each column.

In [33]:
# This loop will go through each column in the data frame
for column in missing_data: # Select each column-name in the header and prints it 
    print(column)
    print(missing_data[column].value_counts())
    print('----')

country
False    27840
Name: country, dtype: int64
----
year
False    27840
Name: year, dtype: int64
----
sex
False    27840
Name: sex, dtype: int64
----
age
False    27840
Name: age, dtype: int64
----
cases_no
False    23575
True      4265
Name: cases_no, dtype: int64
----
population
False    27840
Name: population, dtype: int64
----
HDI for year
True     19472
False     8368
Name: HDI for year, dtype: int64
----
 gdp_for_year ($) 
False    27840
Name:  gdp_for_year ($) , dtype: int64
----


The length of the dataframe is 27840 and the column `cases_no` has 23575 rows where `isnull()` is False, indicating that there is numeric data, and it has 4265 values of True (27840 - 23575). 

The HDI column has more observations with missing values - 19472 in total. As we will not use this column for analysis, we can simply remove it.

#### Identifying non-numeric data in a column
Most of the missing values have already been identified and appropriately labelled by pandas, but since this column is not a numeric datatype, there must still be some non-numeric data stored in it. All the values are currently strings, so we can find all rows where the strings represent data that is non-numeric. We will use the method `str.isdigit()`, which will generate a set of Boolean values much like our test above with `isnull()` did. 

In [43]:
non_numeric = ( df_raw['cases_no'].str.isdigit() == False)
non_numeric.value_counts()

False    27820
True        20
Name: cases_no, dtype: int64

 Notice how this method has only found twenty values, far less than we found with the `isnull()` method. That is because `NaN` is considered a numeric value. However, these are the values that are causing all of the problems, and we need to find out what they are and handle them appropriately. 
 
We are using a trick in pandas called _boolean indexing_. The following code will only select rows in the column `cases_no` where the variable `non_numeric` is equal to `True`. Thus it will return only twenty rows, the ones that have non-numeric data. Then, we will use the method `unique()` again to remove duplicate values. 

In [45]:
df_raw['cases_no'][non_numeric].unique()

array(['Null', 'Unknown'], dtype=object)

So these twenty rows have two unique values: `Null` and `Unknown`. These two strings are not automatically identified and labelled as missing by pandas when reading in the file and now we can set these values as missing. We can use the `replace()` function to remove these values.

In [5]:
df_raw['cases_no'] = df_raw['cases_no'].replace('Null',np.nan)
df_raw['cases_no'] = df_raw['cases_no'].replace('Unknown',np.nan)
non_numeric = ( df_raw['cases_no'].str.isdigit() == False)
non_numeric.value_counts()

False    27840
Name: cases_no, dtype: int64

### Exercise 4
The output should indicate that `cases_no` no longer has non-numeric data. How can you tell that? Put your answer in the blank markdown cell below.

#### Answer

### Exercise 5
What datatype is `cases_no` now? Will the mean work?

In [None]:
# Put your answer to Exercise 5 here

So we still need to have pandas convert this to a numeric type that will allow us to do more analysis with it.  

In [6]:
df_raw['cases_no'] = df_raw['cases_no'].apply(pd.to_numeric)
print(df_raw['cases_no'].dtype)
df_raw['cases_no']

float64


0         21.0
1         16.0
2         14.0
3          1.0
4          9.0
         ...  
27835      6.0
27836    152.0
27837     21.0
27838      NaN
27839      7.0
Name: cases_no, Length: 27840, dtype: float64

Does the mean now work?

In [7]:
cases_mean = df_raw['cases_no'].mean()
print(cases_mean)

217.80072171513478


### 3.1 Deal with missing data
Now that we have identified what data is missing, how should we handle that for our analysis? There are many options, and the right one will be based on the data and what assumptions that can be made on why the data is missing. For some of the more complex options, you should consult with a statistician before implementing.

* Drop data
  * Drop the whole row
  * Drop the whole column
* Replace data - often known as imputation
  * Replace it with the mean
  * Replace it with multiple guesses from the expected distribution (multiple imputation)
  * Replace it by the mode
  * Replace it based on other functions




####  Dropping a column with lots of missing data.
To remove a column we use the `pandas.DataFrame.drop()` method. The input argument should be the list of column names that you want to drop, and axis argument should be set to one to tell pandas that you are wanting to drop columns, not rows. 

We don't need the column `HDI for year` for our analysis, and since there are a lot of missing values there, we can just drop it from the data frame. 

To avoid writing over the original data_frame, we will save the output to a new variable.


In [8]:
df_no_hdi = df_raw.drop(['HDI for year'], axis = 1)                         
df_no_hdi

Unnamed: 0,country,year,sex,age,cases_no,population,gdp_for_year ($)
0,Albania,1987,male,15-24 years,21.0,312900,2156624900
1,Albania,1987,male,35-54 years,16.0,308000,2156624900
2,Albania,1987,female,15-24 years,14.0,289700,2156624900
3,Albania,1987,male,75+ years,1.0,21800,2156624900
4,Albania,1987,male,25-34 years,9.0,274300,2156624900
...,...,...,...,...,...,...,...
27835,Belgium,2011,female,25-34 years,6.0,707535,527008453887
27836,Thailand,2016,male,75+ years,152.0,1124052,411755164833
27837,Netherlands,1998,female,15-24 years,21.0,934500,432476116419
27838,Grenada,2002,female,5-14 years,,11760,540336926


Now we turn to the other variable in the data frame that had missing values `cases_no`. Now that all of the missing values are coded properly, we can identify all of the missing data with the `isnull()` function. 

In [9]:
missing_values = df_no_hdi['cases_no'].isnull()
missing_values.value_counts()

False    23555
True      4285
Name: cases_no, dtype: int64

To do a _complete case analysis_, we can simply use the `dropna()` function. Even though `cases_no` is the only column with missing data, we tell dropna to only consider it when decided which rows to drop. We use `axis=0` to tell it to drop rows not columns. 

In [10]:
df_no_missing_cases = df_no_hdi.dropna(axis=0, subset=['cases_no'])
df_no_missing_cases.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23555 entries, 0 to 27839
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   country             23555 non-null  object 
 1   year                23555 non-null  int64  
 2   sex                 23555 non-null  object 
 3   age                 23555 non-null  object 
 4   cases_no            23555 non-null  float64
 5   population          23555 non-null  int64  
 6    gdp_for_year ($)   23555 non-null  object 
dtypes: float64(1), int64(2), object(4)
memory usage: 1.4+ MB


Note that the new data frame only contains 23555 entries, which is how many non-missing values there are. Let's take a look at the actual data. 

In [64]:
df_no_missing_cases

Unnamed: 0,country,year,sex,age,cases_no,population,gdp_for_year ($)
0,Albania,1987,male,15-24 years,21.0,312900,2156624900
1,Albania,1987,male,35-54 years,16.0,308000,2156624900
2,Albania,1987,female,15-24 years,14.0,289700,2156624900
3,Albania,1987,male,75+ years,1.0,21800,2156624900
4,Albania,1987,male,25-34 years,9.0,274300,2156624900
...,...,...,...,...,...,...,...
27834,Ukraine,2005,female,25-34 years,182.0,3380536,86142018069
27835,Belgium,2011,female,25-34 years,6.0,707535,527008453887
27836,Thailand,2016,male,75+ years,152.0,1124052,411755164833
27837,Netherlands,1998,female,15-24 years,21.0,934500,432476116419


It's a bit odd that we have only 23,555 rows now but some of the rows in the spreadsheet still have their old index (27839). We can fix that by using `reset_index(drop=True)`, with the `drop=True` meaning that we do not want to retain the old index.

In [11]:
df_no_missing_cases = df_no_missing_cases.reset_index(drop=True)
df_no_missing_cases

Unnamed: 0,index,country,year,sex,age,cases_no,population,gdp_for_year ($)
0,0,Albania,1987,male,15-24 years,21.0,312900,2156624900
1,1,Albania,1987,male,35-54 years,16.0,308000,2156624900
2,2,Albania,1987,female,15-24 years,14.0,289700,2156624900
3,3,Albania,1987,male,75+ years,1.0,21800,2156624900
4,4,Albania,1987,male,25-34 years,9.0,274300,2156624900
...,...,...,...,...,...,...,...,...
23550,27834,Ukraine,2005,female,25-34 years,182.0,3380536,86142018069
23551,27835,Belgium,2011,female,25-34 years,6.0,707535,527008453887
23552,27836,Thailand,2016,male,75+ years,152.0,1124052,411755164833
23553,27837,Netherlands,1998,female,15-24 years,21.0,934500,432476116419


####  Replacing missing values
Replacing or inputting missing data can be a complex subject. Here we are only going to show the simplest option of replacing these missing values with a single value. One option is to replace the missing value with a mean.
In order to do that we make a copy of the DataFrame and then modify the specific cells identified as missing using the `loc[]` method which only modifies the rows that are missing in the `cases_no` column.

In [12]:
df_cases_replaced = df_no_hdi.copy()
df_cases_replaced.loc[missing_values,'cases_no'] = cases_mean
df_cases_replaced

Unnamed: 0,country,year,sex,age,cases_no,population,gdp_for_year ($)
0,Albania,1987,male,15-24 years,21.000000,312900,2156624900
1,Albania,1987,male,35-54 years,16.000000,308000,2156624900
2,Albania,1987,female,15-24 years,14.000000,289700,2156624900
3,Albania,1987,male,75+ years,1.000000,21800,2156624900
4,Albania,1987,male,25-34 years,9.000000,274300,2156624900
...,...,...,...,...,...,...,...
27835,Belgium,2011,female,25-34 years,6.000000,707535,527008453887
27836,Thailand,2016,male,75+ years,152.000000,1124052,411755164833
27837,Netherlands,1998,female,15-24 years,21.000000,934500,432476116419
27838,Grenada,2002,female,5-14 years,217.800722,11760,540336926


You can see that the next to last row has the mean instead of a specific number. But notice that means we have a rather large number of cases (218) for a very small population of Grenada (11760)

Another option is to replace with the most frequent value, also known as the mode. This can be done with the `mode()` function. As there can be more than one mode, it will return a pandas Series. However, we can see here that there is only one mode, so we will replace using that value. 

In [13]:
cases_mode = df_no_hdi['cases_no'].mode()
print(f"The mode is {cases_mode}")
df_cases_replaced.loc[missing_values,'cases_no'] = cases_mode.iloc[0]
df_cases_replaced

The mode is 0    1.0
dtype: float64


Unnamed: 0,country,year,sex,age,cases_no,population,gdp_for_year ($)
0,Albania,1987,male,15-24 years,21.0,312900,2156624900
1,Albania,1987,male,35-54 years,16.0,308000,2156624900
2,Albania,1987,female,15-24 years,14.0,289700,2156624900
3,Albania,1987,male,75+ years,1.0,21800,2156624900
4,Albania,1987,male,25-34 years,9.0,274300,2156624900
...,...,...,...,...,...,...,...
27835,Belgium,2011,female,25-34 years,6.0,707535,527008453887
27836,Thailand,2016,male,75+ years,152.0,1124052,411755164833
27837,Netherlands,1998,female,15-24 years,21.0,934500,432476116419
27838,Grenada,2002,female,5-14 years,1.0,11760,540336926


Now instead of the mean (217.8), the missing value is the penultimate row is replaced with teh mode (1.0)

## 4. Checking the data representation

We have previously corrected the issues with the `cases_no` column. Now let's take a look what issues are still remaining by using the `info()` function.

In [14]:
df_cases_replaced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27840 entries, 0 to 27839
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   country             27840 non-null  object 
 1   year                27840 non-null  int64  
 2   sex                 27840 non-null  object 
 3   age                 27840 non-null  object 
 4   cases_no            27840 non-null  float64
 5   population          27840 non-null  int64  
 6    gdp_for_year ($)   27840 non-null  object 
dtypes: float64(1), int64(2), object(4)
memory usage: 1.5+ MB


We would expect age and gdp to be numeric, but they are not. Let's check the reason why by looking at whether they are numeric or not.


In [17]:
print(df_cases_replaced['age'].unique())

['15-24 years' '35-54 years' '75+ years' '25-34 years' '55-74 years'
 '5-14 years']


So age does not represent a numeric value in years, but rather it ia a _categorical_ variable representing age bands. We will convert this column to a categorical value. But if you see above, the values are not quite as sorted as we would like the category to be (look at where the 5-14 year old group is). So we need to tell pandas what order we want these age bands in.

In [26]:
age_bands_sorted = [
    '5-14 years','15-24 years','25-34 years', 
    '35-54 years','55-74 years','75+ years'  
    ]
df_cases_replaced['age'] = pd.Categorical(df_cases_replaced['age'], 
                                          categories=age_bands_sorted)
df_cases_replaced['age']

0        15-24 years
1        35-54 years
2        15-24 years
3          75+ years
4        25-34 years
            ...     
27835    25-34 years
27836      75+ years
27837    15-24 years
27838     5-14 years
27839      75+ years
Name: age, Length: 27840, dtype: category
Categories (6, object): ['5-14 years', '15-24 years', '25-34 years', '35-54 years', '55-74 years', '75+ years']

See how this variable now has a dtype of `category` and the categories are sorted so that the levels go up with age?

Now what is going on with the column representing gdp? First we look at how many values do not have numeric data.

In [20]:
gdp_is_numeric = (df_cases_replaced[' gdp_for_year ($) '].str.isdigit()==False)
gdp_is_numeric.value_counts()


True    27840
Name:  gdp_for_year ($) , dtype: int64

All of these are not numeric. Let's look at head again to see what is up here?

In [22]:
df_cases_replaced[' gdp_for_year ($) '].head()

0    2,156,624,900
1    2,156,624,900
2    2,156,624,900
3    2,156,624,900
4    2,156,624,900
Name:  gdp_for_year ($) , dtype: object

It is likely that the commas are causing trouble here. If we had spotted this before we read in the file, we could add the argument `thousands=','` to our `read_csv` call. 

Before we fix this though the leading and trailing spaces in the column are annoying, so let's get rid of those first. 

In [23]:
df_cases_replaced = df_cases_replaced.rename(columns={' gdp_for_year ($) ':'gdp_for_year_usd'})
df_cases_replaced.columns

Index(['country', 'year', 'sex', 'age', 'cases_no', 'population',
       'gdp_for_year_usd'],
      dtype='object')

The column with the GDP has a header that contains empty spaces and the $ character, these will cause problems whenever we mention the header. We need to change it to a title that would be easy to write and avoid mistakes.

Fixing the columns requires we do a string operation to the values.

In [24]:
# First replace the string comma ',' empty space''. Then typecasting the column to integers
df_cases_replaced['gdp_for_year_usd'] = df_cases_replaced['gdp_for_year_usd'].str.replace(',', '').astype(int)
df_cases_replaced['gdp_for_year_usd']

0          2156624900
1          2156624900
2          2156624900
3          2156624900
4          2156624900
             ...     
27835    527008453887
27836    411755164833
27837    432476116419
27838       540336926
27839    183144164357
Name: gdp_for_year_usd, Length: 27840, dtype: int64

### **Wonderful!**
#### Save the data

Now we have finally obtained the cleaned dataset with no missing values with all data in its proper format.
We can save this data set as it is now the 'clean' data set which is ready for processing. 


In [None]:
df_cases_replaced.to_csv(r'who_suicide_statistics_clean_data.csv', index=False)

## 5. Grouping data
Now that our data is cleaned, we will move on to what operations we can perform on the DatFrame. Often we want to group series of observations together and perform an operation on them. This is accomplished by the `groupby()` function, which allows users to:
* **split** data into groups according to some value in the variable (like country)
* **apply** a function to all of the observations within the group
* **combine** the resulting options into a new aggregated data frame.

We will group the data frame by `country` and from that I will select the country of preference by using `.get_group('selection')`. Lets select the data for the United Kingdom.

##  Grouping
 Use  **.groupby('..')** Its a function of grouping the data. It involes a combination of **spliting** the object applying a function and **combine** the results **.groupby('column name, value')**

To select the data of a single country from the 101, I'll  group the data frame by 'country' and from that I will select the country of preference by using **.get_group('selection').
Lets select the data for the United Kingdom.

In [31]:
df_by_country= df_cases_replaced.groupby(['country'])
df_uk=df_by_country.get_group('United Kingdom')
df_uk

Unnamed: 0,country,year,sex,age,cases_no,population,gdp_for_year_usd
26476,United Kingdom,1985,male,75+ years,264.0,1202838,489285164271
26477,United Kingdom,1985,male,55-74 years,915.0,5170113,489285164271
26478,United Kingdom,1985,male,35-54 years,128.0,6899879,489285164271
26479,United Kingdom,1985,male,25-34 years,62.0,3969689,489285164271
26480,United Kingdom,1985,female,55-74 years,678.0,6002096,489285164271
...,...,...,...,...,...,...,...
26843,United Kingdom,2015,female,25-34 years,181.0,4414464,2885570309161
26844,United Kingdom,2015,female,75+ years,18.0,3070457,2885570309161
26845,United Kingdom,2015,female,15-24 years,14.0,3966564,2885570309161
26846,United Kingdom,2015,female,5-14 years,6.0,3663221,2885570309161


Summarizing data across both sexes, all of the different age levels and over thirty years of data may be too coarse of a grouping for most statistical analysis. 

So we can group by multiple columns at a time, and perform operations that summarize across these groups. For example, let's say we want the total number of cases each year per country across all the age groups and both male and female individuals.


In [52]:
df_by_country_yr = df_cases_replaced.groupby(['country','year']).sum()
df_by_country_yr

Unnamed: 0_level_0,Unnamed: 1_level_0,cases_no,population,gdp_for_year_usd
country,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Albania,1987,76.0,2709600,25879498800
Albania,1988,65.0,2764300,25512000000
Albania,1989,71.0,2803100,28021499856
Albania,1992,51.0,2822500,8513431008
Albania,1993,56.0,2807300,14736852456
...,...,...,...,...
Uzbekistan,2010,1077.0,25651783,471993251148
Uzbekistan,2011,1631.0,25978049,550982294268
Uzbekistan,2012,1214.0,26381830,621858880056
Uzbekistan,2013,1662.0,26838924,692285441532


The `age` and `sex` column is no longer included in the output because it has been compressed by the group level aggregation. 

While, the sum operation makes sense for the cases and the population, it does not for the gdp column, which is the same for each country and each year across the age bands.

## 6. Sorting data
Observations (rows) can be rearranged so that they are sorted, either by numerical value or alphabetical order using the method `sort_values()`. Let's see which year had the most cases in the UK.

In [53]:
df_uk = df_by_country_yr.loc['United Kingdom',['cases_no','population']]
df_uk.sort_values('cases_no')

Unnamed: 0_level_0,cases_no,population
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2002,2668.0,55908944
1998,2967.0,55527379
1997,2991.0,55266635
1985,3017.0,53006535
2007,3348.0,57382757
2011,3421.0,58881852
2004,3508.0,56455011
2003,3529.0,56171771
1987,3604.0,53239668
2015,3650.0,61082942


## 7. Statistical Summary Operations  
The methods `max()`, `min()`,`mean()`,`std()` can be used to find the summary statistics maximum, minimum, mean and standard deviation. The `pandas.DataFrame.describe()` function shows all of these statistics at once.

In [54]:
df_uk['cases_no'].max() 

4683.0

In [62]:
# Use the print() function to print the numbers 
# that you calculate with the .max(), .min(). mean() .std() 
# with this print you can control the number of decimal places
# for the mean and the standard deviation using the :0.1f bit
print(f"The mean number of cases in the UK is {df_uk['cases_no'].mean():0.1f}")
print(f"The standard deviation of cases in the UK is {df_uk['cases_no'].std():0.1f}")
print(f"The minimum number of cases in the UK is {df_uk['cases_no'].min()}")
print(f"The maximum number of cases in the UK is {df_uk['cases_no'].max()}")


The mean number of cases in the UK is 3753.3
The standard deviation of cases in the UK is 452.8
The minimum number of cases in the UK is 2668.0
The maximum number of cases in the UK is 4683.0


#### Exercise 7 

Obtain the same stats for Greece and store it in a variable `df_greece`

In [64]:
# Get the total number of cases per year for Greece and 
# produce the summary statistics 
df_greece = df_by_country_yr.loc['Greece',['cases_no','population']]
df_greece.sort_values('cases_no')

Unnamed: 0_level_0,cases_no,population
year,Unnamed: 1_level_1,Unnamed: 2_level_1
1989,213.0,9527500
2007,221.0,10653591
2009,258.0,10719618
1998,260.0,10310630
1992,262.0,9834155
2001,298.0,10441529
1990,314.0,9617000
2005,320.0,10582816
2002,325.0,10482045
2006,341.0,10618891


## 8. Combining and merging data sets 
Often we will need to combine data from two different data sets together. There are two primary operations to combine data:
* **concatenating** two files, typically vertically to provide additional observations of the same data.
* **joining/merging** linked records from two different files that share a common key or set of keys. 

There are different types of joins (_inner_, _outer_, _letf_, _right_) that differ on what vaules to keep. A schematic is seen below. 

![merge](./fig/fig_merge.jpg)


### 8.1 Merging data
Let's say we want to do a side by side comparison of cases in the UK and Greece. First let's extract them again from the main dataframe.

In [83]:
df_by_country = df_cases_replaced.groupby(['country'])
df_uk = df_by_country.get_group('United Kingdom')
df_greece = df_by_country.get_group('Greece')
print(df_uk.head())
print(df_greece.head())

              country  year     sex          age  cases_no  population  \
26476  United Kingdom  1985    male    75+ years     264.0     1202838   
26477  United Kingdom  1985    male  55-74 years     915.0     5170113   
26478  United Kingdom  1985    male  35-54 years     128.0     6899879   
26479  United Kingdom  1985    male  25-34 years      62.0     3969689   
26480  United Kingdom  1985  female  55-74 years     678.0     6002096   

       gdp_for_year_usd  
26476      489285164271  
26477      489285164271  
26478      489285164271  
26479      489285164271  
26480      489285164271  
      country  year     sex          age  cases_no  population  \
10022  Greece  1985    male    75+ years      35.0      233300   
10023  Greece  1985    male  55-74 years      85.0      874300   
10024  Greece  1985    male  35-54 years      82.0     1248900   
10025  Greece  1985    male  25-34 years       4.0      682900   
10026  Greece  1985  female    75+ years      15.0      317900   

  

We are going to use the `reset_index()` function to change how the index values from the original rows from teh master spreadsheet to rows that start again from 0.

In [84]:
df_uk = df_uk.reset_index(drop=True)
df_greece = df_greece.reset_index(drop=True)
df_uk
df_greece

Unnamed: 0,country,year,sex,age,cases_no,population,gdp_for_year_usd
0,Greece,1985,male,75+ years,35.0,233300,47820850975
1,Greece,1985,male,55-74 years,85.0,874300,47820850975
2,Greece,1985,male,35-54 years,82.0,1248900,47820850975
3,Greece,1985,male,25-34 years,4.0,682900,47820850975
4,Greece,1985,female,75+ years,15.0,317900,47820850975
...,...,...,...,...,...,...,...
367,Greece,2015,female,35-54 years,36.0,1627797,195541761243
368,Greece,2015,female,75+ years,12.0,683939,195541761243
369,Greece,2015,female,15-24 years,3.0,541528,195541761243
370,Greece,2015,female,5-14 years,1.0,526126,195541761243


Now we will use the `merge` function in pandas to combine them. We have a few things to think about here. 
* One data frame needs to be the left data frame (see above figure) and the other needs to be the right data frame. 
* No matter if the record is found in just the left data frame, just the right data frame, or both, we want to keep it. That is called an _outer_ join and is specified by setting the parameter to `how='outer'`.
* The `year`, `age`, and `sex` columns are the _keys_ that we will want to match for the row by row comparison. These are fed into the `on` parameter of merge.
* The other three variables `cases_no`, `population`, and `gdp_for_year_usd` are the variables that we are going to compare. 
* We can find out if there was a match or not by setting `indicator=True`. This creates a new column called `_merge` that indicates if the observation was found in the left, the right, or both.
* We can add suffixes so that we know where the column data comes from. The country columns now become irrelevant as we have suffixed these values to the columns that matter. 

In [85]:
df_uk_greece = pd.merge(df_uk, df_greece,
                        how='outer',on=['year','age','sex'],
                        suffixes=['_uk','_greece'],indicator=True) 
# Since suffixed in the merge, we can drop the countries
df_uk_greece = df_uk_greece.drop(columns = ['country_uk','country_greece'])
df_uk_greece

Unnamed: 0,year,sex,age,cases_no_uk,population_uk,gdp_for_year_usd_uk,cases_no_greece,population_greece,gdp_for_year_usd_greece,_merge
0,1985,male,75+ years,264.0,1202838,489285164271,35.0,233300,47820850975,both
1,1985,male,55-74 years,915.0,5170113,489285164271,85.0,874300,47820850975,both
2,1985,male,35-54 years,128.0,6899879,489285164271,82.0,1248900,47820850975,both
3,1985,male,25-34 years,62.0,3969689,489285164271,4.0,682900,47820850975,both
4,1985,female,55-74 years,678.0,6002096,489285164271,47.0,1000100,47820850975,both
...,...,...,...,...,...,...,...,...,...,...
367,2015,female,25-34 years,181.0,4414464,2885570309161,17.0,662893,195541761243,both
368,2015,female,75+ years,18.0,3070457,2885570309161,12.0,683939,195541761243,both
369,2015,female,15-24 years,14.0,3966564,2885570309161,3.0,541528,195541761243,both
370,2015,female,5-14 years,6.0,3663221,2885570309161,1.0,526126,195541761243,both


### 8.2 Concatenating data
Concatenation is a bit different from merging. Concatenating is about bolting on additional rows (if the columns are the same in both) or additional columns (if the rows are the same). Unlike merging, which only works with two data frames at a time, concatenate can merge multiple (i.e two or more) dataframes together. 

Let's use our two data frames `df_uk` and `df_greece` again and try to concatenate vertically. 
If we were to run the following command
```python
df_concatenated = pandas.concat([df1, df2])
```

then it would _vertically_ concatenates the dataframes (df1 and df2), and the default results will look like the picture:



![concatenate1](./fig/fig1_concat.jpg)

This example assumes that your column names are the same betweeen the two dataframes.  If your column names are different while concatenating along rows (axis 0), then by default the columns will also be added, and NaN values will be filled in as applicable.

![concatenate2](./fig/fig2_concat.jpg)

Concatenation is usually performed vertically to add more observations (rows) of the same columns. However, you can also concatenate horizontally along the same row indices by passing the parameter `axis=columns`.
```python
df_concatenated = pandas.concat([df1, df2], axis='columns')
```



![concatenate3](./fig/fig3_concat.jpg)

Lets concatenate our datasets and view them 

In [86]:
df_uk_greece_concat = pd.concat([df_uk, df_greece])
df_uk_greece_concat

Unnamed: 0,country,year,sex,age,cases_no,population,gdp_for_year_usd
0,United Kingdom,1985,male,75+ years,264.0,1202838,489285164271
1,United Kingdom,1985,male,55-74 years,915.0,5170113,489285164271
2,United Kingdom,1985,male,35-54 years,128.0,6899879,489285164271
3,United Kingdom,1985,male,25-34 years,62.0,3969689,489285164271
4,United Kingdom,1985,female,55-74 years,678.0,6002096,489285164271
...,...,...,...,...,...,...,...
367,Greece,2015,female,35-54 years,36.0,1627797,195541761243
368,Greece,2015,female,75+ years,12.0,683939,195541761243
369,Greece,2015,female,15-24 years,3.0,541528,195541761243
370,Greece,2015,female,5-14 years,1.0,526126,195541761243


This results that the datasets are being stacked with `df_uk` on top of `df_greece`. They are concatenated by default on the "row axis" or axis=0. If you look carefully the indices repeat. The first thirty rows of the UK have indices of 0 to 29, and the first thirty rows of the Greece data also have indices from 0 to 29.

In [88]:
df_uk_greece_rowconcat = pd.concat([df_uk, df_greece], axis =1)
print(df_uk_greece_rowconcat.columns)
df_uk_greece_rowconcat

Index(['country', 'year', 'sex', 'age', 'cases_no', 'population',
       'gdp_for_year_usd', 'country', 'year', 'sex', 'age', 'cases_no',
       'population', 'gdp_for_year_usd'],
      dtype='object')


Unnamed: 0,country,year,sex,age,cases_no,population,gdp_for_year_usd,country.1,year.1,sex.1,age.1,cases_no.1,population.1,gdp_for_year_usd.1
0,United Kingdom,1985,male,75+ years,264.0,1202838,489285164271,Greece,1985,male,75+ years,35.0,233300,47820850975
1,United Kingdom,1985,male,55-74 years,915.0,5170113,489285164271,Greece,1985,male,55-74 years,85.0,874300,47820850975
2,United Kingdom,1985,male,35-54 years,128.0,6899879,489285164271,Greece,1985,male,35-54 years,82.0,1248900,47820850975
3,United Kingdom,1985,male,25-34 years,62.0,3969689,489285164271,Greece,1985,male,25-34 years,4.0,682900,47820850975
4,United Kingdom,1985,female,55-74 years,678.0,6002096,489285164271,Greece,1985,female,75+ years,15.0,317900,47820850975
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
367,United Kingdom,2015,female,25-34 years,181.0,4414464,2885570309161,Greece,2015,female,35-54 years,36.0,1627797,195541761243
368,United Kingdom,2015,female,75+ years,18.0,3070457,2885570309161,Greece,2015,female,75+ years,12.0,683939,195541761243
369,United Kingdom,2015,female,15-24 years,14.0,3966564,2885570309161,Greece,2015,female,15-24 years,3.0,541528,195541761243
370,United Kingdom,2015,female,5-14 years,6.0,3663221,2885570309161,Greece,2015,female,5-14 years,1.0,526126,195541761243


Now the dataframes are stacked side by side, which is unhelpful, because they both have the same column names, which will be difficult and confusing if you are trying to index a specific column.


## 9. Visualizing the data

There is only so much you can learn from the data, especially when there are large numbers of variables and observations, from looking at it in tabular form. Effective data visualization is an essential tool with many applications, from initial exploration to disseminating findings. There are various libraries that you can use in Python to visualize your data, with the two most popular being are [Matplotlib](https://matplotlib.org/), and [Seaborn](https://seaborn.pydata.org/). Here, we will use the Matplotlib library to do some example visualizations. 

In [None]:
%matplotlib inline  

  **%matplotlib** is a magic function, if you use it just it displays the figure as a pup up. Using with inline, the figure is inside the noteboook 

In [None]:
df1_merged=pd.merge(UK_suicides_per_year, Greece_suicides_per_year,how='right', on='year') #
df1_merged 



We can change the column names to be more specific or use the .join() method instead which have arguments lsuffix and rsuffix to specify it. 
Lets change the header as we have seen before at paragraph 1. 


In [None]:
headers=['UK','year','suicides_no_UK','Greece','suicides_no_Greece']
df1_merged.columns=headers
df1_merged.head()

Now we are ready to plot. Firts import the pyplot module of the matplotlib library as plt.
then use the .plot() function at our dataframe. The arguments start with x column and y column/s

In [None]:

import matplotlib.pyplot as plt
df1_merged.plot(x='year', y=['suicides_no_UK','suicides_no_Greece'])



The default plot is line plot. There are different other kinds of plots and we need to specify them in the arguments these are : bar, area, barh=horizontal bars, hist= histogramms, box, pie= piecharts, line, scatter. 

In [None]:
df1_merged.plot(kind= 'bar', x='year', y=['suicides_no_UK','suicides_no_Greece'])

## Operations with Functions Statistical Functions
### Example Statistical Functions of **Normalization** and **Standarization** 
Data is usually collected from different sources in different formats.
Normalization and standardization are the processes of transforming data into a common format, allowing the researcher to make the meaningful comparison.

**Normalization** is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable from 0 to 1, scaling the variable so the variance is 



### **Example**

To demonstrate normalization, let's say we want to scale the columns , "suicides_no_UK" and "suicides_no_Greece" at df1_merged data frame.
We would like to normalize those variables so their value ranges from 0 to 1
We replace original value by (original value)/(maximum value)

**Step 1)** Create a function of normalization 

In [None]:
def normalize(number):
    return (number)/number.max()

**Step 2)** Apply the function to the selected column 

In [None]:
suicides_UK_normalised=normalize(df1_merged['suicides_no_UK'])
suicides_UK_normalised

In [None]:
suicides_Greece_normalised=normalize(df1_merged['suicides_no_Greece'])
suicides_Greece_normalised

### **Data Standardization**
Data standardization is also a term for a particular type of data normalization where we subtract the mean and divide by the standard deviation. The data is transformed into a common format with  the mean to be 0 and the variable changes from negative  to positive around the mean. 

In [None]:
def standarize_test(number):
    return (number-number.mean())/number.std()

In [None]:
standarize_test(UK_suicides_per_year['suicides_no'])

## Visualise and compare
 I want to make a comparison between the normalised values fo both countries and I use again the merged data.

We can add the two columns of normalised to the df1_merged dataset and create another datasets which includes the statistics. Or we can create a new data set.

Lets go with the first option:

In [None]:
df1_merged['UK_suicides_normalised']= suicides_UK_normalised # Add a new column at the merged dataframe and add the series of normalisation data as we calculated above
df1_merged['Greece_suicides_normalised']= suicides_Greece_normalised
df1_merged.head()

We can now plot the normalisation data and get insightful information comparing the two countries.

In [None]:
df1_merged.plot(kind= 'bar', x='year', y=['UK_suicides_normalised','Greece_suicides_normalised'])# we can add in the dataframe the plot()function

You see that in general the % of the cases in the UK was higher until 2011. After 2011 the % cases in Greece were higher and it is due to the economic crisis starting fro 2010 and have the greatest impact to the society after 2011. 
With these kind of comparative plots you can visualise and give insightful information about political and socio-economic of different countries in the 101 one presented in this data. 


## Grouping plots
You can group similar plots in a single figure using subplots. The matplotlib.pyplot.figure() which is shortened to plt.figure() creates a space into which we will place all our plots. The parameter figsize tells Python how big to take this space. Each subplot is placed into the figure using its add_subplot method. The add_subplot method takes 3 parameters. The first denotes how many total rows of subplots there are, the second parameter refers to the total number of subplot columns, and the final parameter denotes which subplot your variable is referencing (left-to-right, top-to-bottom). Each subplot is stored in a different variable (axes1, axes2, axes3). Once a subplot is created, the axes can be titled using the ax1.set_title() or set_xlabel() command (or set_ylabel()). Here are our four plots side by side:

In [None]:
fig=plt.figure(figsize=(12.0, 8.0))


ax1= fig.add_subplot(2,2,1)
#Here we plot bar using the plt.bar() function and it is set at the ax1 which is axes1 subplot variable
plt.title('UK suicides') 
plt.bar(UK_data['year'],
        UK_data['suicides_no'],
        color='blue')


ax2= fig.add_subplot(2,2,2)
ax2.set_title('GDP in billions USD')
ax2.plot(UK_data['year'],UK_data['gdp_for_year_usd']/10**9, color='purple' )


ax3= fig.add_subplot(2,2,3)

plt.title('Greece suicides')
plt.bar(Greece_data['year'],
        Greece_data['suicides_no'],
        color='orange')


ax4= fig.add_subplot(2,2,4)
ax4.set_title('GDP in billions USD')
ax4.plot(Greece_data['year'],Greece_data['gdp_for_year_usd']/10**9, color='green' )


I want to plot the GDP per capita over the years for both countries
I need to calculate it by dividing the two columns and add the result as an extra column.

Lets plot the sum of the suicides over the years for all countries (101)over the years.  

We could write it in one line: From the dataframe, group by the countries column, summarise and sort the values of suicides and plot the suicides_no

In [None]:


plt.figure(figsize=(16,8))
data_frame.groupby('country').sum().sort_values(by='suicides_no',ascending=False)[['suicides_no']][:101].plot(kind='bar',figsize=(16,8),title='Sum of suicides by country during 1985-2016')
plt.ylabel("Sum of suicides")
plt.xlabel("Country")

In [None]:
# Here it plots the top 20 countries in suicides 
plt.figure(figsize=(14,6))
data_frame.groupby('country').sum().sort_values(by='suicides_no',ascending=False)[['suicides_no']][:20].plot(kind='bar',figsize=(16,8),color='purple', title='Sum of suicides of the top 20 countries during 30 years 1985-2016')
plt.ylabel("Sum of suicides")




## Exercise 
Calculate the GDP/capita of the countries and visualise the cases according to GDP/capita.



# Convert to GDP/capita

### Thank you for completing this lab!

## Author
##  Mary Tziraki


                                                                                     

## <h3 align="center"> © Health+Bioscience IDEAS 2022. All rights reserved. <h3/>
