<a href="https://pandas.pydata.org/">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/2560px-Pandas_logo.svg.png" width="300px">
</a>


# Pandas and Data Wrangling



## Objectives
You could go through this notebook while we present the lecture on Pandas and Data wrangling for having a hands - on experience, while you've been taught. You can also run it at your own leisure. 
We intend to make you familiar with Data Frames and how to use Pandas clean them, analyse them and visualise them.

After completing this notebook you will be able to:
*   Understand Pandas library and its features.
*   Import your data
*   Understand your data, its size, features and its structure.
*   Handle missing values
*   Correct data format
*   Do Statistical Analysis, Standardize and normalize data
*   Visualise data using Matplotlib library


<h2>Table of Contents</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ul>
    <li><a href="#Import_Libraries_and_the_Data_Set">Import Libraries and the Data Set</a></li> 
    <li><a href="#Discover and understand the data set">Discover and understand the data set</a></li> 
    <li><a href="#Identify_missing_Data">Identify and handle missing values</a></li>     
    <li><a href="#deal_missing_values">Deal with missing values</li>
    <li><a href="#correct_data_format">Correct data format</a></li>
    <li><a href="https://#data_standardization">Data standardization</a></li>
    <li><a href="https://#data_normalization">Data normalization (centering/scaling)</a></li>
    
</ul>

</div>

<hr>


## Tabular Data analysis with Pandas Library
Pandas library is widely used to discover, clean and structure the data 


> _Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language._

From https://pandas.pydata.org/.

<h2>What is the purpose of data wrangling?</h2>


Data Wrangling is the process of cleaning the raw data set and convert the information into a format that it is compatible for analysis. Part of this process include data discovery, find and delete or replace missing values, change the data format into usable form and data visualisation.
While data cleaning involves removing erroneous data from your data set data wrangling involves more steps and processes.  

Data Wrangling is the first and essential part of data analysis, however it is often  the most time-consuming and tedious part of it. It is very important to understand the process and the code lines as you will often use it before your data analysis. 


### Data Wrangling Steps
The exact tasks required  basic steps include:

* **Discovering** : 
Understand the  data and find out  what information is useful for your problem 
* **Structuring** :
Standardise the data format for disparate types of data and make the data usable for automated or semiautomated data analysis. The data must be structured to fit the analytics model 

* **Cleaning** : 
There are outliers in any dataset that could alter the outcome of an analysis. This means that structured data must be cleaned to improve analysis. This involves changing null values, eliminating redundancies, standardizing formatting, and changing redundancies to improve data consistency


* **Enriching**
You are pretty familiar with the data at this point. This is the moment to ask yourself if you want to enhance the data. Are you looking to add other data to it?


*  **Validating** 
This step involves iterative programming steps that authenticate your data’s quality and safety. For example, you may have problems if your data is not clean or enriched and the attributes are not distributed evenly.


## Tabular Data analysis with Pandas Library
Pandas library is widely used to discover, clean and stracture the data 


> _Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language._

From https://pandas.pydata.org/.

## Visualise Data 
Visualisation libraries such as matplot lib and seaborn are often used to plot and statistically analyse data.
This way its easy to discover and clean 'outliners' which are data points that don't make sense and often mess the data.

### ---------------------------------------------------------------


<h2 id="Import_Libraries_and_the_Data_Set"> 1) Import Libraries and the Data Set </h2>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt


###  Reading the dataset:
You might upload the data from a URL or from the URL
or 
Directly from your computer


### From URL
First, we assign the URL of the dataset to "filename" as the next code example. It doesn't actually apply to the case here.

In [None]:
#filename = "https://....the website/data.csv"

### Reading the Dataset from a file in your computer 
If the file is at the same folder with your code the best practice is to use as path="./data.csv"
If there are not at the same folder than you need to use the path of data folder.

In [None]:

#NOTE: Change the path to much yours
#my_path = "/Users/yourName /TeamCoders_Event_Based_Model/2_DataCleaningAndWrangling/who_suicide_statistics_modified3.csv"
my_path = "./data/who_suicide_statistics_modified3.csv" # You shorten the above by starting with "./" which means to read the file from the current directory that the Jupyter notebook is   ./
data_raw = pd.read_csv(my_path , header=None)


 <h2 id="Discover_and_understand_the_Data_Set"> 2) Discover and understand the data set. </h2>
Display the data and read it. Discover the data by displaying it using the following methods:

- **.head()** 
- **.tail()**

Investigate data with :
- **.info()**
- **.describe()**

Find the shape (columns and rows) and what type of data are as values using the following.
- **.shape**
- **.dtypes**
- the function **len(data)**


Use the method <b>head()</b> to display the first five rows of the dataframe.


In [None]:
data_raw.head()

In [None]:
data_raw.shape

In [None]:
data_raw.dtypes

## Exercise 1)
Use the tail method to see the last part of the data.
How many lines do you see?
Use the the tail method again to see the last 20 lines of the data.

In [None]:
# Write your code here

## Exercise 2)
Use the info method to gather information  about the data. 


In [None]:
# Write your code here

## Exercise 3)
Use the len () function  to find the length of your data. 


In [None]:
# Write your code here

### Observations
* The data frame has 27841 rows and 8 columns hat describe the demographics of the suicides in the world.
* There is no headers, the columns are indexed. 
* The header is in the first row ( 0 index) we need to move the row and make it header.
* The data frame is an object by itself and within it the data all columns are all objects. 
* It is not the data structure that is compartible with the analysis we will follow.

## Exercise 4 )
Write any other observation that you see about the data. 


## **2.1) Fix the headers (column names).** 
If the data frame hasn't got the column names assigned, or there are at the wrong place, we could restore the headers by the following three ways:


### **A) How to replace the header with the first row.**
In our data set the description of the columns are included in the first row ( 0 ). We need to replace the headers with the first row. We need to:
- 1) locate the first row and assign it as columns
- 2) select the dataframe from the 2nd row downwards
- 3) reset the index of teh dataframe 

* First row of the dataframe is assigned to the df.columns using the **df.iloc[0]** statement
* Next, the dataframe is sliced from the second row using its index 1 (using **.iloc[1:]**) 
* Within the same line we reset its row index using the **reset_index()** method.
* With these steps, the header of the dataframe is replaced with the first row of the dataframe.


### A) Method to add the headers that exist at the row (0) 

In [None]:
data_raw.columns = data_raw.iloc[0]
data_raw = data_raw.iloc[1:].reset_index(drop=True) #This method will  reset the index of the rows 

data_raw.head()

We recommend you to run only one method when you first run the notebook.
We have run the method (A) so we don't need to run (B) and (C), therefore I have them commented(place # in front). If you want to run them you need to remove the (#). 

### B ) Method 
We need to create headers with the descriptions which are in the 1st line (row 0 ) and then delete row 0.
The following code is commented (having the ## in front) so it cannot run as code. To run it, don't run the code of method A, and delete the # in front of each line.

We create a Python list **headers** containing name of headers.


In [None]:
##headers = ["country","year","gender","age","suicides_no","population","HDI for year","gdp_for_year ($)"]

In [None]:
# data_raw.columns = headers            # 1) Add Headers as columns 
# data_raw= data_raw.drop(0, axis =0)   # 2) delete the row with the labels , which is row 0 
#data_raw.head()                        # 3) check the five lines of the Data Frame

###  C) Method

This method will only work when the headers are in the 1st line (line 0). If there are not headers at all, you need to use method B to insert them manually.
Use the Pandas function  read_csv('filename.csv') to load the data from the file. 


In [None]:
# df = pd.read_csv(my_path)

Use the method **.head()** to display **the first five rows** of the dataframe , but when you insert an argument it displays the number you give.


In [None]:
# To see what the data set looks like, we'll use the head() method for the first 12 lines.
data_raw.head(12)


## **2.2) Understand the data set- Select columns- get values**
The way to to select columns is to write the 
**dataframe['column_name']**
Lets show all the countries in the data set.

In [None]:
data_raw_countries=data_raw['country']
print(data_raw_countries)

### **How we can count values within the column**
Use the **.value_counts().** method

In [None]:
data_raw['country'].value_counts() #select the column with data_raw['country'] then add the method .value_counts()

In [None]:
print((data_raw_countries).value_counts()) # You can also use it within the print function

### **How to make selections in columns and make lists** 
To find all the values in a column that appeared at list once we use the unique method **.unique()**
We use it here to list all the countries alphabetically and to set them as 'string" data types.  


In [None]:
data_raw_countries_alphabetically = data_raw['country'].unique().astype(str)
data_raw_countries_alphabetically

In [None]:
type(data_raw_countries_alphabetically)

The data_raw_countries_alphabeticallly is a numpy array that we need to convert it to list, using the **.tolist()** method.


In [None]:
data_countries_list= data_raw_countries_alphabetically.tolist() # Convert a numpy array to a list
data_countries_list  # The list starts and finishes with []
#  we can print the list within the print function:      print(f'The list of countries is:{data_countries_list}')

In [None]:
#Find the length of the list
len(data_countries_list)

### Data columns Investigation 
### Find the mean of a numerical column
The method to find the mean of a column is **.mean()**

In [None]:
data_raw_suicides = data_raw['suicides_no']
data_raw_suicides

In [None]:
data_raw_suicides_mean = data_raw['suicides_no'].mean()

### ERROR !
 We observe that:
There are  string values at the suicide_no column and we cannot find the mean with the .mean() method because it applies only to integers or strings!

Lets to display all the unique values in the column of 'suicides_no' with the method **.unique()** and change the type of these values to integers using typecasting **.astype(int)**.
 

In [None]:
data_raw_suicides = data_raw['suicides_no'].unique().astype(int) # Convert the data into the suicides_no column into an integer
data_raw_suicides

### Error and Observations:
   * There are many NaN (Not a Number) data points in the suicide_no column, that indicate missing values and these cannot converted into integer.
   * As displayed at the data_raw above the HDI column is dominated by NaN. We need to delete the column with the Human Development Index because we will not use it for the analysis and it has lots of missing values that we cannot deal with them. We will also save space and time. 
   


# 3) **How to work with missing data?**
As we can see, there are several NaN (not a number) in the data frame, it might also be word  Null or any other words and string values. Those are missing values which may hinder our further analysis. I'm showing here, how do we identify all those missing values and work with missing data in order to bring our Data Frame at a format that is ready for analysis.

Steps for working with missing data:

</ol>
    <li>Identify missing data</li>
    <li>Deal with missing data</li>
    <li>Correct data format</li>
</ol>





<h2 id="Identify_missing_Data">3.1 Identify missing Data</h2>


### **3.1.1 Find missing data**
The missing values are converted by default to Null. We use the following methodsto identify these missing values:

- **.isnull()**
- **.notnull()**
The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.



When we use the **isnull()** method the output is True when there is Null, and False when there is a real value. The opposite happens when we use the .notnull() method

In [None]:
(data_raw.isnull()).value_counts() # Indicates with True that there are Nulls and counts the values that are numbers 

There are Null values in just two columns at the suicides_no and at the HDI for year column

The opposite we observe with **.notnull()** Similar to not null is to use the **(~)** in front of the dataset and then the .isnull(). The tilde sign **(~)** indicates negation.  

In [None]:
(data_raw.notnull()).value_counts() # Indicates with True that we have values and with False where there are Nulls and counts the values that are numbers (not Null)

In [None]:
(~data_raw.isnull()).value_counts() # The ~ indicates negation and here in front of isnull results to notnull()

#### We can have the same result but reverted Booleans by using the **.notnull()** method.

**(~data_raw.isnull()).value_counts()** # The tilde symbol indicates negation and here in front of isnull results to notnull()

**(data_raw.notnull()).value_counts()** # Indicates with False that there are Nulls and counts the values that are numbers 


In [None]:
missing_data = data_raw.isnull() # find data that are null and return 'True' if is null and 'False' if it is not
missing_data.head(20)

### **3.1.2.Count the missing values in the data frame**.

### Count missing values in each column

Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value and "False" means the value is present in the dataset.  In the body of the for loop the method **".value_counts()"** counts the number of data values, (number of  "False" here). 




In [None]:
missing_data=(data_raw.isnull())

In [None]:
type(missing_data)

In [None]:
# The for loop 
for column in data_raw.columns: # Select each column-name in the header and prints it 
    print(column)
    print(missing_data[column].value_counts())
    print('----')

In [None]:
#The commented code give the same result as the one that follows but is shortened as puts the methods together.
#for column in missing_data.columns.values.tolist():  # Take the data Frame missing_data,selects the columns and their values and makes a list of them.  
 #print(column)
 #print (missing_data[column].value_counts())
 ## print("") 

An easier way is to show the data set 

In [None]:
missing_data = data_raw.isnull() # find data that are null and return 'True' if is null and 'False' if it is not
missing_data.head(10)

As discussed the only columns that contain missing data are the "suicides_no'and the HDI.

In [None]:
print (data_raw.isnull().sum()) # Find the data that are Null in the whole data frame  

The length of the dataframe is 27841 and the column suicides_no has 23575 numbers. It is shorter than the total length of the data 27841 by 27841- 23575= 4265. It is indicated as True (is null) in the count. The missing values therefore there must be strings characters in the column. We need to investigate it.

The HDI column has lots of missing data 19472 values are missing. We will not use it for the analysis and it will be good to delete this column.

## **3.2 Investigation for alpharithmetic characters or words misplaced as values**
There are two ways to identify the non-integer values in the column suicides_no
1) Using the module Regular expressions which we need to import as re. 
A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.

2) Find the non numeric data with the method **.str.isdigit()** and set it to False that way it collects all the data points that are strings.

### 1st Way Regular expressions module **re**

If you want to run the following code uncomment it from the import downwards.
As I have this notebook run I have selected the 2nd method (.str.isdigit())

In [None]:
# Import module Regular expressions as re.
#The re module provides an interface to the regular expression engine, 
# allowing you to compile REs into objects and then perform matches with them


#import re 
#replace = re.compile("([a-zA-Z]+)")  # compile any alphabetic character

#data_raw['string'] = data_raw['suicides_no'].str.extract(replace) # Adds a column string to the data set with all the strings which have letters
#data_raw['integer'] = data_raw['suicides_no'].str.replace(replace, " ") #Adds a column integer to the data set 

#data_raw['string'].unique() # Presents the unique "words" from the string column

 In the column suicide-no there are the strings **'Null'**, **'Unknown'** and **nan (not a number)** and we need to replace them.

### 2nd Way without is to use the **.isdigit()** method to a string.


The following code takes all the strings in the column which have digits and sets it to False. Which means that takes all the strings that have no digits. Using the method **unique()** identifies which are these strings.

In [None]:
non_numeric=data_raw['suicides_no'].str.isdigit() == False

In [None]:
type (non_numeric)# The non_numeric is 1D array, Series 

In [None]:
data_raw['suicides_no'][non_numeric].unique()

We observe that there are not only the NaN strings but there are also the "Null" and "Unknown" strings and we need to remove them

We cannot see the NaN values and we apply the pd.numeric (explain)

It's an elegant way to find the strings and words in the datasets without the regular expresions, however it doesn't show us the NaN datapoints. We can add this with the following coding which is more complete and shorten than regular expresions (re).

In [None]:
suicides =data_raw['suicides_no']

In [None]:
suicides2 = suicides.apply(pd.to_numeric, errors='coerce')

In [None]:
print(data_raw['suicides_no'][suicides2.isna()].unique())# Shows the data which are not numeric

The above shows that in the data there are points which as NaN (not an number) and have the word Null and Unknown.

In [None]:
data_raw.head(15)

In [None]:

df_null=data_raw.loc[data_raw['suicides_no']=='Null']
df_null

In [None]:
data_raw.head(15)

#### In the column suicide-no there are the strings 'Null', 'Unknown' and NaN and we need to replace them.

<h2 id="deal_missing_values">4 Deal with missing data</h2>
<b>How to deal with missing data?</b>

<ol>
    <li>Drop data<br>
        a. Drop the whole row<br>
        b. Drop the whole column
    </li>
    <li>Replace data<br>
        a. Replace it by mean<br>
        b. Replace it by frequency<br>
        c. Replace it based on other functions
    </li>
</ol>



### 4.1  Drop / Delete a column with lots of missing data.
To remove a column we use the .drop() method as **df.drop(["column1, column2, ... "], axis=1)**. the axis =1 indicates that I apply the method to columns.
 * I will cut the column HDI for year, There are a lot of missing values there and I don't think that if contributes to the result.
 
 * I will name the new Data Frame **data_frame** because it is not the raw data anymore


In [None]:
#If you have run the Regular expressions and you have created the columns 'string' and 'integers', please run the code I have commented. 
#data_frame= data_raw.drop(['HDI for year','string','integer'], axis = 1) 

# If  you have run the 2nd way to identify the string values, which is with the .isdigit(), run the following command. 
data_frame= data_raw.drop(['HDI for year'], axis = 1)                         
data_frame

So far we have investigated that there are 4265 missing integers in the suicide_no column which would be the strings 
'NaN', 'Unknown', "Null".

We have deleted the HDI for year colum because it had a lot of missing values

We have created a new data_frame with the data, which is not longer the raw data.

###  4.2 REPLACE :
#### Locate the data in suicides_no that appeared as strings and REPLACE them

In [None]:
df_null=data_frame.loc[data_frame['suicides_no']=='Null']
df_null

In [None]:
df_null.shape

There are 4265 points at the suicides_no with the wrong/missing values. It's exactly what we have calculated with the missing_values variable.

In [None]:
#### Count the values of the 'Null' in 'suicides_no' column 
#data_null["suicides_no"].value_counts()  # Same as:  data_null.suicides_no.value_counts()  
df_null.suicides_no.value_counts()  

### If we want to be concise we could write it elegantly in one line
# data_frame.loc[data_frame['suicides_no']=='Null'].suicides_no.value_counts()

In [None]:
df_unknown=data_frame.loc[data_frame['suicides_no'] =='Unknown'] 
df_unknown  ## show the lines with the word Unknown in the suicides_no column 

In [None]:
df_unknown.suicides_no.value_counts() ## count the values with the word Unknown in the suicides_no column 

## REPLACE  incorrect / missing values!
There are three types of data in the column 'suicides_no' in our Data Frame that need to be cleared out. The word 'Null',the word "Unknown" and NaN (Not a Number). I will replace the word Null' with 0 , because its is actually zero. 

The strategy with other words which don't actually contribute to data (unless it is categorical data), is to replace them with the string NaN.  I replace " " with NaN (Not a Number), which is Python's default missing value marker, for reasons of computational speed and convenience. Then we can delete all the NaN data points.
- Replace all strings with NaN
- Delete all NaN values

Here we use the functions to use to localise and replace one value with another:

* .loc[data['Column Name']=='Character to localise']
to localise the Characters in the Column name
* .replace(A, B, inplace = True) 
to replace A value  by B value.



In [None]:
# replace an empty space " " and "Unknown" to NaN and the "Null" with 0 using Numpy library!!
data_frame.replace(" ", np.nan, inplace = True)
data_frame.replace("Null", 0, inplace = True)
data_frame.replace("Unknown", np.nan, inplace = True)
data_frame.head(15)

### Lets count now how many columns are NaN

In [None]:
#df_NaN=data_frame.loc[data_frame['suicides_no'] =='NaN'] 
#df_NaN.suicides_no.value_counts()

Based on the summary above, each column has 27840 rows of data and only one of the columns the suicide_no containes missing data:

<ol>
    <li>"suicides_no": 4273 missing data which are replaced with NaN</li>
    
</ol>


### DELETE/DROP  NaN 
In order to deal with NaN we have to create a numpy array where the NaN appears. We can visualise the data frame by calling .head() or .tail()

The way to  delete the NaN is  using the *.dropna()*  to the  data. Immediately afterwards you need to  to reset the index of the data frame as there are deleted values.


In [None]:
data_frame.tail(10).reset_index()  # use .reset_index()  to visualise it and count it better

### Remove the NaN values 
We can see the the entry points 27824, 27825, 27828  have NaN values at the suicides_no column. We use **.dropna()**. Always reset index 

In [None]:
# Now I will cut any data points with NaN since I don't know with what to replace them
data_frame = data_frame.dropna() # deletes the rows with NaN
data_frame.tail(10).reset_index()  ## reset the values 

In [None]:
data_frame.head()

Whole columns or rows should be dropped only if most entries in the column are empty. In our dataset, none of the columns are empty.


There is some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others. We will apply each method according to the data and what the column represents :

**Replace by mean:** depending on the concept the  missing data could be replaced them with mean
   

#### Calculate the mean value for the "suicides_no" column
It does not make any sense with this data but I display how to substitute with the mean to show the method. 
If you want to use it elsewhere you can remove the # in front of the code and use them 


In [None]:
# How to calculate and show the average of a column
avg_suicide_no=data_frame['suicides_no'].astype('int').mean(axis=0)# First we need to make the values floats (type casting)
print("Average suicides:",avg_suicide_no )

In [None]:
print ("Average suicides integer:", int(avg_suicide_no))

Replace "NaN" with the integer average value in the "suicide_no" column.
As written above we will not replace by a mean in this example.
I just show the code to use it when appropriate.



In [None]:
#data_frame['suicides_no'].replace(np.nan, avg_suicide_no, inplace=True)

### Replace by Frequency:
Find the most frequent value ( the mode) of the column and replace the NaN with the most frequent

In [None]:
data_frame['suicides_no'].value_counts()

There are 1576 different numbers in the column.  We can also use the ".idxmax()" method to calculate the most common value automatically.

In [None]:
data_frame['suicides_no'].value_counts().idxmax()

We can see that number **'1' is the most common value**.


The replacement procedure is very similar to what we have seen previously:


In [None]:
#replace the missing 'NaN' values in the suicides_no column  by the most frequent number which is 1 
#data_frame['suicides_no'].replace(np.nan, 1 , inplace=True)

<b>Good!</b> Now, we have a dataset with no missing values.


<h2 id="correct_data_format"> 5 Correct data format </h2>

We are almost there!
The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other)

In Pandas, we use:

**.dtype()** to check the data type
**.astype()**  to change the data type


<h4>Let's list the data types for each column</h4>


As we can see above, some columns are not of the correct data type. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'. For example, 'year', 'age', 'suicides_no, 'population, and 'gdp_for_year ($)' variables are numerical values, so we should expect them to be of the type 'float' or 'int'; however, they are shown as type 'object'. We have to convert data types into a proper format for each column using the **.astype()** method.

<h4>Convert data types to proper format</h4>


In [None]:
data_frame[["year", "suicides_no", "population"]] = data_frame[["year", "suicides_no", "population"]].astype("int")
#Type casting using .astype("float")   .astype("str")

## The problems of the column 'gdp_for_year '
* The header of the column has a space before the dolar symbol 'gdp_for_year ($)' and we need to replace it with a   name that is coherent and like 'gdp_for_year_usd'
* The format of the numbers in the column should be without the commas.

### 1st Change the header

In [None]:
headers= data_frame.columns # Select the headers
headers

The column with the GDP has a header that contains empty spaces and the $ character, these will cause problems whenever we mention the header. We need to change it to a title that would be easy to write and avoid mistakes.

In [None]:
##There are spaces at the name of the column "  gdp_for_year-usd  " and I'm replacing it with a name without spaces "gdp_for_year_usd"
## #Create a list (headers) and replace with the **lamda* method one 
headers = list(map(lambda x: x.replace(' gdp_for_year ($) ', 'gdp_for_year_usd'), headers)) #Create a list and replace with the **lamda* method one value of the list
headers

In [None]:
data_frame.columns=headers

### 2nd Change the format 

In [None]:
data_frame.dtypes

In [None]:
data_frame['gdp_for_year_usd'] # Show the values, which are of the type object

We need to REMOVE the COMMAS ' on the numbers and typecasting them to integers.

In [None]:
# First replace the string comma ',' empty space''. Then typecasting the column to integers
data_frame['gdp_for_year_usd']= data_frame['gdp_for_year_usd'].str.replace(',', '').astype(int)
data_frame['gdp_for_year_usd']

In [None]:
#here the gdp_for_year_usd column is integer and I dont need to run this cell but in case it is not we need to do the typecasting and set it to integer
#data_frame['gdp_for_year_usd'].astype("int")

<h4>Let us list the columns after the conversion</h4>


In [None]:
data_frame.dtypes

In [None]:
data_frame.head()

### **Wonderful!**
#### Save the data

Now we have finally obtained the cleaned dataset with no missing values with all data in its proper format.
We can save this data set as it is now the 'clean' data set which is ready for processing. 


In [None]:
data_frame.to_csv(r'./data/who_suicide_statistics_clean_data.csv', index=False)

# **Data Operations**
Data operations can be performed through various built-in methods for faster data processing and analysis. A few methods are:
- Operations using Grouping
- Operations with Sorting
- Operations with Statistics 
- Operations with Functions (Standarisation and Normalisation)




##  Grouping
 Use  **.groupby('..')** Its a function of grouping the data. It involes a combination of **spliting** the object applying a function and **combine** the results **.groupby('column name, value')**

To select the data of a single country from the 101, I'll  group the data frame by 'country' and from that I will select the country of preference by using **.get_group('selection').
Lets select the data for the United Kingdom.

In [None]:
country_grouped= data_frame.groupby(['country'])
UK_data=country_grouped.get_group('United Kingdom')
UK_data

We make another data frame by selecting the columns that we want to analyse 

In [287]:
UK_data_year_suicides= UK_data[['country','year', 'suicides_no']] # UK_data = country_grouped.get_group('United Kingdom'). If you want you could add another column as well
UK_data_year_suicides

Unnamed: 0,country,year,suicides_no
26476,United Kingdom,1985,264
26477,United Kingdom,1985,915
26478,United Kingdom,1985,128
26479,United Kingdom,1985,62
26480,United Kingdom,1985,678
...,...,...,...
26843,United Kingdom,2015,181
26844,United Kingdom,2015,18
26845,United Kingdom,2015,14
26846,United Kingdom,2015,6


To find the total suicides of each year, group two columns together and use the  method **.sum()**


In [None]:
UK_suicides_per_year =UK_data_year_suicides.groupby(['country','year'],as_index=False).sum()
UK_suicides_per_year.head(10)

## Sorting values 
To sort values either according to their size or alphabetical order we use  the method **.sort_values**

In [None]:
UK_suicides_per_year_sorted=UK_suicides_per_year.sort_values('suicides_no') # The default is ascending order
UK_suicides_per_year_sorted

## Operations with Statistics 
The methods **.max()**, **.min()**,**.mean()**,**.std()** used to find the maximum, minimum, mean and standard deviation at data. The **.describe** function shows all the statistics.

In [None]:
UK_suicides_per_year['suicides_no'].max() # Select from the data frame, the column that we are interested and apply the .max() method

In [None]:
# Use the print() function to print the numbers that you calculate with the .max(), .min(). mean() .std() and all the statistical methods
print('The maximum number of cases in the UK is', UK_suicides_per_year['suicides_no'].max())
print('The minimum number of cases in the UK is',UK_data_year_suicides['suicides_no'].min())
print('The standard deviation of cases in the UK is', UK_suicides_per_year['suicides_no'].std())

## Investigate another country's data and create another sub-data 

Let's repeat what we have calculated and presented for the UK for another country. You could choose any country from the list of the initial data frame. I'm choosing Greece here and I repeat the code above:

In [None]:
## Investigate another data set 
country_grouped= data_frame.groupby(['country'])
Greece_data=country_grouped.get_group('Greece')
Greece_data

In [None]:
Greece_data_year_suicides= Greece_data[['country','year', 'suicides_no']] # country_grouped.get_group('Greece').
Greece_data_year_suicides

In [None]:
Greece_suicides_per_year =Greece_data_year_suicides.groupby(['country','year'],as_index=False).sum()
Greece_suicides_per_year.head(10)

In [None]:
Greece_suicides_per_year_sorted=Greece_suicides_per_year.sort_values('suicides_no') # The default is ascending order
Greece_suicides_per_year_sorted

In [None]:
print('The maximum number of cases in Greece is', Greece_suicides_per_year['suicides_no'].max())
print('The minimum number of cases in Greece is',Greece_data_year_suicides['suicides_no'].min())
print('The standard deviation of cases in Greece',Greece_suicides_per_year['suicides_no'].std())



## Combine / Merge the data sets 

Another useful operation is to combine together data sets. Pandas is a powerful library and gives a multifaced approach to combining separate datasets. 
With pandas, you can **merge** , **join** and **concatenate** your datasets, allowing you to unify them , understand and interpret them. 

![merge](./fig/fig_merge.jpg)


If you use the **merge() function** , the default is an inner join, which in most of the cases result in a DataFrames which are shorter. With merge(), you also have control over which column(s) to join on. Let’s say that you want to merge both entire datasets, but only on 'year'  since the combination of the two will yield a unique value for each row. To do so, you can use the on parameter:

In [None]:
UK_suicides_per_year.shape

In [None]:
Greece_suicides_per_year.shape

In [None]:
UK_Greece_merged=pd.merge(UK_suicides_per_year, Greece_suicides_per_year,on='year') # Merges the two datasets (one next to the other) and keep common column the 'year' column
UK_Greece_merged

Depending on the datasets you can join on 'left' or on the 'right but in this example it doesn't mean anything: The indication is with the **how** parameter like below: 

In [None]:
df1_merge=pd.merge(UK_suicides_per_year, Greece_suicides_per_year,how='right', on='year')
df1_merge 

## Use the .join() method : Combine Data on a Column or index
The merge () is a module function (which means that you write the DataFrames as arguments). The .join() is method.  This enables you to specify only one DataFrame, which will join the DataFrame you call .join() on. 

Here I'll create on data frame (Two_countries_combined) which I'll assign it  the Greek results and then I'll join to it the UK DataFrame.

In [None]:
one_country= Greece_suicides_per_year

In [None]:


two_countries_combined=one_country.join(UK_suicides_per_year.set_index(["year"]), on=["year"], how= "inner", lsuffix="_x", rsuffix="_y") # It is important to set the index at th common column. The lsuffix sets the first data Frame on the left
two_countries_combined

In [None]:
two_countries_combined2=one_country.join(UK_suicides_per_year.set_index(["year"]), on=["year"], lsuffix="_left") # In this datasets if you dont use how "inner" it doesnt matter 
two_countries_combined2

## Use the concat() method : Combine data Across Rows or Columns 
Concatenation is a bit different from the merging techniques with merge() function and .join() method where you would specify the merging (inner or outer), and depending on the type of merge you might lose rows and information. 

With the concatenation the  datasets are stitched together along wither the row axis or the column axis.

If you us the following code that concatenates the dataframes (df1 and df2)  with no parameters the default results will look like the picture:

**df_concatenated = pandas.concat([df1, df2])**



![concatenate1](./fig/fig1_concat.jpg)

This example assumes that your column names are the same.  If your column names are different while concatenating along rows (axis 0), then by default the columns will also be added, and NaN values will be filled in as applicable.

![concatenate2](./fig/fig2_concat.jpg)

If you want to accomplish concatenation along columns instead, you’ll use a concat() call like you did above, but you’ll also need to pass the axis parameter with a value of 1 or "columns":

**df_concatenated = pandas.concat([df1, df2], axis='columns")**  or **df_concatenated = pandas.concat([df1, df2], axis=1)**



![concatenate3](./fig/fig3_concat.jpg)

Lets concatenate our datasets and view them 

In [None]:
df1_concatenated=pd.concat([UK_suicides_per_year, Greece_suicides_per_year])
df1_concatenated

This results to the datasets being one underneath the other. They are concatenated by default on the "row axis" or axis=0. If you look carefully the indexes are different for each data sets (start from 0-30 for the first dataframe and from 0-30 for the second one)

In [None]:
df2_concatenated=pd.concat([UK_suicides_per_year, Greece_suicides_per_year], axis =1)
df2_concatenated

You see the data frame 1 next to the other.

By default, a concatenation results in a **set union**, where all data is preserved. You’ve seen this with merge() and .join() as an outer join, and you can specify this with the join parameter.



## **Visualise the data**

Where you are investicating your data of prepare to publish your findings, visualisation is an essential tool. There are various libraries that you can use in Python to visualise your data most popular are Matplotlib, and Seaborn. Here we use the Matplotlib library 

In [None]:
%matplotlib inline  

  **%matplotlib** is a magic function, if you use it just it displays the figure as a pup up. Using with inline, the figure is inside the noteboook 

In [None]:
df1_merged=pd.merge(UK_suicides_per_year, Greece_suicides_per_year,how='right', on='year') #
df1_merged 



We can change the column names to be more specific or use the .join() method instead which have arguments lsuffix and rsuffix to specify it. 
Lets change the header as we have seen before at paragraph 1. 


In [None]:
headers=['UK','year','suicides_no_UK','Greece','suicides_no_Greece']
df1_merged.columns=headers
df1_merged.head()

Now we are ready to plot. Firts import the pyplot module of the matplotlib library as plt.
then use the .plot() function at our dataframe. The arguments start with x column and y column/s

In [None]:

import matplotlib.pyplot as plt
df1_merged.plot(x='year', y=['suicides_no_UK','suicides_no_Greece'])



The default plot is line plot. There are different other kinds of plots and we need to specify them in the arguments these are : bar, area, barh=horizontal bars, hist= histogramms, box, pie= piecharts, line, scatter. 

In [None]:
df1_merged.plot(kind= 'bar', x='year', y=['suicides_no_UK','suicides_no_Greece'])

## Operations with Functions Statistical Functions
### Example Statistical Functions of **Normalization** and **Standarization** 
Data is usually collected from different sources in different formats.
Normalization and standardization are the processes of transforming data into a common format, allowing the researcher to make the meaningful comparison.

**Normalization** is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable from 0 to 1, scaling the variable so the variance is 



### **Example**

To demonstrate normalization, let's say we want to scale the columns , "suicides_no_UK" and "suicides_no_Greece" at df1_merged data frame.
We would like to normalize those variables so their value ranges from 0 to 1
We replace original value by (original value)/(maximum value)

**Step 1)** Create a function of normalization 

In [None]:
def normalize(number):
    return (number)/number.max()

**Step 2)** Apply the function to the selected column 

In [None]:
suicides_UK_normalised=normalize(df1_merged['suicides_no_UK'])
suicides_UK_normalised

In [None]:
suicides_Greece_normalised=normalize(df1_merged['suicides_no_Greece'])
suicides_Greece_normalised

### **Data Standardization**
Data standardization is also a term for a particular type of data normalization where we subtract the mean and divide by the standard deviation. The data is transformed into a common format with  the mean to be 0 and the variable changes from negative  to positive around the mean. 

In [None]:
def standarize_test(number):
    return (number-number.mean())/number.std()

In [None]:
standarize_test(UK_suicides_per_year['suicides_no'])

## Visualise and compare
 I want to make a comparison between the normalised values fo both countries and I use again the merged data.

We can add the two columns of normalised to the df1_merged dataset and create another datasets which includes the statistics. Or we can create a new data set.

Lets go with the first option:

In [None]:
df1_merged['UK_suicides_normalised']= suicides_UK_normalised # Add a new column at the merged dataframe and add the series of normalisation data as we calculated above
df1_merged['Greece_suicides_normalised']= suicides_Greece_normalised
df1_merged.head()

We can now plot the normalisation data and get insightful information comparing the two countries.

In [None]:
df1_merged.plot(kind= 'bar', x='year', y=['UK_suicides_normalised','Greece_suicides_normalised'])# we can add in the dataframe the plot()function

You see that in general the % of the cases in the UK was higher until 2011. After 2011 the % cases in Greece were higher and it is due to the economic crisis starting fro 2010 and have the greatest impact to the society after 2011. 
With these kind of comparative plots you can visualise and give insightful information about political and socio-economic of different countries in the 101 one presented in this data. 


## Grouping plots
You can group similar plots in a single figure using subplots. The matplotlib.pyplot.figure() which is shortened to plt.figure() creates a space into which we will place all our plots. The parameter figsize tells Python how big to take this space. Each subplot is placed into the figure using its add_subplot method. The add_subplot method takes 3 parameters. The first denotes how many total rows of subplots there are, the second parameter refers to the total number of subplot columns, and the final parameter denotes which subplot your variable is referencing (left-to-right, top-to-bottom). Each subplot is stored in a different variable (axes1, axes2, axes3). Once a subplot is created, the axes can be titled using the ax1.set_title() or set_xlabel() command (or set_ylabel()). Here are our four plots side by side:

In [None]:
fig=plt.figure(figsize=(12.0, 8.0))


ax1= fig.add_subplot(2,2,1)
#Here we plot bar using the plt.bar() function and it is set at the ax1 which is axes1 subplot variable
plt.title('UK suicides') 
plt.bar(UK_data['year'],
        UK_data['suicides_no'],
        color='blue')


ax2= fig.add_subplot(2,2,2)
ax2.set_title('GDP in billions USD')
ax2.plot(UK_data['year'],UK_data['gdp_for_year_usd']/10**9, color='purple' )


ax3= fig.add_subplot(2,2,3)

plt.title('Greece suicides')
plt.bar(Greece_data['year'],
        Greece_data['suicides_no'],
        color='orange')


ax4= fig.add_subplot(2,2,4)
ax4.set_title('GDP in billions USD')
ax4.plot(Greece_data['year'],Greece_data['gdp_for_year_usd']/10**9, color='green' )


I want to plot the GDP per capita over the years for both countries
I need to calculate it by dividing the two columns and add the result as an extra column.

Lets plot the sum of the suicides over the years for all countries (101)over the years.  

We could write it in one line: From the dataframe, group by the countries column, summarise and sort the values of suicides and plot the suicides_no

In [None]:


plt.figure(figsize=(16,8))
data_frame.groupby('country').sum().sort_values(by='suicides_no',ascending=False)[['suicides_no']][:101].plot(kind='bar',figsize=(16,8),title='Sum of suicides by country during 1985-2016')
plt.ylabel("Sum of suicides")
plt.xlabel("Country")

In [None]:
# Here it plots the top 20 countries in suicides 
plt.figure(figsize=(14,6))
data_frame.groupby('country').sum().sort_values(by='suicides_no',ascending=False)[['suicides_no']][:20].plot(kind='bar',figsize=(16,8),color='purple', title='Sum of suicides of the top 20 countries during 30 years 1985-2016')
plt.ylabel("Sum of suicides")




## Exercise 
Calculate the GDP/capita of the countries and visualise the cases according to GDP/capita.



# Convert to GDP/capita

### Thank you for completing this lab!

## Author
##  Mary Tziraki


                                                                                     

## <h3 align="center"> © Health+Bioscience IDEAS 2022. All rights reserved. <h3/>
